9 Managing File Systems and Devices

This chapter contains information specific to managing storage devices in a TruCluster Server system. The chapter discusses the following subjects:

Working with CDSLs (Section 9.1)

Managing devices (Section 9.2)

Managing the cluster file system (Section 9.3)

Managing the device request dispatcher (Section 9.4)

Managing AdvFS in a cluster (Section 9.5)

Creating new file systems (Section 9.6)

Managing CDFS file systems (Section 9.7)

Backing up and restoring files (Section 9.8)

Managing swap space (Section 9.9)

Fixing problems with boot parameters (Section 9.10)

Using the verify command in a cluster (Section 9.11)

You can find other information on device management in the Tru64 UNIX Version 5.1B documentation that is listed in Table 9-1.

Table 9-1: Sources of Information of Storage Device Management

Topic	Tru64 UNIX Manual
Administering devices	Hardware Management manual
Administering file systems	System Administration manual
Administering the archiving services	System Administration manual
Managing AdvFS	AdvFS Administration manual

For information about Logical Storage Manager (LSM) and clusters, see Chapter 10.

9.1 Working with CDSLs

A context-dependent symbolic link (CDSL) contains a variable that identifies a cluster member. This variable is resolved at run time into a target.

A CDSL is structured as follows:

/etc/rc.config -> ../cluster/members/{memb}/etc/rc.config

When resolving a CDSL pathname, the kernel replaces the string {memb} with the string membern, where n is the member ID of the current member. For example, on a cluster member whose member ID is 2, the pathname /cluster/members/{memb}/etc/rc.config resolves to /cluster/members/member2/etc/rc.config.

CDSLs provide a way for a single file name to point to one of several files. Clusters use this to allow member-specific files that can be addressed throughout the cluster by a single file name. System data and configuration files tend to be CDSLs. They are found in the root (/), /usr, and /var directories.

9.1.1 Making CDSLs

The mkcdsl command provides a simple tool for creating and populating CDSLs. For example, to make a new CDSL for the file /usr/accounts/usage-history, enter the following command:

# mkcdsl /usr/accounts/usage-history

When you list the results, you see the following output:

# ls -l /usr/accounts/usage-history
 
... /usr/accounts/usage-history -> cluster/members/{memb}/accounts/usage-history

The CDSL usage-history is created in /usr/accounts. No files are created in any member's /usr/cluster/members/{memb} directory.

To move a file into a CDSL, enter the following command:

# mkcdsl -c targetname

To replace an existing file when using the copy (-c) option, you must also use the force (-f) option.

The -c option copies the source file to the member-specific area on the cluster member where the mkcdsl command executes and then replaces the source file with a CDSL. To copy a source file to the member-specific area on all cluster members and then replace the source file with a CDSL, use the -a option to the command as follows:

# mkcdsl -a filename

Remove a CDSL with the rm command, like you do for any symbolic link.

The file /var/adm/cdsl_admin.inv stores a record of the cluster's CDSLs. When you use mkcdsl to add CDSLs, the command updates /var/adm/cdsl_admin.inv. If you use the ln -s command to create CDSLs, /var/adm/cdsl_admin.inv is not updated.

To update /var/adm/cdsl_admin.inv, enter the following:

# mkcdsl -i targetname

Update the inventory when you remove a CDSL, or if you use the ln -s command to create a CDSL.

For more information, see mkcdsl(8).

9.1.2 Maintaining CDSLs

The following tools can help you maintain CDSLs:

clu_check_config(8)

cdslinvchk(8)

mkcdsl(8) (with the -i option)

The following example shows the output (and the pointer to a log file containing the errors) when clu_check_config finds a bad or missing CDSL:

# clu_check_config -s check_cdsl_config
Starting Cluster Configuration Check...
check_cdsl_config : Checking installed CDSLs
check_cdsl_config : CDSLs configuration errors : See /var/adm/cdsl_check_list
clu_check_config : detected one or more configuration errors

As a general rule, before you move a file, make sure that the destination is not a CDSL. If by mistake you do overwrite a CDSL on the appropriate cluster member, use the mkcdsl -c filename command to copy the file and re-create the CDSL.

9.1.3 Kernel Builds and CDSLs

When you build a kernel in a cluster, use the cp command to copy the new kernel from /sys/HOSTNAME/vmunix to /vmunix. If you move the kernel to /vmunix, you will overwrite the /vmunix CDSL. The result will be that the next time that cluster member boots, it will use the old vmunix in /sys/HOSTNAME/vmunix.

9.1.4 Exporting and Mounting CDSLs

CDSLs are intended for use when files of the same name must necessarily have different contents on different cluster members. Because of this, CDSLs are not intended for export.

Mounting CDSLs through the cluster alias is problematic, because the file contents differ depending on which cluster system gets the mount request. However, nothing prevents CDSLs from being exported. If the entire directory is a CDSL, then the node that gets the mount request provides a file handle corresponding to the directory for that node. If a CDSL is contained within an exported clusterwide directory, then the network file system (NFS) server that gets the request will do the expansion. Like normal symbolic links, the client cannot read the file or directory unless that area is also mounted on the client.

9.2 Managing Devices

Device management in a cluster is similar to that in a standalone system, with the following exceptions:

The dsfmgr command for managing device special files takes special options for clusters.

Because of the mix of shared and private buses in a cluster, device topology can be more complex.

You can control which cluster members act as servers for the devices in the cluster, and which members act as access nodes.

The rest of this section describes these differences.

9.2.1 Managing the Device Special File

When using dsfmgr, the device special file management utility, in a cluster, keep the following in mind:

The -a option requires that you use c (cluster) as the entry_type.

The -o and -O options, which create device special files in the old format, are not valid in a cluster.

In the output from the -s option, the class scope column in the first table uses a c (cluster) to indicate the scope of the device.

For more information, see dsfmgr(8). For information on devices, device naming, and device management, see the Tru64 UNIX Hardware Management manual.

9.2.2 Determining Device Locations

The Tru64 UNIX hwmgr command can list all hardware devices in the cluster, including those on private buses, and correlate bus-target-LUN names with /dev/disks/dsk* names. For example:

# hwmgr -view devices -cluster
HWID: Device Name         Mfg     Model            Hostname   Location       
-------------------------------------------------------------------------------
  3: kevm                                         pepicelli
 28: /dev/disk/floppy0c          3.5in floppy     pepicelli  fdi0-unit-0
 40: /dev/disk/dsk0c     DEC     RZ28M    (C) DEC pepicelli  bus-0-targ-0-lun-0
 41: /dev/disk/dsk1c     DEC     RZ28L-AS (C) DEC pepicelli  bus-0-targ-1-lun-0
 42: /dev/disk/dsk2c     DEC     RZ28     (C) DEC pepicelli  bus-0-targ-2-lun-0
 43: /dev/disk/cdrom0c   DEC     RRD46    (C) DEC  pepicelli bus-0-targ-6-lun-0
 44: /dev/disk/dsk3c     DEC     RZ28M    (C) DEC pepicelli  bus-1-targ-1-lun-0
 44: /dev/disk/dsk3c     DEC     RZ28M    (C) DEC polishham  bus-1-targ-1-lun-0
 44: /dev/disk/dsk3c     DEC     RZ28M    (C) DEC provolone  bus-1-targ-1-lun-0
 45: /dev/disk/dsk4c     DEC     RZ28L-AS (C) DEC pepicelli  bus-1-targ-2-lun-0
 45: /dev/disk/dsk4c     DEC     RZ28L-AS (C) DEC polishham  bus-1-targ-2-lun-0
 45: /dev/disk/dsk4c     DEC     RZ28L-AS (C) DEC provolone  bus-1-targ-2-lun-0
 46: /dev/disk/dsk5c     DEC     RZ29B    (C) DEC pepicelli  bus-1-targ-3-lun-0
 46: /dev/disk/dsk5c     DEC     RZ29B    (C) DEC polishham  bus-1-targ-3-lun-0
 46: /dev/disk/dsk5c     DEC     RZ29B    (C) DEC provolone  bus-1-targ-3-lun-0
 47: /dev/disk/dsk6c     DEC     RZ28D    (C) DEC pepicelli  bus-1-targ-4-lun-0
 47: /dev/disk/dsk6c     DEC     RZ28D    (C) DEC polishham  bus-1-targ-4-lun-0
 47: /dev/disk/dsk6c     DEC     RZ28D    (C) DEC provolone  bus-1-targ-4-lun-0
 48: /dev/disk/dsk7c     DEC     RZ28L-AS (C) DEC pepicelli  bus-1-targ-5-lun-0
 48: /dev/disk/dsk7c     DEC     RZ28L-AS (C) DEC polishham  bus-1-targ-5-lun-0
 48: /dev/disk/dsk7c     DEC     RZ28L-AS (C) DEC provolone  bus-1-targ-5-lun-0
 49: /dev/disk/dsk8c     DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-8-lun-0
 49: /dev/disk/dsk8c     DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-8-lun-0
 49: /dev/disk/dsk8c     DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-8-lun-0
 50: /dev/disk/dsk9c     DEC     RZ1CB-CS (C) DEC pepicelli  bus-1-targ-9-lun-0
 50: /dev/disk/dsk9c     DEC     RZ1CB-CS (C) DEC polishham  bus-1-targ-9-lun-0
 50: /dev/disk/dsk9c     DEC     RZ1CB-CS (C) DEC provolone  bus-1-targ-9-lun-0
 51: /dev/disk/dsk10c    DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-10-lun-0
 51: /dev/disk/dsk10c    DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-10-lun-0
 51: /dev/disk/dsk10c    DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-10-lun-0
 52: /dev/disk/dsk11c    DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-11-lun-0
 52: /dev/disk/dsk11c    DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-11-lun-0
 52: /dev/disk/dsk11c    DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-11-lun-0
 53: /dev/disk/dsk12c    DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-12-lun-0
 53: /dev/disk/dsk12c    DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-12-lun-0
 53: /dev/disk/dsk12c    DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-12-lun-0
 54: /dev/disk/dsk13c    DEC     RZ1CF-CF (C) DEC pepicelli  bus-1-targ-13-lun-0
 54: /dev/disk/dsk13c    DEC     RZ1CF-CF (C) DEC polishham  bus-1-targ-13-lun-0
 54: /dev/disk/dsk13c    DEC     RZ1CF-CF (C) DEC provolone  bus-1-targ-13-lun-0
 59: kevm                                         polishham  
 88: /dev/disk/floppy1c          3.5in floppy     polishham  fdi0-unit-0
 94: /dev/disk/dsk14c    DEC     RZ26L    (C) DEC polishham  bus-0-targ-0-lun-0
 95: /dev/disk/cdrom1c   DEC     RRD46   (C) DEC  polishham  bus-0-targ-4-lun-0
 96: /dev/disk/dsk15c    DEC     RZ1DF-CB (C) DEC polishham  bus-0-targ-8-lun-0
 99: /dev/kevm                                    provolone    
127: /dev/disk/floppy2c          3.5in floppy     provolone  fdi0-unit-0
134: /dev/disk/dsk16c    DEC     RZ1DF-CB (C) DEC provolone  bus-0-targ-0-lun-0
135: /dev/disk/dsk17c    DEC     RZ1DF-CB (C) DEC provolone  bus-0-targ-1-lun-0
136: /dev/disk/cdrom2c   DEC     RRD47   (C) DEC  provolone  bus-0-targ-4-lun-0

The drdmgr devicename command reports which members serve the device. Disks with multiple servers are on a shared SCSI bus. With very few exceptions, disks that have only one server are private to that server. For details on the exceptions, see Section 9.4.1.

To learn the hardware configuration of a cluster member, enter the following command:

# hwmgr -view hierarchy -member membername

If the member is on a shared bus, the command reports devices on the shared bus. The command does not report on devices private to other members.

To get a graphical display of the cluster hardware configuration, including active members, buses, both shared and private storage devices, and their connections, use the sms command to invoke the graphical interface for the SysMan Station, and then select Hardware from the View menu.

Figure 9-1 shows the SysMan Station representation of a two-member cluster.

Figure 9-1: SysMan Station Display of Hardware Configuration

9.2.3 Adding a Disk to the Cluster

For information on physically installing SCSI hardware devices, see the TruCluster Server Cluster Hardware Configuration manual. After the new disk has been installed, follow these steps:

So that all members recognize the new disk, run the following command on each member:
```
# hwmgr -scan comp -cat scsi_bus
```
Note

You must run the hwmgr -scan comp -cat scsi_bus command on every cluster member that needs access to the disk.
Wait a minute or so for all members to register the presence of the new disk.

If the disk that you are adding is an RZ26, RZ28, RZ29, or RZ1CB-CA model, run the following command on each cluster member:
```
# /usr/sbin/clu_disk_install
```
If the cluster has a large number of storage devices, this command can take several minutes to complete.

To learn the name of the new disk, enter the following command:
```
# hwmgr -view devices -cluster
```
You can also run the SysMan Station command and select Hardware from the Views menu to learn the new disk name.

For information about creating file systems on the disk, see Section 9.6.

9.2.4 Managing Third-party Storage

When a cluster member loses quorum, all of its I/O is suspended, and the remaining members erect I/O barriers against nodes that have been removed from the cluster. This I/O barrier operation inhibits non-cluster members from performing I/O with shared storage devices.

The method that is used to create the I/O barrier depends on the types of storage devices that the cluster members share. In certain cases, a Task Management function called a Target_Reset is sent to stop all I/O to and from the former member. This Task Management function is used in either of the following situations:

The shared SCSI device does not support the SCSI Persistent Reserve command set and uses the Fibre Channel interconnect.

The shared SCSI device does not support the SCSI Persistent Reserve command set, uses the SCSI Parallel interconnect, is a multiported device, and does not propagate the SCSI Target_Reset signal.

In either of these situations, there is a delay between the Target_Reset and the clearing of all I/O pending between the device and the former member. The length of this interval depends on the device and the cluster configuration. During this interval, some I/O with the former member might still occur. This I/O, sent after the Target_Reset, completes in a normal way without interference from other nodes.

During an interval configurable with the drd_target_reset_wait kernel attribute, the device request dispatcher suspends all new I/O to the shared device. This period allows time to clear those devices of the pending I/O that originated with the former member and were sent to the device after it received the Target_Reset. After this interval passes, the I/O barrier is complete.

The default value for drd_target_reset_wait is 30 seconds, which is usually sufficient. However, if you have doubts because of third-party devices in your cluster, contact the device manufacturer and ask for the specifications on how long it takes their device to clear I/O after the receipt of a Target_Reset.

You can set drd_target_reset_wait at boot time and run time.

For more information about quorum loss and system partitioning, see the chapter on the connection manager in the TruCluster Server Cluster Technical Overview manual.

9.2.5 Tape Devices

You can access a tape device in the cluster from any member, regardless of whether it is located on that member's private bus, on a shared bus, or on another member's private bus.

Placing a tape device on a shared bus allows multiple members to have direct access to the device. Performance considerations also argue for placing a tape device on a shared bus. Backing up storage connected to a system on a shared bus with a tape drive is faster than having to go over the cluster interconnect. For example, in Figure 9-2, the backup of dsk9 and dsk10 to the tape drive requires the data to go over the cluster interconnect. For the backup of any other disk, including the semi-private disks dsk11, dsk12, dsk13, and dsk14, the data transfer rate will be faster.

Figure 9-2: Cluster with Semi-private Storage

If the tape device is located on the shared bus, applications that access the device must be written to react appropriately to certain events on the shared SCSI bus, such as bus and device resets. Bus and device resets (such as those that result from cluster membership transitions) cause any tape device on the shared SCSI bus to rewind.

A read() or write() by a tape server application causes an errno to be returned. You must explicitly set up the tape server application to retrieve error information that is returned from its I/O call to reposition the tape. When a read() or write() operation fails, use ioctl() with the MTIOCGET command option to return a structure that contains the error information that is needed by the application to reposition the tape. For a description of the structure, see /usr/include/sys/mtio.h.

The commonly used utilities tar, cpio, dump, and vdump are not designed in this way, so they may unexpectedly terminate when used on a tape device that resides on a shared bus in a cluster.

9.2.6 Formatting Diskettes in a Cluster

TruCluster Server includes support for read/write UNIX file system (UFS) file systems, as described in Section 9.3.7, and you can use TruCluster Server to format a diskette.

Versions of TruCluster Server prior to Version 5.1A do not support read/write UFS file systems. Because prior versions of TruCluster Server do not support read/write UFS file systems and AdvFS metadata overwhelms the capacity of a diskette, the typical methods to format a diskette cannot be used in a cluster.

If you must format a diskette in a cluster with a version of TruCluster Server prior to Version 5.1A, use the mtools or dxmtools tool sets. For more information, see mtools(1) and dxmtools(1).

9.2.7 CD-ROM and DVD-ROM

CD-ROM drives and DVD-ROM drives are always served devices. This type of drive must be connected to a local bus; it cannot be connected to a shared bus.

For information about managing a CD-ROM file system (CDFS) in a cluster, see Section 9.7.

9.3 Managing the Cluster File System

The cluster file system (CFS) provides transparent access to files that are located anywhere on the cluster. Users and applications enjoy a single-system image for file access. Access is the same regardless of the cluster member where the access request originates, and where in the cluster the disk containing the file is connected. CFS follows a server/client model, with each file system served by a cluster member. Any cluster member can serve file systems on devices anywhere in the cluster. If the member serving a file system becomes unavailable, the CFS server automatically fails over to an available cluster member.

The primary tool for managing the cluster file system is the cfsmgr command. A number of examples of using the command appear in this section. For more information about the cfsmgr command, see cfsmgr(8).

TruCluster Server Version 5.1B includes the -o option to the mount command that causes file systems to be served by a specific cluster member on startup. This option is described in Section 9.3.4.

TruCluster Server Version 5.1B includes a load monitoring daemon, /usr/sbin/cfsd, that can monitor, report on, and respond to file system-related member and cluster activity. The cfsd daemon is described in Section 9.3.3.

To gather statistics about CFS, use the cfsstat command or the cfsmgr -statistics command. An example of using cfsstat to get information about direct I/O appears in Section 9.3.6.2. For more information on the command, see cfsstat(8).

For file systems on devices on the shared bus, I/O performance depends on the load on the bus and the load on the member serving the file system. To simplify load balancing, CFS allows you to easily relocate the server to a different member. Access to file systems on devices that are private to a member is faster when the file systems are served by that member.

9.3.1 When File Systems Cannot Fail Over

In most instances, CFS provides seamless failover for the file systems in the cluster. If the cluster member serving a file system becomes unavailable, CFS fails over the server to an available member. However, in the following situations, no path to the file system exists and the file system cannot fail over:

The file system's storage is on a private bus that is connected directly to a member and that member becomes unavailable.

The storage is on a shared bus and all the members on the shared bus become unavailable.

In either case, the cfsmgr command returns the following status for the file system (or domain):

Server Status : Not Served

Attempts to access the file system return the following message:

filename I/O error

When a cluster member that is connected to the storage becomes available, the file system becomes served again and accesses to the file system begin to work. Other than making the member available, you do not need to take any action.

9.3.2 Direct Access Cached Reads

TruCluster Server implements direct access cached reads, which is a performance enhancement for AdvFS file systems. Direct access cached reads allow CFS to read directly from storage simultaneously on behalf of multiple cluster members.

If the cluster member that issues the read is directly connected to the storage that makes up the file system, direct access cached reads access the storage directly and do not go through the cluster interconnect to the CFS server.

If a CFS client is not directly connected to the storage that makes up a file system (for example, if the storage is private to a cluster member), that client will still issue read requests directly to the devices, but the device request dispatcher layer sends the read request across the cluster interconnect to the device.

Direct access cached reads are consistent with the existing CFS served file-system model, and the CFS server continues to perform metadata and log updates for the read operation.

Direct access cached reads are implemented only for AdvFS file systems. In addition, direct access cached reads are performed only for files that are at least 64K in size. The served I/O method is more efficient when processing smaller files.

Direct access cached reads are enabled by default and are not user-settable or tunable. However, if an application uses direct I/O, as described in Section 9.3.6.2, that choice is given priority and direct access cached reads are not performed for that application.

Use the cfsstat directio command to display direct I/O statistics. The direct i/o reads field includes direct access cached read statistics. See Section 9.3.6.2.3 for a description of these fields.

# cfsstat directio
Concurrent Directio Stats:
     941 direct i/o reads
       0 direct i/o writes
       0 aio raw reads
       0 aio raw writes
       0 unaligned block reads
      29 fragment reads
      73 zero-fill (hole) reads
       0 file-extending writes
       0 unaligned block  writes
       0 hole writes
       0 fragment writes
       0 truncates

9.3.3 CFS Load Balancing

When a cluster boots, the TruCluster Server software ensures that each file system is directly connected to the member that serves it. File systems on a device connected to a member's local bus are served by that member. A file system on a device on a shared SCSI bus is served by one of the members that is directly connected to that SCSI bus.

In the case of AdvFS, the first fileset that is assigned to a CFS server determines that all other filesets in that domain will have that same cluster member as their CFS server.

When a cluster boots, typically the first member up that is connected to a shared SCSI bus is the first member to see devices on the shared bus. This member then becomes the CFS server for all the file systems on all the devices on that shared bus. Because of this, most file systems are probably served by a single member and this member can become more heavily loaded than other members, thereby using a larger percentage of its resources (CPU, memory, I/O, and so forth). In this case, CFS can recommend that you relocate file systems to other cluster members to balance the load and improve performance.

TruCluster Server Version 5.1B includes a load monitoring daemon, /usr/sbin/cfsd, that can monitor, report on, and respond to file-system-related member and cluster activity. cfsd is disabled by default and you must explicitly enable it. After being enabled, cfsd can perform the following functions:

Assist in managing file systems by locating file systems based on your preferences and storage connectivity. You can configure cfsd to automatically relocate file systems when members join or leave the cluster, when storage connectivity changes, or as a result of CFS memory usage.

Collect a variety of statistics on file system usage and system load. You can use this data to understand how the cluster's file systems are being used.

Analyze the statistics that it collects and recommend file system relocations that may improve system performance or balance the file system load across the cluster.

Monitor CFS memory usage on cluster nodes and generate an alert when a member is approaching the CFS memory usage limit.

An instance of the cfsd daemon runs on each member of the cluster. If cfsd runs in the cluster, it must run on each member; cfsd depends on a daemon running on each member for proper behavior. If you do not want cfsd to be running in the cluster, do not allow any member to run it.

Each instance of the daemon collects statistics on its member and monitors member-specific events such as low CFS memory. One daemon from the cluster automatically serves as the "master" daemon and is responsible for analyzing all of the collected statistics, making recommendations, and initiating automatic relocations. The daemons are configured via a clusterwide /etc/cfsd.conf configuration file.

The cfsd daemon monitors file system performance and resource utilization by periodically polling the member for information, as determined by the polling_schedule attributes of the /etc/cfsd.conf configuration file for a given policy.

The cfsd daemon collects information about each member's usage of each file system, about the memory demands of each file system, about the system memory demand on each member, and about member-to-physical storage connectivity. Each daemon accumulates the statistics in the member-specific binary file /var/cluster/cfs/stats.member. The data in this file is in a format specific to cfsd and is not intended for direct user access. The cfsd daemon updates and maintains these files; you do not need to periodically delete or maintain them.

The following data is collected for each cluster member:

svrcfstok structure count limit. (See Section 9.3.6.3 for a discussion of this structure.)

Number of active svrcfstok structures

Total number of bytes

Number of wired bytes

The following data is collected per file system per member:

Number of read operations

Number of write operations

Number of lookup operations

Number of getattr operations

Number of readlink operations

Number of access operations

Number of other operations

Bytes read

Bytes written

Number of active svrcfstok structures (described in Section 9.3.6.3)

The cfsd daemon also subscribes to EVM events to monitor information on general cluster and cluster file system state, such as cluster membership, mounted file systems, and device connectivity.

Notes

cfsd considers an AdvFS domain to be an indivisible entity. Relocating an AdvFS file system affects not only the selected file system, but all file systems in the same domain. The entire domain is relocated.
File systems of type NFS client and memory file system (MFS) cannot be relocated. In addition, member boot partitions, server-only file systems (such as UFS file systems mounted for read-write access), and file systems that are under hierarchical storage manager (HSM) management cannot be relocated.
Direct-access I/O devices on a shared bus are served by all cluster members on that bus. A single-server device, whether on a shared bus or directly connected to a cluster member, is served by a single member. If two or more file systems use the same single-server device, cfsd does not relocate them due to performance issues that can arise if the file systems are not served by the same member.

9.3.3.1 Starting and Stopping `cfsd`

The /sbin/init.d/cfsd file starts an instance of the cfsd daemon on each cluster member. However, starting the daemon does not by itself make cfsd active: the cfsd daemon's behavior is controlled via the active field of the stanza-formatted file /etc/cfsd.conf. The active field enables cfsd if set to 1.

The cfsd daemon is disabled (set to 0) by default and you must explicitly enable it if you want to use it.

The cfsd daemon reads the clusterwide /etc/cfsd.conf file at startup. You can force cfsd to reread the configuration file by sending it a SIGHUP signal, in a manner similar to the following:

#  kill -HUP `cat /var/run/cfsd.pid`

If you modify /etc/cfsd.conf, send cfsd the SIGHUP signal to force it to reread the file. If you send SIGHUP to any cfsd daemon process in the cluster, all cfsd daemons in the cluster reread the file; you do not need to issue multiple SIGHUP signals.

9.3.3.2 EVM Events

The cfsd daemon posts an EVM sys.unix.clu.cfsd.anlys.relocsuggested event to alert you that the latest analysis contains interesting results. You can use the evmwatch and evmshow commands to monitor these events.

# evmget | evmshow -t "@name [@priority]" | grep cfsd
sys.unix.clu.cfsd.anlys.relocsuggested [200]

The following command provides additional information about cfsd EVM events:

# evmwatch  -i -f "[name sys.unix.clu.cfsd.*]" | evmshow -d | more

9.3.3.3 Modifying the `/etc/cfsd.conf` Configuration File

The /etc/cfsd.conf configuration file, described in detail in cfsd.conf(4), specifies some general parameters for cfsd and defines a set of file system placement policies that cfsd adheres to when analyzing and managing the cluster's file systems. All file systems in the cluster have a placement policy associated with them. This policy specifies how each file system is assigned to members and determines whether or not to have cfsd automatically relocate it if necessary.

If you modify this file, keep the following points in mind:

If you do not explicitly assign a file system a policy, it inherits the default policy.

The cfsd daemon never attempts to relocate a cluster member's boot partition, even if the boot partition belongs to a policy that has active_placement set to perform relocations. The cfsd daemon ignores the active_placement setting for the boot partition.

The active_placement keywords determine the events upon which an automatic relocation occurs; the hosting_members option determines your preference of the members to which a file system is relocated.

The hosting_members option is a restrictive list, not a preferred list. The placement attribute controls how the file system is placed when none of the members specified by hosting_members attributes is available.

The cfsd daemon treats all cluster members identified in a single hosting_members entry equally; no ordering preference is assumed by position in the list.
To specify an ordering of member preference, use multiple hosting_members lines. The cfsd daemon gives preference to the members listed in the first hosting_members line, followed by the members in the next hosting_members line, and so on.

There is no limit on the number of policies you can create.

You can list any number of file systems in a file systems line. If the same file system appears in multiple polices, the last usage takes precedence.

A sample /etc/cfsd.conf file is shown in Example 9-1. See cfsd.conf(4) for complete information.

Example 9-1: Sample `/etc/cfsd.conf` File

# Use this file to configure the CFS load monitoring daemon (cfsd)
# for the cluster. cfsd will read this file at startup and on receipt
# of a SIGHUP. After modifying this file, you can apply your changes
# cluster-wide by issuing the following command from any cluster
# member: "kill -HUP `cat /var/run/cfsd.pid`". This will force the
# daemon on each cluster member to reconfigure itself. You only need
# to send one SIGHUP.
# 
# Any line whose first non-whitespace character is a '#' is ignored
# by cfsd.
# 
# If cfsd encounters syntax errors while processing this file, it will
# log the error and any associated diagnostic information to syslog.
#
# See cfsd(8) for more information.
 
# This block is used to configure certain daemon-wide features.
# 
# To enable cfsd, set the "active" attribute to "1".
# To disable cfsd, set the "active" attribute to "0".
# 
# Before enabling the daemon, you should review and understand the
# configuration in order to make sure that it is compatible with how
# you want cfsd to manage the cluster's file systems.
# 
# cfsd will analyze load every 12 hours, using the past 24 hours worth
# of statistical data.
# 
cfsd:
	active = 1
	reloc_on_memory_warning = 1
	reloc_stagger = 0
	analyze_samplesize = 24:00:00
	analyze_interval = 12:00:00
 
 
# This block is used to define the default policy for file systems that
# are not explicitly included in another policy. Furthermore, other
# policies that do not have a particular attribute explicitly defined
# inherit the corresponding value from this default policy.
# 
# Collect stats every 2 hours all day monday-friday.
# cfsd will perform auto relocations to maintain server preferences,
# connectivity, and acceptable memory usage, and will provide relocation
# hints to the kernel for preferred placement on failover.
# No node is preferred over another.
# 
defaultpolicy:
	polling_schedule = 1-5, 0-23, 02:00:00
	placement = favored
	hosting_members = *
	active_placement = connectivity, preference, memory, failover
 
# This policy is used for file systems that you do NOT want cfsd to
# ever relocate. It is recommended that cfsd not be allowed to relocate
# the /, /usr, or /var file systems.
# 
# It is also recommended that file systems whose placements are
# managed by other software, such as CAA, also be assigned to
# this policy.
# 
policy:
	name = PRECIOUS
	filesystems = cluster_root#, cluster_usr#, cluster_var#
	active_placement = 0 
 
 
# This policy is used for file systems that cfsd should, for the most
# part, ignore. File systems in this policy will not have statistics
# collected for them and will not be relocated.
# 
# Initially, this policy contains all NFS and MFS file systems that
# are not explicitly listed in other policies. File systems of these
# types tend to be temporary, so collecting stats for them is usually
# not beneficial. Also, CFS currently does not support the relocation
# of NFS and MFS file systems.
# 
policy:
	name = IGNORE
	filesystems = %nfs, %mfs
	polling_schedule = 0
	active_placement = 0
 
 
# Policy for boot file systems.
# 
# No stats collection for boot file systems.  Boot partitions are never
# relocated.
#
policy:
	name = BOOTFS
	filesystems = root1_domain#, root2_domain#, root3_domain#
	polling_schedule = 0
 
 
# You can define as many policies as necessary, using this policy block
# as a template. Any attributes that you leave commented out will be
# inherited from the default policy defined above.
# 
policy:
	name = POLICY01
	#filesystems =
	#polling_schedule = 0-6, 0-23, 00:15:00
        #placement = favored
	#hosting_members = *
	#active_placement = preference, connectivity, memory, failover

9.3.3.4 Understanding `cfsd` Analysis and Implementing Recommendations

The cfsd daemons collect statistics in the member-specific file /var/cluster/cfs/stats.member. These data files are in a format specific to cfsd and are not intended for direct user access. The cfsd daemon updates and maintains these files; you do not need to periodically delete or maintain them.

After analyzing these collected statistics, cfsd places the results of that analysis in the /var/cluster/cfs/analysis.log file. The /var/cluster/cfs/analysis.log file is a symbolic link to the most recent /var/cluster/cfs/analysis.log.dated file. When a /var/cluster/cfs/analysis.log.dated file becomes 24 hours old, a new version is created and the symbolic link is updated. Prior versions of the /var/cluster/cfs/analysis.log.dated file are purged. The cfsd daemon posts an EVM event to alert you that the latest analysis contains interesting results.

The /var/cluster/cfs/analysis.log file contains plain text, in a format similar to the following:

Cluster Filesystem (CFS) Analysis Report
(generated by cfsd[525485]
 
Recommended
relocations:
 
none
 
Filesystem usage summary:
cluster         reads           writes          req'd svr mem
                24 KB/s         0 KB/s          4190 KB
 
 
node            reads           writes          req'd svr mem
 
rye             4 KB/s          0 KB/s          14 KB
swiss           19 KB/s         0 KB/s          4176 KB
 
filesystem    
    node        reads           writes          req'd svr mem
 
test_one#       2 KB/s          0 KB/s          622 KB
rye             0 KB/s          0 KB/s
@swiss          2 KB/s          0 KB/s
 
test_two#       4 KB/s          0 KB/s          2424 KB
rye             1 KB/s          0 KB/s
@swiss          3 KB/s          0 KB/s
:
:
Filesystem placement evaluation results:
 
filesystem    
node            conclusion      observations
test_one#
    rye         considered      (hi conn, hi pref, lo use) 
    @swiss      recommended     (hi conn, hi pref, hi use)
test_two#
    rye         considered      (hi conn, hi pref, lo use)
    @swiss      recommended     (hi conn, hi pref, hi use)
:
:

The current CFS server of each file system is indicated by an "at" symbol (@). As previously described, cfsd treats an AdvFS domain as an indivisible entity, and the analysis is reported at the AdvFS domain level. Relocating a file system of type AdvFS affects all file systems in the same domain.

You can use the results of this analysis to determine whether you want a different cluster member to be the CFS server for a given file system. If the current CFS server is not the recommended server for this file system based on the cfsd analysis, you can use the cfsmgr command to relocate the file system to the recommended server.

For example, assume that swiss is the current CFS server of the test_two domain and member rye is the recommended CFS server. If you agree with this analysis and want to implement the recommendation, enter the following cfsmgr command to change the CFS server to rye:

# cfsmgr -a server=rye -d test_two
# cfsmgr -d test_two
 Domain or filesystem name = test_two
 Server Name = rye
 Server Status : OK

9.3.3.5 Automatic Relocations

The cfsd daemon does not automatically relocate file systems based solely on its own statistical analysis. Rather, it produces reports and makes recommendations that you can accept or reject based on your environment.

However, for a select series of conditions, cfsd can automatically relocate a file system based on the keywords you specify in the active_placement option for a given file system policy. The active_placement keywords determine the events upon which an automatic relocation occurs; the hosting_members option determines the members to which a file system is relocated and the order in which a member is selected.

The possible values and interactions of the active_placement option are described in cfsd.conf(4), and are summarized here:

Memory
CFS memory usage is limited by the svrcfstok_max_percent kernel attribute, which is described in Section 9.3.6.3. If a cluster member reaches this limit, file operations on file systems served by the member begin failing with "file table overflow" errors. While a member approaches its CFS memory usage limit, the kernel posts an EVM event as a warning. When such an event is posted, cfsd can attempt to free memory on the member by relocating some of the file systems that it is serving.

Preference
While members join and leave the cluster, cfsd can relocate file systems to members that you prefer. You might want certain file systems to be served primarily by a subset of the cluster members.

Failover
While members join and leave the cluster, cfsd can relocate file systems to members that you prefer. You might want certain file systems to be served primarily by a subset of the cluster members.

Connectivity
If a member does not have a direct physical connection to the devices required by a file system that it serves, a severe performance degradation can result. The cfsd daemon can automatically relocate a file system in the event that its current server loses connectivity to the file system's underlying devices.

9.3.3.6 Relationship to CAA Resources

The cfsd daemon has no knowledge of CAA resources. CAA allows you to use the placement_policy, hosting_members, and required_resources options to favor or limit the member or members that can run a particular CAA resource.

If this CAA resource has an application- or resource-specific file system, use the associated CAA action script to place or relocate the file system. For example, if the resource is relocated, the action script should use cfsmgr to move the file system as well. Using the cfsmgr command via the action script allows you to more directly and easily synchronize the file system with the CAA resource.

9.3.3.7 Balancing CFS Load Without `cfsd`

The cfsd daemon is the recommended method of analyzing and balancing the CFS load on a cluster. The cfsd daemon can monitor, report on, and respond to file-system-related member and cluster activity. However, if you already have a process in place to balance your file system load, or if you simply prefer to perform the load balancing analysis yourself, you can certainly do so.

Use the cfsmgr command to determine good candidates for relocating the CFS servers. The cfsmgr command displays statistics on file system usage on a per-member basis. For example, suppose you want to determine whether to relocate the server for /accounts to improve performance. First, confirm the current CFS server of /accounts as follows:

# cfsmgr /accounts
 
 Domain or filesystem name = /accounts
 Server Name = systemb
 Server Status : OK

Then, get the CFS statistics for the current server and the candidate servers by entering the following commands:

# cfsmgr -h systemb -a statistics /accounts
 
 Counters for the filesystem /accounts:
        read_ops = 4149
        write_ops = 7572
        lookup_ops = 82563
        getattr_ops = 408165
        readlink_ops = 18221
        access_ops = 62178
        other_ops = 123112
 
 Server Status : OK
# cfsmgr -h systema -a statistics /accounts
 
 Counters for the filesystem /accounts:
        read_ops = 26836
        write_ops = 3773
        lookup_ops = 701764
        getattr_ops = 561806
        readlink_ops = 28712
        access_ops = 81173
        other_ops = 146263
 
 Server Status : OK
# cfsmgr -h systemc -a statistics /accounts
 
 Counters for the filesystem /accounts:
        read_ops = 18746
        write_ops = 13553
        lookup_ops = 475015
        getattr_ops = 280905
        readlink_ops = 24306
        access_ops = 84283
        other_ops =  103671
 
 Server Status : OK
# cfsmgr -h systemd -a statistics /accounts
 
 Counters for the filesystem /accounts:
        read_ops = 98468
        write_ops = 63773
        lookup_ops = 994437
        getattr_ops = 785618
        readlink_ops = 44324
        access_ops = 101821
        other_ops = 212331
 
 Server Status : OK

In this example, most of the read and write activity for /accounts is from member systemd, not from the member that is currently serving it, systemb. Assuming that systemd is physically connected to the storage for /accounts, systemd is a good choice as the CFS server for /accounts.

Determine whether systemd and the storage for /accounts are physically connected as follows:

Find out where /accounts is mounted. You can either look in /etc/fstab or use the mount command. If there are a large number of mounted file systems, you might want to use grep as follows:
```
# mount | grep accounts
accounts_dmn#accounts on /accounts type advfs (rw)
 
```

Look at the directory /etc/fdmns/accounts_dmn to learn the device where the AdvFS domain accounts_dmn is mounted as follows:
```
# ls /etc/fdmns/accounts_dmn
dsk6c
 
```

Enter the drdmgr command to learn the servers of dsk6 as follows:

# drdmgr -a server dsk6
                   Device Name: dsk6
                   Device Type: Direct Access IO Disk
                 Device Status: OK
             Number of Servers: 4
                   Server Name: membera
                  Server State: Server
                   Server Name: memberb
                  Server State: Server
                   Server Name: memberc
                  Server State: Server
                   Server Name: memberd
                  Server State: Server

Because dsk6 has multiple servers, it is on a shared bus. Because systemd is one of the servers, there is a physical connection.

Relocate the CFS server of /accounts to systemd as follows:
```
# cfsmgr -a server=systemd /accounts
```

Even in cases where the CFS statistics do not show an inordinate load imbalance, we recommend that you distribute the CFS servers among the available members that are connected to the shared bus. Doing so can improve overall cluster performance.

9.3.3.8 Distributing CFS Server Load via `cfsmgr`

To automatically have a particular cluster member act as the CFS server for a file system or domain, you can place a script in /sbin/init.d that calls the cfsmgr command to relocate the server for the file system or domain to the desired cluster member. This technique distributes the CFS load but does not balance it.

For example, if you want cluster member alpha to serve the domain accounting, place the following cfsmgr command in a startup script:

# cfsmgr -a server=alpha -d accounting

Have the script look for successful relocation and retry the operation if it fails. The cfsmgr command returns a nonzero value on failure; however, it is not sufficient for the script to keep trying on a bad exit value. The relocation might have failed because a failover or relocation is already in progress.

On failure of the relocation, have the script search for one of the following messages:

	Server Status : Failover/Relocation in Progress
 
	Server Status : Cluster is busy, try later

If either of these messages occurs, have the script retry the relocation. On any other error, have the script print an appropriate message and exit.

9.3.4 Distributing File Systems Via the `mount -o` Command

A file system on a device on a shared SCSI bus is served by one of the members that is directly connected to that SCSI bus. When a cluster boots, typically the first active member that is connected to a shared SCSI bus is the first member to see devices on the shared bus. This member then becomes the CFS server for all the file systems on all the devices on that shared bus. CFS allows you to then relocate file systems to better balance the file system load, as described in Section 9.3.3.

As an alternate approach, the mount -o server=name command allows you to specify which cluster member serves a given file system at startup. The -o server=name option is particularly useful for those file systems that cannot be relocated, such as NFS, MFS, and read/write UFS file systems:

# mount -t nfs -o server=rye smooch:/usr /tmp/mytmp
# cfsmgr -e
Domain or filesystem name = smooch:/usr
Mounted On = /cluster/members/member1/tmp/mytmp
Server Name = ernest
Server Status : OK

If the mount specified by the mount -o server=name command is successful, the specified cluster member is the CFS server for the file system. However, if the specified member is not a member of the cluster or is unable to serve the file system, the mount attempt fails.

The mount -o server=name command determines where the file system is first mounted; it does not limit or determine the cluster members to which the file system might later be relocated or fail over.

If you combine the -o server=name option with the -o server_only option, the file system can be mounted only by the specified cluster member and the file system is then treated as a partitioned file system. That is, the file system is accessible for both read-only and read/write access only by the member that mounts it. Other cluster members cannot read from, or write to, the file system. Remote access is not allowed; failover does not occur. The -o server_only option can be applied only to AdvFS, MFS, and UFS file systems.

Note

The -o server=name option bypasses the normal server selection process and may result in a member that has less than optimal connectivity to the file system's devices serving the file system. In addition, if the member you specify is not available, the file system is not mounted by any other cluster member.
The combination of the -o server=name and -o server_only options removes many of the high-availability protections of the CFS file system: the file system can be mounted only by the specified cluster member, it can be accessed by only that member, and it cannot fail over to another member. Therefore, use this combination carefully.

The -o server=name option is valid only in a cluster, and only for AdvFS, UFS, MFS, NFS, CDFS, and DVDFS file systems. In the case of MFS file systems, the -o server=name option is supported in a limited fashion: the file system is mounted only if the specified server is the local node.

You can use the -o server=name option with the /etc/fstab file to create cluster-member-specific fstab entries.

See mount(8) for additional usage information.

9.3.5 Freezing a Domain Before Cloning

To allow coherent hardware snapshots in multivolume domain configurations, file system metadata must be consistent across all volumes when the individual volumes are cloned. To guarantee that the metadata is consistent, Tru64 UNIX Version 5.1B includes the freezefs command, which is described in freezefs(8). The freezefs command causes an AdvFS domain to enter into a metadata-consistent frozen state and guarantees that it stays that way until the specified freeze time expires or it is explicitly thawed with the thawfs command. All metadata, which can be spread across multiple volumes or logical units (LUNs), is flushed to disk and does not change for the duration of the freeze.

Although freezefs requires that you specify one or more AdvFS file system mount directories, all of the filesets in the AdvFS domain are affected. The freezefs command considers an AdvFS domain to be an indivisible entity. Freezing a file system in a domain freezes the entire domain.

When you freeze a file system in a clustered configuration, all in-process file system operations are allowed to complete. Some file system operations that do not require metadata updates work normally even if the target file system is frozen; for example, read and stat.

Although there are slight differences in how freezefs functions on a single system and in a cluster, in both cases metadata changes are not allowed on a frozen domain. The most notable differences in the behavior of the commands in a cluster are the following:

Shutting down any cluster member causes all frozen file systems in the cluster to be thawed.

If any cluster member fails, all frozen file systems in the cluster are thawed.

9.3.5.1 Determining Whether a Domain Is Frozen

By default, freezefs freezes a file system for 60 seconds. However, you can use the -t option to specify a lesser or greater timeout value in seconds, or to specify that the domain is to remain frozen until being thawed by thawfs.

The freezefs command -q option allows you to query a file system to determine if it is frozen:

# freezefs -q /mnt
/mnt is frozen

In addition, the freezefs command posts an EVM event when a file system is frozen or thawed. You can use the evmwatch and evmshow commands to determine if any domains in the cluster are frozen or thawed, as shown in the following example:

# /usr/sbin/freezefs -t -1 /freezetest 
freezefs: Successful
 
# evmget -f "[name sys.unix.fs.vfs.freeze]" | evmshow -t "@timestamp @@"
14-Aug-2002 14:16:51 VFS: filesystem test2_domain#freeze  mounted on
/freezetest was frozen
 
# /usr/sbin/thawfs /freezetest
thawfs: Successful
 
# evmget -f "[name sys.unix.fs.vfs.thaw]" | evmshow -t "@timestamp @@"
14-Aug-2002 14:17:32 VFS: filesystem test2_domain#freeze mounted on
/freezetest was thawed

9.3.6 Optimizing CFS Performance

You can tune CFS performance by doing the following:

Changing the number of read-ahead and write-behind threads (Section 9.3.6.1)

Taking advantage of direct I/O (Section 9.3.6.2)

Adjusting CFS memory usage (Section 9.3.6.3)

Using memory mapped files (Section 9.3.6.4)

Avoiding full file systems (Section 9.3.6.5)

Trying other strategies (Section 9.3.6.6)

9.3.6.1 Changing the Number of Read-Ahead and Write-Behind Threads

When CFS detects sequential accesses to a file, it employs read-ahead threads to read the next I/O block size worth of data. CFS also employs write-behind threads to buffer the next block of data in anticipation that it too will be written to disk. Use the cfs_async_biod_threads kernel attribute to set the number of I/O threads that perform asynchronous read ahead and write behind. Read-ahead and write-behind threads apply only to reads and writes originating on CFS clients.

The default size for cfs_async_biod_threads is 32. In an environment where at one time you have more than 32 large files sequentially accessed, increasing cfs_async_biod_threads can improve CFS performance, particularly if the applications using the files can benefit from lower latencies.

The number of read-ahead and write-behind threads is tunable from 0 through 128. When not in use, the threads consume few system resources.

9.3.6.2 Taking Advantage of Direct I/O

When an application opens an AdvFS file with the O_DIRECTIO flag in the open system call, data I/O is direct to the storage; the system software does no data caching for the file at the file-system level. In a cluster, this arrangement supports concurrent direct I/O on the file from any member in the cluster. That is, regardless of which member originates the I/O request, I/O to a file does not go through the cluster interconnect to the CFS server. Database applications frequently use direct I/O in conjunction with raw asynchronous I/O (which is also supported in a cluster) to improve I/O performance.

The best performance on a file that is opened for direct I/O is achieved under the following conditions:

A read from an existing location of the file

A write to an existing location of the file

When the size of the data being read or written is a multiple of the disk sector size, 512 bytes

The following conditions can result in less than optimal direct I/O performance:

Operations that cause a metadata change to a file. These operations go across the cluster interconnect to the CFS server of the file system when the application that is doing the direct I/O runs on a member other than the CFS server of the file system. Such operations include the following:
- Any modification that fills a sparse hole in the file
- Any modification that appends to the file
- Any modification that truncates the file
- Any read or write on a file that is less than 8K and consists solely of a fragment or any read/write to the fragment portion at the end of a larger file

Any unaligned block read or write that is not to an existing location of the file. If a request does not begin or end on a block boundary, multiple I/Os are performed.

When a file is open for direct I/O, any AdvFS migrate operation (such as migrate, rmvol, defragment, or balance) on the domain will block until the I/O that is in progress completes on all members. Conversely, direct I/O will block until any AdvFS migrate operation completes.

An application that uses direct I/O is responsible for managing its own caching. When performing multithreaded direct I/O on a single cluster member or multiple members, the application must also provide synchronization to ensure that, at any instant, only one thread is writing a sector while others are reading or writing.

For a discussion of direct I/O programming issues, see the chapter on optimizing techniques in the Tru64 UNIX Programmer's Guide.

9.3.6.2.1 Differences Between Cluster and Standalone AdvFS Direct I/O

The following list presents direct I/O behavior in a cluster that differs from that in a standalone system:

Performing any migrate operation on a file that is already opened for direct I/O blocks until the I/O that is in progress completes on all members. Subsequent I/O will block until the migrate operation completes.

AdvFS in a standalone system provides a guarantee at the sector level that, if multiple threads attempt to write to the same sector in a file, one will complete first and then the other. This guarantee is not provided in a cluster.

9.3.6.2.2 Cloning a Fileset with Files Open in Direct I/O Mode

As described in Section 9.3.6.2, when an application opens a file with the O_DIRECTIO flag in the open system call, I/O to the file does not go through the cluster interconnect to the CFS server. However, if you clone a fileset that has files open in Direct I/O mode, the I/O does not follow this model and might cause considerable performance degradation. (Read performance is not impacted by the cloning.)

The clonefset utility, which is described in clonefset(8), creates a read-only copy, called a clone fileset, of an AdvFS fileset. A clone fileset is a read-only snapshot of fileset data structures (metadata). That is, when you clone a fileset, the utility copies only the structure of the original fileset, not its data. If you then modify files in the original fileset, every write to the fileset causes a synchronous copy-on-write of the original data to the clone if the original data has not already been copied. In this way, the clone fileset contents remain the same as when you first created it.

If the fileset has files open in Direct I/O mode, when you modify a file AdvFS copies the original data to the clone storage. AdvFS does not send this copy operation over the cluster interconnect. However, CFS does send the write operation for the changed data in the fileset over the cluster interconnect to the CFS server unless the application using Direct I/O mode happens to be running on the CFS server. Sending the write operation over the cluster interconnect negates the advantages of opening the file in Direct I/O mode.

To retain the benefits of Direct I/O mode, remove the clone as soon as the backup operation is complete so that writes are again written directly to storage and are not sent over the cluster interconnect.

9.3.6.2.3 Gathering Statistics on Direct I/O

If the performance gain for an application that uses direct I/O is less than you expected, you can use the cfsstat command to examine per-node global direct I/O statistics.

Use cfsstat to look at the global direct I/O statistics without the application running. Then execute the application and examine the statistics again to determine whether the paths that do not optimize direct I/O behavior were being executed.

The following example shows how to use the cfsstat command to get direct I/O statistics:

# cfsstat directio
Concurrent Directio Stats:
     160 direct i/o reads
     160 direct i/o writes
       0 aio raw reads
       0 aio raw writes
       0 unaligned block reads
       0 fragment reads
       0 zero-fill (hole) reads
     160 file-extending writes
       0 unaligned block  writes
       0 hole writes
       0 fragment writes
       0 truncates

The individual statistics have the following meanings:

direct i/o reads
The number of normal direct I/O read requests. These read requests were processed on the member that issued the request and were not sent to the AdvFS layer on the CFS server.

direct i/o writes
The number of normal direct I/O write requests processed. These write requests were processed on the member that issued the request and were not sent to the AdvFS layer on the CFS server.

aio raw reads
The number of normal direct I/O asynchronous read requests. These read requests were processed on the member that issued the request and were not sent to the AdvFS layer on the CFS server.

aio raw writes
The number of normal direct I/O asynchronous write requests. These read requests were processed on the member that issued the request and were not sent to the AdvFS layer on the CFS server.

unaligned block reads
The number of reads that were not a multiple of a disk sector size (currently 512 bytes). This count will be incremented for requests that do not start at a sector boundary or do not end on a sector boundary. An unaligned block read operation results in a read for the sector and a copyout of the user data requested from the proper location of the sector.
If the I/O request encompasses an existing location of the file and does not encompass a fragment, this operation does not get sent to the CFS server.

fragment reads
The number of read requests that needed to be sent to the CFS server because the request was for a portion of the file that contains a fragment.
A file that is less than 140K might contain a fragment at the end that is not a multiple of 8K. Also small files less than 8K in size may consist solely of a fragment.
To ensure that a file of less than 8K does not consist of a fragment, always open the file only for direct I/O. Otherwise, on the close of a normal open, a fragment will be created for the file.

zero-fill (hole) reads
The number of reads that occurred to sparse areas of the files that were opened by direct I/O. This request is not sent to the CFS server.

file-extending writes
The number of write requests that were sent to the CFS server because they appended data to the file.

unaligned block writes
The number of writes that were not a multiple of a disk sector size (currently 512 bytes). This count will be incremented for requests that do not start at a sector boundary or do not end on a sector boundary. An unaligned block write operation results in a read for the sector, a copyin of the user data that is destined for a portion of the block, and a subsequent write of the merged data. These operations do not get sent to the CFS server.
If the I/O request encompasses an existing location of the file and does not encompass a fragment, this operation does not get sent to the CFS server.

hole writes
The number of write requests to an area that encompasses a sparse hole in the file that needed to be sent to AdvFS on the CFS server.

fragment writes
The number of write requests that needed to be sent to the CFS server because the request was for a portion of the file that contains a fragment.
A file that is less than 140K might contain a fragment at the end that is not a multiple of 8K. Also small files less than 8K in size may consist solely of a fragment.
To ensure that a file of less than 8K does not consist of a fragment, always open the file only for direct I/O. Otherwise, on the close of a normal open, a fragment will be created for the file.

truncates
The number of truncate requests for direct I/O opened files. This request does get sent to the CFS server.

9.3.6.3 Adjusting CFS Memory Usage

In situations where one cluster member is the CFS server for a large number of file systems, the client members may cache a great many vnodes from the served file systems. For each cached vnode on a client, even vnodes that are not actively used, the CFS server must allocate 800 bytes of system memory for the CFS token structure that is needed to track the file at the CFS layer. In addition to this, the CFS token structures typically require corresponding AdvFS access structures and vnodes, resulting in a near-doubling of the amount of memory that is used.

By default, each client can use up to 4 percent of memory to cache vnodes. When multiple clients fill up their caches with vnodes from a CFS server, system memory on the server can become overtaxed, causing it to hang.

The svrcfstok_max_percent kernel attribute is designed to prevent such system hangs. The attribute sets an upper limit on the amount of memory that is allocated by the CFS server to track vnode caching on clients. The default value is 25 percent. The memory is used only if the server load requires it. The memory is not allocated up front.

After the svrcfstok_max_percent limit is reached on the server, an application accessing files that are served by the member gets an EMFILE error. Applications that use perror() to learn current errno settings will return the message too many open files to the standard error stream, stderr, the controlling TTY or log file used by the applications. Although you see EMFILE error messages, no cached data is lost.

If applications start getting EMFILE errors, follow these steps:

Determine whether the CFS client is out of vnodes, as follows:
1. Get the current value of the max_vnodes kernel attribute:
```
# sysconfig -q vfs max_vnodes
```
2. Use dbx to get the values of total_vnodes and free_vnodes:
```
# dbx -k /vmunix /dev/mem
dbx version 5.0
Type 'help' for help.
(dbx) pd total_vnodes
total_vnodes_value
 
```
  Get the value for max_vnodes:
```
(dbx) pd max_vnodes
max_vnodes_value
 
```
  If total_vnodes equals max_vnodes and free_vnodes equals 0, then that member is out of vnodes. In this case, you can increase the value of the max_vnodes kernel attribute. You can use the sysconfig command to change max_vnodes on a running member. For example, to set the maximum number of vnodes to 20000, enter the following:
```
# sysconfig -r vfs max_vnodes=20000
```

If the CFS client is not out of vnodes, then determine whether the CFS server has used all the memory that is available for token structures (svrcfstok_max_percent), as follows:
1. Log on to the CFS server.
2. Use dbx to get the current value for svrtok_active_svrcfstok:
```
# dbx -k /vmunix /dev/mem
dbx version 5.0
Type 'help' for help.
(dbx)pd svrtok_active_svrcfstok
active_svrcfstok_value
 
```
3. Get the value for cfs_max_svrcfstok:
```
(dbx)pd cfs_max_svrcfstok
max_svrcfstok_value
 
```
If svrtok_active_svrcfstok is equal to or greater than cfs_max_svrcfstok, then the CFS server has used all the memory that is available for token structures.
In this case, the best solution to make the file systems usable again is to relocate some of the file systems to other cluster members. If that is not possible, then the following solutions are acceptable:
- Increase the value of cfs_max_svrcfstok.
  You cannot change cfs_max_svrcfstok with the sysconfig command. However, you can use the dbx assign command to change the value of cfs_max_svrcfstok in the running kernel. For example, to set the maximum number of CFS server token structures to 80000, enter the following command:
```
(dbx)assign cfs_max_svrcfstok=80000
```
  Values you assign with the dbx assign command are lost when the system is rebooted.
- Increase the amount of memory that is available for token structures on the CFS server.
  This option is undesirable on systems with small amounts of memory.
  To increase svrcfstok_max_percent, log on to the server and run the dxkerneltuner command. On the main window, select the cfs kernel subsystem. On the cfs window, enter an appropriate value for svrcfstok_max_percent. This change will not take effect until the cluster member is rebooted.

Typically, when a CFS server reaches the svrcfstok_max_percent limit, relocate some of the CFS file systems so that the burden of serving the file systems is shared among cluster members. You can use startup scripts to run the cfsmgr and automatically relocate file systems around the cluster at member startup.

Setting svrcfstok_max_percent below the default is recommended only on smaller memory systems that run out of memory because 25 percent default value is too high.

9.3.6.4 Using Memory Mapped Files

Using memory mapping to share a file across the cluster for anything other than read-only access can negatively affect performance. CFS I/O to a file does not perform well when multiple members are simultaneously modifying the data. This situation forces premature cache flushes to ensure that all nodes have the same view of the data at all times.

9.3.6.5 Avoiding Full File Systems

If free space in a file system is less than 50 MB or less than 10 percent of the file system's size, whichever is smaller, then write performance to the file system from CFS clients suffers. Performance suffers because all writes to nearly full file systems are sent immediately to the server to guarantee correct ENOSPC ("not enough space") semantics.

9.3.6.6 Other Strategies

The following measures can improve CFS performance:

Ensure that the cluster members have sufficient system memory.

In general, sharing a file for read/write access across cluster members may negatively affect performance because of all of the cache invalidations. CFS I/O to a file does not perform well if multiple members are simultaneously modifying the data. This situation forces premature cache flushes to ensure that all nodes have the same view of the data at all times.

If a distributed application does reads and writes on separate members, try locating the CFS servers for the application to the member performing writes. Writes are more sensitive to remote I/O than reads.

If multiple applications access different sets of data in a single AdvFS domain, consider splitting the data into multiple domains. This arrangement allows you to spread the load to more than a single CFS server. It also presents the opportunity to colocate each application with the CFS server for that application's data without loading everything on a single member.

9.3.7 MFS and UFS File Systems Supported

TruCluster Server includes read/write support for memory file system (MFS) and UNIX file system (UFS) file systems.

When you mount a UFS file system in a cluster for read/write access, or when you mount an MFS file system in a cluster for read-only or read/write access, the mount command server_only argument is used by default. These file systems are treated as partitioned file systems, as described in Section 9.3.8. That is, the file system is accessible for both read-only and read/write access only by the member that mounts it. Other cluster members cannot read from, or write to, the MFS or UFS file system. Remote access is not allowed; failover does not occur.

If you want to mount a UFS file system for read-only access by all cluster members, you must explicitly mount it read-only.

9.3.8 Partitioning File Systems

CFS makes all files accessible to all cluster members. Each cluster member has the same access to a file, whether the file is stored on a device that is connected to all cluster members or on a device that is private to a single member. However, CFS makes it possible to mount an AdvFS file system so that it is accessible to only a single cluster member, which is referred to as file system partitioning.

The Available Server Environment (ASE), which is an earlier version of the TruCluster Server product, offered functionality like that of file system partitioning. File partitioning is provided in TruCluster Server as of Version 5.1 to ease migration from ASE. File system partitioning in TruCluster Server is not intended as a general purpose method for restricting file system access to a single member.

To mount a partitioned file system, log on to the member that you want to give exclusive access to the file system. Run the mount command with the server_only option. This mounts the file system on the member where you execute the mount command and gives that member exclusive access to the file system. Although only the mounting member has access to the file system, all members, cluster-wide, can see the file system mount.

The server_only option can be applied only to AdvFS, MFS, and UFS file systems.

Partitioned file systems are subject to the following limitations:

Starting with Tru64 UNIX Version 5.1B, file systems can be mounted under a partitioned file system if the file systems to be mounted are also partitioned file systems and are served by the same cluster member.

No failover via CFS
If the cluster member serving a partitioned file system fails, the file system is unmounted. You must remount the file system on another cluster member.
You can work around this by putting the application that uses the partitioned file system under the control of CAA. Because the application must run on the member where the partitioned file system is mounted, if the member fails, both the file system and application fail. An application that is under control of CAA will fail over to a running cluster member. You can write the application's CAA action script to mount the partitioned file system on the new member.

NFS export
The best way to export a partitioned file system is to create a single node cluster alias for the node serving the partitioned file system and include that alias in the /etc/exports.aliases file. See Section 3.15 for additional information on how to best utilize the /etc/exports.aliases file.
If you use the default cluster alias to NFS-mount file systems that the cluster serves, some NFS requests will be directed to a member that does not have access to the file system and will fail.
Another way to export a partitioned file system is to assign the member that serves the partitioned file system the highest cluster-alias selection priority (selp) in the cluster. If you do this, the member will serve all NFS connection requests. However, the member will also have to handle all network traffic of any type that is directed to the cluster, which is not likely to be acceptable in most environments.
For more information about distributing connection requests, see Section 3.10.

No mixing partitioned and conventional filesets in the same domain
The server_only option applies to all file systems in a domain. The type of the first fileset mounted determines the type for all filesets in the domain:
- If a fileset is mounted without the server_only option, then attempts to mount another fileset in the domain server_only will fail.
- If a fileset in a domain is mounted server_only, then all subsequent fileset mounts in that domain must be server_only.

No manual relocation
To move a partitioned file system to a different CFS server, you must unmount the file system and then remount it on the target member. At the same time, you will need to move applications that use the file system.

No mount updates with server_only option
After you mount a file system normally, you cannot use the mount -u command with the server_only option on the file system. For example, if file_system has already been mounted without use of the server_only flag, the following command fails:
```
# mount -u -o server_only file_system
```

9.3.9 Block Devices and Cache Coherency

A single block device can have multiple aliases. In this situation, multiple block device special files in the file system namespace will contain the same dev_t. These aliases can potentially be located across multiple domains or file systems in the namespace.

On a standalone system, cache coherency is guaranteed among all opens of the common underlying block device regardless of which alias was used on the open() call for the device. In a cluster, however, cache coherency can be obtained only among all block device file aliases that reside on the same domain or file system.

For example, if cluster member mutt serves a domain with a block device file and member jeff serves a domain with another block device file with the same dev_t, then cache coherency is not provided if I/O is performed simultaneously through these two aliases.

9.3.10 CFS Restrictions

The cluster file system (CFS) supports the network file system (NFS) client for read/write access.

When a file system is NFS-mounted in a cluster, CFS makes it available for read/write access from all cluster members. The member that has actually mounted it serves the file system to other cluster members.

If the member that has mounted the NFS file system shuts down or fails, the file system is automatically unmounted and CFS begins to clean up the mount points. During the cleanup process, members that access these mount points may see various types of behavior, depending upon how far the cleanup has progressed:

If members still have files open on that file system, their writes will be sent to a local cache instead of to the actual NFS-mounted file system.

After all of the files on that file system have been closed, attempts to open a file on that file system will fail with an EIO error until the file system is remounted. Applications may encounter "Stale NFS handle" messages. This is normal behavior on a standalone system, as well as in a cluster.

Until the CFS cleanup is complete, members may still be able to create new files at the NFS file system's local mount point (or in any directories that were created locally beneath that mount point).

An NFS file system does not automatically fail over to another cluster member. Rather, you must manually remount it — on the same mount point or another — from another cluster member to make it available again. Alternatively, booting a cluster member will remount those file systems that are listed in the /etc/fstab file that are not currently mounted and served in the cluster. (If you are using AutoFS or automount, the remount will happen automatically.)

9.4 Managing the Device Request Dispatcher

The device request dispatcher subsystem makes physical disk and tape storage transparently available to all cluster members, regardless of where the storage is physically located in the cluster. When an application requests access to a file, CFS passes the request to AdvFS, which then passes it to the device request dispatcher. In the file system hierarchy, the device request dispatcher sits right above the device drivers.

The primary tool for managing the device request dispatcher is the drdmgr command. A number of examples of using the command appear in this section. For more information, see drdmgr(8).

9.4.1 Direct-Access I/O and Single-Server Devices

The device request dispatcher follows a client/server model; members serve devices, such as disks, tapes, and CD-ROM drives.

Devices in a cluster are either direct-access I/O devices or single-server devices. A direct-access I/O device supports simultaneous access from multiple cluster members. A single-server device supports access from only a single member.

Direct-access I/O devices on a shared bus are served by all cluster members on that bus. A single-server device, whether on a shared bus or directly connected to a cluster member, is served by a single member. All other members access the served device through the serving member. Direct-access I/O devices are part of the device request dispatcher subsystem, and have nothing to do with direct I/O (opening a file with the O_DIRECTIO flag to the open system call), which is handled by CFS. See Section 9.3.6.2 for information about direct I/O and CFS.

Typically, disks on a shared bus are direct-access I/O devices, but in certain circumstances, some disks on a shared bus can be single-server. The exceptions occur when you add an RZ26, RZ28, RZ29, or RZ1CB-CA disk to an established cluster. Initially, such devices are single-server devices. See Section 9.4.1.1 for more information. Tape devices are always single-server devices.

Although single-server disks on a shared bus are supported, they are significantly slower when used as member boot disks or swap files, or for the retrieval of core dumps. We recommend that you use direct-access I/O disks in these situations.

Figure 9-3 shows a four-node cluster with five disks and a tape drive on the shared bus. Systemd is not on the shared bus. Its access to cluster storage is routed through the cluster interconnect.

Figure 9-3: Four Node Cluster

Disks on the shared bus are served by all the cluster members on the bus. You can confirm this by looking for the device request dispatcher server of dsk3 as follows:

# drdmgr -a server dsk3
                   Device Name: dsk3
                   Device Type: Direct Access IO Disk
                 Device Status: OK
             Number of Servers: 3
                   Server Name: systema
                  Server State: Server
                   Server Name: systemb
                  Server State: Server
                   Server Name: systemc
                  Server State: Server

Because dsk3 is a direct-access I/O device on the shared bus, all three systems on the bus serve it: when any member on the shared bus accesses the disk, the access is directly from the member to the device.

Disks on private buses are served by the system that they are local to. For example, the server of dsk7 is systemb:

# drdmgr -a server dsk7
                   Device Name: dsk7
                   Device Type: Direct Access IO Disk
                 Device Status: OK
             Number of Servers: 1
                   Server Name: systemb
                  Server State: Server

Tape drives are always single-server. Because tape0 is on a shared bus, any member on that bus can act as its server. When the cluster is started, the first active member that has access to the tape drive becomes the server for the tape drive.

The numbering of disks indicates that when the cluster booted, systema came up first. It detected its private disks first and labeled them, then it detected the disks on the shared bus and labeled them. Because systema came up first, it is also the server for tape0. To confirm this, enter the following command:

# drdmgr -a server tape0
                   Device Name: tape0
                   Device Type: Served Tape
                 Device Status: OK
             Number of Servers: 1
                   Server Name: systema
                  Server State: Server

To change tape0's server to systemc, enter the drdmgr command as follows:

# drdmgr -a server=systemc /dev/tape/tape0

For any single-server device, the serving member is also the access node. The following command confirms this:

# drdmgr -a accessnode tape0
                   Device Name: tape0
              Access Node Name: systemc

Unlike the device request dispatcher SERVER attribute, which for a given device is the same on all cluster members, the value of the ACCESSNODE attribute is specific to a cluster member.

Any system on a shared bus is always its own access node for the direct-access I/O devices on the same shared bus.

Because systemd is not on the shared bus, for each direct-access I/O device on the shared bus you can specify the access node to be used by systemd when it accesses the device. The access node must be one of the members on the shared bus.

The result of the following command is that systemc handles all device request dispatcher activity between systemd and dsk3:

# drdmgr -h systemd -a accessnode=systemc dsk3

9.4.1.1 Devices Supporting Direct-Access I/O

RAID-fronted disks are direct-access I/O capable. The following are examples of Redundant Array of Independent Disks (RAID) controllers:

HSZ80

HSG60

HSG80

RA3000 (HSZ22)

Enterprise Virtual Array (HSV110)

Any RZ26, RZ28, RZ29, and RZ1CB-CA disks already installed in a system at the time the system becomes a cluster member, either through the clu_create or clu_add_member command, are automatically enabled as direct-access I/O disks. To later add one of these disks as a direct-access I/O disk, you must use the procedure in Section 9.2.3.

9.4.1.2 Replacing RZ26, RZ28, RZ29, or RZ1CB-CA as Direct-Access I/O Disks

If you replace an RZ26, RZ28, RZ29, or RZ1CB-CA direct-access I/O disk with a disk of the same type (for example, replace an RZ28-VA with another RZ28-VA), follow these steps to make the new disk a direct-access I/O disk:

Physically install the disk in the bus.

On each cluster member, enter the hwmgr command to scan for the new disk as follows:
```
# hwmgr -scan comp -cat scsi_bus
```
Allow a minute or two for the scans to complete.

If you want the new disk to have the same device name as the disk it replaced, use the hwmgr -redirect scsi command. For details, see hwmgr(8) and the section on replacing a failed SCSI device in the Tru64 UNIX Hardware Management manual.

On each cluster member, enter the clu_disk_install command:
```
# clu_disk_install
```

Note

If the cluster has a large number of storage devices, the clu_disk_install command can take several minutes to complete.

9.4.1.3 HSZ Hardware Supported on Shared Buses

For a list of hardware that is supported on shared buses, see the TruCluster Server Version 5.1B QuickSpecs.

If you try to use an HSZ that does not have the proper firmware revision on a shared bus, the cluster might hang when there are multiple simultaneous attempts to access the HSZ.

9.5 Managing AdvFS in a Cluster

For the most part, the Advanced file system (AdvFS) on a cluster is like that on a standalone system. However, this section describes some cluster-specific considerations:

Integrating AdvFS files from a newly added member (Section 9.5.1)

Creating only one fileset in the cluster root domain (Section 9.5.2)

Not adding filesets to a member's boot partition (Section 9.5.3)

Not adding a volume to a member's root domain (Section 9.5.4)

Using the addvol and rmvol commands (Section 9.5.5)

Using user and group file system quotas (Section 9.5.6)

Understanding storage connectivity and AdvFS volumes (Section 9.5.7)

9.5.1 Integrating AdvFS Files from a Newly Added Member

Suppose that you add a new member to the cluster and that new member has AdvFS volumes and filesets from when it ran as a standalone system. To integrate these volumes and filesets into the cluster, you need to do the following:

Modify the /etc/fstab file listing the domains#filesets that you want to integrate into the cluster.

Make the new domains known to the cluster, either by manually entering the domain information into /etc/fdmns or by running the advscan command.

For information on the advscan command, see advscan(8). For examples of reconstructing /etc/fdmns, see the section on restoring an AdvFS file system in the Tru64 UNIX AdvFS Administration manual.

9.5.2 Create Only One Fileset in Cluster Root Domain

The root domain, cluster_root, must contain only a single fileset. If you create more than one fileset in cluster_root (you are not prevented from doing so), it can lead to a panic if the cluster_root domain needs to fail over.

As an example of when this situation might occur, consider cloned filesets. As described in advfs(4), a clone fileset is a read-only copy of an existing fileset, which you can mount as you do other filesets. If you create a clone of the clusterwide root (/) and mount it, the cloned fileset is added to the cluster_root domain. If the cluster_root domain has to fail over while the cloned fileset is mounted, the cluster will panic.

Note

If you make backups of the clusterwide root from a cloned fileset, minimize the amount of time during which the clone is mounted. Mount the cloned fileset, perform the backup, and unmount the clone as quickly as possible.

9.5.3 Adding Filesets to a Member's Boot Partition Not Recommended

Although you are not prohibited from adding filesets to a member's boot partition, we do not recommend it. If a member leaves the cluster, all filesets mounted from that member's boot partition are force-unmounted and cannot be relocated.

9.5.4 Do Not Add a Volume to a Member's Root Domain

You cannot use the addvol command to add volumes to a member's root domain (rootmemberID_domain#root). Instead, you must delete the member from the cluster, use diskconfig or SysMan to configure the disk appropriately, and then add the member back into the cluster. For the configuration requirements for a member boot disk, see the Cluster Installation manual.

9.5.5 Using the addvol and rmvol Commands in a Cluster

You can manage AdvFS domains from any cluster member, regardless of whether the domains are mounted on the local member or a remote member. However, when you use the addvol or rmvol command from a member that is not the CFS server for the domain you are managing, the commands use rsh to execute remotely on the member that is the CFS server for the domain. This has the following consequences:

If addvol or rmvol is entered from a member that is not the server of the domain, and if member that is serving the domain fails, the command can hang on the system where it was executed until TCP times out, which can take as long as an hour.
If this situation occurs, you can kill the command and its associated rsh processes and repeat the command as follows:
1. Get the process identifiers (PIDs) with the ps command and pipe the output through more, searching for addvol or rmvol, whichever is appropriate. For example:
```
# ps -el | more +/addvol
80808001 I  + 0 16253977 16253835  0.0  44   0 451700 424K wait     pts/0
 0:00.09 addvol
80808001 I  + 0 16253980 16253977  0.0  44   0 1e6200 224K event    pts/0
 0:00.02 rsh
  808001 I  + 0 16253981 16253980  0.0  44   0 a82200  56K tty      pts/0
 0:00.00 rsh
 
```
2. Use the process IDs (in this example, PIDs 16253977, 16253980, and 16253981) and parent process IDs (PPIDs 16253977 and 16253980) to confirm the association between the addvol or rmvol and the rsh processes. Two rsh processes are associated with the addvol process. All three processes must be killed.
3. Kill the appropriate processes. In this example:
```
# kill -9 16253977 16253980 16253981
 
```
4. Reenter the addvol or rmvol command. In the case of addvol, you must use the -F option because the hung addvol command might have already changed the disk label type to AdvFS.
Alternately, before using either the addvol or rmvol command on a domain, you can do the following:
1. Use the cfsmgr command to learn the name of the CFS server of the domain:
```
# cfsmgr -d domain_name
 
```
  To get a list of the servers of all CFS domains, enter only the cfsmgr command.
2. Log in to the serving member.
3. Use the addvol or rmvol command.

If the CFS server for the volume fails over to another member in the middle of an addvol or rmvol operation, you may need to reenter the command because the new server undoes any partial operation. The command does not return a message indicating that the server failed, and the operation must be repeated.
We recommend that you enter a showfdmn command for the target domain of an addvol or rmvol command after the command returns.

The rmvol and addvol commands use rsh when the member where the commands are executed is not the server of the domain. For rsh to function, the default cluster alias must appear in the /.rhosts file. The entry for the cluster alias in /.rhosts can take the form of the fully qualified host name or the unqualified host name. Although the plus sign (+) can appear in place of the host name, allowing all hosts access, this is not recommended for security reasons.

The clu_create command automatically places the cluster alias in /.rhosts, so rsh normally works without your intervention. If the rmvol or addvol command fails because of rsh failure, the following message is returned:

rsh failure, check that the /.rhosts file allows cluster alias access.

9.5.6 User and Group File System Quotas Are Supported

TruCluster Server includes quota support that allows you to limit both the number of files and the total amount of disk space that are allocated in an AdvFS file system on behalf of a given user or group.

Quota support in a TruCluster Server environment is similar to quota support in the Tru64 UNIX system, with the following exceptions:

Hard limits are not absolute because the cluster file system (CFS) makes certain assumptions about how and when cached data is written.

Soft limits and grace periods are supported, but a user might not get a message when the soft limit is exceeded from a client node, and such a message might not arrive in a timely manner.

The quota commands are effective clusterwide. However, you must edit the /sys/conf/NAME system configuration file on each cluster member to configure the system to include the quota subsystem. If you do not perform this step on a cluster member, quotas are enabled on that member but you cannot enter quota commands from that member.

TruCluster Server supports quotas only for AdvFS file systems.

Users and groups are managed clusterwide. Therefore, user and group quotas are also managed clusterwide.

This section describes information that is unique to managing disk quotas in a TruCluster Server environment. For general information about managing quotas, see the Tru64 UNIX System Administration manual.

9.5.6.1 Quota Hard Limits

In a Tru64 UNIX system, a hard limit places an absolute upper boundary on the number of files or amount of disk space that a given user or group can allocate on a given file system. When a hard limit is reached, disk space allocations or file creations are not allowed. System calls that would cause the hard limit to be exceeded fail with a quota violation.

In a TruCluster Server environment, hard limits for the number of files are enforced as they are in a standalone Tru64 UNIX system.

However, hard limits on the total amount of disk space are not as rigidly enforced. For performance reasons, CFS allows client nodes to cache a configurable amount of data for a given user or group without any communication with the member serving that data. After the data is cached on behalf of a given write operation and the write operation returns to the caller, CFS guarantees that, barring a failure of the client node, the cached data will eventually be written to disk at the server.

Writing the cached data takes precedence over strictly enforcing the disk quota. If and when a quota violation occurs, the data in the cache is written to disk regardless of the violation. Subsequent writes by this group or user are not cached until the quota violation is corrected.

Because additional data is not written to the cache while quota violations are being generated, the hard limit is never exceeded by more than the sum of quota_excess_blocks on all cluster members. The actual disk space quota for a user or group is therefore determined by the hard limit plus the sum quota_excess_blocks on all cluster members.

The amount of data that a given user or group is allowed to cache is determined by the quota_excess_blocks value, which is located in the member-specific etc/sysconfigtab file. The quota_excess_blocks value is expressed in units of 1024-byte blocks and the default value of 1024 represents 1 MB of disk space. The value of quota_excess_blocks does not have to be the same on all cluster members. You might use a larger quota_excess_blocks value on cluster members on which you expect most of the data to be generated, and accept the default value for quota_excess_blocks on other cluster members.

9.5.6.2 Setting the `quota_excess_blocks` Value

The value for quota_excess_blocks is maintained in the /etc/sysconfigtab file in the cfs stanza.

Avoid making manual changes to this file. Instead, use the sysconfigdb command to make changes. This utility automatically makes any changes available to the kernel and preserves the structure of the file so that future upgrades merge in correctly.

Performance for a given user or group can be affected by quota_excess_blocks. If this value is set too low, CFS cannot use the cache efficiently. Setting quota_excess_blocks to less than 64K will have a severe performance impact. Conversely, setting quota_excess_blocks too high increases the actual amount of disk space that a user or group can consume.

We recommend accepting the quota_excess_blocks default of 1 MB, or increasing it as much as is considered practical given its effect of raising the potential upper limit on disk block usage. When determining how to set this value, consider that the worst-case upper boundary is determined as follows:

(admin specified hard limit) + 
  (sum of "quota_excess_blocks" on each client node)

CFS makes a significant effort to minimize the amount by which the hard quota limit is exceeded; you are very unlikely to reach the worst-case upper boundary.

9.5.7 Storage Connectivity and AdvFS Volumes

All volumes in an AdvFS domain must have the same connectivity if failover capability is desired. Volumes have the same connectivity when either one of the following conditions is true:

All volumes in the AdvFS domain are on the same shared SCSI bus.

Volumes in the AdvFS domain are on different shared SCSI buses, but all of those buses are connected to the same cluster members.

The drdmgr and hwmgr commands can give you information about which systems serve which disks. To get a graphical display of the cluster hardware configuration, including active members, buses, storage devices, and their connections, use the sms command to invoke the graphical interface for the SysMan Station, and then select Hardware from the Views menu.

9.6 Considerations When Creating New File Systems

Most aspects of creating new file systems are the same in a cluster and a standalone environment. The Tru64 UNIX AdvFS Administration manual presents an extensive description of how to create AdvFS file systems in a standalone environment.

For information about adding disks to the cluster, see Section 9.2.3.

The following are important cluster-specific considerations for creating new file systems:

To ensure the highest availability, make sure that all disks that are used for volumes in an AdvFS domain have the same connectivity.
We recommend that all LSM volumes that are placed into an AdvFS domain share the same connectivity. See the Tru64 UNIX Logical Storage Manager manual for more on LSM volumes and connectivity.

When you determine whether a disk is in use, make sure it is not used as any of the following:
- The cluster quorum disk
  Do not use any of the partitions on a quorum disk for data.
- The clusterwide root file system, the clusterwide /var file system, or the clusterwide /usr file system
- A member's boot disk
  See Section 11.1.5 for a description of the member boot disk and how to configure one.

A single /etc/fstab file applies to all members of a cluster.

9.6.1 Verifying Disk Connectivity

To ensure the highest availability, make sure that all disks that are used for volumes in an AdvFS domain have the same connectivity.

Disks have the same connectivity when either one of the following conditions is true:

All disks that are used for volumes in the AdvFS domain are on the same shared SCSI bus.

Disks that are used for volumes in the AdvFS domain are on different shared SCSI buses, but all of those buses are connected to the same cluster members.

The easiest way to verify disk connectivity is to use the sms command to invoke the graphical interface for the SysMan Station, and then select Hardware from the Views menu.

For example, in Figure 9-1, the SCSI bus that is connected to the pza0s is shared by all three cluster members. All disks on that bus have the same connectivity.

You can also use the hwmgr command to view all the devices on the cluster and then pick out those disks that show up multiple times because they are connected to several members. For example:

# hwmgr -view devices -cluster
 
HWID: Device Name         Mfg     Model            Hostname   Location
-------------------------------------------------------------------------------
  3: kevm                                         pepicelli
 28: /dev/disk/floppy0c          3.5in floppy     pepicelli  fdi0-unit-0
 40: /dev/disk/dsk0c     DEC     RZ28M    (C) DEC pepicelli  bus-0-targ-0-lun-0
 41: /dev/disk/dsk1c     DEC     RZ28L-AS (C) DEC pepicelli  bus-0-targ-1-lun-0
 42: /dev/disk/dsk2c     DEC     RZ28     (C) DEC pepicelli  bus-0-targ-2-lun-0
 43: /dev/disk/cdrom0c   DEC     RRD46    (C) DEC pepicelli  bus-0-targ-6-lun-0
 44: /dev/disk/dsk13c    DEC     RZ28M    (C) DEC pepicelli  bus-1-targ-1-lun-0
 44: /dev/disk/dsk13c    DEC     RZ28M    (C) DEC polishham  bus-1-targ-1-lun-0
 44: /dev/disk/dsk13c    DEC     RZ28M    (C) DEC provolone  bus-1-targ-1-lun-0
 45: /dev/disk/dsk14c    DEC     RZ28L-AS (C) DEC pepicelli  bus-1-targ-2-lun-0
 45: /dev/disk/dsk14c    DEC     RZ28L-AS (C) DEC polishham  bus-1-targ-2-lun-0
 45: /dev/disk/dsk14c    DEC     RZ28L-AS (C) DEC provolone  bus-1-targ-2-lun-0
 46: /dev/disk/dsk15c    DEC     RZ29B    (C) DEC pepicelli  bus-1-targ-3-lun-0
 46: /dev/disk/dsk15c    DEC     RZ29B    (C) DEC polishham  bus-1-targ-3-lun-0
 46: /dev/disk/dsk15c    DEC     RZ29B    (C) DEC provolone  bus-1-targ-3-lun-0
        .
        .
        .

In this partial output, dsk0, dsk1, and dsk2 are private disks that are connected to pepicelli's local bus. None of these are appropriate for a file system that needs failover capability, and they are not good choices for Logical Storage Manager (LSM) volumes.

Disks dsk13 (HWID 44), dsk14 (HWID 45), and dsk15 (HWID 46) are connected to pepicelli, polishham, and provolone. These three disks all have the same connectivity.

9.6.2 Looking for Available Disks

When you want to determine whether disks are already in use, look for the quorum disk, disks containing the clusterwide file systems, and member boot disks and swap areas.

9.6.2.1 Looking for the Location of the Quorum Disk

You can learn the location of the quorum disk by using the clu_quorum command. In the following example, the partial output for the command shows that dsk10 is the cluster quorum disk:

# clu_quorum
 Cluster Quorum Data for: deli as of Wed Apr 25 09:27:36 EDT 2001
 
Cluster Common Quorum Data
Quorum disk:   dsk10h
        .
        .
        .

You can also use the disklabel command to look for a quorum disk. All partitions in a quorum disk are unused, except for the h partition, which has fstype cnx.

9.6.2.2 Looking for the Location of Member Boot Disks and Clusterwide AdvFS File Systems

To learn the locations of member boot disks and clusterwide AdvFS file systems, look for the file domain entries in the /etc/fdmns directory. You can use the ls command for this. For example:

# ls /etc/fdmns/*
 
/etc/fdmns/cluster_root:
dsk3c
 
/etc/fdmns/cluster_usr:
dsk5c
 
/etc/fdmns/cluster_var:
dsk6c
 
/etc/fdmns/projects1_data:
dsk9c
 
/etc/fdmns/projects2_data:
dsk11c
 
/etc/fdmns/projects_tools:
dsk12c
 
/etc/fdmns/root1_domain:
dsk4a
 
/etc/fdmns/root2_domain:
dsk8a
 
/etc/fdmns/root3_domain:
dsk2a
 
/etc/fdmns/root_domain:
dsk0a
 
/etc/fdmns/usr_domain:
dsk0g

This output from the ls command indicates the following:

Disk dsk3 is used by the clusterwide root file system (/). You cannot use this disk.

Disk dsk5 is used by the clusterwide /usr file system. You cannot use this disk.

Disk dsk6 is used by the clusterwide /var file system. You cannot use this disk.

Disks dsk4, dsk8, and dsk2 are member boot disks. You cannot use these disks.
You can also use the disklabel command to identify member boot disks. They have three partitions: the a partition has fstype AdvFS, the b partition has fstype swap, and the h partition has fstype cnx.

Disks dsk9, dsk11, and dsk12 appear to be used for data and tools.

Disk dsk0 is the boot disk for the noncluster, base Tru64 UNIX operating system.
Keep this disk unchanged in case you need to boot the noncluster kernel to make repairs.

9.6.2.3 Looking for Member Swap Areas

A member's primary swap area is always the b partition of the member boot disk. (For information about member boot disks, see Section 11.1.5.) However, a member might have additional swap areas. If a member is down, be careful not to use the member's swap area. To learn whether a disk has swap areas on it, use the disklabel -r command. Look in the fstype column in the output for partitions with fstype swap.

In the following example, partition b on dsk11 is a swap partition:

# disklabel -r dsk11
        .
        .
        .
8 partitions:
#         size     offset    fstype   [fsize bsize cpg] # NOTE: values not exact
 a:     262144          0     AdvFS                     # (Cyl.    0 - 165*)
 b:     401408     262144      swap                     # (Cyl.  165*- 418*)
 c:    4110480          0    unused        0     0      # (Cyl.    0 - 2594)
 d:    1148976     663552    unused        0     0      # (Cyl.  418*- 1144*)
 e:    1148976    1812528    unused        0     0      # (Cyl. 1144*- 1869*)
 f:    1148976    2961504    unused        0     0      # (Cyl. 1869*- 2594)
 g:    1433600     663552     AdvFS                     # (Cyl.  418*- 1323*)
 h:    2013328    2097152     AdvFS                     # (Cyl. 1323*- 2594)

9.6.3 Editing /etc/fstab

You can use the SysMan Station graphical user interface (GUI) to create and configure an AdvFS volume. However, if you choose to use the command line, when it comes time to edit /etc/fstab, you need do it only once, and you can do it on any cluster member. The /etc/fstab file is not a CDSL. A single file is used by all cluster members.

9.7 Managing CDFS File Systems

In a cluster, a CD-ROM drive is always a served device. The drive must be connected to a local bus; it cannot be connected to a shared bus. The following are restrictions on managing a CD-ROM file system (CDFS) in a cluster:

The cddevsuppl command is not supported in a cluster.

The following commands work only when executed from the cluster member that is the CFS server of the CDFS file system:
Regardless of which member mounts the CD-ROM, the member that is connected to the drive is the CFS server for the CDFS file system.

To manage a CDFS file system, follow these steps:

Enter the cfsmgr command to learn which member currently serves the CDFS:
```
# cfsmgr
 
```

Log in on the serving member.

Use the appropriate commands to perform the management tasks.

For information about using library functions that manipulate the CDFS, see the TruCluster Server Cluster Highly Available Applications manual.

9.8 Backing Up and Restoring Files

Back up and restore for user data in a cluster is similar to that in a standalone system. You back up and restore CDSLs like any other symbolic links. To back up all the targets of CDSLs, back up the /cluster/members area.

Make sure that all restore software that you plan to use is available on the Tru64 UNIX disk of the system that was the initial cluster member. Treat this disk as the emergency repair disk for the cluster. If the cluster loses the root domain, cluster_root, you can boot the initial cluster member from the Tru64 UNIX disk and restore cluster_root.

The bttape utility is not supported in clusters.

The clonefset utility, described in clonefset(8), enables you to perform online backups of active files by making a read-only copy (clone) of an active fileset. After you create and mount a clone fileset, you can back up the clone using the vdump command or other supported backup utility. (The dump command is not supported by AdvFS.) You might find it useful to use the clonefset to back up cluster file systems. If you do make backups of the clusterwide root from a cloned fileset, minimize the amount of time during which the clone is mounted. Mount the cloned fileset, perform the backup, and unmount the clone as quickly as possible. See Section 9.5.2 for additional information.

9.8.1 Suggestions for Files to Back Up

Back up data files and the following file systems regularly:

The clusterwide root file system
Use the same backup and restore methods that you use for user data.

The clusterwide /usr file system
Use the same backup and restore methods that you use for user data.

The clusterwide /var file system
Use the same backup and restore methods that you use for user data.
If, before installing TruCluster Server, you were using AdvFS and had /var located in /usr (usr_domain#var), the installation process moved /var into its own domain (cluster_var#var).
Because of this move, you must back up /var as a separate file system from /usr.

Member boot disks
See Section 11.1.5 for special considerations for backing up and restoring member boot disks.

9.9 Managing Swap Space

Do not put swap entries in /etc/fstab. In Tru64 UNIX Version 5.0 the list of swap devices was moved from the /etc/fstab file to the /etc/sysconfigtab file. Additionally, you no longer use the /sbin/swapdefault file to indicate the swap allocation; use the /etc/sysconfigtab file for this purpose as well. The swap devices and swap allocation mode are automatically placed in the /etc/sysconfigtab file during installation of the base operating system. For more information, see the Tru64 UNIX System Administration manual and swapon(8) .

Put each member's swap information in that member's sysconfigtab file. Do not put any swap information in the clusterwide /etc/fstab file.

Swap information in sysconfigtab is identified by the swapdevice attribute. The format for swap information is as follows:

swapdevice=disk_partition,disk_partition,...

For example:

swapdevice=/dev/disk/dsk1b,/dev/disk/dsk3b

Specifying swap entries in /etc/fstab does not work in a cluster because /etc/fstab is not member-specific; it is a clusterwide file. If swap is specified in /etc/fstab, the first member to boot and form a cluster reads and mounts all the file systems in /etc/fstab. The other members never see that swap space.

The file /etc/sysconfigtab is a context-dependent symbolic link (CDSL), so that each member can find and mount its specific swap partitions. The installation script automatically configures one swap device for each member, and puts a swapdevice= entry in that member's sysconfigtab file.

If you want to add additional swap space, specify the new partition with swapon, and then put an entry in sysconfigtab so the partition is available following a reboot. For example, to configure dsk3b for use as a secondary swap device for a member already using dsk1b for swap, enter the following command:

swapon -s /dev/disk/dsk3b

Then, edit that member's /etc/sysconfigtab and add /dev/disk/dsk3b. The final entry in /etc/sysconfigtab will look like the following:

	swapdevice=/dev/disk/dsk1b,/dev/disk/dsk3b

9.9.1 Locating Swap Device for Improved Performance

Locating a member's swap space on a device on a shared bus results in additional I/O traffic on the bus. To avoid this, you can place swap on a disk on the member's local bus.

The only downside to locating swap local to the member is the unlikely case where the member loses its path to the swap disk, which can happen when an adapter fails. In this situation, the member will fail. When the swap disk is on a shared bus, the member can still use its swap partition as long as at least one member still has a path to the disk.

9.10 Fixing Problems with Boot Parameters

If a cluster member fails to boot due to parameter problems in the member's root domain (rootN_domain), you can mount that domain on a running member and make the needed changes to the parameters. However, before booting the down member, you must unmount the newly updated member root domain from the running cluster member.

Failure to do so can cause a crash and result in the display of the following message:

cfs_mountroot: CFS server already exists for node boot partition.

For more information, see Section 11.1.10.

9.11 Using the verify Utility in a Cluster

The verify utility examines the on-disk metadata structures of AdvFS file systems. Before using the utility, you must unmount all filesets in the file domain to be verified.

If you are running the verify utility and the cluster member on which it is running fails, extraneous mounts may be left. This can happen because the verify utility creates temporary mounts of the filesets that are in the domain that is being verified. On a single system these mounts go away if the system fails while running the utility, but, in a cluster, the mounts fail over to another cluster member. The fact that these mounts fail over also prevents you from mounting the filesets until you remove the spurious mounts.

When verify runs, it creates a directory for each fileset in the domain and then mounts each fileset on the corresponding directory. A directory is named as follows: /etc/fdmns/domain/set_verify_XXXXXX, where XXXXXX is a unique ID.

For example, if the domain name is dom2 and the filesets in dom2 are fset1, fset2, and fset3, enter the following command:

# ls -l /etc/fdmns/dom2
total 24
lrwxr-xr-x   1 root     system        15 Dec 31 13:55 dsk3a -> /dev/disk/dsk3a
lrwxr-x---   1 root     system        15 Dec 31 13:55 dsk3d -> /dev/disk/dsk3d
drwxr-xr-x   3 root     system      8192 Jan  7 10:36 fset1_verify_aacTxa
drwxr-xr-x   4 root     system      8192 Jan  7 10:36 fset2_verify_aacTxa
drwxr-xr-x   3 root     system      8192 Jan  7 10:36 fset3_verify_aacTxa

To clean up the failed-over mounts, follow these steps:

Unmount all the filesets in /etc/fdmns:
```
# umount /etc/fdmns/*/*_verify_*
```

Delete all failed over mounts with the following command:
```
# rm -rf /etc/fdmns/*/*_verify_*
```

Remount the filesets like you do after a normal completion of the verify utility.

For more information about verify, see verify(8).

9.11.1 Using the verify Utility on Cluster Root

The verify utility has been modified to allow it to run on active domains. Use the -a option to examine the cluster root file system, cluster_root.

You must execute the verify -a utility on the member that is serving the domain that you are examining. Use the cfsmgr command to determine which member serves the domain.

When verify runs with the -a option, it only examines the domain. No fixes can be done on the active domain. The -f and -d options cannot be used with the -a option.