This chapter contains information specific to managing storage devices in a TruCluster Server system. The chapter discusses the following subjects:
Working with CDSLs (Section 9.1)
Managing devices (Section 9.2)
Managing the cluster file system (Section 9.3)
Managing the device request dispatcher (Section 9.4)
Managing AdvFS in a cluster (Section 9.5)
Creating new file systems (Section 9.6)
Managing CDFS file systems (Section 9.7)
Backing up and restoring files (Section 9.8)
Managing swap space (Section 9.9)
Fixing problems with boot parameters (Section 9.10)
Using the
verify
command in a cluster
(Section 9.11)
You can find other information
on device management in the Tru64 UNIX Version 5.1B documentation
that is listed in
Table 9-1.
Table 9-1: Sources of Information of Storage Device Management
| Topic | Tru64 UNIX Manual |
| Administering devices | Hardware Management manual |
| Administering file systems | System Administration manual |
| Administering the archiving services | System Administration manual |
| Managing AdvFS | AdvFS Administration manual |
For information about Logical
Storage Manager (LSM) and clusters, see
Chapter 10.
9.1 Working with CDSLs
A context-dependent symbolic link (CDSL) contains a variable that identifies a cluster member. This variable is resolved at run time into a target.
A CDSL is structured as follows:
/etc/rc.config -> ../cluster/members/{memb}/etc/rc.config
When resolving a CDSL pathname, the kernel replaces the string
{memb}
with the string
membern,
where
n
is the member ID of the
current member.
For example, on a cluster member whose member ID is 2,
the pathname
/cluster/members/{memb}/etc/rc.config
resolves to
/cluster/members/member2/etc/rc.config.
CDSLs provide a way for a single file name to point to one of several
files.
Clusters use this to allow member-specific
files that can be addressed throughout the cluster by a single file name.
System data and configuration files tend to be CDSLs.
They are found
in the root (/),
/usr,
and
/var
directories.
9.1.1 Making CDSLs
The
mkcdsl
command provides a simple tool for
creating and populating CDSLs.
For example, to make a new CDSL for the
file
/usr/accounts/usage-history,
enter the following command:
# mkcdsl /usr/accounts/usage-history
When you list the results, you see the following output:
# ls -l /usr/accounts/usage-history
... /usr/accounts/usage-history -> cluster/members/{memb}/accounts/usage-history
The CDSL
usage-history
is created in
/usr/accounts.
No files are created in any member's
/usr/cluster/members/{memb}
directory.
To move a file into a CDSL, enter the following command:
# mkcdsl -c targetname
To replace an existing file when using the copy
(-c) option, you must also
use the force (-f) option.
The
-c
option copies the source file to the member-specific
area on the cluster member where the
mkcdsl
command
executes and then replaces the source file with a CDSL.
To copy a
source file to the member-specific area on all cluster members and
then replace the source file with a CDSL, use the
-a
option to the command as follows:
# mkcdsl -a filename
Remove a CDSL with the
rm
command,
like you do for any symbolic link.
The file
/var/adm/cdsl_admin.inv
stores a record
of the cluster's CDSLs.
When you use
mkcdsl
to add
CDSLs, the command
updates
/var/adm/cdsl_admin.inv.
If you use the
ln -s
command to create CDSLs,
/var/adm/cdsl_admin.inv
is not updated.
To update
/var/adm/cdsl_admin.inv, enter the following:
# mkcdsl -i targetname
Update the inventory when you remove a CDSL, or if you use
the
ln -s
command to create a CDSL.
For more information, see
mkcdsl(8)9.1.2 Maintaining CDSLs
The following tools can help you maintain CDSLs:
mkcdsl(8)-i
option)
The following example shows
the output (and the pointer to a log file containing the errors) when
clu_check_config
finds a bad or missing CDSL:
# clu_check_config -s check_cdsl_config Starting Cluster Configuration Check... check_cdsl_config : Checking installed CDSLs check_cdsl_config : CDSLs configuration errors : See /var/adm/cdsl_check_list clu_check_config : detected one or more configuration errors
As a general rule, before you move a file, make sure that the destination
is not a CDSL.
If by mistake you do overwrite a CDSL on the
appropriate cluster member, use the
mkcdsl -c
filename
command
to copy the file and re-create the CDSL.
9.1.3 Kernel Builds and CDSLs
When you build a kernel in a cluster, use the
cp
command to copy the new kernel from
/sys/HOSTNAME/vmunix
to
/vmunix.
If you move the kernel to
/vmunix, you will
overwrite the
/vmunix
CDSL.
The result will be
that the next time that cluster member boots, it will use the old
vmunix
in
/sys/HOSTNAME/vmunix.
9.1.4 Exporting and Mounting CDSLs
CDSLs are intended for use when files of the same name must necessarily have different contents on different cluster members. Because of this, CDSLs are not intended for export.
Mounting CDSLs through the cluster alias is problematic, because the file contents differ depending on which cluster system gets the mount request. However, nothing prevents CDSLs from being exported. If the entire directory is a CDSL, then the node that gets the mount request provides a file handle corresponding to the directory for that node. If a CDSL is contained within an exported clusterwide directory, then the network file system (NFS) server that gets the request will do the expansion. Like normal symbolic links, the client cannot read the file or directory unless that area is also mounted on the client.
Device management in a cluster is similar to that in a standalone system, with the following exceptions:
The
dsfmgr
command for managing device special
files takes special options for clusters.
Because of the mix of shared and private buses in a cluster, device topology can be more complex.
You can control which cluster members act as servers for the devices in the cluster, and which members act as access nodes.
The rest of this section describes these differences.
9.2.1 Managing the Device Special File
When using
dsfmgr, the
device special file management utility, in a cluster, keep the
following in mind:
The
-a
option requires that you
use
c
(cluster) as the
entry_type.
The
-o
and
-O
options,
which create device special files in the
old format, are not valid in a cluster.
In the output from the
-s
option, the
class scope
column in the first table uses a
c
(cluster) to indicate the scope of the device.
For more information, see
dsfmgr(8)9.2.2 Determining Device Locations
The Tru64 UNIX
hwmgr
command can list all
hardware devices in the cluster, including those on private buses,
and correlate bus-target-LUN names with
/dev/disks/dsk*
names.
For example:
# hwmgr -view devices -cluster HWID: Device Name Mfg Model Hostname Location ------------------------------------------------------------------------------- 3: kevm pepicelli 28: /dev/disk/floppy0c 3.5in floppy pepicelli fdi0-unit-0 40: /dev/disk/dsk0c DEC RZ28M (C) DEC pepicelli bus-0-targ-0-lun-0 41: /dev/disk/dsk1c DEC RZ28L-AS (C) DEC pepicelli bus-0-targ-1-lun-0 42: /dev/disk/dsk2c DEC RZ28 (C) DEC pepicelli bus-0-targ-2-lun-0 43: /dev/disk/cdrom0c DEC RRD46 (C) DEC pepicelli bus-0-targ-6-lun-0 44: /dev/disk/dsk3c DEC RZ28M (C) DEC pepicelli bus-1-targ-1-lun-0 44: /dev/disk/dsk3c DEC RZ28M (C) DEC polishham bus-1-targ-1-lun-0 44: /dev/disk/dsk3c DEC RZ28M (C) DEC provolone bus-1-targ-1-lun-0 45: /dev/disk/dsk4c DEC RZ28L-AS (C) DEC pepicelli bus-1-targ-2-lun-0 45: /dev/disk/dsk4c DEC RZ28L-AS (C) DEC polishham bus-1-targ-2-lun-0 45: /dev/disk/dsk4c DEC RZ28L-AS (C) DEC provolone bus-1-targ-2-lun-0 46: /dev/disk/dsk5c DEC RZ29B (C) DEC pepicelli bus-1-targ-3-lun-0 46: /dev/disk/dsk5c DEC RZ29B (C) DEC polishham bus-1-targ-3-lun-0 46: /dev/disk/dsk5c DEC RZ29B (C) DEC provolone bus-1-targ-3-lun-0 47: /dev/disk/dsk6c DEC RZ28D (C) DEC pepicelli bus-1-targ-4-lun-0 47: /dev/disk/dsk6c DEC RZ28D (C) DEC polishham bus-1-targ-4-lun-0 47: /dev/disk/dsk6c DEC RZ28D (C) DEC provolone bus-1-targ-4-lun-0 48: /dev/disk/dsk7c DEC RZ28L-AS (C) DEC pepicelli bus-1-targ-5-lun-0 48: /dev/disk/dsk7c DEC RZ28L-AS (C) DEC polishham bus-1-targ-5-lun-0 48: /dev/disk/dsk7c DEC RZ28L-AS (C) DEC provolone bus-1-targ-5-lun-0 49: /dev/disk/dsk8c DEC RZ1CF-CF (C) DEC pepicelli bus-1-targ-8-lun-0 49: /dev/disk/dsk8c DEC RZ1CF-CF (C) DEC polishham bus-1-targ-8-lun-0 49: /dev/disk/dsk8c DEC RZ1CF-CF (C) DEC provolone bus-1-targ-8-lun-0 50: /dev/disk/dsk9c DEC RZ1CB-CS (C) DEC pepicelli bus-1-targ-9-lun-0 50: /dev/disk/dsk9c DEC RZ1CB-CS (C) DEC polishham bus-1-targ-9-lun-0 50: /dev/disk/dsk9c DEC RZ1CB-CS (C) DEC provolone bus-1-targ-9-lun-0 51: /dev/disk/dsk10c DEC RZ1CF-CF (C) DEC pepicelli bus-1-targ-10-lun-0 51: /dev/disk/dsk10c DEC RZ1CF-CF (C) DEC polishham bus-1-targ-10-lun-0 51: /dev/disk/dsk10c DEC RZ1CF-CF (C) DEC provolone bus-1-targ-10-lun-0 52: /dev/disk/dsk11c DEC RZ1CF-CF (C) DEC pepicelli bus-1-targ-11-lun-0 52: /dev/disk/dsk11c DEC RZ1CF-CF (C) DEC polishham bus-1-targ-11-lun-0 52: /dev/disk/dsk11c DEC RZ1CF-CF (C) DEC provolone bus-1-targ-11-lun-0 53: /dev/disk/dsk12c DEC RZ1CF-CF (C) DEC pepicelli bus-1-targ-12-lun-0 53: /dev/disk/dsk12c DEC RZ1CF-CF (C) DEC polishham bus-1-targ-12-lun-0 53: /dev/disk/dsk12c DEC RZ1CF-CF (C) DEC provolone bus-1-targ-12-lun-0 54: /dev/disk/dsk13c DEC RZ1CF-CF (C) DEC pepicelli bus-1-targ-13-lun-0 54: /dev/disk/dsk13c DEC RZ1CF-CF (C) DEC polishham bus-1-targ-13-lun-0 54: /dev/disk/dsk13c DEC RZ1CF-CF (C) DEC provolone bus-1-targ-13-lun-0 59: kevm polishham 88: /dev/disk/floppy1c 3.5in floppy polishham fdi0-unit-0 94: /dev/disk/dsk14c DEC RZ26L (C) DEC polishham bus-0-targ-0-lun-0 95: /dev/disk/cdrom1c DEC RRD46 (C) DEC polishham bus-0-targ-4-lun-0 96: /dev/disk/dsk15c DEC RZ1DF-CB (C) DEC polishham bus-0-targ-8-lun-0 99: /dev/kevm provolone 127: /dev/disk/floppy2c 3.5in floppy provolone fdi0-unit-0 134: /dev/disk/dsk16c DEC RZ1DF-CB (C) DEC provolone bus-0-targ-0-lun-0 135: /dev/disk/dsk17c DEC RZ1DF-CB (C) DEC provolone bus-0-targ-1-lun-0 136: /dev/disk/cdrom2c DEC RRD47 (C) DEC provolone bus-0-targ-4-lun-0
The
drdmgr
devicename
command reports which members serve the device.
Disks with multiple servers are on a shared SCSI bus.
With very few exceptions, disks that have only one server
are private to that server.
For details on the exceptions,
see
Section 9.4.1.
To learn the hardware configuration of a cluster member, enter the following command:
# hwmgr -view hierarchy -member membername
If the member is on a shared bus, the command reports devices on the shared bus. The command does not report on devices private to other members.
To get a graphical display of the cluster hardware configuration, including
active members, buses, both shared and private storage devices,
and their connections, use the
sms
command to invoke the graphical interface
for the SysMan Station, and then select Hardware
from the View menu.
Figure 9-1
shows the SysMan Station
representation of a two-member cluster.
Figure 9-1: SysMan Station Display of Hardware Configuration
9.2.3 Adding a Disk to the Cluster
For information on physically installing SCSI hardware devices, see the TruCluster Server Cluster Hardware Configuration manual. After the new disk has been installed, follow these steps:
So that all members recognize the new disk, run the following command on each member:
# hwmgr -scan comp -cat scsi_bus
Note
You must run the
hwmgr -scan comp -cat scsi_buscommand on every cluster member that needs access to the disk.
Wait a minute or so for all members to register the presence of the new disk.
If the disk that you are adding is an RZ26, RZ28, RZ29, or RZ1CB-CA model, run the following command on each cluster member:
# /usr/sbin/clu_disk_install
If the cluster has a large number of storage devices, this command can take several minutes to complete.
To learn the name of the new disk, enter the following command:
# hwmgr -view devices -cluster
You can also run the SysMan Station command and select Hardware from the Views menu to learn the new disk name.
For information about creating file systems on the disk,
see
Section 9.6.
9.2.4 Managing Third-party Storage
When a cluster member loses quorum, all of its I/O is suspended, and the remaining members erect I/O barriers against nodes that have been removed from the cluster. This I/O barrier operation inhibits non-cluster members from performing I/O with shared storage devices.
The method that is used to create the I/O barrier depends on the types of storage devices that the cluster members share. In certain cases, a Task Management function called a Target_Reset is sent to stop all I/O to and from the former member. This Task Management function is used in either of the following situations:
The shared SCSI device does not support the SCSI Persistent Reserve command set and uses the Fibre Channel interconnect.
The shared SCSI device does not
support the SCSI Persistent Reserve
command set, uses the SCSI Parallel interconnect, is a
multiported device, and does not propagate the SCSI
Target_Reset
signal.
In either of these situations, there is a delay between the
Target_Reset
and the clearing of all I/O
pending between the device and the former member.
The length of this interval depends on the device
and the cluster configuration.
During this
interval, some I/O with the former member might still occur.
This I/O, sent after the
Target_Reset,
completes in a normal way without interference from other nodes.
During an interval configurable with the
drd_target_reset_wait
kernel attribute,
the device request dispatcher suspends all new I/O to the shared
device.
This period allows time to clear those devices
of the pending I/O that originated with the former member
and were sent to the device after it received the
Target_Reset.
After this interval passes,
the I/O barrier is complete.
The default value for
drd_target_reset_wait
is
30 seconds, which is usually sufficient.
However, if you
have doubts because of third-party devices in your cluster,
contact the device manufacturer and
ask for the specifications on how long it takes their device to clear
I/O after the receipt of a
Target_Reset.
You can set
drd_target_reset_wait
at boot time
and run time.
For more information about quorum loss and system partitioning,
see the chapter on the connection manager in the
TruCluster Server
Cluster Technical Overview
manual.
9.2.5 Tape Devices
You can access a tape device in the cluster from any member, regardless of whether it is located on that member's private bus, on a shared bus, or on another member's private bus.
Placing a tape device on a shared bus allows multiple members to
have direct access to the device.
Performance considerations also
argue for placing a tape device on a shared bus.
Backing up storage
connected to a system on a shared bus with a tape drive is faster than
having to go over the cluster interconnect.
For example, in
Figure 9-2, the backup of
dsk9
and
dsk10
to the tape drive requires the data
to go over the cluster interconnect.
For the backup of any other
disk, including the semi-private disks
dsk11,
dsk12,
dsk13, and
dsk14,
the data transfer rate will be faster.
Figure 9-2: Cluster with Semi-private Storage
If the tape device is located on the shared bus, applications that access the device must be written to react appropriately to certain events on the shared SCSI bus, such as bus and device resets. Bus and device resets (such as those that result from cluster membership transitions) cause any tape device on the shared SCSI bus to rewind.
A
read()
or
write()
by
a tape server application causes an
errno
to be returned.
You must explicitly set up the tape
server application to retrieve error information that is returned from
its I/O call to
reposition the tape.
When a
read()
or
write()
operation fails, use
ioctl()
with the
MTIOCGET
command option
to return a structure that contains the
error information that is needed by the application to reposition the tape.
For a description of the structure, see
/usr/include/sys/mtio.h.
The commonly
used utilities
tar,
cpio,
dump, and
vdump
are not designed in
this way, so they may unexpectedly terminate when used on a
tape device that resides on a shared bus in a cluster.
9.2.6 Formatting Diskettes in a Cluster
TruCluster Server includes support for read/write UNIX file system (UFS) file systems, as described in Section 9.3.7, and you can use TruCluster Server to format a diskette.
Versions of TruCluster Server prior to Version 5.1A do not support read/write UFS file systems. Because prior versions of TruCluster Server do not support read/write UFS file systems and AdvFS metadata overwhelms the capacity of a diskette, the typical methods to format a diskette cannot be used in a cluster.
If you must format a diskette in a cluster with a version of
TruCluster Server prior to Version 5.1A, use the
mtools
or
dxmtools
tool sets.
For more information, see
mtools(1)dxmtools(1)9.2.7 CD-ROM and DVD-ROM
CD-ROM drives and DVD-ROM drives are always served devices. This type of drive must be connected to a local bus; it cannot be connected to a shared bus.
For information about managing a CD-ROM file system (CDFS)
in a cluster, see
Section 9.7.
9.3 Managing the Cluster File System
The cluster file system (CFS) provides transparent access to files that are located anywhere on the cluster. Users and applications enjoy a single-system image for file access. Access is the same regardless of the cluster member where the access request originates, and where in the cluster the disk containing the file is connected. CFS follows a server/client model, with each file system served by a cluster member. Any cluster member can serve file systems on devices anywhere in the cluster. If the member serving a file system becomes unavailable, the CFS server automatically fails over to an available cluster member.
The primary tool for managing the cluster file system
is the
cfsmgr
command.
A number of examples of using the command appear in
this section.
For more information about the
cfsmgr
command,
see
cfsmgr(8)
TruCluster Server Version 5.1B includes the
-o
option to the
mount
command that causes
file systems to be served by a specific cluster member on startup.
This option is described in
Section 9.3.4.
TruCluster Server Version 5.1B includes a load monitoring daemon,
/usr/sbin/cfsd, that can monitor, report on, and
respond to file system-related member and cluster activity.
The
cfsd
daemon is described in
Section 9.3.3.
To gather statistics about CFS, use the
cfsstat
command or the
cfsmgr -statistics
command.
An example
of using
cfsstat
to get information about
direct I/O appears in
Section 9.3.6.2.
For more information
on the command, see
cfsstat(8)
For file systems on devices on the shared bus, I/O
performance depends on the load on the bus and the load on the member
serving the file system.
To simplify load balancing,
CFS allows you to easily relocate the
server to a different member.
Access to file systems on devices
that are private to a member is faster when the file systems are served by
that member.
9.3.1 When File Systems Cannot Fail Over
In most instances, CFS provides seamless failover for the file systems in the cluster. If the cluster member serving a file system becomes unavailable, CFS fails over the server to an available member. However, in the following situations, no path to the file system exists and the file system cannot fail over:
The file system's storage is on a private bus that is connected directly to a member and that member becomes unavailable.
The storage is on a shared bus and all the members on the shared bus become unavailable.
In either case, the
cfsmgr
command returns the
following status for the file system (or domain):
Server Status : Not Served
Attempts to access the file system return the following message:
filename I/O error
When a cluster member that is connected to the storage becomes available,
the file system becomes served again and accesses to the file system
begin to work.
Other than making the member available, you do not
need to take any action.
9.3.2 Direct Access Cached Reads
TruCluster Server implements direct access cached reads, which is a performance enhancement for AdvFS file systems. Direct access cached reads allow CFS to read directly from storage simultaneously on behalf of multiple cluster members.
If the cluster member that issues the read is directly connected to the storage that makes up the file system, direct access cached reads access the storage directly and do not go through the cluster interconnect to the CFS server.
If a CFS client is not directly connected to the storage that makes up a file system (for example, if the storage is private to a cluster member), that client will still issue read requests directly to the devices, but the device request dispatcher layer sends the read request across the cluster interconnect to the device.
Direct access cached reads are consistent with the existing CFS served file-system model, and the CFS server continues to perform metadata and log updates for the read operation.
Direct access cached reads are implemented only for AdvFS file systems. In addition, direct access cached reads are performed only for files that are at least 64K in size. The served I/O method is more efficient when processing smaller files.
Direct access cached reads are enabled by default and are not user-settable or tunable. However, if an application uses direct I/O, as described in Section 9.3.6.2, that choice is given priority and direct access cached reads are not performed for that application.
Use the
cfsstat directio
command to
display direct I/O statistics.
The
direct i/o
reads
field includes direct access cached read
statistics.
See
Section 9.3.6.2.3
for a description of
these fields.
# cfsstat directio
Concurrent Directio Stats:
941 direct i/o reads
0 direct i/o writes
0 aio raw reads
0 aio raw writes
0 unaligned block reads
29 fragment reads
73 zero-fill (hole) reads
0 file-extending writes
0 unaligned block writes
0 hole writes
0 fragment writes
0 truncates
When a cluster boots, the TruCluster Server software ensures that each file system is directly connected to the member that serves it. File systems on a device connected to a member's local bus are served by that member. A file system on a device on a shared SCSI bus is served by one of the members that is directly connected to that SCSI bus.
In the case of AdvFS, the first fileset that is assigned to a CFS server determines that all other filesets in that domain will have that same cluster member as their CFS server.
When a cluster boots, typically the first member up that is connected to a shared SCSI bus is the first member to see devices on the shared bus. This member then becomes the CFS server for all the file systems on all the devices on that shared bus. Because of this, most file systems are probably served by a single member and this member can become more heavily loaded than other members, thereby using a larger percentage of its resources (CPU, memory, I/O, and so forth). In this case, CFS can recommend that you relocate file systems to other cluster members to balance the load and improve performance.
TruCluster Server Version 5.1B includes a load monitoring daemon,
/usr/sbin/cfsd, that can monitor, report on, and
respond to file-system-related member and cluster activity.
cfsd
is disabled by default and you must explicitly
enable it.
After being enabled,
cfsd
can perform the following functions:
Assist in managing file systems by locating file systems based on
your preferences and storage connectivity.
You can configure
cfsd
to automatically relocate
file systems when members join or leave the cluster,
when storage connectivity changes, or as a result of CFS memory
usage.
Collect a variety of statistics on file system usage and system load. You can use this data to understand how the cluster's file systems are being used.
Analyze the statistics that it collects and recommend file system relocations that may improve system performance or balance the file system load across the cluster.
Monitor CFS memory usage on cluster nodes and generate an alert when a member is approaching the CFS memory usage limit.
An instance of the
cfsd
daemon runs on each member
of the cluster.
If
cfsd
runs in the cluster, it
must run on each member;
cfsd
depends on a daemon
running on each member for proper behavior.
If you do not want
cfsd
to be running in the cluster, do not allow
any member to run it.
Each instance of the daemon
collects statistics on its member and monitors member-specific events
such as low CFS memory.
One daemon from the cluster automatically serves as the
"master"
daemon and is responsible for analyzing all
of the collected statistics, making recommendations, and initiating automatic
relocations.
The daemons are configured via a
clusterwide
/etc/cfsd.conf
configuration
file.
The
cfsd
daemon monitors file system performance and
resource utilization by periodically polling the
member for information, as determined by the
polling_schedule
attributes of the
/etc/cfsd.conf
configuration file for a given
policy.
The
cfsd
daemon collects information
about each member's usage of each file system, about the memory demands of
each file system, about the system memory demand on each member, and about
member-to-physical storage connectivity.
Each daemon accumulates the
statistics in the member-specific binary file
/var/cluster/cfs/stats.member.
The data in this file is
in a format specific to
cfsd
and is not intended for direct
user access.
The
cfsd
daemon updates and maintains these
files; you do not need to periodically delete or maintain them.
The following data is collected for each cluster member:
svrcfstok structure count limit. (See Section 9.3.6.3 for a discussion of this structure.)
Number of active svrcfstok structures
Total number of bytes
Number of wired bytes
The following data is collected per file system per member:
Number of read operations
Number of write operations
Number of lookup operations
Number of getattr operations
Number of readlink operations
Number of access operations
Number of other operations
Bytes read
Bytes written
Number of active
svrcfstok
structures (described in
Section 9.3.6.3)
The
cfsd
daemon also subscribes to EVM events to monitor
information on general cluster and cluster file system state, such as cluster
membership, mounted file systems, and device connectivity.
Notes
cfsdconsiders an AdvFS domain to be an indivisible entity. Relocating an AdvFS file system affects not only the selected file system, but all file systems in the same domain. The entire domain is relocated.File systems of type NFS client and memory file system (MFS) cannot be relocated. In addition, member boot partitions, server-only file systems (such as UFS file systems mounted for read-write access), and file systems that are under hierarchical storage manager (HSM) management cannot be relocated.
Direct-access I/O devices on a shared bus are served by all cluster members on that bus. A single-server device, whether on a shared bus or directly connected to a cluster member, is served by a single member. If two or more file systems use the same single-server device,
cfsddoes not relocate them due to performance issues that can arise if the file systems are not served by the same member.
9.3.3.1 Starting and Stopping
cfsd
The
/sbin/init.d/cfsd
file starts an
instance of the
cfsd
daemon on each cluster
member.
However, starting the daemon does not by itself make
cfsd
active: the
cfsd
daemon's behavior is
controlled via the
active
field of the stanza-formatted file
/etc/cfsd.conf.
The
active
field enables
cfsd
if set to
1.
The
cfsd
daemon is disabled (set to
0)
by default and you must explicitly enable it if you want to use it.
The
cfsd
daemon reads the clusterwide
/etc/cfsd.conf
file at startup.
You can force
cfsd
to
reread the configuration file by sending it a SIGHUP signal, in a
manner similar to the following:
# kill -HUP `cat /var/run/cfsd.pid`
If you modify
/etc/cfsd.conf, send
cfsd
the SIGHUP signal to force it to reread the file.
If you send SIGHUP to any
cfsd
daemon process in the cluster, all
cfsd
daemons in
the cluster reread the file; you do not need to issue multiple
SIGHUP signals.
9.3.3.2 EVM Events
The
cfsd
daemon posts an EVM
sys.unix.clu.cfsd.anlys.relocsuggested
event to alert you that the latest analysis contains interesting
results.
You can use the
evmwatch
and
evmshow
commands to monitor these events.
# evmget | evmshow -t "@name [@priority]" | grep cfsd sys.unix.clu.cfsd.anlys.relocsuggested [200]
The following command provides additional information about
cfsd
EVM events:
# evmwatch -i -f "[name sys.unix.clu.cfsd.*]" | evmshow -d | more
9.3.3.3 Modifying the
/etc/cfsd.conf Configuration File
The
/etc/cfsd.conf
configuration file, described
in detail in
cfsd.conf(4)cfsd
and defines
a set of file system placement policies that
cfsd
adheres to when analyzing and managing the cluster's file systems.
All file systems in the cluster have a placement
policy associated with them.
This policy specifies how each
file system is assigned to members and determines whether or not to
have
cfsd
automatically relocate it if
necessary.
If you modify this file, keep the following points in mind:
If you do not explicitly assign a file system a policy, it inherits the default policy.
The
cfsd
daemon never attempts to relocate a
cluster member's boot partition, even if the boot partition
belongs to a policy that has
active_placement
set
to perform relocations.
The
cfsd
daemon ignores the
active_placement
setting for the boot
partition.
The
active_placement
keywords determine
the events upon which an automatic relocation occurs; the
hosting_members
option determines your preference of
the members to which a file system is relocated.
The
hosting_members
option is a restrictive list,
not a preferred list.
The
placement
attribute controls how the file system is placed when
none of the members specified by
hosting_members
attributes is available.
The
cfsd
daemon treats all cluster members
identified in a single
hosting_members
entry
equally; no ordering preference is assumed by position in the
list.
To specify an ordering of member preference, use
multiple
hosting_members
lines.
The
cfsd
daemon gives preference to the members listed in the
first
hosting_members
line, followed by the members
in the next
hosting_members
line, and so
on.
There is no limit on the number of policies you can create.
You can list any number of file systems in a
file systems
line.
If the same file
system appears in multiple polices, the last usage takes
precedence.
A sample
/etc/cfsd.conf
file is shown in
Example 9-1.
See
cfsd.conf(4)Example 9-1: Sample
/etc/cfsd.conf File
# Use this file to configure the CFS load monitoring daemon (cfsd)
# for the cluster. cfsd will read this file at startup and on receipt
# of a SIGHUP. After modifying this file, you can apply your changes
# cluster-wide by issuing the following command from any cluster
# member: "kill -HUP `cat /var/run/cfsd.pid`". This will force the
# daemon on each cluster member to reconfigure itself. You only need
# to send one SIGHUP.
#
# Any line whose first non-whitespace character is a '#' is ignored
# by cfsd.
#
# If cfsd encounters syntax errors while processing this file, it will
# log the error and any associated diagnostic information to syslog.
#
# See cfsd(8) for more information.
# This block is used to configure certain daemon-wide features.
#
# To enable cfsd, set the "active" attribute to "1".
# To disable cfsd, set the "active" attribute to "0".
#
# Before enabling the daemon, you should review and understand the
# configuration in order to make sure that it is compatible with how
# you want cfsd to manage the cluster's file systems.
#
# cfsd will analyze load every 12 hours, using the past 24 hours worth
# of statistical data.
#
cfsd:
active = 1
reloc_on_memory_warning = 1
reloc_stagger = 0
analyze_samplesize = 24:00:00
analyze_interval = 12:00:00
# This block is used to define the default policy for file systems that
# are not explicitly included in another policy. Furthermore, other
# policies that do not have a particular attribute explicitly defined
# inherit the corresponding value from this default policy.
#
# Collect stats every 2 hours all day monday-friday.
# cfsd will perform auto relocations to maintain server preferences,
# connectivity, and acceptable memory usage, and will provide relocation
# hints to the kernel for preferred placement on failover.
# No node is preferred over another.
#
defaultpolicy:
polling_schedule = 1-5, 0-23, 02:00:00
placement = favored
hosting_members = *
active_placement = connectivity, preference, memory, failover
# This policy is used for file systems that you do NOT want cfsd to
# ever relocate. It is recommended that cfsd not be allowed to relocate
# the /, /usr, or /var file systems.
#
# It is also recommended that file systems whose placements are
# managed by other software, such as CAA, also be assigned to
# this policy.
#
policy:
name = PRECIOUS
filesystems = cluster_root#, cluster_usr#, cluster_var#
active_placement = 0
# This policy is used for file systems that cfsd should, for the most
# part, ignore. File systems in this policy will not have statistics
# collected for them and will not be relocated.
#
# Initially, this policy contains all NFS and MFS file systems that
# are not explicitly listed in other policies. File systems of these
# types tend to be temporary, so collecting stats for them is usually
# not beneficial. Also, CFS currently does not support the relocation
# of NFS and MFS file systems.
#
policy:
name = IGNORE
filesystems = %nfs, %mfs
polling_schedule = 0
active_placement = 0
# Policy for boot file systems.
#
# No stats collection for boot file systems. Boot partitions are never
# relocated.
#
policy:
name = BOOTFS
filesystems = root1_domain#, root2_domain#, root3_domain#
polling_schedule = 0
# You can define as many policies as necessary, using this policy block
# as a template. Any attributes that you leave commented out will be
# inherited from the default policy defined above.
#
policy:
name = POLICY01
#filesystems =
#polling_schedule = 0-6, 0-23, 00:15:00
#placement = favored
#hosting_members = *
#active_placement = preference, connectivity, memory, failover
9.3.3.4 Understanding
cfsd Analysis and Implementing Recommendations
The
cfsd
daemons collect statistics in the
member-specific file
/var/cluster/cfs/stats.member.
These data files are in a format specific to
cfsd
and
are not intended for direct user access.
The
cfsd
daemon
updates and maintains these files; you do not need to periodically
delete or maintain them.
After analyzing these collected statistics,
cfsd
places the results of that analysis in the
/var/cluster/cfs/analysis.log
file.
The
/var/cluster/cfs/analysis.log
file is a symbolic link
to the most recent
/var/cluster/cfs/analysis.log.dated
file.
When a
/var/cluster/cfs/analysis.log.dated
file becomes 24 hours old, a new version is created and the symbolic
link is updated.
Prior versions of the
/var/cluster/cfs/analysis.log.dated
file are purged.
The
cfsd
daemon posts an EVM event to alert you that the
latest analysis contains interesting results.
The
/var/cluster/cfs/analysis.log
file
contains plain text, in a format similar to the following:
Cluster Filesystem (CFS) Analysis Report
(generated by cfsd[525485]
Recommended
relocations:
none
Filesystem usage summary:
cluster reads writes req'd svr mem
24 KB/s 0 KB/s 4190 KB
node reads writes req'd svr mem
rye 4 KB/s 0 KB/s 14 KB
swiss 19 KB/s 0 KB/s 4176 KB
filesystem
node reads writes req'd svr mem
test_one# 2 KB/s 0 KB/s 622 KB
rye 0 KB/s 0 KB/s
@swiss 2 KB/s 0 KB/s
test_two# 4 KB/s 0 KB/s 2424 KB
rye 1 KB/s 0 KB/s
@swiss 3 KB/s 0 KB/s
:
:
Filesystem placement evaluation results:
filesystem
node conclusion observations
test_one#
rye considered (hi conn, hi pref, lo use)
@swiss recommended (hi conn, hi pref, hi use)
test_two#
rye considered (hi conn, hi pref, lo use)
@swiss recommended (hi conn, hi pref, hi use)
:
:
The current CFS server of each file system is indicated by an
"at"
symbol (@).
As previously described,
cfsd
treats an AdvFS
domain as an indivisible entity, and the analysis is reported at the
AdvFS domain level.
Relocating a file
system of type AdvFS affects all file systems in the same domain.
You can use the results of this analysis to determine whether you want
a different cluster member to be the CFS server for a given file
system.
If the current CFS server is not the recommended
server for this file system based on the
cfsd
analysis, you can use the
cfsmgr
command to relocate the file system to the recommended server.
For example, assume that
swiss
is the current CFS server of the
test_two
domain and member
rye
is the recommended CFS server.
If you agree
with this analysis and want to implement the recommendation, enter the following
cfsmgr
command to change the CFS server
to
rye:
# cfsmgr -a server=rye -d test_two # cfsmgr -d test_two Domain or filesystem name = test_two Server Name = rye Server Status : OK
The
cfsd
daemon does not automatically relocate
file systems based solely on its own statistical analysis.
Rather, it
produces reports and makes recommendations that you can accept or
reject based on your environment.
However, for a select series of conditions,
cfsd
can automatically relocate
a file system based on the keywords you specify in the
active_placement
option for a given
file system policy.
The
active_placement
keywords determine
the events upon which an automatic relocation occurs; the
hosting_members
option determines the members to
which a file system is relocated and the order in which a member is
selected.
The possible values and interactions of the
active_placement
option are described in
cfsd.conf(4)
Memory
CFS memory usage is limited by the
svrcfstok_max_percent
kernel attribute, which is described in
Section 9.3.6.3.
If a cluster member
reaches this limit, file operations on file systems served by the
member begin failing with
"file table overflow"
errors.
While a member approaches its CFS memory usage limit, the kernel posts an EVM
event as a warning.
When such an event is posted,
cfsd
can attempt to free memory on the member by
relocating some of the file systems that it is serving.
Preference
While members join and leave the
cluster,
cfsd
can relocate file systems to members
that you prefer.
You might want certain file systems to be served primarily by
a subset of the cluster members.
Failover
While members join and leave the
cluster,
cfsd
can relocate file systems to members
that you prefer.
You might want certain file systems to be served primarily by
a subset of the cluster members.
Connectivity
If a member does not have a direct
physical connection to the
devices required by a file system that it serves, a severe performance
degradation can result.
The
cfsd
daemon can
automatically relocate a
file system in the event that its current server loses connectivity to
the file system's underlying devices.
9.3.3.6 Relationship to CAA Resources
The
cfsd
daemon has no knowledge of CAA resources.
CAA allows you to use the
placement_policy,
hosting_members, and
required_resources
options to favor or limit the
member or members that can run a particular CAA resource.
If this CAA resource has an application- or resource-specific
file system, use the associated CAA action script to place
or relocate the file system.
For example, if the resource is relocated,
the action script should use
cfsmgr
to move the file system as well.
Using the
cfsmgr
command via the action script allows you
to more directly and easily synchronize the file system with the CAA resource.
9.3.3.7 Balancing CFS Load Without
cfsd
The
cfsd
daemon is the recommended method of
analyzing and balancing the CFS load on a cluster.
The
cfsd
daemon can
monitor, report on, and respond to file-system-related member and
cluster activity.
However, if you already have a process in place
to balance your file system load, or if you simply prefer to perform the
load balancing analysis yourself, you can certainly do so.
Use the
cfsmgr
command to determine good candidates
for relocating the CFS servers.
The
cfsmgr
command
displays statistics on file system usage on a per-member basis.
For example, suppose you want to determine whether to relocate the server for
/accounts
to improve performance.
First, confirm the current CFS server of
/accounts
as follows:
# cfsmgr /accounts Domain or filesystem name = /accounts Server Name = systemb Server Status : OK
Then, get the CFS statistics for the current server and the candidate servers by entering the following commands:
# cfsmgr -h systemb -a statistics /accounts
Counters for the filesystem /accounts:
read_ops = 4149
write_ops = 7572
lookup_ops = 82563
getattr_ops = 408165
readlink_ops = 18221
access_ops = 62178
other_ops = 123112
Server Status : OK
# cfsmgr -h systema -a statistics /accounts
Counters for the filesystem /accounts:
read_ops = 26836
write_ops = 3773
lookup_ops = 701764
getattr_ops = 561806
readlink_ops = 28712
access_ops = 81173
other_ops = 146263
Server Status : OK
# cfsmgr -h systemc -a statistics /accounts
Counters for the filesystem /accounts:
read_ops = 18746
write_ops = 13553
lookup_ops = 475015
getattr_ops = 280905
readlink_ops = 24306
access_ops = 84283
other_ops = 103671
Server Status : OK
# cfsmgr -h systemd -a statistics /accounts
Counters for the filesystem /accounts:
read_ops = 98468
write_ops = 63773
lookup_ops = 994437
getattr_ops = 785618
readlink_ops = 44324
access_ops = 101821
other_ops = 212331
Server Status : OK
In this example, most of the read and write activity
for
/accounts
is from member
systemd, not from the member that is currently serving it,
systemb.
Assuming that
systemd
is physically connected to the storage for
/accounts,
systemd
is
a good choice as the CFS server for
/accounts.
Determine whether
systemd
and
the storage for
/accounts
are physically
connected as follows:
Find out where
/accounts
is mounted.
You can
either look in
/etc/fstab
or use the
mount
command.
If there are a large number of
mounted file systems, you might want to use
grep
as follows:
# mount | grep accounts accounts_dmn#accounts on /accounts type advfs (rw)
Look at the directory
/etc/fdmns/accounts_dmn
to
learn the device where the AdvFS domain
accounts_dmn
is mounted as follows:
# ls /etc/fdmns/accounts_dmn dsk6c
Enter the
drdmgr
command to learn the servers of
dsk6
as follows:
# drdmgr -a server dsk6
Device Name: dsk6
Device Type: Direct Access IO Disk
Device Status: OK
Number of Servers: 4
Server Name: membera
Server State: Server
Server Name: memberb
Server State: Server
Server Name: memberc
Server State: Server
Server Name: memberd
Server State: Server
Because
dsk6
has multiple servers, it is on a
shared bus.
Because
systemd
is one of the servers,
there is a physical connection.
Relocate the CFS server of
/accounts
to
systemd
as follows:
# cfsmgr -a server=systemd /accounts
Even in cases where the CFS statistics do not show an inordinate load
imbalance, we recommend that you distribute the CFS servers among the
available members that are connected to the shared bus.
Doing so can
improve overall cluster performance.
9.3.3.8 Distributing CFS Server Load via
cfsmgr
To automatically have a particular cluster member act as the CFS server
for a file system or domain, you can place a script in
/sbin/init.d
that calls the
cfsmgr
command to relocate the server for the file
system or domain to the desired cluster member.
This technique
distributes the CFS load but does not balance it.
For example, if you want cluster member
alpha
to
serve the domain
accounting, place the following
cfsmgr
command in a startup script:
# cfsmgr -a server=alpha -d accounting
Have the script look for successful relocation and retry the operation
if it fails.
The
cfsmgr
command returns a nonzero value
on failure; however, it is not sufficient for the script to
keep trying on a bad exit value.
The relocation might have failed because a failover or relocation is
already in progress.
On failure of the relocation, have the script search for one of the following messages:
Server Status : Failover/Relocation in Progress Server Status : Cluster is busy, try later
If either of these messages occurs, have the script retry the
relocation.
On any other error, have the script print an appropriate
message and exit.
9.3.4 Distributing File Systems Via the
mount -o Command
A file system on a device on a shared SCSI bus is served by one of the members that is directly connected to that SCSI bus. When a cluster boots, typically the first active member that is connected to a shared SCSI bus is the first member to see devices on the shared bus. This member then becomes the CFS server for all the file systems on all the devices on that shared bus. CFS allows you to then relocate file systems to better balance the file system load, as described in Section 9.3.3.
As an alternate approach, the
mount -o server=name
command allows you to specify which cluster member serves a given file system
at startup.
The
-o server=name
option
is particularly useful for those file systems that cannot be
relocated, such as NFS, MFS, and read/write UFS file systems:
# mount -t nfs -o server=rye smooch:/usr /tmp/mytmp # cfsmgr -e Domain or filesystem name = smooch:/usr Mounted On = /cluster/members/member1/tmp/mytmp Server Name = ernest Server Status : OK
If the mount specified by the
mount -o
server=name
command is successful, the
specified cluster member is the CFS server for the file system.
However, if the
specified member is not a member of the cluster or is unable to serve
the file system, the mount attempt fails.
The
mount -o
server=name
command determines
where the file system is first mounted; it does not limit or
determine the cluster members to which the file system might later be
relocated or fail over.
If you combine the -o server=name option with the -o server_only option, the file system can be mounted only by the specified cluster member and the file system is then treated as a partitioned file system. That is, the file system is accessible for both read-only and read/write access only by the member that mounts it. Other cluster members cannot read from, or write to, the file system. Remote access is not allowed; failover does not occur. The -o server_only option can be applied only to AdvFS, MFS, and UFS file systems.
Note
The -o server=name option bypasses the normal server selection process and may result in a member that has less than optimal connectivity to the file system's devices serving the file system. In addition, if the member you specify is not available, the file system is not mounted by any other cluster member.
The combination of the -o server=name and -o server_only options removes many of the high-availability protections of the CFS file system: the file system can be mounted only by the specified cluster member, it can be accessed by only that member, and it cannot fail over to another member. Therefore, use this combination carefully.
The -o server=name option is valid only in a cluster, and only for AdvFS, UFS, MFS, NFS, CDFS, and DVDFS file systems. In the case of MFS file systems, the -o server=name option is supported in a limited fashion: the file system is mounted only if the specified server is the local node.
You can use the
-o
server=name
option
with the
/etc/fstab
file to create
cluster-member-specific
fstab
entries.
See
mount(8)9.3.5 Freezing a Domain Before Cloning
To allow coherent hardware snapshots in multivolume domain configurations,
file system metadata must be consistent across all volumes when the
individual volumes are cloned.
To guarantee that the metadata is
consistent, Tru64 UNIX Version 5.1B includes
the
freezefs
command, which is described in
freezefs(8)freezefs
command causes an AdvFS domain to enter into a
metadata-consistent frozen state and guarantees that it stays that way
until the specified freeze time expires or it is explicitly thawed
with the
thawfs
command.
All metadata, which
can be spread across multiple volumes or logical units (LUNs), is
flushed to disk and does not change for the duration of the freeze.
Although
freezefs
requires that you specify one
or more AdvFS file system mount directories, all of the filesets in
the AdvFS domain are affected.
The
freezefs
command considers
an AdvFS domain to be an indivisible entity.
Freezing a file system
in a domain freezes the entire domain.
When you freeze a
file system in a clustered configuration, all in-process
file system operations are allowed to complete.
Some file
system operations that do not require metadata updates work
normally even if the target file system is
frozen; for example,
read
and
stat.
Although there are slight differences in how
freezefs
functions on a single system and in a
cluster, in both cases metadata changes are not allowed on a
frozen domain.
The most notable differences in the behavior of the
commands in a cluster are the following:
Shutting down any cluster member causes all frozen file systems in the cluster to be thawed.
If any cluster member fails, all frozen file systems in the cluster are thawed.
9.3.5.1 Determining Whether a Domain Is Frozen
By default,
freezefs
freezes a file system for
60 seconds.
However, you can use the
-t
option to specify a
lesser or greater timeout value in seconds, or to specify that the
domain is to remain frozen until being thawed by
thawfs.
The
freezefs
command
-q
option allows you to query a file system to determine if it is frozen:
# freezefs -q /mnt /mnt is frozen
In addition, the
freezefs
command posts
an EVM event when a file system is frozen or thawed.
You can use
the
evmwatch
and
evmshow
commands to determine if any domains in
the cluster are frozen or thawed, as shown in the following example:
# /usr/sbin/freezefs -t -1 /freezetest freezefs: Successful # evmget -f "[name sys.unix.fs.vfs.freeze]" | evmshow -t "@timestamp @@" 14-Aug-2002 14:16:51 VFS: filesystem test2_domain#freeze mounted on /freezetest was frozen # /usr/sbin/thawfs /freezetest thawfs: Successful # evmget -f "[name sys.unix.fs.vfs.thaw]" | evmshow -t "@timestamp @@" 14-Aug-2002 14:17:32 VFS: filesystem test2_domain#freeze mounted on /freezetest was thawed
9.3.6 Optimizing CFS Performance
You can tune CFS performance by doing the following:
Changing the number of read-ahead and write-behind threads (Section 9.3.6.1)
Taking advantage of direct I/O (Section 9.3.6.2)
Adjusting CFS memory usage (Section 9.3.6.3)
Using memory mapped files (Section 9.3.6.4)
Avoiding full file systems (Section 9.3.6.5)
Trying other strategies (Section 9.3.6.6)
9.3.6.1 Changing the Number of Read-Ahead and Write-Behind Threads
When CFS detects sequential accesses to a file, it
employs read-ahead threads to read the next I/O block size worth of data.
CFS also employs write-behind threads to buffer the next block of data
in anticipation that it too
will be written to disk.
Use the
cfs_async_biod_threads
kernel attribute to
set the number of I/O threads that perform asynchronous read ahead and
write behind.
Read-ahead and write-behind threads apply only to
reads and writes originating on CFS clients.
The default size for
cfs_async_biod_threads
is 32.
In an environment where at one time you have more than 32 large files
sequentially accessed, increasing
cfs_async_biod_threads
can improve CFS performance,
particularly if the applications using
the files can benefit from lower latencies.
The number of read-ahead and write-behind threads is tunable
from 0 through 128.
When not in use, the threads consume few system resources.
9.3.6.2 Taking Advantage of Direct I/O
When an application opens an AdvFS file with the
O_DIRECTIO
flag in the
open
system call, data I/O is direct to the
storage; the system software does no data caching for the file
at the file-system level.
In a cluster, this arrangement supports
concurrent direct I/O on the file from any member in the cluster.
That is,
regardless of which member originates the I/O request,
I/O to a file does not go through the cluster
interconnect to the CFS server.
Database applications frequently
use direct I/O in conjunction with raw asynchronous I/O (which is also supported in
a cluster) to improve I/O performance.
The best performance on a file that is opened for direct I/O is achieved under the following conditions:
A read from an existing location of the file
A write to an existing location of the file
When the size of the data being read or written is a multiple of the disk sector size, 512 bytes
The following conditions can result in less than optimal direct I/O performance:
Operations that cause a metadata change to a file. These operations go across the cluster interconnect to the CFS server of the file system when the application that is doing the direct I/O runs on a member other than the CFS server of the file system. Such operations include the following:
Any modification that fills a sparse hole in the file
Any modification that appends to the file
Any modification that truncates the file
Any read or write on a file that is less than 8K and consists solely of a fragment or any read/write to the fragment portion at the end of a larger file
Any unaligned block read or write that is not to an existing location of the file. If a request does not begin or end on a block boundary, multiple I/Os are performed.
When a file is open for direct I/O,
any AdvFS migrate operation (such as
migrate,
rmvol,
defragment, or
balance) on the domain will block until the I/O
that is in progress completes on all members.
Conversely, direct I/O will block until any AdvFS migrate
operation completes.
An application that uses direct I/O is responsible for managing its own caching. When performing multithreaded direct I/O on a single cluster member or multiple members, the application must also provide synchronization to ensure that, at any instant, only one thread is writing a sector while others are reading or writing.
For a discussion of direct I/O programming issues, see the chapter
on optimizing techniques in the Tru64 UNIX
Programmer's Guide.
9.3.6.2.1 Differences Between Cluster and Standalone AdvFS Direct I/O
The following list presents direct I/O behavior in a cluster that differs from that in a standalone system:
Performing any migrate operation on a file that is already opened for direct I/O blocks until the I/O that is in progress completes on all members. Subsequent I/O will block until the migrate operation completes.
AdvFS in a standalone system provides a guarantee at the sector level that, if multiple threads attempt to write to the same sector in a file, one will complete first and then the other. This guarantee is not provided in a cluster.
9.3.6.2.2 Cloning a Fileset with Files Open in Direct I/O Mode
As described in
Section 9.3.6.2, when an application
opens a file with the
O_DIRECTIO
flag in the
open
system call, I/O to the file does not go through the cluster
interconnect to the CFS server.
However, if you clone a fileset that
has files open in Direct I/O mode, the I/O does not follow this model and might cause
considerable performance degradation.
(Read performance is not impacted by the
cloning.)
The
clonefset
utility, which is described in
clonefset(8)
If the fileset has files open in Direct I/O mode, when you modify a file AdvFS copies the original data to the clone storage. AdvFS does not send this copy operation over the cluster interconnect. However, CFS does send the write operation for the changed data in the fileset over the cluster interconnect to the CFS server unless the application using Direct I/O mode happens to be running on the CFS server. Sending the write operation over the cluster interconnect negates the advantages of opening the file in Direct I/O mode.
To retain the benefits of Direct I/O mode, remove the clone as
soon as the backup operation is complete so that writes are again written
directly to storage and are not sent over the cluster interconnect.
9.3.6.2.3 Gathering Statistics on Direct I/O
If the performance gain for an application that uses direct I/O
is less than you expected, you can use the
cfsstat
command
to examine per-node global direct I/O statistics.
Use
cfsstat
to look at the global direct I/O
statistics without the application running.
Then execute the
application and examine the statistics again to determine whether
the paths that do not optimize direct I/O behavior were being
executed.
The following example shows how to use the
cfsstat
command to get direct I/O statistics:
# cfsstat directio
Concurrent Directio Stats:
160 direct i/o reads
160 direct i/o writes
0 aio raw reads
0 aio raw writes
0 unaligned block reads
0 fragment reads
0 zero-fill (hole) reads
160 file-extending writes
0 unaligned block writes
0 hole writes
0 fragment writes
0 truncates
The individual statistics have the following meanings:
direct i/o reads
The number of normal direct I/O read requests. These read requests were processed on the member that issued the request and were not sent to the AdvFS layer on the CFS server.
direct i/o writes
The number of normal direct I/O write requests processed. These write requests were processed on the member that issued the request and were not sent to the AdvFS layer on the CFS server.
aio raw reads
The number of normal direct I/O asynchronous read requests. These read requests were processed on the member that issued the request and were not sent to the AdvFS layer on the CFS server.
aio raw writes
The number of normal direct I/O asynchronous write requests. These read requests were processed on the member that issued the request and were not sent to the AdvFS layer on the CFS server.
unaligned block reads
The number of reads that were not a multiple of a disk sector size (currently 512 bytes). This count will be incremented for requests that do not start at a sector boundary or do not end on a sector boundary. An unaligned block read operation results in a read for the sector and a copyout of the user data requested from the proper location of the sector.
If the I/O request encompasses an existing location of the file and does not encompass a fragment, this operation does not get sent to the CFS server.
fragment reads
The number of read requests that needed to be sent to the CFS server because the request was for a portion of the file that contains a fragment.
A file that is less than 140K might contain a fragment at the end that is not a multiple of 8K. Also small files less than 8K in size may consist solely of a fragment.
To ensure that a file of less than 8K does not consist of a fragment, always open the file only for direct I/O. Otherwise, on the close of a normal open, a fragment will be created for the file.
zero-fill (hole) reads
The number of reads that occurred to sparse areas of the files that were opened by direct I/O. This request is not sent to the CFS server.
file-extending writes
The number of write requests that were sent to the CFS server because they appended data to the file.
unaligned block writes
The number of writes that were not a multiple of a disk sector size (currently 512 bytes). This count will be incremented for requests that do not start at a sector boundary or do not end on a sector boundary. An unaligned block write operation results in a read for the sector, a copyin of the user data that is destined for a portion of the block, and a subsequent write of the merged data. These operations do not get sent to the CFS server.
If the I/O request encompasses an existing location of the file and does not encompass a fragment, this operation does not get sent to the CFS server.
hole writes
The number of write requests to an area that encompasses a sparse hole in the file that needed to be sent to AdvFS on the CFS server.
fragment writes
The number of write requests that needed to be sent to the CFS server because the request was for a portion of the file that contains a fragment.
A file that is less than 140K might contain a fragment at the end that is not a multiple of 8K. Also small files less than 8K in size may consist solely of a fragment.
To ensure that a file of less than 8K does not consist of a fragment, always open the file only for direct I/O. Otherwise, on the close of a normal open, a fragment will be created for the file.
truncates
The number of truncate requests for direct I/O opened files. This request does get sent to the CFS server.
9.3.6.3 Adjusting CFS Memory Usage
In situations where one cluster member is the CFS server for a large number of file systems, the client members may cache a great many vnodes from the served file systems. For each cached vnode on a client, even vnodes that are not actively used, the CFS server must allocate 800 bytes of system memory for the CFS token structure that is needed to track the file at the CFS layer. In addition to this, the CFS token structures typically require corresponding AdvFS access structures and vnodes, resulting in a near-doubling of the amount of memory that is used.
By default, each client can use up to 4 percent of memory to cache vnodes. When multiple clients fill up their caches with vnodes from a CFS server, system memory on the server can become overtaxed, causing it to hang.
The
svrcfstok_max_percent
kernel attribute is designed to prevent such system hangs.
The attribute
sets an upper limit on the amount of memory that is allocated
by the CFS server to track vnode caching on clients.
The default value is 25 percent.
The memory is used only
if the server load requires it.
The memory is not allocated up front.
After the
svrcfstok_max_percent
limit
is reached on the server, an application accessing files that are served by
the member gets an
EMFILE
error.
Applications that use
perror()
to learn current
errno
settings will return the message
too many open files
to the
standard error stream,
stderr, the controlling
TTY or log file used by the applications.
Although you see
EMFILE
error messages,
no cached data is lost.
If applications start getting
EMFILE
errors, follow these steps:
Determine whether the CFS client is out of vnodes, as follows:
Get the current value of the
max_vnodes
kernel
attribute:
# sysconfig -q vfs max_vnodes
Use
dbx
to get the values of
total_vnodes
and
free_vnodes:
# dbx -k /vmunix /dev/mem dbx version 5.0 Type 'help' for help. (dbx) pd total_vnodes total_vnodes_value
Get the value for
max_vnodes:
(dbx) pd max_vnodes max_vnodes_value
If
total_vnodes
equals
max_vnodes
and
free_vnodes
equals 0, then that member
is out of vnodes.
In this case, you can increase the value of
the
max_vnodes
kernel attribute.
You can use
the
sysconfig
command to change
max_vnodes
on a running member.
For example,
to set the maximum number of vnodes to 20000, enter the following:
# sysconfig -r vfs max_vnodes=20000
If the CFS client is not out of vnodes,
then determine whether the CFS server has used all the memory
that is available for token structures
(svrcfstok_max_percent), as follows:
Log on to the CFS server.
Use
dbx
to get the current value
for
svrtok_active_svrcfstok:
# dbx -k /vmunix /dev/mem dbx version 5.0 Type 'help' for help. (dbx)pd svrtok_active_svrcfstok active_svrcfstok_value
Get the value for
cfs_max_svrcfstok:
(dbx)pd cfs_max_svrcfstok max_svrcfstok_value
If
svrtok_active_svrcfstok
is
equal to or greater than
cfs_max_svrcfstok,
then the CFS server has used all the memory that is available for token
structures.
In this case, the best solution to make the file systems usable again is to relocate some of the file systems to other cluster members. If that is not possible, then the following solutions are acceptable:
Increase the value of
cfs_max_svrcfstok.
You cannot change
cfs_max_svrcfstok
with the
sysconfig
command.
However, you can use
the
dbx assign
command to change the value of
cfs_max_svrcfstok
in the running kernel.
For example, to set the maximum number of
CFS server token structures to 80000, enter the following command:
(dbx)assign cfs_max_svrcfstok=80000
Values you assign with the
dbx assign
command are lost when the system is rebooted.
Increase the amount of memory that is available for token structures on the CFS server.
This option is undesirable on systems with small amounts of memory.
To increase
svrcfstok_max_percent, log on to the
server and run the
dxkerneltuner
command.
On the main window, select
the
cfs
kernel subsystem.
On the
cfs
window, enter an appropriate value for
svrcfstok_max_percent.
This change will
not take effect until the cluster member is rebooted.
Typically, when a CFS server reaches the
svrcfstok_max_percent
limit,
relocate some of the CFS file systems so that the burden of
serving the file systems is shared among cluster members.
You can
use startup scripts to run the
cfsmgr
and
automatically relocate file systems around the cluster at member startup.
Setting
svrcfstok_max_percent
below the default
is recommended only on smaller memory systems
that run out of memory because 25 percent default value is too high.
9.3.6.4 Using Memory Mapped Files
Using memory mapping to share a file across the cluster
for anything other than read-only access can negatively affect performance.
CFS I/O to a file does not perform well when multiple members
are simultaneously modifying the data.
This situation
forces premature cache flushes to ensure that all nodes have the
same view of the data at all times.
9.3.6.5 Avoiding Full File Systems
If free space in a file system is less than 50 MB or less
than 10 percent of the file system's size,
whichever is smaller, then write performance to the file system from
CFS clients suffers.
Performance suffers because all writes to nearly full file
systems are sent immediately to the
server to guarantee correct ENOSPC ("not enough space") semantics.
9.3.6.6 Other Strategies
The following measures can improve CFS performance:
Ensure that the cluster members have sufficient system memory.
In general, sharing a file for read/write access across cluster members may negatively affect performance because of all of the cache invalidations. CFS I/O to a file does not perform well if multiple members are simultaneously modifying the data. This situation forces premature cache flushes to ensure that all nodes have the same view of the data at all times.
If a distributed application does reads and writes on separate members, try locating the CFS servers for the application to the member performing writes. Writes are more sensitive to remote I/O than reads.
If multiple applications access different sets of data in a single AdvFS domain, consider splitting the data into multiple domains. This arrangement allows you to spread the load to more than a single CFS server. It also presents the opportunity to colocate each application with the CFS server for that application's data without loading everything on a single member.
9.3.7 MFS and UFS File Systems Supported
TruCluster Server includes read/write support for memory file system (MFS) and UNIX file system (UFS) file systems.
When you mount a UFS file system in a cluster
for read/write access, or when you mount an MFS file system in a
cluster for read-only or read/write access,
the
mount
command
server_only
argument is used by default.
These
file systems are treated as partitioned file systems, as described in
Section 9.3.8.
That is, the file
system is accessible for both read-only and read/write access only by
the member that mounts it.
Other cluster members cannot read
from, or write to, the MFS or UFS file system.
Remote
access is not allowed; failover does not occur.
If you want to mount a UFS file system for read-only
access by all cluster members, you must explicitly mount it
read-only.
9.3.8 Partitioning File Systems
CFS makes all files accessible to all cluster members. Each cluster member has the same access to a file, whether the file is stored on a device that is connected to all cluster members or on a device that is private to a single member. However, CFS makes it possible to mount an AdvFS file system so that it is accessible to only a single cluster member, which is referred to as file system partitioning.
The Available Server Environment (ASE), which is an earlier version of the TruCluster Server product, offered functionality like that of file system partitioning. File partitioning is provided in TruCluster Server as of Version 5.1 to ease migration from ASE. File system partitioning in TruCluster Server is not intended as a general purpose method for restricting file system access to a single member.
To mount a partitioned file system, log on to the member that
you want to give exclusive access to the file system.
Run the
mount
command with the
server_only
option.
This mounts the file
system on the member where you execute the
mount
command and gives that member exclusive access to the file system.
Although only the mounting member has access to the file system,
all members, cluster-wide, can see the file system mount.
The
server_only
option can be applied only to
AdvFS, MFS, and UFS file systems.
Partitioned file systems are subject to the following limitations:
Starting with Tru64 UNIX Version 5.1B, file systems can be mounted under a partitioned file system if the file systems to be mounted are also partitioned file systems and are served by the same cluster member.
No failover via CFS
If the cluster member serving a partitioned file system fails, the file system is unmounted. You must remount the file system on another cluster member.
You can work around this by putting the application that uses the partitioned file system under the control of CAA. Because the application must run on the member where the partitioned file system is mounted, if the member fails, both the file system and application fail. An application that is under control of CAA will fail over to a running cluster member. You can write the application's CAA action script to mount the partitioned file system on the new member.
NFS export
The best way to export a partitioned file system is to create a single node
cluster alias for the node serving the partitioned file system and include
that alias in the
/etc/exports.aliases
file.
See
Section 3.15
for additional information on
how to best utilize the
/etc/exports.aliases
file.
If you use the default cluster alias to NFS-mount file systems that the cluster serves, some NFS requests will be directed to a member that does not have access to the file system and will fail.
Another way to export a partitioned file system is to assign
the member that serves the partitioned file system the
highest cluster-alias selection priority
(selp) in the cluster.
If you do this,
the member will serve all NFS connection requests.
However,
the member will also have to handle
all network traffic of any type that is directed to the cluster,
which is not likely to be acceptable in most environments.
For more information about distributing connection requests, see Section 3.10.
No mixing partitioned and conventional filesets in the same domain
The
server_only
option applies to
all file systems in a domain.
The type of the first fileset mounted determines the type for all
filesets in the domain:
If a fileset is mounted without the
server_only
option, then attempts to mount another fileset in the domain
server_only
will fail.
If a fileset in a domain is mounted
server_only,
then all subsequent fileset mounts in that domain must be
server_only.
No manual relocation
To move a partitioned file system to a different CFS server, you must unmount the file system and then remount it on the target member. At the same time, you will need to move applications that use the file system.
No mount updates with
server_only
option
After you mount a file system normally, you cannot use the
mount -u
command with the
server_only
option on the file system.
For example, if
file_system
has already been mounted
without use of the
server_only
flag,
the following command fails:
# mount -u -o server_only file_system
9.3.9 Block Devices and Cache Coherency
A single block device can have multiple aliases.
In this situation, multiple block device special files in the
file system namespace will contain the same
dev_t.
These aliases can potentially be located across multiple domains
or file systems in the namespace.
On a standalone system, cache coherency is guaranteed among all opens
of the common underlying block device regardless of which alias was
used on the
open()
call for the device.
In a cluster, however, cache coherency can be obtained only
among all block device file
aliases that reside on the same domain or file system.
For example, if cluster member
mutt
serves a
domain with a block device file and
member
jeff
serves a domain with another
block device file with the same
dev_t, then
cache coherency is not provided if I/O
is performed simultaneously through these two aliases.
9.3.10 CFS Restrictions
The cluster file system (CFS) supports the network file system (NFS) client for read/write access.
When a file system is NFS-mounted in a cluster, CFS makes it available for read/write access from all cluster members. The member that has actually mounted it serves the file system to other cluster members.
If the member that has mounted the NFS file system shuts down or fails, the file system is automatically unmounted and CFS begins to clean up the mount points. During the cleanup process, members that access these mount points may see various types of behavior, depending upon how far the cleanup has progressed:
If members still have files open on that file system, their writes will be sent to a local cache instead of to the actual NFS-mounted file system.
After all of the files
on that file system have been closed, attempts to open a file on that
file system will fail with an
EIO
error until the
file system is remounted.
Applications may encounter
"Stale NFS
handle"
messages.
This is normal behavior on a standalone system, as
well as in a cluster.
Until the CFS cleanup is complete, members may still be able to create new files at the NFS file system's local mount point (or in any directories that were created locally beneath that mount point).
An NFS file system does not automatically fail
over to another cluster member.
Rather, you must manually remount it
on the same mount point or another from another
cluster member to make it available again.
Alternatively, booting a
cluster member will remount those file systems that are listed in the
/etc/fstab
file that are not currently mounted
and served in the cluster.
(If you are using AutoFS or automount, the
remount will happen automatically.)
9.4 Managing the Device Request Dispatcher
The device request dispatcher subsystem makes physical disk and tape storage transparently available to all cluster members, regardless of where the storage is physically located in the cluster. When an application requests access to a file, CFS passes the request to AdvFS, which then passes it to the device request dispatcher. In the file system hierarchy, the device request dispatcher sits right above the device drivers.
The primary tool for managing the device request dispatcher
is the
drdmgr
command.
A number of examples of using the command appear in
this section.
For more information, see
drdmgr(8)9.4.1 Direct-Access I/O and Single-Server Devices
The device request dispatcher follows a client/server model; members serve devices, such as disks, tapes, and CD-ROM drives.
Devices in a cluster are either direct-access I/O devices or single-server devices. A direct-access I/O device supports simultaneous access from multiple cluster members. A single-server device supports access from only a single member.
Direct-access I/O devices on a shared bus are served by all
cluster members on that
bus.
A single-server device, whether on a shared bus or directly
connected to a cluster member, is served by a single member.
All other members access the served device through the serving
member.
Direct-access I/O devices are part of the device
request dispatcher subsystem, and have nothing to do with direct I/O
(opening a file with the
O_DIRECTIO
flag to the
open
system call),
which is handled by CFS.
See
Section 9.3.6.2
for
information about direct I/O and CFS.
Typically, disks on a shared bus are direct-access I/O devices, but in certain circumstances, some disks on a shared bus can be single-server. The exceptions occur when you add an RZ26, RZ28, RZ29, or RZ1CB-CA disk to an established cluster. Initially, such devices are single-server devices. See Section 9.4.1.1 for more information. Tape devices are always single-server devices.
Although single-server disks on a shared bus are supported, they are significantly slower when used as member boot disks or swap files, or for the retrieval of core dumps. We recommend that you use direct-access I/O disks in these situations.
Figure 9-3
shows a four-node cluster
with five disks and a tape drive on the shared bus.
Systemd is not on the shared bus.
Its
access to cluster storage is routed through the cluster
interconnect.
Figure 9-3: Four Node Cluster
Disks on the shared bus are served by all the cluster members
on the bus.
You can confirm this by looking for the device
request dispatcher server of
dsk3
as follows:
# drdmgr -a server dsk3
Device Name: dsk3
Device Type: Direct Access IO Disk
Device Status: OK
Number of Servers: 3
Server Name: systema
Server State: Server
Server Name: systemb
Server State: Server
Server Name: systemc
Server State: Server
Because
dsk3
is a direct-access I/O device on the
shared bus, all three systems on the bus serve it: when any member on
the shared bus accesses the disk,
the access is directly from the member to the device.
Disks on private buses are served by the system that they are local to.
For example,
the server of
dsk7
is
systemb:
# drdmgr -a server dsk7
Device Name: dsk7
Device Type: Direct Access IO Disk
Device Status: OK
Number of Servers: 1
Server Name: systemb
Server State: Server
Tape drives are
always single-server.
Because
tape0
is on a shared bus, any member on
that bus can act as its server.
When the cluster is started,
the first active member that has access to the tape drive becomes the
server for the tape drive.
The numbering of disks indicates that when the
cluster booted,
systema
came up first.
It detected
its private disks first and labeled them, then it detected the disks on
the shared bus and labeled them.
Because
systema
came up first, it is also the server for
tape0.
To confirm this, enter the following command:
# drdmgr -a server tape0
Device Name: tape0
Device Type: Served Tape
Device Status: OK
Number of Servers: 1
Server Name: systema
Server State: Server
To change
tape0's server to
systemc,
enter the
drdmgr
command as follows:
# drdmgr -a server=systemc /dev/tape/tape0
For any single-server device, the serving member is also the access node. The following command confirms this:
# drdmgr -a accessnode tape0
Device Name: tape0
Access Node Name: systemc
Unlike the device request dispatcher
SERVER
attribute,
which for a given device is the same on all cluster members, the value
of the
ACCESSNODE
attribute is specific to a
cluster member.
Any system on a shared bus is always its own access node for the direct-access I/O devices on the same shared bus.
Because
systemd
is not on the shared bus,
for each direct-access I/O device on the shared bus you can specify
the access node to be used by
systemd
when it
accesses the device.
The access node must be one of the members on the
shared bus.
The result of the following command is that
systemc
handles all device request dispatcher activity between
systemd
and
dsk3:
# drdmgr -h systemd -a accessnode=systemc dsk3
9.4.1.1 Devices Supporting Direct-Access I/O
RAID-fronted disks are direct-access I/O capable. The following are examples of Redundant Array of Independent Disks (RAID) controllers:
HSZ80
HSG60
HSG80
RA3000 (HSZ22)
Enterprise Virtual Array (HSV110)
Any RZ26, RZ28, RZ29, and RZ1CB-CA disks already
installed in a system at the time
the system becomes a cluster member, either through the
clu_create
or
clu_add_member
command, are automatically enabled as direct-access I/O disks.
To later add one of these disks as a direct-access I/O disk, you must
use the procedure in
Section 9.2.3.
9.4.1.2 Replacing RZ26, RZ28, RZ29, or RZ1CB-CA as Direct-Access I/O Disks
If you replace an RZ26, RZ28, RZ29, or RZ1CB-CA direct-access I/O disk with a disk of the same type (for example, replace an RZ28-VA with another RZ28-VA), follow these steps to make the new disk a direct-access I/O disk:
Physically install the disk in the bus.
On each cluster member, enter the
hwmgr
command to scan for the
new disk as follows:
# hwmgr -scan comp -cat scsi_bus
Allow a minute or two for the scans to complete.
If you want the new disk to have the same device name as the disk it
replaced, use the
hwmgr -redirect scsi
command.
For details, see
hwmgr(8)
On each cluster member, enter the
clu_disk_install
command:
# clu_disk_install
Note
If the cluster has a large number of storage devices, the
clu_disk_installcommand can take several minutes to complete.
9.4.1.3 HSZ Hardware Supported on Shared Buses
For a list of hardware that is supported on shared buses, see the TruCluster Server Version 5.1B QuickSpecs.
If you try to use
an HSZ that does not have
the proper firmware revision on a shared bus, the cluster
might hang when there are multiple simultaneous attempts to access
the HSZ.
9.5 Managing AdvFS in a Cluster
For the most part, the Advanced file system (AdvFS) on a cluster is like that on a standalone system. However, this section describes some cluster-specific considerations:
Integrating AdvFS files from a newly added member (Section 9.5.1)
Creating only one fileset in the cluster root domain (Section 9.5.2)
Not adding filesets to a member's boot partition (Section 9.5.3)
Not adding a volume to a member's root domain (Section 9.5.4)
Using the
addvol
and
rmvol
commands
(Section 9.5.5)
Using user and group file system quotas (Section 9.5.6)
Understanding storage connectivity and AdvFS volumes (Section 9.5.7)
9.5.1 Integrating AdvFS Files from a Newly Added Member
Suppose that you add a new member to the cluster and that new member has AdvFS volumes and filesets from when it ran as a standalone system. To integrate these volumes and filesets into the cluster, you need to do the following:
Modify the
/etc/fstab
file
listing the
domains#filesets
that you want to integrate into the cluster.
Make the new domains
known to the cluster, either by manually entering the domain information
into
/etc/fdmns
or by running the
advscan
command.
For information on the
advscan
command, see
advscan(8)/etc/fdmns,
see the section on restoring an AdvFS file system in
the Tru64 UNIX
AdvFS Administration
manual.
9.5.2 Create Only One Fileset in Cluster Root Domain
The root domain,
cluster_root, must
contain only a single fileset.
If you create more than one fileset in
cluster_root
(you are not prevented from
doing so), it can lead to a panic if the
cluster_root
domain needs to fail over.
As an example of when this situation might occur, consider
cloned filesets.
As described in
advfs(4)cluster_root
domain.
If the
cluster_root
domain has to fail
over while the cloned fileset is mounted, the cluster will
panic.
Note
If you make backups of the clusterwide root from a cloned fileset, minimize the amount of time during which the clone is mounted. Mount the cloned fileset, perform the backup, and unmount the clone as quickly as possible.
9.5.3 Adding Filesets to a Member's Boot Partition Not Recommended
Although you are not prohibited from adding filesets to a member's boot
partition, we do not recommend it.
If a member leaves the cluster,
all filesets mounted from that member's boot partition are
force-unmounted and cannot be relocated.
9.5.4 Do Not Add a Volume to a Member's Root Domain
You cannot use the
addvol
command to add volumes to a member's
root domain (rootmemberID_domain#root).
Instead, you must delete the member from the cluster, use
diskconfig
or SysMan to configure the disk
appropriately, and then add the member back
into the cluster.
For the configuration requirements for a member boot
disk, see the
Cluster Installation
manual.
9.5.5 Using the addvol and rmvol Commands in a Cluster
You can manage AdvFS domains from any
cluster member, regardless of
whether the domains are mounted on the local member or a remote member.
However, when you use the
addvol
or
rmvol
command from a member that is not the CFS
server for the domain you
are managing, the commands use
rsh
to execute
remotely on the member that is the CFS server for the domain.
This
has the following consequences:
If
addvol
or
rmvol
is entered
from a member that is not the server of the domain, and if member
that is serving the domain fails, the command can hang on the
system where it was executed until TCP times out, which can take as
long as an hour.
If this situation occurs, you can kill the command and its associated
rsh
processes and repeat the command as follows:
Get the process identifiers (PIDs) with the
ps
command and pipe
the output through
more, searching for
addvol
or
rmvol, whichever
is appropriate.
For example:
# ps -el | more +/addvol 80808001 I + 0 16253977 16253835 0.0 44 0 451700 424K wait pts/0 0:00.09 addvol 80808001 I + 0 16253980 16253977 0.0 44 0 1e6200 224K event pts/0 0:00.02 rsh 808001 I + 0 16253981 16253980 0.0 44 0 a82200 56K tty pts/0 0:00.00 rsh
Use the process IDs (in this example, PIDs
16253977,
16253980,
and
16253981) and parent process IDs
(PPIDs
16253977
and
16253980) to confirm
the association between the
addvol
or
rmvol
and the
rsh
processes.
Two
rsh
processes are associated with the
addvol
process.
All
three processes must be killed.
Kill the appropriate processes. In this example:
# kill -9 16253977 16253980 16253981
Reenter the
addvol
or
rmvol
command.
In the case of
addvol, you must use the
-F
option because the hung
addvol
command might have already changed the
disk label type to AdvFS.
Alternately, before using either the
addvol
or
rmvol
command on a domain,
you can do the following:
Use the
cfsmgr
command to learn the name of the CFS
server of the domain:
# cfsmgr -d domain_name
To get a list of the servers of all CFS domains, enter only the
cfsmgr
command.
Log in to the serving member.
Use the
addvol
or
rmvol
command.
If the CFS
server for the volume fails over to another member
in the middle of an
addvol
or
rmvol
operation, you may need to reenter the command because the new server
undoes any partial operation.
The
command does
not return a message indicating that the server failed, and the
operation must be repeated.
We recommend that you enter a
showfdmn
command for the target domain of an
addvol
or
rmvol
command after the
command returns.
The
rmvol
and
addvol
commands
use
rsh
when the member where the commands are executed
is not the server of the domain.
For
rsh
to function, the default cluster alias must
appear in the
/.rhosts
file.
The entry for
the cluster alias in
/.rhosts
can take the form of
the fully qualified host name or the unqualified host name.
Although the
plus sign (+) can appear in place of the host name, allowing
all hosts access, this is not
recommended for security reasons.
The
clu_create
command automatically places
the cluster alias in
/.rhosts,
so
rsh
normally works without your intervention.
If the
rmvol
or
addvol
command
fails because of
rsh
failure, the following
message is returned:
rsh failure, check that the /.rhosts file allows cluster alias access.
9.5.6 User and Group File System Quotas Are Supported
TruCluster Server includes quota support that allows you to limit both the number of files and the total amount of disk space that are allocated in an AdvFS file system on behalf of a given user or group.
Quota support in a TruCluster Server environment is similar to quota support in the Tru64 UNIX system, with the following exceptions:
Hard limits are not absolute because the cluster file system (CFS) makes certain assumptions about how and when cached data is written.
Soft limits and grace periods are supported, but a user might not get a message when the soft limit is exceeded from a client node, and such a message might not arrive in a timely manner.
The quota commands are effective clusterwide.
However, you must edit the
/sys/conf/NAME
system configuration file on each cluster
member to configure the system to include the quota subsystem.
If
you do not perform this step on a cluster member, quotas are
enabled on that member but you cannot enter quota
commands from that member.
TruCluster Server supports quotas only for AdvFS file systems.
Users and groups are managed clusterwide. Therefore, user and group quotas are also managed clusterwide.
This section describes information that is unique to managing
disk quotas in a TruCluster Server environment.
For general
information about managing quotas, see the Tru64 UNIX
System Administration
manual.
9.5.6.1 Quota Hard Limits
In a Tru64 UNIX system, a hard limit places an absolute upper boundary on the number of files or amount of disk space that a given user or group can allocate on a given file system. When a hard limit is reached, disk space allocations or file creations are not allowed. System calls that would cause the hard limit to be exceeded fail with a quota violation.
In a TruCluster Server environment, hard limits for the number of files are enforced as they are in a standalone Tru64 UNIX system.
However, hard limits on the total amount of disk space are not as rigidly enforced. For performance reasons, CFS allows client nodes to cache a configurable amount of data for a given user or group without any communication with the member serving that data. After the data is cached on behalf of a given write operation and the write operation returns to the caller, CFS guarantees that, barring a failure of the client node, the cached data will eventually be written to disk at the server.
Writing the cached data takes precedence over strictly enforcing the disk quota. If and when a quota violation occurs, the data in the cache is written to disk regardless of the violation. Subsequent writes by this group or user are not cached until the quota violation is corrected.
Because additional data is not written
to the cache while quota violations are being generated, the hard
limit is never exceeded by more than the
sum of
quota_excess_blocks
on all cluster members.
The actual disk space quota for a user or group is therefore
determined by the hard limit plus the sum
quota_excess_blocks
on all cluster members.
The amount of data that a given user or group is allowed to cache is
determined by the
quota_excess_blocks
value, which is
located in the member-specific
etc/sysconfigtab
file.
The
quota_excess_blocks
value is
expressed in units of 1024-byte blocks and the default value of 1024
represents 1 MB of disk space.
The value of
quota_excess_blocks
does
not have to be the same on all cluster members.
You might use a
larger
quota_excess_blocks
value on cluster members
on which you expect most of the data to be generated, and accept the
default value for
quota_excess_blocks
on other
cluster members.
9.5.6.2 Setting the
quota_excess_blocks Value
The value for
quota_excess_blocks
is
maintained in the
/etc/sysconfigtab
file in the
cfs
stanza.
Avoid making manual changes to this
file.
Instead, use the
sysconfigdb
command to make
changes.
This utility automatically makes any changes available
to the kernel and preserves the structure of the file so that future
upgrades merge in correctly.
Performance for a given user or group can be
affected by
quota_excess_blocks.
If this value
is set too low, CFS cannot use the cache
efficiently.
Setting
quota_excess_blocks
to less
than 64K will have a severe performance impact.
Conversely, setting
quota_excess_blocks
too
high increases the actual
amount of disk space that a user or group can consume.
We recommend accepting the
quota_excess_blocks
default of 1 MB, or increasing it as much as is considered
practical given its effect of raising the potential upper limit on
disk block usage.
When determining
how to set this value, consider
that the worst-case upper boundary is determined as follows:
(admin specified hard limit) + (sum of "quota_excess_blocks" on each client node)
CFS makes a significant effort to minimize the amount by which the
hard quota limit is exceeded; you are very unlikely to reach
the worst-case upper boundary.
9.5.7 Storage Connectivity and AdvFS Volumes
All volumes in an AdvFS domain must have the same connectivity if failover capability is desired. Volumes have the same connectivity when either one of the following conditions is true:
All volumes in the AdvFS domain are on the same shared SCSI bus.
Volumes in the AdvFS domain are on different shared SCSI buses, but all of those buses are connected to the same cluster members.
The
drdmgr
and
hwmgr
commands can give you information about which systems serve which disks.
To get a graphical display of the cluster hardware configuration, including
active members, buses, storage devices, and their connections, use the
sms
command to invoke the graphical interface
for the SysMan Station, and then select Hardware
from the Views menu.
9.6 Considerations When Creating New File Systems
Most aspects of creating new file systems are the same in a cluster and a standalone environment. The Tru64 UNIX AdvFS Administration manual presents an extensive description of how to create AdvFS file systems in a standalone environment.
For information about adding disks to the cluster, see Section 9.2.3.
The following are important cluster-specific considerations for creating new file systems:
To ensure the highest availability, make sure that all disks that are used for volumes in an AdvFS domain have the same connectivity.
We recommend that all LSM volumes that are placed into an AdvFS domain share the same connectivity. See the Tru64 UNIX Logical Storage Manager manual for more on LSM volumes and connectivity.
When you determine whether a disk is in use, make sure it is not used as any of the following:
The cluster quorum disk
Do not use any of the partitions on a quorum disk for data.
The clusterwide root file system, the
clusterwide
/var
file system, or the
clusterwide
/usr
file system
A member's boot disk
See Section 11.1.5 for a description of the member boot disk and how to configure one.
A single
/etc/fstab
file applies to all members
of a cluster.
9.6.1 Verifying Disk Connectivity
To ensure the highest availability, make sure that all disks that are used for volumes in an AdvFS domain have the same connectivity.
Disks have the same connectivity when either one of the following conditions is true:
All disks that are used for volumes in the AdvFS domain are on the same shared SCSI bus.
Disks that are used for volumes in the AdvFS domain are on different shared SCSI buses, but all of those buses are connected to the same cluster members.
The easiest way to verify disk connectivity is to
use the
sms
command to invoke the graphical
interface for the SysMan Station, and then select
Hardware from the Views menu.
For example, in
Figure 9-1, the SCSI bus
that is connected to the
pza0s is shared by all
three cluster members.
All disks
on that bus have the same connectivity.
You can also use the
hwmgr
command to view all the
devices on the cluster and then pick out those disks that show up
multiple times because they are connected to several members.
For example:
# hwmgr -view devices -cluster
HWID: Device Name Mfg Model Hostname Location
-------------------------------------------------------------------------------
3: kevm pepicelli
28: /dev/disk/floppy0c 3.5in floppy pepicelli fdi0-unit-0
40: /dev/disk/dsk0c DEC RZ28M (C) DEC pepicelli bus-0-targ-0-lun-0
41: /dev/disk/dsk1c DEC RZ28L-AS (C) DEC pepicelli bus-0-targ-1-lun-0
42: /dev/disk/dsk2c DEC RZ28 (C) DEC pepicelli bus-0-targ-2-lun-0
43: /dev/disk/cdrom0c DEC RRD46 (C) DEC pepicelli bus-0-targ-6-lun-0
44: /dev/disk/dsk13c DEC RZ28M (C) DEC pepicelli bus-1-targ-1-lun-0
44: /dev/disk/dsk13c DEC RZ28M (C) DEC polishham bus-1-targ-1-lun-0
44: /dev/disk/dsk13c DEC RZ28M (C) DEC provolone bus-1-targ-1-lun-0
45: /dev/disk/dsk14c DEC RZ28L-AS (C) DEC pepicelli bus-1-targ-2-lun-0
45: /dev/disk/dsk14c DEC RZ28L-AS (C) DEC polishham bus-1-targ-2-lun-0
45: /dev/disk/dsk14c DEC RZ28L-AS (C) DEC provolone bus-1-targ-2-lun-0
46: /dev/disk/dsk15c DEC RZ29B (C) DEC pepicelli bus-1-targ-3-lun-0
46: /dev/disk/dsk15c DEC RZ29B (C) DEC polishham bus-1-targ-3-lun-0
46: /dev/disk/dsk15c DEC RZ29B (C) DEC provolone bus-1-targ-3-lun-0
.
.
.
In this partial output,
dsk0,
dsk1, and
dsk2
are private disks that are connected to
pepicelli's local
bus.
None of these are appropriate for a file system that
needs failover capability, and they are not good choices
for Logical Storage Manager (LSM) volumes.
Disks
dsk13
(HWID 44),
dsk14
(HWID 45), and
dsk15
(HWID 46) are connected to
pepicelli,
polishham, and
provolone.
These three disks all have the same connectivity.
9.6.2 Looking for Available Disks
When you want to determine whether disks are already in use, look for the
quorum disk, disks containing the clusterwide file systems,
and member boot disks and swap areas.
9.6.2.1 Looking for the Location of the Quorum Disk
You can learn the location of the
quorum disk by using the
clu_quorum
command.
In the following example, the partial output for the command shows that
dsk10
is the cluster quorum disk:
# clu_quorum
Cluster Quorum Data for: deli as of Wed Apr 25 09:27:36 EDT 2001
Cluster Common Quorum Data
Quorum disk: dsk10h
.
.
.
You can also use the
disklabel
command
to look for a quorum disk.
All partitions in a quorum
disk are unused, except for the
h
partition, which has
fstype
cnx.
9.6.2.2 Looking for the Location of Member Boot Disks and Clusterwide AdvFS File Systems
To learn the locations of member boot disks and clusterwide
AdvFS file
systems, look for the file domain entries in
the
/etc/fdmns
directory.
You can use the
ls
command for this.
For example:
# ls /etc/fdmns/* /etc/fdmns/cluster_root: dsk3c /etc/fdmns/cluster_usr: dsk5c /etc/fdmns/cluster_var: dsk6c /etc/fdmns/projects1_data: dsk9c /etc/fdmns/projects2_data: dsk11c /etc/fdmns/projects_tools: dsk12c /etc/fdmns/root1_domain: dsk4a /etc/fdmns/root2_domain: dsk8a /etc/fdmns/root3_domain: dsk2a /etc/fdmns/root_domain: dsk0a /etc/fdmns/usr_domain: dsk0g
This output from the
ls
command
indicates the following:
Disk
dsk3
is used by the clusterwide
root file system (/).
You cannot use this disk.
Disk
dsk5
is used by the clusterwide
/usr
file system.
You cannot use this disk.
Disk
dsk6
is used by the clusterwide
/var
file system.
You cannot use this disk.
Disks
dsk4,
dsk8, and
dsk2
are member boot disks.
You cannot use these disks.
You can also use the
disklabel
command to identify
member boot disks.
They have three partitions:
the
a
partition has
fstype
AdvFS,
the
b
partition has
fstype
swap, and
the
h
partition has
fstype
cnx.
Disks
dsk9,
dsk11, and
dsk12
appear to be used for data and tools.
Disk
dsk0
is the boot disk for
the noncluster, base Tru64 UNIX operating system.
Keep this disk unchanged in case you need to boot the noncluster kernel to make repairs.
9.6.2.3 Looking for Member Swap Areas
A member's primary swap area is always the
b
partition of the member boot disk.
(For information about member boot disks, see
Section 11.1.5.)
However, a member might have additional swap areas.
If a member is down, be careful not to use the member's
swap area.
To learn whether a disk has swap areas on it, use
the
disklabel -r
command.
Look in the
fstype
column in the output for
partitions with
fstype
swap.
In the following
example, partition
b
on
dsk11
is a swap partition:
# disklabel -r dsk11
.
.
.
8 partitions:
# size offset fstype [fsize bsize cpg] # NOTE: values not exact
a: 262144 0 AdvFS # (Cyl. 0 - 165*)
b: 401408 262144 swap # (Cyl. 165*- 418*)
c: 4110480 0 unused 0 0 # (Cyl. 0 - 2594)
d: 1148976 663552 unused 0 0 # (Cyl. 418*- 1144*)
e: 1148976 1812528 unused 0 0 # (Cyl. 1144*- 1869*)
f: 1148976 2961504 unused 0 0 # (Cyl. 1869*- 2594)
g: 1433600 663552 AdvFS # (Cyl. 418*- 1323*)
h: 2013328 2097152 AdvFS # (Cyl. 1323*- 2594)
You can use the SysMan Station graphical user interface (GUI) to
create and configure
an AdvFS volume.
However, if you choose to use the
command line, when it comes time to edit
/etc/fstab, you need do it only once, and
you can do it on any cluster member.
The
/etc/fstab
file is
not a CDSL.
A single file is used by all cluster members.
9.7 Managing CDFS File Systems
In a cluster, a CD-ROM drive is always a served device. The drive must be connected to a local bus; it cannot be connected to a shared bus. The following are restrictions on managing a CD-ROM file system (CDFS) in a cluster:
The
cddevsuppl
command is not supported in a
cluster.
The following commands work only when executed from the cluster member that is the CFS server of the CDFS file system:
Regardless of which member mounts the CD-ROM, the member that is connected to the drive is the CFS server for the CDFS file system.
To manage a CDFS file system, follow these steps:
Enter the
cfsmgr
command to learn which member
currently serves the CDFS:
# cfsmgr
Log in on the serving member.
Use the appropriate commands to perform the management tasks.
For information about using library functions that manipulate the
CDFS, see the TruCluster Server
Cluster Highly Available Applications
manual.
9.8 Backing Up and Restoring Files
Back up and restore
for user data in a cluster is similar to that in a standalone system.
You back up and restore CDSLs
like any other symbolic links.
To back up all the targets of CDSLs,
back up the
/cluster/members
area.
Make sure that all restore software that you plan to use
is available on the Tru64 UNIX disk of the system that
was the initial cluster member.
Treat this disk as the
emergency repair disk for the cluster.
If the cluster loses
the root domain,
cluster_root, you can
boot the initial cluster member from the Tru64 UNIX disk
and restore
cluster_root.
The
bttape
utility is not supported in clusters.
The
clonefset
utility,
described in
clonefset(8)vdump
command or
other supported backup utility.
(The
dump
command is not supported by AdvFS.) You
might find it useful to use the
clonefset
to back
up cluster file systems.
If you do
make backups of the clusterwide root from a cloned fileset,
minimize the amount of time during which the clone is mounted.
Mount the cloned fileset, perform the backup, and unmount the
clone as quickly as possible.
See
Section 9.5.2
for additional information.
9.8.1 Suggestions for Files to Back Up
Back up data files and the following file systems regularly:
The clusterwide root file system
Use the same backup and restore methods that you use for user data.
The clusterwide
/usr
file system
Use the same backup and restore methods that you use for user data.
The clusterwide
/var
file system
Use the same backup and restore methods that you use for user data.
If, before installing TruCluster Server, you were using AdvFS and
had
/var
located in
/usr
(usr_domain#var),
the installation process moved
/var
into its own domain (cluster_var#var).
Because of this move, you must back up
/var
as a
separate file system from
/usr.
Member boot disks
See Section 11.1.5 for special considerations for backing up and restoring member boot disks.
Do not put swap entries in
/etc/fstab.
In Tru64 UNIX Version 5.0 the list of swap devices was moved from the
/etc/fstab
file to the
/etc/sysconfigtab
file.
Additionally, you no longer
use the
/sbin/swapdefault
file
to indicate the swap allocation; use the
/etc/sysconfigtab
file for this purpose as well.
The swap devices and swap allocation mode are automatically placed in the
/etc/sysconfigtab
file during installation of the base operating system.
For more
information, see the Tru64 UNIX
System Administration
manual
and
swapon(8)
Put each member's swap information in
that member's
sysconfigtab
file.
Do not put any swap
information in the clusterwide
/etc/fstab
file.
Swap information in
sysconfigtab
is identified
by the
swapdevice
attribute.
The format for swap information is as follows:
swapdevice=disk_partition,disk_partition,...
For example:
swapdevice=/dev/disk/dsk1b,/dev/disk/dsk3b
Specifying swap entries in
/etc/fstab
does not
work in a cluster because
/etc/fstab
is not
member-specific; it is a clusterwide file.
If swap is
specified in
/etc/fstab, the first member
to boot and form a cluster reads and mounts all the file systems in
/etc/fstab.
The other members never see that
swap space.
The file
/etc/sysconfigtab
is a context-dependent
symbolic link (CDSL), so that each member can
find and mount its specific swap partitions.
The installation script automatically
configures one swap device for each member, and puts a
swapdevice=
entry in that member's
sysconfigtab
file.
If you want to add additional swap space, specify the new partition
with
swapon, and then put an entry in
sysconfigtab
so the partition is available
following a reboot.
For example, to configure
dsk3b
for use as a secondary swap device for a member already
using
dsk1b
for swap, enter the following
command:
swapon -s /dev/disk/dsk3b
Then, edit that member's
/etc/sysconfigtab
and add
/dev/disk/dsk3b.
The final
entry in
/etc/sysconfigtab
will look like the
following:
swapdevice=/dev/disk/dsk1b,/dev/disk/dsk3b
9.9.1 Locating Swap Device for Improved Performance
Locating a member's swap space on a device on a shared bus results in additional I/O traffic on the bus. To avoid this, you can place swap on a disk on the member's local bus.
The only downside to locating swap local to the member is
the unlikely case where the member loses its path to the swap disk,
which can happen when an adapter fails.
In this situation, the
member will fail.
When the swap disk is on
a shared bus, the member can still use its swap partition as long
as at least one member still has a path to the disk.
9.10 Fixing Problems with Boot Parameters
If a cluster member fails to boot due to parameter problems in the
member's root domain
(rootN_domain),
you can mount that domain on a running
member and make the needed
changes to the parameters.
However, before booting the
down member, you must unmount the
newly updated member root
domain from the running cluster member.
Failure to do so can cause a crash and result in the display of the following message:
cfs_mountroot: CFS server already exists for node boot partition.
For more information, see
Section 11.1.10.
9.11 Using the verify Utility in a Cluster
The
verify
utility
examines the on-disk metadata structures of AdvFS file systems.
Before
using the utility, you must unmount all filesets in the file domain
to be verified.
If you are running the
verify
utility and the
cluster member on which it is running fails, extraneous mounts may be
left.
This can happen because the
verify
utility creates temporary mounts of the
filesets that are in the domain that is being verified.
On a single system these mounts go away if the system
fails while running the utility, but, in a cluster, the mounts
fail over to another cluster member.
The fact that these mounts
fail over also prevents you from mounting the filesets until
you remove the spurious mounts.
When
verify
runs, it creates a directory for
each fileset in the domain and then mounts each fileset on the
corresponding directory.
A directory is named as follows:
/etc/fdmns/domain/set_verify_XXXXXX,
where
XXXXXX
is a unique ID.
For example, if the domain name is
dom2
and the
filesets in
dom2
are
fset1,
fset2, and
fset3, enter the following command:
# ls -l /etc/fdmns/dom2 total 24 lrwxr-xr-x 1 root system 15 Dec 31 13:55 dsk3a -> /dev/disk/dsk3a lrwxr-x--- 1 root system 15 Dec 31 13:55 dsk3d -> /dev/disk/dsk3d drwxr-xr-x 3 root system 8192 Jan 7 10:36 fset1_verify_aacTxa drwxr-xr-x 4 root system 8192 Jan 7 10:36 fset2_verify_aacTxa drwxr-xr-x 3 root system 8192 Jan 7 10:36 fset3_verify_aacTxa
To clean up the failed-over mounts, follow these steps:
Unmount all the filesets in
/etc/fdmns:
# umount /etc/fdmns/*/*_verify_*
Delete all failed over mounts with the following command:
# rm -rf /etc/fdmns/*/*_verify_*
Remount the filesets like you do after a normal
completion of the
verify
utility.
For more information about
verify, see
verify(8)9.11.1 Using the verify Utility on Cluster Root
The
verify
utility has been modified
to allow it to run on active domains.
Use the
-a
option to examine the cluster root file system,
cluster_root.
You must execute the
verify -a
utility
on the member that is serving the domain that you are examining.
Use the
cfsmgr
command to determine which
member serves the domain.
When
verify
runs with the
-a
option, it only examines the domain.
No fixes can be
done on the active domain.
The
-f
and
-d
options cannot be used with the
-a
option.