Article 138047 of comp.os.vms: Path: nntpd.lkg.dec.com!nntpd2.cxo.dec.com!movies.enet.dec.com!randal From: randal@movies.enet.dec.com (Paul Randal) Newsgroups: comp.os.vms Subject: Spiralog Technical Documentation (long) Date: 18 Jan 1996 08:55:54 GMT Organization: DEC OpenVMS Engineering Lines: 621 Distribution: world Message-ID: <4dl1uq$34u@nntpd2.cxo.dec.com> Reply-To: cwells@movies.enet.dec.com (Clare Wells) NNTP-Posting-Host: grip Keywords: Spiralog X-Newsreader: mxrn 6.18-16 I'm posting this for the Spiralog Product Manager: =============================================================================== I have attached below a recent information sheet describing the Spiralog technology. Other than the technical design document ( available only under non-disclosure) this 9 page document is the most technical document currently available which describes the technology. The attached document is also available in postscript format with supportive diagrams, should you require the document in this format please mail me on clare.wells@edo.mts.dec.com. I apologise for the delay in completing this document, schedule constraints in ensuring we delivered Spiralog V1 (on time) at the end of the year meant that we had to reschedule tasks which were not critical to delivery of the V1 product. Please note that Spiralog product documentation can be ordered via normal channels. Regards, Clare F Wells OpenVMS UK Product Management Digital's Spiralog File System for OpenVMS Alpha: Information Sheet Spiralog is Digital's new file system for OpenVMS Alpha that provides advanced, high-performance file services on single computers and OpenVMS Clusters. This Information Sheet describes the technology of this innovative new software and how it benefits OpenVMS Alpha users. The first release of Spiralog will be available early in 1996 as a separately ordered product running with OpenVMS Alpha Version 7.0. Rights to use the product will be part of the OpenVMS Alpha license. The Demands on File Systems Today Every computer operating system includes a file system, the component that organises files on disk. The requirements of business computing today make ever higher demands on file systems. These demands include: * The need to handle huge amounts of data, very large files and very large numbers of files. Disks of many gigabytes and "data farms" of terabytes are becoming very common. Increasingly demanding applications are continually pushing file systems to their performance limits. Examples include file servers such as PATHWORKS and technical applications such as image-processing, multimedia and finite-element analysis. * The requirement for uninterrupted service, 24 hours a day and 365 days of the year, without down time for backups or system failures. * Demands for better performance when users on different computers access the same data at the same time. * The need to add extra computers to existing systems as an organisation's computing workload increases, without cost in performance. * The need to use a single file system on a client-server network to save data from different clients, with each client seeing the data in its "native" format. Existing file systems have limits intrinsic to their design that prevent them meeting these demands. For example, 32-bit addressing limits the handling of large data structures, and using tables for internal directories restricts performance with large numbers of files. The challenge is to provide a file system that will cope with these demands well into the future. Digital's Spiralog is the file system for the new millennium that tackles these problems head on. What are the Advantages of Spiralog? Spiralog is an innovative file system for OpenVMS Alpha. It runs on both OpenVMS Clusters and single computers. It offers the following advantages to OpenVMS Alpha users and system managers: * Spiralog is much faster at writing data than Files-11, the existing OpenVMS Alpha file system. This is made possible by advanced caching technology and by eliminating most of the disk read/write head movements that slow down conventional write operations. * It offers a new approach to disk storage including fast, online backup and rapid recovery of data. Spiralog achieves rates of backup that are close to the maximum device throughput. * Incremental (partial) backups are as easy as full backups. Several incremental backups can be merged into one, independently of the original volume, so that files can be restored more quickly. * You do not need to recompile or relink your existing OpenVMS Alpha applications to obtain the benefits of Spiralog. * Spiralog distributes the processing load across the computers in an OpenVMS Cluster, making efficient use of the combined CPU power available within the OpenVMS Cluster. * It runs alongside Files-11, on the same computer or on different computers in an OpenVMS Cluster. * Support for different client systems (UNIX and PC) is built into Spiralog. * Spiralog works with existing storage technology without changes. * To migrate your data, you just copy it from your existing Files-11 volumes to your Spiralog volumes. * Spiralog extends the familiar system service interface of OpenVMS, allowing programmers to make full use of the unique features of Spiralog's caching. * Spiralog provides high availability and uninterrupted service to users when computers drop out or are taken out of the OpenVMS Cluster. What's New About Spiralog Technology? Spiralog is a completely new design of file system in several ways. Four main advances in technology are summarised here, and discussed in more detail later: * The Log-Structured File System The method that Spiralog uses to write data on a disk is an enormous advance over that used by conventional file systems, including Files-11. Data is written to the disk in a sequential fashion regardless of which files it belongs to. The record that Spiralog makes is called a "log". Although other file systems make limited use of logs, Spiralog is unique in that both user data and the data required by the file system itself are stored in the log. Newly created files are written to the end of the log, and when you update a file (for example, by changing its owner or by modifying some of the data in it) the updates are also written to the end of the log. Existing files on the disk are not overwritten. The advantages of this design are fast writes and very fast online backup. * The Clerk-Server Model Spiralog takes the best of the client-server model and applies it to OpenVMS Cluster computing as the "Clerk-Server Model". The client application deals directly with a clerk that runs on the client's own computer. The caching of data and all of the processing of the client application's file operations are the work of the clerk. The server may be on any computer in the OpenVMS Cluster, and it alone carries out the reads and writes on a disk. The system manager can set up standby servers on other computers to take over if the computer serving the volume fails. A computer can run both clerk and server components for a disk, and can serve more than one disk. The clerk-server model spreads the processing load well across an OpenVMS Cluster. It also offers high availability of the system, provides consistency of data across the whole cluster and preserves data integrity if a computer fails. * Write Optimisation by Write-Behind Caching Spiralog buffers all writes in the clerk's cache, both user data and the data required by the file system. This "write-behind caching", unique to Spiralog, improves the response times for applications. Spiralog maintains the correct order of the writes while they are in the cache and combines multiple writes. This reduces the communications load. * Architecture For Multiple File Systems Spiralog is designed to work as a server for client programs running on different operating systems, for example, UNIX and PC. Developments in OpenVMS are under consideration to take further advantage of this core facility, which will enable users on different clients to use Spiralog as their file system and backup mechanism. What Is a Log-Structured Design? In order to understand the heart of Spiralog, we first need to look at how conventional file systems like Files-11 work. The principle they are based on is called "update-in-place", because, when a file is updated, it remains in the same place on the disk surface. When a file is updated, the disk read/write head must first move to the place containing information about the position of the file on the disk. When the file has been located, the head then moves to where the file is stored, and updates the data. A file which has been updated already, and has increased in size, may well be stored in several pieces in different places on the disk surface. Updating the file then requires several movements of the head. For example, consider what happens in Files-11 when a file is stored in three blocks on the disk surface. If blocks 2 and 3 of the file are updated, the before and after layouts are the same. The read/write head has to move first to the index file to locate the file header, which stores information about the file, then to the file header to find the blocks where the file is stored. Then it makes separate moves to overwrite blocks 2 and 3 and returns to update the file header with the time of the update. As much as 85% of the time in a conventional write operation is spent in moving the disk head over the surface to the required positions on the disk surface, rather than in transferring the data itself. The log-structured design of Spiralog eliminates most head movement. Instead of updating the existing file in the same place on the disk surface, Spiralog simply writes the new data sequentially to the end of the log. Writing takes place at the rate that the head can move along the spiral disk track. Spiralog does not go back and overwrite the data that has been changed. In the previous example using the Spiralog on-disk structure, when blocks 2 and 3 and the file header have been updated, the layout of the data on the disk surface is not the same. The updated blocks of data and the header have been placed together at the end of the log, so the disk head needs only to move along that section of the disk track. What Does Spiralog Do with Updated Data? When Spiralog updates an existing file, it writes the new data to the end of the log. This makes the original data, which remains in the log, obsolete. If Spiralog did nothing with this obsolete data, the disk would eventually fill up. To prevent this, if the obsolete data occupies a certain minimum space, that space is made available immediately for further use; otherwise Spiralog marks the obsolete data for later removal. A component of Spiralog called "the cleaner" (garbage collector) reclaims the space occupied by marked data. Normally, the cleaner runs as a background job at times when the disk is not busy, so that it does not slow normal disk activities down. However, as the disk becomes more full, its priority becomes higher. When the cleaner has removed the obsolete data, the region of the disk can now be used for writing new data. The cleaner uses an algorithm to decide whether to clean a region of obsolete data now, or whether to delay. This algorithm is chosen to minimise the time spent cleaning. Does Spiralog Need a Defragmenter? The cleaner itself does an excellent job of ensuring that the disk is not fragmented. However, files in Spiralog may become fragmented. A file can be defragmented simply by copying it, which ensures that the file is written contiguously to the end of the log. Alternatively, a file defragmentation facility may be used. How Does Spiralog Keep Track of File Positions on Disk? Like all file systems, Spiralog keeps track of the disk blocks used to store a file, so that it can quickly read the file. Spiralog stores this information in a data structure called a depthbalanced tree (B-tree). Many databases use this data structure because it is fast and efficient and because it scales up well to larger sizes of database. The B-tree is kept in the Spiralog volume itself, so that it benefits from the log-structured design of the Spiralog disk as much as any other data does. Because Spiralog caches everything it reads and writes, most of the B-tree is held in the cache, so that Spiralog may not have to read the disk at all to locate the file, or at most it will require one disk read to find it. The result is that the performance of Spiralog is good both with volumes that contain large numbers of files and with those that contain only a few very large files. Performance falls off slowly with large numbers of files, in contrast to Files-11. Why is Spiralog's Backup So Fast? Spiralog's backup performance is far superior to anything currently available. Backup exploits the log-structured design by reading the data sequentially and copying it directly from disk to disk or tape, without any need to interpret the data to discover the files to which each item of data belongs. This copying requires much less CPU time than the file-by-file backup used by conventional file systems. Spiralog's backup is similar to the Files-11 physical backup of a whole disk, but only the log, not the whole disk, needs to be copied. The backup medium does not have to be a Spiralog volume. Spiralog saves directory information with the backup, so you can quickly restore individual files or directories. In a conventional file system during a write, data can be written to any part of the disk. It is usually necessary to stop writing during a backup, as otherwise the different parts of the backup will not be consistent with one another. For example, part of the backup of a file may represent the state of the file before a write, and another part of the backup of the same file may correspond to the same file after the write. To avoid this problem of conventional file systems, volumes are taken offline before backup. Spiralog writes changed data only to the end of the log, and the rest of the data on the disk is not changed during a write. Therefore, Spiralog can back up all the data in the log before a particular point, even while more data is being written to the end of the log. Backup does not mean taking the volume offline, so there is no interruption to normal business. The backed-up volume represents a "snapshot" of the disk at the time of the backup. When the backup volume is used to restore data, the files are restored in precisely the state they were in at the backup time. How Does Spiralog Do Incremental Backups? Because a point in the Spiralog log represents a particular point in time, it is possible to make a backup of all the changes made since that time, by just copying all the data after that point. For example, you could do a full backup on Monday and then on Tuesday do an incremental backup of all the changes since the full one. Incremental backup in Spiralog offers a particularly great advantage in speed over Files-11, because there is no need to search directories or read file headers to find out what has changed. Incremental backup, like full backup, can be done online without interruption to service. Spiralog provides for backups to be made at up to 128 different levels, as in the UNIX dump command, where level 0 is a full backup, and the other levels represent partial backups. A backup at a particular level saves all changes made since the last backup at a lower level. You can use this to define a schedule of full and incremental backups covering different time periods. The advantage of this is that you can balance the time taken to do backups and the space taken up by the backups themselves against the time needed to restore data. For example, the Monday backup could be a full one, then Tuesday's could be a backup of all changes in the last day, Wednesday's a backup of all changes in the last two days, Thursday's a backup of the last three days and so on. Then at most two backups are needed to restore data. Spiralog also enables you to make unscheduled backups that do not affect the backup schedule. For example, on Wednesday you may make an unscheduled full backup, but the Thursday backup will cover the period since Monday as determined by the original schedule. What the Advantages of Merging Backups? A number of backups made at different times can be merged into a single backup. This saves time when the backup is used to restore data, and also economises on the use of offline storage media. For example, Monday's full backup and the partial backups on Tuesday and Wednesday can be merged into a single one. This produces exactly the same backup as if you did a full unscheduled backup on Wednesday. Merging is carried out independently of the original volume, which can continue to be used online, so merging does not affect any applications using that volume. The merging on Wednesday in the example just discussed does not require the original volume to be online at the time it is done. You can merge backups offline, on any computer in any OpenVMS Cluster that is running Spiralog, even if no Spiralog volumes are mounted. How Do Standby Servers Work? Spiralog offers high availability in part because the client-server model allows standby servers to be installed for each Spiralog volume. Each of the standby servers is on a different computer in the OpenVMS Cluster, as specified by the system manager. Only one server serves the volume at any one time, ensuring that all applications running in the OpenVMS Cluster see the same data. If the computer fails on which the server is running, Spiralog switches to a standby server without interrupting service. This is called "failover". When failover occurs, clerks may have sent data to the server for writing to the disk. Normally, the server confirms to the clerk that the data has actually been made permanent on disk. If the server's computer fails during a write, this acknowledgement may not be sent. The unique design of Spiralog's interface between the clerk and the server ensures that this does not matter. The clerk can send the same data to the new server, confident that the written file will be identical whether the failed server had succeeded in writing the data or not. The system manager can arrange failover for other purposes, for example, taking a computer out of service. How Does Spiralog Recover After System Failure? In a log-structured file system, any point in the log represents a particular moment in time. If a clerk is running on a computer that fails, the data in the log up to that point can be used to reconstruct the state of all the files at that moment so that the clerk can continue. Recovery can quickly be made to a consistent state, with the loss only of the data that was still in the clerk's cache and had not been written to disk at the time of failure. When Spiralog is writing to a disk, after it has written a certain amount of information and at certain other times, it creates a "checkpoint" in the log. The purpose of a checkpoint is to make it unnecessary to go right back to the start of the log in order to carry out a recovery. This minimises the time needed to carry out the recovery. When a checkpoint is created, it does not interrupt normal writes to the disk. How Does Spiralog Optimise Write Performance? Several features give Spiralog its dramatic improvement in write performance over other file systems. One of them is the log-structured design already described, which eliminates disk head movement for writes, even when the writes are to different files. Another new feature is the use of a cache for writes. In Files-11, writes are sent straight to disk. In contrast, when Spiralog uses write-behind caching (its default), it stores all data in the clerk's cache before it is actually written to the disk. This is called "write-behind" caching. While Spiralog is holding the data in the cache for a short time, the application may make further changes to the data. Spiralog maintains the order in which changes are made and records how one change overwrites another. Before the clerk passes the data to the server, it consolidates the changes into one write. It then sends only the final result of each series of changes. This means that the server makes only one write where it might otherwise have made several. For example, if a first write changes blocks 1 and 2 of a file, and a second write changes blocks 2 and 3 of the same file, when these writes are amalgamated, it results in one value only written to each block. In the case of block 2, the value written in the block is the one assigned by the second write, and the first value is thrown away. Spiralog can amalgamate changes to more than one file into a single write. In some cases, Spiralog does not need to write anything at all when the writes are consolidated. An example is when a file is created and then deleted before the corresponding data in the clerk's cache is made permanent on disk. How Long Does It Take for Data to be Written to Disk? When write-behind caching is used, a clerk attempts to send the data to the server within 30 seconds of completion of an application's write request. If the system fails, this means that the state of a volume may not reflect changes made in the last 30 seconds. In some cases an application may need data to be made permanent more quickly on the disk. In this case, the data can be sent straight to the disk, as is done in Files-11; this is called "write-through" caching. The data is written on the disk before it appears in the cache. The programmer can also force Spiralog to flush the caches to disk even when write-behind caching is in use, or to make writes to a file occur in a particular order. When write-through caching is used the performance benefits of write-behind caching are lost, but the application still gains from the enormous advantages of the log-structured file system. In addition, the data remains in the cache, and if it is subsequently read, there is a resulting advantage to the speed of the read. Spiralog can give a file or application the caching it needs, write-through or write-behind, regardless of the caching used at the same time by other applications. What Guarantees of Data Integrity Does Spiralog Make? The Spiralog clerk provides "atomicity" for write and delete operations on data up to 64 kilobytes in size. That is, the data is written in its entirety to the disk or not at all, or a piece of data up to that size is deleted completely or not at all, even though the operation may actually involve several disk input- output operations. After a computer failure, the data on the disk is in a consistent state and does not represent a write or delete that failed part-way through. As described above, if a server fails without acknowledging a write, the clerk can repeat the write with a new server. It makes no difference whether the previous write was successful; the outcome of the write is the intended one. Spiralog makes use of the OpenVMS Distributed Lock Manager to ensure that only one clerk writes a particular item of data at one time, ensuring that a file is always in a consistent state. How Does Spiralog Improve Read Performance? To improve read performance in Spiralog, the clerk caches everything that is read, using a larger cache than in Files-11. The clerk can "read ahead", anticipating the data needed during sequential reads and bringing it into cache before it is actually required. The log-structured file system brings an advantage when two or more files are being read and written frequently, because the most recently written parts of both files are kept close together near the end of the log, in contrast to the "write-in-place" disk structure. This minimises the head movements required for reading the files. Spiralog does not have the same advantage for reading single files that are rarely written, but it gives an overall read performance comparable to that of Files-11. How Does Performance Change with Increasing Number of Files? In many conventional file systems, including Files-11 in OpenVMS, performance can degrade rapidly as the number of files increases. In contrast, the performance of Spiralog drops off slowly with the number of files, and greatly exceeds the performance of the conventional file systems when large disks (e.g., a terabyte RAID5 array) are used. This is a consequence of Spiralog's overall design, which takes full advantage of the fast directory structure based on depth-balanced trees (discussed above). How Does Spiralog Work with Today's Systems? Spiralog coexists with the present OpenVMS file system, Files-11, either in a single computer or in an OpenVMS Cluster. You can use Files-11 volumes alongside Spiralog volumes (in Spiralog version 1, the system disk must be a Files-11 disk). You can continue to use existing storage technologies, including RAID5 and shadow sets, with both Spiralog and Files-11. How Does Spiralog Integrate With Other File Systems? Spiralog provides multiple client support and transparent file services to PCs and other clients through the design of the clerk/client interface, allowing all clients to share the benefits of Spiralog. Their data can be stored together in Spiralog volumes, with no need to create partitions. All files share the same directory hierarchy, so they can be easily visible between clients and have multiple names. Each client sees what appears as a local file system (FAT for DOS, a bit stream for UNIX, and so on). This support is provided at present for PC clients through the PATHWORKS product. Enhanced client support is being considered for the future. Everything on a Spiralog volume is backed up and restored using a single procedure, so separate procedures are not needed for different types of client. Does Spiralog Work with OpenVMS VAX? On an OpenVMS Cluster containing both OpenVMS Alpha and OpenVMS VAX computers, users on OpenVMS VAX computers can have access to Spiralog through DECnet and similar network products. As Spiralog is an OpenVMS Alpha product, it does not run on an OpenVMS VAX computer itself. How Is Data Moved to a Spiralog Volume? The commands that the system manager uses to initialise and mount Spiralog volumes are integrated into the normal DCL command set. The procedure is slightly different depending on whether you intend a computer to serve the volume or not. A Spiralog volume is actually a "container" file which occupies all the space on a Files-11 volume. When a Spiralog volume is mounted for the first time on a single computer or on an OpenVMS Cluster, the procedure is to initialise and mount the volume as a Files-11 volume, then initialise and mount it as a Spiralog volume. This creates a Spiralog server for the volume on the computer. When the same Spiralog volume is mounted on another computer in an OpenVMS Cluster, it can be either for a clerk or for a standby server. To create a standby server for the volume, you mount the volume first as a Files-11 volume, then as a Spiralog volume. To mount the volume with a clerk only, you simply mount it as a Spiralog volume. Once you have mounted the Spiralog volume, you use the normal DCL COPY and BACKUP commands to transfer data to it, exactly as for any other OpenVMS volume. Will Existing Applications Work with Spiralog? Existing OpenVMS applications work without modification with Spiralog. The few exceptions are programs that make assumptions about the underlying design of the file system, such as disk repairers and defragmentation software. To ensure that existing applications work unchanged, Spiralog supports all existing programming interfaces at the RMS level and above. All RMS, run-time library and highlevel language interfaces work with Spiralog. It also supports all existing low-level QIO interfaces that are compatible with Spiralog's new design. Can Applications Exploit Spiralog's New Features? An application can use Spiralog's extensions to the RMS and QIO interfaces to exploit the new features offered by Spiralog. For example, it can use extensions to the RMS and QIO interfaces to control whether it uses write- behind caching when it opens a file for write access. There are new and powerful facilities to use the clerk's caches, including specifying the order of write and delete operations sent to the server. Do Users Need to Learn New Commands? There is a new backup/restore facility. Apart from that, users do not need to learn any new commands to use Spiralog. They use the same DCL commands as for the existing OpenVMS Alpha file system, Files-11. For example, TYPE shows the contents of a Spiralog file, and PRINT is the command that prints it. There is an extension to the SET FILE command that allows the user to select the default caching option for a file when an application does not specify what caching it requires. Further reading The Spiralog document set: Spiralog User's Guide Spiralog System Manager's Guide Spiralog Programmer's Guide Spiralog Installation Guide Spiralog Release Notes Copyright and Trade Mark Information (c) 1995 Digital Equipment Corporation. All rights reserved. Digital believes the information in this publication is accurate as of its publication date; such information is subject to change without notice. Digital is not responsible for any inadvertent errors. Digital conducts its business in a manner that conserves the environment and protects the safety and health of its employees, customers, and the community. The following are trademarks of Digital Equipment Corporation: Digital, the Digital logo, OpenVMS, Alpha, VAX and PATHWORKS. UNIX is a registered trademark in the United States and other countries licensed exclusively through X/Open Company, Ltd. -- --- Paul Randal - DEC OpenVMS File System Engineering "NFS - Nightmare Tardis System Administration. File System" URL=http://www.tardis.ed.ac.uk/