Article 138047 of comp.os.vms:
Path: nntpd.lkg.dec.com!nntpd2.cxo.dec.com!movies.enet.dec.com!randal
From: randal@movies.enet.dec.com (Paul Randal)
Newsgroups: comp.os.vms
Subject: Spiralog Technical Documentation (long)
Date: 18 Jan 1996 08:55:54 GMT
Organization: DEC OpenVMS Engineering
Lines: 621
Distribution: world
Message-ID: <4dl1uq$34u@nntpd2.cxo.dec.com>
Reply-To: cwells@movies.enet.dec.com (Clare Wells)
NNTP-Posting-Host: grip
Keywords: Spiralog
X-Newsreader: mxrn 6.18-16


I'm posting this for the Spiralog Product Manager:

===============================================================================
I have attached below a recent information sheet describing the Spiralog
technology. Other than the technical design document ( available only under
non-disclosure) this 9 page document is the most technical document currently
available which describes the technology. The attached document is also
available in postscript format with supportive diagrams, should you require the
document in this format please mail me on clare.wells@edo.mts.dec.com.

I apologise for the delay in completing this document, schedule constraints in
ensuring we delivered Spiralog V1 (on time) at the end of the year meant that we
had to reschedule tasks which were not critical to delivery of the V1
product. Please note that Spiralog product documentation can be ordered via 
normal channels. 

Regards,

Clare F Wells
OpenVMS UK Product Management


        Digital's Spiralog File System for OpenVMS Alpha:
        Information Sheet

Spiralog is Digital's new file system for OpenVMS Alpha that provides advanced, 
high-performance file services on single computers and OpenVMS Clusters. This 
Information Sheet describes the technology of this innovative new software and 
how it benefits OpenVMS Alpha users.

The first release of Spiralog will be available early in 1996 as a separately 
ordered product running with OpenVMS Alpha Version 7.0. Rights to use the 
product will be part of the OpenVMS Alpha license.


        The Demands on File Systems Today

Every computer operating system includes a file system, the component that 
organises files on disk. The requirements of business computing today make ever 
higher demands on file systems. These demands include:

* The need to handle huge amounts of data, very large files and very
  large numbers of files. Disks of many gigabytes and "data farms" of
  terabytes are becoming very common. Increasingly demanding applications
  are continually pushing file systems to their performance limits.
  Examples include file servers such as PATHWORKS and technical
  applications such as image-processing, multimedia and finite-element 
  analysis.

* The requirement for uninterrupted service, 24 hours a day and 365
  days of the year, without down time for backups or system failures.

* Demands for better performance when users on different computers
  access the same data at the same time.

* The need to add extra computers to existing systems as an
  organisation's computing workload increases, without cost in
  performance.

* The need to use a single file system on a client-server network to
  save data from different clients, with each client seeing the data in
  its "native" format.

Existing file systems have limits intrinsic to their design that prevent them 
meeting these demands. For example, 32-bit addressing limits the handling of 
large data structures, and using tables for internal directories restricts 
performance with large numbers of files.

The challenge is to provide a file system that will cope with these demands 
well into the future. Digital's Spiralog is the file system for the new 
millennium that tackles these problems head on.
       

	What are the Advantages of Spiralog?

Spiralog is an innovative file system for OpenVMS Alpha. It runs on both 
OpenVMS Clusters and single computers. It offers the following advantages to 
OpenVMS Alpha users and system managers:

* Spiralog is much faster at writing data than Files-11, the existing
  OpenVMS Alpha file system. This is made possible by advanced caching
  technology and by eliminating most of the disk read/write head
  movements that slow down conventional write operations.

* It offers a new approach to disk storage including fast, online
  backup and rapid recovery of data. Spiralog achieves rates of backup
  that are close to the maximum device throughput.

* Incremental (partial) backups are as easy as full backups. Several
  incremental backups can be merged into one, independently of the
  original volume, so that files can be restored more quickly.

* You do not need to recompile or relink your existing OpenVMS Alpha
  applications to obtain the benefits of Spiralog.

* Spiralog distributes the processing load across the computers in an
  OpenVMS Cluster, making efficient use of the combined CPU power 
  available within the OpenVMS Cluster.

* It runs alongside Files-11, on the same computer or on different
  computers in an OpenVMS Cluster.

* Support for different client systems (UNIX and PC) is built into
  Spiralog.

* Spiralog works with existing storage technology without changes.

* To migrate your data, you just copy it from your existing Files-11
  volumes to your Spiralog volumes.

* Spiralog extends the familiar system service interface of OpenVMS,
  allowing programmers to make full use of the unique features of
  Spiralog's caching.

* Spiralog provides high availability and uninterrupted service to
  users when computers drop out or are taken out of the OpenVMS Cluster.


        What's New About Spiralog Technology?

Spiralog is a completely new design of file system in several ways. Four main 
advances in technology are summarised here, and discussed in more detail later:

* The Log-Structured File System

The method that Spiralog uses to write data on a disk is an enormous advance 
over that used by conventional file systems, including Files-11. Data is 
written to the disk in a sequential fashion regardless of which files it 
belongs to. The record that Spiralog makes is called a "log". Although other 
file systems make limited use of logs, Spiralog is unique in that both user 
data and the data required by the file system itself are stored in the log.

Newly created files are written to the end of the log, and when you update a 
file (for example, by changing its owner or by modifying some of the data in 
it) the updates are also written to the end of the log. Existing files on the 
disk are not overwritten. The advantages of this design are fast writes and 
very fast online backup.

*  The Clerk-Server Model

Spiralog takes the best of the client-server model and applies it to OpenVMS 
Cluster computing as the "Clerk-Server Model". The client application deals 
directly with a clerk that runs on the client's own computer. The caching of 
data and all of the processing of the client application's file operations are 
the work of the clerk.

The server may be on any computer in the OpenVMS Cluster, and it alone carries 
out the reads and writes on a disk. The system manager can set up standby
servers on other computers to take over if the computer serving the volume 
fails. A computer can run both clerk and server components for a disk, and can
serve more than one disk.

The clerk-server model spreads the processing load well across an OpenVMS 
Cluster. It also offers high availability of the system, provides consistency 
of data across the whole cluster and preserves data integrity if a computer 
fails.

* Write Optimisation by Write-Behind Caching

Spiralog buffers all writes in the clerk's cache, both user data and the data 
required by the file system. This "write-behind caching", unique to Spiralog, 
improves the response times for applications.

Spiralog maintains the correct order of the writes while they are in the cache 
and combines multiple writes. This reduces the communications load.

* Architecture For Multiple File Systems

Spiralog is designed to work as a server for client programs running on 
different operating systems, for example, UNIX and PC. Developments in OpenVMS 
are under consideration to take further advantage of this core facility, which 
will enable users on different clients to use Spiralog as their file system and 
backup mechanism.


        What Is a Log-Structured Design?

In order to understand the heart of Spiralog, we first need to look at how 
conventional file systems like Files-11 work. The principle they are based on 
is called "update-in-place", because, when a file is updated, it remains in the 
same place on the disk surface.

When a file is updated, the disk read/write head must first move to the place 
containing information about the position of the file on the disk. When the 
file has been located, the head then moves to where the file is stored, and 
updates the data. A file which has been updated already, and has increased in 
size, may well be stored in several pieces in different places on the disk 
surface. Updating the file then requires several movements of the head.

For example, consider what happens in Files-11 when a file is stored in three 
blocks on the disk surface. If blocks 2 and 3 of the file are updated, the 
before and after layouts are the same. The read/write head has to move first to 
the index file to locate the file header, which stores information about the 
file, then to the file header to find the blocks where the file is stored. Then 
it makes separate moves to overwrite blocks 2 and 3 and returns to update the 
file header with the time of the update.

As much as 85% of the time in a conventional write operation is spent in moving 
the disk head over the surface to the required positions on the disk surface,
rather than in transferring the data itself.

The log-structured design of Spiralog eliminates most head movement. Instead of 
updating the existing file in the same place on the disk surface, Spiralog
simply writes the new data sequentially to the end of the log. Writing takes 
place at the rate that the head can move along the spiral disk track. Spiralog 
does not go back and overwrite the data that has been changed.

In the previous example using the Spiralog on-disk structure, when blocks 2 and 
3 and the file header have been updated, the layout of the data on the disk 
surface is not the same. The updated blocks of data and the header have been 
placed together at the end of the log, so the disk head needs only to move 
along that section of the disk track.


        What Does Spiralog Do with Updated Data?

When Spiralog updates an existing file, it writes the new data to the end of 
the log. This makes the original data, which remains in the log, obsolete. If 
Spiralog did nothing with this obsolete data, the disk would eventually fill 
up. To prevent this, if the obsolete data occupies a certain minimum space, 
that space is made available immediately for further use; otherwise Spiralog 
marks the obsolete data for later removal.

A component of Spiralog called "the cleaner" (garbage collector) reclaims the 
space occupied by marked data. Normally, the cleaner runs as a background job 
at times when the disk is not busy, so that it does not slow normal disk 
activities down. However, as the disk becomes more full, its priority becomes 
higher. When the cleaner has removed the obsolete data, the region of the disk 
can now be used for writing new data.

The cleaner uses an algorithm to decide whether to clean a region of obsolete 
data now, or whether to delay. This algorithm is chosen to minimise the time 
spent cleaning.


        Does Spiralog Need a Defragmenter?

The cleaner itself does an excellent job of ensuring that the disk is not 
fragmented. However, files in Spiralog may become fragmented. A file can be 
defragmented simply by copying it, which ensures that the file is written 
contiguously to the end of the log. Alternatively, a file defragmentation 
facility may be used.


        How Does Spiralog Keep Track of File Positions on Disk?

Like all file systems, Spiralog keeps track of the disk blocks used to store a 
file, so that it can quickly read the file. Spiralog stores this information in 
a data structure called a depthbalanced tree (B-tree). Many databases use this 
data structure because it is fast and efficient and because it scales up well 
to larger sizes of database.

The B-tree is kept in the Spiralog volume itself, so that it benefits from the 
log-structured design of the Spiralog disk as much as any other data does. 
Because Spiralog caches everything it reads and writes, most of the B-tree is 
held in the cache, so that Spiralog may not have to read the disk at all to 
locate the file, or at most it will require one disk read to find it.

The result is that the performance of Spiralog is good both with volumes that 
contain large numbers of files and with those that contain only a few very 
large files. Performance falls off slowly with large numbers of files, in 
contrast to Files-11.


        Why is Spiralog's Backup So Fast?

Spiralog's backup performance is far superior to anything currently available. 
Backup exploits the log-structured design by reading the data sequentially and 
copying it directly from disk to disk or tape, without any need to interpret 
the data to discover the files to which each item of data belongs. This copying 
requires much less CPU time than the file-by-file backup used by conventional 
file systems.

Spiralog's backup is similar to the Files-11 physical backup of a whole disk, 
but only the log, not the whole disk, needs to be copied. The backup medium 
does not have to be a Spiralog volume. Spiralog saves directory information 
with the backup, so you can quickly restore individual files or directories.

In a conventional file system during a write, data can be written to any part 
of the disk. It is usually necessary to stop writing during a backup, as 
otherwise the different parts of the backup will not be consistent with one 
another. For example, part of the backup of a file may represent the state of 
the file before a write, and another part of the backup of the same file may 
correspond to the same file after the write. To avoid this problem of 
conventional file systems, volumes are taken offline before backup.

Spiralog writes changed data only to the end of the log, and the rest of the 
data on the disk is not changed during a write. Therefore, Spiralog can back up 
all the data in the log before a particular point, even while more data is 
being written to the end of the log. Backup does not mean taking the volume 
offline, so there is no interruption to normal business.

The backed-up volume represents a "snapshot" of the disk at the time of the 
backup. When the backup volume is used to restore data, the files are restored 
in precisely the state they were in at the backup time.


        How Does Spiralog Do Incremental Backups?

Because a point in the Spiralog log represents a particular point in time, it 
is possible to make a backup of all the changes made since that time, by just 
copying all the data after that point. For example, you could do a full backup 
on Monday and then on Tuesday do an incremental backup of all the changes since 
the full one. Incremental backup in Spiralog offers a particularly great 
advantage in speed over Files-11, because there is no need to search 
directories or read file headers to find out what has changed. Incremental 
backup, like full backup, can be done online without interruption to service.

Spiralog provides for backups to be made at up to 128 different levels, as in 
the UNIX dump command, where level 0 is a full backup, and the other levels 
represent partial backups. A backup at a particular level saves all changes 
made since the last backup at a lower level. You can use this to define a 
schedule of full and incremental backups covering different time periods. The 
advantage of this is that you can balance the time taken to do backups and the 
space taken up by the backups themselves against the time needed to restore 
data.

For example, the Monday backup could be a full one, then Tuesday's could be a 
backup of all changes in the last day, Wednesday's a backup of all changes in 
the last two days, Thursday's a backup of the last three days and so on. Then 
at most two backups are needed to restore data.

Spiralog also enables you to make unscheduled backups that do not affect the 
backup schedule. For example, on Wednesday you may make an unscheduled full 
backup, but the Thursday backup will cover the period since Monday as 
determined by the original schedule.


        What the Advantages of Merging Backups?

A number of backups made at different times can be merged into a single backup. 
This saves time when the backup is used to restore data, and also economises on 
the use of offline storage media.

For example, Monday's full backup and the partial backups on Tuesday and 
Wednesday can be merged into a single one. This produces exactly the same 
backup as if you did a full unscheduled backup on Wednesday.

Merging is carried out independently of the original volume, which can continue 
to be used online, so merging does not affect any applications using that 
volume. The merging on Wednesday in the example just discussed does not require 
the original volume to be online at the time it is done.

You can merge backups offline, on any computer in any OpenVMS Cluster that is 
running Spiralog, even if no Spiralog volumes are mounted.


        How Do Standby Servers Work?

Spiralog offers high availability in part because the client-server model 
allows standby servers to be installed for each Spiralog volume. Each of the 
standby servers is on a different computer in the OpenVMS Cluster, as specified 
by the system manager. Only one server serves the volume at any one time, 
ensuring that all applications running in the OpenVMS Cluster see the same 
data. If the computer fails on which the server is running, Spiralog switches 
to a standby server without interrupting service. This is called "failover".

When failover occurs, clerks may have sent data to the server for writing to 
the disk. Normally, the server confirms to the clerk that the data has actually 
been made permanent on disk. If the server's computer fails during a write, 
this acknowledgement may not be sent. The unique design of Spiralog's interface 
between the clerk and the server ensures that this does not matter. The clerk 
can send the same data to the new server, confident that the written file will 
be identical whether the failed server had succeeded in writing the data
or not.

The system manager can arrange failover for other purposes, for example, taking 
a computer out of service.


        How Does Spiralog Recover After System Failure?

In a log-structured file system, any point in the log represents a particular 
moment in time. If a clerk is running on a computer that fails, the data in the 
log up to that point can be used to reconstruct the state of all the files at 
that moment so that the clerk can continue. Recovery can quickly be made to a 
consistent state, with the loss only of the data that was still in the clerk's 
cache and had not been written to disk at the time of failure.

When Spiralog is writing to a disk, after it has written a certain amount of 
information and at certain other times, it creates a "checkpoint" in the log. 
The purpose of a checkpoint is to make it unnecessary to go right back to the 
start of the log in order to carry out a recovery. This minimises the time 
needed to carry out the recovery. When a checkpoint is created, it does not 
interrupt normal writes to the disk.


        How Does Spiralog Optimise Write Performance?

Several features give Spiralog its dramatic improvement in write performance 
over other file systems. One of them is the log-structured design already 
described, which eliminates disk head movement for writes, even when the writes 
are to different files.

Another new feature is the use of a cache for writes. In Files-11, writes are 
sent straight to disk. In contrast, when Spiralog uses write-behind caching (its
default), it stores all data in the clerk's cache before it is actually written 
to the disk. This is called "write-behind" caching.

While Spiralog is holding the data in the cache for a short time, the 
application may make further changes to the data. Spiralog maintains the order 
in which changes are made and records how one change overwrites another. Before 
the clerk passes the data to the server, it consolidates the changes into one 
write. It then sends only the final result of each series of changes. This 
means that the server makes only one write where it might otherwise have
made several.

For example, if a first write changes blocks 1 and 2 of a file, and a second 
write changes blocks 2 and 3 of the same file, when these writes are 
amalgamated, it results in one value only written to each block. In the case of 
block 2, the value written in the block is the one assigned by the second 
write, and the first value is thrown away. Spiralog can amalgamate changes to 
more than one file into a single write.

In some cases, Spiralog does not need to write anything at all when the writes 
are consolidated. An example is when a file is created and then deleted before 
the corresponding data in the clerk's cache is made permanent on disk.


        How Long Does It Take for Data to be Written to Disk?

When write-behind caching is used, a clerk attempts to send the data to the 
server within 30 seconds of completion of an application's write request. If 
the system fails, this means that the state of a volume may not reflect changes
made in the last 30 seconds.

In some cases an application may need data to be made permanent more quickly on 
the disk. In this case, the data can be sent straight to the disk, as is done in
Files-11; this is called "write-through" caching. The data is written on the 
disk before it appears in the cache. The programmer can also force Spiralog to 
flush the caches to disk even when write-behind caching is in use, or to make 
writes to a file occur in a particular order.

When write-through caching is used the performance benefits of write-behind 
caching are lost, but the application still gains from the enormous advantages 
of the log-structured file system. In addition, the data remains in the cache, 
and if it is subsequently read, there is a resulting advantage to the speed of 
the read.

Spiralog can give a file or application the caching it needs, write-through or 
write-behind, regardless of the caching used at the same time by other 
applications.


        What Guarantees of Data Integrity Does Spiralog Make?

The Spiralog clerk provides "atomicity" for write and delete operations on data 
up to 64 kilobytes in size. That is, the data is written in its entirety to the 
disk or not at all, or a piece of data up to that size is deleted completely or 
not at all, even though the operation may actually involve several disk input-
output operations. After a computer failure, the data on the disk is in a 
consistent state and does not represent a write or delete that failed part-way 
through.

As described above, if a server fails without acknowledging a write, the clerk 
can repeat the write with a new server. It makes no difference whether the 
previous write was successful; the outcome of the write is the intended one.

Spiralog makes use of the OpenVMS Distributed Lock Manager to ensure that only 
one clerk writes a particular item of data at one time, ensuring that a file is
always in a consistent state.


        How Does Spiralog Improve Read Performance?

To improve read performance in Spiralog, the clerk caches everything that is 
read, using a larger cache than in Files-11. The clerk can "read ahead", 
anticipating the data needed during sequential reads and bringing it into cache
before it is actually required.

The log-structured file system brings an advantage when two or more files are 
being read and written frequently, because the most recently written parts of 
both files are kept close together near the end of the log, in contrast to the 
"write-in-place" disk structure. This minimises the head movements required for 
reading the files. Spiralog does not have the same advantage for reading single 
files that are rarely written, but it gives an overall read performance 
comparable to that of Files-11.


        How Does Performance Change with Increasing Number of Files?

In many conventional file systems, including Files-11 in OpenVMS, performance 
can degrade rapidly as the number of files increases. In contrast, the 
performance of Spiralog drops off slowly with the number of files, and greatly 
exceeds the performance of the conventional file systems when large disks 
(e.g., a terabyte RAID5 array) are used. This is a consequence of Spiralog's 
overall design, which takes full advantage of the fast directory structure 
based on depth-balanced trees (discussed above).


        How Does Spiralog Work with Today's Systems?

Spiralog coexists with the present OpenVMS file system, Files-11, either in a 
single computer or in an OpenVMS Cluster. You can use Files-11 volumes 
alongside Spiralog volumes (in Spiralog version 1, the system disk must be a 
Files-11 disk). You can continue to use existing storage technologies, 
including RAID5 and shadow sets, with both Spiralog and Files-11.


        How Does Spiralog Integrate With Other File Systems?

Spiralog provides multiple client support and transparent file services to PCs 
and other clients through the design of the clerk/client interface, allowing 
all clients to share the benefits of Spiralog. Their data can be stored 
together in Spiralog volumes, with no need to create partitions. All files 
share the same directory hierarchy, so they can be easily visible between 
clients and have multiple names. Each client sees what appears as a local file 
system (FAT for DOS, a bit stream for UNIX, and so on). This support is 
provided at present for PC clients through the PATHWORKS product. Enhanced 
client support is being considered for the
future.

Everything on a Spiralog volume is backed up and restored using a single 
procedure, so separate procedures are not needed for different types of client.


        Does Spiralog Work with OpenVMS VAX?

On an OpenVMS Cluster containing both OpenVMS Alpha and OpenVMS VAX computers,
users on OpenVMS VAX computers can have access to Spiralog through DECnet and
similar network products. As Spiralog is an OpenVMS Alpha product, it does not 
run on an OpenVMS VAX computer itself.


        How Is Data Moved to a Spiralog Volume?

The commands that the system manager uses to initialise and mount Spiralog 
volumes are integrated into the normal DCL command set. The procedure is 
slightly different depending on whether you intend a computer to serve the 
volume or not.

A Spiralog volume is actually a "container" file which occupies all the space 
on a Files-11 volume. When a Spiralog volume is mounted for the first time on a 
single computer or on an OpenVMS Cluster, the procedure is to initialise and 
mount the volume as a Files-11 volume, then initialise and mount it as a 
Spiralog volume. This creates a Spiralog server for the volume on the computer.

When the same Spiralog volume is mounted on another computer in an OpenVMS 
Cluster, it can be either for a clerk or for a standby server. To create a 
standby server for the volume, you mount the volume first as a Files-11 volume,
then as a Spiralog volume. To mount the volume with a clerk only, you
simply mount it as a Spiralog volume.

Once you have mounted the Spiralog volume, you use the normal DCL COPY and 
BACKUP commands to transfer data to it, exactly as for any other OpenVMS
volume.


        Will Existing Applications Work with Spiralog?

Existing OpenVMS applications work without modification with Spiralog. The few
exceptions are programs that make assumptions about the underlying design of 
the file system, such as disk repairers and defragmentation software.

To ensure that existing applications work unchanged, Spiralog supports all 
existing programming interfaces at the RMS level and above. All RMS, run-time 
library and highlevel language interfaces work with Spiralog. It also supports 
all existing low-level QIO interfaces that are compatible with Spiralog's new 
design.


        Can Applications Exploit Spiralog's New Features?

An application can use Spiralog's extensions to the RMS and QIO interfaces to 
exploit the new features offered by Spiralog. For example, it can use 
extensions to the RMS and QIO interfaces to control whether it uses write-
behind caching when it opens a file for write access. There are new and 
powerful facilities to use the clerk's caches, including specifying the order 
of write and delete operations sent to the server.


        Do Users Need to Learn New Commands?

There is a new backup/restore facility. Apart from that, users do not need to 
learn any new commands to use Spiralog. They use the same DCL commands as for 
the existing OpenVMS Alpha file system, Files-11. For example, TYPE shows the 
contents of a Spiralog file, and PRINT is the command that prints it. There is 
an extension to the SET FILE command that allows the user to select the
default caching option for a file when an application does not specify
what caching it requires.


        Further reading

The Spiralog document set:
        Spiralog User's Guide
        Spiralog System Manager's Guide
        Spiralog Programmer's Guide
        Spiralog Installation Guide
        Spiralog Release Notes


       Copyright and Trade Mark Information

(c) 1995 Digital Equipment Corporation. All rights reserved.

Digital believes the information in this publication is accurate as of its 
publication date; such information is subject to change without notice. Digital 
is not responsible for any inadvertent errors.

Digital conducts its business in a manner that conserves the environment and 
protects the safety and health of its employees, customers, and the community.

The following are trademarks of Digital Equipment Corporation: Digital, the 
Digital logo, OpenVMS, Alpha, VAX and PATHWORKS.

UNIX is a registered trademark in the United States and other countries 
licensed exclusively through X/Open Company, Ltd.


-- 
---
Paul Randal - DEC OpenVMS File System Engineering  "NFS - Nightmare
              Tardis System Administration.              File System"
		URL=http://www.tardis.ed.ac.uk/