From: Bill Todd [billtodd@foo.mv.com] Sent: Thursday, July 01, 1999 4:04 PM To: GlennEverhart@FirstUSA.com Subject: Re: CFS Hi - I'll try to dig up the Safety material - not having had anything to do with VMS for the past dozen years, I had no idea what Safety was. Moved your message over on a floppy from another system, so I'll set off comments with *****: ***** The extra authentication stuff is not widely used now, mainly because nothing makes it simple to do. The trusted subsystem stuff in VMS gives a kind of check by imagename, but is not so simple to administer, ***** Bringing something as system-specific as an imagename into what I'd like to be a heterogeneously-sharable file system does not appeal to me if I can reasonably avoid it: I'd rather use existing, generic facilities and, for example, have the host system extend a user's identity temporarily (external to the common file system code) by adding membership in one or more special groups that hold the necessary extended access rights. ***** and VMS tries to handle time of day by not allowing login. That misses the fact that I may be willing to let folks read their mail at 2AM but not get to customer lists, and that giving everyone multiple accounts quickly becomes infeasible. ***** I don't have much problem letting ACEs limit access by (cluster host) time-of-day (though with the increasing reach of SANs this may be different for different cluster members - but that's going to have to be defined as not the file system's problem); as with all on-disk time values, I suspect it should be recorded as GMT (or CUT, or whatever it's called these days) and converted as required. There's no real overhead unless you use the feature, and it's relatively clear-cut to define and easy to implement - as long as you don't want an existing access to be cut off when the specified time window runs out (which *might* not be all that hard to do, as long as there was a bit of leeway to allow low-level in-progress operations to complete first - or, perhaps easier, just check the cut-off time if one existed before starting any user-level operation, though lengthy operations such as CopyFile could present problems here). ***** Safety has a file hiding facility. You hide a file "behind" another file, which works just like unix softlinks. I made it possible to do this conditionally on access checks though. ***** As I said, I'm not fond of this one. Not being a Unix heavy, I didn't realize that a 'softlink' (the same as what I'd call a 'symbolic link'?) could hide an existing file, rather than just being a file whose sole content was a pointer to somewhere else. Even if Unix supports such a mechanism, I'm not sure I would choose to: I think of object identity (in a file system, file identity) as something close to sacred - even more so if the object is accessible by multiple paths such as via the attribute-based look-up functions you mentioned. ***** I attended a walkthrough of an I/O which included RMS. There is a LOT of code there which takes lots of cycles. My point here is that the OS overhead of an RT11 type I/O system is far less for normal operation. ***** First of all, it takes 100K instructions (plus a lot of on-chip-cache misses) on an Alpha to have a noticeable effect on the performance of an actual disk access: if RMS comes anywhere near to adding such overhead to the cost of reading data from an accessed (non-shared) file, someone should be shot - but I seriously doubt that that is the case. Once the disk access has completed, RMS obviously does add overhead to each individual operation on the buffered data. If the buffer is large and the accesses are fine-grained, this overhead can add up to something significant compared to the cost of getting the data off the disk. However, the question then is more whether this overhead is large compared to the per-operation processing overhead in the application that is using RMS. As I said, it's been over 12 years since I had even tangential contact with RMS-32. What I will say is that, while there's no way you're ever going to get a full-functioned, clustered, multi-user-protected environment like OpenVMS down to RT-11-like code-path-lengths, you can certainly get paths short enough to run faster than RT-11 ever did due to the hardware advances of the past decade - and this doesn't require eliminating the RMS-like layer, if that layer is laid out with any eye toward efficiency (which I'm reasonably sure at least used to be true of RMS-32). If, as you assert later, few programs choose to use the low-overhead RMS facilities that are available (and I don't believe RMS has any 'defaults' in such areas: it just does whatever higher-level software tells it to in terms of choosing whether to use block I/O or 'stream' access - i.e., modes similar to the RT-11 mechanisms you're comparing it with - instead of record-oriented access), that's not RMS's fault: either these programs are poorly conceived, or they're getting some benefit out of the RMS facilities that they'd otherwise have to create for themselves. The only possibly inappropriate 'default' action I can think of would be if RMS defaults all file access to write-sharing (which would surprise me), since that would cause locking overhead typically inappropriate for the uses you refer to (though, of course, the applications could easily specify non-shared access explicitly). So those are not 'layering' issues, and I'm not convinced that layering is an issue elsewhere either: the main reason it sometimes was in the PDP-11 environment was address-space limitations that just don't apply nowadays, coupled with processors 3 orders of magnitude slower than today's that could make even moderate differences in code path-lengths significant. ***** As for the extended directory attributes, the scheme is not to build all the selections ahead of time into a structure. Rather it is to exploit the notion that when looking up a file, you provide a filename and get an index file index to the actual file (a file ID). If instead of flat files for directories you put that info into a relation, the relation can have as many indices as you want and be searched as you want. File ID still gets returned, but the selections done can have additional ones dependent on what a user or site wants. ***** As I said, the concept is simple: it's the execution and use that isn't. Let's start by assuming that you're not talking about building the entire directory structure for the system into a single relation: that screws up the per-directory access-control (and inheritance) semantics we're all used to. Making each individual directory into a b-tree, however, is reasonable, and is noticeably slower only when following paths whose disk blocks are not cache-resident. So now the question becomes whether you're simply saying that instead of a b-tree, each directory should be a multi-keyed relation. Given the size of a typical directory, I'd question the value of that: a full directory scan, even if it checks attributes of individual files (something which should be performance-optimized anyway by attempting to 'cluster' the file headers on disk and access them in groups, since simple directory-list operations often return header-resident attribute information as well and placing it in the directory as NTFS does is not the way to go, IMO), is usually fairly cheap - so while I think it would be reasonable to be able to request a list of all files in a given directory with certain attributes, I don't see much value in creating separate indexes to support this. Now if you want to be able to scan for specified attributes on something *other* than a single-directory basis, things get more complex. Should the scan be limited to files owned by a particular user? To a specified sub-tree of the directory structure? As I said previously, in a system of any size, there needs to be *some* constraint: indexing file attributes across the entire file system is usually not useful, so I'd be interested in hearing how you believe it should be done. Lacking other inspiration, I'd be inclined to support attribute-scanning only within individual directories - but to let this get extended by specifying directory wild-cards such that a single scan could include multiple directories, with each scanned in sequence for files with matching attributes. And regardless of how such an attribute-based look-up is performed, one has to define isolation semantics for scans - just as they must be defined when processing any database query. Current wildcard name-based directory look-ups I'm familiar with offer 'degree 2' isolation ('cursor stability'): if a new entry matching the wildcard specification is inserted after the wildcard scan has passed its insertion point, it will be missed in the scan (unless said scan is operating within the larger context of a user-specified transaction with stricter isolation requirements, though no systems I'm acquainted with support this). If attribute-based scans occur only within a single directory, then perhaps they should work the same way; if they occur across directories, then some kind of search order must be established if they are to work the same way; and if they are created by building a list and then processing it, there will be no guarantees at all - just as there wouldn't be if name wild-carding was performed that way: by the time a list entry got processed, it could have disappeared or changed completely. ***** Making a DBMS that runs well enough in a cluster or a galaxy to handle this sort of thing is of course part of the challenge of building this kind of system. ***** Funny you should mention that: one of the reasons I'm interested in doing a new cluster FS and associated DLM from scratch is that there are some really interesting things one can do to optimize b-tree operations and associated page/record locking. It's possible that they could make existing databases (and RMS, for that matter) look pretty slow... ***** Thanks for your input though. Glad my message got at least one person thinking..it's good to see that. ***** I'm afraid I was already thinking, but *I'm* glad to see that others are as well. ***** How are things in NH by the way? (I used mv as my provider when I lived there.) ***** I thought your name sounded awfully familiar, though I couldn't place it firmly. I'm not an MV employee, they're just my ISP. For some reason, IE 5 (I'm not sure whether IE 4 did this or not - I just upgraded rather than bother bringing IE 4 up to current patch level) chooses to place one's service provider in the news (but not mail...) org field if you leave the org field empty. NH isn't all that different, though shows modest signs of entering the 20th century before it is completely over. Still, the only reason we have a woman Democrat as governer is due to the unmistakably off-the-wall nature of her opponent. ***** This kind of thing would have been extremely interesting to do but when I was at DEC it was all in Scotland, and I needed to get back to Delaware for family reasons anyhow. I left some of these ideas behind. It'd be nice to see some of them implemented. ***** Yeah, and now the group in Scotland has been eliminated, I hear. Given DEC's (and now Compaq's) track record, I don't have a great deal of hope of anything interesting happening in these areas. The last I knew, Andy Goldstein wasn't interested in making any waves, and I'm not sure I can blame him, given the climate. It surprises me how many good people are still there, but I suspect a lot of them (and I *know* some of them) just feel they can't afford to leave this close to retirement. ***** BTW I have code for a much more complete VMS softlink facility if you want to look... ***** I'd love to - though as I said above, the hiding part tends to bother me. - bill