Adding a Soundex function to Datatrieve (or other programs) Bart Z. Lederman Western Union First, some history on a discussion on DECUServe: Bob McDougall, and then Bob Hassinger and Bill Mayhew, asked for code which would do Soundex processing. I happened to be working on a project where I might need Soundex, and had some old PL/I code someone left from a previous project (with few comments) that did it, and offered to post the algorithm. (Mark?) Kozam posted the algorithm from Donald Knuth's text, "Sorting and Searching" (Volume 3 from The Art of Computer Programming), which in turn references the creators, Margaret Odell and Robert Russell (in 1918!). (There was also some discussion of a BASIC program posted on another bulletin board, which some users found difficult to understand. ) I had some questions about the algorithm as posted, particularly about what to do with non-alphabetic characters. For example, is O'hara the same as Ohara? What about "words" like 1st and 2nd? And what do you do with hyphenated words? Alan Conroy suggested the following: "Ignore apostrophes (') - O'hara is Ohara. I don't know the history of Soundex, but I've only seen it used on names (census records, for instance) so I don't know how it would work with hyphens, etc. I assume that they are either ignored or the case is undefined. " Lisa J. Pokel then told us that one of the programs that comes with the DECUS C package that does a 'phone directory application had a C version of the Soundex routine. I had that package, and dug out the program. How I wrote my version. I found that this program (by Martin Minow, who credits Knuth for the algorithm) had a different implementation than that given by Mark Kozam from the same source: his example had the same value assigned to both 'M' and 'N', whereas the program gave them different values. The program also had what looked like several good improvements: it removed leading silent letters (as in ptomaine), and equated PH to F. But it also returned the processed value as a binary number, rather than the "letter-digit-digit-digit" form given before. However, the biggest problem was that it didn't work correctly: it didn't actually ignore leading silents, nor would it change PH to F if the word starts with PH. I therefore "fixed" the code from PHBOOK (a polite way of saying I re-wrote it). I added 'MN' as a leading pair to strip (as in "MNEMONIC"), got it to ignore 'PH', put the output back to a character string, etc. I decided to keep the separate assignments for 'M' and 'N', but the structure of the program is such that the assignments of values for each letter is easily changed to suit any particular application's needs. I also made a number of other changes to suit my needs. First, I made it ignore all leading non-alphabetic characters (such as numbers and blank spaces). I also made it ignore all punctuation, not only to handle cases such as "O'hara", but also words like "x-ray", because I'm trying to match stuff which is going to include "words" that have hyphens or slashes or other plus signs or other stuff buried in it. I also had to consider what to do with DEC Multi-national characters (letters with accent marks). This really wasn't too difficult: most are vowels, and as all vowels without accents are given the same value (basically they are all ignored) I did the same for vowels with accents. The characters which require more consideration are the French cedilla "Ç", Spanish N "Ñ" and German sharp s "ß": I gave them the same values as the American C, N, and S, respectively. This was written in the C programming language, and seemed to work fairly well. However, I couldn't easily add this function into Datatrieve because it didn't pass it's character string arguments as descriptors. I was also considering making this code available to other users, and there is the desire to distribute public domain software in source code form only, so the recipient may check to see if anyone in the distribution process has done something nasty (like inserting virus code). Since not everyone has a C compiler, distributing the routine this way would make it unusable to many people. I solved the problem by writing a Macro-32 version, as VMS systems are supplied with a Macro-32 assembler and anyone can re-assemble the routine from the source code. I also made it work with string descriptors, which makes it very easy to add it as a function in Datatrieve: and to call it with CHARACTER data types from FORTRAN or other languages, as most, if not all, VMS languages support string descriptors in order to call other system routines. It is this version which I am supplying to the Newsletter editor, and plan to include in the next release of the Datatrieve and Fourth Generation Languages SIG Library Collection (at the next Symposium). (I also wrote a Macro-11 version, to run on PDP-11 systems. Although it isn't possible to add user functions into Datatrieve-11, I wanted to have it for use in my own programs. ) What to do with Soundex. What is Soundex used for? The original purpose appears to be to find names where the exact spelling may not be known. This would certainly be useful for telephone book applications, personnel files, employee lists, and other data bases where you sometimes don't know how a person spells their name. I was also considering using it in an application where a street address has to be retrieved, and again there exact spelling of the address may not be known. I can imagine other similar uses in retrieving data on chemicals, medicines, diseases, or other data where the person looking for information may know how the word sounds, but not exactly how it's spelled. No doubt other readers will have their own applications. To be most useful, you will need to add the Soundex value into the database when the data is loaded. For example, a part of your record definition might look like this: 10 LAST_NAME PIC X(32). 10 LAST_NAME_SOUNDEX PIC X(4). and when you store the data do something like this: LAST_NAME = *."last name / surname" LAST_NAME_SOUNDEX = FN$SOUNDEX(LAST_NAME) It would also help a lot to make LAST_NAME_SOUNDEX a keyed field when the file is created if you are going to use the field for retrievals regularly. Then you can do things like: SEARCH_TEXT = *."a name you wish to find" PRINT domain WITH LAST_NAME = SEARCH_TEXT ask the user if the name was found or count matches. If there aren't any then do PRINT domain WITH LAST_NAME_SOUNDEX = FN$SOUNDEX(SEARCH_TEXT) If you are not going retrieve on a Soundex key very often, you can of course do something like: PRINT domain with FN$SOUNDEX(LAST_NAME) = FN$SOUNDEX(SEARCH_TEXT) but of course Datatrieve is going to have to read through the entire domain to find the records you want, and this sort of thing tends to give Datatrieve a bad reputation as being slow and a resource hog. If you have an existing data file to which you want to add Soundex keys, you can always use Datatrieve to read the old data and populate a new file using FN$SOUNDEX to add the new fields. Users at past Symposia have requested the ability to do Soundex in Datatrieve. It would be very interesting to hear from some of your experiences with this function: what you use it for, if the assignment of values to the various letters could be improved, how well it worked, etc.