How the konspire file index database works

How the konspire file index database works

Overview
When a server receives a client file list, it adds the file descriptors (name, size, and host address) to its index database. This database sorts the file descriptors by file name to allow searching for files by file name. The database also stores file names so that they can be search by partial names. Thus, searching for "uta" will match both "utah.zip" and "computation.exe". When a search request arrives from a client, the server queries its database and sends a collection of matching file descriptors back to the client.

The string tree
File names are stored in a string tree, which is a tree that has a single ASCII character at each node. Thus, walking through the characters in a file name involves walking through nodes in the tree (as does walking through characters in a search term). A string tree is best illustrated by a picture. Below is an image of a simple string tree built by adding words to it in the following order: FAST, MAN, ZOO, FAT, FACE, ANT, AND.

Note that the first word added to the tree forms the root and a column of nodes down the center. When MAN is added, M is later in the alphabet than F, so we move to the right. We find an empty spot, so we add an M node, and then add A and N below M. When adding ZOO, we move to the right of F and to the right of M before finding an empty spot to add a Z node. When adding FAT, the F node matches the first character, so we go down (not right or left). Then the A node matches the second character, so we keep going down, but the S node doesn't match, so we go to the left after S to find an empty spot for the T node. Adding FACE is similar, except that we go to the left of the S node to add C because C comes earlier in the alphabet than S. Adding the other words follows the same pattern

Searching for words in a string tree is very fast: search time for a word is proportional to the word's length. When searching for a word, we move down from the current node if our current word character matches the current node character, and we move left or right if our word character is to early or too late in the alphabet. Thus, when searching for MAN, we move left from F, move down from M, move down from A, and move down from N. A marker exists down from N indicating that MAN is in our tree, and when we encounter this marker, we know we've found a match. However, if we search for MAC, we move right from F, down from M, down from A, and then left from N. No node exists to the left of N, so we haven't found a match.

Searching for prefixes in the tree is equally as fast. We simply travel down the tree as far as our prefix takes us and then return all strings that are below us in the tree. If searching for the prefix FA, we move down from F and down from A, and then take all strings (all marked nodes) that are below A in the tree.
For searching file names, prefixes and whole word matches aren't sufficient, since we'd like to search for substrings as well. For instance, we may want to find all strings in the tree that contain the substring AN. Unfortunately, the basic string tree doesn't support substring searches. However, we've devised a modified string tree that supports substring searches. Below is an image of a modified string tree built by adding the same words from above.

Note that for each character in our alphabet, we have a set of pointers to *every* node in the tree that contains the character (note that this diagram is simplified in that it only shows part of the ASCII character set). Thus, when looking for AN, we can follow every A pointer in the character table to find every A node in the tree. We can then search normally from these A nodes for AN. Searching in this way correctly returns ANT, AND, and MAN. Unfortunately, searches now take on the order of the number of first character nodes in the tree (e.g., if the first character is A, search time is proportional to the number of A nodes in the tree). However, the database uses several optimizations to make this process more efficient

Optimizations
First, to prevent the database from getting bogged down by small, useless search queries (for instance, searching for the letter E by itself), we ignore all search terms that are less than three characters long. This eliminates many problem terms, but it doesn't eliminate terms like ZIP, which could potentially match half of the files in the database.

A client search request is accompanied by a number of desired results (either 10, 20, or 40). If we're searching for ZIP, we only have to look through enough Z pointers until we find more than enough matching files. However, if searching for "ZIP MOVIE", we need to find all ZIP files and all MOVIE files, and then throw out files that are not matched by both terms. This leaves us still searching for every ZIP file in the database.

To overcome this problem, we sort search terms by size of likely matching set and search for terms that are likely to match the smallest set first. Thus, if searching for "ZIP MOVIE", we would guess that MOVIE will match a smaller set (because M has fewer pointers in the character table). We'd search for all MOVIE files, and then search for ZIP files, but stop searching for ZIP files after we had found more than enough files matching both ZIP and MOVIE.

Also, because clients do not want to see search results for files that they themselves are hosting, we weed out files on the searching client as we build the collection of search matches.

In summary, when processing client search requests, we take the following optimization steps:
1. weed out and discard short search terms
2. sort search terms in smallest-likely-match-set-first order
3. search until more than enough matching files are found, weeding out results for files that are hosted by the searching client