j  � ht://Dig: How it works� �  

 How it works



= ht://Dig Copyright © 1995-2000 The ht://Dig Group
; Please see the file COPYING for license information.




7 The system performs three major tasks that should be$ performed in the following order:



! Digging



= Before you can search, a database of all the documents that( need to be searched has to be created.



! Merging



; Once the document database has been created, it has to be< converted to something that can be searched quickly. Also,= if you want to only update changed documents, these changes5 have to be merged into the searchable database.
< Even though this task could be performed at the same time9 as the Digging, it is a separate process for efficiency: reasons. It also gives more flexibility to what actually happens at merge time.



% Searching



: Finally, the databases that were created in the previous9 steps can be used to perform actual searches. Normally,: searches will be invoked by a CGI program which gets its+ input from the user through an HTML form.




Digging



@ Digging is the first step in creating a search database. This? system uses the word digging while other systems call< it harvesting or gathering. In the ht://Dig> system, the program htdig performs@ the information gathering stage. In this process, the program= will act as a regular web user, except that it will follow> all hyperlinks that it comes across. (Actually, it> will not follow all of them, just those that are within the3 domain it needs to gather information on...)
; Each document it goes to is examined and all the unique3 words in this document are extracted and stored.



@ The digging process will create at least two files. The first; one is the list of all the words and the second one is a3 database of URLs and information about the URLs.




Merging



9 Once the digging process has completed, it needs to be? converted into something the search engine can actually use.> The htmerge program will use the? information from previous digs to create a database that the= search engine can use. It uses the term 'merge' because it= will take data from several databases and merges them into< several other databases. The source databases include the? databases created by the Digging process but also a previous@ merged databases. These old databases are used if the Digging= process produced information only for documents which have changed.



; There are several optional tasks which also fit into the merge phase:



 Expiration notification:

= The ht://Dig system includes a handy reminder service which8 allows HTML authors to add some ht://Dig specific meta9 information in HTML documents. This meta information is; used to email authors after a specified date. Very useful5 to maintain lists that contain those annoying 'new': graphics with new items. (Hint: things really aren't all) that 'new' anymore after 6 months!)
< The htnotify program performs this task.

 Fuzzy word index creation:

: To allow the searches to use "fuzzy" algorithms to match; words, the htfuzzy program can2 create indexes for several different algorithms.




$ Searching



; Searching is where the users actually get to use all the9 information that was gathered during the dig and merge5 stages. The 8 htsearch program performs the actual searches. It: produces HTML output which will be seen by the users.


I Andrew Scherpbier <andrew@contigo.com>
+Last modified: $Date: 2000/02/17 22:05:21 $ ÿÿ