!Document Searching7

HFRD Hypertext Services - Environment Overview

| [next] [previous][contents]


7 - Document Searching




? The query and extract scripts provide real-time Isearching of plain-text and HTML documents, and document retrieval. The Nsearch is a simple-string search, not a GREP-style search. It is designed to Lprovide a useful mechanism for locating documents containing a keyword, not Nfor document analysis. It has the useful feature for plain-text documents of ?allowing the selective extraction of only the portion near the hit. 

Note:

LSearching is a notoriously CPU and I/O intensive activity. Longer searches Iprogressively decrease scheduling priority by one every 10 seconds, from Inormal to zero, helping to reduce impact on any interactive users of the Mserver system. The search algorithm itself is efficient, but searching will Gtake longer on a more heavily loaded system because of this mechanism.
 

: Only files with a plain-text or HTML MIME data type (see k3 - Document Access and Specification) will be searched. Others may be Nspecified, or be selected from wildcard file specification, but they will not 'actually have their contents searched.

7.1 - Plain-Text Search



K A search of a plain-text file is straight-forward. Each line in the file Jis searched for the required string. The first time it is encountered is Cconsidered a hit. The line is not searched for any further occurances. 

H Searches of plain text files allow the subsequent selection of partial Jdocuments (i.e. the retrieval of only a number of lines around any actual Lhit). This allows the user to selectively extract a portion of a document, Havoiding the need to explcitly scan through to the section of interest. 

7.2 - HTML Search



K A search of an HTML file is a little more complex. As might be expected, Oonly text presented in the document text is searched, markup text is ignored. HThat is, all text not part of an HTML tag construct is extracted ?and searched. For example, out of the following HTML fragment 

)  <!-- an example HTML document -->  <P>Q  The document entitled <A HREF="example.html">"Example Document"</A>P  provides only an <I>overview</I> of the full capabilities of HTML.
2only the following text would actually be searchedI
  The document entitled "Example Document" provides only an overview#  of the full capabilities of HTML.


H The mechanism for partial document retrieval available with plain-text =files is not present with HTML documents. HTML files Ngenerally must be treated as a whole, with the formatting of current sections Ioften very dependent on the formatting of previous sections. This makes Lextracting a subsection perilous without extensive syntactical analyis. On Mthe positive side, HTML documents tend to be already divided into meaningful Nsubdocuments (files), making retrieval of a hit naturally more-or-less within context. 

7.3 - Search Syntax



. A search may be initiated in one of two ways:

    N
  1. Appending a question-mark and search string to a file specification (the =simple syntax of ISINDEX-style searching). This is 9standard HTTP, and of course must conform to HTTP syntax.I
  2. Forms-based search, which allows the format and mechanism of the search to be controlled.
(

7.3.1 - ISINDEX Search




8 Placing the HTML tag <ISINDEX> within a Fdocument's text is sufficient to inform the browser that searching is Kavailable for that document. The browser will inform the user of this and Nallow a search of that document to be initiated at any time. Note that it is limited to the one document. 

L Using the keyword search syntax explicitly is another method of initiating Na search, and additionally can use a wildcard in the document specification. For example:1

  /hyperdata/html/html-primer/*.html?problem
#

7.3.2 - Forms-Based Search




F A ``forms-based'' search is initiated by the server receiving a file Dspecification, which of course may contain wildcards, followed by a Esearch parameter. This is a typical HTML forms format URL. For example: 

   *.txt?search=SIMPLE!   /hyperdata/.../*.*?search=THIS    sub_directory/*.*?search=THAT+   ../sibling_directory/*.HTML?search=OTHER
M

The following link provides an online demonstration search. It searches Iall files (plain-text and HTML) in the current directory for the keyword F"formatted". Note the difference in the way plain-text file hits are -presented compared with those of HTML files. 

9Search for "formatted"  

7.3.3 - Search Options



= Additional URI components may be appended after the initial F``search='' parameter. These are appended with intervening ``&'' characters. 

$

7.3.4 - Example Search Form




I To allow the client to enter a search string and submit a search to the Eserver a HTML level 2 form construct can be used. Here is an example: 

2  <FORM ACTION="/hyperdata/html/.../*.html">  Search HTML documents for: '  <INPUT TYPE=text NAME="search">-  <INPUT TYPE=submit VALUE="[execute]">  </FORM>
J

The following provides an online demonstration of the form used above:

Search HTML documents for: %


Bells and Whistles




0 A form providing all the options refered to in f7.3.3 - Search Options is shown below (some additional white-space introduced for clarity): 

2  <FORM ACTION="/hyperdata/html/.../*.html">  Search HTML documents for: '  <INPUT TYPE=text NAME="search">-  <INPUT TYPE=submit VALUE="[execute]">f  <BR><TT><A HREF="/hyperdata/?about=search">About</A> this search.</TT>   <BR><TT>Output By:@  line <INPUT TYPE=radio NAME="hits" VALUE="line" CHECKED>K  document <INPUT TYPE=radio NAME="hits" VALUE="document"></TT>%  <BR><TT>Case Sensitive:<  no <INPUT TYPE=radio NAME="case" VALUE="no" CHECKED>A  yes <INPUT TYPE=radio NAME="case" VALUE="yes"></TT>  </FORM>
J

The following provides an online demonstration of the form used above:

Search HTML documents for: %;
About this search.
Output By:8line =document 
Case Sensitive:4no 3yes 





| [next] [previous][contents]