Swish-E Logo


SWISH-CONFIG - Configuration File Directives


Table of Contents:

[ TOC ]

SWISH-E CONFIGURATION FILE

What files SWISH-E indexes and how they are indexed, and where the index is written can be controlled by a configuration file. The configuration file is passed to swish as a command line argument by using the -c switch (see SWISH-RUN).

The configuration file is a text file composed of comments, blank lines, and configuration directives. Order of the directives is not important. Some directives may be used more than once in the configuration file, while others can only be used once (e.g. additional directives will overwrite preceding directives). Case of the directive is not important -- you may use upper, lower, or mixed case.

Comments are any line that begin with a ``#''.

 
    # This is a comment

Commented example configuration files are included in the config directory of the SWISH-E distribution.

Typically, configuration file the directives are grouped together in some logical order -- that is, directives that control the source of the documents would be grouped together first, and directives that control how each document is filtered or its words index in another group of directives. (The directives listed below are grouped in this order).

You may also split your directives up into different configuration files and specify more than one configuration file. This allows you to have a master configuration file used for many different indexes, and smaller configuration files for each separate index. You can specify the different configuration files when running from the command line with the -c switch, or you may include other Configuration file with the IncludeConfigFile directive.

Some command line arguments can override directives specified in the configuration file. Please see also the SWISH-RUN for instructions on running SWISH-E, and the SWISH-SEARCH page for information and examples on how to search your index.

The configuration file is specified to SWISH-E by the -c switch. For example,

 
    swish-e -c myconfig.conf

The configuration file directives are listed below in these groups:

[ TOC ]


Alphabetical Listing of Directives

[ TOC ]


Directives that Control Swish

These configuration directives control the general behavior of SWISH-E.

IncludeConfigFile *path to config file*

This directive can be used to include configuration directives located in another file.

 
    IncludeConfigFile /usr/local/swish/conf/site_config.config

IndexReport [0|1|2|3|4]

This is how detailed you want reporting while indexing. You can specify numbers 0 to 3 - 0 is totally silent, 3 is the most verbose. The default is 3, so you probably should define this.

 
    IndexReport 1

This may be overridden from the command line via the -v switch (see SWISH-RUN).

IndexFile *path*

Index file specifies the location of the generated index file. If not specified, SWISH-E will create the file index.swish-e in the current directory.

 
    IndexFile /usr/local/swish/site.index

The following items are currently not available until Swish can parse a configuration file while searching

EnableAltSearchSyntax [yes|NO] Enable alternate search syntax. Allows the usage of a basic "Altavista(c)", "Lycos(c)", etc. like search syntax. This means a search query can contain "+" and "-" as syntax parameter.

Example:

 
    swish-e -w "+word1 +word2 -word3  word4 word5"
    "+"  = following word has to be in all found documents
    "-"  = following word may not be in any document found
    " "  = following word will be searched in documents
  
=item SwishSearchOperators <and-word> <or-word> <not-word>

Using this config directive you can change the boolean search operators of swish-e, e.g. to adapt these to your language. The default is: AND OR NOT

Example (german):

 
    SwishSearchOperators   UND  ODER  NICHT

SwishSearchDefaultRule [<AND-WORD>|<or-word>]

SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between words or phrases. The default is AND.

The word you specify must match one of the available SwishSearchOperators.

Example:

 
    SwishSearchOperators   UND  ODER  NICHT
    # Make it act like a web search engine
    SwishSearchDefaultRule ODER

ResultExtFormatName name -x format string

The output of swish can be defined by specifying a format string with the -x command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.

Examples:

 
    ResultExtFormatName  moreinfo   "%c|%r|%t|%p|<author>|<publishyear>\n"

Then when searching you can specify the the format string's name swish-e ... -x moreinfo ...

See the -x switch in SWISH-RUN for more information about output formats.

[ TOC ]


Administrative Headers Directives

SWISH-E stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the SWISH-E C library. There are a number of fields available for your own use. None of these fields are required:

IndexName *text*

IndexDescription *text*

IndexPointer *text*

IndexAdmin *text*

These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.

Examples:

 
    IndexName "Linux Documentation"
    IndexDescription "This is an index of /usr/doc on our Linux machine." 
    IndexPointer "http://localhost/swish/linux/index.html";
    IndexAdmin "webmaster"

[ TOC ]


Document Source Directives

These directives control what documents are indexed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.

IndexDir [directories or files|URL|external program]

IndexDir defines the source of the documents for SWISH-E. SWISH-E currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.

The -S command line argument is used to select the file access method.

 
    swish-e -c swish.config -S fs    - file system
    swish-e -c swish.config -S http  - spider
    swish-e -c swish.config -S prog  - external program
    

For the File system method of access IndexDir is a space-separated list of files and directories to index. You may specify more than one IndexDir directive.

Any sub-directories of any listed directory will also be indexed.

Examples:

 
    # Index this directory an any subdirectories
    IndexDir /usr/local/home/http

 
    # Index the docs directory in current directory
    IndexDir ./docs

 
    # Index these files in the current directory
    IndexDir ./index.html ./page1.html ./page2.html
    # and index this directory, too
    IndexDir ../public_html
    

For the HTTP method of access specify the URL's from which you want the spidering to begin.

Example:

 
    IndexDir http://www.my-site.com/index.html
    IndexDir http://localhost/index.html

Obviously, using the HTTP method to index is much slower than indexing local files. Be well aware that some sites do not appreciate spidering and may block your IP address. You may wish to contact the remote site before spidering their web site. More information about spidering can be found in Directives for the HTTP Access Method Only below.

For the prog method of access IndexDir specifies the path to the program(s) to execute. The external program must correctly format the documents being passed back to swish. Examples of external programs are provided in the prog-bin directory.

Note: Not all directives work with all methods.

NoContents *list of file suffixes*

Files with these suffixes will not have their contents indexed, but their file names will be indexed. File names are not normally indexed. If you specify .html or .htm then if a <TITLE> section is found those words will be indexed, otherwise the file name will be indexed.

 
    NoContents .gif .xbm .au .mov .mpg .pdf .ps

ReplaceRules [replace|remove|prepend|append]

ReplaceRules allows you to make changes to file pathnames before they're indexed. These changed file names or URLs will be returned in search results.

For example, you may index your files locally (with the File system indexing method), yet return a URL in search results. This directive can be used to map the file names to their respective URLs on your web server.

There are four operations you can specify: replace, append, remove, and prepend. They will parse the pathname in the order you've typed these commands. More than one command and its arguments can appear on the same line, but it's easier to read when commands are broken up over a few lines. You can't put a command and its argument(s) on different lines, however.

This directive uses C library regex.h regular expressions.

 
   replace "the string you want replaced" "what to change it to"
        This replaces all occurrences of the old string
        with the new one.

 
   remove "a string to remove"   

 
   prepend "a string to add before the result"

 
   append "a string to add after the result"

Examples:

 
    ReplaceRules replace "testdir/" "anotherdir/"
    ReplaceRules replace "[a-z_0-9]*_m.*\.html" "index.html"

 
    ReplaceRules remove "testdir/"

 
    ReplaceRules prepend "http://localhost/";
    ReplaceRules append ".html"

IndexContents [TXT|HTML|XML|LST|WML] *file extensions*

The IndexContents directive assigns one of Swish's document parsers to a document, based on the its extension. Swish currently knows how to parse TXT, HTML, and XML documents. LST are special multiple-document XML files, described below. WML uses the HTML parser.

Documents that are not assigned a parser with IndexContents will, by default, use the HTML parser. The DefaultContents directive may be used to assign a parser to documents that do not match a file extension defined with the IndexContents directive.

Example:

 
    IndexContents HTML .htm .html .shtml
    IndexContents TXT  .txt .log .text
    IndexContents XML  .xml

HTML is the default type for all files, unless otherwise specified (and this default can be changed by the DefaultContents directive. Swish parses titles from HTML files, if available, and keeps track of the context of the text for context searching (see -t in SWISH-RUN). HTML and XML files use different tag formats for MetaNames and PropertyNames.

If using filters to convert documents you should include those extensions, too. For example, if using a filter to conver .pdf to .html, you need to tell swish that .pdf should be indexed by the internal HTML parser: FileFilter .pdf pdf2html IndexContent HTML .pdf

See also Document Filter Directives.

LST files are XML files that contain multiple documents. The documents are separated by the first XML tag found in the document (each time that tag is found it is considered a new document).

 
    <tag1>         <== First document
        bla, bla
    </tag1>
    <tag2>
        bla, bla
    </tag2>
    <tag1>         <== Second document
    ...

When reporting results from a query, swish will still return the document name, but will also return a document offset in the property swishstartpos, and the length of the sub-document in swishdocsize (these properties are available by using the -x format option).

For example, you could have text files that contain SQL queries, and the queries might generate quite a number of results (documents) from a database. You can instruct swish to ``index'' these files, and use a filter to convert the SQL queries into documents. In other words, Swish indexes a file, but a swish filter converts that file (which contains the SQL query statement) into a query and returns, perhaps, many documents.

For example, the swish config file might look like this:

 
    IndexDir ./test.sql
    IndexFile ./test.index
    MetaNames tag1 tag2 tag3
    PropertyNames tag2 tag3
    FilterDir ./
    # Here is the main part
    IndexContents LST .sql          
    FileFilter .sql mysqlfilter.sh       
    # If you also want a desc use XML not LST
    StoreDescription XML <meta1> 

Then, the *.sql files can contain the queries. For example,

 
    select tag1,tag2,tag3 from my_table

The mysqlfilter.sh program should read the *.sql file, proccess the query/select, and format the output in the ``LST'' style:

 
    <tag1>
        data
    </tag1>
    <tag2>
        more data
    </tag2>
    <tag3>
        even more data
    </tag3>
    <tag1>
        start of a new document
    ...

Care must be taken when returning multiple document files to swish, as swish will load all data into memory for each file. In other words, don't try to index thousands of documents as a single LST type of document.

Note: Some of this may be changed in the future to use content-types instead of file extensions. See SWISH-3.0

DefaultContents [TXT|HTML|XML|LST|WML]

This sets the default parser for documents that are not specified in IndexContents. If not specified the default is HTML.

Example:

 
    DefaultContents HTML

The DefaultContents directive should be used when spidering, as HTML files may be returned without a file extension (such as when requesting a directory and the default index.html is returned).

FileInfoCompression [yes|NO]

** This directive is currently not supported **

Setting FileInfoCompression to yes will compress the index file to save disk space. This may result in longer indexing times. The default is no.

Also see the -e switch in SWISH-RUN for saving RAM during indexing.

[ TOC ]


Document Contents Directives

These directives control what information is extracted from your source documents, and how that information is made available during searching.

ConvertHTMLEntities [YES|no]

ASCII entities can be converted automatically while indexing documents of type HTML and XML. For performance reasons you may wish to set this to no if your documents do not contain HTML entities. The default is yes.

If ConvertHTMLEntities is set no the entities will be indexed without conversion.

MetaNames *list of names*

META names are a way to define ``fields'' in your XML and HTML documents. You can use the META names in your queries to limit the search to just the words contained in that META name of your document. For example, you might have a META tagged field in your documents called subjects and then you can search your documents for the word ``foo'' but only return documents where ``foo'' is within the subjects META tag.

 
    swish-e -w subjects=foo

(See also the -t switch in SWISH-RUN for information about context searching in HTML documents.)

The MetaNames directive is a space separated list. For example:

 
    MetaNames meta1 meta2 keywords subjects

You may also use UndefinedMetaTags to specify automatic extraction of meta names from your HTML and XML documents.

META tags can have two formats in your HTML source documents:

 
    <META NAME="meta1" CONTENT="some content">

and

 
    <!-- META START NAME="meta1" -->
        some content
    <!-- META END -->

And in XML documents, use the format:

 
    <meta1>
        Some Content
    </meta1>

Then you can limit your search to just META meta1 like this:

 
    swish-e -w 'meta1=(apples or oranges)'

You may nest the XML and the start/end tag versions:

 
    <keywords>
        <tag1>
            some content
        </tag1>
        <tag2>
            some other content
        </tag2>
    <keywords>

Then you can search in both tag2 and tag2 with:

 
  swish-e -w 'keywords=(query words)'

MetaNames are case sensitive in XML documents.

UndefinedMetaTags [error|ignore|index|auto]

This directive defines the behavior of swish during indexing when a meta name is found but is not listed in MetaNames. There are four choices:

error - If a meta name is found that is not listed in MetaNames then indexing will be halted and an error reported.

ignore - The contents of the meta tag are ignored and not indexed.

index - The contenst of the meta tag are indexed, but placed in the main index (the contents are not assinged a meta name and cannot be searched by meta name. This is the default.

auto - This method only applies to HTML and XML documents and will create meta tags automatically for HTML meta names and XML elements. Using this is the same as specifying all the meta names explicitly in a MetaNames dirictive.

PropertyNames *list of meta names*

SWISH-E allows you to specify certain META tags that can be used as document properties. The contents of any META tag that has been identified as a document property can be returned as part of the search results along with the rank, file name, title, and document size (see the -p and -x switches in SWISH-RUN).

Properties are useful for returning additional data from documents in search results -- this saves the effort of reading and parsing the source files while reading SWISH-E search results, and is especially useful when the source documents are no longer available or slow to access (e.g. over http).

Another feature of properties is that SWISH-E can use the PropertyNames for sorting the search results (see the -s switch).

 
    PropertyNames author subjects

Note that the PropertyNames listed must also be listed in the MetaNames directive. Property names are case sensitive in XML documents.

Use of PropertyNames will increase the size of your index file, sometimes significantly.

PropertyNamesNumeric

This directive is similar to PropertyNames, but it flags the property as being a string of digits that will be stored as binary data instead of a string. This allows sorting with -s and limiting with -L to sort and limit the property correctly.

Swish uses strtoul(3) to convert the string into an unsigned long integer. Therefore, only positive integers can be stored.

Future versions of swish may be able to store different property types (such as negative integers and real numbers). This directive may change in future releases of Swish.

PropertyNamesDate

This directive is exactly like PropertyNamesNumeric, but it also flags the number as a machine timestamp (seconds since epoch), and will print a formatted date when returning this property. See -x in SWISH-RUN.

Swish will not parse dates when indexing; you must use a timestamp.

PreSortedIndex *list of property names*

By default Swish generates presorted tables while indexing for each property name. This allows faster sorting when generating results. On large document collections this presorting may add to the indexing time, and also adds to the total size of the index. This directive can be used to customize exactly which properties will be presorted.

If PreSortedIndex it is not present in the config file (default action), all the properties will be presorted at indexing time. If it is present without any parameter, no properties will be presorted. Otherwise, only the property names specified will be presorted.

For example, if you only wish to sort results by a property called title:

 
    PropertyNames title age time
    PreSortedIndex  title

StoreDescription [XML <tag> size|HTML <meta> size|TXT size]

StoreDescription allows you to store a document description in the index file, and this description is returned in your search results when the -x switch is used to include the swishdescription for extended results.

For text documents you specify the type TXT and the number of characters to capture.

 
    StoreDescription TXT 20

For HTML, and XML file types, specify the the tag to use for the description, and optionally the number of characters to capture. If not specified will capture the entire contents of the tag.

 
    StoreDescription HTML <body> 20
    StoreDescription XML  <desc> 40

Note that documents must be assigned a document type with IndexContents or DefaultContents to use this feature.

TruncateDocSize *number of characters*

TruncateDocSize limits the size of a document while indexing documents and/or using filters. This config directive truncates the numbers of read bytes of a document to the specified size. This means: if a document is larger, read only the specified numbers of bytes of the document.

Example:

 
    TruncateDocSize    10000000

The default is zero, which means read all data.

Warning: If you use TruncateDocSize, use it with care! TruncateDocSize is a safty belt only, to limit e.g. filteroutput, when accessing databases, or to limit ``runnaway'' filters. Truncating doc input may destroy document structures for swish-e (e.g. swish may miss closing tags for XML or HTML documents).

TruncateDocSize does not currently work with the prog input source method.

UseStemming [yes|NO]

Put yes to apply word stemming algorithm during indexing, else no.

 
    UseStemming no
    UseStemming yes

When UseStemming is set to yes every word is stemmed before placing it in to the index.

The stemming function does not convert words to their root, rather programmatically removes endings on words in an attempt to make similar words with different endings stem to the same string of characters. It's not a perfect system, and searches on stemmed indexes often return curious results. For example, two entirely different words may stem to the same word.

Stemming also can be confusing when used with a wildcard (truncation). For example, you might expect to find the word ``running'' by searching for ``runn*''. But this fails when using a stemmed index, as ``running'' stems to ``run'', yet searching for ``runn*'' looks for words that start with ``runn''.

It's a good idea to create both a stemmed and non-stemmed index and allow your search interface select which index to use.

UseSoundex [yes|NO]

When UseSoundex is set to yes every word is converted to a Soundex code before placing it in to the index.

Soundex was developed in the 1880s so records for people with similar sounding names could be found more readily. Soundex is a coded surname based on the way a surname sounds rather than spelling. Surnames that sound similar, like Smith and Smyth, are filed together under the same Soundex code. This is mostly useful for US English.

Soundex should not be used to search for sound-alike words. Metaphone would be more appropriate for generic sound matching of words. Soundex should only be used where you need to search multiple documents for proper names which sound similar. This is primarily used for indexing genealogical records. This may be useful for indexing other collections of data consisting mostly of names. Many common name variations are matched by Soundex. The only notable exception is the first letter of the name. The first letter is not matched for sound.

It may be a good idea to create both a Soundex and non-Soundex index and allow your search interface select which index to use.

IgnoreTotalWordCountWhenRanking [YES|no]

Put yes to ignore the total number of words in the file when calculating ranking. Often better with merges and small files. Default is yes.

 
    IgnoreTotalWordCountWhenRanking no

The default was changed from no to yes in version 2.2.

MinWordLimit *integer*

Set the minimum length of an word. Shorter words will not be indexed. The default is 1 (as defined in src/config.h).

 
    MinWordLimit 5

MaxWordLimit *integer*

Set the maximum length of an indexable word. Every longer word will not be indexed. The Default is 40 (as defined in src/config.h).

WordCharacters *string of characters*

IgnoreFirstChar *string of characters*

IgnoreLastChar *string of characters*

BeginCharacters *string of characters*

EndCharacter *string of characters*

These settings define what a word consists of to the SWISH-E indexing engine. Compiled in defaults are in src/config.h.

When indexing SWISH-E uses WordCharacters to split up the document into words. Words are defined by any string of non-blank characters that contain only the characters listed in WordCharacters. If a string of characters includes a character that is not in WordCharacters then the word will be spit into two or more separate words.

For example:

 
    WordCharacters abde

Would turn ``abcde'' into two words ``ab'' and ``de''.

Next, of these words, any characters defined in IgnoreFirstChar are stripped off the start of the word, and IgnoreLastChar characters are stripped off the end of the word. This allows, for example, periods within a word (www.slashdot.com), but not at the end of a word. Characters in IgnoreFirstChar and IgnoreLastChar must be in WordCharacters.

Finally, the resulting words MUST begin with one of the characters listed in BeginCharacters and end with one of the characters listed in EndCharacters. BeginCharacters and EndCharacters must be a subset of the characters in WordCharacters. Often, WordCharacters, BeginCharactes and EndCharacters will all be the same.

Note that the same process applies to the query while searching.

Getting these setting correct will take careful consideration and practice. It's helpful to create an index of a single test file, and then look at the words that are placed in the index (see the -v 4, -D and -k searching switches).

Currently there is only support for eight-bit characters.

Example:

 
    WordCharacters  .abcdefghijklmnopqrstuvwxyz
    BeginCharacters abcdefghijklmnopqrstuvwxyz
    EndCharacters   abcdefghijklmnopqrstuvwxyz
    IgnoreFirstChar .
    IgnoreLastChar  .

So the string

 
    Please visit http://www.example.com/path/to/file.html.

will be indexed as the following words:

 
    please
    visit
    http
    www.example.com
    path
    to
    file.html

Which means that you can search for www.example.com as a single word, but searching for just example will not find the document.

Note: when indexing HTML documents HTML entities are converted to their character equivalents before being processed with these directives. This is a change from previous versions of SWISH-E where you were required to include the characters 0123456789&#; to index entities. See also ConvertHTMLEntities

Buzzwords [*list of buzzwords*|File: path]

The Buzzwords option allows you to specify words that will be indexed regardless of WordCharacters, BeginCharacters, EndCharacters, stemming, soundex and many of the other checks do on words while indexing.

Buzzwords are case insensitive.

Buzzwords should be separated by spaces and may span multiple directives. If the special format File:filename is used then the Buzzwords will be read from an external file during indexing.

Examples:

 
    Buzzwords C++ TCP/IP

 
    Buzzwords File:./buzzwords.lst

If a Buzzword contains search operator characters they must be backslashed when searching. For example:

 
    Buzzwords C++ TCP/IP web=http

 
    ./swish-e -w 'web\=http'

IgnoreWords [*list of stop words*|File: path]

The IgnoreWords option allows you to specify words to ignore, called stopwords. The default is to not use any stopwords.

Words should be separated by spaces and may span multiple directives. If the special format File:filename is used then the stop words will be read from an external file during indexing.

In previous versions of swish you could use the directive

 
    IgnoreWords swishdefault - obsolete!

to include a default list of compiled in stopwords. This keywords is no longer supported.

Examples:

 
    IgnoreWords www http a an the of and or

 
    IgnoreWords File:./stopwords.de

UseWords [*list of words*|File: path]

UseWords defines the words that swish will index. Only the words listed will be indexed.

You can specify a list of words following the directive (you may specify more than one UseWords directive in a config file), and/or use the File: form to specify a path to a file containing the words:

 
    UseWords perl python pascal fortran basic cobal php
    UseWords File: /path/to/my/wordlist

Please drop the swish-e list a note if you actually use this feature. It may be removed from future versions.

IgnoreLimit *integer integer*

This automatically omits words that appear too often in the files (these words are called stopwords). Specify a whole percentage and a number, such as ``80 256''. This omits words that occur in over 80% of the files and appear in over 256 files. Comment out to turn of auto-stopwording.

 
    IgnoreLimit 50 1000

SWISH-E must do extra processing to adjust the entire index when this feature is used. It is recommended that instead of using this feature that you decided what words are stopwords and add them to IngoreWords in your configuration file. To do this, use IgnoreLimit one time and note the stop words that are found while indexing. Add this list to IgnoreWords, and then remove IgnoreLimit from the configuration file.

IgnoreMetaTags *list of names*

IgnoreMetaTags defines a list of metanames to ignore while indexing XML files. This is useful to avoid indexing specific data from a file. For example:

 
    <person>
        <first_name>
            William
        </first_name>
        <last_name>
            Shakespeare
        </last_name>
        <updated_date>
            April 25, 1999
        </updated_date>
    </person>

In the above example you might not want to index the updated date, and therefore prevent finding this record by searching

 
    -w 'person=(April)'

This is solved by:

 
    IgnoreMetaTags updated_date

Warning: Any data listed in IgnoreMetaTags will not be indexed.

See also UndefinedMetaTags.

IndexComments [NO|yes]

This option allows the user decide if to index the contents of HTML comments. Default is no. Set to yes if comment indexing is required.

 
    IndexComments yes

Note: This is a change in the default behavior prior to version 2.2.

TranslateCharacters [*string1 string2*|:ascii7:]

The TranslateCharacters directive maps the characters in string1 to the characters listed in string2.

For example:

 
    # This will index a_b as a-b and ámo as amo
    TranslateCharacters _á -a

TranslateCharacters :ascii7: is a predefined set of characters that will translate eight bit characters to ascii7 characters. Using the :ascii7: rule will translate ``Ääç'' to ``aac''. This means: searching ``Çelik'', ``çelik'' or ``celik'' will all match the same word.

TranslateCharacters is done early in the indexing process, after converting HTML entities but before splitting the input text into words based on WordCharacters. So characterters you are translating from do not need to be listed in word characters.

The same character translations take place when searching.

BumpPositionCounterCharacters *string*

When indexing SWSIH-E assigns a word position to each word. This enables phrase searching. There may be cases where you would like to prevent phrase matching. The BumpPositionCounterCharacters directive allows you to specify a set of characters that when found in the text will increment the word position -- effectively preventing phrase matches across that character.

For example, if you have a META tag:

 
    <!-- META START NAME="subjects" -->
        computer programming | apple computers
    <!-- META END -->

You might want to prevent matching ``programming apple'' in that meta name.

 
    BumpPositionCounterCharacters |

There is no default, and you may list a string of characters.

DontBumpPositionOnMetaTags *list of names*

Since metatags are typically separate data fields, the word position counter is automatically bumped between metatags. This prevents matching a phrase that spans more than one metaname. DontBumpPositionOnMetaTags disables this feature for the listed metanames.

For example,

 
    <person>
        <first_name>
            William
        </first_name>
        <last_name>
            Shakespeare
        </last_name>
        <updated_date>
            April 25, 1999
        </updated_date>
    </person>

In the conifuration file:

 
    DontBumpPositionOnMetaTags last_name

This configuration allows this phrase search

 
    -w 'person=("william shakespeare")'

but this phrase search will fail

 
    -w 'person=("shakespeare april")'
    

[ TOC ]


Directives for the File Access method only

Some directives have different uses depending on the source of the documents. These directives are only valid when using the File system method of indexing.

IndexOnly *list of file suffixes*

This directive specifies the allowable file suffixes (extensions) while indexing. The default is to index all files specified in IndexDir.

 
    # Only index .html .htm and .q files
    IndexOnly .html .htm .q

FollowSymLinks [yes|NO]

Put ``yes'' to follow symbolic links in indexing, else ``no''. Default is no.

 
    FollowSymLinks no
    FollowSymLinks yes

Note that when set to no extra stat(2) system calls must be made for each file. For large number of files you may see a small reduction in indexing time by setting this to yes.

See also the -l switch in SWISH-RUN.

FileRules [contains|is] *regular expression*

Files matching the specified criteria will not be indexed. C regex.h library regular expression pattern matching is allowed.

 
    FileRules pathname contains .*dir1
    FileRules filename contains # % ~ .bak .orig .old old.
    FileRules title contains construction example pointers
    FileRules directory contains .htaccess
    FileRules filename is index

Note: FileRules title works for any input method (fs, prog, or http) that is parsed as HTML, and where a title was found in the document.

[ TOC ]


Directives for the HTTP Access Method Only

These directives are available when using the HTTP Access Method of indexing.

MaxDepth *integer*

MaxDepth defines how many links the spider should follow before stopping. A value of 0 configures the spider to traverse all links. The default is MaxDepth 5.

 
    MaxDepth 5

Delay *seconds*

The number of seconds to wait between issuing requests to a server. This setting allows for more friendly spidering of remote sites. The default is 60 seconds.

 
    Delay 1

TmpDir *path*

The location of a writable temp directory on your system. The HTTP access method tells the Perl helper to place its files in this location, and the -e switch causes swish to use this directory while indexing. The default is /var/tmp.

 
    TmpDir /tmp/swish/

If this directory does not exist or is not writable SWISH-E will fail with an error during indexing.

SpiderDirectory *path*

The location of the Perl helper script called swishspider. If you use a relative directory, it is relative to your directory when you run SWISH-E, not to the directory that SWISH-E is in. The default is ./

 
    SpiderDirectory /usr/local/swish/

EquivalentServer *server alias*

Often times the same site may be referred to by different names. A common example is that often http://www.some-server.com and http://some-server.com are the same. Each line should have a list of all the method/names that should be considered equivalent. Multiple EquivalentServer directives may be used. Each directive defines its own set of equivalent servers.

 
    EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
    EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu

[ TOC ]


Directives for the prog Access Method Only

This section details the directives that are only available for the ``prog'' document source feature of swish. The ``prog'' access method runs an external program that ``feeds'' documents to swish. This allows indexing and filtering of documents from any source.

A number of example programs for use with the ``prog'' access method are provided in the prog-bin directory. Please see those example if you have questions about implementing a ``prog'' input program.

SwishProgParameters *list of parameters*

This is a list of parameters that will be sent to the external program when running with the ``prog'' document source method.

 
    SwishProgParameters /path/to/config hello there
    IndexDir /path/to/program.pl

Then running:

 
    swish-e -c config -S prog

swish will execute /path/to/program.pl and pass /path/to/config hello there as three command line arguments to the program. This directive makes it easy to pass settings from the swish-e configuration file to the external program.

For example, the spider.pl program (included in the prog-bin directory) uses the SwishProgParameters to specify what file to read for configuation information.

 
    SwishProgParameters spider.config
    IndexDir ./spider.pl

The spider.pl program also has a default action so you can avoid using a configuration file:

 
    SwishProgParameters default http://www.swishe.org/ http://some.other.site/
    IndexDir ./spider.pl

And the spider program will use default settings for spidering those sites.

[ TOC ]


Document Filter Directives

Internally, SWISH-E knows how to parse only text, HTML, and XML documents. With SWISH-E filters you can index other types of documents. For example, if all your web pages are in gzip format a filter can uncompress these on the fly for indexing.

A filter is an external program that swish executes when processing a document of a given type. SWISH-E will execute the filter program for each file that matches the file extension set in the FileFilter directive.

SWISH-E calls the external program passing as default arguments:

$0

the name of the filter program

$1

the physical path name of the file to read. This may be a temporary file location if indexing by the http method.

$2

When indexing under the file system this will be the same as $1 (the path to the source file), but when indexing under the http method this will be the URL of the source document.

SWISH-E can also pass other parameters to the filter program. These parameters can be defined using the FileFilter directive. See Filter Options below.

The filter program must open the file, process its contents, and return it to SWISH-E by printing to STDOUT.

Note that this can add a significant amount of time to the indexing process. If you have many files to filter you should consider writing your filter in C instead of a shell or perl script, or using the ``prog'' Access Method.

FilterDir *path-to-directory*

This is the path to a directory where the filter programs are stored. SWISH-E looks in this directory to find the filter specified in the FileFilter directive. If this directive is omitted, you have to specify the full path to the filterscript on each FileFilter directive.

Example:

 
    FilterDir /usr/local/swish/filters

FileFilter *extension* "filter-prog" ["filter-options"]

This maps file extensions to a filter program. If filter-prog starts with a directory delimiter (absolute path), SWISH-E doesn't use the FilterDir settings, but uses the given filter-prog path directly.

Filter options: Filter options are a string passed as arguments to the filter-prog. Filter options can contain variables, replaced by SWISH-E.

 
   If you ommit I<filter-options> SWISH-E will use default parameters for the options.

 
        Default:      "'%p' '%P'"
        Which means:  pass   "workfile path" and "documentfile path" to filter (each quoted).

 
   Variables in filter options:

 
       %%   =  %
       %P   =  Full document pathname (e.g. URL, or path on filesystem)  
       %p   =  Full pathname to work file (maybe a tmpfile or the real document path on filesystem)
       %F   =  Filename stripped from full document pathname
       %f   =  Filename stripped from "work" pathname
       %D   =  Directoryname stripped from full document pathname
       %d   =  Directoryname stripped from full "work" pathname

 
       Example:
          %P =  document pathname:  http://myserver/path1/mydoc.txt
          %p =  work pathname:      /tmp/tmp.1234.mydoc.txt
          %F =     mydoc.txt
          %f =     tmp.1234.mydoc.txt
          %D =     http://myserver/path1
          %d =     /tmp

 
       Important hint for security:
           When using variable substitition, use quotes, to ensure filename integrity.
           e.g. "'%f'"  -->  'file name with spaces.doc'.
           If you don't use this, your system security may be compromised, or filtering
           may not work for these files.

Examples for filters:

 
    FileFilter .pdf       pdftotext   "'%p' -"
    FileFilter .doc       /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"
    FileFilter .html.gz   gzip  "-c '%p'"
    FileFilter .pdf       pdf2html.sh
    FileFilter .html.gz   ungzip-html
    FileFilter .doc       /usr/local/filters/wword-filter.sh
    FileFilter .dot       wword-filter.sh
    FileFilter .ps        ghostscript-filter.pl
    FileFilter .mydoc     "/some/path/mydocfilter"  "-d '%d' -example -url '%P' '%f'"

Here is a simple example of a filter using Perl. Again, you should try to avoid running shell or perl scripts as filters as the scripts will significantly slow down indexing, if indexing speed is an issue. But, for a small number of files to filter, this method works well and is easy to implement.

Convert gzipped files to text:

 
    #!/usr/local/bin/perl -w
    use strict;
    use Compress::Zlib ;
    
    my $file = $ARGV[1] || shift;
    
    die "Usage: gzcat file...\n"
        unless $file;
    
    my $gz = gzopen($file, 'rb')
        or die "Cannot open $file: $gzerrno\n" ;
    
    my $buffer;
    print $buffer
        while $gz->gzread($buffer) > 0 ;
    
    die "Error reading from $file: $gzerrno\n"
        if $gzerrno != Z_STREAM_END ;
    
    $gz->gzclose() ;

[ TOC ]


Document Info

$Id: SWISH-CONFIG.pod,v 1.21 2001/06/17 04:13:33 whmoseley Exp $

. [ TOC ]