Table of Contents:
What files SWISH-E indexes and how they are indexed, and where the index is
written can be controlled by a configuration file. The configuration file
is passed to swish as a command line argument by using the -c
switch (see SWISH-RUN).
The configuration file is a text file composed of comments, blank lines, and configuration directives. Order of the directives is not important. Some directives may be used more than once in the configuration file, while others can only be used once (e.g. additional directives will overwrite preceding directives). Case of the directive is not important -- you may use upper, lower, or mixed case.
Comments are any line that begin with a ``#''.
# This is a comment |
Commented example configuration files are included in the config directory of the SWISH-E distribution.
Typically, configuration file the directives are grouped together in some logical order -- that is, directives that control the source of the documents would be grouped together first, and directives that control how each document is filtered or its words index in another group of directives. (The directives listed below are grouped in this order).
You may also split your directives up into different configuration files
and specify more than one configuration file. This allows you to have a
master configuration file used for many different indexes, and smaller
configuration files for each separate index. You can specify the different
configuration files when running from the command line with the -c
switch, or you may include other Configuration file with the
IncludeConfigFile directive.
Some command line arguments can override directives specified in the configuration file. Please see also the SWISH-RUN for instructions on running SWISH-E, and the SWISH-SEARCH page for information and examples on how to search your index.
The configuration file is specified to SWISH-E by the -c
switch. For example,
swish-e -c myconfig.conf |
The configuration file directives are listed below in these groups:
Administrative Headers Directives -- You may add administrative information to the header of the index file.
Document Source Directives -- Directives for selecting the source documents and the location of the index file.
Document Contents Directives -- Directives that control how a document content is indexed.
Directives for the File Access method only -- These directives are only applicable to the File Access indexing method.
Directives for the HTTP Access Method Only -- Likewise, these only apply to the HTTP Access method.
Directives for the prog Access Method Only -- These only apply to the prog Access method.
Document Filter Directives -- This is a special section that describes using document filters with Swish-e.
[ TOC ]
BeginCharacters *string
of characters*
BumpPositionCounterCharacters *string*
Buzzwords [*list of buzzwords*|File: path]
ConvertHTMLEntities [YES|no]
DefaultContents [TXT|HTML|XML|WML|LST]
Delay *seconds*
DontBumpPositionOnMetaTags *list
of names*
EnableAltSearchSyntax [yes|NO]
EndCharacter *string
of characters*
EquivalentServer *server
alias*
FileInfoCompression [yes|NO]
FileRules [contains|is] *regular
expression*
FollowSymLinks [yes|NO]
IgnoreFirstChar *string
of characters*
IgnoreLastChar *string
of characters*
IgnoreLimit *integer
integer*
IgnoreMetaTags *list
of names*
IgnoreTotalWordCountWhenRanking [YES|no]
IgnoreWords [*list of stop words*|File: path]
IndexAdmin *text*
IndexComments [YES|no]
IndexContents [TXT|HTML|XML|WML|LST] *file
extensions*
IndexDescription *text*
IndexDir [URL|directories or files]
IndexFile *path*
IndexName *text*
IndexOnly *list
of file suffixes*
IndexPointer *text*
IndexReport [0|1|2|3]
MaxDepth *integer*
MaxWordLimit *integer*
MetaNames *list
of names*
MinWordLimit *integer*
NoContents *list
of file suffixes*
PreSortedIndex *list
of property names*
PropertyNames *list
of meta names*
PropertyNamesDates *list
of meta names*
PropertyNamesNumeric *list
of meta names*
ReplaceRules [replace|remove|prepend|append]
ResultExtFormatName name -x format string
SpiderDirectory *path*
StoreDescription [XML <tag>|HTML <meta>|TXT size]
SwishProgParameters *list
of parameters*
SwishSearchDefaultRule [<AND-WORD>|<or-word>]
SwishSearchOperators <and-word> <or-word> <not-word>
TmpDir *path*
TranslateCharacters [*string1 string2*|:ascii7:]
TruncateDocSize *number
of characters*
UndefinedMetaTags [error|ignore|index|auto]
UseStemming [yes|NO]
UseSoundex [yes|NO]
UseWords [*list of words*|File: path]
WordCharacters *string
of characters*
[ TOC ]
These configuration directives control the general behavior of SWISH-E.
This directive can be used to include configuration directives located in another file.
IncludeConfigFile /usr/local/swish/conf/site_config.config |
This is how detailed you want reporting while indexing. You can specify numbers 0 to 3 - 0 is totally silent, 3 is the most verbose. The default is 3, so you probably should define this.
IndexReport 1 |
This may be overridden from the command line via the -v
switch (see SWISH-RUN).
Index file specifies the location of the generated index file. If not specified, SWISH-E will create the file index.swish-e in the current directory.
IndexFile /usr/local/swish/site.index |
The following items are currently not available until Swish can parse a configuration file while searching
Example:
swish-e -w "+word1 +word2 -word3 word4 word5" "+" = following word has to be in all found documents "-" = following word may not be in any document found " " = following word will be searched in documents =item SwishSearchOperators <and-word> <or-word> <not-word> |
Using this config directive you can change the boolean search operators of swish-e, e.g. to adapt these to your language. The default is: AND OR NOT
Example (german):
SwishSearchOperators UND ODER NICHT |
SwishSearchDefaultRule defines the default Boolean operator to use if none is specified between
words or phrases. The default is AND
.
The word you specify must match one of the available SwishSearchOperators
.
Example:
SwishSearchOperators UND ODER NICHT # Make it act like a web search engine SwishSearchDefaultRule ODER |
The output of swish can be defined by specifying a format string with the -x
command line argument. Using ResultExtFormatName you can assign a predefined format string to a name.
Examples:
ResultExtFormatName moreinfo "%c|%r|%t|%p|<author>|<publishyear>\n" |
Then when searching you can specify the the format string's name swish-e ... -x moreinfo ...
See the -x
switch in SWISH-RUN for more information about output formats.
[ TOC ]
SWISH-E stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the SWISH-E C library. There are a number of fields available for your own use. None of these fields are required:
These variables specify information that goes into index files to help users and administrators. IndexName should be the name of your index, like a book title. IndexDescription is a short description of the index or a URL pointing to a more full description. IndexPointer should be a pointer to the original information, most likely a URL. IndexAdmin should be the name of the index maintainer and can include name and email information. These values should not be more than 70 or so characters and should be contained in quotes. Note that the automatically generated date in index files is in D/M/Y and 24-hour format.
Examples:
IndexName "Linux Documentation" IndexDescription "This is an index of /usr/doc on our Linux machine." IndexPointer "http://localhost/swish/linux/index.html" IndexAdmin "webmaster" |
[ TOC ]
These directives control what documents are indexed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.
IndexDir defines the source of the documents for SWISH-E. SWISH-E currently supports three file access methods: File system, HTTP (also called spidering), and prog for reading files from an external program.
The -S
command line argument is used to select the file access method.
swish-e -c swish.config -S fs - file system swish-e -c swish.config -S http - spider swish-e -c swish.config -S prog - external program |
For the File system method of access IndexDir is a space-separated list of files and directories to index. You may specify more than one IndexDir directive.
Any sub-directories of any listed directory will also be indexed.
Examples:
# Index this directory an any subdirectories IndexDir /usr/local/home/http |
# Index the docs directory in current directory IndexDir ./docs |
# Index these files in the current directory IndexDir ./index.html ./page1.html ./page2.html # and index this directory, too IndexDir ../public_html |
For the HTTP method of access specify the URL's from which you want the spidering to begin.
Example:
IndexDir http://www.my-site.com/index.html IndexDir http://localhost/index.html |
Obviously, using the HTTP method to index is much slower than indexing local files. Be well aware that some sites do not appreciate spidering and may block your IP address. You may wish to contact the remote site before spidering their web site. More information about spidering can be found in Directives for the HTTP Access Method Only below.
For the prog method of access IndexDir specifies the path to the program(s)
to execute. The external
program must correctly format the documents being passed back to swish.
Examples of external programs are provided in the prog-bin directory.
Note: Not all directives work with all methods.
Files with these suffixes will not have their contents indexed, but their file names will be indexed. File
names are not normally indexed. If you specify .html or .htm
then if a <TITLE> section is found those words will be indexed,
otherwise the file name will be indexed.
NoContents .gif .xbm .au .mov .mpg .pdf .ps |
ReplaceRules allows you to make changes to file pathnames before they're indexed. These changed file names or URLs will be returned in search results.
For example, you may index your files locally (with the File system indexing method), yet return a URL in search results. This directive can be used to map the file names to their respective URLs on your web server.
There are four operations you can specify: replace, append, remove, and prepend. They will parse the pathname in the order you've typed these commands.
More than one command and its arguments can appear on the same line, but
it's easier to read when commands are broken up over a few lines. You can't
put a command and its argument(s)
on different lines, however.
This directive uses C library regex.h regular expressions.
replace "the string you want replaced" "what to change it to" This replaces all occurrences of the old string with the new one. |
remove "a string to remove" |
prepend "a string to add before the result" |
append "a string to add after the result" |
Examples:
ReplaceRules replace "testdir/" "anotherdir/" ReplaceRules replace "[a-z_0-9]*_m.*\.html" "index.html" |
ReplaceRules remove "testdir/" |
ReplaceRules prepend "http://localhost/" ReplaceRules append ".html" |
The IndexContents directive assigns one of Swish's document parsers to a document, based on the its extension. Swish currently knows how to parse TXT, HTML, and XML documents. LST are special multiple-document XML files, described below. WML uses the HTML parser.
Documents that are not assigned a parser with IndexContents will, by default, use the HTML parser. The DefaultContents directive may be used to assign a parser to documents that do not match a file extension defined with the IndexContents directive.
Example:
IndexContents HTML .htm .html .shtml IndexContents TXT .txt .log .text IndexContents XML .xml |
HTML is the default type for all files, unless otherwise specified (and
this default can be changed by the DefaultContents directive. Swish parses titles from HTML files, if available, and keeps
track of the context of the text for context searching (see -t
in SWISH-RUN). HTML and XML files use different tag formats for MetaNames and PropertyNames.
If using filters to convert documents you should include those extensions, too. For example, if using a filter to conver .pdf to .html, you need to tell swish that .pdf should be indexed by the internal HTML parser: FileFilter .pdf pdf2html IndexContent HTML .pdf
See also Document Filter Directives.
LST files are XML files that contain multiple documents. The documents are separated by the first XML tag found in the document (each time that tag is found it is considered a new document).
<tag1> <== First document bla, bla </tag1> <tag2> bla, bla </tag2> <tag1> <== Second document ... |
When reporting results from a query, swish will still return the document
name, but will also return a document offset in the property swishstartpos
, and the length of the sub-document in swishdocsize
(these properties are available by using the -x
format option).
For example, you could have text files that contain SQL queries, and the queries might generate quite a number of results (documents) from a database. You can instruct swish to ``index'' these files, and use a filter to convert the SQL queries into documents. In other words, Swish indexes a file, but a swish filter converts that file (which contains the SQL query statement) into a query and returns, perhaps, many documents.
For example, the swish config file might look like this:
IndexDir ./test.sql IndexFile ./test.index MetaNames tag1 tag2 tag3 PropertyNames tag2 tag3 FilterDir ./ # Here is the main part IndexContents LST .sql FileFilter .sql mysqlfilter.sh # If you also want a desc use XML not LST StoreDescription XML <meta1> |
Then, the *.sql files can contain the queries. For example,
select tag1,tag2,tag3 from my_table |
The mysqlfilter.sh program should read the *.sql file, proccess the query/select, and format the output in the ``LST'' style:
<tag1> data </tag1> <tag2> more data </tag2> <tag3> even more data </tag3> <tag1> start of a new document ... |
Care must be taken when returning multiple document files to swish, as swish will load all data into memory for each file. In other words, don't try to index thousands of documents as a single LST type of document.
Note: Some of this may be changed in the future to use content-types instead of file extensions. See SWISH-3.0
This sets the default parser for documents that are not specified in IndexContents. If not specified the default is HTML.
Example:
DefaultContents HTML |
The DefaultContents directive should be used when spidering, as HTML files may be returned without a file extension (such as when requesting a directory and the default index.html is returned).
** This directive is currently not supported **
Setting FileInfoCompression to yes
will compress the index file to save disk space. This may result in longer
indexing times. The default is no
.
Also see the -e
switch in SWISH-RUN for saving RAM during indexing.
[ TOC ]
These directives control what information is extracted from your source documents, and how that information is made available during searching.
ASCII entities can be converted automatically while indexing documents of type HTML and
XML. For performance reasons you may wish to set this to no
if your documents do not contain HTML entities. The default is yes
.
If ConvertHTMLEntities is set no
the entities will be indexed without conversion.
META names are a way to define ``fields'' in your XML and HTML documents.
You can use the META names in your queries to limit the search to just the
words contained in that META name of your document. For example, you might
have a META tagged field in your documents called subjects
and then you can search your documents for the word ``foo'' but only return
documents where ``foo'' is within the subjects
META tag.
swish-e -w subjects=foo |
(See also the -t
switch in SWISH-RUN for information about context searching in HTML documents.)
The MetaNames directive is a space separated list. For example:
MetaNames meta1 meta2 keywords subjects |
You may also use UndefinedMetaTags to specify automatic extraction of meta names from your HTML and XML documents.
META tags can have two formats in your HTML source documents:
<META NAME="meta1" CONTENT="some content"> |
and
<!-- META START NAME="meta1" --> some content <!-- META END --> |
And in XML documents, use the format:
<meta1> Some Content </meta1> |
Then you can limit your search to just META meta1 like this:
swish-e -w 'meta1=(apples or oranges)' |
You may nest the XML and the start/end tag versions:
<keywords> <tag1> some content </tag1> <tag2> some other content </tag2> <keywords> |
Then you can search in both tag2 and tag2 with:
swish-e -w 'keywords=(query words)' |
MetaNames are case sensitive in XML documents.
This directive defines the behavior of swish during indexing when a meta name is found but is not listed in MetaNames. There are four choices:
SWISH-E allows you to specify certain META tags that can be used as document properties. The contents of any META tag that has been identified as a document
property can be returned as part of the search results along with the rank,
file name, title, and document size (see the -p
and -x
switches in SWISH-RUN).
Properties are useful for returning additional data from documents in search results -- this saves the effort of reading and parsing the source files while reading SWISH-E search results, and is especially useful when the source documents are no longer available or slow to access (e.g. over http).
Another feature of properties is that SWISH-E can use the PropertyNames for
sorting the search results (see the -s
switch).
PropertyNames author subjects |
Note that the PropertyNames listed must also be listed in the MetaNames directive. Property names are case sensitive in XML documents.
Use of PropertyNames will increase the size of your index file, sometimes significantly.
This directive is similar to PropertyNames, but it flags the property as being a string of digits that will be stored
as binary data instead of a string. This allows sorting with -s
and limiting with -L
to sort and limit the property correctly.
Swish uses strtoul(3)
to convert the string into an unsigned long integer. Therefore, only
positive integers can be stored.
Future versions of swish may be able to store different property types (such as negative integers and real numbers). This directive may change in future releases of Swish.
This directive is exactly like PropertyNamesNumeric, but it also flags the number as a machine timestamp (seconds since
epoch), and will print a formatted date when returning this property. See -x
in SWISH-RUN.
Swish will not parse dates when indexing; you must use a timestamp.
By default Swish generates presorted tables while indexing for each property name. This allows faster sorting when generating results. On large document collections this presorting may add to the indexing time, and also adds to the total size of the index. This directive can be used to customize exactly which properties will be presorted.
If PreSortedIndex it is not present in the config file (default action), all the properties will be presorted at indexing time. If it is present without any parameter, no properties will be presorted. Otherwise, only the property names specified will be presorted.
For example, if you only wish to sort results by a property called title
:
PropertyNames title age time PreSortedIndex title |
StoreDescription allows you to store a document description in the index file, and this
description is returned in your search results when the -x
switch is used to include the swishdescription for extended results.
For text documents you specify the type TXT
and the number of characters to capture.
StoreDescription TXT 20 |
For HTML, and XML file types, specify the the tag to use for the description, and optionally the number of characters to capture. If not specified will capture the entire contents of the tag.
StoreDescription HTML <body> 20 StoreDescription XML <desc> 40 |
Note that documents must be assigned a document type with IndexContents or DefaultContents to use this feature.
TruncateDocSize limits the size of a document while indexing documents and/or using filters. This config directive truncates the numbers of read bytes of a document to the specified size. This means: if a document is larger, read only the specified numbers of bytes of the document.
Example:
TruncateDocSize 10000000 |
The default is zero, which means read all data.
Warning: If you use TruncateDocSize, use it with care! TruncateDocSize is a safty belt only, to limit e.g. filteroutput, when accessing databases, or to limit ``runnaway'' filters. Truncating doc input may destroy document structures for swish-e (e.g. swish may miss closing tags for XML or HTML documents).
TruncateDocSize does not currently work with the prog input source method.
Put yes to apply word stemming algorithm during indexing, else no.
UseStemming no UseStemming yes |
When UseStemming is set to yes
every word is stemmed before placing it in to the index.
The stemming function does not convert words to their root, rather programmatically removes endings on words in an attempt to make similar words with different endings stem to the same string of characters. It's not a perfect system, and searches on stemmed indexes often return curious results. For example, two entirely different words may stem to the same word.
Stemming also can be confusing when used with a wildcard (truncation). For example, you might expect to find the word ``running'' by searching for ``runn*''. But this fails when using a stemmed index, as ``running'' stems to ``run'', yet searching for ``runn*'' looks for words that start with ``runn''.
It's a good idea to create both a stemmed and non-stemmed index and allow your search interface select which index to use.
When UseSoundex is set to yes
every word is converted to a Soundex code before placing it in to the
index.
Soundex was developed in the 1880s so records for people with similar sounding names could be found more readily. Soundex is a coded surname based on the way a surname sounds rather than spelling. Surnames that sound similar, like Smith and Smyth, are filed together under the same Soundex code. This is mostly useful for US English.
Soundex should not be used to search for sound-alike words. Metaphone would be more appropriate for generic sound matching of words. Soundex should only be used where you need to search multiple documents for proper names which sound similar. This is primarily used for indexing genealogical records. This may be useful for indexing other collections of data consisting mostly of names. Many common name variations are matched by Soundex. The only notable exception is the first letter of the name. The first letter is not matched for sound.
It may be a good idea to create both a Soundex and non-Soundex index and allow your search interface select which index to use.
Put yes to ignore the total number of words in the file when calculating ranking. Often better with merges and small files. Default is yes.
IgnoreTotalWordCountWhenRanking no |
The default was changed from no to yes in version 2.2.
Set the minimum length of an word. Shorter words will not be indexed. The default is 1 (as defined in src/config.h).
MinWordLimit 5 |
Set the maximum length of an indexable word. Every longer word will not be indexed. The Default is 40 (as defined in src/config.h).
These settings define what a word consists of to the SWISH-E indexing engine. Compiled in defaults are in src/config.h.
When indexing SWISH-E uses WordCharacters to split up the document into words. Words are defined by any string of non-blank characters that contain only the characters listed in WordCharacters. If a string of characters includes a character that is not in WordCharacters then the word will be spit into two or more separate words.
For example:
WordCharacters abde |
Would turn ``abcde'' into two words ``ab'' and ``de''.
Next, of these words, any characters defined in IgnoreFirstChar are stripped off the start of the word, and IgnoreLastChar characters are stripped off the end of the word. This allows, for example, periods within a word (www.slashdot.com), but not at the end of a word. Characters in IgnoreFirstChar and IgnoreLastChar must be in WordCharacters.
Finally, the resulting words MUST begin with one of the characters listed in BeginCharacters and end with one of the characters listed in EndCharacters. BeginCharacters and EndCharacters must be a subset of the characters in WordCharacters. Often, WordCharacters, BeginCharactes and EndCharacters will all be the same.
Note that the same process applies to the query while searching.
Getting these setting correct will take careful consideration and practice.
It's helpful to create an index of a single test file, and then look at the
words that are placed in the index (see the -v 4
, -D
and -k
searching switches).
Currently there is only support for eight-bit characters.
Example:
WordCharacters .abcdefghijklmnopqrstuvwxyz BeginCharacters abcdefghijklmnopqrstuvwxyz EndCharacters abcdefghijklmnopqrstuvwxyz IgnoreFirstChar . IgnoreLastChar . |
So the string
Please visit http://www.example.com/path/to/file.html. |
will be indexed as the following words:
please visit http www.example.com path to file.html |
Which means that you can search for www.example.com
as a single word, but searching for just example will not find the document.
Note: when indexing HTML documents HTML entities are converted to their
character equivalents before being processed with these directives. This is
a change from previous versions of SWISH-E where you were required to
include the characters 0123456789&#;
to index entities. See also ConvertHTMLEntities
The Buzzwords option allows you to specify words that will be indexed regardless of WordCharacters, BeginCharacters, EndCharacters, stemming, soundex and many of the other checks do on words while indexing.
Buzzwords are case insensitive.
Buzzwords should be separated by spaces and may span multiple directives.
If the special format File:filename
is used then the Buzzwords will be read from an external file during
indexing.
Examples:
Buzzwords C++ TCP/IP |
Buzzwords File:./buzzwords.lst |
If a Buzzword contains search operator characters they must be backslashed when searching. For example:
Buzzwords C++ TCP/IP web=http |
./swish-e -w 'web\=http' |
The IgnoreWords option allows you to specify words to ignore, called stopwords. The default is to not use any stopwords.
Words should be separated by spaces and may span multiple directives. If
the special format File:filename
is used then the stop words will be read from an external file during
indexing.
In previous versions of swish you could use the directive
IgnoreWords swishdefault - obsolete! |
to include a default list of compiled in stopwords. This keywords is no longer supported.
Examples:
IgnoreWords www http a an the of and or |
IgnoreWords File:./stopwords.de |
UseWords defines the words that swish will index. Only the words listed will be indexed.
You can specify a list of words following the directive (you may specify
more than one UseWords directive in a config file), and/or use the File:
form to specify a path to a file containing the words:
UseWords perl python pascal fortran basic cobal php UseWords File: /path/to/my/wordlist |
Please drop the swish-e list a note if you actually use this feature. It may be removed from future versions.
This automatically omits words that appear too often in the files (these words are called stopwords). Specify a whole percentage and a number, such as ``80 256''. This omits words that occur in over 80% of the files and appear in over 256 files. Comment out to turn of auto-stopwording.
IgnoreLimit 50 1000 |
SWISH-E must do extra processing to adjust the entire index when this feature is used. It is recommended that instead of using this feature that you decided what words are stopwords and add them to IngoreWords in your configuration file. To do this, use IgnoreLimit one time and note the stop words that are found while indexing. Add this list to IgnoreWords, and then remove IgnoreLimit from the configuration file.
IgnoreMetaTags defines a list of metanames to ignore while indexing XML files. This is useful to avoid indexing specific data from a file. For example:
<person> <first_name> William </first_name> <last_name> Shakespeare </last_name> <updated_date> April 25, 1999 </updated_date> </person> |
In the above example you might not want to index the updated date, and therefore prevent finding this record by searching
-w 'person=(April)' |
This is solved by:
IgnoreMetaTags updated_date |
Warning: Any data listed in IgnoreMetaTags will not be indexed.
See also UndefinedMetaTags.
This option allows the user decide if to index the contents of HTML comments. Default is no. Set to yes if comment indexing is required.
IndexComments yes |
Note: This is a change in the default behavior prior to version 2.2.
The TranslateCharacters directive maps the characters in string1 to the characters listed in string2.
For example:
# This will index a_b as a-b and ámo as amo TranslateCharacters _á -a |
TranslateCharacters :ascii7:
is a predefined set of characters that will translate eight bit characters
to ascii7 characters. Using the :ascii7: rule will translate ``Ääç'' to
``aac''. This means: searching ``Çelik'', ``çelik'' or ``celik'' will all
match the same word.
TranslateCharacters is done early in the indexing process, after converting HTML entities but before splitting the input text into words based on WordCharacters. So characterters you are translating from do not need to be listed in word characters.
The same character translations take place when searching.
When indexing SWSIH-E assigns a word position to each word. This enables phrase searching. There may be cases where you would like to prevent phrase matching. The BumpPositionCounterCharacters directive allows you to specify a set of characters that when found in the text will increment the word position -- effectively preventing phrase matches across that character.
For example, if you have a META tag:
<!-- META START NAME="subjects" --> computer programming | apple computers <!-- META END --> |
You might want to prevent matching ``programming apple'' in that meta name.
BumpPositionCounterCharacters | |
There is no default, and you may list a string of characters.
Since metatags are typically separate data fields, the word position counter is automatically bumped between metatags. This prevents matching a phrase that spans more than one metaname. DontBumpPositionOnMetaTags disables this feature for the listed metanames.
For example,
<person> <first_name> William </first_name> <last_name> Shakespeare </last_name> <updated_date> April 25, 1999 </updated_date> </person> |
In the conifuration file:
DontBumpPositionOnMetaTags last_name |
This configuration allows this phrase search
-w 'person=("william shakespeare")' |
but this phrase search will fail
-w 'person=("shakespeare april")' |
[ TOC ]
Some directives have different uses depending on the source of the documents. These directives are only valid when using the File system method of indexing.
This directive specifies the allowable file suffixes (extensions) while indexing. The default is to index all files specified in IndexDir.
# Only index .html .htm and .q files IndexOnly .html .htm .q |
Put ``yes'' to follow symbolic links in indexing, else ``no''. Default is no.
FollowSymLinks no FollowSymLinks yes |
Note that when set to no
extra stat(2)
system calls must be made for each file. For
large number of files you may see a small reduction in indexing time by
setting this to yes
.
See also the -l
switch in SWISH-RUN.
Files matching the specified criteria will not be indexed. C regex.h library regular expression pattern matching is allowed.
FileRules pathname contains .*dir1 FileRules filename contains # % ~ .bak .orig .old old. FileRules title contains construction example pointers FileRules directory contains .htaccess FileRules filename is index |
Note: FileRules title
works for any input method (fs, prog, or http) that is parsed as HTML, and
where a title was found in the document.
[ TOC ]
These directives are available when using the HTTP Access Method of indexing.
MaxDepth defines how many links the spider should follow before stopping. A value of 0 configures the spider to traverse all links. The default is MaxDepth 5.
MaxDepth 5 |
The number of seconds to wait between issuing requests to a server. This setting allows for more friendly spidering of remote sites. The default is 60 seconds.
Delay 1 |
The location of a writable temp directory on your system. The HTTP access
method tells the Perl helper to place its files in this location, and the -e
switch causes swish to use this directory while indexing. The default is
/var/tmp.
TmpDir /tmp/swish/ |
If this directory does not exist or is not writable SWISH-E will fail with an error during indexing.
The location of the Perl helper script called swishspider. If you use a relative directory, it is relative to your directory when
you run SWISH-E, not to the directory that SWISH-E is in. The default is ./
SpiderDirectory /usr/local/swish/ |
Often times the same site may be referred to by different names. A common example is that often http://www.some-server.com and http://some-server.com are the same. Each line should have a list of all the method/names that should be considered equivalent. Multiple EquivalentServer directives may be used. Each directive defines its own set of equivalent servers.
EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu |
[ TOC ]
This section details the directives that are only available for the ``prog'' document source feature of swish. The ``prog'' access method runs an external program that ``feeds'' documents to swish. This allows indexing and filtering of documents from any source.
A number of example programs for use with the ``prog'' access method are provided in the prog-bin directory. Please see those example if you have questions about implementing a ``prog'' input program.
This is a list of parameters that will be sent to the external program when running with the ``prog'' document source method.
SwishProgParameters /path/to/config hello there IndexDir /path/to/program.pl |
Then running:
swish-e -c config -S prog |
swish will execute /path/to/program.pl
and pass
/path/to/config hello there
as three command line arguments to the program. This directive makes it
easy to pass settings from the swish-e configuration file to the external
program.
For example, the spider.pl
program (included in the prog-bin
directory) uses the SwishProgParameters to specify what file to read for configuation information.
SwishProgParameters spider.config IndexDir ./spider.pl |
The spider.pl
program also has a default action so you can avoid using a configuration
file:
SwishProgParameters default http://www.swishe.org/ http://some.other.site/ IndexDir ./spider.pl |
And the spider program will use default settings for spidering those sites.
[ TOC ]
Internally, SWISH-E knows how to parse only text, HTML, and XML documents. With SWISH-E filters you can index other types of documents. For example, if all your web pages are in gzip format a filter can uncompress these on the fly for indexing.
A filter is an external program that swish executes when processing a document of a given type. SWISH-E will execute the filter program for each file that matches the file extension set in the FileFilter directive.
SWISH-E calls the external program passing as default arguments:
the name of the filter program
the physical path name of the file to read. This may be a temporary file location if indexing by the http method.
When indexing under the file system this will be the same as
$1
(the path to the source file), but when indexing under the
http method this will be the URL of the source document.
SWISH-E can also pass other parameters to the filter program. These parameters can be defined using the FileFilter directive. See Filter Options below.
The filter program must open the file, process its contents, and return it to SWISH-E by printing to STDOUT.
Note that this can add a significant amount of time to the indexing process. If you have many files to filter you should consider writing your filter in C instead of a shell or perl script, or using the ``prog'' Access Method.
This is the path to a directory where the filter programs are stored. SWISH-E looks in this directory to find the filter specified in the FileFilter directive. If this directive is omitted, you have to specify the full path to the filterscript on each FileFilter directive.
Example:
FilterDir /usr/local/swish/filters |
This maps file extensions to a filter program. If filter-prog starts with a directory delimiter (absolute path), SWISH-E doesn't use the FilterDir settings, but uses the given filter-prog path directly.
Filter options: Filter options are a string passed as arguments to the filter-prog. Filter options can contain variables, replaced by SWISH-E.
If you ommit I<filter-options> SWISH-E will use default parameters for the options. |
Default: "'%p' '%P'" Which means: pass "workfile path" and "documentfile path" to filter (each quoted). |
Variables in filter options: |
%% = % %P = Full document pathname (e.g. URL, or path on filesystem) %p = Full pathname to work file (maybe a tmpfile or the real document path on filesystem) %F = Filename stripped from full document pathname %f = Filename stripped from "work" pathname %D = Directoryname stripped from full document pathname %d = Directoryname stripped from full "work" pathname |
Example: %P = document pathname: http://myserver/path1/mydoc.txt %p = work pathname: /tmp/tmp.1234.mydoc.txt %F = mydoc.txt %f = tmp.1234.mydoc.txt %D = http://myserver/path1 %d = /tmp |
Important hint for security: When using variable substitition, use quotes, to ensure filename integrity. e.g. "'%f'" --> 'file name with spaces.doc'. If you don't use this, your system security may be compromised, or filtering may not work for these files. |
Examples for filters:
FileFilter .pdf pdftotext "'%p' -" FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'" FileFilter .html.gz gzip "-c '%p'" FileFilter .pdf pdf2html.sh FileFilter .html.gz ungzip-html FileFilter .doc /usr/local/filters/wword-filter.sh FileFilter .dot wword-filter.sh FileFilter .ps ghostscript-filter.pl FileFilter .mydoc "/some/path/mydocfilter" "-d '%d' -example -url '%P' '%f'" |
Here is a simple example of a filter using Perl. Again, you should try to avoid running shell or perl scripts as filters as the scripts will significantly slow down indexing, if indexing speed is an issue. But, for a small number of files to filter, this method works well and is easy to implement.
Convert gzipped files to text:
#!/usr/local/bin/perl -w use strict; use Compress::Zlib ; my $file = $ARGV[1] || shift; die "Usage: gzcat file...\n" unless $file; my $gz = gzopen($file, 'rb') or die "Cannot open $file: $gzerrno\n" ; my $buffer; print $buffer while $gz->gzread($buffer) > 0 ; die "Error reading from $file: $gzerrno\n" if $gzerrno != Z_STREAM_END ; $gz->gzclose() ; |
[ TOC ]
$Id: SWISH-CONFIG.pod,v 1.21 2001/06/17 04:13:33 whmoseley Exp $
. [ TOC ]