Table of Contents:
[ TOC ]
SWISH-E is Simple Web Indexing System for Humans - Enhanced. With it, you can quickly and easily index directories of files or remote web sites and search the generated indexes for words and phrases.
[ TOC ]
Well, yes. Probably the most common use of swish-e is to provide a search engine for web sites. The swish-e distribution includes CGI scripts that can be used with swish-e to add a search engine for your web site. The CGI scripts can be found in the example directory of the distribution package. See the README file for information about the scripts.
But swish-e can also be used to index all sorts of data, such as email messages, data stored in a relational database management system, XML documents, or documents such as Word and PDF documents -- or any combination of those sources. Searches can be limited to fields or MetaNames within a document, or limited to areas within an HTML document (e.g. body, title). Programs other than CGI applications can use swish, as well.
[ TOC ]
A large number of bug fixes, feature additions, and logic corrections were made in version 2.2. In addition, indexing speed has been drastically improved (reports of indexing times changing from four hours to twenty minutes), and major parts of the indexing and search parsers have been rewritten. There's better debugging options, enhanced output formats, more document meta data (e.g. last modified date, document summary), options for indexing from external data sources, and faster spidering just to name a few changes. (See the CHANGES file for more information.
Since so much effort has gone into version 2.2, support for previous versions will probably be limited.
[ TOC ]
Foo? Well, yes there are some binary distributions available. Please see the swish-e web site for a list at http://sunsite.berkeley.edu/SWISH-E/.
In general, it is recommended that you build swish from source, if possible.
[ TOC ]
Debugging CGI scripts are beyond the scope of this document. Internal
Server Error basically means ``check the web server's log for an error
message'', as it can mean anything from a bad shebang (#!) line to an error
in the program. The CGI script swish.cgi in the example directory contains some debugging suggestions. Type perldoc swish.cgi
for information.
There are also many, many CGI FAQs available on the Internet. A quick web search should offer help. As a last resort you might ask your webadmin for help...
[ TOC ]
Your web server is not configured to run the program as a CGI script. This
problem is described in
perldoc swish.cgi
.
[ TOC ]
The SWISH-E discussion list is the place to go. http://sunsite.berkeley.edu/SWISH-E/. Please do not email developers directly. The list is the best place to ask questions.
Before you post please read QUESTIONS AND TROUBLESHOOTING located in the INSTALL page.
In short, be sure to include in the following when asking for help.
[ TOC ]
By default, swish-e tries to make it best guesses as to what it thinks are reasonable words and filters out ``garbage'' words according to a set of rules, for instance, if swish-e encounters a word that has no vowels, it doesn't index it. You can change these rules by editing the config.h file in the src directory of the swish-e distribution package. By editing the rules, you may be able to index quite a few more words, or less, depending on your preference.
Configuration file directives (SWISH-CONFIG)
WordCharacters, BeginCharacters, EndCharacters
,
IgnoreFirstChar, and IgnoreLastChar also control what words swish indexes.
Use of the command line arguments -k
, -v
and -T
are useful when debugging these issues. Using -T INDEXED_WORDS
while indexing will display each word as it is indexed. You should specify
one file when using this feature since it can generate a lot of output.
./swish-e -c my.conf -i problem.file -T INDEXED_WORDS You may also wish to index a single file that contains words that are or are not indexing as you expect and use -T to out debugging information about the index. A useful command might be: |
./swish-e -f index.swish-e -T INDEX_FULL |
[ TOC ]
This shouldn't happen. If it does please post to the swish-e discussion list the details so it can be reproduced by the developers.
In the mean time, you can use a FileRules operation to exclude the particular file name, or pathname, or its title. If there are serious problems in indexing certain types of files, they may not have valid text in them (they may be binary files, for instance). You can use NoContents to exclude that type of file.
Swish will issue a warning if an embedded null character is found in a
document. The document will be truncated at the null. This warning will be
an indication that you are trying to index binary data. If you need to
index binary files try to find a program that will extract out the text
(e.g. strings(1),
catdoc(1),
pdftotext(1)).
[ TOC ]
It's probably best to specify a temporary IndexFile file in your configuration and then rename the index to the live index name after indexing is complete. Under unix rename (mv) is atomic, so any searches in progress not be effected.
[ TOC ]
If possible, use the file system method -S fs
of indexing to index documents in you web area of the file system. This
avoids the overhead of spidering a web server and is much faster. (-S fs
is the default method if -S
is not specified).
If this is impossible (the web server is not local, or documents are
dynamically generated), swish provides two methods of spidering. First,
swish includes the http method of indexing -S http
. A number of special configuration directives are available that control
spidering (see Directives for the HTTP Access Method Only). A perl helper script (swishspider.pl) is included in the src directory to assist with spidering web servers. There are example
configurations for spidering in the conf directory.
As of swish 2.2, there's a general purpose ``prog'' document source where a
program can feed documents to swish for indexing. A number of example
programs can be found in the prog-bin
directory, including a program to spider web servers. The provided
spider.pl program is full-featured and is easily customized.
The advantage of the ``prog'' document source feature over the ``http'' method is that the program is only executed one time, where the swishspider.pl program used in the ``http'' method is executed once for every document read from the web server. The forking of swish and compiling of the perl script can be quite expensive, time-wise.
The other advantage of the spider.pl
program is that it's simple and efficient to add filtering (such as for PDF
or MS Word docs) right into the spider.pl's configuration, and it includes
features such as MD5 checks to prevent duplicate indexing, options to avoid
spidering some files, or index but avoid spidering. And since it's a perl
program there's no limit on the features you can add.
[ TOC ]
Swish cannot follow links generated by Javascript, as they are generated by the browser and are not part of the document.
[ TOC ]
Use a robots.text file in your document root. This is a standard way to excluded files from search engines, and is fully supported by swish-e. See http://www.robotstxt.org/
You can also modify the spider.pl spider perl program to skip, index content only, or spider only listed web pages.
[ TOC ]
The spider.pl
program has a default limit of 5MB file size. This can be changed with the max_size
parameter setting. See perldoc spider.pl
for more information.
[ TOC ]
The spider.pl program has a number of debugging switches and can be quite verbose in
telling you what's happening, and why. See perldoc spider.pl
for instructions.
[ TOC ]
Use the ReplaceRules configuration directive to rewrite path names and URLs.
[ TOC ]
Use the ``prog'' document source method of indexing. Write a program to
extract out the data from your database, and format it as XML, HTML, or
text. See the examples in the prog-bin
directory, and the next question.
[ TOC ]
Swish-e can internally only handle HTML, WML, XML and TXT (text) files by default, but can make use of filters that will convert other types of files such as MS Word documents, PDF, or gzipped files into one of the file types that Swish-e understands.
The FileFilter config directive is used to define programs to use as filters, based on
file extension. For example, you can use the program catdoc
to convert MS-Word documents to text for indexing. Please see SWISH-CONFIG and the examples in the filter-bin
directory for more information.
Another option is to use the prog document source input method. In this case you write a program (such as a
perl script) that will read and convert your data as needed and then output
one of the formats that swish understands. Examples of using the prog input method for filtering are included in the prog-bin
directory of the Swish-e distribution.
The disadvantage of using the prog input method is that you must write a program that reads the documents from the source (e.g. from the file system or via a spider to read files on a web server), and also include the code to filter the documents. It's much easier to use the FileFilter option since the filter can often be implemented with just a single configuration directive.
On the other hand, the advantage of using the prog input method for indexing is speed. Filtering within a prog input method program will be faster if your filtering program is something like a Perl script (something that has a large start-up cost). This may or may not be an issue for you, depending on how much time your indexing requires.
You can also use a combination of methods. For example, say you are indexing a directory that contains PDF files using a FileFilter directive. Now you want to index a MySQL database that also contains PDF files. You can write a prog input method program to read your MySQL database and use the same FileFilter configuration parameter (and filter program) to convert the PDF files into one of the native swish formats (TXT, HTML, XML).
Do note that it will be slower to use the FileFilter method instead of running the filter directly from the prog input method program. When FileFilter is used with the prog input method swish must create a temporary file containing the output from your prog method program, and then execute the filter program.
In general, use the FileFilter method to filter documents. If indexing speed is an issue, consider writing a prog input method program. If you are already using the prog method, then filtering will probably be best accomplished within that program.
Here's two examples of how to run a filter program, one using swish's FileFilter directive, another using a prog input method program. These filters simply use the program /bin/cat
as a filter and only indexes .html files.
First, using the FileFilter method, here's the entire configuration file (swish.conf):
IndexDir . IndexOnly .html FileFilter .html "/bin/cat" "'%p'" |
and index with the command
swish-e -c swish.conf -v 1 |
Now, the same thing with using the prog document source input method and a Perl program called catfilter.pl. You can see that's it's much more work than using the FileFilter method above, but provides a place to do additional processing. In this example, the prog method is only slightly faster. But if you needed a perl script to run as a FileFilter then prog will be significantly faster.
#!/usr/local/bin/perl -w use strict; use File::Find; # for recursing a directory tree |
$/ = undef; find( { wanted => \&wanted, no_chdir => 1, }, '.', ); |
sub wanted { return if -d; return unless /\.html$/; |
my $mtime = (stat)[9]; |
my $child = open( FH, '-|' ); die "Failed to fork $!" unless defined $child; exec '/bin/cat', $_ unless $child; |
my $content = <FH>; my $size = length $content; |
print <<EOF; Content-Length: $size Last-Mtime: $mtime Path-Name: $_ |
EOF |
print <FH>; } |
And index with the command:
swish-e -S prog -i ./catfilter.pl -v 1 |
This example will probably not work under Windows due to the '-|' open. A simple piped open may work just as well:
That is, replace:
my $child = open( FH, '-|' ); die "Failed to fork $!" unless defined $child; exec '/bin/cat', $_ unless $child; |
with this:
open( FH, "/bin/cat $_ |" ) or die $!; |
Perl will try to avoid running the command through the shell if meta
characters are not passed to the open. See perldoc -f open
for more information.
[ TOC ]
See the examples in the conf directory.
[ TOC ]
The examples in the prog-bin directory use a module to convert the PDF files into XML. So you must tell swish that you are indexing XML files for the PDF extension.
IndexContents XML .pdf |
[ TOC ]
No. Filters (FileFilter or via ``prog'' method) are only used for building the search index database. During search requests there will be no filter calls.
[ TOC ]
Yes, you can. Just remember that swish-e retains capitalization for all characters other than [a-z A-Z], so the word ``Çelik'' is not retrieved by ``çelik'', ``Celik'', or ``celik''. You can index and use words containing any entity from ! (#033) to ˙ (#255).
(note: but swish uses tolower(3),
so locale settings may
apply.)
Also, the TranslateCharacters directive (SWISH-CONFIG) can translate characters while indexing and searching. TranslateCharacters :ascii7:
is a predefined set of characters that will translate eight bit characters
to ascii7 characters. Using the :ascii7: rule will translate ``Ääç'' to
``aac''. This means: searching ``Çelik'', ``çelik'' or ``celik'' will all
match the same word.
[ TOC ]
Phrases are indexed automatically. To search for a phrase simply place double quotes around the phrase.
For example:
swish-e -w 'free and "fast search engine"' |
[ TOC ]
Use the BumpPositionCounterCharacters configuration directive.
[ TOC ]
In your HTML files you can put keywords in HTML META tags or in XML blocks.
META tags can have three formats in your source documents:
<META NAME="DC.subject" CONTENT="digital libraries"> |
<!-- META START NAME="meta1" --> some content <!-- META END --> |
And in XML format
<meta2> Some Content </meta2> |
Then, to inform SWISH-E about the existence of the meta name in your documents, edit the line in your configuration file:
MetaNames DC.subject meta1 meta2 |
[ TOC ]
A document property is typically data that describes the document. For example, properties might include a document's path name, its last modified date, its title, or its size. Swish stores a document's properties in the index file, and they can be reported back in search results.
Swish also uses properties for sorting. You may sort your results by one or more properties, in ascending or descending order.
Properties can also be defined within your documents. HTML and XML files can specifify tags (see previous question) as properties. The contents of these tags can then be returned with search results. These user-defined properties can also be used for sorting search results.
For example, if you had the following in your documents
<meta name="creator" content="accounting department"> |
And creator
is defined as a property (see PropertyNames in SWISH-CONFIG) swish can return accounting department
with the result for that document.
swish-e -w foo -p creator |
Or for sorting:
swish-e -w foo -s creator |
[ TOC ]
MetaNames allows keywords searches in your documents. That is, you can use MetaNames to restrict searches to just parts of your documents.
PropertyNames, on the other hand, define text that can be returned with results, and can be used for sorting.
Both use meta tags found in your documents (as shown in the above two questions) to define the text you wish to use as a property or meta name.
You may define a tag as both a property and a meta name. For example:
<meta name="creator" content="accounting department"> |
placed in your documents and then using configuration settings of:
PropertyNames creator MetaNames creator |
will allow you to limit your searches to documents created by accounting:
swish-e -w 'foo and creator=(accounting)' |
That will find all documents with the word foo
that also have a creator meta tag that contains the word accounting
. This is using MetaNames.
And you can also say:
swish-e -w foo -p creator |
which will return all documents with the word foo
, but the results will also include the contents of the creator
meta tag along with results. This is using properties.
You can use properties and meta names at the same time, too:
swish-e -w creator=(accounting or marketing) -p creator -s creator |
That searches only in the creator
meta name for either of the words
accounting
or marketing
, prints out the contents of the contents of the creator
property, and sorts the results by the creator
property name.
(See also the -x
output format switch in SWISH-RUN.)
[ TOC ]
It's true that indexing can take up a lot of memory! One thing you can do
is make many indices of smaller content instead of trying to do everything
at once. You can then merge all the smaller pieces together with the -M
switch, or use the -f
switch to specify more than one index while searching.
Another option is use the -e
switch. This will require less memory, but indexing will take longer as not
all data will be stored in memory while indexing. Please report back your
findings as it seems -e
requires quite a bit less RAM, but often not that much more indexing time.
[ TOC ]
Or, I can't search with meta names, all the names are indexed as "plain".
Check in the config.h file if #define INDEXTAGS is set to 1. If it is, change it to 0, recompile, and index again. When INDEXTAGS is 1, ALL the tags are indexed as plain text, that is you index ``title'', ``h1'', and so on, AND they loose their indexing meaning. If INDEXTAGS is set to 0, you will still index meta tags and comments, unless you have indicated otherwise in the user config file with the IndexComments directive.
Also, check for the UndefinedMetaTags setting in your configuration file.
[ TOC ]
At times it might not strictly be necessary, but since you don't really know if anything in the index has changed, it is a good rule to reindex anyway.
[ TOC ]
An example CGI script is included in the example directory. (Type perldoc swish.cgi
in the example directory for instructions.)
Please be careful when picking a CGI script to use with swish. Quite a few of the scripts that have been available for swish are insecure and should not be used.
The included example CGI script was designed with security in mind. Regardless, you are encouraged to have your local Perl expert review it (and all other CGI scripts you use) before placing into production. This is just a good policy to follow.
[ TOC ]
We know of no security issues with using swish. Careful attention has been made with regard to common security problems such as buffer overruns when programming swish.
The most likely security issue with swish is when swish is run via a poorly
written CGI interface. This is not limited to CGI scripts written in Perl,
as it's just as easy to write an insecure CGI script in C, Java, PHP, or
Python. A good source of information is included with the Perl
distribution. Type
perldoc perlsec
at your local prompt for more information. Another must-read document is
located at
http://www.w3.org/Security/faq/wwwsf4.html
.
Note that there are many free yet insecure and poorly written CGI scripts available -- even some designed for use with swish. Free is not such a good price when you get your server hacked...
[ TOC ]
Swish-e can't because it doesn't have access to the source documents when returning results, of course. But a front-end program of your creation can highlight terms. Your program can open up the source documents and then use regular expressions to replace search terms with highlighted or bolded words.
But, that will fail with all but the most simple source documents. For HTML documents, for example, you must parse the document into words and tags (and comments). A word you wish to highlight may span multiple HTML tags, or be a word in a URL and you wish to highlight the entire link text.
Perl modules such as HTML::Parser and XML::Parser make word extraction possible. Next, you need to consider that swish uses settings such as WordCharacters, BeginCharacters, EndCharacters, IgnoreFirstChar, and IgnoreLast, char to define a ``word''. That is, you can't consider that a string of characters with white space on each side is a word.
Then things like TranslateCharacters, and HTML Entities may transform a source word into something else, as far as swish is concerned. Finally, searches can be limited by metanames, so you may need to limit your highlighting to only parts of the source document. Throw phrase searches and stopwords into the equation and you can see that it's not a trivial problem to solve.
All hope is not lost, thought, as swish does provide some help. Using the -H
option swish will return in the headers the current index (or indexes)
settings for WordCharacters (and others) required to parse your source
documents as swish parses them during indexing, and will return a ``Parsed
Words:'' header that will show how swish parsed the query internally. If
you use word stemming then you will also need to stem each word in your
document before comparing with the ``Parsed Words:'' returned by swish. The
swish-e stemming code is available either by using the swish-e Perl module
or C library (included with the swish-e distribution), or by using the
SWISH::Stemmer module available on CPAN.
[ TOC ]
That's a good thing! That expensive CPU is suppose to be busy.
Indexing takes a lot of work -- to make indexing fast much of the work is done in memory, and moving all that memory around requires CPU time. But, there's two things you can try:
The -e
option will run swish in economy mode, which uses the disk to store data
while indexing. This makes swish run somewhat slower, but also uses less
memory. Since swish is writing to disk more often it will be spending more
time waiting on I/O and less time in CPU. Maybe.
The other thing is to simply lower the priority of the job using the
nice(1)
command:
/bin/nice -15 swish-e -c search.conf |
If concerned about searching time, make sure you are using the -b and -m switches to only return a page at a time. If you know that your result sets will be large, and that you wish to return results one page at a time, and that often times many pages of the same query will be requested, you may be smart to request all the documents on the first request, and then cache the results to a temporary file. The perl module File::Cache makes this very simple to accomplish.
[ TOC ]
Currently, there is not a configuration directive to include a file that contains a list of files to index. But, there is a directive to include another configuration file.
IncludeConfigFile /path/to/other/config |
And in /path/to/other/config
you can say:
IndexDir file1 file2 file3 file4 file5 ... IndexDir file20 file21 file22 |
You may also specify more than one configuration file on the command line:
./swish-e -c config_one config_two config_three |
Another option is to create a directory with symbolic links of the files to index, and index just that directory.
[ TOC ]
Install Linux?
[ TOC ]
Not really. Swish currently has no way to add or remove items from its index.
About the only way to delete items from the index is to
stat(2)
all the results to make sure that all the files still
exist.
Incremental additions can be handled in a couple of ways, depending on your
situation. It's probably easiest to create one main index every night (or
every week), and then create an index of just the new files between main
indexing jobs and use the -f
option to pass both indexes to swish while searching.
You can merge the indexes into one index (instead of using -f), but it's
not clear that this has any advantage over searching multiple indexes.
Using -f
gives access to the individual headers of both indexes, while -M
merges the headers, and merging indexes with different indexing settings
(Stemming, WordCharacters) may produce odd results. This is a question for
the swish-e discussion list.
How does one create the incremental index?
One method is by using the -N
switch to pass a file path to swish when indexing. Swish will only index
files that have a last modification date newer
than the file supplied with the -N
switch.
This option has the disadvantage that swish must process every file in every directory as if they were going to be indexed (the test for -N is done last right before indexing of the file contents begin and after all other tests on the file have been completed) -- all that just to find a few new files. Also, if you use the swish index file as the file passed to -N there may be files that were added after indexing was started, but before the index file was written. This could result in a file not being added to the index.
Another option is to maintain a parallel directory tree that contains symlinks pointing to the main files. When a new file is added you create a symlink to the real file in the parallel directory tree. Then just index the symlink directory to generate the incremental index.
This option has the disadvantage that you need to have a central program that creates the new files that can also create the symlinks. But, indexing is quite fast since swish only has to look at the files that need to be indexed. When you run full indexing you simply unlink (delete) all the symlinks.
Both of these methods have issues where files could end up in both indexes, or files being left out of an index. Use of file locks while indexing, and hash lookups during searches can help prevent these problems.
[ TOC ]
You can either merge -M
two indexes into a single index, or use -f
to specify more than one index while searching.
[ TOC ]
$Id: SWISH-FAQ.pod,v 1.10 2001/06/11 23:42:53 whmoseley Exp $
. [ TOC ]