Berkeley Digital Library SunSITE

SWISH-E

Spidering

or, FILESYSTEM vs. HTTP

SWISH-E has been enhanced to support different file access methods. The current version supports access via a FILESYSTEM or HTTP. The method is chosen at indexing time time. The index format is identical; you may access an index created with an executable compiled with one method by an executable compiled with another.

How to Choose an Access Method

The FILESYSTEM access method is chosen by default. You can pick a different method by specifying the -S option during indexing ("-S fs" for filesystems) and ("-S http" for spidering).

Excluding method during compilation

If you like to exclude either method from compilation, you can do so by unsetting the appropriate ALLOW_XXX_INDEXING_DATA_SOURCE variable in the config.h file.

Required Common Directories

IndexDir requires a method appropriate value. Please use filenames or directories for the FILESYSTEM method and URLs for the HTTP method.

FILESYSTEM only directives

The following directives are now only available with the FILESYSTEM acess method:

HTTP only directives

The HTTP access method implements the following directives.

MaxDepth: (default 5)
This defines how many links the spider should follow before stopping. A value of 0 configures the spider to traverse all links
Delay: (default 60)
The number of seconds to wait between issuing requests to a server.
TmpDir: (default /var/tmp)
The location of a writeable temp directory on your system. The HTTP access method tells the Perl helper to place its files there.
SpiderDirectory: (default ./)
The location of the Perl helper script. Remember, if you use a relative directory, it is relative to your directory when you run SWISH-E, not to the directory that SWISH-E is in.
EquivalentServer: (default nothing)
This allows you to deal with servers that use respond to multiple DNS names. Each line should have a list of all the method/names that should be considered equivalent. If you have multiple directives, each one defines its own set of equivalent servers.

Writing your own File Access Method

SWISH-E has been rearchitechted to allow new file access methods to be implemented without requiring any changes to the central engine. To implement a new method, the following functions need to be implemented:

int parseconfline(char *line)
This function gives your code a chance to define its own configuration file directives. The function will never get called with comments or blank lines. Return 0 if the directive is unrecognized.

void indexpath(char *startpoint)
This function is called once for each starting point definied via the IndexDir directive. This function must call countwords() for each entity it wants indexed.

int vgetc(void *vp)
int vsize(void *vp)
These functions return the next character (or EOF) and the size of the entity being index respectively. The pointer is the first argument passed to countwords().

The following are function from the core engine of SWISH-E that you will need to use:

int countwords(void *vp, char *location, char *title, int indextitleonly)
This function must be called once for each entity you want indexed. The first argument (vp) is an opaque handle used for accessing the entity's data. It will be supplied in the call to vgetc() and vsize(). The second argument (location) is the index specific location of the file to be stored in the index after being modified by ReplaceRules. The third argument (title) is the descriptive title of the entity. The fourth argument (indextitleonly) should be true if the contents of the file should not be indexed.

Table of Contents

Copyright (C) 1995, 1996, 1997, 1998, 1999, 2000 Hewlett-Packard Company
Originally by Kevin Hughes, kev@kevcom.com, March 11, 1994.
SWISH-E is distributed with no warranty under the terms of the GNU Public License,
Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA
Public questions may be posted to the SWISH-E Discussion.
Document maintained at http://sunsite.berkeley.edu/SWISH-E/Manual/spidering.html by the SunSITE Manager.
Last update December 16, 1998. SunSITE Manager: manager@sunsite.berkeley.edu