j 3 ht://Dig Frequently Asked Questions  $

Frequently Asked Questions



= ht://Dig Copyright © 1995-2000 The ht://Dig Group
; Please see the file COPYING for license information.


=

This FAQ is compiled by the ht://Dig developers and the- most recent version is available at <http://www.htdig.org/FAQ.html>.4 Questions (and answers!) are greatly appreciated.< Please send questions and/or answers to the ht://Dig userQ mailing list at: <htdig@htdig.org>.



Questions



1. General

H 1.1. Can I search the internet with ht://Dig?
G 1.2. Can I index the internet with ht://Dig?
? 1.3. What's the difference between htdig and ht://Dig?
9 1.4. I sent mail to Andrew or Geoff or. Gilles, but I never got a response!
C 1.5. I sent a question to the mailing list but I never got a response!
G 1.6. I have a great idea/patch for ht://Dig!
: 1.7. Is ht://Dig Y2K compliant?
H 1.8. I think I found a bug. What should I do?
< 1.9. Does ht://Dig support phrase or near matching?
C 1.10. What are the practical and/or theoretical limits of ht://Dig?
? 1.11. Do any ISPs offer ht://Dig as part of& their web hosting services?



2. Getting ht://Dig

F 2.1. What's the latest version of ht://Dig?
K 2.2. Are there binary distributions of ht://Dig?
D 2.3. Are there mirror sites for ht://Dig?
= 2.4. Is ht://Dig available by ftp?
= 2.5. Are patches around to upgrade between versions?



3. Compiling

> 3.1. When I compile ht://Dig I get an error about libht.a.
8 3.2. I get an error about -lg
? 3.3. I'm compiling on Digital Unix and I get4 mesages about "unresolved" and "db_open."
? 3.4. I'm compiling on FreeBSD and I get lots9 of messages about '___error' being unresolved.
J 3.5. I'm compiling on HP/UX and I get a complaint about' "Large Files not supported."



4. Configuration

? 4.1. How come I can't index my site?
= 4.2. How can I change the output format of htsearch?
I 4.3. How do I index pages that start with '~'?
= 4.4. Can I use multiple databases?
? 4.5. OK, I can use multiple databases. Can I merge them into one?
; 4.6. Wow, ht://Dig eats up a lot of disk% space. How can I cut down?
9 4.7. Can I use SSI or other CGIs in my htsearch results?
9 4.8. How do I index Word or PostScript documents?
8 4.9. How do I index PDF files without acroread?
; 4.10. How do I index documents in other languages?
= 4.11. How do I get rotating banner ads in search results?
F 4.12. How do I index numbers in documents?
B 4.13. How can I call htsearch from a hypertext0 link, rather than from a search form?
A 4.14. How do I restrict a search to only meta) keywords entries in documents?
C 4.15. Can I use meta tags to prevent htdig from" indexing certain files?



5. Troubleshooting

C 5.1. I can't seem to index more than X documents in a directory.
8 5.2. I can't index PDF files.
B 5.3. When I run "rundig," I get a message about* "DATABASE_DIR" not being found.
A 5.4. When I run htmerge, it stops with an "out! of diskspace" message.
@ 5.5. I have problems running rundig from cron under Linux.
< 5.6. When I run htmerge, it stops with an* "Unexpected file type" message.
C 5.7. When I run htsearch, I get lots of Internal Server Errors (#500).
? 5.8. I'm having problems with indexing words$ with accented characters.
; 5.9. When I run htmerge, it stops with a& "Word sort failed" message.
E 5.10. When htsearch has a lot of matches, it runs extremely slowly.
E 5.11. When I run htsearch, it gives me a count of< matches, but doesn't list the matching documents.
D 5.12. I can't seem to index documents with names+ like left_index.html with htdig.
F 5.13. I get Premature End of Script Headers errors! when running htsearch.
@ 5.14. I get Segmentation faults when running& htdig, htsearch or htfuzzy.
D 5.15. Why does htdig 3.1.3 mangle URL parameters0 that contain bare "&" characters?
> 5.16. When I run htmerge, it stops with anE "Unable to open word list file '.../db.wordlist'" message.
J 5.17. When using Netscape, htsearch always returns the "No match" page.



Answers



1. General

= 1.1. Can I search the internet with ht://Dig?
A

No, ht://Dig is a system for indexing and searching a smallB set of sites or intranet. It is not meant to replace any of the) many internet-wide search engines.

< 1.2. Can I index the internet with ht://Dig?
/

No, as above, ht://Dig is not meant as an. internet-wide search engine. While there is> theoretically nothing to stop you from indexing as> much as you wish, practical considerations (e.g. time, disk, space, memory, etc.) will limit this.

F 1.3. What's the difference between htdig and ht://Dig?
H

The complete ht://Dig package consists of several programs, one ofB which is called "htdig." This program performs the "digging" orI indexing of the web pages. Of course an index doesn't do you much good< without a program to sort it, search through it, etc.

= 1.4. I sent mail to Andrew or Geoff: or Gilles, but I never got a response!
C

Andrew no longer does much work on ht://Dig. He has started a< company, called ContigoB Software and is quite busy with that. To contact any of the* current developers, send mail to <htdig3-dev@htdig.org>

7

Geoff and Gilles are currently the maintainers of; ht://Dig, but they are both volunteers. So while they do9 read all the e-mail they receive, they may not respond= immediately. Questions about ht://Dig in general should be pointed to the <htdig@htdig.org&;gt; mailing list.

J 1.5. I sent a question to the mailing list but I) never got a response!
A

Development of ht://Dig is done by volunteers. Since we allA have other jobs, it make take a while before someone gets backt to you.

< 1.6. I have a great idea/patch for ht://Dig!
B

Great! Development of ht://Dig continues through suggestionsD and improvements from users. If you have an idea (or even better,B a patch), please send it to the ht://Dig mailing list so othersE can use it. For suggestions on how to submit patches, please checkhA the Guidelines foreB Patch Submissions. If you'd like to make a feature request,A you can do so through the ht://Dig bug database, either off ofnJ <www.htdig.org> or by sending mail toA <bugs@htdig.org>

dJ 1.7. Is ht://Dig Y2K compliant?


I ht://Dig should be y2k compliant since it never stores dates as1J two-digit years. Under ht://Dig's copyright (GPL), there is no warrantyB whatsoever as permitted by law. If you would like an iron-clad,@ legally-binding guarantee, feel free to check the source codeF itself. Versions prior to 3.1.2 did have a problem with the parsingF of the Last-Modified header returned by the HTTP server, which willB cause incorrect dates to be stored for documents modified afterD February 28, 2000 (yes, it didn't recognize 2000 as a leap year).( This is fixed in the current release.6 If you discover something else, please let us know!

9C 1.8. I think I found a bug. What should I< do?
tC

Well, there are probably bugs out there. You have two options1C for bug-reporting. You can either mail the ht://Dig mailing listrE at <htdig@htdig.org> ortD better yet, report it to the bug database, which ensures it won'tE become lost amongst all of the other mail on the list. To do this,t either follow the link fromJ <www.htdig.org> or by sending mail to> <bugs@htdig.org>.C Please try to include as much information as possible, includingrC the version of ht://Dig, the OS, and anything else that might be>= helpful. Often, running the programs with one "-v" or morer4 (e.g. "-vvv") gives useful debugging information.D If you are unsure whether the problem is a bug or a configurationD problem, you should discuss the problem on htdig@htdig.org (afterH carefully reading the FAQ and searching the mail archives, of course)D to sort out what it is. The mailing list has a wider audience, soC you're more likely to get help with configuration problems therep. than by reporting them to the bug database.

oC 1.9. Does ht://Dig support phrase or neare matching?
4:

Phrase searching has been added for the 3.2 release,P which is currently in the beta phase (3.2.0b1 as of this writing). Anyone who> wishes to live on the bleeding edge (literally) to test out< the phrase searching should e-mail the developer list at: <htdig3-dev@htdig.org>.

J 1.10. What are the practical and/or theoretical' limits of ht://Dig?
sA

The code itself doesn't put any real limit on the number of > pages. There are several sites in the hundreds of thousands= of pages. As for practical limits, it depends a lot on how.@ many pages you plan on indexing. Some operating systems limitA files to 2 GB in size, which can become a problem with a largea@ database. There are also slightly different limits to each of? the programs. Right now htmerge performs a sort on the words3; indexed. Most sort programs use a fair amount of RAM andh= temporary disk space as they assemble the sorted list. Theo> htdig program stores a fair amount of information about the@ URLs it visits, in part to only index a page once. This takes@ a fair amount of RAM. With cheap RAM, it never hurts to throw> more memory at indexing larger sites. In a pinch, swap will7 work, but it obviously really slows things down.

iF 1.11. Do any ISPs offer ht://Dig as part of/ their web hosting services?
g1

Yes. A list of such ISPs is available at http://www.htdig.org/isp.html.m

s


2. Getting ht://Dig

5V 2.1. What's the latest version of ht://Dig?
<

The latest version is 3.1.5 as of this writing. A beta: version of the 3.2 code, 3.2.0b1 is also available, for those who wish to test it.t

I@ 2.2. Are there binary distributions of ht://Dig?
n=

We're trying to get consistent binary distributions for=? popular platforms. Contributed binary releases will go in n+ http://www.htdig.org/files/binaries/e( and contributions may be placed in ) ftp://ftp.htdig.org/incoming/.
t2 Anyone who would like to make consistent binary= distributions of ht://Dig at least should signup to the htdig3-announce mailing list.

T 2.3. Are there mirror sites for ht://Dig?
D

Not at the moment. Currently, there is only the main server at <www.htdig.org>. If you'd be willing% to mirror the site, please contact? <htdig3-dev@htdig.org>

M 2.4. Is ht://Dig available by ftp?
q>

Yes. You can find the current versions and several older versions at <ftp.htdig.org>.

D 2.5. Are patches around to upgrade between versions?
C

Most versions are also distributed as a patch to the previousi? version's source code. The most recent exception to this wasn= version 3.1.0b1. Since this version switched from the GDBM.A database to DB2, the new database package needed to be shipped G with the distribution. This made the potential patch almost as largenC as the regular distribution. Update patches resumed with versionr 3.1.0b2.




3. Compiling

K 3.1. When I compile ht://Dig I get an error aboutm libht.a
eF

This usually indicates that either libstdc++ is not installed orC is installed incorrectly. To get libstdc++ or any other GNU too,l check ftp://ftp.gnu.org/pub/gnu/

H 3.2. I get an error about -lg
?

This is due to a bug in the Makefile.config.in of versionl@ 3.1.0b1. Remove all flags "-ggdb" in Makefile.config.in. Then6 type "./config.status" to rebuild the Makefiles and7 recompile. This bug is fixed in version 3.1.0b2.

"F 3.3. I'm compiling on Digital Unix and I get= mesages about "unresolved" and "db_open."
(

Answer contributed by George Adams- <learningapache@my-dejanews.com>

e@

What you're seeing are problems related to the Berkeley DB@ library. htdig needs a fairly modern version of db, which isB why it ships with one that works. (see that -L../db-2.4.14/dist2 line? That's where htdig's db library is).
< The solution is to modify the c++ command so it explicityA references the correct libdb.a . You can do this by replacing / the "-ldb" directive in the c++ command withaC "../db-2.4.14/dist/libdb.a" This problem has been resolved as ofy version 3.1.0.

lF 3.4. I'm compiling on FreeBSD and I get lotsI of messages about '___error' being unresolved.
rH

Answer contributed by Laura Wingerd <laura@perforce.com>
@ I got a clean build of htdig-3.1.2 on FreeBSD 2.2.8 by taking? -D_THREAD_SAFE out of CPPFLAGS, and setting LIBS to null, ine db/dist/configure.

tQ 3.5. I'm compiling on HP/UX and I get a complaint about<0 "Large Files not supported."
L

The db/ pacakge, included with ht://Dig seems to be unable to completeG on HP/UX 10.20 in particular. After running the top-level configure t( script, cd into db/dist and type:

+
./configure --disable-bigfile
t4

Then continue with the normal compilation.

 d


4. Configuration

O 4.1. How come I can't index my site?
r;

There are a variety of reasons ht://Dig won't index asB site. To get to the bottom of things, it's advisable to turn onB some debugging output from the htdig program. When running from9 the command-line, try "-vvv" in addition to any otherrA flags. This will add debugging output, including the responsesd from the server.

t_ 4.2. How can I change the output format of htsearch?
FT

Answer contributed by: Malka Cymbalista <vumalki@ultra1.weizmann.ac.il>

E

You can change the output format of htsearch by creating differentoDheader, footer and result files that specify how you want the outputBto look. You then create a configuration file that specifies which@files to use. In the html document that links to the search, you,specify which configuration file to use.

2

So the configuration file would have the lines:

/2search_results_header: ${common_dir}/ccheader.html2search_results_footer: ${common_dir}/ccfooter.html'template_map:  Long long builtin-long \ *               Short short builtin-short \:               Default default ${common_dir}/ccresult.htmltemplate_name: Default
GYou would also put into the configuration file any other lines from thei6default configuration file that apply to htsearch.

,

The files ${common_dir}/ccheader.html andD${common_dir}/ccfooter.html and ${common_dir}/ccresult.html would be6tailored to give the output in the desired format.

I

Assuming your configuration file is called cc.conf, the html file thatnDlinks to the search has to set the config parameter equal to cc. Thefollowing line would do it:
d?

<input type=hidden name=config value="cc">

iY 4.3. How do I index pages that start with '~'?
a

E ht://Dig should index pages starting with '~' as if it was anotherG web browser. If you are having problems with this, check your server A log files to see what file the server is attempting to return. 

.M 4.4. Can I use multiple databases?
;

Yes, though you may find it easier to have one larger)B database and use restrict or exclude fields on searches. To use; multiple databases, you will need a config file for eachiB database. Then each file will set the "database_base" option to( change the name of the databases.

F 4.5. OK, I can use multiple databases. Can I( merge them into one?
@

As of version 3.1.0, you can do this with the -m option to* htmerge.

B 4.6. Wow, ht://Dig eats up a lot of disk. space. How can I cut down?
>

There are several ways to cut down on disk space. One is? not to use the "-a" option, which creates work copies of thep9 databases. Naturally this essentially doubles the disksA usage. Changing configuration variables can also help cut downe0 on disk usage. Decreasing max_head_length and? max_meta_description_length will cut down on the size of thea. excerpts stored (in fact, if you don't have( use_meta_description set, you can set? max_meta_description_length to 0!). Other techniques includei= removing the db.wordlist file and adding more words to theo bad_words file.

@ 4.7. Can I use SSI or other CGIs in my% htsearch results?
g:

Not really. Apache will not parse CGI output for SSI statements (See the Apache: FAQ). Thus,the htsearch CGI does not understand SSI' markup and thus cannot include othern> CGIs. However, it is possible doing it the other way round:= you can have the htsearch results included in your dynamicr page.

?

The easiest approach is using SSI with the help of the href="attrs.html#script_name">script_name configuration6 file attribute. See the contrib/scriptname/ directory for a small example using SSI.

l?

For CGI and PHP, you need a "wrapper" script to"6 do that. For perl script examples, see the files in2 contrib/ewswrap. The PHP guide (see href="http://www.htdig.org/contrib/guides.html">contributed? guides) not only describes a wrapper script for PHP, buts7 also offers a step by step tutorial to the basics of& ht://dig and is well worth reading.B For other alternatives, see question 4.11.

@ 4.8. How do I index Word or PostScript documents?
I

This must be done with an external parser or converter. A sample of'A such a parser is the contrib/parse_doc.pl Perl script. It willC appropriate document to text converters. It uses catdoc to parset> Word documents, and ps2ascii to parse PostScript files. The< comments in the Perl script indicate where you can obtain these converters.

/

As of htdig version 3.1.4, you can use anaC external converter, such as the contrib/conv_doc.pl Perl script,/E instead of an external parser. This script is simpler to write andeB maintain than a full external parser, as it just converts input@ documents to text/plain or text/html, and passes that back toA htdig to be parsed. Parsing is more consistent across documenth types as a result.

uB

The most recent versions of parse_doc.pl and conv_doc.pl are available on our web site.
D See below for an example of parse_doc.pl, or see the comments in/ conv_doc.pl for an example of its usage.

>? 4.9. How do I index PDF files withouti acroread?
lC

This too can be done with an external parser or converter, in @ combination with the pdftotext program that is part of the xpdf 0.90 package. Au; sample of such a parser is the contrib/parse_doc.pl PerleC script. It uses pdftotext to parse PDF documents, then processesa) the text into external parser records.tA The most recent version of parse_doc.pl is available on our web site.

oE

For example, you could put this in your configuration file:
>

gBexternal_parsers: application/msword /usr/local/bin/parse_doc.pl \F                  application/postscript /usr/local/bin/parse_doc.pl \=                  application/pdf /usr/local/bin/parse_doc.ple
D You would also need to configure the script to indicate where all8 of the document to text converters are installed.

/

As of htdig version 3.1.4, you can use ancC external converter, such as the contrib/conv_doc.pl Perl script,d also available on our web site,.D instead of an external parser. This script is simpler, and offersE more consistent parsing, because the final work is done by htdig's ? internal parsers. See the comments inside this script for ang example of its usage.

D

Whether you use this external parser or converter, or acroreadE with the pdf_parser attribute,C8 to successfully index PDF files be sure to set the max_doc_size attribute to = a value larger than the size of your largest PDF file. PDFa9 documents can not be parsed if they are truncated.

i9

This also raises the questions of why two differentu; methods of indexing PDFs are supported, and which methodn? is preferred. The built-in PDF support, which uses acroreadp? to convert the PDF to PostScript, was the first method which? was provided. It had a few problems with it: acroread is notg; open source, it is not supported on all systems on whicho; ht://Dig can run, and for some PDFs, the PostScript that @ acroread generated was very difficult to parse into indexableA text. Also, the built-in PDF support expected PDF documents tou@ use the same character encoding as is defined in your currentA locale, which isn't always thenA case. The external parser, which uses pdftotext, was developedw@ to overcome these problems. xpdf 0.90 is open source, and its9 pdftotext utility works very well as an indexing tool..= It also converts various PDF encodings to the Latin 1 set.f7 It is the opinion of the developers that this is thel> preferred method. However, some users still prefer to stick< with acroread, as it works well for them, and is a little< easier to set up if you've already installed Acrobat.

@

Also, pdftotext still has some difficulty handling text in@ landscape orientation, even with its new -raw option in 0.90,? so if you need to index such text in PDFs, you may still gett$ better results with acroread.

:

See also question 5.2 below.

B 4.10. How do I index documents in other languages?
5

The first and most important thing you must do,e6 to allow ht://Dig to properly support international6 characters, is to define the correct locale for the: language and country you wish to support. This is done8 by setting the locale8 attribute (see question 5.8). The; next step is to configure ht://Dig to use dictionary anda9 affix files for the language of your choice. These can< be the same dictionary and affix files as are used by the< ispell software. A collection of these is available from Geoff Kuenning's/G ? International Ispell Dictionaries page, and we're slowlyt0 building a collection of word lists on our web site.

aG

For example, if you install German dictionaries in common/german,t< you could use these lines in your configuration file:


 locale:               de_DE *lang_dir:             ${common_dir}/german+bad_word_list:        ${lang_dir}/bad_wordss,endings_affix_file:   ${lang_dir}/german.aff*endings_dictionary:   ${lang_dir}/german.0.endings_root2word_db: ${lang_dir}/root2word.db.endings_word2root_db: ${lang_dir}/word2root.db
D You can build the endings database with htfuzzy endings.8 (This command may actually take days to complete, forA releases older than 3.1.2. Current releases use faster regular @ expression matching, which will speed this up by a few orders= of magnitude.) You will also need to redefine the synonymss> file if you wish to use the synonyms search algorithm. ThisA file is not included with most of the dictionaries, nor is thegB bad_words file. Current versions of ht://Dig only support 8-bit? characters, so languages such as Chinese and Japanese, whichl> require 16-bit characters, are not currently supported.

D 4.11. How do I get rotating banner ads in# search results?
q?

While htsearch doesn't currently provide a means of doingr@ SSI on its output, or calling other CGI scripts, it does haveB the capability of using environment variables in templates.

<

The easiest way to get rotating banners in htsearch is9 to replace htsearch with a wrapper script that sets ane: environment variable to the banner content, or whatever? dynamically generated content you want. Your script can thencC call the real htsearch to do the work. The wrapper script can betA written as a shell script, or in Perl, C, C++, or whatever youu? like. You'd then need to reference that environment variableS@ in header.html (or wrapper.html if that's what you're using),> to indicate where the dynamic content should be placed.

C

If the dynamic content is generated by a CGI script, your new C wrapper script which calls this CGI would then have to strip out A the parts that you don't want embedded in the output (headers,sA some tags) so that only the relevant content gets put into thet? environment variable you want. You'd also have to make sure @ this CGI script doesn't grab the POST data or get confused by? the QUERY_STRING contents intended for htsearch. Your scriptF; should not take anything out of, or add anything to, the.) QUERY_STRING environment variable.

"E

An alternative approach is to have a cron job that periodically>? regenerates a different header.html or wrapper.html with theo@ new banner ad, or changes a link to a different pre-generated@ header.html or wrapper.html file. For other alternatives, see( question 4.7.

V 4.12. How do I index numbers in documents?
=

By default, htdig doesn't treat numbers without lettersb& as words, so it doesn't index them., To change this behavior, you must set the7 allow_numbers ? attribute to true, and rebuild your index from scratch usingr? rundig or htdig with the -i option, so that bare numbers gets added to the index.

I 4.13. How can I call htsearch from a hypertext.9 link, rather than from a search form?
>

If you change the search.html form to use the GET method? rather than POST, you can see the URLs complete with all thetE arguments that htsearch needs for a query. Here is an example:
n

e•http://www.grommetsRus.com/cgi-bin/htsearch?config=htdig&restrict=&exclude=&method=and&format=builtin-long&words=grapple+grommetsi
+ which can actually be simplified to:
o
 Qhttp://www.grommetsRus.com/cgi-bin/htsearch?method=and&words=grapple+grommetsr
= with the current defaults. The "&" character acts as ad> separator for the input parameters, while the "+" character< acts as a space character within an input parameter. Most> non-alphanumeric characters should be hex-encoded following? the convention for URL encoding (e.g. "%" becomes "%25", "+" ? becomes "%2B", etc). Any htsearch input parameter that you'd>< use in a search form can be added to the URL in this way.= This can be embedded into an <a href="..."> tag.

cH 4.14. How do I restrict a search to only meta2 keywords entries in documents?
:

First of all, you do not do this by using the: "keywords" field in the search form. This seems to be a> frequent cause of confusion. The "keywords" input parameter? to htsearch has absolutely nothing to do with searching metaa> keywords fields. It actually predates the addition of meta= keyword support in 3.1.x. A better choice of name for theoA parameter would have been "requiredwords", because that's whath@ it really means - a list of words that are all required to be@ found somewhere in the document, in addition to the words the) user specifies in the search form.

eB

To restrict a search to meta keywords only, you must set all> factors other than keywords_factor to 0, and for 3.1.x, you? must then reindex your documents. In 3.2, you'll be able tof@ change factors at search time without needing to reindex.

J 4.15. Can I use meta tags to prevent htdig from+ indexing certain files?
iC

Yes, in each HTML file you want to exclude, add the following 3 between the <HEAD> and </HEAD> tags:u

9 <META NAME="robots" CONTENT="noindex, follow">c
F Doing so will allow htdig to still follow links to other documents,G but will prevent this document from being put into the index itself.eA You can also use "nofollow" to prevent following of links. See,E the section on Recognized META informationyE for more details. For documents produced automatically by MhonArc,pE you can have that line inserted automatically by putting it in thesI MhonArc resource file, in the sections IDXPGBEGIN and TIDXPGBEGIN.

i

You can also use the; noindex_start and A noindex_end attributes tooE define one set of tags which will mark sections to be stripped out D of documents, so they don't get indexed, or you can mark sectionsB with the non-DTD <noindex> and </noindex> tags.




5. Troubleshooting

J 5.1. I can't seem to index more than X documents# in a directory.
x;

This usually has to do with the default document sizei: limit. If you set "max_doc_size" in your config file toB something enough to read in the directory index (try 100000 forB 100K) this should fix this problem. Of course this will require+ more memory to read the larger file.

eH 5.2. I can't index PDF files.
@

As above, this usually has to do with the default documentA size. What happens is ht://Dig will read in part of a PDF filed7 and try to index it. This usually fails. Try setting @ "max_doc_size" in your config file to a larger value than the% size of your largest PDF file.

eB

Another common problem is that htdig can't find the acroreadA program, which it uses to convert PDF files to PostScript. Theg> solution is to obtain and install Adobe Acrobat Reader 3.0,> if it's available for your system. You may also need to set? the pdf_parser attribute/8 to the correct location and options for acroread.

>

There is a bug in Adobe Acrobat Reader version 4, in its= handling of the -pairs option, which causes a segmentation/@ violation when using it with htdig 3.1.2 or earlier. There is> a workaround for this as of version 3.1.3 - you must remove= the -pairs option from your pdf_parser definition, if it's>A there. However, acroread version 4 is still very unstable (onm> Linux, anyway) so it is not recommended as a PDF parser. An> alternative is to use an external parser with the xpdf 0.90@ package installed on your system, as described in question 4.9 above.

I 5.3. When I run "rundig," I get a message aboutz3 "DATABASE_DIR" not being found.
d=

This is due to a bug in the Makefile.in file in versioneA 3.1.0b1. The easiest fix is to edit the rundig file and changet@ the line "TMPDIR=@DATABASE_DIR@" to set TMPDIR to a directory@ with a large amount of temporary disk space for htmerge. This' bug is fixed in version 3.1.0b2.

eH 5.4. When I run htmerge, it stops with an "out* of diskspace" message.
A

This means that htmerge has run out of temporary disk spaceiB for sorting. Either in your "rundig" script (if you run htmergeC through that) or before you run htmerge, set the variable TMPDIR . to a temp directory with lots of space.

G 5.5. I have problems running rundig from croni under Linux.
@

This problem commonly occurs on Red Hat Linux 5.0 and 5.1,C because of a bug in vixie-cron. It causes htmerge to fail with a 7 "Word sort failed" error. It's fixed in Red Hat 5.2.< You can install vixie-cron-3.0.1-26.{arch}.rpm from a 5.2A distribution to fix the problem on 5.0 or 5.1. A quick fix forcD the problem is to change the first line of rundig to "#!/bin/ash"D which will run the script through the ash shell, but this doesn't$ solve the underlying problem.

C 5.6. When I run htmerge, it stops with an 3 "Unexpected file type" message.
eC

Often this is because the databases are corrupt. Try removing B them and rebuilding. If this doesn't work, some have found thatC the solution for question 3.2 works for thise8 as well. This should be fixed in version 3.1.0b2.

J 5.7. When I run htsearch, I get lots of Internal) Server Errors (#500).
tH

Answer contributed by David R. Barstis <dbarstis@nd.edu>

B

If you are running Apache under Solaris, try adding "PassEnvC LD_LIBRARY_PATH" to Apache's httpd.conf file. Often these errorseC can be caused by insufficient memory, so if you often run memoryr? intensive programs (including htdig and htmerge themselves!)l" htsearch may run out of memory.7
See also question 5.13.

oF 5.8. I'm having problems with indexing words- with accented characters.
4

< Most of the time, this is caused by either not setting or incorrectly setting the locale attribute. The default locale: for most systems is the "portable" locale, which strips9 everything down to standard ASCII. Most systems expectn+ something like locale: en_US orr: locale: fr_FR. Locale files are often found in7 /usr/share/locale or the $LANGUAGE"E environment variable. See also question 4.10.f

rB 5.9. When I run htmerge, it stops with a/ "Word sort failed" message.
nC

There are three common causes of this. First of all, the sorts? program may be running out of temporary file space. Fix thist@ by freeing up some space where sort puts its temporary files,@ or change the setting of the TMPDIR environment variable to aA directory on a volume with more space. A second common problemp@ is on systems with a BSD version of the sort program (such asB FreeBSD or NetBSD). This program uses the -T option as a recordC separator rather than an alternate temporary directory. On these"@ systems, you must remove the TMPDIR environment variable from@ rundig, or change the code in htmerge/words.cc not to use the@ -T option. A third cause is the cron program on Red Hat Linux@ 5.0 or 5.1. (See question 5.5 above.)

L 5.10. When htsearch has a lot of matches, it runs% extremely slowly.
;7

When you run htsearch with no customization, on ae9 large database, and it gets a lot of hits, it tends ton: take a long time to process those hits. Some users with9 large databases have reported much higher performance,m: for searches that yield lots of hits, by setting the backlink_factor attributeB in htdig.conf to 0, and sorting by score. The scores calculated? this way aren't quite as good, but htsearch can process hits B much faster when it doesn't need to look up the db.docdb record? for each hit, just to get the backlink count, date or title,l; either for scoring or for sorting. This affects versionst? 3.1.0b3 and up. In version 3.2, currently under development,= the databases will be structured differently, so it shouldt% perform searches more quickly.

cL 5.11. When I run htsearch, it gives me a count ofE matches, but doesn't list the matching documents.
n@

This most commonly happens when you run htsearch while the; database is currently being rebuilt or updated by htdig. db.docdb, or db.docs.index (which maps document IDs used in? db.words.db to URLs used to look up records in db.docdb), ist? incomplete or messed up. You'll likely need to rebuild yourr= database from scratch if it's corrupted. Older versions of ; ht://Dig were susceptible to database corruption of thiso; sort. Versions 3.1.2 and later are much more stable.

NK 5.12. I can't seem to index documents with namesn4 like left_index.html with htdig.
4

There is a bug in the implementation of the href="attrs.html#remove_default_doc">remove_default_docC attribute in htdig versions 3.1.0, 3.1.1 and 3.1.2, which causes > it to match more than it should. The default value for thisC attribute is "index.html", so any URL in which the filename endsa? with this string (rather than matches it entirely) will have A the filename stripped off. This is fixed in version 3.1.3.

M 5.13. I get Premature End of Script Headers errorso* when running htsearch.
:

This happens when htsearch dies before putting out aB "Content-Type" header. If you are running Apache under Solaris,H first try the solution described in question 5.7.A If that doesn't work, or you're running on another system, trylF running "htsearch -vvv" directly from the command line to see whereC and why it's failing. It should prompt you for the search words," as well as the format.g5
See also questions 5.7 ands! 5.14.

oG 5.14. I get Segmentation faults when runninge/ htdig, htsearch or htfuzzy.
lE

Despite a great deal of debugging of these programs, we haven'taH been able to completely eliminate all such problems on all platforms.C If you're running htsearch or htfuzzy on a BSDI system, a commonaA cause of core dumps is due to a conflict between the GNU regexaG code bundled in htdig 3.1.2 and later, and the BSD C or C++ library.sC The solution is to use the BSD library's own regex code instead,i as summarized by Joe Jah:

nE This solution may work on some other platforms as well (we haven'teD heard one way or the other), but will definitely not work on someE platforms. For instance, on libc5-based Linux systems, the bundled"A regex code works fine by default, but using libc5's regex code  causes core dumps.

<

Users of Cobalt Raq or Qube servers have complained ofC segmentation faults in htdig. Apparently this is due to problemsI@ in their C++ libraries, which are fixed in their experimental@ compiler and libraries. The following commands should install the packages you need:W

T rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/binutils-2.8.1-3C1.mips.rpm
N rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-1.0.2-9.mips.rpm
R rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-c++-1.0.2-9.mips.rpm
R rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-g77-1.0.2-9.mips.rpm
S rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/egcs-objc-1.0.2-9.mips.rpm
cY rpm -Uvh ftp://ftp.cobaltnet.com/pub/experimental/libstdc++-devel-2.8.0-9.mips.rpm
.[ rpm -Uvh --force ftp://ftp.cobaltnet.com/pub/products/current/RPMS/gcc-2.7.2-C2.mips.rpm 
fF You may have to remove the libg++ package, if you have it installedG before installing libstdc++, because of conflicts in these packages.tE Be sure to do a "make clean" before a "make", to remove any objecte8 files compiled with the old compiler and headers.

D

For other causes of segmentation faults, or in other programs,G getting a stack backtrace after the fault can be useful in narrowingE down the problem. E.g.: try "gdb /path/to/htsearch /path/to/core",tD then enter the command "bt". You can also try running the programD directly under the debugger, rather than attempting a post-mortemD analysis of the core dump. Options to the program can be given onD gdb's "run" command, and after the program is suspended on fault,E you can use the "bt" command. This may give you enough information G to find and fix the problem yourself, or at least it may help others.> on the htdig mailing list to point out what to do next.

K 5.15. Why does htdig 3.1.3 mangle URL parametersd9 that contain bare "&" characters?
:

This is a known bug in 3.1.3, and is fixed with thisS t? patch. You can apply the patch by entering into the maini: source directory for htdig-3.1.3, and using the command? "patch -p1 < /path/to/htdig-3.1.3-urlparmbug.patch". This ise& also fixed as of version 3.1.4.

E 5.16. When I run htmerge, it stops with anoN "Unable to open word list file '.../db.wordlist'" message.
?

The most common cause of this error is that htdig did notoA manage to index any documents, and so it did not create a worda? list. You should repeat the htdig or rundig command with the52 -vvv option to see where and why it is failing., See question 4.1.

Q 5.17. When using Netscape, htsearch always returns ther$ "No match" page.
B

Check your search form. Chances are there is a hidden input 9 field with no value defined. For example, one user hadv5

<input type=hidden name=restrict>
o! in his search form, instead of D
<input type=hidden name=restrict value="">
J The problem is that Netscape sets the missing value to a default of " "H (two spaces), rather than an empty string. For the restrict parameter,I this is a problem, because htsearch won't likely find any URLs with twonF spaces in them. Other input parameters may similarly pose a problem.






0 $Author: ghutchis $
".+Last modified: $Date: 2000/02/25 02:31:54 $< n ÿÿth no customization, on ae9 large database, and it gets a lot of