AscToHTM performs statistical analysis on the document to determine at what character positions indentations occur. This information is used on the output pass to determine the indentation level for each source line.
AscToHTM attempts to indent the HTML code to match the output indentation level, to make it easier to read.
The indentations themselves will be marked up using <BLOCKQUOTE> ... </BLOCKQUOTE> tags.
Future versions (after version 4.0) of AscToHTM may offer you the option of using <DIV> tags.
Some documents, especially ones dumped from Word, have hanging paragraph indents. That is, each paragraph starts at an offset to the rest of the paragraph.
AscToHTM struggles heroically with this, and tries not to treat this as text at two indent levels, but it does occasionally get confused.
If writing a text file from scratch with AscToHTM in mind, then it is best to avoid this practice.
AscToHTM detects and supports several types of bullets.
Bullet chars are lines of the type
- this is a bullet line - this is a bullet paragraph because it carries over onto more lines
That is, a single character followed by the bullet line. AscToHTM can determine via statistical analysis which character, if any, is being used in this way. Special attention is paid to the '-' and 'o' characters.
Bullets of this type are given a <UL> ... <LI> ... </UL> markup.
AscToHTM can spot numbered bullets. These can sometimes be confused with section headings in some documents. This is one area where the use of a document policy really pays dividends in sorting the sheep from the goats.
Numbered bullets are given a <OL TYPE=1 START=N> ... <LI> ... </OL> markup.
- Note:
- Not all browsers support this type of markup. In such cases, it's possible that the numbering of bullets will get reset to 1 every so often. However, this isn't a problem with either Netscape or Internet Explorer.
AscToHTM detects upper and lower case alphabetic bullets. These are marked up like numbered bullets, with TYPE=a.
AscToHTM detects upper and lower case roman numeral bullets. These are marked up like numbered bullets, with TYPE=i.
AscToHTM can attempt to spot sections of centred text. However, because this can easily go wrong this option is normally switched off.
Centering is only switched on for single isolated lines, or any group of at least two lines. <CENTER> ... </CENTER> markup is used.
A definition line is a single line that appears to be defining something. Usually this is a line with either a colon (:) or an equals sign (=) in it. For example
IMHO = In my humble opinion Address : Somewhere over the rainbow.
AscToHTM attempts to determine what definition characters are used and whether they are strong (only ever used in a definition) or weak (only sometimes used in a definition).
AscToHTM marks up definition lines by placing a <BR> on the end of the line to preserve the original line structure. Where this decision is made incorrectly unexpected breaks can appear in text.
AscToHTM offers the option of marking up the definition term in bold. This is not the default behaviour however.
AscToHTM also recognises the use of definition paragraphs such as :-
Note: This is a definition paragraph whereby the whole paragraph is defining the term shown on the first line. Unfortunately AscToHTM currently only copes with single paragraphs (i.e. not with continuation paragraphs), and only with single word definitions.
This gets marked up in a <DL> <DT>...</DT> <DD>...</DD> </DL> sequence
- Note:
- This is a "definition" paragraph, i.e. the whole paragraph defines the term shown on the first line. Unfortunately AscToHTM currently only copes with single paragraphs (i.e. not with continuation paragraphs), and only with single word definitions.
AscToHTM recognises that, especially in Internet files, it is increasingly common to quote from other text sources such as e-mail. The convention used in such cases is to insert a quote character such as > at the start of each line.
Consequently, AscToHTM adds a <BR> tag at the end of such lines to preserve the line structure of the original, and marks it up in <EM>..</EM> tags to differentiate the quoted text
AscToHTM can look for text emphasised by placing asterisks (*) either side of it, or underscores (_). AscToHTM will convert the enclosed text to bold and italic respectively using <STRONG> and <EM> tags respectively.
From version 3.2 onwards AscToHTM will also look for combinations of asterisks and underscores which will be placed in bold italic. The asterisks and underscores should be properly nested, e.g.
The emphasised word or phrase should span no more than a few lines, and in particular should not span a blank line. If the phrase is longer, or if AsctoHTM fails to match opening and closing emphasis marks, the characters are left unconverted.
Tests are made to ignore double asterisks and underscores.
Contents list lines are marked up in bold, and turned into a hyperlink pointing at the section referenced. The text is sized according to heading type in the range +/- 1 font size from normal (3).
AscToHTM can convert cross-references to other sections into hyperlinks to those sections. Unfortunately this is currently only possible for second, third, fourth... level numeric headings (n.n, n.n.n, n.n.n.n etc)
This is because the error rate becomes too high on single numbers/letters or roman numerals. This may be refined in future releases, although it's hard to see how that would work.
AscToHTM can convert any URLs in the document to hyperlinks. This includes http and ftp URLs and any web addresses beginning with www.
AscToHTM can convert any newsgroup names it spots into hyperlinks to those newsgroups. Because this is prone to error, AscToHTM currently only converts newsgroups in known USENET hierarchies such as rec.gardens by default.
This can be overcome either by
- placing "news:" in front of the newsgroup name (e.g. news:this.is.a.newsgroup.honest)
- relaxing this condition via a document policy (see the policy "Only use known groups")
- specifying the newsgroup hierarchy as recognised via a policy "Recognised USENET groups".
AscToHTM can convert any email addresses into hypertext mailto: links.
AscToHTM can convert use-specified keywords into hyperlinks. The words or phrase to be converted must lie on a single line in the source document. Care should be taken to ensure keywords are unambiguous. Normally I mark my keywords in [] brackets if authoring for conversion by AscToHTM
See the discussions on "link dictionaries" in 4.3.2.2 and 4.4.2.
AscToHTM recognises various types of headings. Where headings are found, and deemed to be consistent with the prevailing document policy (correct indentation, right type, in numerical sequence etc), AscToHTM will use the standard <Hn> ... </Hn> markup.
In addition to this, AscToHTM will insert a named Anchor tag (<A> ... </A>) to allow hyperlink jumps to this point. These anchors are used for example in the contents list and cross-reference hyperlinks that AscToHTM generates.
This is the preferred heading type and the type that AscToHTM has most success with. Sections of type N.N.N can be checked for consistency, and references to them can be spotted and converted into hyperlinks.
At present more exotic numbering schemes using roman numerals and letters of the alphabet are not fully supported. This is planned to be implemented soon, possibly via user policy files.
AscToHTM can treat wholly capitalised lines as headings. It also allows for such headings to be spread over more than one line.
AscToHTM can recognise underlined text, and optionally promote the preceding line to be a section header. The "underlining" line should have no gaps in it.
Some types of documents use what look like section numbers to number paragraphs (e.g. legal documents, or sets of rules).
AscToHTM can recognise this, and mark up such lines by placing the number in bold, and not using <Hn> ... </Hn> markup on the whole line.
New in version 3.2
Some documents, especially those that were originally email or USENET posts, come with header lines, usually in the form of a number of lines with a keyword followed by a colon and then some value.
AscToHTM can recognise these (to a limited extent). Where these are detected the program will parse the header lines to extract the Subject, Author and Date of the article concerned. A heading containing this information will then be generated to replace all the unsightly header lines.
Lines are interpreted in context. If they appear to be underlining text, or part of some pre-formatted structure such as a table, then they are treated as such. Otherwise they become horizontal rules (<HR>).
Form feeds or page breaks also become <HR> markups.
AscToHTM normally ignores any HTML markup in the original text. The sole exceptions are any preprocessor tags which a user may insert into their text document (see Using the preprocessor).
For example :-
The use of BEGIN_PRE and END_PRE preprocessor commands (see 7.1.7) in the text documents tells AscToHTM that this portion of the document has been formatted by the user and should be left unchanged.
AscToHTM attempts to spot sections of preformatted text. This can vary from a single line (e.g. a line with a page number on the right-hand margin) to a complete table of data.
Where such text is detected AscToHTM analyses the section to determine what type of pre-formatted text it is. Options include
- Tables
- Code samples
- Ascii Art and diagrams
- some other formatted text
You can adjust the sensitivity of AscToHTM to pre-formatted text by setting the minimum number of lines required for a pre-formatted region using the "Minimum automatic <PRE> size" policy.
Tables are marked out by their use of white space, and a regular pattern of gaps or vertical bars being spotted on each lines. AscToHTM will attempt to spot the table, its columns, its headings, its cell alignment and entries that span multiple columns or rows.
Should AscToHTM wrongly detect the extent of a table, you can mark up a section of text by using the BEGIN_TABLE ... END_TABLE pre-processor commands (see 7.1.2). Alternatively you can try adding blank lines before and after, as the analysis uses white space to delimit tables.
You can alter the characteristics of all tables via the table policies (see 6.3.7).
You can alter the characteristics of all or individual tables via the table pre-processor commands (see 7.4).
Or you can suppress the whole thing altogether via the "Attempt TABLE generation" policy
AscToHTM attempts to recognise code fragments in technical documents. The code is assumed to be "C++" or "Java"-like, and key indicators are, for example, the presence of ";" characters on the end of lines.
Should AscToHTM wrongly detect the extent of a code fragment, you can mark up a section of text by using the BEGIN_CODE ... END_CODE pre-processor commands (see 7.1.5).
You can choose what type of markup is used for the code fragment (see the policy "Use <CODE>..</CODE> markup").
Of you can suppress the whole thing altogether via the policy "Expect code samples".
AscToHTM attempts to recognise Ascii art and diagrams in documents. Key indicators include large numbers of non-alphanumeric characters and the use of white space.
However, some diagrams use the same mix of line and alphabetic characters as tables, so the two sometimes get confused.
Should AscToHTM wrongly detect the extent or type of a diagram, you can mark up a section of text by using the BEGIN_DIAGRAM ... END_DIAGRAM pre-processor commands (see 7.1.6).
If AscToHTM detects formatted text, but decides that is is neither table, code or art (and it knows what it likes), then the text may be put out "as normal", but with <BR> added to each line. In such regions other markup (such as bullets) may not be processed such as it would be elsewhere.
AscToHTM can calculate - or be told - the title of a document. This will be placed in <TITLE>...</TITLE> markup in the <HEAD> section of each HTML page produced.
The Title is calculated as in the order shown below. If the first algorithm returns a value, the subsequent ones are ignored.
- If a $_$_TITLE pre-processor command (see 7.2.1) is placed in the source text, that value is used
- If the "Use first header as title" policy is set then the first heading (if any) encountered is used as the title.
- Note:
- Depending on your document structure, this is prone to give bland tiles like "Introduction" , "Overview" and "Summary"
- If the "Use first line as title" policy is set then the first line in the file is used as the title.
- If the "Document title" policy is set then this value is used.
- Note:
- If this is the value you want, ensure the other policies outlined above are disabled.
- Finally, if none of the above result in a title the text "Converted from <filename>" is used.
AscToHTM can detect the presence of a contents list in the original document, or it can generate a contents list for you from the headings that it observes. There are a number of policies that give you control over how and where a contents list is generated (see 6.3.4).
There are three different situations in which contents lists may, or may not be generated. These are :-
By default AscToHTM will not generate a contents list for a file unless it already has one.
If it should detect a contents list in the document, then that list will be changed into hyperlinks to the named sections. In such a case, only those headings shown in the contents list are converted into links, and the link text is that in the original contents list, and not the text in the actual heading (often they are different).
- Note:
- AsctoHTM currently only detected numbered contents lists, and is occasionally prone to error when they are present. If you experience problems, either delete the contents list and get AscToHTM to generate one for you, or mark up the existing list using the contents pre-processor commands (see 7.1.3)
As described in 5.6.2.1, AscToHTM will not generate a contents list by default unless it already has one.
Requesting a contents list
However, you can request that a contents list is always generated, by using the "Add contents list" policy. In this case a contents list will be either
- made from the existing contents list, or
- generated from the observed headings. in this case the contents list will only be as good as the detection of headings in the rest of the document permits.
Forcing a generated contents list
You can force a generated list to be used by disabling the "Use any existing contents list" policy.
If an existing contents list is present, it will be deleted from the output. Normally it's best to either use the existing contents list, or to delete it from the source text and request a generated list.
Contents lists placement
By default the contents list will be placed at the top of the output file. In earlier versions of AscToHTM the contents list was always placed in a separate file.
New in version 3.2
You can cause contents lists to be placed wherever you want by using the CONTENTS_LIST preprocessor command (see 7.3.3). If you do this, then contents lists will be placed only where you place CONTENTS_LIST markers.
Generating a contents list in a separate file
If you select the "Generate external contents list" policy the contents list will be placed in a separate file, and a hyperlink to that file called "Contents List" is placed at the top of the HTML page generated form the document.
You can choose the name of the external file using the "External contents list filename" policy. If omitted, the file will be called "Contents_<filename>", where <filename> is the name of the document being converted.
AscToHTM can be made to split the output into many files. At present this is only possible at numbered section headings. Each generated page usually has a navigation bar, which includes a hyperlink back to the following section in any contents list.
The behaviour is identical to that in 5.6.2.2 expect that
- the output is now split into several files.
- the options to generate an external contents list in a separate file are no longer available.
- if the contents list is being generated, it is now placed at the foot of the first document, rather than at the top (unless the CONTENTS_LIST preprocessor command is used)
This is usually before the first heading (which now starts the second document), and after any document preamble.
- Note:
- Where the original contents list is used when splitting files it is possible that not every file will be directly accessible from the contents list, and that the back links to the contents list may not function as expected. In such cases you can go from the contents list to a major section, and then use the navigation bars to page through to the minor section.
When converting several files at once, AscToHTM can be made to generate a "Directory Page". This is an HTML index of all the files converted and their contents.
The policies available for controlling generation of a directory page are explained in 6.3.9.
The directory page will consist of an entry for each file converted, in the order that files are converted (usually alphabetic). Each entry will (optionally) contain :-
- A link to the file being converted. The link will either be the converted file's HTML title, or failing that, the filename itself.
- Links to each of the sections of the converted file as detected by AscToHTM.
AscToHTM can be made to add standard header, footers and JavaScript to each page generated. It does this by allowing you to specify include files to be copied into the generated HTML. These include files can contain any valid HTML commands.
The program supports three types of such files :-
- Header files. These contain any HTML you want placing immediately after the output's <BODY> tag.
A good example might be a standard header, with a logo and links back to the home page.
- Footer files. These contain any HTML you want placing immediately before the closing </BODY> tag.
- Script files, These contain any HTML you want placing inside the <HEAD> ... </HEAD> portion of the generated file. Such tags are not usually visible.
You should place in here any JavaScript you want, although it will be difficult to make this apply to the converted text.
You can specify include files for the converted files, as well as for any directory page (see 5.6.3) that you create. If you don't specify values for the directory page, then it will use the same files as the generated files.
![]() |
Converted from a single text file by AscToHTM © 1997-99 John A. Fotheringham | ![]() |