[AscToHTM] Documentation for AscToHTM conversion utility
----------------------------------------------------------------------------
5 HTML markup produced
5.1 Indentation
AscToHTM performs statistical analysis on the document to determine at
what character positions indentations occur. This information is used
on the output pass to determine the indentation level for each source
line.
AscToHTM attempts to indent the HTML code to match the output
indentation level, to make it easier to read. The indentations
themselves will be marked up using
tags.
Note:
This is argueably not strictly correct HTML as we omit the
tag that would give a bullet character. However, this does produce
simpler HTML than using markups which are not
supported by earlier browsers, and which give (apparently, if not
actually) slower-loading HTML.
5.2 Header Lines
AscToHTM recognises various types of headers. Where headers are found,
and deemed to be consistent with the prevailing document policy
(correct indentation, right type, in numerical sequence etc), AscToHTM
will use the standard ... markup.
In addition to this, AscToHTM will insert a named Anchor tag ( ...
) to allow hyperlink jumps to this point. These anchors are used
for example in the contents list and cross-reference hyperlinks that
AscToHTM generates.
5.2.1 Numbered headers
This is the preferred heading type and the type that AscToHTM has most
success with. Sections of type N.N.N can be checked for consistency,
and references to them can be spotted and converted into hyperlinks.
At present more exotic numbering schemes using roman numerals and
letters of the alphabet are not fully supported. This is planned to be
implemented soon, possibly via user policy files.
5.2.2 Capitalised headers
AscToHTM can treat wholly capitalised lines as headers. It also allows
for such headers to be spread over more than one line.
5.2.3 Underlined headers
AscToHTM can recognise underlined text, and optionally promote the
preceding line to be a section header.
5.2.4 Numbered paragraphs
Some types of documents use what look like section numbers to number
paragraphs (e.g. legal documents, or sets of rules).
AscToHTM can recognise this, and mark up such lines by placing the
number in bold, and not using ... markup on the whole line.
5.3 Hyperlinks
5.3.1 Contents List lines
Contents list lines are marked up in bold, and turned into a hyperlink
pointing at the section referenced. The text is sized according to
heading type in the range +/- 1 font size from normal (3).
5.3.2 Cross-references
AscToHTM can convert cross-references to other sections into hyperlinks
to those sections. Unfortunately this is currently only possible for
second, third, fourth... level numeric headings (n.n, n.n.n, n.n.n.n
etc)
This is because the error rate becomes too high on single
numbers/letters or roman numerals. This may be refined in later
releases.
5.3.3 URLs
AscToHTM can convert any URLs in the document to hyperlinks. This
includes http and ftp URLs and any web addresses beginning with www.
5.3.4 Usenet Newsgroups
AscToHTM can convert any newsgroup names is spots into hyperlinks to
those newsgroups. Currently only third level newsgroups such as
comp.os.vms are converted to reduce the error rate. This may be
inproved
in later releases.
5.3.5 E-mail addresses
AscToHTML can convert any email addresses into hypertext mailto: links.
5.4 Hanging paragraph indents
Some documents, especially ones dumped from Word, have hanging
paragraph indents. That is, each paragraph starts at an offset to the
rest of the paragraph.
AscToHTM stuggles heroically with this, and tries not to treat this as
text at two indent levels, but it does occasionally get confused.
If writing a text file from scratch with AscToHTM in mind, then it is
best to avoid this practice.
5.5 Bullets
AscToHTML detects and supports several types of bullets.
5.5.1 Bullet chars
Bullet chars are lines of the type
- this is a bullet line
- this is a bullet paragraph
because it carries over onto
more lines
That is, a single character followed by the bullet line. AscToHTM can
determine via statistical analysis which character, if any, is being
used in this way. Special attention is paid to the '-' and 'o'
characters.
Bullets of this type are given a markup.
5.5.2 Numbered bullets
AscToHTM can spot numbered bullets. These can sometimes be confused
with section headings in some documents. This is one area where the use
of a document policy really pays dividends in sorting the sheep from
the goats.
Numbered bullets are given a ... - ... < /OL>
markup.
Note:
Not all browsers support this type of markup. In such cases, it's
possible that the numbering of bullets will get reset to 1 every
so often. However, this isn't a problem with either Netscape or
Internet Explorer.
5.5.3 Alphabetic bullets
AscToHTM detects upper and lower case alphabetic bullets. These are
marked up like numbered bullets, with TYPE=a.
5.5.4 Roman Numeral bullets
AscToHTM detects upper and lower case roman numeral bullets. These are
marked up like numbered bullets, with TYPE=a.
5.6 Definitions
5.6.1 Definition lines
A definition line is a single line that appears to be defining
something. Usually this is a line with either a colon (:) or an equals
sign (=) in it. For example
IMHO = In my humble opinion
Address : Somewhere over the rainbow.
AscToHTM attempts to determine what definition characters are used and
whether they are strong (only ever used in a definition) or weak (only
sometimes used in a definition).
AscToHTM marks up definition lines by placing a
on the end of the
line to preseve the original line structure. Where this decision is
made incorrectly unexpected breaks can appear in text.
AscToHTM offers the option of marking up the definition term in bold.
This is not the default behaviour however.
5.6.2 Definition paragraphs
AscToHTM also recognises the use of definition paragraphs such as :-
Note: This is a definition paragraph whereby the whole
paragraph is defining the term shown on the first line.
Unfortunately AscToHTM currently only copes with single
paragraphs (i.e. not with continuation paragraphs), and
only with single word definitions.
This gets marked up in a - ...
- ...< /DD>
sequence
Note:
This is a "definition" paragraph, i.e. the whole paragraph defines
the term shown on the first line. Unfortunately AscToHTM currently
only copes with single paragraphs (i.e. not with continuation
paragraphs), and only with single word definitions.
5.7 Quoted lines
AscToHTM recognises that, especially in Internet files, it is
increasingly common to quote from other text sources such as e-mail.
The convention used in such cases is to insert a quote character such
as > at the start of each line.
Consequently, AscToHTM adds a
tag at the end of such lines to
preserve the layout in the original.
5.8 Pre-formatted text
5.8.1 Lines and form feeds
Lines are interpreted in context. If they appear to be underlining
text, or part of some pre-formatted structure such as a table, then
they are treated as such.
Otherwise they become horizontal rules (
). Form feeds or page beaks
also become
markups.
5.8.2 User defined pre-formatted text
AscToHTM normally ignores any HTML markup in the original text. The
sole exceptions are the < PRE> ... < /PRE> tags which a user may insert
into their text document.
For example :-
The use of < PRE> and < /PRE> in the text documents tells AscToHTM
that this portion of the document has been formatted and should
be left unchanged.
Note:
Because of this, care has to be taken when referring to < PRE> and
< /PRE> HTML tags in source document. A single space after the
opening < is sufficient. All other HTML tags are ignored and
converted into "safe" text.
5.8.3 Automatically detected pre-formatted text
AscToHTM attempts to spot chunks of preformatted text. This can vary
from a single line (e.g. a line with a page number on the right-hand
margin) to a complete table of data.
Where such text is detected AscToHTM marks these sections up in < PRE>
... < /PRE> tags.
Eventually it is hoped to add full generation for
such sections.
5.9 Centred text
AscToHTM can attempt to spot chunks of centred text. However, because
this can easily go wrong this option is normally switched off.
Centering is only switched on for single isolated lines, or any group
of at least two lines. ... markup is used.
----------------------------------------------------------------------------
Prev | Next | Contents
----------------------------------------------------------------------------
© 1997 John A. Fotheringham