Lessons learned from the translation of documentation from LaTeX to HTML

Xavier Leroy (INRIA)


This presentation reports on the transformation of a 200-pages LaTeX technical documentation into HTML for inclusion in the World Wide Web. We believe that many others who have tried to provide information on the Web have encountered difficulties similar to ours; it is therefore worthwhile to summarize the lessons gained from our experience.

The initial document

The document we worked on is the reference manual and user's documentation for our implementation of the Caml Light programming language. It was originally written in LaTeX, with a good deal of non-standard environments and macros written directly in TeX. Parts of the document were generated: syntactic definitions (typeset from BNF descriptions) and descriptions of library functions (extracted from commented source code).

An HTML version of this document was desirable for several reasons:

The Caml Light documentation is about 200 pages long. It was therefore out of the question to rewrite it manually in HTML. Automatic translation was not obvious either, due to the typographical gap between TeX and HTML. For instance, even though the documentation does not contain math formulas properly speaking (no integrals, no greek letters, ...), TeX's math mode is frequently used in BNF definitions and program samples to get metavariables, subscripts, superscripts, ... Typographical enrichments (italics, bold, monospaced fonts) are used extensively for disambiguation (e.g. distinguish terminal symbols from BNF operators) and must be rendered faithfully. The documentation also contains many tables and a few line drawings, which caused some difficulties.

Preparing the LaTeX source

When faced with a LaTeX construct that has no direct HTML equivalent, the latex2html translator simply turns it into a bitmap image and insert that in the produced HTML code. This approach was not acceptable for our purposes:

To avoid the recourse to bitmaps and allow the production of decent HTML, we introduced a number of TeX macros in the LaTeX source, to ``abstract over'' the typesetting details and make the semantics of the LaTeX source more explicit. For instance, we wrote \var{x} instead of $x$ to denote the metavariable named x. Similarly, the n-th element of a sequence v was typed \nth{v}{n} instead of $v_n$. The same technique was also used to eliminate low-level constructs and environments such as center and tabular.

For typesetting with TeX, the new constructs (\var, \nth, etc) were simply macro-defined as their old form, so the end result was the same. For translation to HTML, however, they were specially recognized and the intended meaning was rendered in the best possible HTML way: for instance, \nth{v}{n} becomes <i>v(n)</i>.

The overall approach is basically that of LaTeX for structuring TeX document (instead of typesetting a section header by hand, just write \section{...} to communicate your intentions to the typesetting program), but extended to lower-level elements of the text: metavariables, sequence elements, entries in precedence tables, ...

For the parts of the documentation automatically generated, we modified the programs that generated them in LaTeX to either output HTML directly (embedded in a \begin{rawhtml} ... \end{rawhtml} environment) or output simplified LaTeX that would look bad on paper but translates nicely to HTML.

Finally, a few nasty uses of math mode (arrays of formulas) had to be translated manually to HTML and embedded in the source documentation as \begin{rawhtml} ... \end{rawhtml} blocks, following immediately the original TeX code bracketed by \begin{latexonly} ... \end{latexonly}. The presence of the two versions (LaTeX and HTML) side by side in the source helps keeping them consistent during modifications. About 0.2% of the source had to be manually translated this way; the remaining 99.8% are automatically processed.

Generating HTML

We originally planned to use Nikos Drakos' latex2html translator to produce the final HTML pages. This plan backfired for two reasons:

We therefore had to write our own translator from the extended subset of LaTeX used to HTML. The translator consists in two main parts:

You may judge the final result and compare it with the DVI file obtained from the original.

The sources of the translator are available here. Some background about the Caml language is helpful to understand them.

Lessons learned

From this experience, we draw the following conclusions:

The HTML language

Despite its apparent simplicity, the HTML language is almost rich enough to format TeX-intensive technical documentation. The sole features that we badly missed were tables, subscripts, and superscripts. None of these is particularly hard to support in HTML viewers, and we hope they will be in the next ``official'' HTML specification. We had no real need for more advanced features such as a full-fledged formula mode.

HTML viewers

We were slightly disappointed by the quality of the typesetting performed by popular HTML viewers such as Mosaic and Netscape. For instance, horizontal spacing is often incorrect when italic characters appear next to roman characters (example: _} -- that's underscore-italic brace). Also, preformatted text is often set up in a smaller font, but italics tags embedded in the preformatted text are in normal size (Mosaic), or in smaller size but in monospaced italics rather than regular italics (Netscape). Both approaches make it difficult to ensure font consistency in the document.

HTML generators and transformers

The quality of publicly available programs that manipulate HTML is clearly insufficient:

The immaturity of the field probably explains many of these weaknesses. But we also believe that the widespread use of Perl to program these tools is also partly responsible for the situation. Perl is a fine generalization of the Unix line-oriented tools, and it works great for programs of the kind ``read a line, massage it, print it''. Many CGI scripts fall in this category, but not the more advanced tools we need to produce HTML from other high-level sources. Perl is inherently not suited to the parsing and transformation of structured languages such as LaTeX and HTML. Reading the whole input in a string and rewriting it over and over with s/.../.../eg just does not cut it in terms of clarity and performance; it's like playing the Well-tempered clavier on a Jew's harp. Languages with high-level parsing capabilities, real data structures and clean semantics are clearly needed here.

In which source language should we write documentation?

What is a good formatting language for writing texts that should be available both on paper and on the Web? An idealistic approach would be to design a new text formatting language that would replace both LaTeX and HTML, or at least be straightforwardly translatable to both. A practical approach is to acknowledge that LaTeX is a standard de facto in computer science and that HTML is already hard to modify, which suggests that more effort should be invested in more clever translators from LaTeX to HTML (e.g. capable of translating $x+1$ to <i>x</i>+1).

Xavier Leroy, INRIA (Xavier.Leroy@inria.fr)