Converting Greek polytonic HTML to LaTEX

(April 2007)

For the impatient:

    greekHtmlToLatex.py file.html >file.tex 2>file.err

Download the script here.

This utility attempts to convert the text information inside a greek HTML file (ignoring the tags) into LATEX. It uses the babel package to handle both monotonic and polytonic text - and respects the "charset" meta in the HTML code. If no "charset" is defined, it first tries to read the text as UTF-8, and if that fails, as ISO-8859-7 (Modern Greek).

It correctly handles HTML entities (e.g. ") and character references (e.g. ߞ), and it includes custom-built "fixups" for some of the Unicode characters used in greek polytonic pages.

When it meets a symbol it doesn't know about, it reports error information (to stderr), which is formatted in UTF-8, and provides both the offending symbol and the line it occurred in, with numbered output: this allows the user to easily extend the script and include code to handle the offending symbol (if you do, send it to me so I can update it for the benefit of us all).

HTML is not structured - it is a presentation language, and therefore the document structure can't be regenerated. The script simply maps contents of tags title, h1, h2, h3, h4, pre to their respective LATEX commands. Feel free to implement additional artificial intelligence rules.

It also "semi-covers" English text: it sprays \textlatin for groups of words appearing inside the same line that use Latin characters.

It has been tested on a collection of mostly polytonic greek pages I have gathered over the web - but naturally, it doesn't handle all Unicode input: expect to do some manual edits after the conversions.

Index

Updated: Sat Oct 8 12:33:59 2022