submit to programming reddit
(April 2007)

For the impatient: file.html >file.tex 2>file.err
Download the script here.

This utility attempts to convert the text information inside a greek HTML file (ignoring the tags) into LATEX. It uses the babel package to handle both monotonic and polytonic text - and respects the "charset" meta in the HTML code. If no "charset" is defined, it first tries to read the text as UTF-8, and if that fails, as ISO-8859-7 (Modern Greek).

It correctly handles HTML entities (e.g. ") and character references (e.g. ߞ), and it includes custom-built "fixups" for some of the Unicode characters used in greek polytonic pages.

When it meets a symbol it doesn't know about, it reports error information (to stderr), which is formatted in UTF-8, and provides both the offending symbol and the line it occurred in, with numbered output: this allows the user to easily extend the script and include code to handle the offending symbol (if you do, send it to me so I can update it for the benefit of us all).

HTML is not structured - it is a presentation language, and therefore the document structure can't be regenerated. The script simply maps contents of tags title, h1, h2, h3, h4, pre to their respective LATEX commands. Feel free to implement additional artificial intelligence rules.

It also "semi-covers" English text: it sprays \textlatin for groups of words appearing inside the same line that use Latin characters.

It has been tested on a collection of mostly polytonic greek pages I have gathered over the web - but naturally, it doesn't handle all Unicode input: expect to do some manual edits after the conversions.

profile for ttsiodras at Stack Overflow, Q&A for professional and enthusiast programmers
GitHub member ttsiodras
Back to index  My CVLast update on: Tue Aug 9 19:52:23 2016

The comments on this website require the use of JavaScript. Perhaps your browser isn't JavaScript capable or the script is not being run for another reason. If you're interested in reading the comments or leaving a comment behind please try again with a different browser or from a different connection.