PDF-to-HTML Rendering
Currently AltLaw relies mostly on court web sites for cases (see SourceWebSites). Most of those sites provide cases in PDF format. PDF is great for print documents, not so great for the web. Launching a browser plug-in every time you want to read a case is too slow. So we convert the PDFs to HTML that can be displayed in the browser.
PDF-to-HTML conversion is accomplished with a hacked version the Poppler PDF library's pdftohtml command-line tool. This utility generates an HTML page that mimics the original formatting and layout of the PDF, including page breaks. It's not perfect, but it at least produces readable output that scales with the browser's default text size.
Some possible improvements:
- Improve the current PDF-to-HTML translator.
- Best-case: write a PDF parser that can recognize formatting structures like paragraphs and footnotes and convert them into semantically-meaningful HTML markup.
- Alternative: Use Flash to embed a PDF viewer in the web page. This technique is used by Justia and Scribd. An open-source PDF-to-SWF converter exists, but it requires additional Flash code to create an interface that scrolls/zooms/turns pages.
- Unlikely: Convince the courts to publish opinions in structured XML/HTML.
Poppler: http://poppler.freedesktop.org/
Source code to AltLaw's hacked Poppler: http://lawcommons.org/darcs/darcsweb.cgi?r=poppler;a=summary
An earlier version of this problem was filed as ticket #3.
A ticket in the Poppler bug database about spacing problems: https://bugs.freedesktop.org/show_bug.cgi?id=12522
