Hands on with the Tesseract OCR engine

Tesseract is an open source optical character recognition (OCR) engine that was developed at HP Labs between 1985 and 1995 and now at Google. According Google it is one of the most accurate open source OCR engines available.

Last week a colleague asked me if I could help him bring back to his word processor an old syllabus with more then 200 pages written in Dutch. The original digital texts of this syllabus were lost. What was left was a printed version and a pdf raster image file that was saved at the printing office. The existence of the pdf file was major time saver. Now I only had to convert the pdf file to images and feed them to an ocr engine. I started with GOCR. Version 0.40 was already installed on my workstation. But I wasn’t happy with the results. To much errors. So I decided to test Tesseract V2.01. The output was clearly more accurate than GOCR. Very acceptable. Click ‘continue reading’ for a detailled comparison and my ocr workflow

A comparison of some results.

Page 68 of the syllabus contains a good reflection of the entire syllabus contents. The first step will be to convert this page from pdf to an image file. For this I use pdftoppm. This tool is part of the xpdf viewer.This command converts page 68 to a high resolution 600 dpi ppm file called test-000068.ppm.
pdftoppm i.ouwehand.sociaal.recht.normaal.of.super.pdf -f 68 -l 68 -r 600 test
Below a scaled down image of the original text of page 68. Click on the image to see the resolution of the ppm file that is generated bij pdftoppm.

png image: converted pdf page

GOCR
To invoke Gocr:
gocr test-000068.ppm -o gocr_page68.txt
The output file contains numerous errors:

Het eerste ?s sJechts mogeJ?jk ?ndien llO-ve_rag 96 zou worden opgezegd, A8ngez?en d?t
pas ?n 2001 mogelijk ?s, wordt ondeRocht in hoeveme in de tussenl?ggende per?ode tot
verdere deregulering kan worden overgegaan, zonder dat uiter8ard str?jd met de
verdr8gsverpl.’icht?ngen ontstaat. H?erover ?s (opnieuw) contact met het _ureau v8n de IlO
gelegd.

In dit verb8nd ?s verder van bel8ng d8t op de 8te z?_?ng van de Intern$t?onale
Arbe?dsconferentie een algemene d?scusscode(012d)e over het onde_erp ”de rol v8n pa_?culcode(012d)ere
bure8us b?j het functioneren van de 8r&eidsmar_” is gevoerd. _owel de roI v8n
arbeid_sbure8us a_s d?e van u_i’_e.ndbureaus _s hierb?j aan de orde.

Tesseract
Tesseract only takes tiff files. So an additional conversion is required. I use the convert tool from Imagemagick.
convert test-000068.ppm test-000068.tif
To invoke Tesseract:
tesseract test-000068.tif tesseract_page68 -l nld
The output file contains one error:

Het eerste is slechts mogelijk indien ILO­verdrag 96 zou worden opgezegd. Aangezien dit
pas in 2001 mogelijk is, wordt onderzocht in hoeverre in de tussenliggende periode tot
verdere deregulering kan worden overgegaan, zonder dat uiteraard strijd met de
verdragsverplichtingen ontstaat. Hierover is (opnieuw) contact met het Bureau van de ILO
gelegd.

ln dit verband is verder van belang dat op de 81e zitting van de Internationale
Arbeidsconferentie een algemene discussie over het onderwerp “de rol van particuliere
bureaus bij het functioneren van de arbeidsmarkt” is gevoerd. Zowel de rol van
arbeidsbureaus als die van uitzendbureaus was hierbij aan de orde.

My final OCR workflow.
Please be aware that the generated ppm files consume a lot of space (100MB per file).

#!/bin/sh
# Tesseract OCR workflow
# by Jeroen Leijen 7 december 2007
# http://jeroen.leijen.net
#
# Run this script from the directory that contains 
# the pdf file to be ocr-ed
#
mv ocr.log ocr.log$$
#
echo '* Starting pdf to ppm conversion' >> ocr.log
echo "It's now `date`" >> ocr.log
pdftoppm i.ouwehand.sociaal.recht.pdf -r 600 ouwehand
#if you have limited disk space, extract only a few pages per run
#pdftoppm i.ouwehand.sociaal.recht.pdf -f 40 -l 80 -r 600 ouwehand
echo '* Finished pdf to ppm conversion' >> ocr.log
#
echo "It's now `date`" >> ocr.log
echo '* Starting ppm to tiff conversion' >> ocr.log
echo "It's now `date`" >> ocr.log
for i in *.ppm; do convert "$i" "`basename "$i" .ppm`.tif"; done
echo '* Finished ppm to tiff conversion' >> ocr.log
echo "It's now `date`" >> ocr.log
#
echo '* Starting ocr' >> ocr.log
echo "It's now `date`" >> ocr.log
for i in *.tif; do tesseract "$i" "`basename "$i" .tif`" -l nld; done
echo '* Finished ocr' >> ocr.log
echo "It's now `date`" >> ocr.log
#
# Collect all individual txt files to one file called final
for i in *.txt; do cat $i >> final; echo "[pagebreak]" >> final; done

Tesseract download, install and usage
Download from Google: http://code.google.com/p/tesseract-ocr/

Untar and compile. On my Suse 10.1 workstation all I had to do was a plain configure, make and make install. Don’t forget to copy the language data files into the appropriate directory. In my case: cd packages/tesseract-2.01/tessdata. And then: cp nld* /usr/local/share/tessdata/

Usage: tesseract inputimage outputbase -l langcode

Leave a Reply