Ubuntu
tesseract package

tesseract fails to train/OCR with certain numbers 6, 8, 9, 0

Bug #1010577 reported by Peter Edmond on 2012-06-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tesseract (Ubuntu)	New	Undecided	Unassigned

Bug Description

There is a fault with training certain numbers. These are:

0,6,8,9

The problem is that if a tiff/image ONLY contains the aforementioned numbers (in single or multiple lines) line, then it will not train with the image, producing an 'Empty Page' response. No box file is created.

Changing the page segment mode (psm) does not alter this.

However, adding say a 3 to the line makes the whole line immediately recognisable to the OCR engine.

I have attached a sample 0 to 9 tiff for working with.

Error demonstrated by making an image of only the aforementioned digits.

When recognising numbers such as 8869860, then nothing is returned by the OCR engine, even though the digits can be 100% recognised as single digits, or by adding extra digits to the end of the image to be OCRed.

Work around is to make sure that the aforementioned digits are never seen in isolation by artificially adding extra digits (in my case I add 53 to every image before OCRing it, and then stripping off the 53), OR you can individually break up the image into individual digits and OCR each digit individually using -psm 10

More example images available on request.

Tags: