tesseract fails to train/OCR with certain numbers 6, 8, 9, 0
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tesseract (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
There is a fault with training certain numbers. These are:
0,6,8,9
The problem is that if a tiff/image ONLY contains the aforementioned numbers (in single or multiple lines) line, then it will not train with the image, producing an 'Empty Page' response. No box file is created.
Changing the page segment mode (psm) does not alter this.
However, adding say a 3 to the line makes the whole line immediately recognisable to the OCR engine.
I have attached a sample 0 to 9 tiff for working with.
Error demonstrated by making an image of only the aforementioned digits.
When recognising numbers such as 8869860, then nothing is returned by the OCR engine, even though the digits can be 100% recognised as single digits, or by adding extra digits to the end of the image to be OCRed.
Work around is to make sure that the aforementioned digits are never seen in isolation by artificially adding extra digits (in my case I add 53 to every image before OCRing it, and then stripping off the 53), OR you can individually break up the image into individual digits and OCR each digit individually using -psm 10
More example images available on request.
This image trains/OCRs perfectly well