Comment 58 for bug 623438

Revision history for this message
George Chriss (gschriss) wrote : Re: Font size not correct in merged sandvich PDF

Treating Comment #1 as "works as intended" (with a character precision limitation) and Bug #632524 as "broken" (font size/placement has no correlation to underlying text + out-of-bounds/missing/"dog-piled" text), I'm happy to report the following:

While developing a new Inkscape extension to export hand-drawn/typed text boxes as hOCR I came across the same issues reported in Bug #632524. The hOCR file generated by the extension does not use 'ocr_word' nor 'ocr_cinfo' elements, just plain text within 'ocr_line' parent elements (with corresponding unique 'id' and 'bbox' attributes).

I believe hocr2pdf was mis-parsing the file expecting that each character was contained within its own bbox. As a stop-gap measure adding either matching <p></p> elements around each plain text line, or, alternatively, a <br> at the end of each plain text line resulted in 'proper' text placement. The <title> element also needs to be escaped in this way.

Tested with exact-image 0.8.8. I wasn't able to complete the build due to relocation errors but '/objdir/frontends/hocr2pdf' was usable as-is. 'lib/hocr.cc' is the file in need of patches.