Comment 51 for bug 623438

Revision history for this message
Jakub Wilk (jwilk) wrote : Re: [Bug 623438] Re: Font size not correct in merged sandvich PDF

>Example:
><span class='ocr_line' id='line_1' title="bbox 0 0 45 20"><span class='ocr_xword' id='xword_1' title="bbox 0 0 20 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">hello</span></span><span> </span><span class='ocr_xword' id='xword_2' title="bbox 25 0 45 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">world</span></span>
>(note the whitespace which is not part of any ocr_xword as cuneiform will produce an incorrect bbox for it)

That looks much better that the current output or pre-0.9 output.
However, I'm not sure if/why we need ocr_cinfo at all here. AFAIU,
"x_bboxes" is analogous to "cuts" and "nlp" properties, which could be
applied to any element (e.g. directly to an ocr_xword).

Anyway, if there are any doubts on the interpretation of the hOCR
specification (which is admittedly vague), it's better to ask at
<email address hidden> than to guess.

--
Jakub Wilk