BuildPDF confused by non-Latin characters

Bug #257733 reported by Hank Bromley
2
Affects Status Importance Assigned to Milestone
Deriver
New
Undecided
Unassigned

Bug Description

(First noticed by Dan in the pdfs for peaceofaristopha00arisuoft, which was ocr'd into English and Greek, but probably true for all pdfs made to date with non-Latin characters.)

The symptoms:

Attempting to copy-and-paste non-Latin passages from the pdf to other apps does not yield characters in the proper character set. The font-size for non-Latin characters also appears to be much too large (visible if you highlight text in the pdf), actually occluding words on other lines so that they're not visible to search queries.

What we know so far:

djvu.xml (and djvu.txt) are written in utf-8, with non-Latin characters indicated directly as multibyte
characters and no font specifiers; in other words, abstract characters are specified, but not specific glyphs (visual representations of those characters).

In order to get those mb characters properly inserted into the pdf, some explicit steps are needed that we're not now taking. iText tutorials recommend two ways to deal with non-Latin characters. (See the tutorial material at http://itextdocs.lowagie.com/tutorial/fonts/index.php#basefont , particularly the two examples at the end of the 2nd bullet point - UnicodeExample and EncodingFont.)

Since we're not inserting visual text, but just the hidden layer for
searching and copy/pasting, glyphs don't matter for our purposes, and therefore it seems specific fonts may not, either. So the approach in UnicodeExample may be adequate, and perhaps easier than checking whether a given font has an encoding for the characters we need to use, as the EncodingFont approach seems to require.

There's another example of inserting characters from various character sets into an iText-generated pdf at http://itext.ugent.be/library/question.php?id=741 .

Tags: pdf
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.