BuildPDF confused by non-Latin characters
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Deriver |
New
|
Undecided
|
Unassigned |
Bug Description
(First noticed by Dan in the pdfs for peaceofaristoph
The symptoms:
Attempting to copy-and-paste non-Latin passages from the pdf to other apps does not yield characters in the proper character set. The font-size for non-Latin characters also appears to be much too large (visible if you highlight text in the pdf), actually occluding words on other lines so that they're not visible to search queries.
What we know so far:
djvu.xml (and djvu.txt) are written in utf-8, with non-Latin characters indicated directly as multibyte
characters and no font specifiers; in other words, abstract characters are specified, but not specific glyphs (visual representations of those characters).
In order to get those mb characters properly inserted into the pdf, some explicit steps are needed that we're not now taking. iText tutorials recommend two ways to deal with non-Latin characters. (See the tutorial material at http://
Since we're not inserting visual text, but just the hidden layer for
searching and copy/pasting, glyphs don't matter for our purposes, and therefore it seems specific fonts may not, either. So the approach in UnicodeExample may be adequate, and perhaps easier than checking whether a given font has an encoding for the characters we need to use, as the EncodingFont approach seems to require.
There's another example of inserting characters from various character sets into an iText-generated pdf at http://