[upstream] Arabic text gets deformed when creating a PDF in LibreOffice Writer

Bug #1772439 reported by Miikka-Markus Alhonen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
LibreOffice
Confirmed
Medium
libreoffice (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

Creating a PDF from a document written in the Arabic script deforms the textual content of the document, although it looks fine on the screen.

For example, see the attached PDF created with Writer, where the example sentence "اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ" looks as it should, but when you view it with any PDF reader, such as evince, copying the text deforms most of the words. Some characters are clearly visible but cannot be selected or searched (such as ى at the end of the first word اشترى). If I search for the second word بلال, evince tells me there are no matches in the document. The same happens when converting the file with pdftotext, which produces the following output:

‫اشتر‬

‫للا مسة لفا كتاب وَأَنَا ْ‬
‫ه‬
‫اشت َ َريْتُهَا ِ‬
‫من ْ ُ‬

Here only two of the seven words are intact, the rest are garbled in one way or another. If the text is in Latin script, both evince and pdftotext behave as expected, meaning that the textual content is transferred correctly from Writer to the PDF.

Description: Ubuntu 17.10
Release: 17.10

libreoffice-writer:
  Installed: 1:5.4.6-0ubuntu0.17.10.1
  Candidate: 1:5.4.6-0ubuntu0.17.10.1
  Version table:
 *** 1:5.4.6-0ubuntu0.17.10.1 500
        500 http://mr.archive.ubuntu.com/ubuntu artful-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     1:5.4.5-0ubuntu0.17.10.5 500
        500 http://security.ubuntu.com/ubuntu artful-security/main amd64 Packages
     1:5.4.1-0ubuntu1 500
        500 http://mr.archive.ubuntu.com/ubuntu artful/main amd64 Packages

ProblemType: Bug
DistroRelease: Ubuntu 17.10
Package: libreoffice-writer 1:5.4.6-0ubuntu0.17.10.1
ProcVersionSignature: Ubuntu 4.13.0-41.46-generic 4.13.16
Uname: Linux 4.13.0-41-generic x86_64
ApportVersion: 2.20.7-0ubuntu3.8
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Mon May 21 15:18:41 2018
InstallationDate: Installed on 2017-02-13 (462 days ago)
InstallationMedia: Ubuntu 16.10 "Yakkety Yak" - Release amd64 (20161012.2)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fi_FI.UTF-8
 SHELL=/bin/bash
SourcePackage: libreoffice
UpgradeStatus: Upgraded to artful on 2017-11-05 (196 days ago)

Revision history for this message
Miikka-Markus Alhonen (malhonen) wrote :
Revision history for this message
Olivier Tilloy (osomon) wrote :

Sorry for the lack of feedback until now, Miikka.

Are you able to test whether the same thing works correctly in other word processing software, such as abiword or MS Word ?

Revision history for this message
Olivier Tilloy (osomon) wrote :

I'm seeing different results than what you describe when testing on Ubuntu 18.04 with libreoffice 6.0.3, although it doesn't seem to behave exactly as you would expect, either. Can you test on 18.04 and share here whether the situation is any better?

Revision history for this message
Miikka-Markus Alhonen (malhonen) wrote :

I tested the same example sentence with Ubuntu 18.04 and LibreOffice 6.0.3.2. Here’s the output from pdftotext:

ه‬
اشترى للا خمسة آفا كتاب وَأنَا اشْ ت َ َريْتُهَا ِ‬
من ْ ُ‬

Here four out of the eight words are intact, so it’s an improvement to 5.4.6 but still leaves a lot to hope for. The last word of the sentence (مِنْهُ) is broken into pieces so that the last full character ه is found on the first line and the two others on the last. Diacritical marks are sometimes placed where they are supposed to (such as the first and the three last diacritics in the word اشْتَرَيْتُهَا) but sometimes not (the middle of the same word and the last word of the sentence مِنْهُ). This time ى is visible but the first letter of the following word ب is not.

Here’s what MS Word 2007 (12.0.6787.5000, SP3 MSO 12.0.6785.5000) on Windows 8.1 produces when processed by pdftotext:

اشترى بالل خمسة آالف كتاب وأنا اشتريتها منه‬

So Word 2007 drops all the diacritics, and mixes up the order of the letters in the combination ل (U+0644) + ا (U+0627) producing ال instead of لا. Otherwise the output is intact and definitely much better than LO. I don't have any newer versions of MS Word at my disposal, so I can't test it further.

Revision history for this message
Olivier Tilloy (osomon) wrote :

Thanks for the confirmation Miikka. Confirming the bug, based on what I observe and your feedback.

This is most likely an upstream issue. Would you mind filing a bug report at https://bugs.documentfoundation.org/enter_bug.cgi?product=LibreOffice&format=guided and linking to it here so we can track its resolution? Thanks!

Changed in libreoffice (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
In , Miikka-Markus Alhonen (malhonen) wrote :

Description:
Creating a PDF from a document written in the Arabic script deforms the textual content of the document, although it looks fine on the screen.

For example, see the attached PDF created with Writer 5.4.6.2 on Ubuntu 17.10, where the example sentence "اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ" looks as it should, but when you view it with any PDF reader, such as evince, copying the text deforms most of the words. Some characters are clearly visible but cannot be selected or searched (such as ى at the end of the first word اشترى). If I search for the second word بلال, evince tells me there are no matches in the document. The same happens when converting the file with pdftotext, which produces the following output:

‫اشتر‬

‫للا مسة لفا كتاب وَأَنَا ْ‬
‫ه‬
‫اشت َ َريْتُهَا ِ‬
‫من ْ ُ‬

Here only two of the eight words are intact, the rest are garbled in one way or another. If the text is in Latin script, both evince and pdftotext behave as expected, meaning that the textual content is transferred correctly from Writer to the PDF.

On LO 6.0.3.2 on Ubuntu 18.04, the textual content is preserved a little better but it is still quite garbled. This is the output from pdftotext:

ه‬
اشترى للا خمسة آفا كتاب وَأنَا اشْ ت َ َريْتُهَا ِ‬
من ْ ُ‬

Here four out of the eight words are intact, and for example the last word of the sentence is divided so that the last full character is found on the first line and the rest on the third line. Some diacritics are found where they are supposed to be, some others not.

MS Word 2007 handles this case better, although it's not perfect either. This is the output from pdftotext:

اشترى بالل خمسة آالف كتاب وأنا اشتريتها منه‬

Here all diacritics are dropped and all sequences of ل (U+0644) + ا (U+0627) are reversed turning لا into ال. Otherwise the sentence is intact.

This bug was first reported on Launchpad for LO 5.4.6.2 on Ubuntu 17.10 at: https://bugs.launchpad.net/ubuntu/+source/libreoffice/+bug/1772439 . After my initial report, I have upgraded to LO 6.0.3.2 where the problem persists, although the actual output is different. Another user on Launchpad confirmed the bug on LO 6.0.3.2, as well.

Steps to Reproduce:
1. In a new Writer document, type some text in Arabic. My example sentence was: اشترى بلال خمسة آلاف كتاب وَأَنَا اشْتَرَيْتُهَا مِنْهُ
2. Create a PDF.
3. Open the created PDF with a PDF reader (such as evince) and type one of the words in the Search dialog, e.g. بلال. Alternatively select the word in the PDF reader and copy-paste it somewhere else. You can also convert the PDF to text using a utility like pdftotext.

Actual Results:
The PDF reader reports there are no matches for some of the words in the document, although they are all clearly visible. Selecting and copy-pasting the word garbles it. Pdftotext's output is garbled.

Expected Results:
All the words that are visible should also be searchable in a PDF reader, copy-pasting should preserve the text, and the output of pdftotext should match the original document.

Reproducible: Always

User Profile Reset: No

Additional Info:

Revision history for this message
In , Miikka-Markus Alhonen (malhonen) wrote :

Created attachment 144554
PDF created with LO 5.4.6.2 where textual content is garbled

Revision history for this message
Miikka-Markus Alhonen (malhonen) wrote :
Changed in df-libreoffice:
importance: Unknown → Medium
status: Unknown → New
Revision history for this message
In , Beluga (beluga) wrote :

Repro. Can only successfully search with individual glyphs in PDF

Arch Linux 64-bit
Version: 6.2.0.0.alpha0+
Build ID: 8b1501d80dc9d3f42c351c6e026fa737e116cae5
CPU threads: 8; OS: Linux 4.18; UI render: default; VCL: gtk3_kde5;
Locale: fi-FI (fi_FI.UTF-8); Calc: threaded
Built on 23 September 2018

Changed in df-libreoffice:
status: New → Confirmed
Revision history for this message
In , Qa-admin-q (qa-admin-q) wrote :

Dear vaaydayaasra,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from http://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3
. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Revision history for this message
In , Miikka-Markus Alhonen (malhonen) wrote :

Still reproducible on:

Version: 6.3.2.2
Build ID: libreoffice-6.3.2.2-snap1
CPU threads: 4; OS: Linux 4.15; UI render: default; VCL: gtk3;
Locale: fi-FI (fi_FI.UTF-8); UI-Language: en-US
Calc: threaded

pdftotext's output is again different from my initial report but it's still garbled:

أ‬
ه‬
ن‬
ا م‬
ه‬
ت‬
ي‬
ر‬
ا اشْت‬
و ن‬
اشترى بالل خمسة آالف كتاب َ‬

This time the beginning of the sentence (found on the last line of the output) is already quite good, though ل and ا in the ligature لا are reversed. Thus on evince بالل matches بلال. The end of the sentence where there are diacritical vowel marks is worse than in my initial report.

summary: - Arabic text gets deformed when creating a PDF in LibreOffice Writer
+ [upstream] Arabic text gets deformed when creating a PDF in LibreOffice
+ Writer
Revision history for this message
In , Qa-admin-q (qa-admin-q) wrote :

Dear vaaydayaasra,

To make sure we're focusing on the bugs that affect our users today, LibreOffice QA is asking bug reporters and confirmers to retest open, confirmed bugs which have not been touched for over a year.

There have been thousands of bug fixes and commits since anyone checked on this bug report. During that time, it's possible that the bug has been fixed, or the details of the problem have changed. We'd really appreciate your help in getting confirmation that the bug is still present.

If you have time, please do the following:

Test to see if the bug is still present with the latest version of LibreOffice from https://www.libreoffice.org/download/

If the bug is present, please leave a comment that includes the information from Help - About LibreOffice.

If the bug is NOT present, please set the bug's Status field to RESOLVED-WORKSFORME and leave a comment that includes the information from Help - About LibreOffice.

Please DO NOT

Update the version field
Reply via email (please reply directly on the bug tracker)
Set the bug's Status field to RESOLVED - FIXED (this status has a particular meaning that is not
appropriate in this case)

If you want to do more to help you can test to see if your issue is a REGRESSION. To do so:
1. Download and install oldest version of LibreOffice (usually 3.3 unless your bug pertains to a feature added after 3.3) from https://downloadarchive.documentfoundation.org/libreoffice/old/

2. Test your bug
3
. Leave a comment with your results.
4a. If the bug was present with 3.3 - set version to 'inherited from OOo';
4b. If the bug was not present in 3.3 - add 'regression' to keyword

Feel free to come ask questions or to say hello in our QA chat: https://kiwiirc.com/nextclient/irc.freenode.net/#libreoffice-qa

Thank you for helping us make LibreOffice even better for everyone!

Warm Regards,
QA Team

MassPing-UntouchedBug

Revision history for this message
In , Miikka-Markus Alhonen (malhonen) wrote :

The problem seems to have been resolved on LO 7.3.0.3 on Windows 10. To test PDF output this time, I used Adobe Acrobat DC 2021.011.20039 64-bit. I haven't tested on Linux, where the problem initially appeared.

Version: 7.3.0.3 (x64) / LibreOffice Community
Build ID: 0f246aa12d0eee4a0f7adcefbf7c878fc2238db3
CPU threads: 4; OS: Windows 10.0 Build 19044; UI render: Skia/Raster; VCL: win
Locale: fr-FR (fr_FR); UI: fr-FR
Calc: CL

Revision history for this message
In , Ilmari-lauhakangas (ilmari-lauhakangas) wrote :

Unfortunately still reproduced on Linux

Arch Linux 64-bit
Version: 7.4.0.0.alpha0+ / LibreOffice Community
Build ID: 8f2b1b1cb84e1ae3139eb90b8efdf61e608adbad
CPU threads: 8; OS: Linux 5.16; UI render: default; VCL: kf5 (cairo+xcb)
Locale: fi-FI (fi_FI.UTF-8); UI: en-US
Calc: threaded Jumbo
Built on 24 February 2022

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.