Poor Arabic rendering in VTE
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
gnome-terminal (Ubuntu) |
Fix Released
|
Medium
|
Gunnar Hjalmarsson | ||
vte2.91 (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
VTE has a number of issues when it comes to rendering Arabic letters in the terminal, which could affect a number of languages (Arabic, Urdu, Persian... etc).
Bug 1: Any Arabic word in any VTE-based terminal is choppily displayed with spaces between its letters, making readability hard and sometimes not possible. Sometimes the letters are crushed together very closely making reading impossible too.
Bug 2: If a non-Arabic text and an Arabic text are displayed together in the same line, then the entire line will be missed up and you won't be able to understand what is being said.
Both of these bugs can be seen from the image I attached.
I reported both of these bugs together because it's unlikely they can be fixed separately, probably they are related to each other.
Problem can be seen in any VTE-based terminal. Here I am using GNOME Terminal 3.44.0 on Ubuntu 22.04, but it can be seen in any Ubuntu version and in any terminal version as well (it has been there since forever).
I reported the bug here instead of upstream because that's what they said at the page: https:/
Happy to provide any information you need, or any do tests or experiments.
M.Hanny Sabbagh (mhsabbagh) wrote : | #1 |
Egmont Koblinger (egmont-gmail) wrote : | #2 |
Egmont Koblinger (egmont-gmail) wrote (last edit ): | #3 |
To be absolutely fair, I have to add this:
One thing, namely the handling of BiDi _control_ characters at the very beginning of a paragraph (logical line), remains as a TODO item both in the spec and in VTE's implementation (both of which are really nontrivial).
I just ran out of motivation and time, so it's probably waiting for someone else to fix it.
In the unlikely case that "apt"'s given message begins with such a character, and VTE chopping it off is the reason for the wrong order, then it's my fault :)
M.Hanny Sabbagh (mhsabbagh) wrote : | #4 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #5 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #6 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #7 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #8 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #9 |
Hello Egmont.
Thank you for the detailed reply and for fixing my understanding about these issues. I also highly appreciate your work and help!
I was using the default font in Ubuntu, but when I changed it to Monospace 9 (or even 12) as you suggested, the spaces issues disappeared and the text is now indeed very much more readable. I attached screenshots for Monospace 9 and 12.
I am not sure however, which Monospace font is that? I mean, to which family? In the terminal settings it just says "Monospace".
So perhaps the first thing we can drive from this to fix bug 1: Use a different terminal font by default in Ubuntu for GNOME Terminal instead of the one currently being used (which currently seems to me to be Ubuntu Mono 12)? Maybe this could be done at least when the system language is Arabic?
I will test additional possible fonts and see how they would look like.
For bug 2, I did a small test for "apt" output in the terminal and also in the web browser (attached images) and in Gedit. You can see that both the RTL webpage and Gedit can display the same output text very much better than the terminal, the order of the text is changed, and displays very nicely. Perhaps the only issue is from apt's translation side (they seem to have left 1 letter untranslated which is the M letter in "MB", which is giving the uncomfortable look), but aside from that, Gedit and the web browser can display the same text without issues, unlike in the GNOME Terminal which changes the order of the text.
So to sum up, if my understanding is correct, are you saying that VTE actually supports displaying RTL/BiDi correctly, and this is just an issue in "applications" like apt, tmux, vim... etc, and they need to fix this issue from their side? I mean, it is an issue in the apps themselves, and not terminal emulators (like GNOME Terminal), correct?
Thank you so much for your help, and for your work over the years. I am trying to make a small volunteering team to report all Arabic/RTL related bugs in Linux and open source software (we have sadly so many, like tens of them!), and we are gradually starting to report and investigate any bug.
I would happily test any scenario or experiment to fix any Arabic or RTL-related bug.
P.S: It sounds like Launchpad does not allow multiple image uploads (or I didn't find it), so I had to upload images one by one, sorry for that!
M.Hanny Sabbagh (mhsabbagh) wrote (last edit ): | #10 |
Also I found the default "Monospace" used in Ubuntu. It is DejaVu Sans Mono:
mhsabbagh@
DejaVuSansMono.ttf: "DejaVu Sans Mono" "Book"
Edit: to be clear, this is the font under the name "Monospace" in the fonts chooser. But the default terminal font in Ubuntu is Ubuntu Mono like I said.
Egmont Koblinger (egmont-gmail) wrote : | #11 |
Hi M.Hanny,
Re bug 1 (rendering):
Thanks for attaching screenshots, I was lazy to do that. Indeed this is also how the letters look to me.
It would indeed be great if Ubuntu could change its default font choice, at least for Arabic locales. I don't know what would be the best place to bring it up with Ubuntu developers (I'm not one of them), maybe a dedicated bugreport here, filed against the relevant font or i18n projects, or perhaps an i18n mailing list (I'm afraid very little attention is being paid to bugreports here).
According to my memories from many-many years ago, I think fontconfig (or some other component in picking a font) can also behave differently depending on the locale. I definitely do remember that some piece of software, probably some terminal emulator, but I can't recall which, with the same settings except for the locale, picked different fonts depending on the locale (wider one for CJK locales). So _maybe_ one way to address the problem is to modify the configuration of fontconfig or some similar underlying font component, and still calling it "Ubuntu Mono" from the terminal. Anyway, I'm not familiar enough with the topic to propose how to solve it.
Once this is fixed, I believe there's one more prominent rendering issue, and that's the lack of the lam-alif ligature. VTE doesn't support ligatures in general, maybe this one should be handled by some one-off code as an exception, but it's still unclear how to handle some corner cases (e.g. when a color change, the cursor, or a linebreak comes in between). According to my memories, the conclusion with my consultant about the Arabic language family was that the lack of lam-alif ligature is not terribly bad on typewriter-like machines, my impression was that it's frowned upon but sort of acceptable. It can be further improved at any time.
Re bug 2 (order of words):
One more important thing that didn't occur to me yesterday was the nasty issue with the "paragraph direction". It's a generic issue with RTL handling, not related to terminals per se. The problem and its consequences (different ordering of words) is breafly explained and demonstrated with the simplest possible example in my BiDi proposal under "RTL and BiDi Introduction" -> "RTL and BiDi text handling in general".
As your gedit and browser screenshots show, they most likely auto-detect the paragraph direction, which ends up being RTL. The overall right alignment suggests that this is most likely the case. The terminal, on the other hand, assumes LTR paragraph direction until explicitly told otherwise.
Try the following command:
printf '\e[?2501h'
to enable auto-detection of the paragraph direction in VTE, and then re-run "apt". Does this fix the order of the words for you in VTE? (Plus it should also right-align the affected lines, just as in the browser and gedit screenshots.)
You might create a wrapper script around your "apt" that sets this, and resets (letter "l" instead of "h", as in low/high) afterwards. Or enable it for your entire shell session, bearing in mind that it will affect the behavior of other RTL tools as well, some for better, some for worse.
To be honest, it was a tough choice to ...
M.Hanny Sabbagh (mhsabbagh) wrote : | #12 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #13 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #14 |
Hi Egmont.
For bug 1, yes, fontconfig can be used for this. We even have a special configuration file at /etc/fonts/
<alias>
<family>
<prefer>
<family>DejaVu Sans Mono</family>
</prefer>
</alias>
Adding this to the file made the Monospace font change to DejaVu Sans Mono in some places (e.g Gedit), but the GNOME Terminal was still using Ubuntu Mono for some reason. I don't know why at the moment.
Perhaps I need to open a thread on Ubuntu Discourse. There are some i18n guys there last time I remember. I will do that and see how we can change the default font in the terminal for Arabic.
About lam-alif ligature, if you meant using the letter "alif" + letter "lam" then they can be displayed in a good shape in the terminal. But if you meant this one combined letter which has both alif and lam already together then it is indeed broken in the terminal. However, not so many Arabic people use this letter, and as a workaround, it can be written as a normal lam + normal alif in order to be displayed like the first line in my new attached screenshot. I think it's not a big deal for now.
For bug 2. Wow! Indeed as you said, it works well now and the text is displayed correctly just like in Gedit. I attached a screenshot.
Is there a way where we can use this "printf '\e[?2501h'" workaround in the GNOME Terminal in Ubuntu, at least only when the Arabic language is set for the system by default? If the user is an Arabic-based user, then it makes sense to offer this advantage for him/her in the terminal I believe.
For example, we can append this by default to ~/.bashrc:
systemlanguage=
if [ $systemlanguage == "ar" ]; then printf '\e[?2501h'; fi
If the system language is "ar", it will enable the auto-detection of paragraph direction in the VTE session. What do you think? I can suggest this in the Ubuntu Discourse too for the i18n guys.
Thank you Egmont for your help!
Egmont Koblinger (egmont-gmail) wrote (last edit ): | #15 |
Hi M.Hanny,
I'm so glad that you're way more familiar with fontconfig quirks as well as Ubuntu processes than me. I wish you good luck in getting some better config accepted and made default in Ubuntu!
-
Re lam-alif:
As far as I remember, and https:/
- The first (right) letter is a Lam (or Laam), U+0644, which looks similar to English J.
- The second (left) letter is an Alif, U+0627, which looks similar to English I (vertical bar); or some accented variant thereof (U+0622, U+0623, U+0625, maybe more, I don't know).
In the terminal, this shows up as if you'd imagine English "IJ" with the two letters connected at the bottom, i.e., similar to a U. As in the top row of your screenshot in comment 12.
In gedit, browser etc. they show up in a way that either more resembles as an "y", or as a shape that reminds me of an upside-down ribbon 🎗, something like an "8" but with the upper segment cut off. As in the second row of your screenshot.
I'm glad you confirm what my consultant also said, namely that it's not terribly bad to go with the not properly ligated version.
At least, it's good enough for now. My BiDi work focused on way more fundamental things, namely to get the proper (non-reversed) ordering of letters done. Rendering of Lam-Alif can always be further improved in the terminal.
-
Re making the RTL paragraph direction (and right alignment) the default:
Unfortunately the locale subsystem (according to my latest memories) does not specify whether the script is LTR or RTL. One way is to check the language part of the locale, pretty much like you did, but listing all the known RTL scripts ("he", and I guess there are some languages that are somewhat related to Arabic that have their own code, maybe Urdu and some more??). One caveat, though, is that locale-related env vars (LANG, LANGUAGE, LC_MESSAGES, LC_ALL) override each other in a certain order of priority, so probably it's wiser to invoke the `locale` utility and find the resolved value in its output. (Fun fact: some projects use gettext as a workaround for the lack of such a field in locales. Translators have to create a "translated" string containing "ltr" or "rtl", or "1" or "0", whatever choice of the programmers, to denote the default direction.)
Another problem is that as the proposal says, no autodetection is the default, and if an app switches to the other mode it should revert that setting upon quitting. Which would revert what you do in the shell startup files. (I'm not aware of any app using these control codes yet, but we should think in the long term and a few steps ahead.)
I'm not absolutely certain that an autodetected default direction is better in RTL locales. I'd guess there will be places where it's an improvement (like "apt" obviously) and places where it's less desired. At this moment I'm afraid we don't have enough knowledge and experience to see the pros and cons, more data should be collected first.
Note that according to the BiDi proposal, there are 2x2 possible values for paragraph direction (in case the terminal is asked to perform BiDi shuffling):
- LTR base dir (no autodetect...
affects: | vte (Ubuntu) → vte2.91 (Ubuntu) |
M.Hanny Sabbagh (mhsabbagh) wrote : | #16 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #17 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #18 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #19 |
Hi Egmond.
Thank you for the info.
About lam-alif: Yes, you are right. It indeed is displayed like a U in my screenshot, although it should have been similar to a "y", but as you said, it is not that terribly bad and can be read without an issue. Perhaps we can improve it in a future work!
About RTL bug:
I have searched and found the following file from you regarding the modes you described: https:/
And I tried the following modes:
- alias ltr='printf "\e[1 k"'
- alias rtl='printf "\e[2 k"'
I attached images for how they look like (Also a 3rd screenshot for autodetect again).
The RTL base one would have been good if the user is expected to write everything 100% in Arabic. However that's not the case; normal commands are written in English but the output of the commands may be Arabic, and some strings may not be translated. You can see how the Shell interrupter is displayed totally missed up if you write one line in English, and the other line in Arabic. In fact, if you try to write in English using RTL base, then you will notice that the shell interrupter is moving to the left side with you as you write! :D
I think it would be good if someone developed a command line based tool in Arabic/RTL languages, then he/she can opt for using this mode by default (because their application is 100% for RTL-based audience), but in the normal average terminal it would be a horrible user experience.
The LTR one was just like the default situation in VTE. The text is not displayed good if it was Arabic and the line gets messed up.
Please tell me if you need more tests or experiments, I would be happy to provide them.
Yes, of course we can also add Urdu and Persian and other RTL-based languages to the list from $locale, it was just a simple example.
Thank you again Egmont for taking time to do all this work and respond here.
Best.
Gunnar Hjalmarsson (gunnarhj) wrote : | #20 |
I have submitted a merge request, which changes the default font for GNOME Terminal from Ubuntu Mono to DejaVu Sans Mono if the user has selected Arabic as the display language:
https:/
A version of gnome-terminal for jammy with that patch applied is available in this PPA:
https:/
Please test and provide feedback.
Changed in gnome-terminal (Ubuntu): | |
assignee: | nobody → Gunnar Hjalmarsson (gunnarhj) |
importance: | Undecided → Medium |
status: | New → In Progress |
M.Hanny Sabbagh (mhsabbagh) wrote : | #21 |
- gnome terminal with default monospace font.png Edit (590.5 KiB, image/png)
Thank you Gunnar, for the help!
Indeed it works well on my testing (image attached); GNOME Terminal's custom font option is disabled (so it is using the default font), and Ubuntu Mono is still used in GNOME Tweaks, and still the GNOME Terminal now uses Monospace font by default, making the first problem solved.
Thanks again for the help.
Best.
Gunnar Hjalmarsson (gunnarhj) wrote : | #22 |
Thanks for testing and confirming! I have uploaded the gnome-terminal change to lunar.
Changed in gnome-terminal (Ubuntu): | |
status: | In Progress → Fix Committed |
Launchpad Janitor (janitor) wrote : | #23 |
This bug was fixed in the package gnome-terminal - 3.46.7-1ubuntu2
---------------
gnome-terminal (3.46.7-1ubuntu2) lunar; urgency=medium
* Use DejaVu as system font if LANG is Arabic (LP: #2002290)
-- Gunnar Hjalmarsson <email address hidden> Wed, 25 Jan 2023 20:14:38 +0100
Changed in gnome-terminal (Ubuntu): | |
status: | Fix Committed → Fix Released |
M.Hanny Sabbagh (mhsabbagh) wrote : | #24 |
Thank you Gunnar for fixing the font issue in gnome terminal.
So now, we can say that bug 1 is fixed. The only thing that remains is related to bug 2 and the RTL text auto-detection in VTE. I am yet to hear from Egmont on anything we can do in this regard.
I hope we can find a solution to use the auto-detection feature by default in VTE, so that the RTL issue gets fixed not just for Arabic but for many other languages as well, and not just in GNOME Terminal but basically in any terminal that uses VTE.
Happy to provide any tests or do any experiments you would like.
Best!
Gunnar Hjalmarsson (gunnarhj) wrote (last edit ): | #25 |
Hi M.Hanny Sabbagh!
Even if bug 1 is fixed, I have rewritten the patch after feedback from one of the Debian developers. Currently it looks like this:
One change is that it now honors the font size set in Tweaks, so while the family may be manipulated if LANG is an Arabic locale, the size keeps unchanged.
Another change is that it now sets "Monospace" instead of specifying "DejaVu Sans Mono" explicitly. Both you and I think that it still results in "DejaVu Sans Mono", but I'm not 100% sure after some own tests yesterday.
It would be great if you could test again, and check if it still handles the rendering of Arabic script as expected. For your convenience I have applied the new patch in my PPA for 22.04:
https:/
Gunnar Hjalmarsson (gunnarhj) wrote : | #26 |
On 2023-02-03 10:46, Gunnar Hjalmarsson wrote:
> https:/
I got more feedback, this time in the form of a code review, and it resulted in yet another version:
> Another change is that it now sets "Monospace" instead of specifying
> "DejaVu Sans Mono" explicitly.
That is no longer true. The latest version, which is applied in the PPA in version 3.44.0-
My plea for another test still stands. ;)
Egmont Koblinger (egmont-gmail) wrote : | #27 |
> The only thing that remains is related to bug 2 and the RTL text auto-detection in VTE. I am yet to hear from Egmont on anything we can do in this regard.
Both the autodetection "on" and "off" values have pros and cons. I don't think either one is better per se than the other. One is better in some circumstances, the other is better in others.
For complete sentences that are of mostly in a single direction, occasionally with an embedded word in foreign directionality, usually autodetection is better.
For a list of items, some of which might be of a foreign directionality (e.g. "ls -1"), aligning all of them in the same way (autodetection is off) is the better.
For users who input some text in their native language, occasionally containing a foreign directionality text, autodetection might mess things up if that word happens to be at the beginning, and again, a fixed overall directionality (autodetection off), matching the main language, is presuabmly better.
When it comes to designing a graphical app or a webpage, these are decided on a case-by-case basis. I can't see how a terminal could magically _solve_ this and provide a solution that's good enough for everyone. It provides a _platform_ where apps can pick which of the two behaviors they wish to have.
To make it more complicated, terminal emulation is an utter mixture of English vs. one's native language. Some pieces of text are English by their nature, some apps are not (or not fully) translated to Arabic (Persian, Hebrew...), etc.
Also, there are multiple use cases to take into account. One using their system in Arabic and encountering quite a few English words, or fully English utilities, is one thing. One using their system in English and encountering a few embedded Arabic words is another.
--
I'm afraid at this point we don't have anywhere near enough data to justify flipping the default, even despite that admittently picking the default was a somewhat arbitrary decision from me, with my overall impression being that this current default behavior might be better for users on average. Since I cannot speak/read/write any of the RTL languages/scripts, the decision might have been biased towards pure LTR environments (i.e. not to have random lines which happen to contain an RTL piece of text be right-aligned). After months of research and work on the topic of RTL in terminals, I did not have a strong stance on this.
Technically, you can flip the default, or create that shell script snippet that does this. Would this be a good solution? I'm afraid not, it would just probably make things more complicated, as the implementation would diverge from the proposed standard, multiple implementations would diverge from each other, app developers wouldn't be sure what to do.
The topic needs further research. It should evaluate which behavior is better under which circumstances, taking into account a wide range of apps, use cases. It should study both when a basic LTR environment has scattered RTL words in it (including the case of dumping binary data), and when a basic RTL environment has occasional LTR words (and numbers etc.).
Very importantly, the decision should heavily take into ...
M.Hanny Sabbagh (mhsabbagh) wrote : | #28 |
- gnome-terminal new monospace.png Edit (636.5 KiB, image/png)
@Gunnar:
Thank you for the continous help. I have tested the new update and indeed it now respects the font size of Monospace font in GNOME Tweaks (image attached).
Thanks again!
M.Hanny Sabbagh (mhsabbagh) wrote : | #29 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #30 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #31 |
M.Hanny Sabbagh (mhsabbagh) wrote : | #32 |
M.Hanny Sabbagh (mhsabbagh) wrote (last edit ): | #33 |
@Egmont:
Thank you for the valuable feedback.
First I would like to ask you to forgive me if some of my words are not clear or vague; I am not a native English speaker so it is a little bit hard when I write long responses like this. I hope I am correctly delivering my ideas to you and others.
I just would like to confirm again that my proposal is related to changing the default text behavior in VTE *only for Arabic/RTL-based languages*. Languages like Arabic, Persian, Urdu, Hebrew... etc and only these. Other users of English, French, Italian and other languages do not need to endure this change (because they do not need autodetection or RTL text).
Hence, we are talking about a change that would not affect 95% of the user base.
I understand your concern and that we may still need to do further testing. And I will work on providing more test cases for various applications. Still, I wholeheartedly believe that any Arabic/RTL user would be quite happy with this proposed change because it makes the RTL text rendering so much better regardless of the application, terminal emulator or Linux distribution used. (I will provide more of these examples).
Ideally, there shouldn't be a text in other languages when the system language is set to Arabic. E.g the output of "apt" should be fully translated to Arabic, and the interface text for "nano" should be translated too, which ultimately should result in a much better UX for Arabic users. That is, the entire command line interface be translated to Arabic and supportive of RTL text. Everything from Python/C/Rush syntax errors or outputs, all the way up to tools like apt, nano, vi or emacs. This is a very long journey and it may take many years until we reach that point.
However, in order to get there, we need to start somewhere. One of the main reasons perhaps why no Arabic users provided feedback all these years is that Arabic or RTL support needs a huge investment in terms of effort to make it feasible for the end user. And so far no one stepped up to do that effort. We need efforts in both the fields of translation + RTL support to make this happen. Sadly, we are missing on both.
I believe if we succeed in supporting RTL applications in the terminal and fixing the bugs we are currently facing, then this could open the door for a wide range of CLI-based computing for RTL languages speakers. We could have command line tools that fully support these languages (even RTL TUIs!), instead of just English as we have today. We could finally print correctly formatted Python error messages in Arabic instead of being afraid to translate it; because we know it wouldn't render correctly in the LTR terminal.
I want to add a few additional reasons on why feedback is small regarding your work on RTL:
- I honestly didn't hear about it although I am invested in Linux and open source software for 13 years in the Arabic Linux communities. I was surprised to know that RTL support is already there in VTE. I think that I came across a social media post once that talks about it in 2020, but because of Coronavirus and how all of our lives were missed up, I just couldn't give it enough attention at that time.
- The pol...
Gunnar Hjalmarsson (gunnarhj) wrote : | #34 |
Thanks for the latest test, M.Hanny Sabbagh. A rewritten patch is now on its way to the coming Ubuntu 23.04.
Egmont Koblinger (egmont-gmail) wrote (last edit ): | #35 |
Hi M.Hanny,
Thanks a lot for spreading the word about BiDi support in VTE!
Really no need to apologize about your English! I'm not a native English speaker either, and your English is at least as good as mine. We have no communication issues at all!
---
> Ideally, there shouldn't be a text in other languages when the system language is set to Arabic. E.g the output of "apt" should be fully translated to Arabic [...] This is a very long journey and it may take many years until we reach that point.
I disagree with you in this point. There will always be English text in the output of "apt", such as package names, I don't think they will ever be translatable (and would arguably be a mistake to go in this direction). For rapidly changing data, such as package descriptions, it's practically impossible for all translations to be fully up-to-date all the time. Think of software like dmesg printing system logs, software printing hardware information (such as identification strings of various components), think of software showing system directories' and files' names etc. Think of commands that print whois data, show a mysql table along with the column names, show source code etc. Think of ssh'ing to a remote host. Think of the long tail of utilities you find on the web that someone quickly hacked together but there's no demand to get translated to languages other than English.
Although not relevant to the world of terminal emulation, I've just checked the completeness of the Arabic translation of the GNOME desktop at https:/
You'll never have a terminal emulation experience that is fully in a foreign (I mean non-English, not necessarily RTL) language, for two reasons. One is that due to the very nature of things there'll always be tons of English stuff in the terminal, the second is that it's a utopic and reasonably unreachable dream to have all software's UI strings be fully translated to all languages. A 100% fully Japanese, or fully Hungarian, or fully Arabic experience in terminal emulators, no English word whatsoever, is neither "ideal" nor in my opinion is reachable in our lives.
> However, in order to get there, we need to start somewhere. [...] this could open the door for a wide range of CLI-based computing for RTL languages speakers [...]
I agree with this one, and I hope I could make an important step here.
> The political factor is also important. I don't want to talk about politics here of course
I don't want to talk about politics either. My work was driven by getting closer to equality for people, no matter if I agree or disagree with certain things they do; as well as by the technical challenge itself. That being said, you raised excellent points here.
> The dominant majority of Arabic users at least are using English as a system-wide language. When I made polls asking them why, they say that there are many bugs and problems in Arabic support in general on Linux [...]
I fully understand this.
Prior to my work, nobody had an idea how to do BiDi in terminals, nobody saw the whole picture. As I show in my document, everyone who thought they knew how to ...
Egmont Koblinger (egmont-gmail) wrote : | #36 |
Re "nano" with LTR vs. autodetected directionality:
The LTR screenshot is more obviously "broken" (or at least undesireable). The autodetected directionality's brokenness is less obvious, maybe no breakage is visible in this particular screenshot, but is still broken.
Maybe it looks pretty much okay with fully RTL text. But try to edit some BiDi text where the text sporadically contains some English words, or numbers. Scroll the long lines horizontally, so that these words and numbers just scroll in or out, or just cross the terminal's edge. You'll see it breaking here and there. Combine it further with nano showing the line numbers (Alt-N), I expect more breakages. Edit some file that contains pairs of ( ) or < > characters, you'll see that the way they are mirrored or not is occasionally broken, and (I think although I haven't tried) even nano's reverse video < > signs at the margin that denote that the line continues might join this game in faulty ways.
In my BiDi proposal, in the section "Why terminals are a truly special story" I argue that fullscreen text viewers/editors (such as nano, vi, emacs, mcedit, joe...) only send partial information to the terminal (only parts of the entire text) and it's impossible to run the BiDi algorithm on partial data. I prove that it's literally impossible to achieve proper BiDi text editing user experience if performing BiDi is the job of the terminal emulators. Hence, in accordance with ECMA TR/53, in such cases it has to be the application (e.g. nano) that runs the BiDi algorithm and the terminal must not shuffle the characters. The terminal can be programatically switched to this so-called "explicit mode" using the BDSM escape sequence.
The only way to have a proper BiDi-editing experience in nano is if nano adds BiDi support, meaning that on one hand it runs the Unicode Bidirectcional Algorithm [UBA] itself, on the other hand it asks the terminal not to do that.
In "implicit mode", where the terminal runs the UBA, even if combined with automatic detection of paragraph direction and alignment, if you try to have the typical text editing user experience then there will be numberous fundamentally unfixable bugs. The only way to fix them is to use the terminal's "explicit mode" instead.
The single most important bit from my BiDi work is (reinforcing ECMA TR/53's realization) that it's either the terminal (and not the application) or the application (and not the terminal) that needs to perform the UBA, and that both of these modes are required (for some utilities like "apt" only the "impicit mode" is reasonably viable, for some other apps like "nano" only the "explicit mode" is usable), and that it has to be programatically switchable. Prior to this work from me (both the spec and the implementation in VTE) all the terminals I'm aware of either only implemented "this" mode (making it impossible to run BiDi apps that required "that" mode), or only implemented "that" mode (making it impossible to run BiDi apps that required "this" mode), or asked the user to choose the behavior (an utterly unacceptable user experience to have to manually switch back and forth, let alone so extremely frequently), and ...
Egmont Koblinger (egmont-gmail) wrote : | #37 |
Re "cat" (which should rather be "ls -1") with LTR vs. autodetected directionality:
Let's use the convention here that uppercase letters are fake Arabic, e.g. imagine that the word [written as LTR here] "ARABIC" is a valid Arabic word which is supposed to visually appear as "CIBARA".
Let's have two files, called "english.pdf" and [written as LTR here] "ARABIC.pdf". Let's run "ls -1".
First, let's run it in a fully LTR environment. "english.pdf" should appear like this, left-aligned, it's obvious. What's the best rendering of the other filename? Should it look like "CIBARA.pdf" or "pdf.CIBARA"? Should it be aligned to the left or right edge of the terminal? Why? I, who can't read Arabic, would argue that a left-aligned "CIBARA.pdf" would be the least confusing for me (i.e. I prefer the look in your LTR screenshot), but others, especially those who do read Arabic, might disagree with me and prefer some other rendering. What is your take on this?
Now, let's run it in a fully RTL environment, e.g. the user has set up as much as reasonable to RTL, most of the filenames are RTL, however, an English one, and an Arabic one with an English extension, have sneaked in. What output would you prefer from "ls -1" and why? Should the English filename appear as "english.pdf" or "pdf.english", should it be aligned to the left or the right edge of terminal? And the same question as in the previous paragraph: Should the Arabic filename look like "CIBARA.pdf" or "pdf.CIBARA", and should it be aligned to the left or right edge?
What output format would you expect from regular "ls" (using multiple columns), given a mixture of plenty of similarly crafted filenames? What output format would you expect from "ls -l"?
Plenty of questions, plenty of possible answers. E.g. for "ls -1" in RTL environment, I asked two questions, 4 possible answers for each, that's 16 possible combinations. Many of them are obviously bad, but still, probably there won't be a clear winner, I'd guess there will be 2 or 3 candidates, with some subjective weighting between them.
Once we have an answer to these questions, the next question should be: whose responsibility is it to implement that behavior? Is it the terminal's, or ls's, or a cooperation between the two? Finding the good answer will require deep understanding of the problem space and examining many other utilities as well. You can't just conclude from the output of 1 particular utility using 1 particular command line switch that enabling autodetection would be an overall win across thousands of use cases.
(I am the one who designed [1] and implemented RTL (right-to-left) and BiDi (bidirectional) text support in VTE.)
The two issues you report here are totally independent.
Re bug 1:
Terminal emulators, by their very nature and their legacy of maybe ~50 years, _have to_ operate in a strict rectangular grid of character cells. If you try to break out of this grid, you break pretty much everything.
Sticking to such a grid has quite a few advantages and quite a few disadvantates. The visual disadvantages are more prominent with scripts that do connect the letters to each other, such as Arabic.
I'm sure that there's room for improvement in rendering, but it probably doesn't belong to VTE. Or maybe belongs to the VTE to the extent of switching to a different font rendering engine (e.g. from freetype to harfbuzz; there's an upstream bug about it).
However, by the very nature of the grid layout, no rendering engine could perform magic and end up with a beautiful rendering if it starts with a font that doesn't have the letters of the desired width.
Long story short: You'll need to find a high quality monospace Arabic font. Or, in fact, one where the English and the Arabic letters all have the same width (or somehow merge two such fonts, an English and an Arabic one, via fontconfig; I'm not at all familiar with how to do that).
For testing, I happened to use a font where the Arabic text didn't look anywhere as bad as your screenshot. I probably used "Monospace 9", but I don't know if this font itself contains Arabic, or if they were substituted from another font, and if so then which one. Comparing the layout to the layout of let's say web browsers (which don't have the fixed with constraint), bearing in mind that I cannot read Arabic, I am confident to say that the rendering was way better than the one in your screenshot. At the very least, letters were connected or not connected exactly as in the browser, and the overall look was also reasonably close.
So keep finding the right font for you.
Re bug 2:
This one is not about joining or not joining adjacent letters; this one is about figuring out how to shuffle the order of the character cells to make sure that words and sentences aren't "sdrawkcab" (backwards).
Pretty much every terminal behaves differently when it comes to RTL or BiDi text. That is, unfortunately it is literally impossible for an app to emit RTL or BiDi text and expect to appear correctly in the terminal. Also, some applications have different requirement from the terminal than others.
Overly simplified story: Some apps need to emit logical order and expect the terminal emulator to rearrange the cells. If the terminal doesn't rearrange, the output will be "nekorb" (broken). Some other apps need to reorder the cells themselves, and if the terminal also reorders them then the output will be, again, "nekorb". No matter which approach a terminal emulator picks (i.e. to rearrange or not to rearrange according to the BiDi algorithm), one set of the applications has no chance of implementing RTL. Or, rather, no application could ever implement proper RTL, because they could not tell which kind of terminal they operate on. (Fun: ...