dash interpreter don't handle some unicode characters correctly

Bug #422298 reported by ZelinskiyIS
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
dash (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Binary package hint: dash

The bug happens, in particular, when using Cyrillic letters "ш" and "с".
To reproduce, open a terminal, cd /tmp (or somewhere else - this don't matter), type
$ dash -c "name=прстуфхцчшщ;echo \$name > \$name"
$ ls

One would expect a file named прстуфхцчшщ to be created with contents, similar to it's name. However, instead of that ls shows a file, named пр?туфхцч?щ. Inside the file a proper string прстуфхцчшщ can be found.

The wrong filename hexdump-s to the following:
0000000 bfd0 80d1 d1d1 d182 d183 d184 d185 d186
0000010 d187 89d1 000a
On the other hand, the correct filename would hexdump to:
0000000 bfd0 80d1 81d1 82d1 83d1 84d1 85d1 86d1
0000010 87d1 88d1 89d1 000a

If bash is used instead of dash, the problem is not present.
The bug is present on both jaunty x86 and jaunty x86_64.

The bug is significant, because "dash" is a default "sh" interpreter for these systems. It is used by system("...") function. In particular, I found the bug while debugging my authomatic Python file-converting script, that failed on files with Cyrillic names, containing "с" and "ш".
-----------------------------------------------------------------
ivze@ubuntu-laptop:/tmp$ lsb_release -rd
Description: Ubuntu 9.04
Release: 9.04

Dash package version: 0.5.4-12ubuntu2

Revision history for this message
ZelinskiyIS (ivze) wrote :

Some tests, performed, revealed that the same issue is with capital "Ё" letter.

Revision history for this message
ZelinskiyIS (ivze) wrote :

The source of the bug has been found.
As for dash-0.5.4, in expand.c:240 there is a line
        rmescapes(p);
If one follows the macro, he or she will find that it just trashes chars 129 and 136 ("\201\210" octal).
In UTF-8 representation of letters сшЁ (two bytes per a letter) the second byte is just from the set:
$ echo сшЁ|hexdump -b
0000000 321 201 321 210 320 201 012
0000007
That's what causes the bug. The bug is UTF-8 specific, if KOI-8 was used for Cyrillics (as it is in Debian), there would be no such bug.

Revision history for this message
Raúl Núñez de Arenas Coronado (dervishd) wrote :

This bug is the same (or at least related) to #382187.

I have the same problem with the character "∕", which is U+2215 or UTF-8: 0xE2 0x88 0x95, but dash thinks it is 0xE2 0x95.

Revision history for this message
ZelinskiyIS (ivze) wrote :

Despite the source of two bugs (this and the duplicate) has been found (dash destroying bytes 0x81=0o201 and 0x88=0o210 from ">" redirrection target), I can't fix the myself because of being uncertain what to do and how to make sure that my actions won't affect some other dash features.

This bug needs attention from someone with such powers.

Revision history for this message
ZelinskiyIS (ivze) wrote :

The bug is confirmed in Karmic.

Steps to reproduce:
1)Take a terminal
2)# cd /tmp; mkdir test; cd test
3)# sh -c "name=сшуЁ; echo \$name > \$name"
4)# ls
5)See a new file "??у?" with garbage in name, but having proper string "сшуЁ" inside.

Revision history for this message
ZelinskiyIS (ivze) wrote :

In 10.04 lucid lynx beta1 the bug is still present.
Dash package version: 0.5.5.1-3ubuntu1.

Revision history for this message
Alexander Korolkov (telgnik) wrote :

Looks like I've fixed this bug (see attached patch).

Some characters in dash are escaped for internal use (сшуЁ in UTF-8 = d1 81 d1 88 d1 83 d0 81 is translated to d1 81 81 d1 81 88 d1 81 83 d0 81 81), then unescaped. But redirection-to-file strings are treated differently and are unescaped twice (d1 81 81 d1 81 88 d1 81 83 d0 81 81 -> d1 81 d1 88 d1 83 d0 81 -> d1 d1 d1 83 d0), in expandarg() function at src/expand.c (in argstr(), then in rmescapes()).

By the way, EXP_REDIR flag is used only for this special treatment of redirection-to-file strings. Could not find its origin, the oldest commit at git://git.kernel.org/pub/scm/utils/dash/dash.git already has it.

Revision history for this message
Jilles Tjoelker (jilles) wrote :

This patch has been pushed to dash.git. However, applying it to an older version of dash could be useful so people can take advantage of this fix without the possible instability and possible subtle incompatibilities of a new dash version.

Revision history for this message
reuben (reuben-ugcs) wrote :

This bug still affects Ubuntu 11.10. It's rather disappointing that two years after the initial bug report, Ubuntu's default shell is still unable to handle file names in common languages such as Russian. It's especially insidious, as many programs that use shell scripts internally will fail in unpredictable ways.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dash (Ubuntu):
status: New → Confirmed
Revision history for this message
Wladimir Mutel (mwg) wrote :

This bug is more than 2 years old. Any hope to get it fixed in Precise ?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.