Default charsets handling for Windows archives in CJKV+th locale
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
unzip (Debian) |
Fix Released
|
Unknown
|
|||
unzip (Ubuntu) |
Triaged
|
Medium
|
Unassigned |
Bug Description
With the current unzip package in Ubuntu, we need to specify charset explicitly to extract zip files sent from localized Windows systems.
For example zip files sent from Japanese localized Windows,
$ zipinfo -O CP932 sent-from-
$ unzip -O CP932 sent-from-
This method won't work for GUI application like file-roller, users do not have way to specify charset from GUI.
Attached branch adds default charsets handling for Windows archives in CJKV+th locale, inspired by Ubuntu Kylin way.
As a result of bug #580961, two options have been added as Ubuntu patch.
> -O CHARSET specify a character encoding for DOS, Windows and OS/2 archives
> -I CHARSET specify a character encoding for UNIX and other archives
Then Ubuntu Kylin added default encoding as environment variables for their distribution.
http://
Now as Ubuntu, we can go further by a better way:
- per user settings by their locales instead of global settings
- don't interfere in other locales by locale guard
I only add "-O", so no behavior change for zip files created on Ubuntu or other Linux/UNIX systems. This branch just handles zip file created on localized Windows system seamlessly.
charsets list is taken from:
https:/
and
msdos/msdos.c in unzip package:
1682 case 932: /* Japanese */
1683 case 949: /* Korean */
1684 case 936: /* Chinese, simple */
1685 case 950: /* Chinese, traditional */
1686 case 874: /* Thai */
1687 case 1258: /* Vietnamese */
(Copied from @nobuto's branch description.)
Related branches
- Mathieu Trudel-Lapierre: Needs Information
- Sebastien Bacher: Needs Information
- Aron Xu (community): Approve
-
Diff: 65 lines (+42/-0)3 files modifieddebian/changelog (+7/-0)
debian/profile.unzip-default-charset.sh (+32/-0)
debian/rules (+3/-0)
- Steve Langasek: Needs Fixing
- Aron Xu: Pending requested
-
Diff: 147 lines (+106/-0)6 files modifieddebian/changelog (+9/-0)
debian/control (+1/-0)
debian/tests/control (+2/-0)
debian/tests/fallback-encoding (+57/-0)
debian/unzip-fallback-charset.sh (+36/-0)
debian/unzip.install (+1/-0)
Changed in unzip (Ubuntu): | |
importance: | Undecided → Medium |
status: | New → Triaged |
description: | updated |
Changed in unzip (Debian): | |
status: | Unknown → Confirmed |
Changed in unzip (Debian): | |
status: | Confirmed → Fix Released |
Additional background:
On Windows, file names are encoded with different encoding for CJKV+th locales, while ZIP archive does not store file name encoding information. When decompressing the ZIP archive on system with another encoding (i.e. UTF-8 on Linux), the file names are garbage and those characters are replaced to ??? by unzip command. And in reality there is no concrete algorithm can detect encoding reliably, not mentioning file names are too short (so it becomes more unreliable, not like in browsers).
Upstream solution to this problem was documented in bug #580961 which is not a direct path that works for ordinary users, hence we are adding a -O switch to specify encoding for archives created on Windows as a locale hack in distribution.