Comment 5 for bug 374807

Revision history for this message
Dmitry (rusdmitry) wrote :

This is not surprising since 'grep' is a standard POSIX utility. It uses POSIX locales (http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html#tag_20_55_08). So if you read the POSIX standard carefully, then you are going to find out the following: UTF-16 and UTF-32 cannot be supported in POSIX locales because these encoding forms imply using 2-byte and 4-byte code-units respectively making the encoding of '/' and '.' nonconforming.
Quoting http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html:

"Conforming implementations shall support one or more coded character sets. Each supported locale shall include the portable character set, which is the set of symbolic names for characters in Portable Character Set.

...
POSIX.1-2008 places only the following requirements on the encoded values of the characters in the portable character set:

...

The encoded values associated with <slash> and <period> shall be invariant across all locales supported by the implementation.

The encoded values associated with the members of the portable character set are each represented in a single byte. Moreover, if the value is stored in an object of C-language type char, it is guaranteed to be positive (except the NUL, which is always zero)."

Another issue is that sizeof(wchar_t) is implementation defined. My tests on Ubuntu show that sizeof(wchar_t) returns 4 (bytes) and you need some other data type to store UTF-16 code units in a portable way.

I would say that this should not be fixed: you should use iconv in a pipeline to do the appropriate grepping with UTF-8 (though this might be resource-intensive for large XML files).