(In reply to comment #13) > You are correct. It is beyond my reach to make that assumption for everyone else > on the planet. That's why I restricted myself to doing so in a Fedora Core RFE, > in Fedora bugzilla. At the risk of beating a dead horse, until either jbj or RedHat decide to part Bugzillas, this is also RPM's bugzilla. > In the context of _Fedora_ it's perfectly reasonable to label those who refuse > to use UTF-8 as Luddites. http://www.nizkor.org/features/fallacies/ad-hominem.html > You just have to look at the quality of the alternative 'solution' which was > proposed -- hacking all the RPM formats from specfile through to the database > to tag data in random formats instead of just storing it in a consistent > encoding in the first place. You are using the word "random" in a manner with which I am unfamiliar. Specified and defined character encodings are not "random." > Since you persist in trolling the Fedora bugzilla and talking about non-Fedora > issues, I suppose I might as well capitulate and discuss it... Get this through your head: This is not Fedora bugzilla. This is RedHat bugzilla, which is currently shared between RHEL, RPM, Fedora, RHAS, and RHN, among others. There is no "RPM" product, so the "rpm" component is used. You used it. So here we are. Furthermore, this is a Bazaar, not a Cathedral. RPM is used by AIX, Solaris, Darwin, and numerous flavors of Linux, not just Fedora. If you have a problem with that, convince the Fedora Deities to use a different package format. Until then, suck it up and deal. Those who develop RPM and related tools concern themselves with numerous operating systems, the majority of which do NOT use UTF-8 by default. > There's no excuse for avoiding UTF-8 in RPM internals, even outside the context > of Fedora. That would really be pointless -- there's certainly no need to > 'extend' its file formats when we can just store data in UTF-8, which can > represent the older encodings. Those who fail to learn from the mistakes of history are doomed to repeat them. You have apparently failed to learn from the mistake of assuming that the de facto standard encoding cannot change over time and does not differ between platforms. Right now, UTF-8 is a compelling replacement for Latin encodings (which are NOT obsolete, so stop erroneously using that term). In the future, UTF-8 may be found to be insufficient to the cause. The correct long-term solution is to allow spec files to specify an arbitrary encoding and to use an internal encoding which can store all data any other encoding could contain. > fix rpmbuild to convert _to_ UTF-8 from the current locale when > reading the specfile. There is no relationship between current locale and the encoding of a particular spec file. > if I check out the current libxml2/devel branch from CVS and attempt to build > it, for example, it should _fail_. It certainly shouldn't use _my_ locale (and > it'd fail anyway because of course my locale is UTF-8). There is nothing whatsoever inherently wrong with any particular encoding. I should be able to create a spec file in UCS-4 or UTF-32 if I so choose. The problem is telling RPM what encoding was used, and the proper solution does not involve ASSuming UTF-8 and failing on an invalid character. > You'd need a way to handle existing RPM databases, which may contain random data > in unknown encodings. "random" You keep using that word. I do not think it means what you think it means. > And of course you might have to call it 'RPM-CHARSET' instead of 'UTF-8' to > appease those who have religious objections to UTF-8. The encoding should be called exactly what it is, be it UTF-8, UCS-4, or any other. My objections are not to UTF-8 itself, and they're technical, not religious. Now, let's talk technical details to try and save the usefulness of this whole conversation. Jeff and Paul (and other RPM developers), please comment on the following two ideas: 1. Spec files are encoded as US-ASCII/UTF-8 by default. Any containing characters which cannot be encoded thusly must specify their encoding via either a header value ("Encoding: ISO-8859-2") or a macro value ("%define __spec_encoding ISO-8859-2"), whichever you think is better. 2. Values which contain non-ASCII characters should specify encoding similar to the way languages are currently specified. For example, PLD uses Summary(pl): and description -l pl to denote Polish content. This could be expanded to allow Summary(pl.utf8) and description -l pl.utf8.