(In reply to comment #17) > jbj isn't even Cc'd on this Fedora bug, although as an outsider who happens to > have an account he was of course able to make a comment. Apparently you aren't familiar with Bugzilla's watch feature or weren't aware that jbj was watching. Regardless, your statement does nothing to contradict or invalidate my point: RPM has used this Bugzilla since its inception and will continue to do so unless and until either jbj or RedHat decide to alter the arrangement. > Read it again. You evidently misunderstood it the first time. Funny, I'd say the same about you. > Note the difference between the following: > > A. "You smoke crack. Therefore your opinion is irrelevant". > B. "You have very strange opinions. Therefore I suspect you smoke crack." Neither of which is representative of what actually occurred: C. "You disagree with me. Therefore I will imply that you are a Luddite." > My point is that they _aren't_ specified and defined. In the absence of such > tagging, it's line noise. It's essentially random. It's not even close to random by any sane definition of the word. The contents of a spec file in an unspecified encoding would have at most 1 or 2 bits of entropy per byte of content. A sufficiently determined individual could almost certainly use a known-plaintext attack on certain spec file parts to produce the encoding. If this is true, applying the label "random" is clearly an overstatement. What you mean to say is that the encoding of untagged data cannot be computationally deduced with sufficient certainty to be used with system-critical information such as that stored in an RPM database. And with that, I agree. :-) My primary goal in all this is two-fold: One, point out that the assumption that UTF-8 is the end-all and be-all encoding for the entire lifespan of the RPM product is potentially as erroneous as the assumption that the C/POSIX locale makes for a sufficient default. Two, open up discussion on mechanisms for tagging to allow for the future. > Heh. Mind if I quote you on that? Go ahead, so long as you include the entire context: 1. Fedora is one of many products sharing this Bugzilla. 2. RPM is also sharing this Bugzilla. 3. RPM does not have its own "product" selection. 4. As a result of 2 and 3, any bug filed against the "rpm" component under *any* product may be an upstream issue. > That's only a partial solution, and it's the uninteresting part of the solution. "Uninteresting" is often a synonym for "important." In light of goal #1 stated above, I consider it an important point. > The more interesting part is what you do with an existing RPM database if it > contains random data. If you have random data in your RPM database, you have bigger issues than whether or not the random data is UTF-8 encoded randomness. > And I do mean 'random' -- if it's in untagged character sets it might as well > be line noise. Nonsense. The vast majority of textual data in an RPM database is plain old ordinary ASCII, which means it's valid ISO-8859-?? as well as UTF-8. Furthermore, I am having a hard time coming up with examples of RPMDB text data which (1) would contain high-ASCII/multibyte data sequences AND (2) the interpretation of which would have significant material impact on a system. Most situations where encoding counts are things like descriptions and summaries...things that are merely cosmetic. It seems to me that the following would suffice when upgrading a non-tagged RPMDB: 1. All data which can be interpreted as ASCII is ASCII. 2. If any bytes 0x80 and above are encountered, attempt to process as UTF-8. 3. If invalid UTF-8 is encountered, look for language tags (like fr or de) to deduce encoding (Latin-N, SJIS, BIG5). 4. If no deduction can be made with reasonable certainty, or if a deduction could cause system problems, replace invalid UTF-8 character sequences with some other character and move on. (In reply to comment #18) > The difference between UTF-8 and the obsolete ISO 8859 encodings is that UTF-8 > can represent all languages of the world, so there is no need for supporting > anything else. First off, ISO-8859 encodings are not obsolete. The vast majority of UNIX-like systems in the world still use Latin-N encodings, and that's not going to change any time soon. Fedora developers declaring something obsolete does not make it obsolete; rather, it makes said developers pretentious. Second, as I've said before, UTF-8 being the "answer to all our encoding problems" now does not mean it will continue to be so in the future. UTF-8 owes its popularity to two compelling but potentially limiting facts: ASCII encodings don't change, and C-style NUL termination doesn't have to change. For legacy code, those are a huge win. But as more and more code becomes multilingual and encoding-agnostic, those factors reduce significantly in importance, lending additional potential to more consistent encodings such as UCS2 and UCS4. If you really want something to become obsolete, continue thinking with blinders on. Before you know it, your thinking will be obsolete. (In reply to comment #19) > That viewpoint is a little excessive. It's definitely sane for 'rpmq' to be able > to convert to the user's locale when displaying text. Definitely. > It might also make sense to take specfiles in obsolete charsets, if they are > clearly marked as such. If that's done, the original Fedora RFE stands, in a > slightly modified form -- if it _isn't_ tagged, and if it isn't valid UTF-8, we > should reject it. I would have no problem with that, with two provisos: 1. Encoding *should* always be tagged. Relying on a particular default should be discouraged. 2. The RPMDB encoding should be opaque as far as packagers are concerned. While UTF-8 may make sense for now, an alternate format may be preferable in the future. > Repeat after me: Untagged data are no better than line noise. s/no/only marginally/