RPM

Bug #637227
Comment #21

Comment 21 for bug 637227

Revision history for this message

In Red Hat Bugzilla #190363, Michael (michael-redhat-bugs) wrote on 2006-05-04:

#21

(In reply to comment #17)
> jbj isn't even Cc'd on this Fedora bug, although as an outsider who happens to
> have an account he was of course able to make a comment.

Apparently you aren't familiar with Bugzilla's watch feature or weren't aware
that jbj was watching. Regardless, your statement does nothing to contradict or
invalidate my point: RPM has used this Bugzilla since its inception and will
continue to do so unless and until either jbj or RedHat decide to alter the
arrangement.

> Read it again. You evidently misunderstood it the first time.

Funny, I'd say the same about you.

> Note the difference between the following:
>
> A. "You smoke crack. Therefore your opinion is irrelevant".
> B. "You have very strange opinions. Therefore I suspect you smoke crack."

Neither of which is representative of what actually occurred:

C. "You disagree with me. Therefore I will imply that you are a Luddite."

> My point is that they _aren't_ specified and defined. In the absence of such
> tagging, it's line noise. It's essentially random.

It's not even close to random by any sane definition of the word. The contents
of a spec file in an unspecified encoding would have at most 1 or 2 bits of
entropy per byte of content. A sufficiently determined individual could almost
certainly use a known-plaintext attack on certain spec file parts to produce the
encoding. If this is true, applying the label "random" is clearly an overstatement.

What you mean to say is that the encoding of untagged data cannot be
computationally deduced with sufficient certainty to be used with
system-critical information such as that stored in an RPM database.

And with that, I agree. :-)

My primary goal in all this is two-fold: One, point out that the assumption
that UTF-8 is the end-all and be-all encoding for the entire lifespan of the RPM
product is potentially as erroneous as the assumption that the C/POSIX locale
makes for a sufficient default. Two, open up discussion on mechanisms for
tagging to allow for the future.

> Heh. Mind if I quote you on that?

Go ahead, so long as you include the entire context:
  1. Fedora is one of many products sharing this Bugzilla.
  2. RPM is also sharing this Bugzilla.
  3. RPM does not have its own "product" selection.
  4. As a result of 2 and 3, any bug filed against the "rpm" component under

*any* product may be an upstream issue.

> That's only a partial solution, and it's the uninteresting part of the solution.

"Uninteresting" is often a synonym for "important." In light of goal #1 stated
above, I consider it an important point.

> The more interesting part is what you do with an existing RPM database if it
> contains random data.

If you have random data in your RPM database, you have bigger issues than
whether or not the random data is UTF-8 encoded randomness.

> And I do mean 'random' -- if it's in untagged character sets it might as well
> be line noise.

Nonsense. The vast majority of textual data in an RPM database is plain old
ordinary ASCII, which means it's valid ISO-8859-?? as well as UTF-8.
Furthermore, I am having a hard time coming up with examples of RPMDB text data
which (1) would contain high-ASCII/multibyte data sequences AND (2) the
interpretation of which would have significant material impact on a system.

Most situations where encoding counts are things like descriptions and
summaries...things that are merely cosmetic.

It seems to me that the following would suffice when upgrading a non-tagged RPMDB:

1. All data which can be interpreted as ASCII is ASCII.
2. If any bytes 0x80 and above are encountered, attempt to process as UTF-8.
3. If invalid UTF-8 is encountered, look for language tags (like fr or de) to
deduce encoding (Latin-N, SJIS, BIG5).
4. If no deduction can be made with reasonable certainty, or if a deduction
could cause system problems, replace invalid UTF-8 character sequences with some
other character and move on.

(In reply to comment #18)
> The difference between UTF-8 and the obsolete ISO 8859 encodings is that UTF-8
> can represent all languages of the world, so there is no need for supporting
> anything else.

First off, ISO-8859 encodings are not obsolete. The vast majority of UNIX-like
systems in the world still use Latin-N encodings, and that's not going to change
any time soon. Fedora developers declaring something obsolete does not make it
obsolete; rather, it makes said developers pretentious.

Second, as I've said before, UTF-8 being the "answer to all our encoding
problems" now does not mean it will continue to be so in the future. UTF-8 owes
its popularity to two compelling but potentially limiting facts: ASCII
encodings don't change, and C-style NUL termination doesn't have to change. For
legacy code, those are a huge win. But as more and more code becomes
multilingual and encoding-agnostic, those factors reduce significantly in
importance, lending additional potential to more consistent encodings such as
UCS2 and UCS4.

If you really want something to become obsolete, continue thinking with blinders
on. Before you know it, your thinking will be obsolete.

(In reply to comment #19)
> That viewpoint is a little excessive. It's definitely sane for 'rpmq' to be able
> to convert to the user's locale when displaying text.

Definitely.

> It might also make sense to take specfiles in obsolete charsets, if they are
> clearly marked as such. If that's done, the original Fedora RFE stands, in a
> slightly modified form -- if it _isn't_ tagged, and if it isn't valid UTF-8, we
> should reject it.

I would have no problem with that, with two provisos:
1. Encoding *should* always be tagged. Relying on a particular default should
be discouraged.
2. The RPMDB encoding should be opaque as far as packagers are concerned.
While UTF-8 may make sense for now, an alternate format may be preferable in the
future.

> Repeat after me: Untagged data are no better than line noise.

s/no/only marginally/

(In reply to comment #17)
> jbj isn't even Cc'd on this Fedora bug, although as an outsider who happens to
> have an account he was of course able to make a comment.

Apparently you aren't familiar with Bugzilla's watch feature or weren't aware
that jbj was watching.  Regardless, your statement does nothing to contradict or
invalidate my point:  RPM has used this Bugzilla since its inception and will
continue to do so unless and until either jbj or RedHat decide to alter the
arrangement.

> Read it again. You evidently misunderstood it the first time.

Funny, I'd say the same about you.

> Note the difference between the following:
> 
> A. "You smoke crack. Therefore your opinion is irrelevant".
> B. "You have very strange opinions. Therefore I suspect you smoke crack."

Neither of which is representative of what actually occurred:

C.  "You disagree with me.  Therefore I will imply that you are a Luddite."

> My point is that they _aren't_ specified and defined. In the absence of such
> tagging, it's line noise. It's essentially random.

It's not even close to random by any sane definition of the word.  The contents
of a spec file in an unspecified encoding would have at most 1 or 2 bits of
entropy per byte of content.  A sufficiently determined individual could almost
certainly use a known-plaintext attack on certain spec file parts to produce the
encoding.  If this is true, applying the label "random" is clearly an overstatement.

And with that, I agree.  :-)

My primary goal in all this is two-fold:  One, point out that the assumption
that UTF-8 is the end-all and be-all encoding for the entire lifespan of the RPM
product is potentially as erroneous as the assumption that the C/POSIX locale
makes for a sufficient default.  Two, open up discussion on mechanisms for
tagging to allow for the future.

> Heh. Mind if I quote you on that?

Go ahead, so long as you include the entire context:
  1.  Fedora is one of many products sharing this Bugzilla.
  2.  RPM is also sharing this Bugzilla.
  3.  RPM does not have its own "product" selection.
  4.  As a result of 2 and 3, any bug filed against the "rpm" component under  
      
      *any* product may be an upstream issue.

> That's only a partial solution, and it's the uninteresting part of the solution.

"Uninteresting" is often a synonym for "important."  In light of goal #1 stated
above, I consider it an important point.

> The more interesting part is what you do with an existing RPM database if it
> contains random data.

If you have random data in your RPM database, you have bigger issues than
whether or not the random data is UTF-8 encoded randomness.

> And I do mean 'random' -- if it's in untagged character sets it might as well
> be line noise.

Nonsense.  The vast majority of textual data in an RPM database is plain old
ordinary ASCII, which means it's valid ISO-8859-?? as well as UTF-8. 
Furthermore, I am having a hard time coming up with examples of RPMDB text data
which (1) would contain high-ASCII/multibyte data sequences AND (2) the
interpretation of which would have significant material impact on a system.

Most situations where encoding counts are things like descriptions and
summaries...things that are merely cosmetic.

It seems to me that the following would suffice when upgrading a non-tagged RPMDB:

1.  All data which can be interpreted as ASCII is ASCII.
2.  If any bytes 0x80 and above are encountered, attempt to process as UTF-8.
3.  If invalid UTF-8 is encountered, look for language tags (like fr or de) to
deduce encoding (Latin-N, SJIS, BIG5).
4.  If no deduction can be made with reasonable certainty, or if a deduction
could cause system problems, replace invalid UTF-8 character sequences with some
other character and move on.

(In reply to comment #18)
> The difference between UTF-8 and the obsolete ISO 8859 encodings is that UTF-8 
> can represent all languages of the world, so there is no need for supporting 
> anything else.

First off, ISO-8859 encodings are not obsolete.  The vast majority of UNIX-like
systems in the world still use Latin-N encodings, and that's not going to change
any time soon.  Fedora developers declaring something obsolete does not make it
obsolete; rather, it makes said developers pretentious.

Second, as I've said before, UTF-8 being the "answer to all our encoding
problems" now does not mean it will continue to be so in the future.  UTF-8 owes
its popularity to two compelling but potentially limiting facts:  ASCII
encodings don't change, and C-style NUL termination doesn't have to change.  For
legacy code, those are a huge win.  But as more and more code becomes
multilingual and encoding-agnostic, those factors reduce significantly in
importance, lending additional potential to more consistent encodings such as
UCS2 and UCS4.

If you really want something to become obsolete, continue thinking with blinders
on.  Before you know it, your thinking will be obsolete.

(In reply to comment #19)
> That viewpoint is a little excessive. It's definitely sane for 'rpmq' to be able
> to convert to the user's locale when displaying text.

Definitely.

I would have no problem with that, with two provisos:
1.  Encoding *should* always be tagged.  Relying on a particular default should
be discouraged.
2.  The RPMDB encoding should be opaque as far as packagers are concerned. 
While UTF-8 may make sense for now, an alternate format may be preferable in the
future.

> Repeat after me: Untagged data are no better than line noise.

s/no/only marginally/