ALPHANUMERIC/DIGIT-CHAR-P invariant broken with Unicode
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
SBCL |
Triaged
|
Medium
|
Unassigned |
Bug Description
DIGIT-CHAR-P only returns T for the 10 ASCII digits, when radix <= 10, even though it recognizes non-ASCII Unicode digit characters in other contexts.
ALPHANUMERICP is defined in SBCL by checking the UCD-GENERAL-
ALPHA-CHAR-P is defined in SBCL by checking the UCD-GENERAL-
DIGIT-CHAR-P is a little different because it can also take an optional "radix" argument. Still, it should make sense that anything with UCD-GENERAL-
The Hyperspec page for ALPHANUMERICP even makes this relationship explicit:
(alphanumericp x)
== (or (alpha-char-p x) (not (null (digit-char-p x))))
In SBCL 1.1.3 (with :SB-UNICODE in *FEATURES*), this isn't always the case. (1.1.3 isn't the latest release, but this function doesn't appear to have been updated since then.)
For example, consider #\FULLWIDTH_
* (digit-char-p #\FULLWIDTH_
NIL
* (alphanumericp #\FULLWIDTH_
T
* (or (alpha-char-p #\FULLWIDTH_
NIL
Internally, it looks like SBCL does recognize that it's a digit, with value 2:
* (sb-impl:
2
It seems like DIGIT-CHAR-P's "Special-case decimal and smaller radices" is what's causing the problem. If you ask if this character is a digit in base-11, SBCL reports that it is:
* (digit-char-p #\FULLWIDTH_
2
I expect that any character that returns a value 0-9 from DIGIT-CHAR-P with radix=11 should also return that value when radix=10.
SIMPLE TEST CASE:
This code returns a list of all characters which don't meet the Hyperspec's equivalence mentioned above:
(defconstant +all-chars+
(loop for i from 0 upto (1- char-code-limit)
collect (code-char i)))
(loop for x in +all-chars+
when (not (eq (alphanumericp x)
collect x)
It should return the empty list, but returns 401 characters here.
VERSION INFORMATION:
$ sbcl --version
SBCL 1.1.3
$ uname -a
Darwin Ken-Harris-
* *features*
(:ALIEN-CALLBACKS :ANSI-CL :BSD :C-STACK-
:COMPARE-
:DARWIN9-OR-BETTER :FLOAT-EQL-VOPS :GENCGC :IEEE-FLOATING-
:INLINE-CONSTANTS :INODE64 :LINKAGE-TABLE :LITTLE-ENDIAN
:MACH-
:OS-PROVIDES-
:OS-PROVIDES-PUTWC :OS-PROVIDES-
:SB-EVAL :SB-LDB :SB-PACKAGE-LOCKS :SB-SOURCE-
:SB-UNICODE :SBCL :STACK-
:STACK-
:STACK-
:UNWIND-
summary "ALPHANUMERIC/ DIGIT-CHAR- P invariant broken with Unicode"
status triaged
importance medium
done
Ken Harris <email address hidden> writes:
> The Hyperspec page for ALPHANUMERICP even makes this relationship
> explicit:
>
> (alphanumericp x)
> == (or (alpha-char-p x) (not (null (digit-char-p x))))
I haven't thought this through properly, but I think that my preferred DIGIT_TWO 11) to be 2 (and I agree that that's LATIN_CAPITAL_ LETTER_ A 11) to be 10, which is ORDINAL_ INDICATOR 11) would
resolution to this invariance breakage is actually to restrict
digit-char-p to the ascii set, rather than extending it to fullwidth
digit variants and similar. The reason I say that is that if you expect
(digit-char-p #\FULLWIDTH_
reasonable, if not the only possible thing) you might also expect
(digit-char-p #\FULLWIDTH_
perhaps a little more surprising but still not impossible, because we
could just take compatibility decompositions of characters, right?
Except that then (digit-char-p #\FEMININE_
also be 10, which is frankly not expected at all.
Of course, restricting digit-char-p to interpreting only ascii digits as
numbers is irritating to those who want to work with Unicode. But I
think the answer to that is to provide and export richer Unicode
functionality, so that users can legitimately work with the Unicode data
that we store. (In my own slow way I am working on this; my github fork
of sbcl has an update to Unicode 6.2 and the beginnings of
normalization, sadly not yet complete).