Add Char type
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Mars |
Triaged
|
Wishlist
|
Matt Giuca |
Bug Description
I have decided (after all this) that Mars does need a separate Char type, for primarily two reasons:
- Despite Mars historically being specified as dealing with plain bytes, I am becoming increasingly annoyed by real languages not implementing Unicode properly. So I have decided to lead by example by adding proper Unicode strings to Mars.
- Changing the Int type to Num (bug #870515) -- a floating point type -- makes the current idiom of treating strings as arrays of integers even sillier (an array of floating point numbers?) Therefore, having a dedicated Char type will be useful.
There would still be no String type -- a string would be an Array(Char) and all string-related functions would be modified to deal with such a type.
Char would be defined as an integer in the range [0, 0x10ffff], with values representing Unicode code points. Char values would display as quoted character literals, and character/string literals would have type Char and Array(Char), respectively. Character and string literal syntax would be extended with \uxxxx and \Uxxxxxxxx notation for specifying code point values.
We would supply several new built-in functions: chr and ord, for conversion from Char to Num. We would also need to be concerned with encodings when reading and writing from a file, and possibly need to specify a way to read and write bytes from a file as well.
Upon further ponderance, this is too big a feature to implement. The biggest problem is that I/O would need to be aware of what encoding the stream is using (if it forces you to use UTF-8 then it just makes things worse).
So, rather than adding a Char type, I will settle for adding a Byte type, defined as an unsigned integer in the range [0, 0xff]. There will be no arithmetic on bytes. Byte literals will be character literals, so they aren't like Java bytes (small integers) -- they actually represent characters. The low 128 values represent ASCII characters, while the high 128 values represent byte values in some arbitrary encoding -- generally UTF-8.