Add Char type

Bug #870518 reported by Matt Giuca
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mars
Triaged
Wishlist
Matt Giuca

Bug Description

I have decided (after all this) that Mars does need a separate Char type, for primarily two reasons:
- Despite Mars historically being specified as dealing with plain bytes, I am becoming increasingly annoyed by real languages not implementing Unicode properly. So I have decided to lead by example by adding proper Unicode strings to Mars.
- Changing the Int type to Num (bug #870515) -- a floating point type -- makes the current idiom of treating strings as arrays of integers even sillier (an array of floating point numbers?) Therefore, having a dedicated Char type will be useful.

There would still be no String type -- a string would be an Array(Char) and all string-related functions would be modified to deal with such a type.

Char would be defined as an integer in the range [0, 0x10ffff], with values representing Unicode code points. Char values would display as quoted character literals, and character/string literals would have type Char and Array(Char), respectively. Character and string literal syntax would be extended with \uxxxx and \Uxxxxxxxx notation for specifying code point values.

We would supply several new built-in functions: chr and ord, for conversion from Char to Num. We would also need to be concerned with encodings when reading and writing from a file, and possibly need to specify a way to read and write bytes from a file as well.

Matt Giuca (mgiuca)
description: updated
Revision history for this message
Matt Giuca (mgiuca) wrote :

Upon further ponderance, this is too big a feature to implement. The biggest problem is that I/O would need to be aware of what encoding the stream is using (if it forces you to use UTF-8 then it just makes things worse).

So, rather than adding a Char type, I will settle for adding a Byte type, defined as an unsigned integer in the range [0, 0xff]. There will be no arithmetic on bytes. Byte literals will be character literals, so they aren't like Java bytes (small integers) -- they actually represent characters. The low 128 values represent ASCII characters, while the high 128 values represent byte values in some arbitrary encoding -- generally UTF-8.

Matt Giuca (mgiuca)
Changed in mars:
milestone: 1.0 → 1.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.