Comment 6 for bug 1940283

Revision history for this message
Andrew Johnson (anj) wrote :

I take that back, this isn't a bug so much as a request that we be more liberal than the JSON-5 spec permits. I might be open to that, but first we should make the error easier to understand. Here's what's happening:

The values of info() and field() entries in a database file are supposed to be valid JSON-5 values, and are currently parsed as such.

Looking at the JSON spec https://datatracker.ietf.org/doc/html/rfc7159 and the diagrams on https://www.json.org/json-en.html that spec doesn't actually allow *any* unescaped control characters inside string values. The spec says:

   A string begins and ends with
   quotation marks. All Unicode characters may be placed within the
   quotation marks, except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

Thus all control characters are supposed to be escaped according to the older JSON rules.

The JSON-5 spec at https://spec.json5.org/#strings has similar language, although its BNF also confusingly allows an unescaped SourceCharacter from the ECMAScript language spec at https://262.ecma-international.org/5.1/#sec-6 specification, which is defined as "any Unicode code unit". I think that can be ignored though.

Both of our JSON parser lexers (dbLex.l and yajl_lex.c) follow those strict specifications when it comes to the set of characters allowed inside strings. I could change them to allow unescaped tab characters inside strings (please comment if you have an opinion about that either way), but I will have to fix both parsers to do that.

I agree that the error messages you got aren't particularly helpful. The character being complained about is the initial double-quote at the start of the string – the lexer couldn't match the whole string because of the illegal character between the quotes, so it back-tracked to the very start of the it and complained about the quote itself. This also explains the "funny" result with the BEL character, which isn't currently legal anywhere in a .db file.

To give a more friendly error message here I can add error-matching patterns that recognize anything that looks like a string to a human but doesn't to the strict lexer, then tell the user what's wrong with their string. The first part should be relatively straightforward to code, although I'm not sure I want to write code that could analyze any kind of broken string and explain why it isn't a legal JSON string.