How should JSON strings be represented in Erlang?

This page is a mirrored copy of an article originally posted on the (now sadly defunct) LShift blog; see the archive index here.

Thu, 13 September 2007

Erlang represents strings as lists of (ASCII, or possibly iso8859-1) codepoints. In this regard, it’s weakly typed - there’s no hard distinction between a string, “ABC”, and a list of small integers, [65,66,67]. For example:

Eshell V5.5.4  (abort with ^G)
1> "ABC".
"ABC"
2> [65,66,67].
“ABC”
3>

Erlang also has a binary type, a simple vector of bytes. In the rfc4627/JSON codec I made for Erlang, I chose to use binaries to represent decoded strings, as suggested by Joe Armstrong.

All was well - until I came to implement UTF8 support after Sam Ruby got the ball rolling. Binaries will no longer work as the chosen mapping for JSON strings, since strings may contain arbitrary characters, including those with codepoints greater than 255.

It has always been the case that the ideal representation for a JSON string is an Erlang string, a list of codepoints. Binaries are really a bit of a compromise. But choosing strings-for-strings puts us straight back in a weakly-typed position: it’s possible in JSON to distinguish between “ABC” and [65,66,67], but it’s not possible to make the same distinction in Erlang. We’d need to alter the way JSON arrays are represented to compensate.

Possible solutions:

Map strings to lists of codepoints. Map arrays to tuples rather than lists. Objects remain {obj,[...]}.
- Pros: Terse syntax for strings and arrays, no worse than the Unicode-ignorant mapping
- Cons: Awkward recursion over arrays, either using a counter and the element/2 BIF, or converting to a real list
Map strings to binaries containing UTF-8 encoded characters. Keep arrays as lists. Objects remain {obj,[...]}.
- Pros: Keep terse syntax for strings, with the understanding that the binaries concerned must hold UTF8-encoded text. Keeps the interface largely unchanged.
- Cons: Codec needs to perform possibly-redundant Unicode encoding/decoding steps to ensure that the binaries hold UTF8 even if, say, UTF32 were the format to be used on the wire
Map strings to lists of codepoints. Map arrays to {arr,[...]}, as other JSON codecs do. Objects remain {obj,[...]}.
- Pros: Natural operations on strings, natural operations on arrays (once you strip the outer {arr,…}).
- Cons: Converting terms to JSON-encodable form is a pain, since you need to wrap each array in your term with the explicit marker atom.

All in all, I can’t decide which is the least distasteful option. I think I prefer the middle option, keeping strings mapped to binaries and viewing them as UTF-8 encoded text, but I really need to get some feedback on the issue.

Comments

On 14 September, 2007 at 9:05 am, Daniel Lyons wrote:

It would be interesting to see what the CouchDB guys are doing, since they are storing JSON data in Erlang (Mnesia, I believe) but presenting to the world a JSON over REST interface. Perhaps their Getting Started document will show you everything you need to know; now I think I’d like to see someone store and retrieve a list of ASCII-range integers just to see if it handles it correctly.

On 24 September, 2007 at 5:17 pm, dda wrote:

I have worked on and off on a library, called mb for multi-byte, creating a new string type and making manipulating non-ASCII strings easy[~ier], including conversion between encodings, encoding-safe common string manipulation methods [left, mid, right, reverse, replace, etc]. It has been on hold for a while, and should probably made public so that others reuse whatever useful code there is.

The new string type is a tuple, {Encoding::atom(), String::binary()}, and i/o methods fold and unfold the data to and from the tuple. A little tedious, but seems to work so far. This File was produced by mb’s test framework.

The advantage I saw in going this route is that the data stays as it is in reality, and mb strings can accept many encodings transparently. The encoding-safe manipulation functions [eg, getNextChar() retrieves codepoints, not bytes], make sure I don’t botch the original.

On 25 September, 2007 at 1:29 am, David Hopwood wrote:

I think dda’s suggested representation is a good idea (it would work best if it were adopted as the “official” representation of encoded strings in Erlang).

Cons: Codec needs to perform possibly-redundant Unicode encoding/decoding steps to ensure that the binaries hold UTF8 even if, say, UTF32 were the format to be used on the wire

UTF-32 is, to a first approximation, never used on the wire. In principle the argument stands for UTF-16, but UTF-8 is significantly more commonly used in protocols.

Incidentally, there is a thread about this post on the erlang-questions list, subject “strings, json, and what happens now” (not in the archive yet, but it will be here).

On 25 September, 2007 at 7:05 am, Jim Larson wrote:

As Joe Armstrong mentions in his reply to me, the standard formatting of a binary is pretty opaque, while a list of codepoints displays as a string as long as it maps to the Latin-1 range.

One additional feature to consider is the selective mapping of JSON object member name strings to Erlang atoms. This lets objects look a little more natural as Erlang terms:

    {obj, [{alpha, 0.123}, {renormalized, false}]}

(Ah, if only Erlang had a native dictionary type.)

If you’re worried about arbitrary JSON data exhausting the atom table, you can first try listtoexisting_atom/1 (since the member names your code is prepared to use will be pre-loaded, right?), falling back to the conventional string representation if the conversion fails.

On 25 September, 2007 at 8:17 am, Bruce wrote:

David: I think you meant: http://www.erlang.org/pipermail/erlang-questions/2007-September/thread.html

(note change to pipermail in October ‘06 :-).

Jim: I can’t see any replies on-list. Care to summarise?

On 25 September, 2007 at 8:39 am, Jim Larson wrote:

Bruce,

Check out http://www.erlang.org/ml-archive/erlang-questions/200511/msg00193.html.

On 25 September, 2007 at 9:31 am, dda wrote:

I think dda’s suggested representation is a good idea (it would work best if it were adopted as the “official” representation of encoded strings in Erlang).

I tend to agree :-) This is why I really think I should get my act together and clean up and publish whatever I have. I think the beauty in that scheme is that it doesn’t force you to store and manipulate your strings in one encoding — or even to care about the encoding.

Remember the discussion, for those who were interested and cared to follow it, on the Ruby list where people clamored for Unicode, and Matz promised an all-encompassing scheme [since Japanese tend not to use UTF]? My background is in East-Asian languages, so my focus in mb was on CJK and their encodings [way too many Asian pages NOT in Unicode], although I did add a bunch of Latin* and Windows codepages encodings.

The first few times I spoke of string manipulations on Erlang-related forums — the list and the IRC chat room mostly — I draw mostly the equivalent of blank stares. I guess most users, especially long-time ones, have zero need for anything but ASCII. Erlang shows its roots, and clearly, as far as TEXT is concerned, it sucks. Whether the Erlang team fixes that will depend on the users making them aware of their needs, and probably of our help. Honestly, I don’t want an mb module, I want a new string type and BIFs and code added to the erlang: module.

On 3 October, 2007 at 4:13 pm, tonyg wrote:

Thanks all for your comments. I’ve gone with the middle option - I’m writing a post now announcing the new version of the code.

On 10 July, 2009 at 11:34 pm, Dana wrote:

String representation in erlang, [58,41] ?

On 10 July, 2009 at 11:36 pm, Dana wrote:

[58,41]