leastfixedpoint

Proper Unicode support in Erlang RFC4627 (JSON) module

This page is a mirrored copy of an article originally posted on the LShift blog; see the archive index here.

In a previous post I explored some of the options for supporting RFC4627 (JSON) Unicode-in-strings well when mapping to Erlang terms. In the end, I settled on keeping the interface almost unchanged: the only change is that binaries returned from rfc4627:decode are to be interpreted as UTF-8 encoded text now, whereas before their interpretation was less well defined.

The new module is available as a tarball (automatically generated from the github repository) or by browsing online here. You can also get the code using git:

git clone git://github.com/tonyg/erlang-rfc4627.git

Here are some examples using the new module. First, let’s explore the autodetection of which encoding is being used. In the following example, we see UTF-16, both big- and little-endian, as well as ill-formed and well-formed examples of UTF-8 being passed through the autodetector. (It also supports UTF-32 big- and little-endian.)

Eshell V5.5.5  (abort with ^G)
1> rfc4627:unicode_decode([34,0,228,0,34,0]).
{’utf-16le’,”\”ä\”"}
2> rfc4627:unicode_decode([0,34,0,228,0,34]).
{’utf-16be’,”\”ä\”"}
3> rfc4627:unicode_decode([34,228,34]).
** exited: {ucs,{bad_utf8_character_code}} **
4> rfc4627:unicode_decode([34,195,164,34]).
{’utf-8′,”\”ä\”"}
5> 

Now let’s look at decoding some UTF-8 encoded JSON text into Erlang terms, and vice versa.

5> rfc4627:decode([34,194,128,34]).
{ok,<<194,128>>,[]}
6> rfc4627:encode(<<194,128>>).
[34,194,128,34]
7> rfc4627:encode_noauto(<<194,128>>).
[34,128,34]
8> rfc4627:unicode_encode({’utf-32le’,
        rfc4627:encode_noauto(<<194,128>>)}).
[34,0,0,0,128,0,0,0,34,0,0,0]
9> rfc4627:encode_noauto({obj, [{[27700], 123}]}).
[123,34,27700,34,58,49,50,51,125]
10> rfc4627:encode({obj, [{[27700], 123}]}).
“{\”æ°´\”:123}”
11> 

Notice, on that final example, that Erlang is printing the final UTF-8 encoded JSON text as if it were Latin-1. This is nothing to worry about: the numbers in the returned list/string are the correct UTF-8 encoding for Unicode code point 27700.

Comments

On 3 October, 2007 at 11:01 pm, Ciaran wrote:

Thanks for that - I’ve been using your rfc2467 module in a little test project and came up against some unicode issues. I’ll grab the latest version and have a play.

On 3 October, 2007 at 11:03 pm, Ciaran wrote:

Hmm, 2467 - Transmission of IPv6 Packets over FDDI Networks!? You would think I would have got used to typing 4627 by now.