Proper Unicode support in Erlang RFC4627 (JSON) module
This page is a mirrored copy of an article originally posted on the (now sadly defunct) LShift blog; see the archive index here.
Wed, 3 October 2007
In a previous post I explored some of the options for supporting RFC4627 (JSON) Unicode-in-strings well when mapping to Erlang terms. In the end, I settled on keeping the interface almost unchanged: the only change is that binaries returned from rfc4627:decode
are to be interpreted as UTF-8 encoded text now, whereas before their interpretation was less well defined.
The new module is available as a tarball (automatically generated from the github repository) or by browsing online here. You can also get the code using git:
git clone git://github.com/tonyg/erlang-rfc4627.git
Here are some examples using the new module. First, let’s explore the autodetection of which encoding is being used. In the following example, we see UTF-16, both big- and little-endian, as well as ill-formed and well-formed examples of UTF-8 being passed through the autodetector. (It also supports UTF-32 big- and little-endian.)
Eshell V5.5.5 (abort with ^G) 1> rfc4627:unicode_decode([34,0,228,0,34,0]). {’utf-16le’,”\”ä\”"} 2> rfc4627:unicode_decode([0,34,0,228,0,34]). {’utf-16be’,”\”ä\”"} 3> rfc4627:unicode_decode([34,228,34]). ** exited: {ucs,{bad_utf8_character_code}} ** 4> rfc4627:unicode_decode([34,195,164,34]). {’utf-8′,”\”ä\”"} 5>
Now let’s look at decoding some UTF-8 encoded JSON text into Erlang terms, and vice versa.
5> rfc4627:decode([34,194,128,34]). {ok,<<194,128>>,[]} 6> rfc4627:encode(<<194,128>>). [34,194,128,34] 7> rfc4627:encode_noauto(<<194,128>>). [34,128,34] 8> rfc4627:unicode_encode({’utf-32le’, rfc4627:encode_noauto(<<194,128>>)}). [34,0,0,0,128,0,0,0,34,0,0,0] 9> rfc4627:encode_noauto({obj, [{[27700], 123}]}). [123,34,27700,34,58,49,50,51,125] 10> rfc4627:encode({obj, [{[27700], 123}]}). “{\”æ°´\”:123}” 11>
Notice, on that final example, that Erlang is printing the final UTF-8 encoded JSON text as if it were Latin-1. This is nothing to worry about: the numbers in the returned list/string are the correct UTF-8 encoding for Unicode code point 27700.
Comments
On 3 October, 2007 at 11:01 pm,
wrote:On 3 October, 2007 at 11:03 pm,
wrote:Hmm, 2467 - Transmission of IPv6 Packets over FDDI Networks!? You would think I would have got used to typing 4627 by now.
Thanks for that - I’ve been using your rfc2467 module in a little test project and came up against some unicode issues. I’ll grab the latest version and have a play.