dsandler.org :: essays

why can't amar read (unicode)?

© 2003 dan sandler

Tuesday, November 18, 2003.

A few months ago, Joel Spolsky exhorted developers everywhere to learn about Unicode. My response: הללויה!‏ ("Hallelujah," in case your browser is Unicode-challenged). I've done a lot of work with international text in the last couple of years—including a fair amount of time spent fixing code that can't handle anything other than ISO-Latin-1 (or 7-bit ASCII)—so any progress we can make in education represents time well spent.

To that end, Amar recently discovered some weirdness in some web pages using seemingly-basic Latin-1 characters (like “é”). Here's what I told him:

Basically what's happening is that the text that you added to your
weblog is UTF-8 encoded, but DiaryLand isn't tagging the pages it sends
out as UTF-8 text. So the browser is free to interpret the text in
whatever way seems most convenient. This has a number of goofy results.

Case 1: Weblog source data is UTF-8; browser interprets as ISO-Latin-1.

Case 1 is what's causing the A-tilde and copyright when you look at the
post to jcruelty.  Let's look at what happened in "fiancé".
Rather than encoding e-with-acute-accent as 0xE9 (which is latin-1
high-ascii for that character), you encoded it aas UTF-8:

    1100 0011 | 1010 1001  <= this is how it looks in the page content.
                              I don't know how it got there; maybe you
                              posted it from some Mac OS X aqua
                              weblogging client; Aqua is all about the
    110x xxxx | 10xx xxxx  <= this is the UTF-8 template for
                              "character between 0x80 and 0x7FF".
    ---0 0011 | --10 1001     To reconstruct the Unicode for which
       | || \\_   || |  |     character that is, take all the x's and
        \||  \_\  || |  |     mush them together at the end of a 16-bit
         \\\   \\ || |  |     field.  Presto.
    0000 0000 | 1110 1001  <= Lo, it is Unicode 0x00E9, commonly written
                              "U00E9", which is "é".  And we're done!

But if the browser isn't looking at it with UTF-8 glasses, it sees
0xC3A9, which is "A-with-tilde" "copyright" in Latin1.

Case 1 is also causing the sample Chinese character which you carefully
pasted into your weblog post (U9BA0, ‘鮠’, which I think is some kind of 
fish) to show up as “é®” (e-acute, registered-trademark) instead.

So this explains why the stuff you posted looked fine when you posted
it, and looks weird on different browsers.  But what of the stuff you
saw that started this whole discussion?  That's a trickier problem, and
it brings us to...

Case 2: Weblog source data is Latin-1; browser interprets as bogus

What a broken browser you must have!  Based on the data at
http://jwinokur.diaryland.com/, I see the following sequence 
("fiancé. "):

    é . SP

A browser that's got its heart set on seeing UTF-8 (!!) will see this as:

    1110 xxxx | 10xx xxxx | 10xx xxxx  <= This is UTF-8 for "character
                                          between 0x800 and 0x7FFF".
    1110 1001 | 0010 1110 | 0011 0010  <= Whoa, this barely fits the
    ---- 1001 | --10 1110 | --11 0010  <= If it did, this is what it
         | \_\_   || | |\\_   |  |  |     would encode.
          \  \_\_ | \ | |\_\  |  |  |     (The lines get a little
           \   \ \ | || |  \\ |  |  |     tangled in ASCII.)
    0000 0001 | 0011 0111 | 1011 0010

So, what is U137B2 ?  Umm, nothing.  Nothing at all.  It's not in a
currently allocated code range.  A UTF-8-aware browser ought to show
this as '?' (or some other way of indicating "bad character").

So we might be in Secret Case 3: Something Else!

Case 3: Weblog source data is Latin-1; browser interprets as some 
random non-Unicode character set.

Example: Japanese Shift-JIS (aka Microsoft Codepage 932).

Shift-JIS is a common wire encoding of Japanese. It's not the official
standard for Japanese text (that would be ISO-2022-JP, a much less
efficient 7-bit range-shifted encoding) but it is the Japanese character
set you're most likely to find in the wild. I don't know all the rules
for CP932 off the top of my head, but that's what the Internet is for:
CP932 reference.

So, the magic sequence is still E92E32.  We might not need all of that.
Let's investigate.  Just like in UTF-8, setting the high bit (0x8-0xF in
the first byte) means "something special here".  Click on the "E9" link
on that URL and you're taken to the E9 subset of Shift-JIS, in which we
can see that 0xE92E is ... illegal!  So we're looking at illegal
Shift-JIS.  A Shift-JIS-aware browser ought to show this as '?'.

Maybe the browser selected Case 2 or Case 3 and displayed some erroneous
CJK character as a last-ditch effort to save the conversion.  There are
a lot of buggy character-set conversion routines out there.

By the way, the version of "fiancé" that you posted as an example
of the bizarre Chinese character result (again, converted by your
posting method to UTF-8 for easy inspection) uses the character U9BA0
(鮠).  Looking at the CP932 chart (you've still got that open,
right?) we see that it is encoded in SJIS with 0xE9BC, which would imply
that instead of "fiancé." your browser saw
"fiancé¼" ("fianc" + "e acute" + "one-quarter fraction").  

The totally awesome text editor Vim is a useful tool in seeing what's going on, especially if you have Unicode fonts for it to use (but you can still sort of make out what's going on if you don't). In this first screenshot, I opened a new gvim (graphical Vim) window and asked it to download Amar's homepage into the empty buffer, with the following command:

:read http://jcruelty.diaryland.com

This is the result:

That's Amar's page, interpreted in the default encoding for Windows apps, ISO-Latin-1. (Well, Microsoft Windows Codepage 1252, but who's counting?) If you move the cursor to a character you want to know more about, and then enter the keystroke ga, Vim will tell you all about the character under the cursor.

Next I asked Vim to interpret the same data as UTF-8:

:set encoding=utf-8

Here's what that looks like:


I hope this is informative. For more information, consult unicode.org, RFC2279 (the UTF-8 specification), the Joel rant I mentioned, or your local library.

(Amar: Consider this version 1.1 of the email I sent you. I cleaned up a few things, and added some pertinent URLs.)

© 2004 :: daniel sandler :: dsandler*dsandler.org
PKI info :: 5DCF C171 0C41 5BB5 7810 FDA4 0ECC EA0A 8245 DBF4