Tuesday, November 18, 2003.
A few months ago, Joel Spolsky exhorted developers everywhere to learn about Unicode. My response: הללויה! ("Hallelujah," in case your browser is Unicode-challenged). I've done a lot of work with international text in the last couple of years—including a fair amount of time spent fixing code that can't handle anything other than ISO-Latin-1 (or 7-bit ASCII)—so any progress we can make in education represents time well spent.
To that end, Amar recently discovered some weirdness in some web pages using seemingly-basic Latin-1 characters (like “é”). Here's what I told him:
Basically what's happening is that the text that you added to your weblog is UTF-8 encoded, but DiaryLand isn't tagging the pages it sends out as UTF-8 text. So the browser is free to interpret the text in whatever way seems most convenient. This has a number of goofy results. Case 1: Weblog source data is UTF-8; browser interprets as ISO-Latin-1. Case 1 is what's causing the A-tilde and copyright when you look at the post to jcruelty. Let's look at what happened in "fiancé". Rather than encoding e-with-acute-accent as 0xE9 (which is latin-1 high-ascii for that character), you encoded it aas UTF-8: 1100 0011 | 1010 1001 <= this is how it looks in the page content. I don't know how it got there; maybe you posted it from some Mac OS X aqua weblogging client; Aqua is all about the Unicode. 110x xxxx | 10xx xxxx <= this is the UTF-8 template for "character between 0x80 and 0x7FF". ---0 0011 | --10 1001 To reconstruct the Unicode for which | || \\_ || | | character that is, take all the x's and \|| \_\ || | | mush them together at the end of a 16-bit \\\ \\ || | | field. Presto. 0000 0000 | 1110 1001 <= Lo, it is Unicode 0x00E9, commonly written "U00E9", which is "é". And we're done! But if the browser isn't looking at it with UTF-8 glasses, it sees 0xC3A9, which is "A-with-tilde" "copyright" in Latin1. Case 1 is also causing the sample Chinese character which you carefully pasted into your weblog post (U9BA0, ‘鮠’, which I think is some kind of fish) to show up as “é®” (e-acute, registered-trademark) instead. So this explains why the stuff you posted looked fine when you posted it, and looks weird on different browsers. But what of the stuff you saw that started this whole discussion? That's a trickier problem, and it brings us to... Case 2: Weblog source data is Latin-1; browser interprets as bogus UTF-8. What a broken browser you must have! Based on the data at http://jwinokur.diaryland.com/, I see the following sequence ("fiancé. "): é . SP E92E32 A browser that's got its heart set on seeing UTF-8 (!!) will see this as: 1110 xxxx | 10xx xxxx | 10xx xxxx <= This is UTF-8 for "character between 0x800 and 0x7FFF". 1110 1001 | 0010 1110 | 0011 0010 <= Whoa, this barely fits the template! ---- 1001 | --10 1110 | --11 0010 <= If it did, this is what it | \_\_ || | |\\_ | | | would encode. \ \_\_ | \ | |\_\ | | | (The lines get a little \ \ \ | || | \\ | | | tangled in ASCII.) 0000 0001 | 0011 0111 | 1011 0010 So, what is U137B2 ? Umm, nothing. Nothing at all. It's not in a currently allocated code range. A UTF-8-aware browser ought to show this as '?' (or some other way of indicating "bad character"). So we might be in Secret Case 3: Something Else! Case 3: Weblog source data is Latin-1; browser interprets as some random non-Unicode character set. Example: Japanese Shift-JIS (aka Microsoft Codepage 932). Shift-JIS is a common wire encoding of Japanese. It's not the official standard for Japanese text (that would be ISO-2022-JP, a much less efficient 7-bit range-shifted encoding) but it is the Japanese character set you're most likely to find in the wild. I don't know all the rules for CP932 off the top of my head, but that's what the Internet is for: CP932 reference. So, the magic sequence is still E92E32. We might not need all of that. Let's investigate. Just like in UTF-8, setting the high bit (0x8-0xF in the first byte) means "something special here". Click on the "E9" link on that URL and you're taken to the E9 subset of Shift-JIS, in which we can see that 0xE92E is ... illegal! So we're looking at illegal Shift-JIS. A Shift-JIS-aware browser ought to show this as '?'. Maybe the browser selected Case 2 or Case 3 and displayed some erroneous CJK character as a last-ditch effort to save the conversion. There are a lot of buggy character-set conversion routines out there. By the way, the version of "fiancé" that you posted as an example of the bizarre Chinese character result (again, converted by your posting method to UTF-8 for easy inspection) uses the character U9BA0 (鮠). Looking at the CP932 chart (you've still got that open, right?) we see that it is encoded in SJIS with 0xE9BC, which would imply that instead of "fiancé." your browser saw "fiancé¼" ("fianc" + "e acute" + "one-quarter fraction").
The totally awesome text editor Vim is a useful tool in seeing what's going on, especially if you have Unicode fonts for it to use (but you can still sort of make out what's going on if you don't). In this first screenshot, I opened a new gvim (graphical Vim) window and asked it to download Amar's homepage into the empty buffer, with the following command:
:read
http://jcruelty.diaryland.com
This is the result:
That's Amar's
page, interpreted in the default encoding for Windows
apps, ISO-Latin-1. (Well, Microsoft Windows Codepage 1252, but who's
counting?) If you move the cursor to a character you want to know more
about, and then enter the keystroke ga
, Vim will
tell you all about the character under the cursor.
Next I asked Vim to interpret the same data as UTF-8:
:set encoding=utf-8
Here's what that looks like:
Cool.
I hope this is informative. For more information, consult unicode.org, RFC2279 (the UTF-8 specification), the Joel rant I mentioned, or your local library.
(Amar: Consider this version 1.1 of the email I sent you. I cleaned up a few things, and added some pertinent URLs.)