All entries for Tuesday 19 September 2006

September 19, 2006

Unicode, UCS, and encodings

About once a year or so I come up against some multi-byte text issue that requires me to learn a bit more about how unicode works. Each time, I come away from it thinking ‘yeah; I understand this now’. So far this has always turned out to be a bit wrong. However, I’ve moved forwards a little bit today, by coming to grips with what the UCS is.

This episode was sparked by a question from Hongfeng, who asked ‘how come our web pages are served as iso-8859-1, but they can still display Chinese characters?’

It’s a good question; iso-8859-1 is a very small and parochial encoding, designed specifically for efficient coding of western european text. It specifies encodings for the letters in the roman alphabet, plus their accented versions, and common symbols. But, unlike a multibyte encoding like UTF-16, it doesn’t know anything about non-european alphabets. So when the browser sees a sequence like ‘& # 1488’ (minus spaces ) how does it know that it should convert that into an ‘aleph’ (א) glyph?

The answer is the UCS . The UCS is the uber-charset; a set of about a hundred thousand characters in every alpabet from Klingon to Olde English. Each character in each alphabet is given a unique number in the UCS, so there’s never any question of people disagreeing on whether 1488=’B’ or 1488=’aleph’. (though it’s quite possible that two code points might refer to two indistinguishable glyphs)

iso-8859-1, utf-8, utf-16, and the other encodings, are just subsets of the UCS which are optimised for a particular kind of text. So if you want to send a page of french text, then you can do it in a lot less bytes if you encode into iso-8859-1 than if you use UTF-*. But HTML gives you the option of specifying a raw UCS code point, via the &#{number}; notation. So as soon as you come up against a character that’s not in your target charset, just look up it’s UCS codepoint and encode away.

UCS is therefore somewhat more reliable than using high-byte characters. For instance, if you want to use ‘smart quotes’ in XHTML you can either use the UCS code points 8220 and 8221, or the UTF-8 encodings c293 and c294, or the HTML entities &ldquo and &rdquo. But if you use the HTML entities you won’t be able to parse your content as XML unless you predefine the entities wherever your markup is to be used; if you use the utf version you won’t be able to use your markup on a page which isn’t utf-8 (which is a pain when you’re syndicating other people’s data). If you use the UCS version, it might cost you an extra byte per character, but it’s universally re-useable. Which is good.

A couple of useful references for more info:

Most recent entries

Loading…

Search this blog

on twitter...


    Tags

    Not signed in
    Sign in

    Powered by BlogBuilder
    © MMXXI