Blog
Unicode, UTF-8, and All That: A Short Note
I don't know how many times I've heard people conflate UTF-8 with Unicode. I have come to believe that distinguishing these is the first essential step in becoming "internationalization-literate".
Perhaps the problem is that representing a character according to the Unicode standard involves two levels of indirection (or more--see http://unicode.org/reports/tr17/ for all the nitty-gritty details). The first is the assignment of a number to every character of each of the world's languages. To quote from the Unicode Consortium Web site: "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language." Since there are more than 256 such characters, this isn't the end of the story. The second stage is the representation of this number as a byte sequence. This is the "encoding" phase. The Unicode standard lays out several encoding methods--UTF-8 is just one of many. (What distinguishes UTF-8 is that all ASCII characters require just one byte. Others, such as UCS-2, require multiple bytes for every character, and thus waste a lot of space when used for predominantly ASCII text.)




Comments
Reminds me of
By Anonymous on February 3, 2010
...Joel On Software's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
~Brian
www.CallMeKung.com
Post new comment