Unicode is a character set. It defines mapping from character to numbers(code points). Encodings like utf-8, utf-16 are trying to encode code points.

The word “Unicode” referring to a encoding(java, .Net) usually means utf-16.

The naive way to encode Unicode is called “UCS-2”, which use 2 bytes(Big Endian) to encode a code point. However, encoded string may contain many bytes containing 0, which is used as “end of string” in C. Thus, UCS-2 is not suitable for UNIX like system.

Other encodings(utf-8, etc)

Many documents are in pure English. If we use utf-16 to encode all these documents, twice space is required than ASCII encoding. But if we encode all these in ASCII, then it is impossible to quote strings in other languages in those document.

To address these problem, new encodings like utf-8 can be used. utf-8 use variant number of bytes to represent characters, which is space efficient, and it is capable of representing 2^31 characters. utf-8 also have many advantage that it is compatible with ASCII encoding. Many webpages are transferred in this encoding.

Unicode in Python

To handle multi language text, besides default ASCII string, Python support a Unicode strings datatype. To construct a Unicode string, simply call unicode(). Conversion between default string and Unicode utf-16 string is by decode and encode method of string.

Here is an example in application:

1
2
3
import urllib
page=urllib.urlopen("http://www.guokr.com/").read()
print page

The code above does not print Chinese character correctly, because we are using normal string, which is not able to handle Chinese characters.

OK..the behavior of this code depends on your system. This is because what print does is pass the string to your system, and let your system to handle IO. If your system’s default encoding is compatible with utf-8, then it will print correctly. But anyway, the idea is there.

1
2
3
import urllib
page=urllib.urlopen("http://www.guokr.com/").read()
print page.decode("utf-8")

After decoding the page content, the Unicode utf-16 string is returned. Python knows how to print Unicode utf-16 string correctly, so the correct result is printed out.