Unicode is a character set. It defines mapping from character to numbers(code points). Encodings like utf-8, utf-16 are trying to encode code points.
The word “Unicode” referring to a encoding(java, .Net) usually means utf-16.
The naive way to encode Unicode is called “UCS-2”, which use 2 bytes(Big Endian) to encode a code point. However, encoded string may contain many bytes containing 0, which is used as “end of string” in C. Thus, UCS-2 is not suitable for UNIX like system.
Many documents are in pure English. If we use utf-16 to encode all these documents, twice space is required than ASCII encoding. But if we encode all these in ASCII, then it is impossible to quote strings in other languages in those document.
To address these problem, new encodings like utf-8 can be used. utf-8 use variant number of bytes to represent characters, which is space efficient, and it is capable of representing 2^31 characters. utf-8 also have many advantage that it is compatible with ASCII encoding. Many webpages are transferred in this encoding.
To handle multi language text, besides default ASCII string, Python support a Unicode strings datatype. To construct a Unicode string, simply call
unicode(). Conversion between default string and Unicode utf-16 string is by
encode method of string.
Here is an example in application:
The code above does not print Chinese character correctly, because we are using normal string, which is not able to handle Chinese characters.
OK..the behavior of this code depends on your system. This is because what
After decoding the page content, the Unicode utf-16 string is returned. Python knows how to print Unicode utf-16 string correctly, so the correct result is printed out.