huffman data compression in c++

This small project is motivated by Stanford CS106X assignment 5.

It is also the first time implementing file compression algorithm. Although I know the algorithm before, I didn’t tried to implement it because I don’t know any efficient way to encode huffman-tree into binary file(it turns out to be easy after reading the handout of Stanford course webpage).

The algorithm is easy, and detail is given in handout of CS106X page. The complexity is linear to the number of bits in file.

Some remarks of the implementation:

  • All files to be compressed are considered as binary file in order to eliminate “new line” issue in different OS.
  • Each btye is considered as a character, that is to say, valid alphabet set size is 2^8=256
  • The compressed file may have any number of bits, which may not be multiple of 8. Since we can only write to file byte by byte, we pad extra 0s at the end to make the bit number multiple of 8.
  • Do not use char type to represent byte. Use int instead. This is because char is signed value, which is not safe for array index(for direct hash purpose). Int has large valid range than char, and some invalid byte value can be used for control purpose(see next).
  • Because of the dummy bits at the end, we cannot detect EOF in the normal way. We use a special char(ascii 256, which is invalid value for a byte, for EOF purpose only). To do compression, first convert byte string into array of int, and append one 256 value into the array. For decompression, when a byte with value 256 is encountered, stop looping.

Experiment Result

all_same_char(best case) .139
equal_freq(a-z) .686
random_printable_char .847
Google.htm .724
random_byte 1.169
huffman.exe .671
NIPS2009_0300.pdf .997
jquery.js .683
long_cn_utf-8.txt .732
long_cn_unicode.txt .889
Wikipedia-logo.png 1.043
Wallpapers.jpg 1.001
google_logo.gif 1.190
lena.bmp 0.94
compressed file 1.028

Discussion

This result recall me a tutorial question in Discrete Structure: proof that any lossless file compression algorithm cannot compress all files. I was a bit surprised when I first saw this statement, but finally believe it when I know the proof. The proof is not difficult:

  1. assume some algorithm f can compress any file. f can be seen as a mapping from binary sequence to binary sequence. Thus, len[f(x)] < len[x] for all binary sequence x
  2. For all binary sequence of length n, there are 2^n of them. They are all compressed into sequence with length less than n by f
  3. However, for all binary sequence of length less than n, there are only 1+2+4+8+…+2^(n-1), which is 2^n-1. If the compression algorithm is lossless, then it should not map different raw sequences into same compressed sequence. By pigeon hole principle, we derive contradiction here, since the total number of shorter sequence is less than row sequence.

Here is another interesting proof:

If we can compress every file, then each time we compress a file, the size of compressed file should be at least 1 byte less than original one. If we keep on compress the compressed file, then after N steps, the file will become 1 byte only. N should be less than the number of bytes of original file. Thus, any file can be represented by a integer and one extra bit(the final 1 bit is either 0 or 1), since we can decompress N times on the bit to obtain the original file. This is obviously impossible.

The best case is all characters are same. In such case, all characters are represented by one bit, thus the ideal compress rate should be 0.125. In our case, we use extra space to store the huffman table, and also we have special EOF character, and a newline character at the end of file because of editor issue, thus the ratio is slightly higher than optimal.

Since our algorithm can handle any binary file, the worst case is random binary type. We can see that for random byte, the result file actually becomes larger.

As long as the file is text file with ascii printable characters only, the compression ratio is not bad: actually it should be bounded above by random_printable’s result, which is 0.847.

Executable files have good result. This is a bit unexpected. By looking at the huffman table in compressed file, I find that although executable file use almost all 256 bytes, the distribution is not even. This may be caused by even usage of machine instructions.

Compressed files are hard to compress again. A good compression algorithm should behave like this. Otherwise, compress twice using the same algorithm is a better algorithm.

Image files are hard to compress. This is because most image files are already compressed using some algorithm. Thus their byte distribution looks more random.

pdf files are hard to compress. I am not familiar with pdf format. I guess the reason should be the same as image files.

Executable files have good result. This is a bit unexpected. By looking at the huffman table in compressed file, I find that although executable file use almost all 256 bytes, the distribution is not even. This may be caused by even usage of machine instructions.

Chinese characters are easy to compress, although there are many of them. Actually, there are only 5000 commonly used Chinese characters. That is to say, 13 bits is enough on average. utf-8 uses at least 16 bits to represent Chinese characters, which is definitely a waste. Unicode looks harder to compress than utf-8. Maybe this is another reason why we still need many encodings except Unicode. For Chinese character, Unicode file has smaller size than utf-8 file, both before and after compression.This is because utf-8 encoding may use 1, 2, 3 or 4 bytes to encode one character, while Unicode always use 2 bytes for each character. utf-8 encoding is optimized for backward compatible with ASCII, without lose number of representable character. Thus, the uft-8 is expected to be more expensive for non-ascii document. Refer to this post for more detail.

Skip List in C++

I implemented a skip list in C++(following pseudo code in this paper).

The skip list itself provided simple set interface: contains, insert, remove
I also implemented a map interface using skip list.
I compared performance of Skip List with STL set and map. STL’s algorithm is about twice faster than skip list.
Detail of implementation and experiment data comes later…

Implementation

The motivation of Skip List is simple. Linked List has O(n) random access time, while it can perform fast deletion given node location. Array have O(1) random access, while deletion is O(n). Thus it is nature to look for something in between, say, O(log n) for random accessing and deleting. Binary search tree is one such thing, while it loss the linear structure, and not so intuitive. Skip List is a more intuitive, and it has O(log n) complexity for random accessing and deletion on average.

Skip List is based on a sorted Linked List, thus we can expect efficient deletion. If we can have efficient accessing, then we are done. Efficient accessing is achieved by adding extra pointers to let us “skip” multiple nodes instead of going through one by one. Ideally, as shown in figure above, each node got several levels, each level has a pointer, which pointing to next node reach that level. Most nodes has low height, and few are high. If we can make sure that the number of nodes with height h grow reverse exponentially with h, then we can access each node in O(log n) time on average.

Accessing a node is done by starting from highest list head with level, go along the list at that level until the next element is larger than or equal to the element you need. When the next one is larger, go done one level, and continue going along. Repeat this until you are at level 1. By now, either your next element at level 1 is the element you are looking for, or search fails.

[sourcecode language=”cpp”]
Node* search(const T& x,vector<Node*>* update=0){
Node* cur=head;
for(int i=level-1;i>=0;i–){
while(cur->next[i]!=tail &&
cur->next[i]->value < x)
cur=cur->next[i];
if(update)update->at(i)=cur;
}
cur=cur->next[0];
if(cur==tail || cur->value != x)
cur = NULL;
return cur;
}
[/sourcecode]
This is the search method, the core of all features. next is a vector of pointers, which correspond to the “pointer tower” at each node.
[sourcecode language=”cpp”]
void insert(const T& x){
vector<Node*> update(level);
Node* cur=search(x,&update);
if(cur){
cur->value=x;
}else{
Node* newNode=new Node(x);
int v=randomLevel();
if(v > level){
level=v;
update.resize(level,head);
}
newNode->next.resize(v);
for(int i=0;i<v;i++) {
newNode->next[i]=update[i]->next[i];
update[i]->next[i]=newNode;
}
}
}
[/sourcecode]

Insertion is easy: first call search, if the element already exist, overwrite it with new value(maybe not exactly the same, but == operator returns true). If not exist, then insert the node into the list, and maintain adjacent pointers(by using vector update, detail refer to the paper).

[sourcecode language=”cpp”]
int randomLevel(){
int res=1;
int r=(rand()<<16)+rand();
while(res<maxLevel && (r&1)){
res++;
r>>=1;
}
return res;
}
[/sourcecode]

randLevel generate random number satisfying the required distribution. Note chat to avoid one specific node has too high level, we avoid the random number to be larger than current level + 1.

[sourcecode language=”cpp”]
void remove(const T& x){
vector<Node*> update(level);
Node* cur=search(x,&update);
if(cur){
for(int i=0;i<level;i++){
if(update[i]->next[i]!=cur)break;
update[i]->next[i]=cur->next[i];
}
delete cur;
while(level>1){
if(head->next[level-1]==tail)
level–;
else
break;
}
}
}
[/sourcecode]

remove is similar to insert, except that only one side of pointers need to be maintained. The removed node is deleted directly, so no need to maintain out going pointers.

Experiment 1

  • Insert n(from 10000 to 100000) random numbers, repeat 10 times, find the average time for each n.
  • Remove n(from 10000 to 100000) random numbers, repeat 10 times, find the average time for each n.
  • Combination of insertion and deletion for n(from 10000 to 100000) random numbers with 100% hit rate(no insert same element, or remove element not exist in the list), repeat 10 times, find the average time for each n.


The image above shows the result of experiment 1. X axis is n, Y axis is average time cost.

  • The running time almost grow linearly with n for insertion and deletion, although theoretical complexity is O(log n).
  • Deletion is faster than Insertion. This is because insertion need to maintain pointers on both sides of inserted node, while deletion only maintain those nodes pointing to the deleted node(only one side).
  • Because of combined operation, the average size of skip list is small, thus the code is faster

Vim Tips

  • non-greedy reg-exp search: simply replace * with \{-}. \{-} is the non-greedy version of *.
  • match all characters including newline: use \_. instead of .
  • substitution in multiple files:

    1. :args * will add all files in current folder in to args list.
    2. :argdo %s/old/new/eg | w will perform substitution in all files in arg list

    Remark: the flag “e” in substitution command suppress error messages when pattern not found, and argdo will continue working on other files instead of stop there.

    Warning: I haven’t figure out how to undo this batch operation easily. Make sure you backup every file before doing this.

  • Search in multiple files:
    :grep <patterns> <files>, also works in windows(implemented using findstr)
  • : opt will show all setting options and documentations.
  • :set encoding=.. will convert current file encoding into new one. If you don’t want to modify the file, while still wish to view the file in another encoding, then use :e ++enc=...
  • Editing file on a remote server can be frustrating because of slow internet connection. Vim is able to download file from remote server, and edit it locally. When you save the file after editing, it will be uploaded to server automatically. This is done by :e scp://user@host/path/to/the/file
  • C-X C-F will auto complete file name in current directory. This is useful when editing makefile.
  • backupdir is the directory where backup files(~) saved. directory is the directory where swap file(.swp) saved. Set these two variables if you don’t want to see your working folder filled with ~ and .swp files.
  • C-t and C-d will indent/unindent whole line in insertion mode
  • :set ? will show the value of in configuration. e.g. set ff?

unicode string and encoding in python

Unicode is a character set. It defines mapping from character to numbers(code points). Encodings like utf-8, utf-16 are trying to encode code points.

The word “Unicode” referring to a encoding(java, .Net) usually means utf-16.

The naive way to encode Unicode is called “UCS-2”, which use 2 bytes(Big Endian) to encode a code point. However, encoded string may contain many bytes containing 0, which is used as “end of string” in C. Thus, UCS-2 is not suitable for UNIX like system.

Other encodings(utf-8, etc)

Many documents are in pure English. If we use utf-16 to encode all these documents, twice space is required than ASCII encoding. But if we encode all these in ASCII, then it is impossible to quote strings in other languages in those document.

To address these problem, new encodings like utf-8 can be used. utf-8 use variant number of bytes to represent characters, which is space efficient, and it is capable of representing 2^31 characters. utf-8 also have many advantage that it is compatible with ASCII encoding. Many webpages are transferred in this encoding.

Unicode in Python

To handle multi language text, besides default ASCII string, Python support a Unicode strings datatype.
To construct a Unicode string, simply call unicode().
Conversion between default string and Unicode utf-16 string is by decode and encode method of string.

Here is an example in application:

import urllib;
page=urllib.urlopen("http://www.guokr.com/").read()
print page

The code above does not print Chinese character correctly, because we are using normal string, which is not able to handle Chinese characters.

OK..the behavior of this code depends on your system. This is because what print does is pass the string to your system, and let your system to handle IO. If your system’s default encoding is compatible with utf-8, then it will print correctly. But anyway, the idea is there.


import urllib;
page=urllib.urlopen("http://www.guokr.com/").read()
print page.decode("utf-8")

After decoding the page content, the Unicode utf-16 string is returned. Python knows how to print Unicode utf-16 string correctly, so the correct result is printed out.

Maze exploring game in vim script

Lots of beginners do not like to use vim’s h,j,k,l key to move concur, so I made this game, hopefully can help beginners get used to h,j,k and l.

This is a maze exploring game.
A 15*15 maze is randomly generated, and user(‘@’) need to go to target position(‘X’).
Only h,j,k,l are allowed for controlling.
Type ‘q’ to quit the game.

Maximize the window before starting the game to get better view.

Type :HJKL to start a new game.

Want more challenge, try konami code(kkjjhlhlba)
script available here