Nasty C code

Today, my friend Yuan Yuan showed me a piece of C code, which gives different result under different optimization option. I was greatly surprised that such code exists.
His original code is complicated. I minimized the code in order to demonstrate the problem.
[sourcecode language=”cpp”]
int main() {
int i=0;
int a[2]={0};
a[i]=++i;
}
[/sourcecode]
[sourcecode language=”cpp”]
int i=0;
int main() {
int a[2]={0};
a[i]=++i;
}
[/sourcecode]
This two piece of code will give different result on windows 7 Professional, compiled using MingW’s g++ 4.50. More specifically, in the first program, a finally becomes [0,1], while in second one, it is [1,0].
Let’s look at the assembly code generated by compiler:

	incl	12(%esp)		# i++;
	movl	12(%esp), %eax		
	movl	12(%esp), %edx
	movl	%edx, 4(%esp,%eax,4)	# a[i]=i;
	movl	_i, %eax		# int temp=i;
	movl	_i, %edx
	incl	%edx
	movl	%edx, _i		# i++;
	movl	_i, %edx
	movl	%edx, 8(%esp,%eax,4)	# a[temp]=i;

Comment is equivalent C code. We can see that they are doing completely different things. I am not sure which one is correct behavior of ++ operator, or even the behavior of ++ operator is undefined in this case.


The lesson is: never write code like a[i]=i++.

lzw data compression algorithm

lzw is another popular file compression algorithm. Many image format is based on lzw algorithm to compress data.

The algorithm itself is more complicated than huffman algorithm, but it has better compression ratio in most case.

I implemented the algorithm following this notes and this tutorial. Actually I haven’t fully understand the idea behind the algorithm, although the implementation works. I will write some more after I think through it later. For now, I just post experiment result and comparison with huffman algorithm.

Experiment and Comparison

data huffman lzw
all_same_char(best case) .139 .048
equal_freq(a-z) .686 .188
random_printable_char .847 1.089
Google.htm .724 .626
random_byte 1.169 1.639
huffman.exe .671 .611
NIPS2009_0300.pdf .997 1.280
jquery.js .683 .523
long_cn_utf-8.txt .732 .561
long_cn_unicode.txt .889 .729
Wikipedia-logo.png 1.043 1.505
Wallpapers.jpg 1.001 1.247
google_logo.gif 1.190 1.494
lena.bmp 0.94 .957
compressed file 1.028 1.461

To sum up, good results become better, bad ones become worse. LZW performs well on text data, while performs poorly on compressed file(including image files) and evenly distributed random bytes.

There are many popular LZW variants. Maybe good to try those later.

The Algorithm

Ok. I have think through the algorithm, and I think I get the basic idea.

  • Use fixed length code to encode bit sequence. This avoid the wastage of huffman code: in huffman, if there is a prifix code with length 7, then it implies all its 6 prefix cannot be used. Code with fixed number of bites does not have this problem, since prefix constrain is not neede already.
  • Encode long binary sequence, if it occurs frequently in the file. LZW is able to learn which sequence occurs frequently during compression.
  • LZW is stream based compression algorithm. That is to say, the algorithm can perform compress/decompress while reading from a stream(you don’t need the whole file before you can do any work). This is not true for huffman because huffman need to read file twice for compression.
  • Unlike huffman, no much extra data(byte count infomation for huffman) is needed to store in compressed file. If the code length is fixed, then no extra information is needed actually.
  • Many variant algorithm of LZW exists.
Compression Psuedo Code

[sourcecode language=”python”]
ST = empty table # a table mapping from BYTE SEQUENCE to INTEGER CODE
for x in 0…255: # initialize table with all single byte sequence
ST[char(x)]=x
S=input byte sequence
codeMax=256 # next code to be inserted into table ST
while S.length > 0:
P=longest prefix of S in ST
code=ST[P]
print code # here is the output
S=S.substring(P.length,S.length) # remove prefix P form S
if S.length > 0:
ST[P+S[0]]=codeMax++ # insert P+S[0] into table.
# This is the learning step.
# If P occurs frequently,
# P+S[0] may also occur frequently
[/sourcecode]

Remark: notice that here we use the whole input sequence instead of reading from a stream. This is for simplicity. It is easy to modify the code to work with sctream model. For the output, you can print the code integer integer in ASCII, but it would be not space efficient. Encoding it in fixed-length binary form gives better result. If the length of bits for each code is not known for decompress algorithm, then this information should be included into compressed file.

Decompress Psuedo Code

[sourcecode language=”python”]
ST = empty table # a table mapping from INTEGER CODE to BYTE SEQUENCE(different from compression algorithm)
for x in 0…255: # initialize table with all single byte sequence
ST[x]=char(x)
S=input code sequence # this is a list of individual code output by compression algorithm
codeMax=256 # next code to be inserted into table ST
PRE=ST[S[0]] # decode the first code first
print PRE
S=S.substring(1,S.lenth) #remove first byte
while S.length > 0:
code=S[0]
if code is in ST:
CUR=ST[code]
print CUR # decode into byte sequence and print
ST[codeMax++]=PRE+CUR[0] # Update table.
# This step is delayed since we only know the next code rather than next byte
else:
CUR=PRE+PRE[0] # current byte sequence
ST[codeMax++]=CUR # this is the tricky step
print CUR

PRE=CUR # current sequence becomes previous one in next iteration
S=S.substring(1,S.lenth) #remove first byte
[/sourcecode]

Remark: the decompresion algorithm construct the reverse mapping from code to byte sequence on hte fly. However, there is one tricky special case caused by the delay update of table. Maybe it would be easier to understand using example.

Assume byte sequence we want to compress is “ABABABA”. Following the compression algorithm, it would produce:

A	-	65

	# "AB" added into table, by now, table is:
	# A  B  AB
	# 65 66 256	

B	-	66

	# "BA" added into table, by now, table is:
	# A  B  AB  BA
	# 65 66 256 257

AB	-	256

	# "ABA" added into table, by now, table is:
	# A  B  AB  BA	ABA
	# 65 66 256 257	258

ABA	-	258

	# we are done :)

Now the compressed sequence is 65,66,256,258. Let’s decompress it from this sequence

65	-	A

	# now PRE = A
	# 65 66
	# A  B 

66	-	B

	# AB is added into table 
	# 65 66 256	
	# A  B  AB
	# now PRE = B

256	-	AB

	# BA is added into table 
	# 65 66 256 257
	# A  B  AB  BA
	# now PRE = AB

258	-	ABA

	# This is the special case
	# 258 is not in the table
	# it means the "delayed table update" will insert 258.
	# So the CUR should be AB+CUR[0]
	# Since we know CUR starts with AB, cur[0]==A
	# So CUR must be ABA.
	# Then ABA is added into table 
	# 65 66 256 257 228
	# A  B  AB  BA  ABA
	# We are done :)

huffman data compression in c++

This small project is motivated by Stanford CS106X assignment 5.

It is also the first time implementing file compression algorithm. Although I know the algorithm before, I didn’t tried to implement it because I don’t know any efficient way to encode huffman-tree into binary file(it turns out to be easy after reading the handout of Stanford course webpage).

The algorithm is easy, and detail is given in handout of CS106X page. The complexity is linear to the number of bits in file.

Some remarks of the implementation:

  • All files to be compressed are considered as binary file in order to eliminate “new line” issue in different OS.
  • Each btye is considered as a character, that is to say, valid alphabet set size is 2^8=256
  • The compressed file may have any number of bits, which may not be multiple of 8. Since we can only write to file byte by byte, we pad extra 0s at the end to make the bit number multiple of 8.
  • Do not use char type to represent byte. Use int instead. This is because char is signed value, which is not safe for array index(for direct hash purpose). Int has large valid range than char, and some invalid byte value can be used for control purpose(see next).
  • Because of the dummy bits at the end, we cannot detect EOF in the normal way. We use a special char(ascii 256, which is invalid value for a byte, for EOF purpose only). To do compression, first convert byte string into array of int, and append one 256 value into the array. For decompression, when a byte with value 256 is encountered, stop looping.

Experiment Result

all_same_char(best case) .139
equal_freq(a-z) .686
random_printable_char .847
Google.htm .724
random_byte 1.169
huffman.exe .671
NIPS2009_0300.pdf .997
jquery.js .683
long_cn_utf-8.txt .732
long_cn_unicode.txt .889
Wikipedia-logo.png 1.043
Wallpapers.jpg 1.001
google_logo.gif 1.190
lena.bmp 0.94
compressed file 1.028

Discussion

This result recall me a tutorial question in Discrete Structure: proof that any lossless file compression algorithm cannot compress all files. I was a bit surprised when I first saw this statement, but finally believe it when I know the proof. The proof is not difficult:

  1. assume some algorithm f can compress any file. f can be seen as a mapping from binary sequence to binary sequence. Thus, len[f(x)] < len[x] for all binary sequence x
  2. For all binary sequence of length n, there are 2^n of them. They are all compressed into sequence with length less than n by f
  3. However, for all binary sequence of length less than n, there are only 1+2+4+8+…+2^(n-1), which is 2^n-1. If the compression algorithm is lossless, then it should not map different raw sequences into same compressed sequence. By pigeon hole principle, we derive contradiction here, since the total number of shorter sequence is less than row sequence.

Here is another interesting proof:

If we can compress every file, then each time we compress a file, the size of compressed file should be at least 1 byte less than original one. If we keep on compress the compressed file, then after N steps, the file will become 1 byte only. N should be less than the number of bytes of original file. Thus, any file can be represented by a integer and one extra bit(the final 1 bit is either 0 or 1), since we can decompress N times on the bit to obtain the original file. This is obviously impossible.

The best case is all characters are same. In such case, all characters are represented by one bit, thus the ideal compress rate should be 0.125. In our case, we use extra space to store the huffman table, and also we have special EOF character, and a newline character at the end of file because of editor issue, thus the ratio is slightly higher than optimal.

Since our algorithm can handle any binary file, the worst case is random binary type. We can see that for random byte, the result file actually becomes larger.

As long as the file is text file with ascii printable characters only, the compression ratio is not bad: actually it should be bounded above by random_printable’s result, which is 0.847.

Executable files have good result. This is a bit unexpected. By looking at the huffman table in compressed file, I find that although executable file use almost all 256 bytes, the distribution is not even. This may be caused by even usage of machine instructions.

Compressed files are hard to compress again. A good compression algorithm should behave like this. Otherwise, compress twice using the same algorithm is a better algorithm.

Image files are hard to compress. This is because most image files are already compressed using some algorithm. Thus their byte distribution looks more random.

pdf files are hard to compress. I am not familiar with pdf format. I guess the reason should be the same as image files.

Executable files have good result. This is a bit unexpected. By looking at the huffman table in compressed file, I find that although executable file use almost all 256 bytes, the distribution is not even. This may be caused by even usage of machine instructions.

Chinese characters are easy to compress, although there are many of them. Actually, there are only 5000 commonly used Chinese characters. That is to say, 13 bits is enough on average. utf-8 uses at least 16 bits to represent Chinese characters, which is definitely a waste. Unicode looks harder to compress than utf-8. Maybe this is another reason why we still need many encodings except Unicode. For Chinese character, Unicode file has smaller size than utf-8 file, both before and after compression.This is because utf-8 encoding may use 1, 2, 3 or 4 bytes to encode one character, while Unicode always use 2 bytes for each character. utf-8 encoding is optimized for backward compatible with ASCII, without lose number of representable character. Thus, the uft-8 is expected to be more expensive for non-ascii document. Refer to this post for more detail.

Skip List in C++

I implemented a skip list in C++(following pseudo code in this paper).

The skip list itself provided simple set interface: contains, insert, remove
I also implemented a map interface using skip list.
I compared performance of Skip List with STL set and map. STL’s algorithm is about twice faster than skip list.
Detail of implementation and experiment data comes later…

Implementation

The motivation of Skip List is simple. Linked List has O(n) random access time, while it can perform fast deletion given node location. Array have O(1) random access, while deletion is O(n). Thus it is nature to look for something in between, say, O(log n) for random accessing and deleting. Binary search tree is one such thing, while it loss the linear structure, and not so intuitive. Skip List is a more intuitive, and it has O(log n) complexity for random accessing and deletion on average.

Skip List is based on a sorted Linked List, thus we can expect efficient deletion. If we can have efficient accessing, then we are done. Efficient accessing is achieved by adding extra pointers to let us “skip” multiple nodes instead of going through one by one. Ideally, as shown in figure above, each node got several levels, each level has a pointer, which pointing to next node reach that level. Most nodes has low height, and few are high. If we can make sure that the number of nodes with height h grow reverse exponentially with h, then we can access each node in O(log n) time on average.

Accessing a node is done by starting from highest list head with level, go along the list at that level until the next element is larger than or equal to the element you need. When the next one is larger, go one level down, and continue going along. Repeat this until you are at level 1. By now, either your next element at level 1 is the element you are looking for, or search fails.

Node* search(const T& x,vector<Node*>* update=0){
  Node* cur=head;
  for(int i=level-1;i>=0;i--){
    while(cur->next[i]!=tail &&
        cur->next[i]->value < x)
      cur=cur->next[i];
    if(update)update->at(i)=cur;
  }
  cur=cur->next[0];
  if(cur==tail || cur->value != x)
    cur = NULL;
  return cur;
}

 

This is the search method, the core of all features. next is a vector of pointers, which correspond to the “pointer tower” at each node.

void insert(const T& x){
  vector<Node*> update(level);
  Node* cur=search(x,&update);
  if(cur){
    cur->value=x;
  }else{
    Node* newNode=new Node(x);
    int v=randomLevel();
    if(v > level){
      level=v;
      update.resize(level,head);
    }
    newNode->next.resize(v);
    for(int i=0;i<v;i++) {
      newNode->next[i]=update[i]->next[i];
      update[i]->next[i]=newNode;
    }
  }
}

 

Insertion is easy: first call search, if the element already exist, overwrite it with new value(maybe not exactly the same, but == operator returns true). If not exist, then insert the node into the list, and maintain adjacent pointers(by using vector update, detail refer to the paper).

int randomLevel(){
  int res=1;
  int r=(rand()<<16)+rand();
  while(res<maxLevel && (r&1)){
    res++;
    r>>=1;
  }
  return res;
}

randLevel generate random number satisfying the required distribution. Note that to avoid single node has too high level, the random number should be no greater than current level + 1.

void remove(const T& x){
  vector<Node*> update(level);
  Node* cur=search(x,&update);
  if(cur){
    for(int i=0;i<level;i++){
      if(update[i]->next[i]!=cur)break;
      update[i]->next[i]=cur->next[i];
    }
    delete cur;
    while(level>1){
      if(head->next[level-1]==tail)
        level--;
      else
        break;
    }
  }
}

remove is similar to insert, except that only one side of pointers need to be maintained. The removed node is deleted directly, so no need to maintain out going pointers.

Experiment 1

  • Insert n(from 10000 to 100000) random numbers, repeat 10 times, find the average time for each n.
  • Remove n(from 10000 to 100000) random numbers, repeat 10 times, find the average time for each n.
  • Combination of insertion and deletion for n(from 10000 to 100000) random numbers with 100% hit rate(no insert same element, or remove element not exist in the list), repeat 10 times, find the average time for each n.


The image above shows the result of experiment 1. X axis is n, Y axis is average time cost.

  • The running time almost grow linearly with n for insertion and deletion, although theoretical complexity is O(log n).
  • Deletion is faster than Insertion. This is because insertion need to maintain pointers on both sides of inserted node, while deletion only maintain those nodes pointing to the deleted node(only one side).
  • Because of combined operation, the average size of skip list is small, thus the code is faster

Vim Tips

  • non-greedy reg-exp search: simply replace * with \{-}. \{-} is the non-greedy version of *.
  • match all characters including newline: use \_. instead of .
  • substitution in multiple files:

    1. :args * will add all files in current folder in to args list.
    2. :argdo %s/old/new/eg | w will perform substitution in all files in arg list

    Remark: the flag “e” in substitution command suppress error messages when pattern not found, and argdo will continue working on other files instead of stop there.

    Warning: I haven’t figure out how to undo this batch operation easily. Make sure you backup every file before doing this.

  • Search in multiple files:
    :grep <patterns> <files>, also works in windows(implemented using findstr)
  • : opt will show all setting options and documentations.
  • :set encoding=.. will convert current file encoding into new one. If you don’t want to modify the file, while still wish to view the file in another encoding, then use :e ++enc=...
  • Editing file on a remote server can be frustrating because of slow internet connection. Vim is able to download file from remote server, and edit it locally. When you save the file after editing, it will be uploaded to server automatically. This is done by :e scp://user@host/path/to/the/file
  • C-X C-F will auto complete file name in current directory. This is useful when editing makefile.
  • backupdir is the directory where backup files(~) saved. directory is the directory where swap file(.swp) saved. Set these two variables if you don’t want to see your working folder filled with ~ and .swp files.
  • C-t and C-d will indent/unindent whole line in insertion mode
  • :set ? will show the value of in configuration. e.g. set ff?

unicode string and encoding in python

Unicode is a character set. It defines mapping from character to numbers(code points). Encodings like utf-8, utf-16 are trying to encode code points.

The word “Unicode” referring to a encoding(java, .Net) usually means utf-16.

The naive way to encode Unicode is called “UCS-2”, which use 2 bytes(Big Endian) to encode a code point. However, encoded string may contain many bytes containing 0, which is used as “end of string” in C. Thus, UCS-2 is not suitable for UNIX like system.

Other encodings(utf-8, etc)

Many documents are in pure English. If we use utf-16 to encode all these documents, twice space is required than ASCII encoding. But if we encode all these in ASCII, then it is impossible to quote strings in other languages in those document.

To address these problem, new encodings like utf-8 can be used. utf-8 use variant number of bytes to represent characters, which is space efficient, and it is capable of representing 2^31 characters. utf-8 also have many advantage that it is compatible with ASCII encoding. Many webpages are transferred in this encoding.

Unicode in Python

To handle multi language text, besides default ASCII string, Python support a Unicode strings datatype.
To construct a Unicode string, simply call unicode().
Conversion between default string and Unicode utf-16 string is by decode and encode method of string.

Here is an example in application:

import urllib;
page=urllib.urlopen("http://www.guokr.com/").read()
print page

The code above does not print Chinese character correctly, because we are using normal string, which is not able to handle Chinese characters.

OK..the behavior of this code depends on your system. This is because what print does is pass the string to your system, and let your system to handle IO. If your system’s default encoding is compatible with utf-8, then it will print correctly. But anyway, the idea is there.


import urllib;
page=urllib.urlopen("http://www.guokr.com/").read()
print page.decode("utf-8")

After decoding the page content, the Unicode utf-16 string is returned. Python knows how to print Unicode utf-16 string correctly, so the correct result is printed out.

Maze exploring game in vim script

Lots of beginners do not like to use vim’s h,j,k,l key to move concur, so I made this game, hopefully can help beginners get used to h,j,k and l.

This is a maze exploring game.
A 15*15 maze is randomly generated, and user(‘@’) need to go to target position(‘X’).
Only h,j,k,l are allowed for controlling.
Type ‘q’ to quit the game.

Maximize the window before starting the game to get better view.

Type :HJKL to start a new game.

Want more challenge, try konami code(kkjjhlhlba)
script available here