Inorder traversal in O(1) space – Morris algorithm explained

The well known recursive binary tree inorder traversal needs O(H) space where H is the height of the tree.

static void inorder(Node root, List<Integer> elements) {
    if (root == null) {
        return;
    }
    inorder(root.left, elements);
    elements.add(root.value);
    inorder(root.right, elements);
}

It is not hard to figure out iterative implementation by simulating back tracking behavior using a stack data structure, which still needs O(H) space.

If parent pointers are present, they will naturally forms stack structure, which makes backtracking feasible in O(1) space.

I recently learnt Morris algorithm, which does inorder traversal in O(1) extra space without using parent pointer. The source code follows.

static void inorder(Node node, List<Integer> elements) {
    while (node != null) {
        if (node.left != null) {
            Node t = node.left;
            while (t.right != null && t.right != node) {
                t = t.right;
            }
            if (t.right == node) {
                // visiting back
                elements.add(node.value);
                t.right = null;
                node = node.right;
            } else {
                t.right = node;
                node = node.left;
            }
        } else {
            elements.add(node.value);
            node = node.right;
        }
    }
}
Why mutating pointers?

Morris algorithm use some null pointers (only right null pointers) in the tree as location to store temporary data. All mutations are guaranteed to be reverted when traversal finishes.

It is kinda cheating to claim it uses O(1) extra space, but it is still an impressive algorithm.

why it works?

The contract of the algorithm is intuitive: it takes in a binary tree, and performs inorder traversal on it, with all mutations reverted.

If left subtree is empty, it is trivial

visit root

go to right and start over

The algorithm will traverse the right subtree by contract.

Let’s focus on the case when left subtree is not empty

find the last node in left subtree

set its right pointer to root

go to left node and start over

The last step let the algorithm finish traversing left sub tree for you, except that after visiting the last node, the algorithm will thought will follow its right pointer and go back to root.

find the last node in left subtree whose right pointer point to root

set its right pointer to null

visit root

go to right and start over

The algorithm will deal with the right subtree for you.

Note that mutation in red is reverted in the green step. Given that left and right subtrees are handled by the same algorithm, which reverts all mutations by contract, we guarantee that all mutations are reverted after the algorithm finishes.

It is not hard to figure out that node are visited in-order.

When the a iteration starts, it might do or undo mutation. The way to tell which is by looking at the right pointer of max node in left subtree.

The key idea is, the algorithm performs specific mutate to allows “backtracking” to the exact right node after finishes traversing sub tree.

Why O(n) time?

It is clear that everything except for the inner while loop (find the last node in left sub tree) sums up to O(n).

How long does it take to perform “find last node in left sub tree” for each node?

It is not hard to observe that each edge in the tree is visited no more than once in such process, so O(n) is the answer.

Therefore, overall time complexity is still O(n).

 

Example for TreeList and BinaryIndexedTree

We mentioned BinaryIndexedTree and TreeList in previous post. Now I will give an example problem where both data structure are needed.

Given array a[0…n-1], which stores a permutation of 0 to n-1. We define array b[0…n-1] using following equation:


b[i]=|{x|x occurs before i in a[], and x > i}|

The problem is: given a[], compute b[] in O(nlogn) time; given b[], compute a[] in O(nlogn) time.

To convert a[] into b[], assume we have an array t[0…n](initially all 0). We go through each elements in a[], and for each element a[i] we encounter, b[a[i]]=t[0]+…+t[a[i]], and then we set t[a[i]]=1. The correctness is obvious. Implementing this method naively gives O(n^2). Notice that binary indexed tree is designed for this kind of prefix sum operation:

[sourcecode language=”cpp”]
vector<int> a2b(vector<int> a){
vector<int> b(a.size());
BITWraper t(a.size(),0);
for (int i = 0; i < a.size(); i++) {
b[a[i]]=i-t.sum(a[i]);
t.add(a[i]);
}
return b;
}
[/sourcecode]

By default, binary indexed tree’s array index starts from 1. Thus, we implemented a wrapper class, which allows arbitrary starting index.

How to computing a[] given b[]? Let’s look at large elements first. Let’s first write down the largest element, for each elements i, there are b[i] elements greater than it. Since we write down elements in descending order, all b[i] elements must have been written down. Thus, we insert i into b[i] position. After all elements are inserted, a[] is constructed. Again, implementing using array or linked list will result in O(n^2) complexity, while TreeList only need O(nlogn) time to do all insertions.

[sourcecode language=”cpp”]
vector<int> b2a(vector<int> b){
TreeList<int> a;
for(int i=b.size()-1;i>=0;i–)
a.insert(i,b[i]);
return a.toVector();
}
[/sourcecode]

TreeList

As we know, array data structure has O(1) access time and O(n) insertion/deletion(kth element); linked list need O(n) for all operations, and O(1) insertion/deletion only when node pointer is given.

We also know that binary search tree and skip-list provide O(log n) for accessing, insertion and deletion.

Now what if we want a data structure, which behaves like a list, which can insert/delete element at kth position efficiently?

One of my friend suggest “blocked linked list”, which is a combination of array and linked list. Assume the input data size N is known, we can use a list of arrays to store data. List has length sqrt(N), each array also has length sqrt(N). Insertion and deletion has complexity O(sqrt(N)), and array is split when length exceed sqrt(N), and adjacent blocks are merged when length smaller than sqrt(N). This data structure is very popular in high school programming contest, where data size is known in advance. However, O(sqrt(N)) is not good enough: we want O(log n) for all operations.

Another method is to use binary search tree: binary search tree with “size” field can find “kth” element in O(log n) time. Elements have a floating point value as key for comparison. When inserting kth element, we first look for element k-1 and element k, and take their mean as key for new element. Thus new element is inserted into kth position. However, this method is not that elegant: size of the list depends on floating point precision.

Finally, comes our “TreeList” data structure. I don’t think I am the first guy who invent it, and I don’t know whether there is any other name for such data structure. The TreeList support 3 operations:

  • insert element into kth position
  • remove element in kth position
  • get kth element

The implementation is based on treap. Actually, any binary search tree can be used. The complexity for all 3 operations are O(log n). Source code is shown below.

When do we need this data structure? Follow my latest post

[sourcecode language=”cpp”]
template<class T>
class TreeList{
// list structure, support 3 operations:
// -insert element into kth position
// -remove element into kth position
// -get kth element
// implemented using treap
struct node{
T value;
int pri,size;
node *left,*right;
node(){}
node(const T& v,node* l,node* r){
value=v;
left=l;
right=r;
size=1;
pri=rand();
}
}*root,*null;
void resize(node* cur){
if(cur!=null)cur->size=cur->left->size+cur->right->size+1;
}
void left_rotate(node* &cur){
node *res=cur->left;
cur->left=res->right;
res->right=cur;
resize(cur);
cur=res;
resize(cur);
}
void right_rotate(node* &cur){
node *res=cur->right;
cur->right=res->left;
res->left=cur;
resize(cur);
cur=res;
resize(cur);
}
void remove_node(node* &cur){
if(cur->left == null && cur->right == null){
cur=null;
}else if(cur->left->pri < cur->right->pri){
left_rotate(cur);
remove_node(cur->right);
resize(cur);
}else{
right_rotate(cur);
remove_node(cur->left);
resize(cur);
}
}

void insert_to_kth_pos(const T& x,node* &cur,int k){
if(cur == null && k==0){
cur = new node(x,null,null);
}else if(k <= cur->left->size){
insert_to_kth_pos(x,cur->left,k);
if(cur->pri > cur->left->pri)
left_rotate(cur);
resize(cur);
}else if(k <= cur->size){
insert_to_kth_pos(x,cur->right,k-cur->left->size-1);
if(cur->pri > cur->right->pri)
right_rotate(cur);
resize(cur);
}
}
void remove_kth_element(node* & cur,int k){
if(k == cur->left->size){
remove_node(cur);
}else if(k < cur->left->size){
remove_kth_element(cur->left,k);
if(cur->pri > cur->left->pri)
left_rotate(cur);
resize(cur);
}else if(k < cur->size){
remove_kth_element(cur->right,k-cur->left->size-1);
if(cur->pri > cur->right->pri)
right_rotate(cur);
resize(cur);
}
}
void print(node* cur){
if(cur == null)return;
print(cur->left);
cout<<cur->value<<" ";
print(cur->right);
}
public:
TreeList (){
null=new node();
null->pri = 0x7fffffff;
null->size = 0;
null->left=null->right=null;
root=null;
}
void print(){
print(root);
cout<<endl;
}
void insert(const T& x,int k){
insert_to_kth_pos(x,root,k);
}
void remove(int k){
remove_kth_element(root,k);
}
T get(int k){
node* cur=root;
while(cur!=null){
if(k < cur->left->size){
cur=cur->left;
}else if(k == cur->left->size){
break;
}else{
k-=cur->left->size+1;
cur=cur->right;
}
}
return cur->value;
}
};
[/sourcecode]

Generator

Warning: this post is very technical. You may find it very hard to understand, or boring.

One of my friend asked me about this problem.

The description of this problem is simple: given a string pattern, find the expected position of first occurrence of such pattern in random strings.

The problem looks like dynamic programming, because there seems exists some recursive relations between its subproblems.

Let’s first assume there exists a random string generator, which generate random strings letter by letter. Left F[n] be the number of expected letters to be generated before the pattern occur, given that most recently generated n letters matches first n letters of the pattern. For example, assume the pattern is ABCDE, and the random string generator generates ….ABC, then F[3] would be the expected number of letters needed before ABCDE is generated.

We can consider “N-prefix” of the pattern as “state” for DP(described by N), and F[0] would be the answer we need: expected number of letters needed before pattern is observed, given no prefix match(random string generator hasn’t generated anything yet).

Now let’s see how one state transfer into another state. Given that most recent generated N letters match N-prefix of patter, if one more letters is generated, how would be the suffix-prefix match goes? In the best case, it will become N+1, if the newly generated letter matches the N+1th letter in the pattern. In other cases, the state would transfer to some number no larger than N, and that number can be found easily: find the largest prefix-suffix match between generated string and pattern.

Now the recursive relationship between states becomes clear: F[i] = 1+\sum_{c=’A’}^{‘Z’}F[next(i,c)]/26. Base case would be F[pattern.length]=0.
In the original problem, number of letters is a variable, rather than fixed 26. next(i,c) is state transition function.

Now comes another problem: this DP cannot be solved because states graph is not DAG, and no topological order can be found. Actually, this is not a DP problem. After listing out all equations of form F[i] = 1+\sum_{c=’A’}^{‘Z’}F[next(i,c)]/26, we can already solve the linear system: using Gaussian-Elimination.

By now, the problem is mostly solved. However, when I submit the code to the judge, I got “wrong answer”. I guess this is because of precision problem: I use double data type, since the expected length can be real number. However, the required output is integer, thus I cast the result into “long long” data type. Real number data type can store 15-16 significant digits. However, long long data type can represent numbers with almost 20 significant digits. Thus, in extreme case, converting double may lose necessary precision when the answer is very large.

I fixed this problem by using “long double” instead of “double”. To my best knowledge, “long double” is some system dependent data type in C. It has higher precision than normal double: on my machine(gcc), sizeof(long double) gives 12, and actually 80 bits are used to store data, 16 more bits are for padding. I am still not sure how many significant digits “long double” can have, but this problem shows that at least it is useful in some cases.

#include <vector>
#include <iostream>
#include <algorithm>
#include <string>
using namespace std;
const long double EPI=1e-6;
string s;
int N;
long double abs(long double a){return a>0?a:-a;}
int g(int i,char c){
  string t=s.substr(0,i)+c;
  for(int j=0;j<i+1;j++){
    if(t==s.substr(0,t.size()))
      return i+1-j;
    t=t.substr(1);
  }
  return 0;
}
void sub_from(vector<long double> & a,vector<long double> &b, long double k){
  // a=a-bk
  for(int i=0;i<a.size();i++)
    a[i]-=b[i]*k;
}
vector<long double> solve_linear_system(vector<vector<long double> >& a){

  for(int i=0;i<a.size();i++){
    int j;
    for(j=i;j<a.size();j++)
      if(abs(a[j][i]) > EPI)
        break;
    swap(a[j],a[i]);
    for(j=i+1;j<a.size();j++)
      sub_from(a[j],a[i],a[j][i]/a[i][i]);
  }
  for(int i=a.size()-1;i>=0;i--){
    for(int j=i-1;j>=0;j--)
      sub_from(a[j],a[i],a[j][i]/a[i][i]);
  }
  vector<long double> ans;
  for(int i=0;i<a.size();i++)
    ans.push_back(a[i].back()/a[i][i]);
  return ans;
}
long double solve(){
  int m=s.size();
  vector<vector<long double> > a;
  for(int i=0;i<m;i++){
    a.push_back(vector<long double>(m+2,0));
    a[i][i]=N;
    a[i][m+1]=N;
    for(int j=0;j<N;j++)
      a[i][g(i,'A'+j)]-=1;
  }
  a.push_back(vector<long double>(m+2,0));
  a[m][m]=1; //xm=0
  vector<long double> ans=solve_linear_system(a);
  return ans[0]; //x0
}
int main(int argc, const char *argv[])
{
  int times;
  cin>>times;
  for(int i=0;i<times;i++){
    cin>>N>>s;
    if(i)cout<<endl;
    cout<<"Case "<<i+1<<":"<<endl<<(long long)solve()<<endl;
  }
  return 0;
}

 

My grep

Thanks to Robert Sedgewick(Princeton)’s lecture notes, I just implemented a ‘grep’ program, which perform regular expression matching.
The code is only about 100 lines long, with the help of c++ STL.
Matching is done by converting RE into NFA. The claimed complexity is O(NM), N is the length of string, M is the length of RE.

Now it only support very basic regular expression matching:

a-z,A-Z,0-9 literal meaning
. match any character
* repeat multiple times
() group regular expression
| logic or, must appear inside ()

Can do some improvement in the future:
Support RE like a|b, i.e. | without parentheses. This can be done by adding ( ) around the input.
Add support for RE like [a-z], [^a-z], [abc].(may need a parser for this)
Add support for meta characters like \d: may need an abstract character type, which can match certain set of characters.
To support ^ and $, I think we need another abstract type for input string characters.

burrows wheeler algorithm for data compression

Burrows Wheeler algorithm is another popular data compression algorithm. My implementation follows the guideline here.

The consists of three parts:

  1. Burrows Wheeler transformation
  2. Move to front transformation
  3. Huffman conpression
  4. key step of this algorithm is Burrows Wheeler transformation. Details will be given describe this later.

    Experiment result
    data huffman lzw Burrows Wheeler
    all_same_char(best case) .139 .048 .165
    equal_freq(a-z) .686 .188 0.269
    random_printable_char .847 1.089 .870
    Google.htm .724 .626 .695
    random_byte 1.169 1.639 1.189
    huffman.exe .671 .611 .692
    NIPS2009_0300.pdf .997 1.280 1.006
    jquery.js .683 .523 0.685
    long_cn_utf-8.txt .732 .561 .769
    long_cn_unicode.txt .889 .729 0.936
    Wikipedia-logo.png 1.043 1.505 1.062
    Wallpapers.jpg 1.001 1.247 1.022
    google_logo.gif 1.190 1.494 1.215
    lena.bmp 0.94 .957 0.906
    compressed file 1.028 1.461 1.043

    It’s hard to say whether it’s good or bad. Discussion session comes later.

lzw data compression algorithm

lzw is another popular file compression algorithm. Many image format is based on lzw algorithm to compress data.

The algorithm itself is more complicated than huffman algorithm, but it has better compression ratio in most case.

I implemented the algorithm following this notes and this tutorial. Actually I haven’t fully understand the idea behind the algorithm, although the implementation works. I will write some more after I think through it later. For now, I just post experiment result and comparison with huffman algorithm.

Experiment and Comparison

data huffman lzw
all_same_char(best case) .139 .048
equal_freq(a-z) .686 .188
random_printable_char .847 1.089
Google.htm .724 .626
random_byte 1.169 1.639
huffman.exe .671 .611
NIPS2009_0300.pdf .997 1.280
jquery.js .683 .523
long_cn_utf-8.txt .732 .561
long_cn_unicode.txt .889 .729
Wikipedia-logo.png 1.043 1.505
Wallpapers.jpg 1.001 1.247
google_logo.gif 1.190 1.494
lena.bmp 0.94 .957
compressed file 1.028 1.461

To sum up, good results become better, bad ones become worse. LZW performs well on text data, while performs poorly on compressed file(including image files) and evenly distributed random bytes.

There are many popular LZW variants. Maybe good to try those later.

The Algorithm

Ok. I have think through the algorithm, and I think I get the basic idea.

  • Use fixed length code to encode bit sequence. This avoid the wastage of huffman code: in huffman, if there is a prifix code with length 7, then it implies all its 6 prefix cannot be used. Code with fixed number of bites does not have this problem, since prefix constrain is not neede already.
  • Encode long binary sequence, if it occurs frequently in the file. LZW is able to learn which sequence occurs frequently during compression.
  • LZW is stream based compression algorithm. That is to say, the algorithm can perform compress/decompress while reading from a stream(you don’t need the whole file before you can do any work). This is not true for huffman because huffman need to read file twice for compression.
  • Unlike huffman, no much extra data(byte count infomation for huffman) is needed to store in compressed file. If the code length is fixed, then no extra information is needed actually.
  • Many variant algorithm of LZW exists.
Compression Psuedo Code

[sourcecode language=”python”]
ST = empty table # a table mapping from BYTE SEQUENCE to INTEGER CODE
for x in 0…255: # initialize table with all single byte sequence
ST[char(x)]=x
S=input byte sequence
codeMax=256 # next code to be inserted into table ST
while S.length > 0:
P=longest prefix of S in ST
code=ST[P]
print code # here is the output
S=S.substring(P.length,S.length) # remove prefix P form S
if S.length > 0:
ST[P+S[0]]=codeMax++ # insert P+S[0] into table.
# This is the learning step.
# If P occurs frequently,
# P+S[0] may also occur frequently
[/sourcecode]

Remark: notice that here we use the whole input sequence instead of reading from a stream. This is for simplicity. It is easy to modify the code to work with sctream model. For the output, you can print the code integer integer in ASCII, but it would be not space efficient. Encoding it in fixed-length binary form gives better result. If the length of bits for each code is not known for decompress algorithm, then this information should be included into compressed file.

Decompress Psuedo Code

[sourcecode language=”python”]
ST = empty table # a table mapping from INTEGER CODE to BYTE SEQUENCE(different from compression algorithm)
for x in 0…255: # initialize table with all single byte sequence
ST[x]=char(x)
S=input code sequence # this is a list of individual code output by compression algorithm
codeMax=256 # next code to be inserted into table ST
PRE=ST[S[0]] # decode the first code first
print PRE
S=S.substring(1,S.lenth) #remove first byte
while S.length > 0:
code=S[0]
if code is in ST:
CUR=ST[code]
print CUR # decode into byte sequence and print
ST[codeMax++]=PRE+CUR[0] # Update table.
# This step is delayed since we only know the next code rather than next byte
else:
CUR=PRE+PRE[0] # current byte sequence
ST[codeMax++]=CUR # this is the tricky step
print CUR

PRE=CUR # current sequence becomes previous one in next iteration
S=S.substring(1,S.lenth) #remove first byte
[/sourcecode]

Remark: the decompresion algorithm construct the reverse mapping from code to byte sequence on hte fly. However, there is one tricky special case caused by the delay update of table. Maybe it would be easier to understand using example.

Assume byte sequence we want to compress is “ABABABA”. Following the compression algorithm, it would produce:

A	-	65

	# "AB" added into table, by now, table is:
	# A  B  AB
	# 65 66 256	

B	-	66

	# "BA" added into table, by now, table is:
	# A  B  AB  BA
	# 65 66 256 257

AB	-	256

	# "ABA" added into table, by now, table is:
	# A  B  AB  BA	ABA
	# 65 66 256 257	258

ABA	-	258

	# we are done :)

Now the compressed sequence is 65,66,256,258. Let’s decompress it from this sequence

65	-	A

	# now PRE = A
	# 65 66
	# A  B 

66	-	B

	# AB is added into table 
	# 65 66 256	
	# A  B  AB
	# now PRE = B

256	-	AB

	# BA is added into table 
	# 65 66 256 257
	# A  B  AB  BA
	# now PRE = AB

258	-	ABA

	# This is the special case
	# 258 is not in the table
	# it means the "delayed table update" will insert 258.
	# So the CUR should be AB+CUR[0]
	# Since we know CUR starts with AB, cur[0]==A
	# So CUR must be ABA.
	# Then ABA is added into table 
	# 65 66 256 257 228
	# A  B  AB  BA  ABA
	# We are done :)

huffman data compression in c++

This small project is motivated by Stanford CS106X assignment 5.

It is also the first time implementing file compression algorithm. Although I know the algorithm before, I didn’t tried to implement it because I don’t know any efficient way to encode huffman-tree into binary file(it turns out to be easy after reading the handout of Stanford course webpage).

The algorithm is easy, and detail is given in handout of CS106X page. The complexity is linear to the number of bits in file.

Some remarks of the implementation:

  • All files to be compressed are considered as binary file in order to eliminate “new line” issue in different OS.
  • Each btye is considered as a character, that is to say, valid alphabet set size is 2^8=256
  • The compressed file may have any number of bits, which may not be multiple of 8. Since we can only write to file byte by byte, we pad extra 0s at the end to make the bit number multiple of 8.
  • Do not use char type to represent byte. Use int instead. This is because char is signed value, which is not safe for array index(for direct hash purpose). Int has large valid range than char, and some invalid byte value can be used for control purpose(see next).
  • Because of the dummy bits at the end, we cannot detect EOF in the normal way. We use a special char(ascii 256, which is invalid value for a byte, for EOF purpose only). To do compression, first convert byte string into array of int, and append one 256 value into the array. For decompression, when a byte with value 256 is encountered, stop looping.

Experiment Result

all_same_char(best case) .139
equal_freq(a-z) .686
random_printable_char .847
Google.htm .724
random_byte 1.169
huffman.exe .671
NIPS2009_0300.pdf .997
jquery.js .683
long_cn_utf-8.txt .732
long_cn_unicode.txt .889
Wikipedia-logo.png 1.043
Wallpapers.jpg 1.001
google_logo.gif 1.190
lena.bmp 0.94
compressed file 1.028

Discussion

This result recall me a tutorial question in Discrete Structure: proof that any lossless file compression algorithm cannot compress all files. I was a bit surprised when I first saw this statement, but finally believe it when I know the proof. The proof is not difficult:

  1. assume some algorithm f can compress any file. f can be seen as a mapping from binary sequence to binary sequence. Thus, len[f(x)] < len[x] for all binary sequence x
  2. For all binary sequence of length n, there are 2^n of them. They are all compressed into sequence with length less than n by f
  3. However, for all binary sequence of length less than n, there are only 1+2+4+8+…+2^(n-1), which is 2^n-1. If the compression algorithm is lossless, then it should not map different raw sequences into same compressed sequence. By pigeon hole principle, we derive contradiction here, since the total number of shorter sequence is less than row sequence.

Here is another interesting proof:

If we can compress every file, then each time we compress a file, the size of compressed file should be at least 1 byte less than original one. If we keep on compress the compressed file, then after N steps, the file will become 1 byte only. N should be less than the number of bytes of original file. Thus, any file can be represented by a integer and one extra bit(the final 1 bit is either 0 or 1), since we can decompress N times on the bit to obtain the original file. This is obviously impossible.

The best case is all characters are same. In such case, all characters are represented by one bit, thus the ideal compress rate should be 0.125. In our case, we use extra space to store the huffman table, and also we have special EOF character, and a newline character at the end of file because of editor issue, thus the ratio is slightly higher than optimal.

Since our algorithm can handle any binary file, the worst case is random binary type. We can see that for random byte, the result file actually becomes larger.

As long as the file is text file with ascii printable characters only, the compression ratio is not bad: actually it should be bounded above by random_printable’s result, which is 0.847.

Executable files have good result. This is a bit unexpected. By looking at the huffman table in compressed file, I find that although executable file use almost all 256 bytes, the distribution is not even. This may be caused by even usage of machine instructions.

Compressed files are hard to compress again. A good compression algorithm should behave like this. Otherwise, compress twice using the same algorithm is a better algorithm.

Image files are hard to compress. This is because most image files are already compressed using some algorithm. Thus their byte distribution looks more random.

pdf files are hard to compress. I am not familiar with pdf format. I guess the reason should be the same as image files.

Executable files have good result. This is a bit unexpected. By looking at the huffman table in compressed file, I find that although executable file use almost all 256 bytes, the distribution is not even. This may be caused by even usage of machine instructions.

Chinese characters are easy to compress, although there are many of them. Actually, there are only 5000 commonly used Chinese characters. That is to say, 13 bits is enough on average. utf-8 uses at least 16 bits to represent Chinese characters, which is definitely a waste. Unicode looks harder to compress than utf-8. Maybe this is another reason why we still need many encodings except Unicode. For Chinese character, Unicode file has smaller size than utf-8 file, both before and after compression.This is because utf-8 encoding may use 1, 2, 3 or 4 bytes to encode one character, while Unicode always use 2 bytes for each character. utf-8 encoding is optimized for backward compatible with ASCII, without lose number of representable character. Thus, the uft-8 is expected to be more expensive for non-ascii document. Refer to this post for more detail.

Skip List in C++

I implemented a skip list in C++(following pseudo code in this paper).

The skip list itself provided simple set interface: contains, insert, remove
I also implemented a map interface using skip list.
I compared performance of Skip List with STL set and map. STL’s algorithm is about twice faster than skip list.
Detail of implementation and experiment data comes later…

Implementation

The motivation of Skip List is simple. Linked List has O(n) random access time, while it can perform fast deletion given node location. Array have O(1) random access, while deletion is O(n). Thus it is nature to look for something in between, say, O(log n) for random accessing and deleting. Binary search tree is one such thing, while it loss the linear structure, and not so intuitive. Skip List is a more intuitive, and it has O(log n) complexity for random accessing and deletion on average.

Skip List is based on a sorted Linked List, thus we can expect efficient deletion. If we can have efficient accessing, then we are done. Efficient accessing is achieved by adding extra pointers to let us “skip” multiple nodes instead of going through one by one. Ideally, as shown in figure above, each node got several levels, each level has a pointer, which pointing to next node reach that level. Most nodes has low height, and few are high. If we can make sure that the number of nodes with height h grow reverse exponentially with h, then we can access each node in O(log n) time on average.

Accessing a node is done by starting from highest list head with level, go along the list at that level until the next element is larger than or equal to the element you need. When the next one is larger, go done one level, and continue going along. Repeat this until you are at level 1. By now, either your next element at level 1 is the element you are looking for, or search fails.

[sourcecode language=”cpp”]
Node* search(const T& x,vector<Node*>* update=0){
Node* cur=head;
for(int i=level-1;i>=0;i–){
while(cur->next[i]!=tail &&
cur->next[i]->value < x)
cur=cur->next[i];
if(update)update->at(i)=cur;
}
cur=cur->next[0];
if(cur==tail || cur->value != x)
cur = NULL;
return cur;
}
[/sourcecode]
This is the search method, the core of all features. next is a vector of pointers, which correspond to the “pointer tower” at each node.
[sourcecode language=”cpp”]
void insert(const T& x){
vector<Node*> update(level);
Node* cur=search(x,&update);
if(cur){
cur->value=x;
}else{
Node* newNode=new Node(x);
int v=randomLevel();
if(v > level){
level=v;
update.resize(level,head);
}
newNode->next.resize(v);
for(int i=0;i<v;i++) {
newNode->next[i]=update[i]->next[i];
update[i]->next[i]=newNode;
}
}
}
[/sourcecode]

Insertion is easy: first call search, if the element already exist, overwrite it with new value(maybe not exactly the same, but == operator returns true). If not exist, then insert the node into the list, and maintain adjacent pointers(by using vector update, detail refer to the paper).

[sourcecode language=”cpp”]
int randomLevel(){
int res=1;
int r=(rand()<<16)+rand();
while(res<maxLevel && (r&1)){
res++;
r>>=1;
}
return res;
}
[/sourcecode]

randLevel generate random number satisfying the required distribution. Note chat to avoid one specific node has too high level, we avoid the random number to be larger than current level + 1.

[sourcecode language=”cpp”]
void remove(const T& x){
vector<Node*> update(level);
Node* cur=search(x,&update);
if(cur){
for(int i=0;i<level;i++){
if(update[i]->next[i]!=cur)break;
update[i]->next[i]=cur->next[i];
}
delete cur;
while(level>1){
if(head->next[level-1]==tail)
level–;
else
break;
}
}
}
[/sourcecode]

remove is similar to insert, except that only one side of pointers need to be maintained. The removed node is deleted directly, so no need to maintain out going pointers.

Experiment 1

  • Insert n(from 10000 to 100000) random numbers, repeat 10 times, find the average time for each n.
  • Remove n(from 10000 to 100000) random numbers, repeat 10 times, find the average time for each n.
  • Combination of insertion and deletion for n(from 10000 to 100000) random numbers with 100% hit rate(no insert same element, or remove element not exist in the list), repeat 10 times, find the average time for each n.


The image above shows the result of experiment 1. X axis is n, Y axis is average time cost.

  • The running time almost grow linearly with n for insertion and deletion, although theoretical complexity is O(log n).
  • Deletion is faster than Insertion. This is because insertion need to maintain pointers on both sides of inserted node, while deletion only maintain those nodes pointing to the deleted node(only one side).
  • Because of combined operation, the average size of skip list is small, thus the code is faster