TreeList

As we know, array data structure has O(1) access time and O(n) insertion/deletion(kth element); linked list need O(n) for all operations, and O(1) insertion/deletion only when node pointer is given.

We also know that binary search tree and skip-list provide O(log n) for accessing, insertion and deletion.

Now what if we want a data structure, which behaves like a list, which can insert/delete element at kth position efficiently?

One of my friend suggest “blocked linked list”, which is a combination of array and linked list. Assume the input data size N is known, we can use a list of arrays to store data. List has length sqrt(N), each array also has length sqrt(N). Insertion and deletion has complexity O(sqrt(N)), and array is split when length exceed sqrt(N), and adjacent blocks are merged when length smaller than sqrt(N). This data structure is very popular in high school programming contest, where data size is known in advance. However, O(sqrt(N)) is not good enough: we want O(log n) for all operations.

Another method is to use binary search tree: binary search tree with “size” field can find “kth” element in O(log n) time. Elements have a floating point value as key for comparison. When inserting kth element, we first look for element k-1 and element k, and take their mean as key for new element. Thus new element is inserted into kth position. However, this method is not that elegant: size of the list depends on floating point precision.

Finally, comes our “TreeList” data structure. I don’t think I am the first guy who invent it, and I don’t know whether there is any other name for such data structure. The TreeList support 3 operations:

• insert element into kth position
• remove element in kth position
• get kth element

The implementation is based on treap. Actually, any binary search tree can be used. The complexity for all 3 operations are O(log n). Source code is shown below.

When do we need this data structure? Follow my latest post

[sourcecode language=”cpp”]
template<class T>
class TreeList{
// list structure, support 3 operations:
// -insert element into kth position
// -remove element into kth position
// -get kth element
// implemented using treap
struct node{
T value;
int pri,size;
node *left,*right;
node(){}
node(const T& v,node* l,node* r){
value=v;
left=l;
right=r;
size=1;
pri=rand();
}
}*root,*null;
void resize(node* cur){
if(cur!=null)cur->size=cur->left->size+cur->right->size+1;
}
void left_rotate(node* &cur){
node *res=cur->left;
cur->left=res->right;
res->right=cur;
resize(cur);
cur=res;
resize(cur);
}
void right_rotate(node* &cur){
node *res=cur->right;
cur->right=res->left;
res->left=cur;
resize(cur);
cur=res;
resize(cur);
}
void remove_node(node* &cur){
if(cur->left == null && cur->right == null){
cur=null;
}else if(cur->left->pri < cur->right->pri){
left_rotate(cur);
remove_node(cur->right);
resize(cur);
}else{
right_rotate(cur);
remove_node(cur->left);
resize(cur);
}
}

void insert_to_kth_pos(const T& x,node* &cur,int k){
if(cur == null && k==0){
cur = new node(x,null,null);
}else if(k <= cur->left->size){
insert_to_kth_pos(x,cur->left,k);
if(cur->pri > cur->left->pri)
left_rotate(cur);
resize(cur);
}else if(k <= cur->size){
insert_to_kth_pos(x,cur->right,k-cur->left->size-1);
if(cur->pri > cur->right->pri)
right_rotate(cur);
resize(cur);
}
}
void remove_kth_element(node* & cur,int k){
if(k == cur->left->size){
remove_node(cur);
}else if(k < cur->left->size){
remove_kth_element(cur->left,k);
if(cur->pri > cur->left->pri)
left_rotate(cur);
resize(cur);
}else if(k < cur->size){
remove_kth_element(cur->right,k-cur->left->size-1);
if(cur->pri > cur->right->pri)
right_rotate(cur);
resize(cur);
}
}
void print(node* cur){
if(cur == null)return;
print(cur->left);
cout<<cur->value<<" ";
print(cur->right);
}
public:
TreeList (){
null=new node();
null->pri = 0x7fffffff;
null->size = 0;
null->left=null->right=null;
root=null;
}
void print(){
print(root);
cout<<endl;
}
void insert(const T& x,int k){
insert_to_kth_pos(x,root,k);
}
void remove(int k){
remove_kth_element(root,k);
}
T get(int k){
node* cur=root;
while(cur!=null){
if(k < cur->left->size){
cur=cur->left;
}else if(k == cur->left->size){
break;
}else{
k-=cur->left->size+1;
cur=cur->right;
}
}
return cur->value;
}
};
[/sourcecode]

Generator

Warning: this post is very technical. You may find it very hard to understand, or boring.

The description of this problem is simple: given a string pattern, find the expected position of first occurrence of such pattern in random strings.

The problem looks like dynamic programming, because there seems exists some recursive relations between its subproblems.

Let’s first assume there exists a random string generator, which generate random strings letter by letter. Left F[n] be the number of expected letters to be generated before the pattern occur, given that most recently generated n letters matches first n letters of the pattern. For example, assume the pattern is ABCDE, and the random string generator generates ….ABC, then F[3] would be the expected number of letters needed before ABCDE is generated.

We can consider “N-prefix” of the pattern as “state” for DP(described by N), and F[0] would be the answer we need: expected number of letters needed before pattern is observed, given no prefix match(random string generator hasn’t generated anything yet).

Now let’s see how one state transfer into another state. Given that most recent generated N letters match N-prefix of patter, if one more letters is generated, how would be the suffix-prefix match goes? In the best case, it will become N+1, if the newly generated letter matches the N+1th letter in the pattern. In other cases, the state would transfer to some number no larger than N, and that number can be found easily: find the largest prefix-suffix match between generated string and pattern.

Now the recursive relationship between states becomes clear: F[i] = 1+\sum_{c=’A’}^{‘Z’}F[next(i,c)]/26. Base case would be F[pattern.length]=0.
In the original problem, number of letters is a variable, rather than fixed 26. next(i,c) is state transition function.

Now comes another problem: this DP cannot be solved because states graph is not DAG, and no topological order can be found. Actually, this is not a DP problem. After listing out all equations of form F[i] = 1+\sum_{c=’A’}^{‘Z’}F[next(i,c)]/26, we can already solve the linear system: using Gaussian-Elimination.

By now, the problem is mostly solved. However, when I submit the code to the judge, I got “wrong answer”. I guess this is because of precision problem: I use double data type, since the expected length can be real number. However, the required output is integer, thus I cast the result into “long long” data type. Real number data type can store 15-16 significant digits. However, long long data type can represent numbers with almost 20 significant digits. Thus, in extreme case, converting double may lose necessary precision when the answer is very large.

I fixed this problem by using “long double” instead of “double”. To my best knowledge, “long double” is some system dependent data type in C. It has higher precision than normal double: on my machine(gcc), sizeof(long double) gives 12, and actually 80 bits are used to store data, 16 more bits are for padding. I am still not sure how many significant digits “long double” can have, but this problem shows that at least it is useful in some cases.

#include <vector>
#include <iostream>
#include <algorithm>
#include <string>
using namespace std;
const long double EPI=1e-6;
string s;
int N;
long double abs(long double a){return a>0?a:-a;}
int g(int i,char c){
string t=s.substr(0,i)+c;
for(int j=0;j<i+1;j++){
if(t==s.substr(0,t.size()))
return i+1-j;
t=t.substr(1);
}
return 0;
}
void sub_from(vector<long double> & a,vector<long double> &b, long double k){
// a=a-bk
for(int i=0;i<a.size();i++)
a[i]-=b[i]*k;
}
vector<long double> solve_linear_system(vector<vector<long double> >& a){

for(int i=0;i<a.size();i++){
int j;
for(j=i;j<a.size();j++)
if(abs(a[j][i]) > EPI)
break;
swap(a[j],a[i]);
for(j=i+1;j<a.size();j++)
sub_from(a[j],a[i],a[j][i]/a[i][i]);
}
for(int i=a.size()-1;i>=0;i--){
for(int j=i-1;j>=0;j--)
sub_from(a[j],a[i],a[j][i]/a[i][i]);
}
vector<long double> ans;
for(int i=0;i<a.size();i++)
ans.push_back(a[i].back()/a[i][i]);
return ans;
}
long double solve(){
int m=s.size();
vector<vector<long double> > a;
for(int i=0;i<m;i++){
a.push_back(vector<long double>(m+2,0));
a[i][i]=N;
a[i][m+1]=N;
for(int j=0;j<N;j++)
a[i][g(i,'A'+j)]-=1;
}
a.push_back(vector<long double>(m+2,0));
a[m][m]=1; //xm=0
vector<long double> ans=solve_linear_system(a);
return ans[0]; //x0
}
int main(int argc, const char *argv[])
{
int times;
cin>>times;
for(int i=0;i<times;i++){
cin>>N>>s;
if(i)cout<<endl;
cout<<"Case "<<i+1<<":"<<endl<<(long long)solve()<<endl;
}
return 0;
}

My grep

Thanks to Robert Sedgewick(Princeton)’s lecture notes, I just implemented a ‘grep’ program, which perform regular expression matching.
The code is only about 100 lines long, with the help of c++ STL.
Matching is done by converting RE into NFA. The claimed complexity is O(NM), N is the length of string, M is the length of RE.

Now it only support very basic regular expression matching:

a-z,A-Z,0-9 literal meaning
. match any character
* repeat multiple times
() group regular expression
| logic or, must appear inside ()

Can do some improvement in the future:
Support RE like a|b, i.e. | without parentheses. This can be done by adding ( ) around the input.
Add support for RE like [a-z], [^a-z], [abc].(may need a parser for this)
Add support for meta characters like \d: may need an abstract character type, which can match certain set of characters.
To support ^ and $, I think we need another abstract type for input string characters. GDB tips gdb is quite a useful tool. Despite it’s text interface, it can do more than GUI debuggers. I am still learning it, and this post is used as a memo. • ptype is useful to print type, structure(class) definition. This is useful to check user defined data types(sometimes deeply defined in header files). • disassemble is useful to see assembly code of function/statement. • info registers shows registers information. • print a@10 shows 10 elements start from address of variable a. • x command examine memory content. x/8xw$eip displays 8 words since where eip register points to. x/i $eip • shows instruction eip register points to. extern “C” You may have seen following C++ code somewhere. [sourcecode language=”cpp”] extern "C"{ …. } [/sourcecode] Why do we need the extern “C”? The answer is available in this post. The idea is, C++ compiler treat c++ differently from C. A function void f(int i) in C code is translated into _f in assembly, while the same function in C++ code will be translated into __Z1fi(the i at the end means first parameter is i, but I am not sure what does __Z1 seems represent return type). Compiler determin whether the code is C or C++ by looking at file extension only(.c for c file, .cpp for c++ file). C++ code is treated differently because C++ language support function overloading. That is to say, void f(int n) and void f(char c) are two different function because they have different parameter type. Compiler implement this by using different names for these two functions: including parameter type into function name. This method works fine, but when you want to call a C function in your C++ code, and the C function is already compilede as C code, then you have trouble. [sourcecode language=”cpp”] // sqr.c int sqr(int n){ return n*n; } [/sourcecode] [sourcecode language=”cpp”] // main.c #include<stdio.h> int sqr(int n); //declaration int main(){ printf("%d\n",sqr(10)); return 0; } [/sourcecode] The code above works fine. However, if you rename “main.c” into “main.cpp”, then you have problem: compiler(actually linker) complains that “int sqrt(int) is not defined”. This is because, as a c++ function, “sqr” in “main.cpp” is translated into “__Z3sqri”, while as C code, “sqr” in “sqr.c” is translated into “_sqr”. Thus, linker cannot find the function “int sqrt(int)”. To solve this problem, we can use extern “C”{} to enclose sqr function declaration, so that the function name is translated into C style. [sourcecode language=”cpp”] // main.cpp #include<stdio.h> extern "C"{ int sqr(int n); //declaration } int main(){ printf("%d\n",sqr(10)); return 0; } [/sourcecode] Please compile and link above code using g++ instead of gcc, or else you may get “undefined reference to `__gxx_personality_v0′” problem. You may ask why functions declared in “stdio.h” works well for both C and C++ code? You can check out preprocessor’s output: [sourcecode language=”cpp”] //dummy c++ code #include<stdio.h> int main(){ return 0; } [/sourcecode] [sourcecode language=”cpp”] extern "C" { …… int __attribute__((__cdecl__)) fscanf (FILE *, const char *, …) __attribute__ ((__format__ (__scanf__, 2, 3))); int __attribute__((__cdecl__)) printf (const char *, …) __attribute__ ((__format__ (__printf__, 1, 2))); int __attribute__((__cdecl__)) scanf (const char *, …) __attribute__ ((__format__ (__scanf__, 1, 2))); …… } [/sourcecode] The output of preprocessor use extern “C” to enclose all standard C library function declaration. However, you will not be able to find extern “C” if you change file extension to .c. Stanfdard library uses following trick to do this: [sourcecode language=”cpp”] // /usr/include/sys/cdefs.h #ifdef __cplusplus # define __BEGIN_DECLS extern "C" { # define __END_DECLS } #else # define __BEGIN_DECLS # define __END_DECLS #endif [/sourcecode] All C library header files include this “cdefs.h”, and use __BEGIN_DECLS and __END_DECLS to enclose function declarations. Python tips • When write to global variable in function, declare the variable as global ... at the first line of function definition. • Conditions can be chained(!!). 1 < a < 3 checks that a is both less than 3 and more than 1. • If you want to execute python code without starting a new console(in windows), change python file extension’s name to “pyw”. This is useful for GUI applications. • __import__ function can import modules dynamically. • multi parameter function: • def f(*a): all parameters in list a • def f(**a): all parameters in dict a (should be called in this way: f(a=1,b=2)) • When using python’s iterative shell, _ variable stores return value of last command. “Regular Expression” matches all non-prime numbers I found a “regular expression” which matches all strings consists of non-prime number of 1s. ^1?$|^(111*)\1+$This is quite surprising because it can be proven that non-prime number of 1s is not a regular language. That is to say, no regular expression can match that language. However, this regular expression works. Following python code will print out all prime numbers within [0,99] [sourcecode language=”python”] import re filter(lambda n:not re.match(r"^1?$|^(111*)\1+$","1"*n),range(100)) [/sourcecode] So why it works? Let’s look at the regular expression: ^1?$ matches “” and “1”, which is trivial. In ^(111*)\1+\$, (111*) matches 2 or more 1s, so (111*)\1+ matches n*k 1s, k>=2, n>=2, which is exactly all composite numbers.

Then what’s wrong with the theory? Actually, nothing is wrong. The “regular expression” in most programming language is much more expressive than the formal regular expression defined in computational theory. In this case, we do not allow “\1” in formal regular language. Actually, this “\1” may potentially require unbounded amount of memory, because the program need to remember the matched string in “\1”, which can be arbitrarily long. In formal regular language, the memory cost should be bounded when regular expression is given: bounded by the size of corresponding DFA. Thus, “\1” kind of stuff gives regular expression much more power.

Next question is: is “regular expression” in programming language equivalent to Turing Machine?

burrows wheeler algorithm for data compression

Burrows Wheeler algorithm is another popular data compression algorithm. My implementation follows the guideline here.

The consists of three parts:

1. Burrows Wheeler transformation
2. Move to front transformation
3. Huffman conpression
4. key step of this algorithm is Burrows Wheeler transformation. Details will be given describe this later.

Experiment result
data huffman lzw Burrows Wheeler
all_same_char(best case) .139 .048 .165
equal_freq(a-z) .686 .188 0.269
random_printable_char .847 1.089 .870
random_byte 1.169 1.639 1.189
huffman.exe .671 .611 .692
NIPS2009_0300.pdf .997 1.280 1.006
jquery.js .683 .523 0.685
long_cn_utf-8.txt .732 .561 .769
long_cn_unicode.txt .889 .729 0.936
Wikipedia-logo.png 1.043 1.505 1.062
Wallpapers.jpg 1.001 1.247 1.022
lena.bmp 0.94 .957 0.906
compressed file 1.028 1.461 1.043

It’s hard to say whether it’s good or bad. Discussion session comes later.

Nasty C code

Today, my friend Yuan Yuan showed me a piece of C code, which gives different result under different optimization option. I was greatly surprised that such code exists.
His original code is complicated. I minimized the code in order to demonstrate the problem.
[sourcecode language=”cpp”]
int main() {
int i=0;
int a[2]={0};
a[i]=++i;
}
[/sourcecode]
[sourcecode language=”cpp”]
int i=0;
int main() {
int a[2]={0};
a[i]=++i;
}
[/sourcecode]
This two piece of code will give different result on windows 7 Professional, compiled using MingW’s g++ 4.50. More specifically, in the first program, a finally becomes [0,1], while in second one, it is [1,0].
Let’s look at the assembly code generated by compiler:

incl	12(%esp)		# i++;
movl	12(%esp), %eax
movl	12(%esp), %edx
movl	%edx, 4(%esp,%eax,4)	# a[i]=i;
movl	_i, %eax		# int temp=i;
movl	_i, %edx
incl	%edx
movl	%edx, _i		# i++;
movl	_i, %edx
movl	%edx, 8(%esp,%eax,4)	# a[temp]=i;

Comment is equivalent C code. We can see that they are doing completely different things. I am not sure which one is correct behavior of ++ operator, or even the behavior of ++ operator is undefined in this case.

The lesson is: never write code like a[i]=i++.

lzw data compression algorithm

lzw is another popular file compression algorithm. Many image format is based on lzw algorithm to compress data.

The algorithm itself is more complicated than huffman algorithm, but it has better compression ratio in most case.

I implemented the algorithm following this notes and this tutorial. Actually I haven’t fully understand the idea behind the algorithm, although the implementation works. I will write some more after I think through it later. For now, I just post experiment result and comparison with huffman algorithm.

Experiment and Comparison

data huffman lzw
all_same_char(best case) .139 .048
equal_freq(a-z) .686 .188
random_printable_char .847 1.089
random_byte 1.169 1.639
huffman.exe .671 .611
NIPS2009_0300.pdf .997 1.280
jquery.js .683 .523
long_cn_utf-8.txt .732 .561
long_cn_unicode.txt .889 .729
Wikipedia-logo.png 1.043 1.505
Wallpapers.jpg 1.001 1.247
lena.bmp 0.94 .957
compressed file 1.028 1.461

To sum up, good results become better, bad ones become worse. LZW performs well on text data, while performs poorly on compressed file(including image files) and evenly distributed random bytes.

There are many popular LZW variants. Maybe good to try those later.

The Algorithm

Ok. I have think through the algorithm, and I think I get the basic idea.

• Use fixed length code to encode bit sequence. This avoid the wastage of huffman code: in huffman, if there is a prifix code with length 7, then it implies all its 6 prefix cannot be used. Code with fixed number of bites does not have this problem, since prefix constrain is not neede already.
• Encode long binary sequence, if it occurs frequently in the file. LZW is able to learn which sequence occurs frequently during compression.
• LZW is stream based compression algorithm. That is to say, the algorithm can perform compress/decompress while reading from a stream(you don’t need the whole file before you can do any work). This is not true for huffman because huffman need to read file twice for compression.
• Unlike huffman, no much extra data(byte count infomation for huffman) is needed to store in compressed file. If the code length is fixed, then no extra information is needed actually.
• Many variant algorithm of LZW exists.
Compression Psuedo Code

[sourcecode language=”python”]
ST = empty table # a table mapping from BYTE SEQUENCE to INTEGER CODE
for x in 0…255: # initialize table with all single byte sequence
ST[char(x)]=x
S=input byte sequence
codeMax=256 # next code to be inserted into table ST
while S.length > 0:
P=longest prefix of S in ST
code=ST[P]
print code # here is the output
S=S.substring(P.length,S.length) # remove prefix P form S
if S.length > 0:
ST[P+S[0]]=codeMax++ # insert P+S[0] into table.
# This is the learning step.
# If P occurs frequently,
# P+S[0] may also occur frequently
[/sourcecode]

Remark: notice that here we use the whole input sequence instead of reading from a stream. This is for simplicity. It is easy to modify the code to work with sctream model. For the output, you can print the code integer integer in ASCII, but it would be not space efficient. Encoding it in fixed-length binary form gives better result. If the length of bits for each code is not known for decompress algorithm, then this information should be included into compressed file.

Decompress Psuedo Code

[sourcecode language=”python”]
ST = empty table # a table mapping from INTEGER CODE to BYTE SEQUENCE(different from compression algorithm)
for x in 0…255: # initialize table with all single byte sequence
ST[x]=char(x)
S=input code sequence # this is a list of individual code output by compression algorithm
codeMax=256 # next code to be inserted into table ST
PRE=ST[S[0]] # decode the first code first
print PRE
S=S.substring(1,S.lenth) #remove first byte
while S.length > 0:
code=S[0]
if code is in ST:
CUR=ST[code]
print CUR # decode into byte sequence and print
ST[codeMax++]=PRE+CUR[0] # Update table.
# This step is delayed since we only know the next code rather than next byte
else:
CUR=PRE+PRE[0] # current byte sequence
ST[codeMax++]=CUR # this is the tricky step
print CUR

PRE=CUR # current sequence becomes previous one in next iteration
S=S.substring(1,S.lenth) #remove first byte
[/sourcecode]

Remark: the decompresion algorithm construct the reverse mapping from code to byte sequence on hte fly. However, there is one tricky special case caused by the delay update of table. Maybe it would be easier to understand using example.

Assume byte sequence we want to compress is “ABABABA”. Following the compression algorithm, it would produce:

A	-	65

# "AB" added into table, by now, table is:
# A  B  AB
# 65 66 256

B	-	66

# "BA" added into table, by now, table is:
# A  B  AB  BA
# 65 66 256 257

AB	-	256

# "ABA" added into table, by now, table is:
# A  B  AB  BA	ABA
# 65 66 256 257	258

ABA	-	258

# we are done :)

Now the compressed sequence is 65,66,256,258. Let’s decompress it from this sequence

65	-	A

# now PRE = A
# 65 66
# A  B

66	-	B

# AB is added into table
# 65 66 256
# A  B  AB
# now PRE = B

256	-	AB

# BA is added into table
# 65 66 256 257
# A  B  AB  BA
# now PRE = AB

258	-	ABA

# This is the special case
# 258 is not in the table
# it means the "delayed table update" will insert 258.
# So the CUR should be AB+CUR[0]
# Since we know CUR starts with AB, cur[0]==A
# So CUR must be ABA.
# Then ABA is added into table
# 65 66 256 257 228
# A  B  AB  BA  ABA
# We are done :)