In earlier post, we presented a skip list implementation in C++. I am going to go thorough a skip list implementation in real world software: leveldb.

What is LevelDB

A key-value store library open-sourced by Google. It borrowed ideas from BigTable. LevelDB is currently being used in many software like Chrome browser. Project page: http://leveldb.org Github: https://github.com/google/leveldb

How is Skip List used in LevelDB

Skip list is used as a buffer that tracks recent writes. The buffer is used to efficiently serve read with key that is recently touched. When buffer is full, its content is flushed into a SSTable (sorted string table) for persistence.

Why Not Hash Table

Hash table is good for serving reads, but not great for flushing data into SSTable. As its name indicates, SSTable content is sorted byte strings, which is good for look things up with binary search. Flushing buffer content to SSTable require traversing buffer contents in sorted order, which is hard for hash table, but easy for Skip List.

Why Not Binary Search Tree

Binary search tree (BST) is a good data structure that supports efficient look up and efficient content traversal in sorted order. I think the biggest advantage of skip list over BST is its concurrent property. Skip list supports concurrent read while write is on going (though writes have to go in sequential). I may create a separate post to go into more details.

Code Structure

There are two key sub-classes: Node and Iterator.

Node Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Implementation details follow
template<typename Key, class Comparator>
struct SkipList<Key,Comparator>::Node {
explicit Node(const Key& k) : key(k) { }

Key const key;

// Accessors/mutators for links. Wrapped in methods so we can
// add the appropriate barriers as necessary.
Node* Next(int n) {
assert(n >= 0);
// Use an 'acquire load' so that we observe a fully initialized
// version of the returned Node.
return reinterpret_cast<Node*>(next_[n].Acquire_Load());
}
void SetNext(int n, Node* x) {
assert(n >= 0);
// Use a 'release store' so that anybody who reads through this
// pointer observes a fully initialized version of the inserted node.
next_[n].Release_Store(x);
}

// No-barrier variants that can be safely used in a few locations.
Node* NoBarrier_Next(int n) {
assert(n >= 0);
return reinterpret_cast<Node*>(next_[n].NoBarrier_Load());
}
void NoBarrier_SetNext(int n, Node* x) {
assert(n >= 0);
next_[n].NoBarrier_Store(x);
}

private:
// Array of length equal to the node height. next_[0] is lowest level link.
port::AtomicPointer next_[1];
};

What is AtomicPointer

A pointer object that supports atomic read and write. Each atomic read/write is guarded by memory barrier, which flushes CPU cache lines to make sure changes are visible to other threads.

What’s the difference between AtomicPointer and volatile pointer

My guess is, AtomicPointer supports both read/write with/without memory barrier, which is more flexible and efficient when memory barrier is unnecessary.

Why next has length 1

Nodes are constructed using factory method.

1
2
3
4
5
6
7
template<typename Key, class Comparator>
typename SkipList<Key,Comparator>::Node*
SkipList<Key,Comparator>::NewNode(const Key& key, int height) {
char* mem = arena_->AllocateAligned(
sizeof(Node) + sizeof(port::AtomicPointer) * (height - 1));
return new (mem) Node(key);
}

It is my first time seeing this syntax, but I think this creates a Node object with next pointing to an array of length height. This hack avoids space overhead of vector or other dynamic array, which is neat.

Iterator Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class Iterator {
public:
// Initialize an iterator over the specified list.
// The returned iterator is not valid.
explicit Iterator(const SkipList* list);

// Returns true iff the iterator is positioned at a valid node.
bool Valid() const;

// Returns the key at the current position.
// REQUIRES: Valid()
const Key& key() const;

// Advances to the next position.
// REQUIRES: Valid()
void Next();

// Advances to the previous position.
// REQUIRES: Valid()
void Prev();

// Advance to the first entry with a key >= target
void Seek(const Key& target);

// Position at the first entry in list.
// Final state of iterator is Valid() iff list is not empty.
void SeekToFirst();

// Position at the last entry in list.
// Final state of iterator is Valid() iff list is not empty.
void SeekToLast();

private:
const SkipList* list_;
Node* node_;
// Intentionally copyable
};

This iterator is different from traditional iterator in C++ or Java because it supports search feature.

Difference between mine and LevelDB Skip List implementation

Template and comparator

The biggest difference is probably LevelDB uses template and comparator to make the code work with any type of key.

Random height

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
template<typename Key, class Comparator>
int SkipList<Key,Comparator>::RandomHeight() {
// Increase height with probability 1 in kBranching
static const unsigned int kBranching = 4;
int height = 1;
while (height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)) {
height++;
}
assert(height > 0);
assert(height <= kMaxHeight);
return height;
}

int randomLevel(){
int res=1;
int r=(rand()<<16)+rand();
while(res<maxLevel && (r&1)){
res++;
r>>=1;
}
return res;
}

My code generate height 1 with 1/2 chance, 2 with 1/4, 3 with 1/8…. LevelDB chose a different distribution: height 1 with (1 - 4/1), 2 with (1-1/4)^2, 3 with (1-1/4)^3… This means LevelDB skip list is more “flat”, and probably less efficient for data lookup. However, it stores less pointer overall, and more space efficient than my code. LevelDB implementation calls random number generator once per iteration, which is less efficient than my version.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
template<typename Key, class Comparator>
typename SkipList<Key,Comparator>::Node* SkipList<Key,Comparator>::FindGreaterOrEqual(const Key& key, Node** prev)
const {
Node* x = head_;
int level = GetMaxHeight() - 1;
while (true) {
Node* next = x->Next(level);
if (KeyIsAfterNode(key, next)) {
// Keep searching in this list
x = next;
} else {
if (prev != nullptr) prev[level] = x;
if (level == 0) {
return next;
} else {
// Switch to next list
level--;
}
}
}
}

Node* search(const T& x,vector<Node*>* update=0){
Node* cur=head;
for(int i=level-1;i>=0;i--){
while(cur->next[i]!=tail &&
cur->next[i]->value < x)
cur=cur->next[i];
if(update)update->at(i)=cur;
}
cur=cur->next[0];
if(cur==tail || cur->value != x)
cur = NULL;
return cur;
}

There is really no much difference except that LevelDB version returns the first element that is greater than or equal to given key.

Insert

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
template<typename Key, class Comparator>
void SkipList<Key,Comparator>::Insert(const Key& key) {
// TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
// here since Insert() is externally synchronized.
Node* prev[kMaxHeight];
Node* x = FindGreaterOrEqual(key, prev);

// Our data structure does not allow duplicate insertion
assert(x == nullptr || !Equal(key, x->key));

int height = RandomHeight();
if (height > GetMaxHeight()) {
for (int i = GetMaxHeight(); i < height; i++) {
prev[i] = head_;
}
//fprintf(stderr, "Change height from %d to %dn", max_height_, height);

// It is ok to mutate max_height_ without any synchronization
// with concurrent readers. A concurrent reader that observes
// the new value of max_height_ will see either the old value of
// new level pointers from head_ (nullptr), or a new value set in
// the loop below. In the former case the reader will
// immediately drop to the next level since nullptr sorts after all
// keys. In the latter case the reader will use the new node.
max_height_.NoBarrier_Store(reinterpret_cast<void*>(height));
}

x = NewNode(key, height);
for (int i = 0; i < height; i++) {
// NoBarrier_SetNext() suffices since we will add a barrier when
// we publish a pointer to "x" in prev[i].
x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
prev[i]->SetNext(i, x);
}
}

void insert(const T& x){
vector<Node*> update(level);
Node* cur=search(x,&update);
if(cur){
cur->value=x;
}else{
Node* newNode=new Node(x);
int v=randomLevel();
if(v > level){
level=v;
update.resize(level,head);
}
newNode->next.resize(v);
for(int i=0;i<v;i++) {
newNode->next[i]=update[i]->next[i];
update[i]->next[i]=newNode;
}
}
}

There is no much difference except that LevelDB version uses atomic pointer methods to perform read/write with/without memory barrier to provide concurrent read safety.

How Skip List is used in LevelDB

A MemTable abstraction layer is built on top of skip list which exposes a key-value store interface. Skip list stores concatenated key-value byte string (tagged by length), which guarantees that Memtable data is sorted by keys.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
class MemTableIterator: public Iterator {
public:
explicit MemTableIterator(MemTable::Table* table) : iter_(table) { }

virtual bool Valid() const { return iter_.Valid(); }
virtual void Seek(const Slice& k) { iter_.Seek(EncodeKey(&tmp_, k)); }
virtual void SeekToFirst() { iter_.SeekToFirst(); }
virtual void SeekToLast() { iter_.SeekToLast(); }
virtual void Next() { iter_.Next(); }
virtual void Prev() { iter_.Prev(); }
virtual Slice key() const { return GetLengthPrefixedSlice(iter_.key()); }
virtual Slice value() const {
Slice key_slice = GetLengthPrefixedSlice(iter_.key());
return GetLengthPrefixedSlice(key_slice.data() + key_slice.size());
}
.....
}

void MemTable::Add(SequenceNumber s, ValueType type,
const Slice& key,
const Slice& value) {
// Format of an entry is concatenation of:
// key_size : varint32 of internal_key.size()
// key bytes : char[internal_key.size()]
// value_size : varint32 of value.size()
// value bytes : char[value.size()]
size_t key_size = key.size();
size_t val_size = value.size();
size_t internal_key_size = key_size + 8;
const size_t encoded_len =
VarintLength(internal_key_size) + internal_key_size +
VarintLength(val_size) + val_size;
char* buf = arena_.Allocate(encoded_len);
char* p = EncodeVarint32(buf, internal_key_size);
memcpy(p, key.data(), key_size);
p += key_size;
EncodeFixed64(p, (s << 8) | type);
p += 8;
p = EncodeVarint32(p, val_size);
memcpy(p, value.data(), val_size);
assert((p + val_size) - buf == encoded_len);
table_.Insert(buf);
}

bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
Slice memkey = key.memtable_key();
Table::Iterator iter(&table_);
iter.Seek(memkey.data());
if (iter.Valid()) {
// entry format is:
// klength varint32
// userkey char[klength]
// tag uint64
// vlength varint32
// value char[vlength]
// Check that it belongs to same user key. We do not check the
// sequence number since the Seek() call above should have skipped
// all entries with overly large sequence numbers.
const char* entry = iter.key();
uint32_t key_length;
const char* key_ptr = GetVarint32Ptr(entry, entry+5, &key_length);
if (comparator_.comparator.user_comparator()->Compare(
Slice(key_ptr, key_length - 8),
key.user_key()) == 0) {
// Correct user key
const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
switch (static_cast<ValueType>(tag & 0xff)) {
case kTypeValue: {
Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
value->assign(v.data(), v.size());
return true;
}
case kTypeDeletion:
*s = Status::NotFound(Slice());
return true;
}
}
}
return false;
}

Memtable serves as “buffer” purpose to store recent writes. Flushing buffer onto disk is done by traversing memtable using iterator.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Status BuildTable(const std::string& dbname,
Env* env,
const Options& options,
TableCache* table_cache,
Iterator* iter,
FileMetaData* meta) {
...
iter->SeekToFirst();

std::string fname = TableFileName(dbname, meta->number);
if (iter->Valid()) {
WritableFile* file;
s = env->NewWritableFile(fname, &file);
if (!s.ok()) {
return s;
}

TableBuilder* builder = new TableBuilder(options, file);
meta->smallest.DecodeFrom(iter->key());
for (; iter->Valid(); iter->Next()) {
Slice key = iter->key();
meta->largest.DecodeFrom(key);
builder->Add(key, iter->value());
}
....
}
}

DB read is implemented by reading from memtable first. If not found, then read from SSTable files.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Status DBImpl::Get(const ReadOptions& options,
const Slice& key,
std::string* value) {
....
MemTable* mem = mem_;
MemTable* imm = imm_;
....
// First look in the memtable, then in the immutable memtable (if any).
LookupKey lkey(key, snapshot);
if (mem->Get(lkey, value, &s)) {
// Done
} else if (imm != NULL && imm->Get(lkey, value, &s)) {
// Done
} else {
s = current->Get(options, lkey, value, &stats);
have_stat_update = true;
}
....
}

Note that there are two memtables: mem and imm._ mem is the buffer we talked about. imm is the memtable being flushed to disk if there is any. It allows write to happen while memtable is being flushed to disk.