BitTree: a more powerful bitset

Recently I ran into a problem that asks for a data structure that efficiently allocate and recycle integer id in [0, MAX_ID). Each id can be allocated again after being recycled. The data structure runs on single machine in single thread. The goal is to optimize time and memory.

An intuitive solution is to use a set to track recycled ids, and use a counter to track maximum allocated id so far.

class Allocator {
  private static final int MAX_ID = 100000;

  private final Set<Integer> recycled = new HashSet<>();
  private final int maxAllocated = 0;

  int allocate() {
    if (maxAllocated < MAX_ID) {
      return maxAllocated++;
    }
    if (!recycled.isEmpty()) {
      int id = recycled.iterator().next();
      recycled.remove(id);
      return id;
    }
    throw new RuntimeException("...");
  }

  int recycle(int id) {
    if (recycle.contains(id)) {
      throw new RuntimeException("...");
    }
    recycled.add(id);
  }
}

Time and space complexity of this code is clearly optimal. All operations take O(1) time and we need to track recycled ids in memory which will need O(n) space in the worst case. However, it is possible to optimize constant factor.

Notice that all ids fall in [0, MAX_ID), which means we can easily replace the hash set with a bitset which is more space efficient in the worst case. However, one challenge is finding an element in a bitset requires traversing the whole set in the worst case, which is O(n). Is there a way to have a data structure that is compact in space like bitset, while allows you to efficiently find an element in the set? The answer is yes.

Imagine a full binary tree whose leaf nodes are labeled with numbers from 0 to n. Each node may have two color: blue or red. All nodes are blue at the beginning. This tree can be used to as a set of integers from 0 to n.

A red leaf node indicates that corresponding number is in the set. Blue leaf node means the opposite.

Red internal node indicates there is some red leaf in its sub tree. Blue means the opposite.

It is easy to insert/remove a number from the set. It is also easy to get one number in the set. All operation takes O(log n) time. When n is INT_MAX, the height of the tree will be 32, which can be usually treated as constant.

We don’t want to use transitional binary tree node class to implement this data structure because of its memory overhead. Two key observations give us compact implementation: each node has 2 states, which fits into 1 bit; binary tree is full. We can totally use heap like array representation of binary tree and each element is a single bit, which means we just need a bit set with around 2*n bits to store this tree.

class BitTree {

    private final int[] arr;
    private final int height;
    // starting index of leave bits
    private final long base;

    // stores int value in [0, n)
    BitTree(int n) {
        // find the number of bits needed to encode the tree
        long size = 1;
        int height = 0;
        while (size < n) {
            size <<= 1;
            height++;
        }
        size *= 2;
        // number of ints in array
        int len = Math.max(1, (int) (size >> 5));
        arr = new int[len];
        this.height = height;
        this.base = 1L << height;
    }

    void insert(int v) {
        long cur = v + base;
        while (cur >= 1) {
            set(cur, true);
            cur >>= 1;
        }
    }

    void remove(int v) {
        set(v + base, false);
        long cur = (v + base) >> 1;
        while (cur >= 1) {
            set(cur, get(cur << 1) || get((cur << 1) + 1));
            cur >>= 1;
        }
    }

    int getElement() {
        if (!get(1)) {
            throw new NoSuchElementException();
        }
        int cur = 1;
        for (int i = 0; i < height; i++) {
            cur <<= 1;
            if (!get(cur)) {
                cur++;
            }
        }
        return (int) (cur - base);
    }

    boolean isEmpty() {
        return !get(1);
    }

    boolean contains(int v) {
        return get(base + v);
    }

    private boolean get(long i) {
        int idx = (int) (i >> 5);
        int offset = (int) (i & 31);
        return (arr[idx] & (1 << offset)) != 0;
    }

    private void set(long i, boolean v) {
        int idx = (int) (i >> 5);
        int offset = (int) (i & 31);
        if (v) {
            arr[idx] |= 1 << offset;
        } else {
            arr[idx] &= ~(1 << offset);
        }
    }

 

The memory usage of BitTree is around 2x to 4x of transitional bitset, which is still very compact.

getElement() returns the smallest element in the set. It is easy to make it return the largest element instead.

It is also not hard to find a successor of a number in the set in O(log n) time, which means traversing this bitset can be done much more efficiently than transitional bitset.

Skip List in LevelDB

In earlier post, we presented a skip list implementation in C++.

I am going to go thorough a skip list implementation in real world software: leveldb.

What is LevelDB

A key-value store library open-sourced by Google.  It borrowed ideas from BigTable.

LevelDB is currently being used in many software like Chrome browser.

Project page: http://leveldb.org

Github: https://github.com/google/leveldb

How is Skip List used in LevelDB

Skip list is used as a buffer that tracks recent writes. The buffer is used to efficiently serve read with key that is recently touched. When buffer is full, its content is flushed into a SSTable (sorted string table) for persistence.

Why Not Hash Table

Hash table is good for serving reads, but not great for flushing data into SSTable.  As its name indicates, SSTable content is sorted byte strings, which is good for look things up with binary search. Flushing buffer content to SSTable require traversing buffer contents in sorted order, which is hard for hash table, but easy for Skip List.

Why Not Binary Search Tree

Binary search tree (BST) is a good data structure that supports efficient look up and efficient content traversal in sorted order. I think the biggest advantage of skip list over BST is its concurrent property. Skip list supports concurrent read while write is on going (though writes have to go in sequential). I may create a separate post to go into more details.

Code Structure

There are two key sub-classes: Node and Iterator.

Node Code

// Implementation details follow
template<typename Key, class Comparator>
struct SkipList<Key,Comparator>::Node {
  explicit Node(const Key& k) : key(k) { }

  Key const key;

  // Accessors/mutators for links.  Wrapped in methods so we can
  // add the appropriate barriers as necessary.
  Node* Next(int n) {
    assert(n >= 0);
    // Use an 'acquire load' so that we observe a fully initialized
    // version of the returned Node.
    return reinterpret_cast<Node*>(next_[n].Acquire_Load());
  }
  void SetNext(int n, Node* x) {
    assert(n >= 0);
    // Use a 'release store' so that anybody who reads through this
    // pointer observes a fully initialized version of the inserted node.
    next_[n].Release_Store(x);
  }

  // No-barrier variants that can be safely used in a few locations.
  Node* NoBarrier_Next(int n) {
    assert(n >= 0);
    return reinterpret_cast<Node*>(next_[n].NoBarrier_Load());
  }
  void NoBarrier_SetNext(int n, Node* x) {
    assert(n >= 0);
    next_[n].NoBarrier_Store(x);
  }

 private:
  // Array of length equal to the node height.  next_[0] is lowest level link.
  port::AtomicPointer next_[1];
};

What is AtomicPointer

A pointer object that supports atomic read and write. Each atomic read/write is guarded by memory barrier, which flushes CPU cache lines to make sure changes are visible to other threads.

What’s the difference between AtomicPointer and volatile pointer

My guess is, AtomicPointer supports both read/write with/without memory barrier, which is more flexible and efficient when memory barrier is unnecessary.

Why next_ has length 1

Nodes are constructed using factory method.

template<typename Key, class Comparator>
typename SkipList<Key,Comparator>::Node*
SkipList<Key,Comparator>::NewNode(const Key& key, int height) {
  char* mem = arena_->AllocateAligned(
      sizeof(Node) + sizeof(port::AtomicPointer) * (height - 1));
  return new (mem) Node(key);
}

It is my first time seeing this syntax, but I think this creates a Node object with next_ pointing to an array of length height. This hack avoids space overhead of vector or other dynamic array, which is neat.

Iterator Code

class Iterator {
  public:
   // Initialize an iterator over the specified list.
   // The returned iterator is not valid.
   explicit Iterator(const SkipList* list);

   // Returns true iff the iterator is positioned at a valid node.
   bool Valid() const;

   // Returns the key at the current position.
   // REQUIRES: Valid()
   const Key& key() const;

   // Advances to the next position.
   // REQUIRES: Valid()
   void Next();

   // Advances to the previous position.
   // REQUIRES: Valid()
   void Prev();

   // Advance to the first entry with a key >= target
   void Seek(const Key& target);

   // Position at the first entry in list.
   // Final state of iterator is Valid() iff list is not empty.
   void SeekToFirst();

   // Position at the last entry in list.
   // Final state of iterator is Valid() iff list is not empty.
   void SeekToLast();

  private:
   const SkipList* list_;
   Node* node_;
   // Intentionally copyable
 };

This iterator is different from traditional iterator in C++ or Java because it supports search feature.

Difference between mine and LevelDB Skip List implementation

Template and comparator

The biggest difference is probably LevelDB uses template and comparator to make the code work with any type of key.

Random height

template<typename Key, class Comparator>
int SkipList<Key,Comparator>::RandomHeight() {
  // Increase height with probability 1 in kBranching
  static const unsigned int kBranching = 4;
  int height = 1;
  while (height < kMaxHeight && ((rnd_.Next() % kBranching) == 0)) {
    height++;
  }
  assert(height > 0);
  assert(height <= kMaxHeight);
  return height;
}
int randomLevel(){
  int res=1;
  int r=(rand()<<16)+rand();
  while(res<maxLevel && (r&1)){
    res++;
    r>>=1;
  }
  return res;
}

My code generate height 1 with 1/2 chance, 2 with 1/4, 3 with 1/8….

LevelDB chose a different distribution: height 1 with (1 – 4/1),  2 with (1-1/4)^2, 3 with (1-1/4)^3…

This means LevelDB skip list is more “flat”, and probably less efficient for data lookup. However, it stores less pointer overall, and more space efficient than my code.

LevelDB implementation calls random number generator once per iteration, which is less efficient than my version.

Search

template<typename Key, class Comparator>
typename SkipList<Key,Comparator>::Node* SkipList<Key,Comparator>::FindGreaterOrEqual(const Key& key, Node** prev)
    const {
  Node* x = head_;
  int level = GetMaxHeight() - 1;
  while (true) {
    Node* next = x->Next(level);
    if (KeyIsAfterNode(key, next)) {
      // Keep searching in this list
      x = next;
    } else {
      if (prev != nullptr) prev[level] = x;
      if (level == 0) {
        return next;
      } else {
        // Switch to next list
        level--;
      }
    }
  }
}
Node* search(const T& x,vector<Node*>* update=0){
  Node* cur=head;
  for(int i=level-1;i>=0;i--){
    while(cur->next[i]!=tail &&
        cur->next[i]->value < x)
      cur=cur->next[i];
    if(update)update->at(i)=cur;
  }
  cur=cur->next[0];
  if(cur==tail || cur->value != x)
    cur = NULL;
  return cur;
}

There is really no much difference except that LevelDB version returns the first element that is greater than or equal to given key.

Insert

template<typename Key, class Comparator>
void SkipList<Key,Comparator>::Insert(const Key& key) {
  // TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
  // here since Insert() is externally synchronized.
  Node* prev[kMaxHeight];
  Node* x = FindGreaterOrEqual(key, prev);

  // Our data structure does not allow duplicate insertion
  assert(x == nullptr || !Equal(key, x->key));

  int height = RandomHeight();
  if (height > GetMaxHeight()) {
    for (int i = GetMaxHeight(); i < height; i++) {
      prev[i] = head_;
    }
    //fprintf(stderr, "Change height from %d to %d\n", max_height_, height);

    // It is ok to mutate max_height_ without any synchronization
    // with concurrent readers.  A concurrent reader that observes
    // the new value of max_height_ will see either the old value of
    // new level pointers from head_ (nullptr), or a new value set in
    // the loop below.  In the former case the reader will
    // immediately drop to the next level since nullptr sorts after all
    // keys.  In the latter case the reader will use the new node.
    max_height_.NoBarrier_Store(reinterpret_cast<void*>(height));
  }

  x = NewNode(key, height);
  for (int i = 0; i < height; i++) {
    // NoBarrier_SetNext() suffices since we will add a barrier when
    // we publish a pointer to "x" in prev[i].
    x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
    prev[i]->SetNext(i, x);
  }
}
void insert(const T& x){
  vector<Node*> update(level);
  Node* cur=search(x,&update);
  if(cur){
    cur->value=x;
  }else{
    Node* newNode=new Node(x);
    int v=randomLevel();
    if(v > level){
      level=v;
      update.resize(level,head);
    }
    newNode->next.resize(v);
    for(int i=0;i<v;i++) {
      newNode->next[i]=update[i]->next[i];
      update[i]->next[i]=newNode;
    }
  }
}

There is no much difference except that LevelDB version uses atomic pointer methods to perform read/write with/without memory barrier to provide concurrent read safety.

How Skip List is used in LevelDB

A MemTable abstraction layer is built on top of skip list which exposes a key-value store interface. Skip list stores concatenated key-value byte string (tagged by length), which guarantees that Memtable data is sorted by keys.

class MemTableIterator: public Iterator {
 public:
  explicit MemTableIterator(MemTable::Table* table) : iter_(table) { }

  virtual bool Valid() const { return iter_.Valid(); }
  virtual void Seek(const Slice& k) { iter_.Seek(EncodeKey(&tmp_, k)); }
  virtual void SeekToFirst() { iter_.SeekToFirst(); }
  virtual void SeekToLast() { iter_.SeekToLast(); }
  virtual void Next() { iter_.Next(); }
  virtual void Prev() { iter_.Prev(); }
  virtual Slice key() const { return GetLengthPrefixedSlice(iter_.key()); }
  virtual Slice value() const {
    Slice key_slice = GetLengthPrefixedSlice(iter_.key());
    return GetLengthPrefixedSlice(key_slice.data() + key_slice.size());
  }
  .....
}
void MemTable::Add(SequenceNumber s, ValueType type,
                   const Slice& key,
                   const Slice& value) {
  // Format of an entry is concatenation of:
  //  key_size     : varint32 of internal_key.size()
  //  key bytes    : char[internal_key.size()]
  //  value_size   : varint32 of value.size()
  //  value bytes  : char[value.size()]
  size_t key_size = key.size();
  size_t val_size = value.size();
  size_t internal_key_size = key_size + 8;
  const size_t encoded_len =
      VarintLength(internal_key_size) + internal_key_size +
      VarintLength(val_size) + val_size;
  char* buf = arena_.Allocate(encoded_len);
  char* p = EncodeVarint32(buf, internal_key_size);
  memcpy(p, key.data(), key_size);
  p += key_size;
  EncodeFixed64(p, (s << 8) | type);
  p += 8;
  p = EncodeVarint32(p, val_size);
  memcpy(p, value.data(), val_size);
  assert((p + val_size) - buf == encoded_len);
  table_.Insert(buf);
}
bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
  Slice memkey = key.memtable_key();
  Table::Iterator iter(&table_);
  iter.Seek(memkey.data());
  if (iter.Valid()) {
    // entry format is:
    //    klength  varint32
    //    userkey  char[klength]
    //    tag      uint64
    //    vlength  varint32
    //    value    char[vlength]
    // Check that it belongs to same user key.  We do not check the
    // sequence number since the Seek() call above should have skipped
    // all entries with overly large sequence numbers.
    const char* entry = iter.key();
    uint32_t key_length;
    const char* key_ptr = GetVarint32Ptr(entry, entry+5, &key_length);
    if (comparator_.comparator.user_comparator()->Compare(
            Slice(key_ptr, key_length - 8),
            key.user_key()) == 0) {
      // Correct user key
      const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
      switch (static_cast<ValueType>(tag & 0xff)) {
        case kTypeValue: {
          Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
          value->assign(v.data(), v.size());
          return true;
        }
        case kTypeDeletion:
          *s = Status::NotFound(Slice());
          return true;
      }
    }
  }
  return false;
}

Memtable serves as “buffer” purpose to store recent writes.

Flushing buffer onto disk is done by traversing memtable using iterator.

Status BuildTable(const std::string& dbname,
                  Env* env,
                  const Options& options,
                  TableCache* table_cache,
                  Iterator* iter,
                  FileMetaData* meta) {
...
  iter->SeekToFirst();

  std::string fname = TableFileName(dbname, meta->number);
  if (iter->Valid()) {
    WritableFile* file;
    s = env->NewWritableFile(fname, &file);
    if (!s.ok()) {
      return s;
    }

    TableBuilder* builder = new TableBuilder(options, file);
    meta->smallest.DecodeFrom(iter->key());
    for (; iter->Valid(); iter->Next()) {
      Slice key = iter->key();
      meta->largest.DecodeFrom(key);
      builder->Add(key, iter->value());
    }
....
  }
}

DB read is implemented by reading from memtable first. If not found, then read from SSTable files.

Status DBImpl::Get(const ReadOptions& options,
                   const Slice& key,
                   std::string* value) {
....
  MemTable* mem = mem_;
  MemTable* imm = imm_;
....
    // First look in the memtable, then in the immutable memtable (if any).
    LookupKey lkey(key, snapshot);
    if (mem->Get(lkey, value, &s)) {
      // Done
    } else if (imm != NULL && imm->Get(lkey, value, &s)) {
      // Done
    } else {
      s = current->Get(options, lkey, value, &stats);
      have_stat_update = true;
    }
....
}

Note that there are two memtables: mem_ and imm_.

mem_ is the buffer we talked about.

imm_ is the memtable being flushed to disk if there is any. It allows write to happen while memtable is being flushed to disk.

Fancy-coder on raspberry pi

I created this blog to share thoughts and learning. This blog is hosted on a poor raspberry pi sits under my table to save electricity bills. Please bear with me if you find the site slow.

I imported some posts I wrote in school which I still find interesting. Source code in existing posts is not pretty-printed for unknown reason and I am too lazy to fix them.

GDB tips

gdb is quite a useful tool. Despite it’s text interface, it can do more than GUI debuggers. I am still learning it, and this post is used as a memo.

  • ptype is useful to print type, structure(class) definition. This is useful to check user defined data types(sometimes deeply defined in header files).
  • disassemble is useful to see assembly code of function/statement.
  • info registers shows registers information.
  • print a@10 shows 10 elements start from address of variable a.
  • x command examine memory content. x/8xw $eip displays 8 words since where eip register points to. x/i $eip
  • shows instruction eip register points to.

“Regular Expression” matches all non-prime numbers

I found a “regular expression” which matches all strings consists of non-prime number of 1s.
^1?$|^(111*)\1+$

This is quite surprising because it can be proven that non-prime number of 1s is not a regular language. That is to say, no regular expression can match that language.

However, this regular expression works.
Following python code will print out all prime numbers within [0,99]

[sourcecode language=”python”]
import re
filter(lambda n:not re.match(r"^1?$|^(111*)\1+$","1"*n),range(100))
[/sourcecode]

So why it works? Let’s look at the regular expression: ^1?$ matches “” and “1”, which is trivial. In ^(111*)\1+$, (111*) matches 2 or more 1s, so (111*)\1+ matches n*k 1s, k>=2, n>=2, which is exactly all composite numbers.

Then what’s wrong with the theory? Actually, nothing is wrong. The “regular expression” in most programming language is much more expressive than the formal regular expression defined in computational theory. In this case, we do not allow “\1” in formal regular language. Actually, this “\1” may potentially require unbounded amount of memory, because the program need to remember the matched string in “\1”, which can be arbitrarily long. In formal regular language, the memory cost should be bounded when regular expression is given: bounded by the size of corresponding DFA. Thus, “\1” kind of stuff gives regular expression much more power.

Next question is: is “regular expression” in programming language equivalent to Turing Machine?