Misapplied Math

# Twelve Days 2013: De Bruijn Sequences

## Day Ten: De Bruijn Sequences

### TL/DR

De Bruijn sequences and more generally De Bruijn graphs have lots of cool applications in combinatorics and CS. On hardware lacking a native instruction to count leading zeros in a word or find the least significant bit, a perfect hash function formed using a De Bruijn sequence provides a constant time solution. The technique is still useful when implementing arbitrary length integers or bit vectors. Outside of bit twiddling, De Bruijn graphs, sequences, and arrays improve the precision of encoders used in robotics and digital tablets, contribute to DNA fragment analysis, and find use in cryptographic applications.

## Intro to De Bruijn Graphs

De Bruijn graphs are Eulerian, Hamiltonian digraphs with a few special properties. if we have an alphabet $\Sigma$ with $m$ symbols, the graph will have $m^n$ vertices, where $n$ is a free parameter. The vertex set is constructed such that every $n$ character word formed by characters taken from the alphabet is represented by a single vertex. The edge set is constructed in a fashion that encodes "overlap" between the words. De Bruijn sequences are the Hamiltonian cycles of the graph.

## Why De Bruijn Sequences are Interesting

Under the construction above, De Bruijn sequences have the property that every $n$ character word in a $k$ character alphabet appears as a sub sequence exactly once when we slide a $n$ character window along the sequence from left to right. As an example, one construction of $B(2, 3)$ is $\{0, 0, 0, 1, 0, 1, 1, 1\}$. As we slide a three element long window from left to right, wrapping around when necessary we get:

\begin{aligned} 000 & = 0 \\ 001 & = 1 \\ 010 &= 2 \\ 101 &= 5 \\ 011 &= 3 \\ 111 &= 7 \\ 110 &= 6 \\ 100 &= 4 \end{aligned}

Every number in the range $[0, 7]$ appears exactly once. For a binary alphabet the length of the sequence will always be $2^n$ but for larger alphabets the sequence can be shorter than direct enumeration. As wikipedia points out, this has the useful property that, given a lock that accepts a pin code without an enter key (which was apparently common in Europe at one point), entering a De Bruijn sequence for the appropriate alphabet and code length is the fastest way to crack the lock. For a ten digit entry pad and a four digit pin that amounts to $B(10, 4)$, which is 10,000 characters long. Compare that with the $4 * 10^4 = 40,000$ combinations that would be required otherwise.

The ability of De Bruijn graphs/sequences to efficiently encode overlap makes them useful for things like DNA sequencing where overlapping fragments need to be recombined, and for mechanical/optical encoders that need to figure out where they are given some overlapping sequence of measurements. There's also a neat bit twiddling hack that comes about from this which we'll discuss next.

## De Bruijn Sequences and Minimal Perfect Hashes

A perfect hash is an injective map $S \mapsto \mathbb{Z}$: a function that maps all elements of $S$ onto the integers with no collisions. Minimal perfect hash functions have the additional property that keys are mapped to integers consecutively. By a simple cardinality argument that also implies that the function is as "efficient" as possible. There's a seminal paper discussing how De Bruijn sequences can be used to construct a minimal perfect hash function that, given any integer $i$, will return the location of the lowest (rightmost) set bit in $i$. This is a pretty common and important operation to the extent that lots of processors now offer an instruction to do this. However, if you look at the code behind a compiler intrinsic that performs the find-first-set operation, the fallback for architectures lacking native support usually involves this hack.

Before going any further it's worth noting that, if you've heard of this hack, you've probably also seen magic De Bruijn constant floating around for integers of various length. De Bruijn sequences are not necessarily unique and they depend on the algorithm used to generate them. Don't be surprised if the constants that you've seen don't match the ones generated by this code.

The code above will generate a De Bruijn sequence and the hash table necessary for the first-set-bit look-up. How does it stack up against the LZCNT instruction emitted by Hotspot on x86?

Intrinsic: 0.0604s, FFS: 0.1346s. Silicon can beat us by a factor of two. However, once upon a time this was the best that we could do, and it's still relevant to bit vectors/words of arbitrary length.