Misapplied Math

Twelve Days 2013: Shunting-Yard AlgorithmDec 22

Day Eleven: The Shunting-Yard Algorithm

TL/DR

Parsing is hard. Parser generators such as ANTLR make it significantly easier by providing a tool to express complex grammars via Extended Backus–Naur Form (a grammar for defining grammars) and generate parsers automatically. However, that's a pretty heavy weight solution, especially if you just need to do something quick and dirty like implement an expression evaluator. There's a significantly easier means of parsing simple grammars that describe things like mathematical equations called the Shunting-yard algorithm.

The Shunting-yard algorithm can be extended to handle more complex tasks, but at its core the algorithm parses an expression written in infix notation and applies the rules of operator precedence with optional parentheses to rewrite the original expression as something unambiguous. You can use it to rewrite the parsed expression in Reverse Polish Notation, or to produce an abstract syntax tree (AST). There are lots of examples of the former but I haven't seen to many of the latter. The code below implements both.

Parsing 101

The simplest end of the parsing spectrum involves something along the lines of taking a CSV file and converting it into a stream of usable values. On the most complex end you have parsers that can read a C++ source file and create a usable representation of it for the compiler. The former definitely doesn't merit anything complicated, the latter requires some of the most powerful tools for language recognition that we have. For "heavy duty parsing" the process usually looks something like: formal grammar -> lexer -> parser -> abstract syntax tree. If you're writing a compiler or something similar, the abstract syntax tree gets used as a unique, structurally validated, and unambiguous input to whatever happens next.

There are very powerful parsers such as LL(*) that are almost impossible to write by hand for any sizable grammar. Thankfully, tools such as ANTLR will write them for us, given a formal description of the language. However this process is fairly heavy weight, and if you're doing something simple like implementing a simple scripting language or an equation evaluator you might be able to get away without it.

Operators, Precedence, and Associativity

Most of us are used to looking at mathematical equations in infix notation: $3 + 4 \cdot 2 /(1 - 5 )^{2^3}$. Unfortunately, that syntax is very hard for a computer to deal with. We subconsciously recognize the rules of operator associativity and operator precedence for mathematical equations, and as programmers we have to know about operator precedence for the languages that we work with; we've learned to parse these things with ease so it's easy to overlook how much is actually going on. When you throw in parenthetical grouping, the task at hand is definitely non-trivial.

Alternatives such as Reverse Polish Notation write expressions in a fashion that's unambiguous and denotes order-of-operation without parentheses. However, unless you're really used to working with it (I've seen people beast through computations on RPN calculators like the HP-48), you'll have to spend some time thinking your way through the expression that you're writing or reading. As such, it's a good thing that languages don't require us to write things that way, but that does leave us with the problem of parsing an expression while taking into account associativity and precedence.

Using The Shunting-Yard Algorithm

Dijkstra first described the algorithm in 1961 (as if that guy hadn't done enough brilliant work already…). It provides a simple means of converting expressions in infix notation to prefix notation. Most of code that I've seen for it outputs the original expression in RPN but the same procedure can generate an abstract syntax tree as well (in fact, formally speaking RPN is generated by the post-order traversal of an AST). The algorithm is iterative and runs in $\O(n)$ so performance wise it's as good as you can do for a parser. The code below assumes that every operator has two operands – modifying it to accept unary operators or functions with parameters is quite simple. I also avoided the cruft of refactoring the code into something more OO to keep things short and simple. You would usually want to make the AST nicer to work with, and it's easy to specialize nodes as operators to clean up evaluation.

Twelve Days 2013: De Bruijn SequencesDec 21

Day Ten: De Bruijn Sequences

TL/DR

De Bruijn sequences and more generally De Bruijn graphs have lots of cool applications in combinatorics and CS. On hardware lacking a native instruction to count leading zeros in a word or find the least significant bit, a perfect hash function formed using a De Bruijn sequence provides a constant time solution. The technique is still useful when implementing arbitrary length integers or bit vectors. Outside of bit twiddling, De Bruijn graphs, sequences, and arrays improve the precision of encoders used in robotics and digital tablets, contribute to DNA fragment analysis, and find use in cryptographic applications.

Intro to De Bruijn Graphs

De Bruijn graphs are Eulerian, Hamiltonian digraphs with a few special properties. if we have an alphabet $\Sigma$ with $m$ symbols, the graph will have $m^n$ vertices, where $n$ is a free parameter. The vertex set is constructed such that every $n$ character word formed by characters taken from the alphabet is represented by a single vertex. The edge set is constructed in a fashion that encodes "overlap" between the words. De Bruijn sequences are the Hamiltonian cycles of the graph.

Why De Bruijn Sequences are Interesting

Under the construction above, De Bruijn sequences have the property that every $n$ character word in a $k$ character alphabet appears as a sub sequence exactly once when we slide a $n$ character window along the sequence from left to right. As an example, one construction of $B(2, 3)$ is $\{0, 0, 0, 1, 0, 1, 1, 1\}$. As we slide a three element long window from left to right, wrapping around when necessary we get:

\begin{aligned} 000 & = 0 \\ 001 & = 1 \\ 010 &= 2 \\ 101 &= 5 \\ 011 &= 3 \\ 111 &= 7 \\ 110 &= 6 \\ 100 &= 4 \end{aligned}

Every number in the range $[0, 7]$ appears exactly once. For a binary alphabet the length of the sequence will always be $2^n$ but for larger alphabets the sequence can be shorter than direct enumeration. As wikipedia points out, this has the useful property that, given a lock that accepts a pin code without an enter key (which was apparently common in Europe at one point), entering a De Bruijn sequence for the appropriate alphabet and code length is the fastest way to crack the lock. For a ten digit entry pad and a four digit pin that amounts to $B(10, 4)$, which is 10,000 characters long. Compare that with the $4 * 10^4 = 40,000$ combinations that would be required otherwise.

The ability of De Bruijn graphs/sequences to efficiently encode overlap makes them useful for things like DNA sequencing where overlapping fragments need to be recombined, and for mechanical/optical encoders that need to figure out where they are given some overlapping sequence of measurements. There's also a neat bit twiddling hack that comes about from this which we'll discuss next.

De Bruijn Sequences and Minimal Perfect Hashes

A perfect hash is an injective map $S \mapsto \mathbb{Z}$: a function that maps all elements of $S$ onto the integers with no collisions. Minimal perfect hash functions have the additional property that keys are mapped to integers consecutively. By a simple cardinality argument that also implies that the function is as "efficient" as possible. There's a seminal paper discussing how De Bruijn sequences can be used to construct a minimal perfect hash function that, given any integer $i$, will return the location of the lowest (rightmost) set bit in $i$. This is a pretty common and important operation to the extent that lots of processors now offer an instruction to do this. However, if you look at the code behind a compiler intrinsic that performs the find-first-set operation, the fallback for architectures lacking native support usually involves this hack.

Before going any further it's worth noting that, if you've heard of this hack, you've probably also seen magic De Bruijn constant floating around for integers of various length. De Bruijn sequences are not necessarily unique and they depend on the algorithm used to generate them. Don't be surprised if the constants that you've seen don't match the ones generated by this code.

The code above will generate a De Bruijn sequence and the hash table necessary for the first-set-bit look-up. How does it stack up against the LZCNT instruction emitted by Hotspot on x86?

Intrinsic: 0.0604s, FFS: 0.1346s. Silicon can beat us by a factor of two. However, once upon a time this was the best that we could do, and it's still relevant to bit vectors/words of arbitrary length.

Twelve Days 2013: Side Channel and Timing AttacksDec 20

Day Nine: Side Channel and Timing Attacks

TL/DR

Layer 7 attacks are becoming increasingly sophisticated and prevalent, and side channel attacks can break mathematically sound cryptography. They're interesting in their own right but the same principles are highly relevant to real-time computing, which is why I'm writing about this. Anything that results in measurable latency jitter or nondeterministic behavior can be a problem from both a performance and security standpoint.

Disclaimer

Cryptography and security is just an interest – it's not my day job. I'm presenting this in the context of low latency/real-time computing, and one of the best ways to illustrate potential problems in that space is to choose canonically bad data as an input. That happens to be the same thing that you would do for malicious purposes. The code that I've provided to illustrate these effects is inefficient by design. If you're interested in this stuff, read about it from the guys who know what they're doing: Dan Boneh, Paul Kocher, Trevor Perrin, and Bruce Schneier to name a few.

Information Leakage and Low Hanging Fruit

Our world banks (literally) on cryptography being mathematically sound. If it's not and someone knows that we have a big problem, so let's just assume that it is. Given that, why are there still media worthy security breaches almost daily? Attacking the problem head on as it's expressed in equation form might be very hard, and hopefully impossible given our modern computing capabilities. However, at the end of the day everything is implemented in software and code executes on hardware. Crypto is often used incorrectly, code has bugs, and hardware can leak information in surprising ways. There are lots of ways to cause big problems short of breaking the underlying algorithms.

Sources of Variance in Time

For modern desktop/server CPUs there is, by design, a huge disconnect between the code that's written (even in assembly) and the computations carried out. When you layer an operating system on top of that, there's an even larger disconnect between what you've written in userland and what happens on the bare metal. At the lowest level CPUs exploits instruction level parallelism to do as much work as possible per clock cycle. Processors such as the x86 have multiple execution units to accelerate arithmetic and add additional parallelism. As main memory access is slow relative to the CPU, speculative execution and branch prediction are used to "guess" what data will come back from a load instruction so that computation can continue based on this guess and rewind when the guess is wrong. Furthermore, a CPU's cache tries to keep recently used (or likely to be used) data closer to the processor for fast access. When the CPU goes to read a value that's not there it has to access main memory, which can be an order of magnitude slower. Cache misses, branch mispredictions, TLB page faults, and instruction level parallelism all betray the secrets of the computation going on behind the scenes.

Given the above, at times it's possible to work out the instructions/data that were used as inputs via statistical analysis of something that's hard to hide: time elapsed. Things get even worse if the attacker has physical access to the box. An analysis of the power used by the CPU, the sounds that the computer makes, or residual data left in RAM even after the computer powers down can be used to access data or recover a private encryption key. These techniques can even break virtual machine hypervisor isolation. On the application layer things are much simpler and the attacker has even more to work with. Studying variations in how long it takes to get a response to some request is enough to bleed information remotely. If the attacker wants to interfere with normal operations (denial of service), or statistically derive sensitive information, remote timings can help her/him do so.

Hash Collisions of Death

When cryptographers discuss hash collisions it's usually in the context of cryptographic hash functions. For a cryptographically secure hash function, having two distinct inputs hash to the same value by chance is really, really bad. Figuring out how to compute the original value given the hash is worse. Being able to find two distinct inputs that result in a collision is the granddaddy of them all (the inverse and the collision are called the first and second preimage, respectively). Why is this bad? Well, among other things, websites (should always) store salted, hashed passwords in lieu of the original ones. That way, if their security is compromised the attacker still has to figure out what the user's original password was before it does them any good. For strong passwords and a good hash function, doing so is computationally infeasible.

Cryptographic hashs can take thousands of times longer to compute than a simple one so they're only used when necessary. However, non-cryptographic hash collisions can still pose a big problem. Hash tables use (as their name suggests) a hash function to quickly store or look-up a value given its key. Multiple keys may have the same hash, resulting in a collision. Hash tables usually resolve collision via probing or chaining. Probing hash tables work by walking over to the next available slot after a collision occurs, and chaining hash maps store a list-like structure at collision points. The average, $\O(1)$ look-up behavior of a hash table relies on keys having uniformly distributed hash values so that all available slots are filled with the same probability. When this breaks down you're left with a structure that performs like a linked list – $\O(n)$ insertions, deletions, and look-ups.

This is where the connection to real-time computing comes in. Two things really ruin your day in this space: data structures performing at their worst-case time complexity, and memory allocation at an inopportune time. Hash collisions can result in both. Most high level OO languages try to hash built-in objects in a sensible fashion that minimizes collisions and in some cases provides security by design. However, it's often possible to engineer collisions either intentionally or unintentionally.

A Case Study: Java Strings

As of writing (and for as long as I can remember) the Oracle JDK hashes a string S with the following function:

$h = s[0] + 31 \cdot s[1] + 31^2 \cdot s[3] \ldots$

That's a speedy and perfectly reasonable hash function. Unfortunately, we can use a very inefficient brute force method to find collisions on short strings, and much better methods exist. Given some admissible set of characters the computation looks something like:

Restricting myself to ASCII [a-Z 0-9] gives 35 collisions for the word "world" – which hashes to 113318802. A few samples are: "wormE, wosMd, and xPsNE". What could we do with these colliding strings? If some remote service relied on a hash table-like structure you might be able to cause problems by a) overloading the hash table with colliding values resulting in $\O(n)$ performance, or b) glean information on other values in the collection, which is easier to do if you have known, colliding values. It's a big enough problem that starting with JDK 7.04 Oracle included a secondary string hash for the hash based collections: sun.misc.Hashing.stringHash32(). Note that generally speaking, in the absence of secondary hashing, it's actually easier to cause a collision in a hash table than a pure hash collision as we only need to find a collision modulo the table width. Java collections apply both a secondary hash, and a randomized hash seed to make this sort of reverse engineering harder to do. However, if you "roll your own" collection without these features, or use something other than a string as a key, you're still vulnerable.

So, how pronounced is the effect? Let's write a little code to find out. You'll need to run it on Java 6 given what I mentioned above.

Running this code on Oracle JDK 1.6.37 gives me:

Iteration 1 Random keys: 0.0526s Collision keys: 0.1258s Iteration 2 Random keys: 0.0518s Collision keys: 0.1238s Iteration 3 Random keys: 0.0519s Collision keys: 0.1232s Iteration 4 Random keys: 0.0508s Collision keys: 0.1235s Iteration 5 Random keys: 0.0472s Collision keys: 0.1230s.

Ouch. There's a very stable and obvious 2.5x timing differential between the colliding and random keys. Longer strings admit more collisions so it only gets worse from here.

Closing Thoughts

This is a contrived example but illustrates a real problem to the extent that Oracle addressed it in Java 7. Making strings collide is relatively hard – it's much easier to create these effects (accidentally or deliberately) using a less general type of key. Hash map performance degrades significantly under load factor and collisions, so both are big problems for low latency computing.