Misapplied Math

Trading, Data Science, CS

Day Eleven: The Shunting-Yard Algorithm

TL/DR

Parsing is hard. Parser generators such as ANTLR make it significantly easier by providing a tool to express complex grammars via Extended Backus–Naur Form (a grammar for defining grammars) and generate parsers automatically. However, that's a pretty heavy weight solution, especially if you just need to do something quick and dirty like implement an expression evaluator. There's a significantly easier means of parsing simple grammars that describe things like mathematical equations called the Shunting-yard algorithm.

The Shunting-yard algorithm can be extended to handle more complex tasks, but at its core the algorithm parses an expression written in infix notation and applies the rules of operator precedence with optional parentheses to rewrite the original expression as something unambiguous. You can use it to rewrite the parsed expression in Reverse Polish Notation, or to produce an abstract syntax tree (AST). There are lots of examples of the former but I haven't seen to many of the latter. The code below implements both.

Parsing 101

The simplest end of the parsing spectrum involves something along the lines of taking a CSV file and converting it into a stream of usable values. On the most complex end you have parsers that can read a C++ source file and create a usable representation of it for the compiler. The former definitely doesn't merit anything complicated, the latter requires some of the most powerful tools for language recognition that we have. For "heavy duty parsing" the process usually looks something like: formal grammar -> lexer -> parser -> abstract syntax tree. If you're writing a compiler or something similar, the abstract syntax tree gets used as a unique, structurally validated, and unambiguous input to whatever happens next.

There are very powerful parsers such as LL(*) that are almost impossible to write by hand for any sizable grammar. Thankfully, tools such as ANTLR will write them for us, given a formal description of the language. However this process is fairly heavy weight, and if you're doing something simple like implementing a simple scripting language or an equation evaluator you might be able to get away without it.

Operators, Precedence, and Associativity

Most of us are used to looking at mathematical equations in infix notation: 3+42/(15)233 + 4 \cdot 2 /(1 - 5 )^{2^3}. Unfortunately, that syntax is very hard for a computer to deal with. We subconsciously recognize the rules of operator associativity and operator precedence for mathematical equations, and as programmers we have to know about operator precedence for the languages that we work with; we've learned to parse these things with ease so it's easy to overlook how much is actually going on. When you throw in parenthetical grouping, the task at hand is definitely non-trivial.

Alternatives such as Reverse Polish Notation write expressions in a fashion that's unambiguous and denotes order-of-operation without parentheses. However, unless you're really used to working with it (I've seen people beast through computations on RPN calculators like the HP-48), you'll have to spend some time thinking your way through the expression that you're writing or reading. As such, it's a good thing that languages don't require us to write things that way, but that does leave us with the problem of parsing an expression while taking into account associativity and precedence.

Using The Shunting-Yard Algorithm

Dijkstra first described the algorithm in 1961 (as if that guy hadn't done enough brilliant work already…). It provides a simple means of converting expressions in infix notation to prefix notation. Most of code that I've seen for it outputs the original expression in RPN but the same procedure can generate an abstract syntax tree as well (in fact, formally speaking RPN is generated by the post-order traversal of an AST). The algorithm is iterative and runs in O(n)\O(n) so performance wise it's as good as you can do for a parser. The code below assumes that every operator has two operands – modifying it to accept unary operators or functions with parameters is quite simple. I also avoided the cruft of refactoring the code into something more OO to keep things short and simple. You would usually want to make the AST nicer to work with, and it's easy to specialize nodes as operators to clean up evaluation.

Operator.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
package parsing;

/**
 * A simple interface common to all operators in a grammar.
 *
 * @author Kelly Littlepage
 */
public interface Operator {

    /***
     *
     * @return <code>true</code> if the operator is right associative, and
     * <code>false</code> otherwise. By definition, any operator that isn't
     * right associative is left associative.
     */
    boolean isRightAssociative();

    /***
     * Compares the precedence of this operator against another operator.
     *
     * @param o The operator to compare against.
     *
     * @return -1 if this operator is of lower precedence, 1 if it's of greater
     * precedence, and 0 if they're of equal precedence. If two operators are of
     * equal precedence, right associativity and parenthetical groupings must be
     * used to determine precedence.
     */
    int comparePrecedence(Operator o);

    /***
     *
     * @return Gets the symbol used to denote the operator.
     */
    char getSymbol();


}

BaseOperator.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
package parsing;

/**
 * A simple base implementation of the [email protected] Operator} interface.
 *
 * @author Kelly Littlepage
 */
public class BaseOperator implements Operator {

    private final char symbol;
    private final boolean rightAssociative;
    private final int precedence;

    /***
     * Creates a new BaseOperator.
     *
     * @param symbol The symbol used to represent the operator.
     * @param rightAssociative <code>true</code> if the operator is right
     * associative, and false otherwise.
     * @param precedence A numerical precedence for the operator relative to
     * all other operators in use.
     */
    public BaseOperator(char symbol, boolean rightAssociative,
                        int precedence) {
        this.symbol = symbol;
        this.rightAssociative = rightAssociative;
        this.precedence = precedence;
    }

    @Override
    public boolean isRightAssociative() {
        return rightAssociative;
    }

    @Override
    public int comparePrecedence(Operator o) {
        if(o instanceof BaseOperator) {
            BaseOperator other = (BaseOperator) o;
            return precedence > other.precedence ? 1 :
                    other.precedence == precedence ? 0 : -1;
        } else {
            // Defer the comparison to the second operator reflectively
            return -o.comparePrecedence(this);
        }
    }

    @Override
    public char getSymbol() {
        return symbol;
    }

    @Override
    public String toString() {
        return Character.toString(symbol);
    }


}

ASTNode.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
package parsing;

/**
 * A simple AST node class.
 *
 * @author Kelly Littlepage
 */
public class ASTNode {

    private final char value;
    private final ASTNode leftASTNode;
    private final ASTNode rightASTNode;

    /***
     * Constructs a new AST node. There's no explicit specialization for leaf
     * nodes. Leaves are denoted by nodes where both the left and right node
     * is null.
     *
     * @param value The value held by the node.
     * @param leftASTNode The left node, or <code>null</code> if there isn't one.
     * @param rightASTNode The right node, or <code>null</code> if there isn't one.
     */
    public ASTNode(char value, ASTNode leftASTNode, ASTNode rightASTNode) {
        this.value = value;
        this.leftASTNode = leftASTNode;
        this.rightASTNode = rightASTNode;
    }

    /***
     *
     * @return The value held by the node.
     */
    public char getValue() {
        return value;
    }

    /***
     *
     * @return The left node, or <code>null</code> if there isn't one.
     */
    public ASTNode getLeftASTNode() {
        return leftASTNode;
    }

    /***
     *
     * @return The right node, or <code>null</code> if there isn't one.
     */
    public ASTNode getRightASTNode() {
        return rightASTNode;
    }


}

ShuntingYardParser.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
package parsing;

import java.util.Collection;
import java.util.HashMap;
import java.util.Map;
import java.util.Stack;

/**
 * A simple implementation of the Shunting Yard algorithm to parse an
 * infix expression and generate either an abstract syntax tree or another
 * expression in Reverse Polish notation. This class duplicates code between
 * [email protected] #convertInfixNotationToRPN(String)} and
 * [email protected] #convertInfixNotationToAST(String)} as the underlying algorithm is
 * nearly identical. You can easily yield RPN given an AST via post-order
 * traversal of the tree.
 *
 * Error handling is almost non existent, all operators are taken as single
 * characters. Both of these issues are easily corrected for.
 *
 * @see <a href="https://en.wikipedia.org/wiki/Shunting-yard_algorithm">
 *     https://en.wikipedia.org/wiki/Shunting-yard_algorithm</a> for details of
 *     the algorithm.
 *
 * @author Kelly Littlepage
 */
public class ShuntingYardParser {

    private final Map<Character, Operator> operators;

    private static void addNode(Stack<ASTNode> stack, char operator) {
        final ASTNode rightASTNode = stack.pop();
        final ASTNode leftASTNode = stack.pop();
        stack.push(new ASTNode(operator, leftASTNode, rightASTNode));
    }

    /***
     * Creates a new ShuntingYardParser for the given operators.
     *
     * @param operators A collection of operators that should be recognized by
     * the parser.
     */
    public ShuntingYardParser(Collection<Operator> operators) {
        this.operators = new HashMap<>();
        for(Operator o : operators) {
            this.operators.put(o.getSymbol(), o);
        }
    }

    /***
     * Convert an expression in infix notation to an abstract syntax tree.
     *
     * @param input The expression, in infix notation.
     *
     * @return An [email protected] ASTNode} that serves as the root of the AST.
     */
    public ASTNode convertInfixNotationToAST(final String input) {
        final Stack<Character> operatorStack = new Stack<>();
        final Stack<ASTNode> operandStack = new Stack<>();
        final char[] chars = input.toCharArray();
        main:
        for(char c : chars) {
            char popped;
            switch(c) {
                case ' ':
                    break;
                case '(':
                    operatorStack.push('(');
                    break;
                case ')':
                    while(!operatorStack.isEmpty()) {
                        popped = operatorStack.pop();
                        if('(' == popped) {
                            continue main;
                        } else {
                            addNode(operandStack, popped);
                        }
                    }
                    throw new IllegalStateException("Unbalanced right " +
                            "parentheses");
                default:
                    if(operators.containsKey(c)) {
                        final Operator o1 = operators.get(c);
                        Operator o2;
                        while(!operatorStack.isEmpty() && null != (o2 =
                                operators.get(operatorStack.peek()))) {
                            if((!o1.isRightAssociative() &&
                                    0 == o1.comparePrecedence(o2)) ||
                                    o1.comparePrecedence(o2) < 0) {
                                operatorStack.pop();
                                addNode(operandStack, o2.getSymbol());
                            } else {
                                break;
                            }
                        }
                        operatorStack.push(c);
                    } else {
                        operandStack.push(new ASTNode(c, null, null));
                    }
                    break;
            }
        }
        while(!operatorStack.isEmpty()) {
            addNode(operandStack, operatorStack.pop());
        }
        return operandStack.pop();
    }

    /***
     * Convert an expression in infix notation to an expression in Reverse
     * Polish Notation.
     *
     * @param input The expression, in infix notation.
     *
     * @return An expression in Reverse Polish notation.
     */
    public String convertInfixNotationToRPN(final String input) {
        final Stack<Character> operatorStack = new Stack<>();
        final StringBuilder sb = new StringBuilder();
        final char[] chars = input.toCharArray();
        main:
        for(char c : chars) {
            char popped;
            switch(c) {
                case ' ':
                    break;
                case '(':
                    operatorStack.push('(');
                    break;
                case ')':
                    while(!operatorStack.isEmpty()) {
                        popped = operatorStack.pop();
                        if('(' == popped) {
                            continue main;
                        } else {
                            sb.append(" ").append(popped);
                        }
                    }
                    throw new IllegalStateException("Unbalanced right " +
                            "parentheses");
                default:
                    if(operators.containsKey(c)) {
                        final Operator o1 = operators.get(c);
                        Operator o2;
                        while(!operatorStack.isEmpty() && null != (o2 =
                                operators.get(operatorStack.peek()))) {
                            if((!o1.isRightAssociative() &&
                                    0 == o1.comparePrecedence(o2)) ||
                                    o1.comparePrecedence(o2) < 0) {
                                operatorStack.pop();
                                sb.append(" ").append(o2.getSymbol());
                            } else {
                                break;
                            }
                        }
                        operatorStack.push(c);
                    } else {
                        if(sb.length() > 0) {
                            sb.append(" ");
                        }
                        sb.append(c);
                    }
                    break;
            }
        }
        while(!operatorStack.isEmpty()) {
            sb.append(" ").append(operatorStack.pop());
        }
        return sb.toString();
    }


}

ShuntingYardDemo.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
package parsing;

import java.util.ArrayList;
import java.util.Collection;
import java.util.Stack;

/**
 * A simple demonstration of the Shunting-Yard algorithm. It's easy to extend
 * this code to handle arbitrary operators/functions (including unary ones)
 * and to validate input.
 *
 * @author Kelly Littlepage
 */
public class ShuntingYardDemo {

    /***
     * Evaluates the calculation encoded in the given abstract syntax tree.
     * This method uses recursion to keep things clean. If you needed to
     * evaluate a very deep tree you might need to rewrite this method to use
     * depth first search and evaluate the tree using an explicit stack.
     *
     * @param tree The [email protected] ASTNode} to evaluate.
     *
     * @return The result of the computation.
     */
    private static double evaluateAST(ASTNode tree) {
        switch(tree.getValue()) {
            case '^':
                return Math.pow(evaluateAST(tree.getLeftASTNode()),
                        evaluateAST(tree.getRightASTNode()));
            case '*':
                return evaluateAST(tree.getLeftASTNode()) * evaluateAST(tree.
                        getRightASTNode());
            case '/':
                return evaluateAST(tree.getLeftASTNode()) / evaluateAST(tree.
                        getRightASTNode());
            case '+':
                return evaluateAST(tree.getLeftASTNode()) + evaluateAST(tree.
                        getRightASTNode());
            case '-':
                return evaluateAST(tree.getLeftASTNode()) - evaluateAST(tree.
                        getRightASTNode());
            default:
                return Double.valueOf(Character.toString(
                        tree.getValue()));
        }
    }

    /***
     * Evaluates the expression given in Reverse Polish notation.
     *
     * @param rpn The expression, in reverse polish notation.
     *
     * @return The result of the calculation.
     */
    private static double evaluateRPN(String rpn) {
        final Stack<String> computation = new Stack<>();
        final char[] chars = rpn.toCharArray();
        for(char c : chars) {
            final String v1;
            final String v2;
            switch(c) {
                case '^':
                    v2 = computation.pop();
                    v1 = computation.pop();
                    computation.push(Double.toString(
                            Math.pow(Double.valueOf(v1), Double.valueOf(v2))));
                    break;
                case '*':
                    v2 = computation.pop();
                    v1 = computation.pop();
                    computation.push(Double.toString(
                            Double.valueOf(v1) * Double.valueOf(v2)));
                    break;
                case '/':
                    v2 = computation.pop();
                    v1 = computation.pop();
                    computation.push(Double.toString(
                            Double.valueOf(v1) / Double.valueOf(v2)));
                    break;
                case '+':
                    v2 = computation.pop();
                    v1 = computation.pop();
                    computation.push(Double.toString(
                            Double.valueOf(v1) + Double.valueOf(v2)));
                    break;
                case '-':
                    v2 = computation.pop();
                    v1 = computation.pop();
                    computation.push(Double.toString(
                            Double.valueOf(v1) - Double.valueOf(v2)));
                    break;
                case ' ':
                    break;
                default:
                    computation.push(Character.toString(c));
            }
        }
        return Double.valueOf(computation.pop());
    }

    /***
     * A simple demonstration of parsing an infix expression and converting it
     * to either Reverse Polish Notation or an abstract syntax tree.
     *
     */
    public static void main(String[] args) {
        // Define our basic operators for arithmetic.
        final Collection<Operator> operators = new ArrayList<>();
        operators.add(new BaseOperator('^', true, 4));
        operators.add(new BaseOperator('*', false, 3));
        operators.add(new BaseOperator('/', false, 3));
        operators.add(new BaseOperator('+', false, 2));
        operators.add(new BaseOperator('-', false, 2));

        final ShuntingYardParser parser = new ShuntingYardParser(operators);
        final String input = "3 + 4 * 2 / ( 1 - 5 ) ^ 2 ^ 3";

        final String rpn = parser.convertInfixNotationToRPN(input);
        System.out.println("RPN expression: " + rpn);

        final ASTNode parseTree = parser.convertInfixNotationToAST(input);
        final double result = evaluateAST(parseTree);
        System.out.println("Result: " + result);

        assert 0 == Double.compare(result, evaluateRPN(rpn));
    }


}


Day Ten: De Bruijn Sequences

TL/DR

De Bruijn sequences and more generally De Bruijn graphs have lots of cool applications in combinatorics and CS. On hardware lacking a native instruction to count leading zeros in a word or find the least significant bit, a perfect hash function formed using a De Bruijn sequence provides a constant time solution. The technique is still useful when implementing arbitrary length integers or bit vectors. Outside of bit twiddling, De Bruijn graphs, sequences, and arrays improve the precision of encoders used in robotics and digital tablets, contribute to DNA fragment analysis, and find use in cryptographic applications.

Intro to De Bruijn Graphs

De Bruijn graphs are Eulerian, Hamiltonian digraphs with a few special properties. if we have an alphabet Σ\Sigma with mm symbols, the graph will have mnm^n vertices, where nn is a free parameter. The vertex set is constructed such that every nn character word formed by characters taken from the alphabet is represented by a single vertex. The edge set is constructed in a fashion that encodes "overlap" between the words. De Bruijn sequences are the Hamiltonian cycles of the graph.

DeBruijn Digraph

Why De Bruijn Sequences are Interesting

Under the construction above, De Bruijn sequences have the property that every nn character word in a kk character alphabet appears as a sub sequence exactly once when we slide a nn character window along the sequence from left to right. As an example, one construction of B(2,3)B(2, 3) is {0,0,0,1,0,1,1,1}\{0, 0, 0, 1, 0, 1, 1, 1\}. As we slide a three element long window from left to right, wrapping around when necessary we get:

000=0001=1010=2101=5011=3111=7110=6100=4\begin{aligned} 000 & = 0 \\ 001 & = 1 \\ 010 &= 2 \\ 101 &= 5 \\ 011 &= 3 \\ 111 &= 7 \\ 110 &= 6 \\ 100 &= 4 \end{aligned}

Every number in the range [0,7][0, 7] appears exactly once. For a binary alphabet the length of the sequence will always be 2n2^n but for larger alphabets the sequence can be shorter than direct enumeration. As wikipedia points out, this has the useful property that, given a lock that accepts a pin code without an enter key (which was apparently common in Europe at one point), entering a De Bruijn sequence for the appropriate alphabet and code length is the fastest way to crack the lock. For a ten digit entry pad and a four digit pin that amounts to B(10,4)B(10, 4), which is 10,000 characters long. Compare that with the 4104=40,0004 * 10^4 = 40,000 combinations that would be required otherwise.

The ability of De Bruijn graphs/sequences to efficiently encode overlap makes them useful for things like DNA sequencing where overlapping fragments need to be recombined, and for mechanical/optical encoders that need to figure out where they are given some overlapping sequence of measurements. There's also a neat bit twiddling hack that comes about from this which we'll discuss next.

De Bruijn Sequences and Minimal Perfect Hashes

A perfect hash is an injective map SZS \mapsto \mathbb{Z}: a function that maps all elements of SS onto the integers with no collisions. Minimal perfect hash functions have the additional property that keys are mapped to integers consecutively. By a simple cardinality argument that also implies that the function is as "efficient" as possible. There's a seminal paper discussing how De Bruijn sequences can be used to construct a minimal perfect hash function that, given any integer ii, will return the location of the lowest (rightmost) set bit in ii. This is a pretty common and important operation to the extent that lots of processors now offer an instruction to do this. However, if you look at the code behind a compiler intrinsic that performs the find-first-set operation, the fallback for architectures lacking native support usually involves this hack.

Before going any further it's worth noting that, if you've heard of this hack, you've probably also seen magic De Bruijn constant floating around for integers of various length. De Bruijn sequences are not necessarily unique and they depend on the algorithm used to generate them. Don't be surprised if the constants that you've seen don't match the ones generated by this code.

DeBruijn.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
import java.util.Random;

/**
 * @author Kelly Littlepage
 */
public class DeBruijn {

    /***
     * Generate a De Bruijn Sequence using the recursive FKM algorithm as given
     * in http://www.1stworks.com/ref/RuskeyCombGen.pdf .
     *
     * @param k The number of (integer) symbols in the alphabet.
     * @param n The length of the words in the sequence.
     *
     * @return A De Bruijn sequence B(k, n).
     */
    private static String generateDeBruijnSequence(int k, int n) {
        final int[] a = new int[k * n];
        final StringBuilder sb = new StringBuilder();
        generateDeBruijnSequence(a, sb, 1, 1, k, n);
        return sb.toString();
    }

    private static void generateDeBruijnSequence(int[] a,
                                                 StringBuilder sequence,
                                                 int t,
                                                 int p,
                                                 int k,
                                                 int n) {
        if(t > n) {
            if(0 == n % p) {
                for(int j = 1; j <= p; ++j) {
                    sequence.append(a[j]);
                }
            }
        } else {
            a[t] = a[t - p];
            generateDeBruijnSequence(a, sequence, t + 1, p, k, n);
            for(int j = a[t - p] + 1; j < k; ++j) {
                a[t] = j;
                generateDeBruijnSequence(a, sequence, t + 1, t, k, n);
            }
        }
    }

    /***
     * Build the minimal perfect hash table required by the De Bruijn ffs
     * procedure. See http://supertech.csail.mit.edu/papers/debruijn.pdf.
     *
     * @param deBruijnConstant The De Bruijn number to use, as produced by
     * [email protected] DeBruijn#generateDeBruijnSequence(int, int)} with k = 2.
     * @param n The length of the integer word that will be used for lookup,
     * in bits. N must be a positive power of two.
     *
     * @return A minimal perfect hash table for use with the
     * [email protected] DeBruijn#ffs(int, int[], int)} function.
     */
    private static int[] buildDeBruijnHashTable(int deBruijnConstant, int n) {
        if(!isPositivePowerOfTwo(n)) {
            throw new IllegalArgumentException("n must be a positive power " +
                    "of two.");
        }
        // We know that n is a power of two so this (meta circular) hack will
        // give us lg(n).
        final int lgn = Integer.numberOfTrailingZeros(n);
        final int[] table = new int[n];
        for(int i = 0; i < n; ++i) {
            table[(deBruijnConstant << i) >>> (n - lgn)] = i;
        }
        return table;
    }

    /***
     * Tests that an integer is a positive, non-zero power of two.
     *
     * @param x The integer to test.
     * @return <code>true</code> if the integer is a power of two, and
     * <code>false otherwise</code>.
     */
    private static boolean isPositivePowerOfTwo(int x) {
        return x > 0 && ((x & -x) == x);
    }

    /***
     * A find-first-set bit procedure based off of De Bruijn minimal perfect
     * hashing. See http://supertech.csail.mit.edu/papers/debruijn.pdf.
     *
     * @param deBruijnConstant The De Bruijn constant used in the construction
     * of the deBruijnHashTable.
     * @param deBruijnHashTable A hash 32-bit integer hash table, as produced by
     * a call to [email protected] DeBruijn#buildDeBruijnHashTable(int, int)} with n = 32.
     * @param x The number for which the first set bit is desired.
     *
     * @return <code>32</code> if x == 0, and the position (in bits) of the
     * first (rightmost) set bit in the integer x. This function chooses to
     * return 32 in the event that no bit is set so that behavior is consistent
     * with [email protected] Integer.numberOfTrailingZeros()}.
     */
    private static int ffs(int deBruijnConstant,
                           int[] deBruijnHashTable, int x) {
        if(x == 0) {
            return 32;
        }
        x &= -x;
        x *= deBruijnConstant;
        x >>>= 27;
        return deBruijnHashTable[x];
    }


}

The code above will generate a De Bruijn sequence and the hash table necessary for the first-set-bit look-up. How does it stack up against the LZCNT instruction emitted by Hotspot on x86?

DeBruijn.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
    private static void benchmarkLookup(int runCount) {
        final Random random = new Random();
        final int[] targets = new int[runCount];
        for(int i = 0; i < runCount; ++i) {
            targets[i] = random.nextInt();
        }
        final int deBruijnConstant = Integer.valueOf(
                generateDeBruijnSequence(2, 5), 2);
        final int[] hashTable = buildDeBruijnHashTable(deBruijnConstant, 32);

        // Warm up
        for(int i = 0; i < 5; ++i) {
            timeLeadingZeros(targets);
            timeFFS(targets, hashTable, deBruijnConstant);
        }

        System.out.printf("Intrinsic: %.4fs\n",
                (double) timeLeadingZeros(targets) / Math.pow(10d, 9));
        System.out.printf("FFS: %.4fs\n",
                (double) timeFFS(targets, hashTable, deBruijnConstant) /
                        Math.pow(10d, 9));
    }

    private static int checksum;

    private static long timeLeadingZeros(int[] targets) {
        // Hash to prevent optimization
        int hash = 0;
        final long startTime = System.nanoTime();
        for(int i : targets) {
            hash += Integer.numberOfTrailingZeros(i);
        }
        final long endTime = System.nanoTime();
        checksum += hash;
        return endTime - startTime;
    }

    private static long timeFFS(int[] targets, int[] hashTable,
                                int deBruijnConstant) {
        // Hash to prevent optimization
        int hash = 0;
        final long startTime = System.nanoTime();
        for(int i : targets) {
            hash += ffs(deBruijnConstant, hashTable, i);
        }
        final long endTime = System.nanoTime();
        checksum += hash;
        return endTime - startTime;
    }

Intrinsic: 0.0604s, FFS: 0.1346s. Silicon can beat us by a factor of two. However, once upon a time this was the best that we could do, and it's still relevant to bit vectors/words of arbitrary length.


Day Nine: Side Channel and Timing Attacks

TL/DR

Layer 7 attacks are becoming increasingly sophisticated and prevalent, and side channel attacks can break mathematically sound cryptography. They're interesting in their own right but the same principles are highly relevant to real-time computing, which is why I'm writing about this. Anything that results in measurable latency jitter or nondeterministic behavior can be a problem from both a performance and security standpoint.

Disclaimer

Cryptography and security is just an interest – it's not my day job. I'm presenting this in the context of low latency/real-time computing, and one of the best ways to illustrate potential problems in that space is to choose canonically bad data as an input. That happens to be the same thing that you would do for malicious purposes. The code that I've provided to illustrate these effects is inefficient by design. If you're interested in this stuff, read about it from the guys who know what they're doing: Dan Boneh, Paul Kocher, Trevor Perrin, and Bruce Schneier to name a few.

Information Leakage and Low Hanging Fruit

Our world banks (literally) on cryptography being mathematically sound. If it's not and someone knows that we have a big problem, so let's just assume that it is. Given that, why are there still media worthy security breaches almost daily? Attacking the problem head on as it's expressed in equation form might be very hard, and hopefully impossible given our modern computing capabilities. However, at the end of the day everything is implemented in software and code executes on hardware. Crypto is often used incorrectly, code has bugs, and hardware can leak information in surprising ways. There are lots of ways to cause big problems short of breaking the underlying algorithms.

Sources of Variance in Time

For modern desktop/server CPUs there is, by design, a huge disconnect between the code that's written (even in assembly) and the computations carried out. When you layer an operating system on top of that, there's an even larger disconnect between what you've written in userland and what happens on the bare metal. At the lowest level CPUs exploits instruction level parallelism to do as much work as possible per clock cycle. Processors such as the x86 have multiple execution units to accelerate arithmetic and add additional parallelism. As main memory access is slow relative to the CPU, speculative execution and branch prediction are used to "guess" what data will come back from a load instruction so that computation can continue based on this guess and rewind when the guess is wrong. Furthermore, a CPU's cache tries to keep recently used (or likely to be used) data closer to the processor for fast access. When the CPU goes to read a value that's not there it has to access main memory, which can be an order of magnitude slower. Cache misses, branch mispredictions, TLB page faults, and instruction level parallelism all betray the secrets of the computation going on behind the scenes.

Given the above, at times it's possible to work out the instructions/data that were used as inputs via statistical analysis of something that's hard to hide: time elapsed. Things get even worse if the attacker has physical access to the box. An analysis of the power used by the CPU, the sounds that the computer makes, or residual data left in RAM even after the computer powers down can be used to access data or recover a private encryption key. These techniques can even break virtual machine hypervisor isolation. On the application layer things are much simpler and the attacker has even more to work with. Studying variations in how long it takes to get a response to some request is enough to bleed information remotely. If the attacker wants to interfere with normal operations (denial of service), or statistically derive sensitive information, remote timings can help her/him do so.

Hash Collisions of Death

When cryptographers discuss hash collisions it's usually in the context of cryptographic hash functions. For a cryptographically secure hash function, having two distinct inputs hash to the same value by chance is really, really bad. Figuring out how to compute the original value given the hash is worse. Being able to find two distinct inputs that result in a collision is the granddaddy of them all (the inverse and the collision are called the first and second preimage, respectively). Why is this bad? Well, among other things, websites (should always) store salted, hashed passwords in lieu of the original ones. That way, if their security is compromised the attacker still has to figure out what the user's original password was before it does them any good. For strong passwords and a good hash function, doing so is computationally infeasible.

Cryptographic hashs can take thousands of times longer to compute than a simple one so they're only used when necessary. However, non-cryptographic hash collisions can still pose a big problem. Hash tables use (as their name suggests) a hash function to quickly store or look-up a value given its key. Multiple keys may have the same hash, resulting in a collision. Hash tables usually resolve collision via probing or chaining. Probing hash tables work by walking over to the next available slot after a collision occurs, and chaining hash maps store a list-like structure at collision points. The average, O(1)\O(1) look-up behavior of a hash table relies on keys having uniformly distributed hash values so that all available slots are filled with the same probability. When this breaks down you're left with a structure that performs like a linked list – O(n)\O(n) insertions, deletions, and look-ups.

This is where the connection to real-time computing comes in. Two things really ruin your day in this space: data structures performing at their worst-case time complexity, and memory allocation at an inopportune time. Hash collisions can result in both. Most high level OO languages try to hash built-in objects in a sensible fashion that minimizes collisions and in some cases provides security by design. However, it's often possible to engineer collisions either intentionally or unintentionally.

A Case Study: Java Strings

As of writing (and for as long as I can remember) the Oracle JDK hashes a string S with the following function:

h=s[0]+31s[1]+312s[3]h = s[0] + 31 \cdot s[1] + 31^2 \cdot s[3] \ldots

That's a speedy and perfectly reasonable hash function. Unfortunately, we can use a very inefficient brute force method to find collisions on short strings, and much better methods exist. Given some admissible set of characters the computation looks something like:

HashTiming.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
private static final int[] permissibleCharacters;

    static {
        permissibleCharacters = new int[62];
        int i = 0;
        for(int v = '0'; v <= 'z'; ++v) {
            if(isPermissibleCharacter(v)) {
                permissibleCharacters[i++] = v;
            }
        }
    }

    private static boolean isPermissibleCharacter(int x) {
        return (x >= '0' && x <= '9') ||
                (x >= 'A' && x <= 'Z') ||
                (x >= 'a' && x <= 'z');
    }

    private static List<String> generateCollisionsForString(String target,
                                                    int[] validCharacters) {
        int n = validCharacters.length;
        final List<String> outputList = new ArrayList<>();
        final int targetHash = target.hashCode();
        for(int k = 0; k <= target.length(); ++k) {
            generateCollisionsForString(targetHash, validCharacters, "", n, k,
                    outputList);
        }
        return outputList;
    }

    private static void generateCollisionsForString(int targetHash,
                                                    int[] set,
                                                    String prefix,
                                                    int n,
                                                    int k,
                                                    List<String> outputList) {
        if (k == 0) {
            if(prefix.hashCode() == targetHash) {
                outputList.add(prefix);
            }
            return;
        }
        for (int i = 0; i < n; ++i) {
            generateCollisionsForString(targetHash, set, prefix + (char) set[i],
                    n, k - 1, outputList);
        }
    }

Restricting myself to ASCII [a-Z 0-9] gives 35 collisions for the word "world" – which hashes to 113318802. A few samples are: "wormE, wosMd, and xPsNE". What could we do with these colliding strings? If some remote service relied on a hash table-like structure you might be able to cause problems by a) overloading the hash table with colliding values resulting in O(n)\O(n) performance, or b) glean information on other values in the collection, which is easier to do if you have known, colliding values. It's a big enough problem that starting with JDK 7.04 Oracle included a secondary string hash for the hash based collections: sun.misc.Hashing.stringHash32(). Note that generally speaking, in the absence of secondary hashing, it's actually easier to cause a collision in a hash table than a pure hash collision as we only need to find a collision modulo the table width. Java collections apply both a secondary hash, and a randomized hash seed to make this sort of reverse engineering harder to do. However, if you "roll your own" collection without these features, or use something other than a string as a key, you're still vulnerable.

So, how pronounced is the effect? Let's write a little code to find out. You'll need to run it on Java 6 given what I mentioned above.

HashTiming.java
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
private static void benchmarkLoadedHashMap(int randomKeys,
                                               int stringLength,
                                               List<String> collisions,
                                               int samples) {
        // Generate a hash map with 50% load
        final Map<String, String> hashMap = new HashMap<String, String>(
                2 * randomKeys);
        final String[] randKeyArr = new String[randomKeys];
        final String[] collisionKeys = new String[randomKeys];
        final Random rand = new Random();

        // Load the map with random keys
        for(int i = 0; i < randomKeys; ++i) {
            final String randomString = (new BigInteger(stringLength * 8,
                    rand)).toString();
            randKeyArr[i] = randomString;
            hashMap.put(randomString, randomString);
        }

        // Load the map with our colliding keys
        for(String s : collisions) {
            hashMap.put(s, s);
        }
        for(int i = 0; i < randomKeys; ++i) {
            collisionKeys[i] = collisions.get(i % collisions.size());
        }

        // Warm up before benchmarking
        for(int i = 0; i < 5; ++i) {
            timeAccess(randKeyArr, hashMap, samples, rand);
            timeAccess(collisionKeys, hashMap, samples, rand);
        }

        for(int i = 0; i < 5; ++i) {
            System.out.printf("Iteration %d\n", i + 1);
            System.out.printf("Random keys: %.4fs\n",
                    timeAccess(randKeyArr, hashMap, samples, rand));
            System.out.printf("Collision keys: %.4fs\n",
                    timeAccess(collisionKeys, hashMap, samples, rand));
        }
    }

    private static double timeAccess(String[] keys, Map<String, String> map,
                                     int samples, Random random) {
        final long startTime = System.nanoTime();
        for(int i = 0; i < samples; ++i) {
            poisonCache(keys, map, random);
            // Pick a key at random to minimize cache effects in the benchmark
            final String key = keys[random.nextInt(keys.length)];
            map.get(key);
        }
        final long endTime = System.nanoTime();
        return (double) (endTime - startTime) / Math.pow(10d, 9);
    }

    // Preform some basic cache busing between iterations to make sure that our
    // benchmark isn't just showing cache effects (not that those aren't useful
    // in their own right)
    private static void poisonCache(String[] keys, Map<String, String> map,
                                    Random random) {
        for(int i = 0; i < 10; ++i) {
            map.get(keys[random.nextInt(keys.length - 1)]);
        }
    }

    public static void main(String[] args) {
        final List<String> collisions = generateCollisionsForString("world",
                permissibleCharacters);
        benchmarkLoadedHashMap(10000, 3, collisions, 100000);
    }

Running this code on Oracle JDK 1.6.37 gives me:

Iteration 1 Random keys: 0.0526s Collision keys: 0.1258s Iteration 2 Random keys: 0.0518s Collision keys: 0.1238s Iteration 3 Random keys: 0.0519s Collision keys: 0.1232s Iteration 4 Random keys: 0.0508s Collision keys: 0.1235s Iteration 5 Random keys: 0.0472s Collision keys: 0.1230s.

Ouch. There's a very stable and obvious 2.5x timing differential between the colliding and random keys. Longer strings admit more collisions so it only gets worse from here.

Closing Thoughts

This is a contrived example but illustrates a real problem to the extent that Oracle addressed it in Java 7. Making strings collide is relatively hard – it's much easier to create these effects (accidentally or deliberately) using a less general type of key. Hash map performance degrades significantly under load factor and collisions, so both are big problems for low latency computing.