Misapplied Math

# Twelve Days 2013: Shunting-Yard Algorithm

## Day Eleven: The Shunting-Yard Algorithm

### TL/DR

Parsing is hard. Parser generators such as ANTLR make it significantly easier by providing a tool to express complex grammars via Extended Backus–Naur Form (a grammar for defining grammars) and generate parsers automatically. However, that's a pretty heavy weight solution, especially if you just need to do something quick and dirty like implement an expression evaluator. There's a significantly easier means of parsing simple grammars that describe things like mathematical equations called the Shunting-yard algorithm.

The Shunting-yard algorithm can be extended to handle more complex tasks, but at its core the algorithm parses an expression written in infix notation and applies the rules of operator precedence with optional parentheses to rewrite the original expression as something unambiguous. You can use it to rewrite the parsed expression in Reverse Polish Notation, or to produce an abstract syntax tree (AST). There are lots of examples of the former but I haven't seen to many of the latter. The code below implements both.

## Parsing 101

The simplest end of the parsing spectrum involves something along the lines of taking a CSV file and converting it into a stream of usable values. On the most complex end you have parsers that can read a C++ source file and create a usable representation of it for the compiler. The former definitely doesn't merit anything complicated, the latter requires some of the most powerful tools for language recognition that we have. For "heavy duty parsing" the process usually looks something like: formal grammar -> lexer -> parser -> abstract syntax tree. If you're writing a compiler or something similar, the abstract syntax tree gets used as a unique, structurally validated, and unambiguous input to whatever happens next.

There are very powerful parsers such as LL(*) that are almost impossible to write by hand for any sizable grammar. Thankfully, tools such as ANTLR will write them for us, given a formal description of the language. However this process is fairly heavy weight, and if you're doing something simple like implementing a simple scripting language or an equation evaluator you might be able to get away without it.

## Operators, Precedence, and Associativity

Most of us are used to looking at mathematical equations in infix notation: $3 + 4 \cdot 2 /(1 - 5 )^{2^3}$. Unfortunately, that syntax is very hard for a computer to deal with. We subconsciously recognize the rules of operator associativity and operator precedence for mathematical equations, and as programmers we have to know about operator precedence for the languages that we work with; we've learned to parse these things with ease so it's easy to overlook how much is actually going on. When you throw in parenthetical grouping, the task at hand is definitely non-trivial.

Alternatives such as Reverse Polish Notation write expressions in a fashion that's unambiguous and denotes order-of-operation without parentheses. However, unless you're really used to working with it (I've seen people beast through computations on RPN calculators like the HP-48), you'll have to spend some time thinking your way through the expression that you're writing or reading. As such, it's a good thing that languages don't require us to write things that way, but that does leave us with the problem of parsing an expression while taking into account associativity and precedence.

## Using The Shunting-Yard Algorithm

Dijkstra first described the algorithm in 1961 (as if that guy hadn't done enough brilliant work already…). It provides a simple means of converting expressions in infix notation to prefix notation. Most of code that I've seen for it outputs the original expression in RPN but the same procedure can generate an abstract syntax tree as well (in fact, formally speaking RPN is generated by the post-order traversal of an AST). The algorithm is iterative and runs in $\O(n)$ so performance wise it's as good as you can do for a parser. The code below assumes that every operator has two operands – modifying it to accept unary operators or functions with parameters is quite simple. I also avoided the cruft of refactoring the code into something more OO to keep things short and simple. You would usually want to make the AST nicer to work with, and it's easy to specialize nodes as operators to clean up evaluation.