Misapplied Math

# Accelerated FIX Processing via AVX2 Vector Instructions

## Accelerated text processing via SIMD instructions

Text isn't going anywhere as a means of storing and transmitting data. It's pretty rare that I hear anyone speak of binary protocols for scientific data short of HD5, and frameworks such as Hadoop largely rely on CSV, XML, and JSON for data interchange. As such there's good incentive to optimize text processing; on Intel x86 hardware, SSE and AVX instructions are ideal for the task. Both are examples of single instruction multiple data (SIMD) instructions – primitives that target vector registers for single instruction parallelism. I have a specific motivation in writing this post – the FIX protocol. However, the examples below would apply equally well to most text processing tasks.

## Background on the FIX Protocol

The FIX Protocol underpins a vast ecosystem of electronic trading. It came about as an easy to implement, generic, and flexible means of transmitting orders and disseminating market data via human readable text. As FIX predates mass market HFT it addressed a different use case than what's common in the binary protocols that emerged thereafter. At the time the ability to transmit extensible, loosely structured data outweighed performance considerations. That said, FIX still stands as the the only standardized, widely adopted protocol for both orders and market data. Most brokers and exchanges support it, even if they have a proprietary, lower latency offering as well.

FIX is a nightmare from a performance standpoint. Integers and decimals are transmitted as ASCII plain text necessitating extra bandwidth and a byte-by-byte conversion, messages aren't fixed length, and the protocol necessitates parsing to extract meaningful business objects. Expressed as a (sloppy/partial) EBNF grammar FIX is simply:

As an example, consider: "8=FIX.4.2|9=130|35=D|…|10=168," which is in the format tag=value|tag=value…" All messages start with a "begin string" specifying the protocol version (8=FIX.4.2) and end with a simple ASCII checksum mod 256 (10=168). An extensive, informally specified grammar addresses application layer validation.

## The Problem

People typically use a FIX engine to handle FIX. I've only described the representation of a message but FIX comes with a long list of requirements: heartbeats, reconnects, message replay, etc. Using an engine that's reasonably performant, standardized, and well tested spares you those unpleasantries. Open source options such as quickfix are in widespread use, and there's a long list of off-the-shelf commercial engines that are more performant/feature rich. If you're deeply concerned about deterministic latency and have the budget, companies such as FixNetix have pushed FIX processing and much more onto FPGAs and ASICs.

FIX engines address a very broad use case, playing no favorites between the buy side and the sell side. They conflate many concerns: connectivity, parsing, validation, persistence, and recovery. Engines are opinionated software and the way to go if you just want to get something working. However, chances are that there's plenty of code bloat and indirection to support a use case that you don't care about. If you're the initiator of an order, and not a broker or an exchange who's responsible for maintaining FIX sessions to a wide user base, that's especially true. On top of everything else, good luck separating your business logic from the engine's API in a clean, zero copy fashion.

I'm in the process of designing a trading platform (many components of which I'll open source, so stay tuned) and as such I've had an opportunity to revisit past sins – the handling of FIX messages being one of them. I decided to build a very simple, buy-side-optimized FIX framework that separates network, parsing, and persistence concerns. It won't be as beginner friendly but it will put the developer back in control of things that matter: memory management, threading, message processing, and API isolation. Initial tests show that it's an order of magnitude lower latency than most of what's out there. That's not a fair comparison seeing as it offers much less for the general use case, but it suits my purposes. Also keep in mind that network hops are always the big, roughly fixed cost expense.

## Part 1: Parsing

Playing around with the lowest level concerns – message tokenization and checksum calculation gave me a good excuse to try out the latest AVX2 introduced as part of the Intel Haswell microarchitecture. AVX2 greatly expanded AVX integer instruction support and introduced many other floating point goodies as well. AVX gets another bump in 2015-2016 with the introduction of AVX-512. At present SSE instructions target 128 bit XMM registers while AVX uses 256 bit YMM registers. AVX-512 will introduce 512 bit ZMM registers doubling Intel's superscalar capabilities once again.

Disclaimer: the code below is not well tested, it's not even close to what I use in production, and it will probably only build on Linux GCC > 4.7. Furthermore, running it on any processor that doesn't support AVX2 will merrily give you a SIGILL (illegal instruction) and kill your program. These benchmarks are quick and dirty. My test bench: Fedora 19 on a 15" late 2013 MacBook Pro (Haswell): Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz.

We'll start with tokenization. As a toy example, let's count the number of equal signs '=' and '\1' characters in a null terminated string (this is functionally equivalent to parsing a message using a visitor pattern). I used the following modified but real message for all of my benchmarks:

A canonical implementation looks something like:

The first two implementations don't use any form of vectorization and serve as our baseline. As noted, a good compiler will effectively unroll the first implementation into the second, but explicit unrolling serves as a nice illustration of this common optimization. Compiling with "CFLAGS=-march=core-avx2 -O3 -funroll-loops -ftree-vectorizer-verbose=1" shows that none of our functions were vectorized and that the optimizer left our "hand unrolling" alone in parseUnrolled().

From the vectorization report we also see that the loop in parseNaive() was unrolled seven times. The compiler is sparing with this optimization as unrolling comes at a cost. Increased code size leads to potential performance issues (long jumps in a huge function can cause an instruction cache miss, which is really, really bad latency wise). Note that by default GCC looks for vectorization opportunities at the -O3 optimization level, or whenever -ftree-vectorize is set. However, because of its potential drawbacks, global unrolling isn't enabled at any optimization level by default. Setting -funroll-loops at -02 and higher will ask GCC to treat all loops as candidates for unrolling.

The results weren't compelling but the "best implementation" using AVX did offer a 5% speedup over the next runner up - our hand unrolled loop. Averaging across 10,000,000 iterations yields:

There's a good explanation for these lackluster results. On the SSE front STTNI (String and Text New Instructions) instructions have a very high instruction latency. PCMPESTRI, emitted by _mm_cmpistri takes eight cycles to return a result. The STTNI instruction set offers a rich set of operations via its control flag but because our query is so basic the instruction's overhead isn't worth it. Worst of all, "needles" are very dense in our "haystack" so we end up duplicating vector loads. STTNI instructions perform very well on general use cases or more complicated queries, which is why functions such as strchr() and strlen() in glibc use them.

On the AVX front we use a simple bitmask comparison to find equal signs and SOH characters. That's great but we're left with a bitmask that we still have to iterate over in a serial fashion. I experimented with several approaches including iteration via LZCNT and a finer grained search than the 32-bit integer one used above. Everything that I tried, albeit not an exhaustive list, was a tie or marginally slower. The classic parallel stream compaction algorithm is, in theory what we want. However, I've yet to figure out an efficient way to reorder the data with vector shuffle operations. If anyone has an idea on this front I would love to hear from you.

## Part 2: Checksums

Given our disappointing results on the parsing front it's time for a win. Calculating a checksum is embarrassingly parallel (ignoring overflow, which we can for any practical FIX message) so it should lend itself to vectorization quite well. Let's try out a few implementations:

And the results:

The first thing we note is that GCC is pretty good at its job. With vectorization enabled we get code that's a dead tie with our first hand optimized implementation. Without vectorization enabled it's no contest. And this time around, we beat GCC's vectorized code by a factor of two.

First let's look at what GCC did to our baseline implementation. Picking through the disassembly reveals that the compiler did indeed use AVX. In short the function spends some time 16-byte aligning memory (unaligned loads to vector registers are slower, but on modern hardware there's very little difference sans pathological cases) before entering a loop where our vector is padded, sign extended, and added as packed double words on the upper and lower half of a YMM register (most AVX instructions treat a 256 bit YMM register as two independent 128 bit ones):

Our hand implemented version is easier to follow and translates directly from what we wrote so no surprises here:

Now for the fun part. In avxChecksumV1() we used the PHADDW instruction to quickly accomplish what we wanted – a sum across each 32 byte chunk of our FIX message. SIMD instructions are optimized to operate "vertically," meaning that operations such as $v_1 = (x_1, x_2)$, $v_2 = (y_1, y_2)$, $z = x + y = (x_1 + y_1, x_2 + y_2)$ are efficient, and horizontal operations such as a prefix sum are not. Almost all of the AVX/SSE add instructions have only one cycle of latency and execute as one micro operation. HADDW requires 3-4 cycles and 1-2 $\mu$-ops depending on its operands. Eliminating it should pay dividends.

As noted in the comments for avxChecksumV2() we can get free unpacking via _mm256_madd_epi16, which emits the PMADDWD instruction (one cycle, two $\mu$-ops). Evidently GCC has a better understanding of what we're trying to do this time around as it unrolls the inner loop and reorders our instructions to minimize loads and better optimize register use:

We note that there's roughly a factor of five difference between the unvectorized naive function and our best implementation that uses six AVX instructions per loop. That sounds pretty reasonable as we're working with 32 characters at a time and each instruction has one cycle of latency, so by Little's Law our max speedup is $\frac{32}{6} \approx 5.3$.

## Closing Thoughts

Having worked through these examples it's easy to see why FPGAs and ASICs have an advantage in this space. Processing and validating tag/value pairs is a simple, highly parallel problem that lends itself well to a hardware implementation. Hundreds of thousands of operations can be carried out in parallel with virtually zero latency jitter. That does however come at a cost – as demonstrated above we can tokenize a message in ~110ns, which is roughly twice the time that it would take a CPU to read from main memory (or in this case an FPGA coprocessor over a DMA bus). Unless the FPGA/ASIC does application layer validations as well, having an external process or piece of hardware parse a message for the sake of handing you a packed structure probably isn't worth it. The hardware value add comes from deterministic networking and highly parallel risk checks.

FIX is about as simple as it gets so SSE/AVX has much more to offer when the use case is more complex. Furthermore, as noted before, the distance between tag/value pairs is small, meaning that we don't get the same sort of boost that we would expect when searching for sparse delimiters in structured text. Intel has a nice paper on XML schema validation via STTNI and I came across a good article on SSE UTF-8 processing when writing this. As a side note, for sufficiently long integers specializing atoi() via fused multiply-add instructions might pay dividends. That aside…man I could go for an efficient AVX optimized array compaction algorithm.