Misapplied Math

# Twelve Days 2013: Discrete SamplingDec 13

## Day Two: Discrete Random Variable Generation and The Table Method

### TL/DR

There's an efficient algorithm for generating random numbers over a discrete distribution. The details are found in this paper, along with sample code.

## Explanation

Generating random variables over a discrete distribution is a common operation. Many resampling methods in computational statistics rely on it, as do many types of simulations. Operating systems such as linux use various sources of entropy to generate uniform random numbers over a closed interval. There are well known techniques such as the ziggurat algorithm which can generate random numbers under an arbitrary distribution, and it's reasonably efficient for most continuous distributions. However, it's hard to implement efficiently for discrete distributions as it uses rejection sampling which wastes random numbers (which aren't cheap to generate in the first place), and it has some pathologically bad cases.

## A First Pass

If our uniform distribution is small and known ahead of time we could hard code conditionals and operate on a single random uniform value. That's hard to beat performance wise. However, if we need to do this for a distribution with lots of discrete categories, or one that's not known ahead of time, the if-then-else approach won't work. One alternative is as follows:

As you can see this is linear in the number of weights, and pathologically bad for certain distributions. We could do slightly better by sorting the weights ahead of time. Taking it one step further we can precompute a cumulative sum of the weights and do a binary search. That might or might not pay dividends depending on the length of the weight array among other factors (binary search has very bad cache performance).

## A Better Approach

If we can pay the setup cost ahead of time and reuse our random generator, a better approach is to use a lookup table. In that case we would generate a large table of entries using a technique like the one above, and sample from the table using a random integer generator. However, we would need a table big enough to avoid resampling errors, and like binary search, this comes at the cost of poor cache performance (not to mention the memory overhead).

## The Compromise

The paper linked in the TL/DR section, Fast Generation of Discrete Random Variables compares several table based hybrid methods. They're the best you'll do for general use cases. There's another technique called the alias method which can offer better cache locality at the cost of having some edge cases where it performs very poorly. Once again, if your distribution is small enough, hard coded conditionals are the way to go. You can even use source code generation to create them automatically, or run time code generation to do so on the fly.

# Twelve Days 2013: Reed–Solomon CodesDec 12

## The Twelve Days of Christmas…err…Algorithms

Ok, so the "real" twelve days of Christmas starts on the 25th of December, but I like the idea of starting on the 12th. Starting today and ending Christmas Eve I'll write about a different algorithm or data structure each day. These posts will be quick reads (except for this kickoff one) and range in complexity but they'll all follow three simple rules:

1. I'm only writing about things that I've used myself and that I hope are general enough to benefit others.
2. If you don't want commentary, every post will start with a TL/DR explaining the use with links to resources/implementations when applicable.
3. They've gotta have a cool factor.

And with that, let's start day one…

## Day One: Reed–Solomon Codes

### TL/DR

Reed–Solomon codes are a fast, super useful form of error correction that's easily implemented in software. If you need to transmit data over a noisy channel (think UDP over WAN) or if you're writing a file system/highly reliable data store, they're your friend, especially for latency critical applications. There's a high quality C implementation in QFS (Quantcast File System), based on the original implementation in KFS.

### A Brief Introduction to Error Correction

Without error correcting codes, storing data on a hard drive, making a phone call, listening to music, or watching a DVD would all fall somewhere on the spectrum of "impossible" to "much less efficient." Error correction is a beautiful application of pure math to a real world application. There's a huge range of error correction/detection schemes but they come in two broad categories: backward error correction in which an error is detected and the data is re-requested, or forward error correction in which data is sent with enough redundancy to detect an error and correct for it.

Fundamental limits dictate the maximum efficiency of any error correction scheme. Within those limits, different schemes have their pros, cons and common use cases. One important consideration is the ease and efficiency of an implementation in hardware or software. Error correction is frequently applied real-time (think cellphone conversations or reading from a Blu-ray disk) so speed is often a primary concern on the encoding or decoding end, and sometimes both.

Reed–Solomon codes are a powerful and early example of error correction. They're still in widespread use as they perform well and are relatively fast even when implemented without special hardware. Modern alternatives such as Turbo Codes are closer to the Shannon Limit but their decoding process is complex and higher latency. For Reed–Solomon, encoding is a simple linear operation, and decoding can be made efficient.

The first time I saw Reed–Solomon codes the concept went right over my head. They were presented using way more abstract algebra than I understood at the time and I missed the fact that they rely on pretty simple, elegant concepts. Yes, you do need to know about Galois field theory and abstract algebra to prove results on the general case of Reed–Solomon, or to develop an efficient implementation. However, appreciating the beauty of how it works or experimenting with a special case requires only algebra and some faith in the underlying math. What follows is a toy example. Real implementations take a different approach mathematically for performance reasons. Standard CPUs weren't built to do efficient math on finite fields (although they often are in hardware).

## The Intuition

Let's forget about codes for a second and think back to geometry. A line is uniquely defined by two points. Giving someone a third point picked from that line doesn't convey any additional information, or change how you would draw that line. Now imagine that you want to tell your friend Sally two numbers over a bad phone connection. One approach is to ask her to draw a line passing through a series of points, and tell her that the slope and the intercept of that line are the two numbers that you want her to have. If you state three collinear $(x, y)$ points and Sally doesn't hear one pair – not a problem, she can still draw a line identical to yours and recover the slope and intercept. Likewise, if she goes to draw the line and finds that the three points don't line up, she'll know that something went wrong.

When we describe our line we can add as much redundancy as we want by including extra points in our description. As long as our final message arrives with at least two of them, we're good to go. However, what happens if instead of not hearing us at all, our friend hears us incorrectly? Now we have a new problem – determining which points are garbage.

Sally knows what points are permissible – the two of you agreed on some rules ahead of time. Maybe you decided that you'll never tell her a negative coordinate, or that the x value of every coordinate will always be a power of two. If she hears a point that's not valid, or if the points don't line up, she'll know that there's a problem. She might ask you to repeat yourself: backward error correction. However, she doesn't want to waste your time, so if she has enough information to sort things out on her end she'll do it: forward error correction. Note that detecting an error in this scheme or correcting for it isn't always possible. There's always a chance that she hears all of the points incorrectly, but that they meet the agreed upon restrictions, and they still form a line. Sally won't know that there was a problem in this case.

Up until now it's not clear why you wouldn't just tell Sally the same set of coordinates over and over again instead of picking a unique third one: $\{(1, 2), (2, 1), (1, 2), (2, 1), \ldots\}$. You certainly could, but if two things sound or look alike they're easier to confuse – try sorting out whether someone is saying "Nancy" or "Mancy" over a bad radio connection. Similarly, we want coordinates that are as visually and audibly distinct as possible, so sampling unique points that are far apart makes sense.

Let's start calling our coordinates codewords, and note that when we write them in binary the "distance" between them is called hamming distance. Hamming distance measures the number of flipped bits required to turn one codeword into another. Codewords that are far apart are desirable, as when they get garbled they're easier to tell apart. For example, if our codewords were 1111 and 0000 and we receive 1011, it's much more plausible that one bit in 1111 flipped to produce 1011 than three in 0000. If we had a little extra information we could say for sure.

The line is a good start, but can we do better? We want something that:

• Is a unique, one-to-one mapping, meaning that we can uniquely recover any message from its code.
• Lets us use as many codewords as we want.
• Provides a consistency check (in the case of our line, the fact that valid points fall on the line is a consistency check).

What about a polynomial? A line is a polynomial of degree one, and just as a line can fit any two data points, a polynomial of degree $n$ can fit any $n + 1$ data points. Is it unique? Yup, there's a simple proof showing that any polynomial $P(x)$ of degree $n$ passing through $n + 1$ data points is unique – no other polynomial of degree $n$ can pass through all of those same points unless it's identical to $P(x)$. Can we use it to construct a one-to-one mapping scheme? Not necessarily, but maybe we can with some restrictions. What about the codewords? Well, we're free to pick coefficients on our polynomial, so we can use those as codewords, just as we did slope and intercept on a line. What about the consistency check? By the uniqueness property above oversampling from a polynomial is no different from oversampling from a line – there's only one possible fit, so given a fixed amount of redundancy we'll be able to detect a fixed number of errors.

Reed–Solomon codes are, in a certain light, a formula for how to sample from a polynomial and use those samples to interpolate the original polynomial. The encoding procedure is a simple linear operation but there are lots of ways to do the decoding, and that's where things get confusing/mathy. Following the Reed-Solomon methodology gives you provable bounds on the maximum number of errors that can be detected or automatically repaired. As noted previously, and as with any coding scheme, there's always a probability of failure in which we 1) detect more damage than we can recover, or 2) end up with something that's internally consistent but incorrect. It's impossible to eliminate this probability, but we can make it arbitrarily small depending on how much duplication we're willing to tolerate (more duplication = less information density).

The procedure that follows is, at the end of the day, a glorified means of interpolating a family of polynomials. We can tolerate a certain number of errors and still pick the correct polynomial out of the family as (assuming there are at most the threshold number of errors) consistency arguments across different combinations of points rule out all curves but one, providing a unique solution. The restrictions placed on the codewords used and the fact that it needs some strange arithmetic (adding and inverting numbers modulo a prime number) isn't germane to the bigger picture. They're vestiges of our need to provide provable guarantees on the codes, make them one-to-one, and make decoding efficient.

## The Procedure

For simplicity I'm repeating the example found here with code and a step-by-step approach. If you're interested in the field theory aspect I included a very brief "why fields" motivator below the fold. These are some notes on practical implementation, and here's the quantcast one. The math (as is) only works if we restrict ourselves to integer codewords $0 \leq c < p$, where $p$ is prime. I'm using $p = 7$ (which is why everything is evaluated modulo 7) but you're free to use a larger prime $p$ and adjust accordingly.

A few definitions:

\begin{aligned} m &= \text{number of message symbols} \\ n &= \text{number of code symbols} \\ e &= \lfloor\frac{n - m}{2}\rfloor = \text{max recoverable errors} \\ C_k(x) &= x(x - 1) \cdots (x - k + 1) / k! \\ P(t) &= x_m t^{m - 1} + x_{m - 1} t^{m - 2} + \cdots + x_2 t + x_1 \\ \vec{P} &= [P(0)\ P(1) \ldots\ P(n - 1)] = \text{transmitted} \\ \vec{R} &= [R(0)\ R(1) \ldots\ R(n - 1)] = \text{received} \end{aligned}

For a sequence $a_1, a_1, \ldots$ define the first difference as $\Delta_i = a_{i + 1} - a_i$. The second difference is the first difference of the first difference, etc. Define $a^{(j)}$ as the sequence $0^jR_0,\ 1^jR_1, \ldots,(n - 1)^jR_{n - 1}$. Let $B$ be an $e\ \times\ (e + 1)$ matrix whos elements are defined by:

$b_{ij} = \Delta^{m + e} a_i^{(j)}$

We'll take $m = 3$ (three symbols in our message), $n = 7$ (seven symbols in our encoded, redundant message), and note that by our formula for $e$ we can correct up to $e = 2$ errors. For a message we'll use: $(2, 3, 4)$, which in turn yields the polynomial $P(t) = 4t^2 + 3t + 2$. We evaluate it at the points $(0, \ldots, 6)$ because $n = 7$, so we need to sample seven points:

We now have that $p.vec\ = (2, 2, 3, 5, 1, 5, 3)$. Suppose that we receive $r.vec\ = (2, 2, 6, 5, 3, 5, 3)$, which contains two errors. We compute a matrix $B$, as defined above.

Thus:

$b = \left[ \begin{array}{ccc} 2 & 5 & 0 \\ 0 & 5 & 2 \end{array} \right]$

We need to solve a system of linear equations (modulo 7) with $b$ as our coefficient matrix:

$\left[ \begin{array}{ccc} 2 & 5 & 0 \\ 0 & 5 & 2 \end{array} \right] \left[ \begin{array}{c} v_0 \\ v_1 \\ v_2 \end{array} \right] = \left[ \begin{array}{c} 0 \\ 0 \\ 0 \end{array} \right]$

This one is easy as $v = (1, 1, 1)$ works, but in general solving a system of linear equations modulo some integer is pretty annoying, and only meaningful if said integer is prime. As much fun as working through that would be…let's use Wolfram Alpha's solver instead. The solution that we want is the minimal non zero one on the second line of output: $v = (1, 1, 1)$.

Reed-Solomon hinges on an equation called the key equation, a theorem stating that there exists polynomials $E(t)\ \text{of degree} \leq e$ and $Q(t)\ \text{of degree} \leq m + e - 1$ such that $Q(i) = R_i E(i)\ \forall i = 0, 1, \ldots n$. There's a proof guaranteeing that it has non-zero solutions, and the system that we just solved gave us the coefficients of $E(t)$, so we're half way there. There's another result that allows us to express a polynomial as a series of successive differences (see the paper that I took this example from if you're interested). The end result is that, using $C_k$ and $\Delta$ as we previously defined them, we can write any degree $d$ polynomial as:

$f(x) = f(0)C_0(x) + \Delta f(0) C_1(x) + \cdots + \Delta^d f(0) C_d(x)$

We'll use this to reconstruct $Q$. To do so we need the first element of $ith$ difference of the sequence $Q(i) = R_iE(i)$ for $i = 0, \ldots (m + 1)$.

Which gives us a $diff.vec = (2, 4, 4, 4, 5)$. We can now write:

$Q(t) = 2C_0(t) + 4C_1(t) + 4C_2(t) + 4C_3(t) + 5C_4(t)$

Those $C_i$ polynomials have a factor of $1 / i!$ in them, which would lead you to believe that they're fractional. However, because we're doing our arithmetic modulo $p$ we're actually dealing with a modular multiplicative inverse. We can expand the polynomial above as we would any polynomial and work out the congruence on each coefficient by hand (for $t^4$ that's tantamount to solving the congruence $24x \equiv 5 (mod\ 7$). Or we can take the lazy approach again: PolynomialMod

Doing so gives us $4t^4 + 2t^2 + 5t + 2$. If we ask Wolfram to factor this for us, mod 7, we get $4 (t + 3) (t + 5) (t^2 + 6 t + 4)$. Doing the same for $E(t)$ gives $E(t) = (t + 3) (t + 5)$. As we can see $Q(t)$ is the product of $E(t)$ and something else. That something else is (expanding modulo 7): $4 t^2 + 3t + 2$. Look familiar? That's our original message, derived from the damaged one that we received.

Having walked through this it's easy to see why this wouldn't lend itself well to practical use as is. We needed a computational algebra system to do this, and factoring polynomials over finite fields as we did to see our result isn't fun. As such, practical implementations take a different view of the math and won't touch any algebra directly.

May your 2014 be slightly less error prone.

## Optional Deeper Dive

Polynomials live in a function space, and $\{1, t, t^2, t^3 \ldots\}$ forms its basis. A basis is the smallest linearly independent set of vectors (meaning that no one vector can be produced by any combination of the others) needed to represent any other point in a vector space. Just as $\{(1, 0, 0), (0, 1, 0), (0, 0, 1)\}$ forms a basis over a three dimensional space defined in terms of $(x, y, z)$ coordinates, the basis above can represent any polynomial as a linear combination of basis vectors.

Polynomials can, generally speaking, have any type of coefficient. However, the type of polynomial that we were using had integer coefficients less than a prime number $p$: $a \in \{0, 1, \ldots, p-1\}$. More formally:

$a \in \mathbb{Z}_p = \text{the set of all congruence classes, modulo p}$

These integers form a field and we get some neat properties because of this. Our codewords are the coefficients of these polynomials – elements of the field. We chose the field that we did for our coefficients out of convenience, but in practice fields with $2^8 = 256$ elements are used so that every byte has a representation as a single codeword. Fields can be finite or infinite, and in this case, it's finite. Such fields are called Galois fields. Galois theory makes a lot of modern day electronics possible.

Fields are defined axiomatically, and one of the axioms, closure, means that the sum or product of any two elements in a field is still in that field. For vector spaces over the reals, the intuition is that you can never add or multiply two real numbers and get something that's not in the reals. The intuition is somewhat different for finite fields. For example, the smallest possible field has two elements, and is called GF(2). It contains the elements 0 and 1, and operations on elements in the field are carried out modulo 2. We can see that it's closed under addition by checking all combinations: 0 + 0 = 0, 0 + 1 = 1, 1 + 0 = 1, 1 + 1 = 0 (mod 2). If we didn't carry out operations modulo 2 we would have 1 + 1 = 2, which is not in the field. It's easy to check the other field axioms hold for $GF(2)$, and we can do the same for the field that we just defined above.

Field extensions of $GF(2)$ such as $GF(2^8)$ admit a nice representation of codewords as bytes. There's a powerful result guaranteeing that $\exists$ a finite field with $p^d$ elements for every prime $p$ and every positive integer $d$. Working with fields makes it a lot easier to reason about certain types of algorithms, and as such they play an important role in both coding theory and cryptography.

In short, why do we care? Our ability to manipulate polynomials as we did and treat them as we would treat ordinary numbers, carrying out operations such as modular arithmetic and long division, hinges on these facts. Generalized Reed-Solomon actually takes this a step further and uses codes over rings of matrices (our coefficients are over a field so certain sets of polynomials will form a polynomial ring). Aside from that, working with $\mathbb{Z}_p$ is required to prove an identity that gives the weight of a Reed–Solomon code, and by extension the maximum number of errors that it can detect/correct. Last but not least, doing all of this in an efficient manner (and not like we did above) requires lots of tricks grounded in abstract algebra and finite field theory.

# Accelerated FIX Processing via AVX2 Vector InstructionsDec 10

## Accelerated text processing via SIMD instructions

Text isn't going anywhere as a means of storing and transmitting data. It's pretty rare that I hear anyone speak of binary protocols for scientific data short of HD5, and frameworks such as Hadoop largely rely on CSV, XML, and JSON for data interchange. As such there's good incentive to optimize text processing; on Intel x86 hardware, SSE and AVX instructions are ideal for the task. Both are examples of single instruction multiple data (SIMD) instructions – primitives that target vector registers for single instruction parallelism. I have a specific motivation in writing this post – the FIX protocol. However, the examples below would apply equally well to most text processing tasks.

## Background on the FIX Protocol

The FIX Protocol underpins a vast ecosystem of electronic trading. It came about as an easy to implement, generic, and flexible means of transmitting orders and disseminating market data via human readable text. As FIX predates mass market HFT it addressed a different use case than what's common in the binary protocols that emerged thereafter. At the time the ability to transmit extensible, loosely structured data outweighed performance considerations. That said, FIX still stands as the the only standardized, widely adopted protocol for both orders and market data. Most brokers and exchanges support it, even if they have a proprietary, lower latency offering as well.

FIX is a nightmare from a performance standpoint. Integers and decimals are transmitted as ASCII plain text necessitating extra bandwidth and a byte-by-byte conversion, messages aren't fixed length, and the protocol necessitates parsing to extract meaningful business objects. Expressed as a (sloppy/partial) EBNF grammar FIX is simply:

As an example, consider: "8=FIX.4.2|9=130|35=D|…|10=168," which is in the format tag=value|tag=value…" All messages start with a "begin string" specifying the protocol version (8=FIX.4.2) and end with a simple ASCII checksum mod 256 (10=168). An extensive, informally specified grammar addresses application layer validation.

## The Problem

People typically use a FIX engine to handle FIX. I've only described the representation of a message but FIX comes with a long list of requirements: heartbeats, reconnects, message replay, etc. Using an engine that's reasonably performant, standardized, and well tested spares you those unpleasantries. Open source options such as quickfix are in widespread use, and there's a long list of off-the-shelf commercial engines that are more performant/feature rich. If you're deeply concerned about deterministic latency and have the budget, companies such as FixNetix have pushed FIX processing and much more onto FPGAs and ASICs.

FIX engines address a very broad use case, playing no favorites between the buy side and the sell side. They conflate many concerns: connectivity, parsing, validation, persistence, and recovery. Engines are opinionated software and the way to go if you just want to get something working. However, chances are that there's plenty of code bloat and indirection to support a use case that you don't care about. If you're the initiator of an order, and not a broker or an exchange who's responsible for maintaining FIX sessions to a wide user base, that's especially true. On top of everything else, good luck separating your business logic from the engine's API in a clean, zero copy fashion.

I'm in the process of designing a trading platform (many components of which I'll open source, so stay tuned) and as such I've had an opportunity to revisit past sins – the handling of FIX messages being one of them. I decided to build a very simple, buy-side-optimized FIX framework that separates network, parsing, and persistence concerns. It won't be as beginner friendly but it will put the developer back in control of things that matter: memory management, threading, message processing, and API isolation. Initial tests show that it's an order of magnitude lower latency than most of what's out there. That's not a fair comparison seeing as it offers much less for the general use case, but it suits my purposes. Also keep in mind that network hops are always the big, roughly fixed cost expense.

## Part 1: Parsing

Playing around with the lowest level concerns – message tokenization and checksum calculation gave me a good excuse to try out the latest AVX2 introduced as part of the Intel Haswell microarchitecture. AVX2 greatly expanded AVX integer instruction support and introduced many other floating point goodies as well. AVX gets another bump in 2015-2016 with the introduction of AVX-512. At present SSE instructions target 128 bit XMM registers while AVX uses 256 bit YMM registers. AVX-512 will introduce 512 bit ZMM registers doubling Intel's superscalar capabilities once again.

Disclaimer: the code below is not well tested, it's not even close to what I use in production, and it will probably only build on Linux GCC > 4.7. Furthermore, running it on any processor that doesn't support AVX2 will merrily give you a SIGILL (illegal instruction) and kill your program. These benchmarks are quick and dirty. My test bench: Fedora 19 on a 15" late 2013 MacBook Pro (Haswell): Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz.

We'll start with tokenization. As a toy example, let's count the number of equal signs '=' and '\1' characters in a null terminated string (this is functionally equivalent to parsing a message using a visitor pattern). I used the following modified but real message for all of my benchmarks:

A canonical implementation looks something like:

The first two implementations don't use any form of vectorization and serve as our baseline. As noted, a good compiler will effectively unroll the first implementation into the second, but explicit unrolling serves as a nice illustration of this common optimization. Compiling with "CFLAGS=-march=core-avx2 -O3 -funroll-loops -ftree-vectorizer-verbose=1" shows that none of our functions were vectorized and that the optimizer left our "hand unrolling" alone in parseUnrolled().

From the vectorization report we also see that the loop in parseNaive() was unrolled seven times. The compiler is sparing with this optimization as unrolling comes at a cost. Increased code size leads to potential performance issues (long jumps in a huge function can cause an instruction cache miss, which is really, really bad latency wise). Note that by default GCC looks for vectorization opportunities at the -O3 optimization level, or whenever -ftree-vectorize is set. However, because of its potential drawbacks, global unrolling isn't enabled at any optimization level by default. Setting -funroll-loops at -02 and higher will ask GCC to treat all loops as candidates for unrolling.

The results weren't compelling but the "best implementation" using AVX did offer a 5% speedup over the next runner up - our hand unrolled loop. Averaging across 10,000,000 iterations yields:

There's a good explanation for these lackluster results. On the SSE front STTNI (String and Text New Instructions) instructions have a very high instruction latency. PCMPESTRI, emitted by _mm_cmpistri takes eight cycles to return a result. The STTNI instruction set offers a rich set of operations via its control flag but because our query is so basic the instruction's overhead isn't worth it. Worst of all, "needles" are very dense in our "haystack" so we end up duplicating vector loads. STTNI instructions perform very well on general use cases or more complicated queries, which is why functions such as strchr() and strlen() in glibc use them.

On the AVX front we use a simple bitmask comparison to find equal signs and SOH characters. That's great but we're left with a bitmask that we still have to iterate over in a serial fashion. I experimented with several approaches including iteration via LZCNT and a finer grained search than the 32-bit integer one used above. Everything that I tried, albeit not an exhaustive list, was a tie or marginally slower. The classic parallel stream compaction algorithm is, in theory what we want. However, I've yet to figure out an efficient way to reorder the data with vector shuffle operations. If anyone has an idea on this front I would love to hear from you.

## Part 2: Checksums

Given our disappointing results on the parsing front it's time for a win. Calculating a checksum is embarrassingly parallel (ignoring overflow, which we can for any practical FIX message) so it should lend itself to vectorization quite well. Let's try out a few implementations:

And the results:

The first thing we note is that GCC is pretty good at its job. With vectorization enabled we get code that's a dead tie with our first hand optimized implementation. Without vectorization enabled it's no contest. And this time around, we beat GCC's vectorized code by a factor of two.

First let's look at what GCC did to our baseline implementation. Picking through the disassembly reveals that the compiler did indeed use AVX. In short the function spends some time 16-byte aligning memory (unaligned loads to vector registers are slower, but on modern hardware there's very little difference sans pathological cases) before entering a loop where our vector is padded, sign extended, and added as packed double words on the upper and lower half of a YMM register (most AVX instructions treat a 256 bit YMM register as two independent 128 bit ones):

Our hand implemented version is easier to follow and translates directly from what we wrote so no surprises here:

Now for the fun part. In avxChecksumV1() we used the PHADDW instruction to quickly accomplish what we wanted – a sum across each 32 byte chunk of our FIX message. SIMD instructions are optimized to operate "vertically," meaning that operations such as $v_1 = (x_1, x_2)$, $v_2 = (y_1, y_2)$, $z = x + y = (x_1 + y_1, x_2 + y_2)$ are efficient, and horizontal operations such as a prefix sum are not. Almost all of the AVX/SSE add instructions have only one cycle of latency and execute as one micro operation. HADDW requires 3-4 cycles and 1-2 $\mu$-ops depending on its operands. Eliminating it should pay dividends.

As noted in the comments for avxChecksumV2() we can get free unpacking via _mm256_madd_epi16, which emits the PMADDWD instruction (one cycle, two $\mu$-ops). Evidently GCC has a better understanding of what we're trying to do this time around as it unrolls the inner loop and reorders our instructions to minimize loads and better optimize register use:

We note that there's roughly a factor of five difference between the unvectorized naive function and our best implementation that uses six AVX instructions per loop. That sounds pretty reasonable as we're working with 32 characters at a time and each instruction has one cycle of latency, so by Little's Law our max speedup is $\frac{32}{6} \approx 5.3$.

## Closing Thoughts

Having worked through these examples it's easy to see why FPGAs and ASICs have an advantage in this space. Processing and validating tag/value pairs is a simple, highly parallel problem that lends itself well to a hardware implementation. Hundreds of thousands of operations can be carried out in parallel with virtually zero latency jitter. That does however come at a cost – as demonstrated above we can tokenize a message in ~110ns, which is roughly twice the time that it would take a CPU to read from main memory (or in this case an FPGA coprocessor over a DMA bus). Unless the FPGA/ASIC does application layer validations as well, having an external process or piece of hardware parse a message for the sake of handing you a packed structure probably isn't worth it. The hardware value add comes from deterministic networking and highly parallel risk checks.

FIX is about as simple as it gets so SSE/AVX has much more to offer when the use case is more complex. Furthermore, as noted before, the distance between tag/value pairs is small, meaning that we don't get the same sort of boost that we would expect when searching for sparse delimiters in structured text. Intel has a nice paper on XML schema validation via STTNI and I came across a good article on SSE UTF-8 processing when writing this. As a side note, for sufficiently long integers specializing atoi() via fused multiply-add instructions might pay dividends. That aside…man I could go for an efficient AVX optimized array compaction algorithm.