Misapplied Math

Trading, Data Science, CS

Twelve Days 2013: Reed–Solomon Codes

The Twelve Days of Christmas…err…Algorithms

Ok, so the "real" twelve days of Christmas starts on the 25th of December, but I like the idea of starting on the 12th. Starting today and ending Christmas Eve I'll write about a different algorithm or data structure each day. These posts will be quick reads (except for this kickoff one) and range in complexity but they'll all follow three simple rules:

  1. I'm only writing about things that I've used myself and that I hope are general enough to benefit others.
  2. If you don't want commentary, every post will start with a TL/DR explaining the use with links to resources/implementations when applicable.
  3. They've gotta have a cool factor.

And with that, let's start day one…

Day One: Reed–Solomon Codes

TL/DR

Reed–Solomon codes are a fast, super useful form of error correction that's easily implemented in software. If you need to transmit data over a noisy channel (think UDP over WAN) or if you're writing a file system/highly reliable data store, they're your friend, especially for latency critical applications. There's a high quality C implementation in QFS (Quantcast File System), based on the original implementation in KFS.

A Brief Introduction to Error Correction

Without error correcting codes, storing data on a hard drive, making a phone call, listening to music, or watching a DVD would all fall somewhere on the spectrum of "impossible" to "much less efficient." Error correction is a beautiful application of pure math to a real world application. There's a huge range of error correction/detection schemes but they come in two broad categories: backward error correction in which an error is detected and the data is re-requested, or forward error correction in which data is sent with enough redundancy to detect an error and correct for it.

Fundamental limits dictate the maximum efficiency of any error correction scheme. Within those limits, different schemes have their pros, cons and common use cases. One important consideration is the ease and efficiency of an implementation in hardware or software. Error correction is frequently applied real-time (think cellphone conversations or reading from a Blu-ray disk) so speed is often a primary concern on the encoding or decoding end, and sometimes both.

Reed–Solomon codes are a powerful and early example of error correction. They're still in widespread use as they perform well and are relatively fast even when implemented without special hardware. Modern alternatives such as Turbo Codes are closer to the Shannon Limit but their decoding process is complex and higher latency. For Reed–Solomon, encoding is a simple linear operation, and decoding can be made efficient.

The first time I saw Reed–Solomon codes the concept went right over my head. They were presented using way more abstract algebra than I understood at the time and I missed the fact that they rely on pretty simple, elegant concepts. Yes, you do need to know about Galois field theory and abstract algebra to prove results on the general case of Reed–Solomon, or to develop an efficient implementation. However, appreciating the beauty of how it works or experimenting with a special case requires only algebra and some faith in the underlying math. What follows is a toy example. Real implementations take a different approach mathematically for performance reasons. Standard CPUs weren't built to do efficient math on finite fields (although they often are in hardware).

The Intuition

Let's forget about codes for a second and think back to geometry. A line is uniquely defined by two points. Giving someone a third point picked from that line doesn't convey any additional information, or change how you would draw that line. Now imagine that you want to tell your friend Sally two numbers over a bad phone connection. One approach is to ask her to draw a line passing through a series of points, and tell her that the slope and the intercept of that line are the two numbers that you want her to have. If you state three collinear (x,y)(x, y) points and Sally doesn't hear one pair – not a problem, she can still draw a line identical to yours and recover the slope and intercept. Likewise, if she goes to draw the line and finds that the three points don't line up, she'll know that something went wrong.

When we describe our line we can add as much redundancy as we want by including extra points in our description. As long as our final message arrives with at least two of them, we're good to go. However, what happens if instead of not hearing us at all, our friend hears us incorrectly? Now we have a new problem – determining which points are garbage.

Sally knows what points are permissible – the two of you agreed on some rules ahead of time. Maybe you decided that you'll never tell her a negative coordinate, or that the x value of every coordinate will always be a power of two. If she hears a point that's not valid, or if the points don't line up, she'll know that there's a problem. She might ask you to repeat yourself: backward error correction. However, she doesn't want to waste your time, so if she has enough information to sort things out on her end she'll do it: forward error correction. Note that detecting an error in this scheme or correcting for it isn't always possible. There's always a chance that she hears all of the points incorrectly, but that they meet the agreed upon restrictions, and they still form a line. Sally won't know that there was a problem in this case.

Up until now it's not clear why you wouldn't just tell Sally the same set of coordinates over and over again instead of picking a unique third one: {(1,2),(2,1),(1,2),(2,1),}\{(1, 2), (2, 1), (1, 2), (2, 1), \ldots\}. You certainly could, but if two things sound or look alike they're easier to confuse – try sorting out whether someone is saying "Nancy" or "Mancy" over a bad radio connection. Similarly, we want coordinates that are as visually and audibly distinct as possible, so sampling unique points that are far apart makes sense.

Let's start calling our coordinates codewords, and note that when we write them in binary the "distance" between them is called hamming distance. Hamming distance measures the number of flipped bits required to turn one codeword into another. Codewords that are far apart are desirable, as when they get garbled they're easier to tell apart. For example, if our codewords were 1111 and 0000 and we receive 1011, it's much more plausible that one bit in 1111 flipped to produce 1011 than three in 0000. If we had a little extra information we could say for sure.

The line is a good start, but can we do better? We want something that:

  • Is a unique, one-to-one mapping, meaning that we can uniquely recover any message from its code.
  • Lets us use as many codewords as we want.
  • Provides a consistency check (in the case of our line, the fact that valid points fall on the line is a consistency check).

What about a polynomial? A line is a polynomial of degree one, and just as a line can fit any two data points, a polynomial of degree nn can fit any n+1n + 1 data points. Is it unique? Yup, there's a simple proof showing that any polynomial P(x)P(x) of degree nn passing through n+1n + 1 data points is unique – no other polynomial of degree nn can pass through all of those same points unless it's identical to P(x)P(x). Can we use it to construct a one-to-one mapping scheme? Not necessarily, but maybe we can with some restrictions. What about the codewords? Well, we're free to pick coefficients on our polynomial, so we can use those as codewords, just as we did slope and intercept on a line. What about the consistency check? By the uniqueness property above oversampling from a polynomial is no different from oversampling from a line – there's only one possible fit, so given a fixed amount of redundancy we'll be able to detect a fixed number of errors.

Reed–Solomon codes are, in a certain light, a formula for how to sample from a polynomial and use those samples to interpolate the original polynomial. The encoding procedure is a simple linear operation but there are lots of ways to do the decoding, and that's where things get confusing/mathy. Following the Reed-Solomon methodology gives you provable bounds on the maximum number of errors that can be detected or automatically repaired. As noted previously, and as with any coding scheme, there's always a probability of failure in which we 1) detect more damage than we can recover, or 2) end up with something that's internally consistent but incorrect. It's impossible to eliminate this probability, but we can make it arbitrarily small depending on how much duplication we're willing to tolerate (more duplication = less information density).

The procedure that follows is, at the end of the day, a glorified means of interpolating a family of polynomials. We can tolerate a certain number of errors and still pick the correct polynomial out of the family as (assuming there are at most the threshold number of errors) consistency arguments across different combinations of points rule out all curves but one, providing a unique solution. The restrictions placed on the codewords used and the fact that it needs some strange arithmetic (adding and inverting numbers modulo a prime number) isn't germane to the bigger picture. They're vestiges of our need to provide provable guarantees on the codes, make them one-to-one, and make decoding efficient.

The Procedure

For simplicity I'm repeating the example found here with code and a step-by-step approach. If you're interested in the field theory aspect I included a very brief "why fields" motivator below the fold. These are some notes on practical implementation, and here's the quantcast one. The math (as is) only works if we restrict ourselves to integer codewords 0c<p0 \leq c < p, where pp is prime. I'm using p=7p = 7 (which is why everything is evaluated modulo 7) but you're free to use a larger prime pp and adjust accordingly.

A few definitions:

m=number of message symbolsn=number of code symbolse=nm2=max recoverable errorsCk(x)=x(x1)(xk+1)/k!P(t)=xmtm1+xm1tm2++x2t+x1P=[P(0) P(1) P(n1)]=transmittedR=[R(0) R(1) R(n1)]=received\begin{aligned} m &= \text{number of message symbols} \\ n &= \text{number of code symbols} \\ e &= \lfloor\frac{n - m}{2}\rfloor = \text{max recoverable errors} \\ C_k(x) &= x(x - 1) \cdots (x - k + 1) / k! \\ P(t) &= x_m t^{m - 1} + x_{m - 1} t^{m - 2} + \cdots + x_2 t + x_1 \\ \vec{P} &= [P(0)\ P(1) \ldots\ P(n - 1)] = \text{transmitted} \\ \vec{R} &= [R(0)\ R(1) \ldots\ R(n - 1)] = \text{received} \end{aligned}

For a sequence a1,a1,a_1, a_1, \ldots define the first difference as Δi=ai+1ai\Delta_i = a_{i + 1} - a_i. The second difference is the first difference of the first difference, etc. Define a(j)a^{(j)} as the sequence 0jR0, 1jR1,,(n1)jRn10^jR_0,\ 1^jR_1, \ldots,(n - 1)^jR_{n - 1}. Let BB be an e × (e+1)e\ \times\ (e + 1) matrix whos elements are defined by:

bij=Δm+eai(j)b_{ij} = \Delta^{m + e} a_i^{(j)}

We'll take m=3m = 3 (three symbols in our message), n=7n = 7 (seven symbols in our encoded, redundant message), and note that by our formula for ee we can correct up to e=2e = 2 errors. For a message we'll use: (2,3,4)(2, 3, 4), which in turn yields the polynomial P(t)=4t2+3t+2P(t) = 4t^2 + 3t + 2. We evaluate it at the points (0,,6)(0, \ldots, 6) because n=7n = 7, so we need to sample seven points:

reed_solomon.R
1
2
3
4
5
6
7
8
9
10
11
12
13
m.vec <- c(2, 3, 4)
r.vec <- c(2, 2, 6, 5, 3, 5, 3)

polyval <- function(c,x) {
	n <- length(c)
	y <- x*0+c[1];
	for (i in 2:n) {
		y <- c[i] +x*y
	}
	return(y)
}

p.vec <- polyval(rev(m.vec), 0:6) %% 7

We now have that p.vec =(2,2,3,5,1,5,3)p.vec\ = (2, 2, 3, 5, 1, 5, 3). Suppose that we receive r.vec =(2,2,6,5,3,5,3)r.vec\ = (2, 2, 6, 5, 3, 5, 3), which contains two errors. We compute a matrix BB, as defined above.

reed_solomon.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
delta <- function(x, n) {
	delta <- function(x) {
		len <- length(x)
		if(len > 1) {
			return(x[2:len] - x[1:(len - 1)])
		} else {
			return(0)
		}
	}
	for(i in 1:n) {
		x <- delta(x)
	}
	return(x)
}

rj.vec <- function(r, j) {
	len <- length(r)
	(0:(len - 1))^j * r
}

a0 <- rj.vec(r.vec, 0)
a1 <- rj.vec(r.vec, 1)
a2 <- rj.vec(r.vec, 2)

b <- cbind(delta(a0, 5), delta(a1, 5), delta(a2, 5)) %% 7

Thus:

b=[250052]b = \left[ \begin{array}{ccc} 2 & 5 & 0 \\ 0 & 5 & 2 \end{array} \right]

We need to solve a system of linear equations (modulo 7) with bb as our coefficient matrix:

[250052][v0v1v2]=[000]\left[ \begin{array}{ccc} 2 & 5 & 0 \\ 0 & 5 & 2 \end{array} \right] \left[ \begin{array}{c} v_0 \\ v_1 \\ v_2 \end{array} \right] = \left[ \begin{array}{c} 0 \\ 0 \\ 0 \end{array} \right]

This one is easy as v=(1,1,1)v = (1, 1, 1) works, but in general solving a system of linear equations modulo some integer is pretty annoying, and only meaningful if said integer is prime. As much fun as working through that would be…let's use Wolfram Alpha's solver instead. The solution that we want is the minimal non zero one on the second line of output: v=(1,1,1)v = (1, 1, 1).

Reed-Solomon hinges on an equation called the key equation, a theorem stating that there exists polynomials E(t) of degreeeE(t)\ \text{of degree} \leq e and Q(t) of degreem+e1Q(t)\ \text{of degree} \leq m + e - 1 such that Q(i)=RiE(i) i=0,1,nQ(i) = R_i E(i)\ \forall i = 0, 1, \ldots n. There's a proof guaranteeing that it has non-zero solutions, and the system that we just solved gave us the coefficients of E(t)E(t), so we're half way there. There's another result that allows us to express a polynomial as a series of successive differences (see the paper that I took this example from if you're interested). The end result is that, using CkC_k and Δ\Delta as we previously defined them, we can write any degree dd polynomial as:

f(x)=f(0)C0(x)+Δf(0)C1(x)++Δdf(0)Cd(x)f(x) = f(0)C_0(x) + \Delta f(0) C_1(x) + \cdots + \Delta^d f(0) C_d(x)

We'll use this to reconstruct QQ. To do so we need the first element of ithith difference of the sequence Q(i)=RiE(i)Q(i) = R_iE(i) for i=0,(m+1)i = 0, \ldots (m + 1).

reed_solomon.R
1
2
3
4
5
e.coef <- c(1, 1, 1)
qi.seq <- (polyval(e.coef, 0:6) * r.vec) %% 7

diff.vec <- unlist(lapply(apply(as.matrix(0:4), 1, 
	function(x) delta(qi.seq, x)), function(x) x[[1]])) %% 7

Which gives us a diff.vec=(2,4,4,4,5)diff.vec = (2, 4, 4, 4, 5). We can now write:

Q(t)=2C0(t)+4C1(t)+4C2(t)+4C3(t)+5C4(t)Q(t) = 2C_0(t) + 4C_1(t) + 4C_2(t) + 4C_3(t) + 5C_4(t)

Those CiC_i polynomials have a factor of 1/i!1 / i! in them, which would lead you to believe that they're fractional. However, because we're doing our arithmetic modulo pp we're actually dealing with a modular multiplicative inverse. We can expand the polynomial above as we would any polynomial and work out the congruence on each coefficient by hand (for t4t^4 that's tantamount to solving the congruence 24x5(mod 724x \equiv 5 (mod\ 7). Or we can take the lazy approach again: PolynomialMod

Doing so gives us 4t4+2t2+5t+24t^4 + 2t^2 + 5t + 2. If we ask Wolfram to factor this for us, mod 7, we get 4(t+3)(t+5)(t2+6t+4)4 (t + 3) (t + 5) (t^2 + 6 t + 4). Doing the same for E(t)E(t) gives E(t)=(t+3)(t+5)E(t) = (t + 3) (t + 5). As we can see Q(t)Q(t) is the product of E(t)E(t) and something else. That something else is (expanding modulo 7): 4t2+3t+24 t^2 + 3t + 2. Look familiar? That's our original message, derived from the damaged one that we received.

Having walked through this it's easy to see why this wouldn't lend itself well to practical use as is. We needed a computational algebra system to do this, and factoring polynomials over finite fields as we did to see our result isn't fun. As such, practical implementations take a different view of the math and won't touch any algebra directly.

May your 2014 be slightly less error prone.


Optional Deeper Dive

Polynomials live in a function space, and {1,t,t2,t3}\{1, t, t^2, t^3 \ldots\} forms its basis. A basis is the smallest linearly independent set of vectors (meaning that no one vector can be produced by any combination of the others) needed to represent any other point in a vector space. Just as {(1,0,0),(0,1,0),(0,0,1)}\{(1, 0, 0), (0, 1, 0), (0, 0, 1)\} forms a basis over a three dimensional space defined in terms of (x,y,z)(x, y, z) coordinates, the basis above can represent any polynomial as a linear combination of basis vectors.

Polynomials can, generally speaking, have any type of coefficient. However, the type of polynomial that we were using had integer coefficients less than a prime number pp: a{0,1,,p1}a \in \{0, 1, \ldots, p-1\}. More formally:

aZp=the set of all congruence classes, modulo pa \in \mathbb{Z}_p = \text{the set of all congruence classes, modulo p}

These integers form a field and we get some neat properties because of this. Our codewords are the coefficients of these polynomials – elements of the field. We chose the field that we did for our coefficients out of convenience, but in practice fields with 28=2562^8 = 256 elements are used so that every byte has a representation as a single codeword. Fields can be finite or infinite, and in this case, it's finite. Such fields are called Galois fields. Galois theory makes a lot of modern day electronics possible.

Fields are defined axiomatically, and one of the axioms, closure, means that the sum or product of any two elements in a field is still in that field. For vector spaces over the reals, the intuition is that you can never add or multiply two real numbers and get something that's not in the reals. The intuition is somewhat different for finite fields. For example, the smallest possible field has two elements, and is called GF(2). It contains the elements 0 and 1, and operations on elements in the field are carried out modulo 2. We can see that it's closed under addition by checking all combinations: 0 + 0 = 0, 0 + 1 = 1, 1 + 0 = 1, 1 + 1 = 0 (mod 2). If we didn't carry out operations modulo 2 we would have 1 + 1 = 2, which is not in the field. It's easy to check the other field axioms hold for GF(2)GF(2), and we can do the same for the field that we just defined above.

Field extensions of GF(2)GF(2) such as GF(28)GF(2^8) admit a nice representation of codewords as bytes. There's a powerful result guaranteeing that \exists a finite field with pdp^d elements for every prime pp and every positive integer dd. Working with fields makes it a lot easier to reason about certain types of algorithms, and as such they play an important role in both coding theory and cryptography.

In short, why do we care? Our ability to manipulate polynomials as we did and treat them as we would treat ordinary numbers, carrying out operations such as modular arithmetic and long division, hinges on these facts. Generalized Reed-Solomon actually takes this a step further and uses codes over rings of matrices (our coefficients are over a field so certain sets of polynomials will form a polynomial ring). Aside from that, working with Zp\mathbb{Z}_p is required to prove an identity that gives the weight of a Reed–Solomon code, and by extension the maximum number of errors that it can detect/correct. Last but not least, doing all of this in an efficient manner (and not like we did above) requires lots of tricks grounded in abstract algebra and finite field theory.

Comments