Misapplied Math

Trading, Data Science, CS

Day Five: Smith–Waterman for Subsequence Search and Alignment

TL/DR

Dynamic programming offers a time/space trade-off that makes certain types of computationally intractable algorithms tractable. For optimization, if you can pose your problem as a Bellman equation, you're in luck. One of the most famous applications of the Bellman Principle of Optimality, sequence alignment, has its roots in bioinformatics. However, the algorithm is equally useful for a host of alignment and similarity scoring tasks. The Biostrings package for R implements fast global and local sequence alignment algorithms that can be used for many things outside of biology.

Time vs. Space

It's hard to overstate the importance of dynamic programming – lots of very important problems wouldn't have computationally tractable solutions without it. At its heart dynamic programming relies on state memorization to accelerate a computation. A trivial example is the computation of the Fibonacci sequence using a recursive method versus an iterative one. Doing so recursively will generate a call graph that takes the shape of a binary tree, having a number of leafs exponential in nn. Each value is defined recursively by the two previous ones so it's easy to visualize why this happens. We can linearize the call graph by allocating a little stack space and explicitly storing the two previous values, after which the iterative computation is very simple. Similarly, we could write the function in tail recursive form, after which a tail call optimizing compiler would generate an equivalent loop.

In general, dynamic programming relies on finding a representation of the original problem such that subsequent outputs can be computed iteratively by reusing existing state. It's a classic example of a space-time complexity trade-off. The Bellman principle of optimality gives the conditions under which a solution found via recursive solutions to sub-problems constitutes a global optimum.

A Motivating Example

Sequence alignment is an important task in bioinformatics and has a wide range of uses. From an evolutionary standpoint, identifying how well conserved a particular subsequence is across generations can help identify shared lineage. The study of single nucleotide polymorphisms (SNPs) between individuals in a population helps us understand differences in susceptibility to disease. In proteomics, studying the extent to which a subsequence is conserved can identify regions of significance (we understand very little about protein folding, so looking for regions that nature tells us are important is a good first pass). All of these tasks are underpinned by a need for local or global sequence alignment.

Alignments are computed relative to some scoring function that penalizes mismatches and gaps while rewarding proper alignment. Imagine two sequences: "ABBBD" and "ABCD". We want to transform the two strings via insertions and deletions until they line up: "ABBBD" -> "A - - CD" or "ABBBD" -> "A-BBD" and "ABCD" -> "ABB-D" are both possibilities. How you choose to transform your sequence depends on your scoring function.

In global alignment, two sequences of similar length are aligned such that, after applying the optimal transformations, the two strings match in entirity. In local alignment, we're only interested in regions of high similarity. The two tasks are fundamentally at odds with each other, but both admit solutions via dynamic programming. We'll focus on local alignment.

The Smith-Waterman Algorithm

For biological applications we're usually speaking in terms of the five nucleobases when it comes to DNA/RNA (the famous T, C, A, and G that make up DNA, and the U that replaces T in RNA), or the 22 protein-building amino acids. This restriction is completely arbitrary and we can use this technique to match any sequence that we want (geometric, spacial, or otherwise) provided that we can come up with a scoring metric to compare two individual points in the sequence.

The wikipedia article here does a good job describing the actual algorithm. For any implementation we only need to decide on a weighting scheme that defines w(a,), w(,b) and w(a,b)w(a, -),\ w(-, b)\text{ and } w(a, b). This gives us a lot of flexibility. The wikipedia example used a simple scheme in which:

w(a,ba=b)=w(match)=2w(a,bab)=w(mismatch)=1w(a,)=1w(,b)=1\begin{aligned} w(a, b | a = b) &= w(\text{match}) = 2 \\ w(a, b | a \neq b) &= w(\text{mismatch}) = -1 \\ w(a, -) &= -1 \\ w(-, b) &= -1 \\ \end{aligned}

but we're free to experiment and do as we please. Weights need not be symmetric and we could easily penalize mutation more or less than a gap.

The example goes on to construct the HH matrix, which is defined recursively using the three values to its immediate left, top, and top left corner (the edges (1,j)(1, j) and (i,1)(i, 1) are defined as zero). We can fill the matrix in starting at (2,2)(2, 2) and working our way our way from top left to bottom right. By construction it's easy to see that if we didn't "memorize" these values we would have to calculate each matrix value recursively, leading to a call graph shaped like a ternary tree.

Assuming that we want the actual alignment and not just an alignment score, we find the sequence of insertions/deletions/mutations necessary to generate the optimal score via a backtracking algorithm. We start at the matrix max and move backward to the largest adjacent value (the cell who's value was used when computing the current cell's value). We repeat this procedure for as long as there are adjacent cells with non-zero values. Wikipedia doesn't address this, but assuming that the gap and mismatch penalties are symmetric, to avoid duplicate alignments we should chose a consistent direction to move in when ties occur. As an example, two sequences TAG and TCG will generate structurally identical alignments TAG -> T-G and TCG -> T-G with the same alignment score.

This procedure differs from global alignment in that we don't always start at the bottom right corner – it's possible to start anywhere in the matrix. Also note that we can find the kk best alignments by backtracking from the kk highest scoring locations – starting from the max only gives us the "optimal" alignment.

Implementation

As noted before Biostrings in R has lots of sequence alginment goodies. A quick search will reveal tools for just about any language that you want (including the fast, commerical grade ones implemented in C/C++ to leverage SSE and graphics cards).


Day Four: Domain Specific Search and Sorting Algorithms

TL/DR

In the most general case it's impossible to beat an average time complexity of O(nlog(n))\O(n\log(n)) when sorting, or O(log(n))\O(log(n)) when searching. However, for special (but common) use cases, we can do substantially better. Radix sort can sort integers in O(kn)\O(k\cdot n) time, where kk is the length of the integer in digits. Interpolation search will find a key in O(loglog(n))\O(\log\log(n)) on average, assuming that the search data is uniformly distributed.

Radix Sort

CS 101 usually covers the optimality of binary search, and by extension the log linear lower bound on sorting a list. However, this bound only applies to comparison sorts, and there are many ways to sort without direct comparison. Doing so isn't "cheating" from an information theoretic standpoint, just as the fact that the average O(1)\O(1) look-up time for a hash map isn't at odds with the O(log(n))\O(\log(n)) lower bound on search. Hash maps have worst case O(n)\O(n) performance, so trading quick average look-ups for the O(log(n))\O(\log(n)) bound provided by binary search doesn't change anything. The ability to "do better" given restrictions is a reoccurring theme in CS.

Radix sort works well for values with a limited number of digits – note the O(kn)\O(k\cdot n) part. The worst case performance for radix is the same as the average case making it pretty deterministic. Also note that we're not restricted to sorting numbers. ASCII maps English characters to integers in ascending order, so radix sort can be applied to text as well. You can probably find a similar mapping for just about any type of input that you would want to sort. It's a great alternative to traditional sort methods for many applications. However, there's always the break even point where d>log(n)d > \log(n), after which traditional sort methods have the advantage.

You really need to understand your data before applying interpolation search since the worst case scenario is O(n)\O(n) (and much worse than that in reality as you'll end up doing lots of unnecessary branching). However, if you know that your data is reasonably uniform, interpolation search is a great way to go. The intuition is simple. If you have an index that's uniformly distributed between aa and bb and you're looking for a value xx, a good place to start is an interpolation: a+xxabaa + x \frac{x - a}{b - a}. Interpolation search does this repeatedly, subdividing intervals and interpolating again. We can apply the same principals to an arbitrary distribution; the uniform distribution was chosen as it leads to a simple, canonical implementation with provable bounds. Here's a derivation of the time complexity for interpolation search.


Day Three: Probabilistic Data Structures and Algorithms

TL/DR

Probabilistic data structures can deal with data sets too large to handle by conventional means, or offer massive speedups at the cost of some uncertainty. Skip lists make for nice lock-free priority queues (useful in schedulers and online sorting algorithms), Bloom filters are well suited for set membership queries over data sets of arbitrary size, and locality-sensitive hashing provides quick approximate nearest neighbor queries and dimensionality reduction. Probabilistic matrix algorithms can speed up matrix math by several orders of magnitude with tight error bounds.

Case One: Probabilistic Matrix Math

Matrix decomposition and factorization techniques such as SVD and QR play an important role in many problems, from optimization and online control to machine learning. The new kid on the block, rank-revealing QR factorization provides a very efficient means of estimating matrix rank. All three benefit from quite recent work on probabilistic matrix algorithms. It's possible to significantly improve both parallelism and computational complexity while giving up very little in terms of accuracy. The paper above presents a method for reducing the complexity of finding the top kk components of an SVD from O(mnk)\O(mnk) to O(mnlog(k))\O(mnlog(k)) while admitting a natural parallel implementation. Two side notes:

  1. Joel Tropp, one of the authors of the paper above and a former prof of mine, is a great guy to follow for this. He's a very good writer and his areas of expertise are quite interesting/relevant to data science: sparse approximation, compressed sensing, and randomized algorithms.
  2. Spectral theory and random matrices are pretty fascinating fields (especially with regard to questions such as detecting spurious correlation and reasoning about the distribution of eigenvalues).

Case Two: Skip lists

In many cases skip lists are a nice drop in replacement for balanced trees when concurrent access is needed. Their basic operations (search, insert, and delete) all have all of the same big OO performance characteristics but their structure makes concurrent programming much simpler. They avoid the rebalancing issues that come up in alternatives such as red-black trees (rebalancing is also what makes efficient concurrent implementations difficult), and they're pretty cache friendly. As a bonus, range queries, k-th largest, and proximity queries are all O(log(n))\O(\log(n)).

Case Three: Bloom Filters

Bloom filters provide space efficient set membership queries. They work by using kk hash functions to determine kk indices in a boolean array. When an item is inserted, each of those kk positions is set to true. To query an item, end each of the kk positions is checked. If all of the positions are set to true, the item is possibly in the set, but if any one of them is false, it's definitely not. This works well provided that the array doesn't become saturated with true values, and that the kk hash functions are independent/do a good job distributing the key space over the hash space uniformly. They're also simple enough to prove bounds on, which is useful for parameter selection.

Case Four: Locality-Sensitive Hashing

Nearest neighbor queries (where distance can be computed under an arbitrary metric function – not necessarily a spacial one) are both very useful, and very hard to do efficiently for high dimensional spaces. In practice, for 30+ dimensions, the search time for exact algorithms such as K-D trees is worse than linear search when you take into account cache effects/branch misprediction that comes about from a real world implementation.

Locality-sensitive hashing does the opposite of what most hash functions try to do in that it aims to maximize the probability of a collision for keys in close proximity. It's an example of dimensionality reduction and as such gets used and abused for unsupervised machine learning. For queries, I've seen it perform poorly on some geometric/spacial queries and extremely well on others, so your mileage may vary.