LASSO regression (least absolute shrinkage and selection operator) is a modified form of least squares regression that penalizes model complexity via a regularization parameter. It does so by including a term proportional to $\beta_{l_1}$ in the objective function which shrinks coefficients towards zero, and can even eliminate them entirely. In that light, LASSO is a form of feature selection/dimensionality reduction. Unlike other forms of regularization such as ridge regression, LASSO will actually eliminate predictors. It’s a simple, useful technique that performs quite well on many data sets.
Regularization refers to the process of adding additional constraints to a problem to avoid over fitting. ML techniques such as neural networks can generate models of arbitrary complexity that will fit insample data oneforone. As we recently saw in the post on ReedSolomon FEC codes, the same applies to regression. We definitely have a problem anytime there are more regressors than data points, but any excessively complex model will generalize horribly and do you zero good out of sample.
There’s a litany of regularization techniques for regression, ranging from heuristic, handson ones like stepwise regression to full blown dimensionality reduction. They all have their place, but I like LASSO because it works very well, and it’s simpler than most dimensionality reduction/ML techniques. And, despite being a nonlinear method, as of 2008 it has a relatively efficient solution via coordinate descent. We can solve the optimization in $O(n\cdot p)$ time, where $n$ is the length of the data set and $p$ is the number of regressors.
Our objective function has the form:
where $\lambda \geq 0$. The first half of the equation is just the standard objective function for least squares regression. The second half penalizes regression coefficients under the $l_1$ norm. The parameter $\lambda$ determines how important the penalty on coefficient weights is.
There are two R packages that I know of for LASSO: lars (short for least angle regression – a super set of LASSO) and glmnet. Glmnet includes solvers for more general models (including elastic net – a hybrid of LASSO and ridge that can handle catagorical variables). Lars is simpler to work with but the documentation isn’t great. As such, here are a few points worth noting:
Here’s a simple example using data from the lars package. We’ll follow a common heuristic that recommends choosing $\lambda$ one SD of MSE away from the minimum. Personally I prefer to examine the CV Lcurve and pick a value right on the elbow, but this works.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

LASSO is a biased, linear estimator whose bias increases with $\lambda$. It’s not meant to provide the “best” fit as GaussMarkov defines it – LASSO aims to find models that generalize well. Feature selection is hard problem and the best that we can do is a combination of common sense and model inference. However, no technique will save you from the worst case scenario: two very highly correlated variables, one of which is a good predictor, the other of which is spurious. It’s a crap shoot as to which predictor a feature selection algorithm would penalize in that case. LASSO has a few technical issues as well. Omitted variable bias is still an issue as it is in other forms of regression, and because of its nonlinear solution, LASSO isn’t invariant under transformations of original data matrix.
]]>Sensor fusion is a generic term for techniques that address the issue of combining multiple noisy estimates of state in an optimal fashion. There’s a straight forward view of it as the gain on a Kalman–Bucy filter, and an even simpler interpretation under the central limit theorem.
Control theory is one of my favorite fields with a ton of applications. As the saying goes, “if all you have is a hammer, everything looks like a nail,” and for me I’m always looking for ways to pose a problem as a state space and use the tools of control theory. Control theory gets you everything from cruise control and auto pilot to the optimal means of executing an order under some set of volatility and market impact assumptions. The word “sensor” is general and can mean anything that produces a time series of values – it need not be a physical one like a GPS or LIDAR, but it certainly can be.
Estimating state is a pillar of control theory; before you can apply any sort of control feedback you need to know both what your system is currently doing and what you want it to be doing. What you want it to do is a hard problem in and of itself as the what requires you to figure out an optimal action given your current state, the cost of applying the control, and some (potentially infinite) time horizon. The currently doing part isn’t a picnic either as you’ll usually have to figure out “where you are” given a set of noisy measurements past and present; that’s the problem of state estimation.
The Kalman filter is one of many approaches to state estimation, and the optimal one under some pretty strict and (usually) unrealistic assumptions (the model matches the system perfectly, all noise is stationary IID Gaussian, and that the noise covariance matrix known a priori). That said, the Kalman filter still performs well enough to enjoy widespread use, and alternatives such as particle filters are computationally intensive and have their own issues.
Awhile back I discussed the geometric interpretation of signal extraction in which we addressed a similar problem. Assume that we have two processes generating normally distributed IID random values, $X = (\mu_1, \sigma_1)$ and $Y = (\mu_2, \sigma_2)$. We can only observe $Z = X + Y$, but what we want $X$, so the best that we can do is $\E[X  Z = c]$. As it turns out the solution has a pretty slick interpretation under the geometry of linear regression. Sensor fusion addresses a more general problem: given a set of measurements from multiple sensors, each one of them noisy, what’s the best way to produce a unified estimate of state? The sensor noise might be correlated and/or time varying, and each sensor might provide a biased estimate of the true state. Good times.
Viewing each sensor independently brings us back to the conditional expectation that we found before (assuming that the sensor has normally distributed noise of constant variance). If we know the sensor noise a priori (the manufacturer tells us that $\sigma = 1m$ on a GPS, for example) it’s easy to compute $\E[X  Z = c]$, where $X$ is our true state, $Y$ is the sensor noise, and $Z$ is what we get to observe. In this context it’s easy to see that we could probably just appeal to the central limit theorem, average across the state estimates using an inverse variance weighting, and call it a day. Given that we have a more detailed knowledge of the process and measurement model, can we do better?
Let’s consider the problem of modeling a Gaussian process with $\mu = 100$ and $\sigma = 2$. We have three sensors with $\sigma_1 = .6$, $\sigma_2 = .7$, and $\sigma_3 = .8$. Sensor one has a correlation of $r_{12} = .3$ with sensor two, a correlation of $r_{13} = .1$ with sensor three, and sensor two has a correlation of $r_{23} = .1$ with sensor three. Assume that they have a bias of .1, .2, and 0, respectively.
Our process and measurement models are $\dot{x} = Ax + Bu + w$ with $w \sim N(0, Q)$ and $y = Cx + v$ with $v \sim N(0, R)$, respectively. For our simple Gaussian process that gives:
From there we can use the dse package in R to compute our Kalman gain state estimate via sensor fusion. In many cases we would need to estimate the parameters of our model. That’s a separate problem known as system identification and there are several R packages (dse included) that help with this. Since we’re simulating data and working with known parameters we’ll skip that step.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 

Plotting the first 50 data points gives:
It’s a little hard to tell what’s going on but you can probably squint and see that the fusion sensor is tracking the best, and that sensor three (the highest variance one) is tracking the worst. Computing the RMSD gives:
Note that the individual sensors have an RMSD almost identical to their measurement error. That’s exactly what we would expect. And, as we expected, the sensor fusion estimate does better than any of the individual ones. Because our sensor errors were positively correlated we made things harder on ourselves. Rerunning the simulation without correlation consistently gives a Kalman RMSD of $\approx .40$. How did the bias impact our simulation? Calculating $\text{bias} = \bar{y  \hat{y}}$ gives:
The Kalman filter was able to significantly overcome the bias in sensors one and two while still reducing variance. I specifically chose the biasfree sensor as the one with the most variance to make things as hard as possible. This helps to illustrate one very cool property of Kalman sensor fusion – the ability to capitalize on the biasvariance tradeoff and mix biased estimates with unbiased ones.
]]>In a country in which people only want boys, every family continues to have children until they have a boy. If they have a girl, they have another child. If they have a boy, they stop. What is the proportion of boys to girls in the country?
It seems reasonable that such a country would succeed in their goal of skewing the population demographics. However, google’s solution to the brainteaser goes on to justify how the population will still end up 50% male/50% female.
Interestingly enough, google’s “official” solution of 50%/50% is incorrect, depending on how you interpret their wording. Assuming that they’re asking for:
(what’s the expected ratio of boys to girls?) there’s a problem with their reasoning. That’s not the only twist. While for a large enough population their answer is very close to correct, for any one family the expected percentage of boys is closer to 70%.
The crux of the issue stems from an invalid application of the expectation operator. We’re interested in the random variable $R(n) = \frac{n}{X(n)}$, where $n$ is fixed and $X(n)$ is itself a random variable: the total number of children in the population. Note that because each family has exactly one boy, $n$ is both the number of families and the number of boys. If we assume that boys and girls are born with $p = .5$, the expected number of children that any one family will have before producing a boy (inclusive) is given by the geometric distribution with $p = .5$:
From there, it seems reasonable to assume (as the google argument does) that:
However, expectation only commutes with linear operators, so the equality above does not hold. Taking things one step further we can find a bound showing that ratio is greater than 1/2 for all finite populations. Jensen’s inequality (the Swiss Army Knife of mathematical sanity checks) gives that for a nondegenerate probability distribution and a convex function $\varphi$, $\E[\varphi(X)] > \varphi(\E[X])$. Letting $\varphi = \frac{n}{x}$ gives:
One of the most interesting things to come out of this analysis is the observation that the expected ratio of boys to girls in one family is a biased estimator of the population mean. To understand why, remember that 50% of the time we’ll observe a family with one child (a boy, making for a ratio of 100% boys), and 50% of the time we’ll observe a family with at least one girl. Working with ratios instead of sums underweights the contribution coming from families with at least one girl. Individual families will, on average, produce a ratio of boys to girls close to 70%. However, as families can have at most one boy and potentially many girls, the population ratio will approach 50% from above.
We can calculate the single family distribution empirically or explicitly. Here’s what it looks like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 

The red dashed line denotes the mean value – .69 for my run of the simulation. Using the same set of sample data and treating it as one simulated population of 100,000 instead of 100,000 simulated populations of one family:
1 2 3 4 5 6 

gives “Population mean: 0.50.” To see why the population will tend towards 50% we’ll need to appeal to the Central Limit Theorem (CLT). For a rigorous explanation of the math see the excellent post by Ben Golub here. In short, by the CLT, as the number of families $n$ becomes large, the total number of children in the population $X(n)$ will tend towards $X(n) \approx \E[X(1)]n = 2n$. We’ll have $n$ boys for our $\approx 2n$ children, leading to a ratio of $\approx \frac{1}{2}$ with tight error bounds given by the CLT.
The applicability of the CLT depends, loosely speaking, on “how poorly behaved” the distribution that you’re sampling from is, and the size of your sample. Lucky for us, the geometric distribution is well behaved (finite variance and mean – both assumptions of the CLT), and our samples are definitely independent. We’re not always that lucky though – some fat tailed distribution such as the Cauchy distribution for which neither mean nor variance is defined can prove problematic.
So how well does the CLT handle our family planning problem? The expected ratio of boys to girls for a population of $k$ individuals is given by:
where $\psi$ is the Digamma function (the derivation is here). Plotting this function we see that it rapidly converges to a ratio of .5:
We can run a quick simulation to confirm our results:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

We see that our simulations are within a few percent of the theoretical and converge towards the true value as $n$ becomes large. So far we’re only looking at one simulated population. How do our results look if we average across many simulated populations?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

The graph above depicts the empirical and exact population ratios, along with bands denoting variance between means for the simulated populations. As you can see, averaging across multiple simulated populations gives much faster convergence. The Central Limit Theorem works its magic once again, this time by smoothing out variation between our simulated populations. We can see that even for a single family population, with enough simulation passes the empirical result is almost indistinguishable from the analytical one. Pretty cool if you ask me.
]]>As previously discussed, there’s no universal measure of randomness. Randomness implies the lack of pattern and the inability to predict future outcomes. However, The lack of an obvious model doesn’t imply randomness anymore than a curve fit one implies order. So what actually constitutes randomness, how can we quantify it, and why do we care?
First off, it’s important to note that predictability doesn’t guarantee profit. On short timescales structure appears, and it’s relatively easy to make short term predictions on the limit order book. However, these inefficiencies are often too small to capitalize on after taking into account commissions. Apparent arbitrage opportunities may persist for some time as the cost of removing the arb is larger than the payout.
Second, randomness and volatility are oftused interchangeably in the same way that precision and accuracy receive the same colloquial treatment. Each means something on its own, and merits consideration as such. In the example above, predictability does not imply profitability anymore than that randomness precludes it. Take subprime for example – the fundamental breakdown in pricing and risk control resulted from correlation and structure, not the lack thereof.
Information theory addresses the big questions of what is information, and what are the fundamental limits on it. Within that scope, randomness plays an integral role in answering questions such as “how much information is contained in a system of two correlated, random variables?” A key concept within information theory is Shannon Entropy – a measure of how much uncertainty there is in the outcome of a random variable. As a simple example, when flipping a weighted coin, entropy is maximized when the probability of heads or tails is equal. If the probability of heads or tails is .9 and .1 respectively, the variable is still random, but guessing heads is a much better bet. Consequently, there’s less entropy in the distribution of binary outcomes for a weighted coin flip than there is for a fair one. The uniform distribution is a so called maximum entropy probability distribution as there’s no other continuous distribution with the same domain and more uncertainty. With the normal distribution you’re reasonably sure that the next value won’t be far away from the mean on one of the tails, but the uniform distribution contains no such information.
There’s a deep connection between entropy and compressibility. Algorithms such as DEFLATE exploit patterns in data to compress the original file to a smaller size. Perfectly random strings aren’t compressible, so is compressibility a measure of randomness? Kolmogorov complexity measures, informally speaking, the shortest algorithm necessary to describe a string. For a perfectly random source, compression will actually increase the length of the string as we’ll end up with the original string (the source in this case is its own shortest description) along with the overhead of the compression algorithm. Sounds good, but there’s one slight problem – Kolmogorov complexity is an uncomputable function. In the general case, the search space for an ideal compressor is infinite, so while measuring randomness via compressibility kind of works, it’s always possible that a compression algorithm exists for which our source is highly compressible, implying that the input isn’t so random after all.
What about testing for randomness using the same tools used to assess the quality of a random number generator? NIST offers a test suite for doing so. However, there are several problems with this approach. For starters, these tests need lots of input – 100,000+ data points. Even for high frequency data that makes for a very backward looking measure. Furthermore, the tests are designed for uniformly distributed sample data. We could use the probability integral transform to map our sample from some (potentially empirical) source distribution to the uniform distribution, but now we’re stacking assumptions on top of heuristics.
Of the above it sounds like entropy gets us the closest to what we want, so let’s see what it looks like compared to volatility. We’ll start by plotting the 20 day trailing absolute realized variation of the S&P 500 cash index as a measure of volatility:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

Now let’s look at entropy. Entropy is a property of a random variable, and as such there’s no way to measure the entropy of data directly. However, if we concern ourselves only with the randomness of up/down moves there’s an easy solution. We treat daily returns as Bernoulli trials in which a positive or a zero return is a one and a negative return is a 0. We could use a ternary alphabet in which up, down, and flat are treated separately, but seeing as there were only two flat days in this series doing so only obfuscates the bigger picture.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 

Visually the two plots look very different. We see that most of the time $H(X)$ is close to one (the mean is .96), indicating that our “coin” is fair and that the probability of a day being positive or negative over a trailing 20 day period is close to .5.
How correlated are the results? If we consider the series directly we find that $Cor(H(X),\ \sigma) = .095$. It might be more interesting to consider the frequency in which an increase in volatility is accompanied by an increase in entropy: .500 – spot on random. Entropy and volatility are distinct concepts.
The reoccurring theme of “what are we actually trying to measure,” in this case randomness, isn’t trivial. Any metric, indicator, or computed value is only as good as the assumptions that went into it. For example, in the frequentist view of probability, a forecaster $F(Xx_{t1}, \ldots, x_0)$ is “well calibrated” if the true proportion of outcomes $X = x$ is close to the forecasted proportion of outcomes (there’s a Bayesian interpretation as well but the frequentist one is more obvious). It’s possible to cook up a forecaster that’s wrong 100% of the time, but spot on with the overall proportion. That’s horrible when you’re trying to predict if tomorrow’s trading session will be up or down, but terrific if you’re only interested in the long term proportion of up and down days. As such, discussing randomness, volatility, entropy, or whatever else may be interesting from an academic standpoint, but profitability is a whole other beast, and measuring something in distribution is inherently backward looking.
]]>Traders love discussing seasonality, and September declines in US equity markets are a favorite topic. Historically September has underperformed every other month of the year, offering a mean return of .56% on the S&P 500 index from 1950 to 2012; 54% of Septembers were bearish over the same period – more than any other month. Empirically, September deserves its moniker: “The Cruelest Month.”
As a trading strategy 54% isn’t a substantial win rate, and small N given that it only trades once a year. However, both as a portfolio overlay and as a trading position it’s worth considering whether bearishness in September is a statistically significant anomaly or random noise.
There are plenty of tests for seasonality in time series data. Many rely on some form of autocorrelation to detect seasonal components in the underlying series. These methods are usually parametric and subject to lots of assumptions – not robust, especially for a nonstationary, noisy time series like the market.
Furthermore, sorting monthly returns by performance and declaring one month “the most bearish” introduces a data snoop bias. We’re implicitly performing multiple hypothesis tests by doing so, and as such we need to correct for the problem of multiple comparisons. This gets even more interesting if we consider a spreading strategy between two months, which introduces a multiple comparison bias closely related to the Birthday Paradox.
The nonparametric bootstrap is a general purpose tool for estimating the sampling distribution of a statistic from the data itself. The technique is a powerful, computationally intensive tool that’s easily applied for any sample statistic, works well on small samples, and makes few assumptions about the underlying data. However, one assumption that it does make is a biggie: the data must be independent, identically distributed (iid). That’s a deal breaker, unless you buy into the efficient market hypothesis, in which case this post is already pretty irrelevant to you. Bootstrapping dependent data is an active area of research and there’s no universal solution to the problem. However, there’s research showing that the bootstrap is still robust when these assumptions are violated.
To address the question of seasonality we need a reasonable way to pose the hypothesis that we’re testing  one that minimizes issues arising from path dependence. One approach is to consider the distribution of labeled and unlabeled monthly returns. This flavor of bootstrap is also known as a permutation test. The premise is simple. Data is labeled as “control” and “experimental.” Under the null hypothesis that there’s no difference between two groups a distribution is bootstapped over unlabeled data. A sample mean is calculated for the experimental data, and a pvalue is computed by finding the percentage of bootstrap replicates more extreme than the sample mean. September returns form the experimental group, and all other months comprise the control.
The monthly (log) return was calculated using the opening price on the first trading day of the month and the closing price on the last trading day. Adjusted returns on the S&P index were used in lieu of eminis or SPY in the interest of a longer return series (we’re interested in the effect, not the execution, and on the monthly level there won’t be much of a difference in mean).
We’ll start by taking a look at the bootstrapped distributions of mean monthly returns. The heavy lifting is done by the fantastic XTS, ggplot, and quantmod libraries.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 

The image above shows a bootstrap distribution for each calendar month, as well as the control. The control distribution is much tighter as there’s 12 times more data, which by the Central Limit Theorem should result in a distribution having $\sigma_\text{control} \approx \frac{\sigma}{\sqrt{12}}$.
Plots of boostrapped distributions offer a nice visual representation of the probability of committing type I and type II errors when hypothesis testing. The tail of the single month return distributions extending towards the mean of the control distribution shows how a Type II error can occur if the alternative hypothesis was indeed true. The converse holds for a Type I error in which the tail of the null hypothesis distribution extends towards the mean of an alternative hypothesis distribution.
Now for the permutation test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 

My pvalue was $p = .0134$ (your mileage will vary as this a random sample after all)  pretty statistically significant as a stand alone hypothesis. However, we still have a lurking issue with multiple comparisons. We cherry picked one month out of the calendar year – September – and we need to account for this bias. The Bonferroni correction is one approach, in which case we would need $\alpha = \frac{.05}{12} = .0042$ if we were testing our hypothesis at the $\alpha = .05$ level. Less formally our pvalue is effectively 0.1608 – not super compelling.
Putting the statistics aside had you traded this strategy from 1950 to present your equity curve (in log returns) would look like:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

The high and low bands show the return for the most bullish and bearish month in each year. It’s easy to see that September tends to hug the bottom band but it looks pretty dodgy as a trade – statistical significance does not a good trading strategy make.
Bootstrapping September returns in aggregate makes the assumption that yearoveryear, September returns are independent, identically distributed. Given that the bearishness of September is common lore, it’s reasonable to hypothesize that at this point the effect is a selffulfilling prophecy in which traders take into account how the previous few Septembers went, or the effect in general. If traders fear that September is bearish and tighten stops or liquidate intramonth, an anomaly born out random variance might gain traction. Whatever the cause, the data indicates that September is indeed anomalous. As for a standalone trading strategy…not so much.
]]>