Misapplied Math

Twelve Days 2013: LASSO Regression

## Day Eight: LASSO Regression

### TL/DR

LASSO regression (least absolute shrinkage and selection operator) is a modified form of least squares regression that penalizes model complexity via a regularization parameter. It does so by including a term proportional to $||\beta||_{l_1}$ in the objective function which shrinks coefficients towards zero, and can even eliminate them entirely. In that light, LASSO is a form of feature selection/dimensionality reduction. Unlike other forms of regularization such as ridge regression, LASSO will actually eliminate predictors. It’s a simple, useful technique that performs quite well on many data sets.

## Regularization

Regularization refers to the process of adding additional constraints to a problem to avoid over fitting. ML techniques such as neural networks can generate models of arbitrary complexity that will fit in-sample data one-for-one. As we recently saw in the post on Reed-Solomon FEC codes, the same applies to regression. We definitely have a problem anytime there are more regressors than data points, but any excessively complex model will generalize horribly and do you zero good out of sample.

## Why LASSO?

There’s a litany of regularization techniques for regression, ranging from heuristic, hands-on ones like stepwise regression to full blown dimensionality reduction. They all have their place, but I like LASSO because it works very well, and it’s simpler than most dimensionality reduction/ML techniques. And, despite being a non-linear method, as of 2008 it has a relatively efficient solution via coordinate descent. We can solve the optimization in $O(n\cdot p)$ time, where $n$ is the length of the data set and $p$ is the number of regressors.

## An Example

Our objective function has the form:

where $\lambda \geq 0$. The first half of the equation is just the standard objective function for least squares regression. The second half penalizes regression coefficients under the $l_1$ norm. The parameter $\lambda$ determines how important the penalty on coefficient weights is.

There are two R packages that I know of for LASSO: lars (short for least angle regression – a super set of LASSO) and glmnet. Glmnet includes solvers for more general models (including elastic net – a hybrid of LASSO and ridge that can handle catagorical variables). Lars is simpler to work with but the documentation isn’t great. As such, here are a few points worth noting:

1. The primary lars function generates an object that’s subsequently used to generate the fit that you actually want. There’s a computational motivation behind this approach. The LARS technique works by solving for a series of “knot points” with associated, monotonically decreasing values of $\lambda$. The knot points are subsequently used to compute the LASSO regression for any value of $\lambda$ using only matrix math. This makes procedures such as cross validation where we need to try lots of different values of $\lambda$ computationally tractable. Without it, we would have to recompute an expensive non-linear optimization each time $\lambda$ changed.
2. There’s a saturation point at which $\lambda$ is high enough that the null model is optimal. On the other end of the spectrum, when $\lambda = 0$, we’re left with least squares. The final value of $\lambda$ on the path, right before we end up with least squares, will correspond to the largest coefficient norm. Let’s call these coefficients $\beta_\text{thresh}$, and denote $\Delta = || \beta_\text{thresh} ||_{l_1}$. When the lars package does cross validation, it does so by computing the MSE for models where the second term in the objective function is fixed at $x \cdot \Delta,\ x \in [0, 1]$. This works from a calculation standpoint (and computationally it makes things pretty), but it’s counter intuitive if you’re interested in the actual value of $\lambda$ and not just trying to get the regression coefficients. You could easily write your own cross validation routine to use $\lambda$ directly.
3. The residual sum of squared errors will increase monotonically with $\lambda$. This makes sense as we’re trading off between minimizing RSS and the model’s complexity. As such, the smallest RSS will always correspond to the smallest value of $\lambda$, and not necessarily the optimal one.

Here’s a simple example using data from the lars package. We’ll follow a common heuristic that recommends choosing $\lambda$ one SD of MSE away from the minimum. Personally I prefer to examine the CV L-curve and pick a value right on the elbow, but this works.

## Final Notes

LASSO is a biased, linear estimator whose bias increases with $\lambda$. It’s not meant to provide the “best” fit as Gauss-Markov defines it – LASSO aims to find models that generalize well. Feature selection is hard problem and the best that we can do is a combination of common sense and model inference. However, no technique will save you from the worst case scenario: two very highly correlated variables, one of which is a good predictor, the other of which is spurious. It’s a crap shoot as to which predictor a feature selection algorithm would penalize in that case. LASSO has a few technical issues as well. Omitted variable bias is still an issue as it is in other forms of regression, and because of its non-linear solution, LASSO isn’t invariant under transformations of original data matrix.