Misapplied Math

Trading, Data Science, CS.

Twelve Days 2013: LASSO Regression

Day Eight: LASSO Regression


LASSO regression (least absolute shrinkage and selection operator) is a modified form of least squares regression that penalizes model complexity via a regularization parameter. It does so by including a term proportional to $||\beta||_{l_1}$ in the objective function which shrinks coefficients towards zero, and can even eliminate them entirely. In that light, LASSO is a form of feature selection/dimensionality reduction. Unlike other forms of regularization such as ridge regression, LASSO will actually eliminate predictors. It’s a simple, useful technique that performs quite well on many data sets.


Regularization refers to the process of adding additional constraints to a problem to avoid over fitting. ML techniques such as neural networks can generate models of arbitrary complexity that will fit in-sample data one-for-one. As we recently saw in the post on Reed-Solomon FEC codes, the same applies to regression. We definitely have a problem anytime there are more regressors than data points, but any excessively complex model will generalize horribly and do you zero good out of sample.


There’s a litany of regularization techniques for regression, ranging from heuristic, hands-on ones like stepwise regression to full blown dimensionality reduction. They all have their place, but I like LASSO because it works very well, and it’s simpler than most dimensionality reduction/ML techniques. And, despite being a non-linear method, as of 2008 it has a relatively efficient solution via coordinate descent. We can solve the optimization in $O(n\cdot p)$ time, where $n$ is the length of the data set and $p$ is the number of regressors.

An Example

Our objective function has the form:

where $\lambda \geq 0$. The first half of the equation is just the standard objective function for least squares regression. The second half penalizes regression coefficients under the $l_1$ norm. The parameter $\lambda$ determines how important the penalty on coefficient weights is.

There are two R packages that I know of for LASSO: lars (short for least angle regression – a super set of LASSO) and glmnet. Glmnet includes solvers for more general models (including elastic net – a hybrid of LASSO and ridge that can handle catagorical variables). Lars is simpler to work with but the documentation isn’t great. As such, here are a few points worth noting:

  1. The primary lars function generates an object that’s subsequently used to generate the fit that you actually want. There’s a computational motivation behind this approach. The LARS technique works by solving for a series of “knot points” with associated, monotonically decreasing values of $\lambda$. The knot points are subsequently used to compute the LASSO regression for any value of $\lambda$ using only matrix math. This makes procedures such as cross validation where we need to try lots of different values of $\lambda$ computationally tractable. Without it, we would have to recompute an expensive non-linear optimization each time $\lambda$ changed.
  2. There’s a saturation point at which $\lambda$ is high enough that the null model is optimal. On the other end of the spectrum, when $\lambda = 0$, we’re left with least squares. The final value of $\lambda$ on the path, right before we end up with least squares, will correspond to the largest coefficient norm. Let’s call these coefficients $\beta_\text{thresh}$, and denote $\Delta = || \beta_\text{thresh} ||_{l_1}$. When the lars package does cross validation, it does so by computing the MSE for models where the second term in the objective function is fixed at $x \cdot \Delta,\ x \in [0, 1]$. This works from a calculation standpoint (and computationally it makes things pretty), but it’s counter intuitive if you’re interested in the actual value of $\lambda$ and not just trying to get the regression coefficients. You could easily write your own cross validation routine to use $\lambda$ directly.
  3. The residual sum of squared errors will increase monotonically with $\lambda$. This makes sense as we’re trading off between minimizing RSS and the model’s complexity. As such, the smallest RSS will always correspond to the smallest value of $\lambda$, and not necessarily the optimal one.

Here’s a simple example using data from the lars package. We’ll follow a common heuristic that recommends choosing $\lambda$ one SD of MSE away from the minimum. Personally I prefer to examine the CV L-curve and pick a value right on the elbow, but this works.

# Compute MSEs for a range of coefficient penalties expressed as a fraction
# of the final L1 norm on the interval [0, 1].
cv.res <- cv.lars(diabetes$x, diabetes$y, type = "lasso",
mode = "fraction", plot = FALSE)
# Choose an "optimal" value one standard deviation away from the
# minimum MSE.
opt.frac <- min(cv.res$cv) + sd(cv.res$cv)
opt.frac <- cv.res$index[which(cv.res$cv < opt.frac)[1]]
# Compute the LARS path
lasso.path <- lars(diabetes$x, diabetes$y, type = "lasso")
# Compute a fit given the LARS path that we precomputed, and our optimal
# fraction of the final L1 norm
lasso.fit <- predict.lars(lasso.path, type = "coefficients",
mode = "fraction", s = opt.frac)
# Extract the final vector of regression coefficients

Final Notes

LASSO is a biased, linear estimator whose bias increases with $\lambda$. It’s not meant to provide the “best” fit as Gauss-Markov defines it – LASSO aims to find models that generalize well. Feature selection is hard problem and the best that we can do is a combination of common sense and model inference. However, no technique will save you from the worst case scenario: two very highly correlated variables, one of which is a good predictor, the other of which is spurious. It’s a crap shoot as to which predictor a feature selection algorithm would penalize in that case. LASSO has a few technical issues as well. Omitted variable bias is still an issue as it is in other forms of regression, and because of its non-linear solution, LASSO isn’t invariant under transformations of original data matrix.