Day Eight: LASSO Regression
TL/DR
LASSO regression (least absolute shrinkage and selection operator) is a modified form of least squares regression that penalizes model complexity via a regularization parameter. It does so by including a term proportional to $\beta_{l_1}$ in the objective function which shrinks coefficients towards zero, and can even eliminate them entirely. In that light, LASSO is a form of feature selection/dimensionality reduction. Unlike other forms of regularization such as ridge regression, LASSO will actually eliminate predictors. It’s a simple, useful technique that performs quite well on many data sets.
Regularization
Regularization refers to the process of adding additional constraints to a problem to avoid over fitting. ML techniques such as neural networks can generate models of arbitrary complexity that will fit insample data oneforone. As we recently saw in the post on ReedSolomon FEC codes, the same applies to regression. We definitely have a problem anytime there are more regressors than data points, but any excessively complex model will generalize horribly and do you zero good out of sample.
Why LASSO?
There’s a litany of regularization techniques for regression, ranging from heuristic, handson ones like stepwise regression to full blown dimensionality reduction. They all have their place, but I like LASSO because it works very well, and it’s simpler than most dimensionality reduction/ML techniques. And, despite being a nonlinear method, as of 2008 it has a relatively efficient solution via coordinate descent. We can solve the optimization in $O(n\cdot p)$ time, where $n$ is the length of the data set and $p$ is the number of regressors.
An Example
Our objective function has the form:
where $\lambda \geq 0$. The first half of the equation is just the standard objective function for least squares regression. The second half penalizes regression coefficients under the $l_1$ norm. The parameter $\lambda$ determines how important the penalty on coefficient weights is.
There are two R packages that I know of for LASSO: lars (short for least angle regression – a super set of LASSO) and glmnet. Glmnet includes solvers for more general models (including elastic net – a hybrid of LASSO and ridge that can handle catagorical variables). Lars is simpler to work with but the documentation isn’t great. As such, here are a few points worth noting:
 The primary lars function generates an object that’s subsequently used to generate the fit that you actually want. There’s a computational motivation behind this approach. The LARS technique works by solving for a series of “knot points” with associated, monotonically decreasing values of $\lambda$. The knot points are subsequently used to compute the LASSO regression for any value of $\lambda$ using only matrix math. This makes procedures such as cross validation where we need to try lots of different values of $\lambda$ computationally tractable. Without it, we would have to recompute an expensive nonlinear optimization each time $\lambda$ changed.
 There’s a saturation point at which $\lambda$ is high enough that the null model is optimal. On the other end of the spectrum, when $\lambda = 0$, we’re left with least squares. The final value of $\lambda$ on the path, right before we end up with least squares, will correspond to the largest coefficient norm. Let’s call these coefficients $\beta_\text{thresh}$, and denote $\Delta =  \beta_\text{thresh} _{l_1}$. When the lars package does cross validation, it does so by computing the MSE for models where the second term in the objective function is fixed at $x \cdot \Delta,\ x \in [0, 1]$. This works from a calculation standpoint (and computationally it makes things pretty), but it’s counter intuitive if you’re interested in the actual value of $\lambda$ and not just trying to get the regression coefficients. You could easily write your own cross validation routine to use $\lambda$ directly.
 The residual sum of squared errors will increase monotonically with $\lambda$. This makes sense as we’re trading off between minimizing RSS and the model’s complexity. As such, the smallest RSS will always correspond to the smallest value of $\lambda$, and not necessarily the optimal one.
Here’s a simple example using data from the lars package. We’ll follow a common heuristic that recommends choosing $\lambda$ one SD of MSE away from the minimum. Personally I prefer to examine the CV Lcurve and pick a value right on the elbow, but this works.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 

Final Notes
LASSO is a biased, linear estimator whose bias increases with $\lambda$. It’s not meant to provide the “best” fit as GaussMarkov defines it – LASSO aims to find models that generalize well. Feature selection is hard problem and the best that we can do is a combination of common sense and model inference. However, no technique will save you from the worst case scenario: two very highly correlated variables, one of which is a good predictor, the other of which is spurious. It’s a crap shoot as to which predictor a feature selection algorithm would penalize in that case. LASSO has a few technical issues as well. Omitted variable bias is still an issue as it is in other forms of regression, and because of its nonlinear solution, LASSO isn’t invariant under transformations of original data matrix.