L2 Regularization | one minute summary

You too can understand L2

Jeffrey Boschman
One Minute Machine Learning

--

Math to really understand how L2 Regularization / Weight Decay works

One of the most common technique to prevent overfitting is called L2 Regularization (because it uses the L2 norm). It is also known as Ridge Regression (from the original 1970 paper) or Weight Decay (in deep learning frameworks, because that’s essentially what it does).

Prerequisite Info: Regularization

  1. Why? Large weights in a neural network are often a sign of a too complex network that has overfit the training data. Therefore, one way to prevent a model from becoming too complex is to stop weights from becoming too large.
  2. What? L2 Regularization is a technique to reduce model complexity by lowering the weights of a model proportional to the square of each weight (therefore especially penalizing those that are highest), but does not make them 0.
  3. How? L2 Regularization adds the squared magnitude of a weight as the penalty term to the loss function (multiplied by a lambda hyperparameter). Because of the derivative, this means that during gradient descent the weights that are higher get penalized more, whereas weights that are lower won’t change that much (therefore preventing them from going to 0).

--

--

Jeffrey Boschman
One Minute Machine Learning

An endlessly curious grad student trying to build and share knowledge.