Noisy Regularization

How is adding noise to training data equivalent to regularization?

Published

12 February 2025

It is widely accepted that the addition of noise to the training data of a neural network can improve the generalization performance (Bishop, 1995). It is even possible to add Gaussian noise layers directly to a neural network with out-of-the-box deep learning packages for this purpose (e.g., see the TensorFlow documentation here). I found a nice way to gain an intuition for this is to consider the simple case of linear regression.

Linear Regression

Suppose we have a set $\{y_i, \boldsymbol{x}_i\}_{i=1}^n$ of independent and identically distributed observations, and that we assume gaussian likelihood such that

$\begin{aligned} \mathbb{E}[y_i \vert \boldsymbol{x}_i] = \boldsymbol{w}^\top \boldsymbol{x}_i, \end{aligned}$

then given we optimise the parameters using the sum-of-squares loss, the estimated coefficents are given by

$\begin{aligned} \hat{\boldsymbol{w}} = \argmin_w \ \frac{1}{n}\sum_{i=1}^n (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2. \end{aligned}$

Adding Noise

Now suppose the data is corrupted with additive noise, such that

$\begin{aligned} \tilde{\boldsymbol{x}}_i = \boldsymbol{x} + \boldsymbol{\delta}_i, \quad \boldsymbol{\delta}_i \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}), \end{aligned}$

where for simplicity let $\boldsymbol{\Sigma}$ be a diagonal matrix. The optimization problem is now augmented to minimize the expected sum-of-squares loss, where the expectation is taken over the random noise such that

$\begin{aligned} \hat{\boldsymbol{w}} &= \argmin_w \ \mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} \left[ \frac{1}{n} \sum_{i=1}^n (y_i - \boldsymbol{w}^\top \tilde{\boldsymbol{x}}_i)^2 \right], \\ &= \argmin_w \ \mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} \left[ \frac{1}{n} \sum_{i=1}^n \left((y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2 - 2 (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i) \boldsymbol{w}^\top \boldsymbol{\delta}_i + \boldsymbol{w}^\top \boldsymbol{\delta}_i \boldsymbol{\delta}_i^\top \boldsymbol{w} \right)\right], \end{aligned}$

which by linearity of expectation gives

$\begin{aligned} \hat{\boldsymbol{w}} &= \argmin_w \ \frac{1}{n} \sum_{i=1}^n \left((y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2 - 2 (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i) \boldsymbol{w}^\top \mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} \left[\boldsymbol{\delta}_i\right] + \boldsymbol{w}^\top \mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} \left[ \boldsymbol{\delta}_i \boldsymbol{\delta}_i^\top \right] \boldsymbol{w} \right), \\ &= \argmin_w \ \frac{1}{n} \sum_{i=1}^n (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2 + \boldsymbol{w}^\top \boldsymbol{\Sigma} \boldsymbol{w}, \end{aligned}$

where the last line is possible since $\mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} [\boldsymbol{\delta}_i] = 0$ and $\mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} [\boldsymbol{\delta}_i^\top \boldsymbol{\delta}_i] = \boldsymbol{\Sigma}$ . For the special case of isotropic noise, such that $\boldsymbol{\Sigma} = \beta \mathbf{I}$ for some constant $\beta$ , we get

$\begin{aligned} \boldsymbol{w}^\top \boldsymbol{\Sigma} \boldsymbol{w} = \beta \vert \vert \boldsymbol{w} \vert \vert_2^2, \end{aligned}$

which results in a vector of coefficients equivalent to that obtained using Ridge regression with shrinkage penalty $\beta$ , known to improve generalization performance of linear regression by reducing overfitting. In Bishop (1995), it is shown more generally that for the sum-of-squares loss, the equivalent regularization term belongs to the class of generalized Tikhonov regularizers. Therefore, direct minimization of the regularized loss provides a practical alternative to training with noise.