Noisy Regularization

How is adding noise to training data equivalent to regularization?

It is widely accepted that the addition of noise to the training data of a neural network can improve the generalization performance (Bishop, 1995). It is even possible to add Gaussian noise layers directly to a neural network with out-of-the-box deep learning packages for this purpose (e.g., see the TensorFlow documentation here). I found a nice way to gain an intuition for this is to consider the simple case of linear regression.

Linear Regression

Suppose we have a set {yi,xi}i=1n\{y_i, \boldsymbol{x}_i\}_{i=1}^n of independent and identically distributed observations, and that we assume gaussian likelihood such that

E[yixi]=wxi, \begin{aligned} \mathbb{E}[y_i \vert \boldsymbol{x}_i] = \boldsymbol{w}^\top \boldsymbol{x}_i, \end{aligned}

then given we optimise the parameters using the sum-of-squares loss, the estimated coefficents are given by

w^=arg minw 1ni=1n(yiwxi)2. \begin{aligned} \hat{\boldsymbol{w}} = \argmin_w \ \frac{1}{n}\sum_{i=1}^n (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2. \end{aligned}

Adding Noise

Now suppose the data is corrupted with additive noise, such that

x~i=x+δi,δiN(0,Σ), \begin{aligned} \tilde{\boldsymbol{x}}_i = \boldsymbol{x} + \boldsymbol{\delta}_i, \quad \boldsymbol{\delta}_i \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma}), \end{aligned}

where for simplicity let Σ\boldsymbol{\Sigma} be a diagonal matrix. The optimization problem is now augmented to minimize the expected sum-of-squares loss, where the expectation is taken over the random noise such that

w^=arg minw EδN(0,Σ)[1ni=1n(yiwx~i)2],=arg minw EδN(0,Σ)[1ni=1n((yiwxi)22(yiwxi)wδi+wδiδiw)], \begin{aligned} \hat{\boldsymbol{w}} &= \argmin_w \ \mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} \left[ \frac{1}{n} \sum_{i=1}^n (y_i - \boldsymbol{w}^\top \tilde{\boldsymbol{x}}_i)^2 \right], \\ &= \argmin_w \ \mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} \left[ \frac{1}{n} \sum_{i=1}^n \left((y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2 - 2 (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i) \boldsymbol{w}^\top \boldsymbol{\delta}_i + \boldsymbol{w}^\top \boldsymbol{\delta}_i \boldsymbol{\delta}_i^\top \boldsymbol{w} \right)\right], \end{aligned}

which by linearity of expectation gives

w^=arg minw 1ni=1n((yiwxi)22(yiwxi)wEδN(0,Σ)[δi]+wEδN(0,Σ)[δiδi]w),=arg minw 1ni=1n(yiwxi)2+wΣw, \begin{aligned} \hat{\boldsymbol{w}} &= \argmin_w \ \frac{1}{n} \sum_{i=1}^n \left((y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2 - 2 (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i) \boldsymbol{w}^\top \mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} \left[\boldsymbol{\delta}_i\right] + \boldsymbol{w}^\top \mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} \left[ \boldsymbol{\delta}_i \boldsymbol{\delta}_i^\top \right] \boldsymbol{w} \right), \\ &= \argmin_w \ \frac{1}{n} \sum_{i=1}^n (y_i - \boldsymbol{w}^\top \boldsymbol{x}_i)^2 + \boldsymbol{w}^\top \boldsymbol{\Sigma} \boldsymbol{w}, \end{aligned}

where the last line is possible since EδN(0,Σ)[δi]=0\mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} [\boldsymbol{\delta}_i] = 0 and EδN(0,Σ)[δiδi]=Σ\mathbb{E}_{\boldsymbol{\delta} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma})} [\boldsymbol{\delta}_i^\top \boldsymbol{\delta}_i] = \boldsymbol{\Sigma} . For the special case of isotropic noise, such that Σ=βI\boldsymbol{\Sigma} = \beta \mathbf{I} for some constant β\beta , we get

wΣw=βw22, \begin{aligned} \boldsymbol{w}^\top \boldsymbol{\Sigma} \boldsymbol{w} = \beta \vert \vert \boldsymbol{w} \vert \vert_2^2, \end{aligned}

which results in a vector of coefficients equivalent to that obtained using Ridge regression with shrinkage penalty β\beta , known to improve generalization performance of linear regression by reducing overfitting. In Bishop (1995), it is shown more generally that for the sum-of-squares loss, the equivalent regularization term belongs to the class of generalized Tikhonov regularizers. Therefore, direct minimization of the regularized loss provides a practical alternative to training with noise.