It is widely accepted that the addition of noise to the training data of a neural network can improve the generalization performance (Bishop, 1995). It is even possible to add Gaussian noise layers directly to a neural network with out-of-the-box deep learning packages for this purpose (e.g., see the TensorFlow documentation here). I found a nice way to gain an intuition for this is to consider the simple case of linear regression.
Linear Regression
Suppose we have a set {yi,xi}i=1n of independent and identically distributed observations, and that we assume gaussian likelihood such that
E[yi∣xi]=w⊤xi,
then given we optimise the parameters using the sum-of-squares loss, the estimated coefficents are given by
w^=wargmin n1i=1∑n(yi−w⊤xi)2.
Adding Noise
Now suppose the data is corrupted with additive noise, such that
x~i=x+δi,δi∼N(0,Σ),
where for simplicity let Σ be a diagonal matrix. The optimization problem is now augmented to minimize the expected sum-of-squares loss, where the expectation is taken over the random noise such that
w^=wargmin Eδ∼N(0,Σ)[n1i=1∑n(yi−w⊤x~i)2],=wargmin Eδ∼N(0,Σ)[n1i=1∑n((yi−w⊤xi)2−2(yi−w⊤xi)w⊤δi+w⊤δiδi⊤w)],
which by linearity of expectation gives
w^=wargmin n1i=1∑n((yi−w⊤xi)2−2(yi−w⊤xi)w⊤Eδ∼N(0,Σ)[δi]+w⊤Eδ∼N(0,Σ)[δiδi⊤]w),=wargmin n1i=1∑n(yi−w⊤xi)2+w⊤Σw,
where the last line is possible since Eδ∼N(0,Σ)[δi]=0 and Eδ∼N(0,Σ)[δi⊤δi]=Σ . For the special case of isotropic noise, such that Σ=βI for some constant β , we get
w⊤Σw=β∣∣w∣∣22,
which results in a vector of coefficients equivalent to that obtained using Ridge regression with shrinkage penalty β , known to improve generalization performance of linear regression by reducing overfitting. In Bishop (1995), it is shown more generally that for the sum-of-squares loss, the equivalent regularization term belongs to the class of generalized Tikhonov regularizers. Therefore, direct minimization of the regularized loss provides a practical alternative to training with noise.