Maximum Entropy Principle

Why do we always use the Gaussian distribution in statistical modelling?

The Gaussian distribution is popular in statistics not only because of its mathematical convenience and its role in the central limit theorem—which states that the sum of many independent variables tends to be normally distributed—but also due to its maximum entropy property. According to the maximum entropy principle, if the only information available about a dataset is its mean and variance, the best choice is the distribution that maximizes entropy under these constraints. In other words, among all distributions with a specified mean and variance, the normal distribution introduces the fewest additional assumptions, making it the most unbiased and uninformative model. This lack of extra structure is a key reason for its widespread use in statistical inference and information theory. In this post, we derive the Gaussian distribution from the fundamental principles of information theory to illustrate this concept.

Entropy

Let XX be a random variable with a probability density function ff whose support is a set X\mathcal{X} . The entropy of this random variable is given by

H(X)=Xf(x)logf(x)dx=EX[log1f(x)], \begin{aligned} H(X) = - \int_{\mathcal{X}} f(x) \log f(x) dx = \mathbb{E}_{\mathcal{X}} \left[\log \frac{1}{f(x)}\right], \end{aligned}

which can be thought of as the average amount of information gained from observing an outcome drawn from a probability distribution. Essentially, it measures the uncertainty associated with the possible outcomes of a random variable. For a more detailed and intuitive explanation, there’s an excellent blog post here on the topic.

Maximizing Entropy

The principle of maximum entropy states that, among all distributions that satisfy our prior constraints, the one with the highest entropy is favored because it makes the fewest additional assumptions. In essence, it embraces “epistemic modesty” or “maximum ignorance”, ensuring that only the provided information is encoded, while remaining as non-committal as possible about what is unknown. For a given mean and variance, the maximum entropy probability distribution ff^\ast is given by

f=arg minfXf(x)logf(x)dx,s.t.Xf(x)dx=1,Xf(x)(xμ)2dx=σ2, \begin{aligned} f^\ast = \argmin_{f} & \int_{\mathcal{X}} f(x) \log f(x) dx, \\ \textrm{s.t.} \quad & \int_{\mathcal{X}} f(x) dx = 1, \\ & \int_{\mathcal{X}} f(x) (x - \mu)^2 dx = \sigma^2, \\ \end{aligned}

where the first constraint simply ensures the distribution is a probability density function and the second constraint defines the variance, which gives us the mean by default. To solve this, we can use the method of Lagrange multipliers such that the Lagrangian is

J(f,λ)=Xf(x)logf(x)dxλ1(Xf(x)dx1)λ2(Xf(x)(xμ)2dxσ2). \begin{aligned} \mathcal{J}(f, \boldsymbol{\lambda}) = \int_{\mathcal{X}} f(x) \log f(x) dx - \lambda_1 \left( \int_{\mathcal{X}} f(x) dx - 1 \right) - \lambda_2 \left( \int_{\mathcal{X}} f(x) (x - \mu)^2 dx - \sigma^2 \right). \end{aligned}

Taking the derivative with respect to ff and setting this to zero we get

ddfJ(f,λ)=1+logf(x)λ1λ2(xμ)2=0,f(x)=exp(λ1+1λ2(xμ)2). \begin{aligned} \frac{d}{df}\mathcal{J}(f, \boldsymbol{\lambda}) &= 1 + \log f(x) - \lambda_1 - \lambda_2 (x - \mu)^2 = 0, \\ \therefore f(x) &= \exp\left( -\lambda_1 + 1 - \lambda_2 (x - \mu)^2 \right). \end{aligned}

Therefore, to satisfy the first constraint the following must hold

1=Xexp(λ1+1λ2(xμ)2)dx,1=Xexp(λ1+1)exp(λ2(xμ)2)dx,exp(λ11)=Xexp(λ2(xμ)2)dx. \begin{aligned} 1 &= \int_{\mathcal{X}} \exp\left( -\lambda_1 + 1 - \lambda_2 (x - \mu)^2 \right) dx, \\ 1 &= \int_{\mathcal{X}} \exp\left( -\lambda_1 + 1 \right) \exp\left( - \lambda_2 (x - \mu)^2 \right) dx, \\ \therefore \exp\left( \lambda_1 - 1 \right) &= \int_{\mathcal{X}} \exp\left( - \lambda_2 (x - \mu)^2 \right) dx. \end{aligned}

We can solve the right-hand side by substitution, setting u=xμu = x - \mu . This shifts the variable of integration without affecting the limits so the integral becomes a standard Euler–Poisson integral with the well-known result

Xexp(λ2u2)du=πλ2. \begin{aligned} \int_{\mathcal{X}} \exp\left( -\lambda_2 u^2 \right) du = \sqrt{\frac{\pi}{\lambda_2}}. \end{aligned}

To satisfy the second constraint, the following must hold

σ2=Xexp(λ1+1λ2(xμ)2)(xμ)2dx,σ2exp(λ11)=Xexp(λ2(xμ)2)(xμ)2dx, \begin{aligned} \sigma^2 &= \int_{\mathcal{X}} \exp \left( -\lambda_1 + 1 - \lambda_2 (x - \mu)^2 \right) (x - \mu)^2 dx, \\ \therefore \sigma^2 \exp \left( \lambda_1 - 1 \right) &= \int_{\mathcal{X}} \exp \left( - \lambda_2 (x - \mu)^2 \right) (x - \mu)^2 dx, \end{aligned}

where again by setting u=xμu = x - \mu , the integral on the right-hand side has a well-known result such that

Xexp(λ2u2)u2du=12λ2πλ2. \begin{aligned} \int_{\mathcal{X}} \exp\left( -\lambda_2 u^2 \right) u^2 du = \frac{1}{2 \lambda_2} \sqrt{\frac{\pi}{\lambda_2}}. \end{aligned}

Therefore, by putting this altogether we get that

πλ2=exp(λ11)=2λ2σ2exp(λ11), \begin{aligned} \sqrt{\frac{\pi}{\lambda_2}} = \exp\left( \lambda_1 - 1 \right) &= 2 \lambda_2 \sigma^2 \exp\left( \lambda_1 - 1 \right), \end{aligned}

and so by solving first for λ2\lambda_2 and substituting the result, we get that

λ2=12σ2andλ1=log2πσ2+1, \begin{aligned} \lambda_2 &= \frac{1}{2 \sigma^2} \quad \textrm{and} \quad \lambda_1 = \log \sqrt{2 \pi \sigma^2} + 1, \end{aligned}

and finally

f(x)=exp(λ1+1λ2(xμ)2),=exp(log2πσ212σ2(xμ)2),=12πσ2exp(12σ2(xμ)2) \begin{aligned} f^\ast(x) &= \exp \left(-\lambda_1 + 1 - \lambda_2 (x - \mu)^2 \right), \\ &= \exp \left(- \log \sqrt{2 \pi \sigma^2} - \frac{1}{2 \sigma^2}(x - \mu)^2 \right), \\ &= \frac{1}{\sqrt{2 \pi \sigma^2} } \exp \left( -\frac{1}{2 \sigma^2}(x - \mu)^2 \right) \end{aligned}

which is, of course, the probability density function of the Gaussian distribution. Therefore, for a fixed mean and variance, the Gaussian dsitribution is the maximum entropy distribution, making it the natural choice for statistical modelling.