Why We Divide by n − 1

Understanding sample variance as an optimization problem, and why estimating the mean guarantees we underestimate the true variance.

Variance as an Optimization Problem

Suppose we have a random variable XX with unknown mean μ\mu^* and unknown variance σ2\sigma^2. We observe nn samples x1,x2,,xnx_1, x_2, \ldots, x_n.

The variance of XX around any point μ\mu can be written as a function:

V(μ)=1ni=1n(xiμ)2V(\mu) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2

This is just the average squared distance of the data from μ\mu. We can think of VV as a function of the centering point — and it turns out this perspective reveals something fundamental about why sample variance is biased.

The True Variance

The quantity we actually want to estimate is V(μ)V(\mu^*), the average squared deviation from the true population mean μ\mu^*:

V(μ)=1ni=1n(xiμ)2V(\mu^*) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu^*)^2

If we knew μ\mu^*, we would be done. We could compute V(μ)V(\mu^*) directly and it would be an unbiased estimator of σ2\sigma^2.

1

Strictly, V(μ)V(\mu^*) has expectation n1nσ2\frac{n-1}{n}\sigma^2, not σ2\sigma^2, because each xix_i is a single draw. But the key bias we discuss here comes from estimating μ\mu^* — a conceptually distinct and more interesting issue.

But we don’t know μ\mu^*. So we estimate it.

Estimating the Mean

The natural estimator of μ\mu^* is the sample mean:

xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

Now here is the critical observation. The sample mean xˉ\bar{x} is not just any estimator of μ\mu^*. It has a very specific relationship to our variance function V(μ)V(\mu).

Claim. xˉ\bar{x} is the value of μ\mu that minimizes V(μ)V(\mu).

To see this, take the derivative and set it to zero:

dVdμ=1ni=1n2(xiμ)=0\frac{dV}{d\mu} = \frac{1}{n}\sum_{i=1}^{n} -2(x_i - \mu) = 0     μ=1ni=1nxi=xˉ\implies \mu = \frac{1}{n}\sum_{i=1}^{n} x_i = \bar{x}

The second derivative is 2>02 > 0, confirming this is a minimum.

Key Insight

V(μ)V(\mu) is a convex quadratic in μ\mu, and xˉ\bar{x} is its unique minimizer. Therefore, for any μxˉ\mu \neq \bar{x}:

V(xˉ)V(μ)V(\bar{x}) \leq V(\mu)

In particular, since μxˉ\mu^* \neq \bar{x} in general:

V(xˉ)V(μ)V(\bar{x}) \leq V(\mu^*)

Try it yourself — drag μ across the axis and watch V(μ) change. Notice it always bottoms out at x̄:

μV(μ)μ*
V(x̄) = 3.45
minimum
V(μ*) = 5.40
true mean
V(μ) = 5.40
μ = 5.0
bias = 1.95 (shaded region)
Drag μ along the axis. V(μ) is always minimized at x̄, so V(x̄) ≤ V(μ*).

This is the entire argument. By plugging in xˉ\bar{x} instead of μ\mu^*, we are plugging in the value that minimizes the sum of squared deviations. The true mean μ\mu^* almost never coincides with xˉ\bar{x} exactly, so V(xˉ)V(\bar{x}) is almost always strictly less than V(μ)V(\mu^*).

We are guaranteed to underestimate the variance.

Curvature, Variance, and the n − 1

Since V(μ)V(\mu) is a quadratic, we can write it exactly as a Taylor expansion around its minimum:

V(μ)=V(xˉ)+V2(μxˉ)2V(\mu) = V(\bar{x}) + \frac{V''}{2}(\mu - \bar{x})^22

This follows from the identity (xiμ)2=(xixˉ)2+(xˉμ)2+2(xixˉ)(xˉμ)(x_i - \mu)^2 = (x_i - \bar{x})^2 + (\bar{x} - \mu)^2 + 2(x_i - \bar{x})(\bar{x} - \mu), and noting that (xixˉ)=0\sum(x_i - \bar{x}) = 0.

The curvature V=2V'' = 2, always — regardless of the data or nn. So the gap between V(μ)V(\mu^*) and V(xˉ)V(\bar{x}) is:

V(μ)V(xˉ)=V2(xˉμ)2=(xˉμ)2V(\mu^*) - V(\bar{x}) = \frac{V''}{2} \cdot (\bar{x} - \mu^*)^2 = (\bar{x} - \mu^*)^2

This is the bias for a single sample. It depends on how far xˉ\bar{x} happened to land from μ\mu^*.

Now we need two ingredients:

  1. The curvature of VV tells us how sensitive the loss is to perturbations in μ\mu. Here V/2=1V''/2 = 1.

  2. The variance of xˉ\bar{x} tells us how much perturbation we actually have. Since xˉ\bar{x} is an average of nn independent draws, Var(xˉ)=σ2/n\text{Var}(\bar{x}) = \sigma^2/n.

2.5

Why σ2/n\sigma^2/n? Variance scales with the square of constants: Var(xˉ)=Var ⁣(1nxi)=1n2nσ2=σ2/n\text{Var}(\bar{x}) = \text{Var}\!\left(\frac{1}{n}\sum x_i\right) = \frac{1}{n^2}\cdot n\sigma^2 = \sigma^2/n. Averaging nn independent things reduces variance by a factor of nn.

Their product gives the expected bias:

E[bias]=V2Var(xˉ)=1σ2n=σ2n\mathbb{E}[\text{bias}] = \frac{V''}{2} \cdot \text{Var}(\bar{x}) = 1 \cdot \frac{\sigma^2}{n} = \frac{\sigma^2}{n}

So our estimator satisfies:

E[V(xˉ)]=σ2σ2n=n1nσ2\mathbb{E}[V(\bar{x})] = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\,\sigma^2

The nn in the denominator of σ2/n\sigma^2/n comes from averaging nn independent samples — the more data we have, the less xˉ\bar{x} wobbles, and the less we underestimate. The curvature of 2 is a fixed property of squared loss, not something we can change.

To correct the bias, divide by n1n - 1 instead of nn:

Bessel's Correction
s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

satisfies E[s2]=σ2\mathbb{E}[s^2] = \sigma^2.

See the bias in action — draw repeated samples and watch the two estimators converge:

σ² = 4Estimated variance024681012
÷ n: avg = 0.00
÷ (n−1): avg = 0.00
Trials: 0
Each trial draws n samples from N(0, 4). The ÷ n estimator (red) consistently underestimates σ² = 4, while ÷ (n−1) (blue) is unbiased.

One Parameter, One Sample Lost

Notice something about the bias σ2/n\sigma^2/n. Our estimator V(μ)V(\mu^*) is an average of nn terms, each contributing σ2/n\sigma^2/n in expectation:

V(μ)=1ni=1n(xiμ)2,E[each term]=σ2nV(\mu^*) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu^*)^2, \qquad \mathbb{E}[\text{each term}] = \frac{\sigma^2}{n}

The bias is σ2/n\sigma^2/nexactly one term’s worth. Estimating the mean “uses up” one data point, leaving n1n - 1 effective observations.

And V/2=1V''/2 = 1 is why it’s exactly one, not some fraction. The curvature of squared loss sets the exchange rate between parameters and samples: for this loss, one estimated parameter costs exactly one sample. That’s the “1” in n1n - 1.

Two Parameters: Simple Linear Regression

To see this in action, consider fitting a line y=a+bxy = a + bx to data (x1,y1),,(xn,yn)(x_1, y_1), \ldots, (x_n, y_n). The loss function is now:

L(a,b)=1ni=1n(yiabxi)2L(a, b) = \frac{1}{n}\sum_{i=1}^{n}(y_i - a - bx_i)^2

We estimate two parameters — the intercept a^\hat{a} and the slope b^\hat{b} — by minimizing LL. The Hessian of LL with respect to (a,b)(a, b) is a 2×22 \times 2 matrix, and its trace is tr(H)=2×2=4\text{tr}(\mathbf{H}) = 2 \times 2 = 4 (two squared-loss terms, each contributing curvature 2). So:

tr(H)2=2\frac{\text{tr}(\mathbf{H})}{2} = 2

Each parameter costs one sample. Two parameters, two samples lost. The unbiased estimator of variance from regression residuals divides by n2n - 2:

Residual Variance in Linear Regression
σ^2=1n2i=1n(yia^b^xi)2\hat{\sigma}^2 = \frac{1}{n - 2}\sum_{i=1}^{n}(y_i - \hat{a} - \hat{b}\,x_i)^2
3

This is why a line through two points has zero residual variance — you’ve used all your data to fit, leaving n2=0n - 2 = 0 degrees of freedom. There’s nothing left to measure noise with.

The pattern is always the same. Fit kk parameters by minimizing a squared loss, and the curvature tells you each one costs exactly one sample: divide by nkn - k.

The Broader Lesson

This argument generalizes far beyond variance estimation. Whenever you use data to estimate parameters and then evaluate the loss at those estimates, you will be biased optimistically — because you used the same data to both fit and evaluate.

4

This is the same phenomenon behind overfitting in machine learning, and why training loss is always lower than test loss. AIC, cross-validation, and Stein’s unbiased risk estimate all exist to correct for this.

The curvature of the loss determines the exchange rate. For squared loss it’s 1:1 — clean and simple. For other losses the rate may differ, but the structure is the same: bias = curvature × parameter uncertainty.

Bessel’s correction is not a quirk of variance estimation. It is the simplest instance of a deep idea: fitting costs information, and the curvature of your loss tells you exactly how much.