Why We Divide by n − 1
Understanding sample variance as an optimization problem, and why estimating the mean guarantees we underestimate the true variance.
Variance as an Optimization Problem
Suppose we have a random variable with unknown mean and unknown variance . We observe samples .
The variance of around any point can be written as a function:
This is just the average squared distance of the data from . We can think of as a function of the centering point — and it turns out this perspective reveals something fundamental about why sample variance is biased.
The True Variance
The quantity we actually want to estimate is , the average squared deviation from the true population mean :
If we knew , we would be done. We could compute directly and it would be an unbiased estimator of .
Strictly, has expectation , not , because each is a single draw. But the key bias we discuss here comes from estimating — a conceptually distinct and more interesting issue.
But we don’t know . So we estimate it.
Estimating the Mean
The natural estimator of is the sample mean:
Now here is the critical observation. The sample mean is not just any estimator of . It has a very specific relationship to our variance function .
Claim. is the value of that minimizes .
To see this, take the derivative and set it to zero:
The second derivative is , confirming this is a minimum.
is a convex quadratic in , and is its unique minimizer. Therefore, for any :
In particular, since in general:
Try it yourself — drag μ across the axis and watch V(μ) change. Notice it always bottoms out at x̄:
This is the entire argument. By plugging in instead of , we are plugging in the value that minimizes the sum of squared deviations. The true mean almost never coincides with exactly, so is almost always strictly less than .
We are guaranteed to underestimate the variance.
Curvature, Variance, and the n − 1
Since is a quadratic, we can write it exactly as a Taylor expansion around its minimum:
This follows from the identity , and noting that .
The curvature , always — regardless of the data or . So the gap between and is:
This is the bias for a single sample. It depends on how far happened to land from .
Now we need two ingredients:
-
The curvature of tells us how sensitive the loss is to perturbations in . Here .
-
The variance of tells us how much perturbation we actually have. Since is an average of independent draws, .
Why ? Variance scales with the square of constants: . Averaging independent things reduces variance by a factor of .
Their product gives the expected bias:
So our estimator satisfies:
The in the denominator of comes from averaging independent samples — the more data we have, the less wobbles, and the less we underestimate. The curvature of 2 is a fixed property of squared loss, not something we can change.
To correct the bias, divide by instead of :
satisfies .
See the bias in action — draw repeated samples and watch the two estimators converge:
One Parameter, One Sample Lost
Notice something about the bias . Our estimator is an average of terms, each contributing in expectation:
The bias is — exactly one term’s worth. Estimating the mean “uses up” one data point, leaving effective observations.
And is why it’s exactly one, not some fraction. The curvature of squared loss sets the exchange rate between parameters and samples: for this loss, one estimated parameter costs exactly one sample. That’s the “1” in .
Two Parameters: Simple Linear Regression
To see this in action, consider fitting a line to data . The loss function is now:
We estimate two parameters — the intercept and the slope — by minimizing . The Hessian of with respect to is a matrix, and its trace is (two squared-loss terms, each contributing curvature 2). So:
Each parameter costs one sample. Two parameters, two samples lost. The unbiased estimator of variance from regression residuals divides by :
This is why a line through two points has zero residual variance — you’ve used all your data to fit, leaving degrees of freedom. There’s nothing left to measure noise with.
The pattern is always the same. Fit parameters by minimizing a squared loss, and the curvature tells you each one costs exactly one sample: divide by .
The Broader Lesson
This argument generalizes far beyond variance estimation. Whenever you use data to estimate parameters and then evaluate the loss at those estimates, you will be biased optimistically — because you used the same data to both fit and evaluate.
This is the same phenomenon behind overfitting in machine learning, and why training loss is always lower than test loss. AIC, cross-validation, and Stein’s unbiased risk estimate all exist to correct for this.
The curvature of the loss determines the exchange rate. For squared loss it’s 1:1 — clean and simple. For other losses the rate may differ, but the structure is the same: bias = curvature × parameter uncertainty.
Bessel’s correction is not a quirk of variance estimation. It is the simplest instance of a deep idea: fitting costs information, and the curvature of your loss tells you exactly how much.