Why We Divide by n − 1

April 4, 2026 statistics, estimation, math

Understanding sample variance as an optimization problem, and why estimating the mean guarantees we underestimate the true variance.

Variance as an Optimization Problem

Suppose we have a random variable $X$ with unknown mean $\mu^*$ and unknown variance $\sigma^2$ . We observe $n$ samples $x_1, x_2, \ldots, x_n$ .

The variance of $X$ around any point $\mu$ can be written as a function:

V(\mu) = \frac{1}{n} \sum_{i=1}^{n} (x_i - \mu)^2

This is just the average squared distance of the data from $\mu$ . We can think of $V$ as a function of the centering point — and it turns out this perspective reveals something fundamental about why sample variance is biased.

The True Variance

The quantity we actually want to estimate is $V(\mu^*)$ , the average squared deviation from the true population mean $\mu^*$ :

V(\mu^*) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu^*)^2

If we knew $\mu^*$ , we would be done. We could compute $V(\mu^*)$ directly and it would be an unbiased estimator of $\sigma^2$ .

Strictly, $V(\mu^*)$ has expectation $\frac{n-1}{n}\sigma^2$ , not $\sigma^2$ , because each $x_i$ is a single draw. But the key bias we discuss here comes from estimating $\mu^*$ — a conceptually distinct and more interesting issue.

But we don’t know $\mu^*$ . So we estimate it.

Estimating the Mean

The natural estimator of $\mu^*$ is the sample mean:

\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i

Now here is the critical observation. The sample mean $\bar{x}$ is not just any estimator of $\mu^*$ . It has a very specific relationship to our variance function $V(\mu)$ .

Claim. $\bar{x}$ is the value of $\mu$ that minimizes $V(\mu)$ .

To see this, take the derivative and set it to zero:

\frac{dV}{d\mu} = \frac{1}{n}\sum_{i=1}^{n} -2(x_i - \mu) = 0

\implies \mu = \frac{1}{n}\sum_{i=1}^{n} x_i = \bar{x}

The second derivative is $2 > 0$ , confirming this is a minimum.

Key Insight

$V(\mu)$ is a convex quadratic in $\mu$ , and $\bar{x}$ is its unique minimizer. Therefore, for any $\mu \neq \bar{x}$ :

V(\bar{x}) \leq V(\mu)

In particular, since $\mu^* \neq \bar{x}$ in general:

V(\bar{x}) \leq V(\mu^*)

Try it yourself — drag μ across the axis and watch V(μ) change. Notice it always bottoms out at x̄:

V(x̄) = 3.45

minimum

V(μ*) = 5.40

true mean

V(μ) = 5.40

μ = 5.0

bias = 1.95 (shaded region)

Drag μ:

Drag μ along the axis. V(μ) is always minimized at x̄, so V(x̄) ≤ V(μ*).

This is the entire argument. By plugging in $\bar{x}$ instead of $\mu^*$ , we are plugging in the value that minimizes the sum of squared deviations. The true mean $\mu^*$ almost never coincides with $\bar{x}$ exactly, so $V(\bar{x})$ is almost always strictly less than $V(\mu^*)$ .

We are guaranteed to underestimate the variance.

Curvature, Variance, and the n − 1

Since $V(\mu)$ is a quadratic, we can write it exactly as a Taylor expansion around its minimum:

V(\mu) = V(\bar{x}) + \frac{V''}{2}(\mu - \bar{x})^2

This follows from the identity $(x_i - \mu)^2 = (x_i - \bar{x})^2 + (\bar{x} - \mu)^2 + 2(x_i - \bar{x})(\bar{x} - \mu)$ , and noting that $\sum(x_i - \bar{x}) = 0$ .

The curvature $V'' = 2$ , always — regardless of the data or $n$ . So the gap between $V(\mu^*)$ and $V(\bar{x})$ is:

V(\mu^*) - V(\bar{x}) = \frac{V''}{2} \cdot (\bar{x} - \mu^*)^2 = (\bar{x} - \mu^*)^2

This is the bias for a single sample. It depends on how far $\bar{x}$ happened to land from $\mu^*$ .

Now we need two ingredients:

The curvature of $V$ tells us how sensitive the loss is to perturbations in $\mu$ . Here $V''/2 = 1$ .
The variance of $\bar{x}$ tells us how much perturbation we actually have. Since $\bar{x}$ is an average of $n$ independent draws, $\text{Var}(\bar{x}) = \sigma^2/n$ .

2.5

Why $\sigma^2/n$ ? Variance scales with the square of constants: $\text{Var}(\bar{x}) = \text{Var}\!\left(\frac{1}{n}\sum x_i\right) = \frac{1}{n^2}\cdot n\sigma^2 = \sigma^2/n$ . Averaging $n$ independent things reduces variance by a factor of $n$ .

Their product gives the expected bias:

\mathbb{E}[\text{bias}] = \frac{V''}{2} \cdot \text{Var}(\bar{x}) = 1 \cdot \frac{\sigma^2}{n} = \frac{\sigma^2}{n}

So our estimator satisfies:

\mathbb{E}[V(\bar{x})] = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n}\,\sigma^2

The $n$ in the denominator of $\sigma^2/n$ comes from averaging $n$ independent samples — the more data we have, the less $\bar{x}$ wobbles, and the less we underestimate. The curvature of 2 is a fixed property of squared loss, not something we can change.

To correct the bias, divide by $n - 1$ instead of $n$ :

Bessel's Correction

s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2

satisfies $\mathbb{E}[s^2] = \sigma^2$ .

See the bias in action — draw repeated samples and watch the two estimators converge:

÷ n: avg = 0.00

÷ (n−1): avg = 0.00

Trials: 0

n =

Each trial draws n samples from N(0, 4). The ÷ n estimator (red) consistently underestimates σ² = 4, while ÷ (n−1) (blue) is unbiased.

One Parameter, One Sample Lost

Notice something about the bias $\sigma^2/n$ . Our estimator $V(\mu^*)$ is an average of $n$ terms, each contributing $\sigma^2/n$ in expectation:

V(\mu^*) = \frac{1}{n}\sum_{i=1}^{n}(x_i - \mu^*)^2, \qquad \mathbb{E}[\text{each term}] = \frac{\sigma^2}{n}

The bias is $\sigma^2/n$ — exactly one term’s worth. Estimating the mean “uses up” one data point, leaving $n - 1$ effective observations.

And $V''/2 = 1$ is why it’s exactly one, not some fraction. The curvature of squared loss sets the exchange rate between parameters and samples: for this loss, one estimated parameter costs exactly one sample. That’s the “1” in $n - 1$ .

Two Parameters: Simple Linear Regression

To see this in action, consider fitting a line $y = a + bx$ to data $(x_1, y_1), \ldots, (x_n, y_n)$ . The loss function is now:

L(a, b) = \frac{1}{n}\sum_{i=1}^{n}(y_i - a - bx_i)^2

We estimate two parameters — the intercept $\hat{a}$ and the slope $\hat{b}$ — by minimizing $L$ . The Hessian of $L$ with respect to $(a, b)$ is a $2 \times 2$ matrix, and its trace is $\text{tr}(\mathbf{H}) = 2 \times 2 = 4$ (two squared-loss terms, each contributing curvature 2). So:

\frac{\text{tr}(\mathbf{H})}{2} = 2

Each parameter costs one sample. Two parameters, two samples lost. The unbiased estimator of variance from regression residuals divides by $n - 2$ :

Residual Variance in Linear Regression

\hat{\sigma}^2 = \frac{1}{n - 2}\sum_{i=1}^{n}(y_i - \hat{a} - \hat{b}\,x_i)^2

This is why a line through two points has zero residual variance — you’ve used all your data to fit, leaving $n - 2 = 0$ degrees of freedom. There’s nothing left to measure noise with.

The pattern is always the same. Fit $k$ parameters by minimizing a squared loss, and the curvature tells you each one costs exactly one sample: divide by $n - k$ .

The Broader Lesson

This argument generalizes far beyond variance estimation. Whenever you use data to estimate parameters and then evaluate the loss at those estimates, you will be biased optimistically — because you used the same data to both fit and evaluate.

This is the same phenomenon behind overfitting in machine learning, and why training loss is always lower than test loss. AIC, cross-validation, and Stein’s unbiased risk estimate all exist to correct for this.

The curvature of the loss determines the exchange rate. For squared loss it’s 1:1 — clean and simple. For other losses the rate may differ, but the structure is the same: bias = curvature × parameter uncertainty.

Bessel’s correction is not a quirk of variance estimation. It is the simplest instance of a deep idea: fitting costs information, and the curvature of your loss tells you exactly how much.

Variance as an Optimization Problem

The True Variance

Estimating the Mean

Curvature, Variance, and the n − 1

One Parameter, One Sample Lost

Two Parameters: Simple Linear Regression

The Broader Lesson

Discussion