Properties of Estimators - Data 89 Course Notes

In Section 12.1 we discussed two methods for deriving estimators.

If the quantity of interest is some summary feature of an unknown distribution, use sample data to form an empirical distribution, and estimate the unknown quantity by computing the same summary on the empirical distribution.
If the quantity of interest is a parameter of a data generating process that assigns chances to observed outcomes, estimate the unknown parameter, or combination of unknown parameters, via the values that would make the observed data most likely.

While these procedures for selecting estimators are reasonable, they don’t give any guarantee on the quality of the estimator. They say nothing about the reliability of the estimator, its accuracy, or the ways in which it tends to make mistakes.

In this section we will identify some properties of estimators that can be used the measure their quality. These are commonly used to compare estimators and to help select between estimators when multiple reasonable estimators are available.

The Distribution of Estimates¶

Before identifying useful properties, we should clarify how we think about estimators. An estimator is a function. An estimate is the answer returned when we apply that function to an observed dataset. Once the dataset is observed, it is no longer random, so any particular estimate is just a number. This is how we thought about estimates in the previous section. We always started by saying, suppose that we observed $X = x$ , then applied the estimator to $x$ .

The quality of an individual estimate is easy to think about. The estimate is good if the error is small. However, the quality of a single estimate doesn’t depend on the process that produces it. An unreliable process could produce a good estimate by accident, and a reliable process could return a bad estimate by accident.

So, if we want to understand the quality of the estimation process, e.g. the estimator, then we need to consider the distribution of possible estimates it might return as a function of the possible datasets we could receive. Instead of measuring the quality of a single estimate, if we want to compare estimators we should consider how they would behave given hypothetical data.

For example, imagine that you needed to choose an estimator before you observed any data. Then, you would consider how the estimator might behave, over all possible datasets. From this perspective, the estimate returned by the estimator is a function of a random variable, the data $X$ before observation, so is itself random.

Properties¶

Here are some properties we might want in an estimator:

Since we can never collect infinitely many observations in practice, we care about consistency since, to prove consistency, we usually have to show that the estimator provides accurate estimates with high probability given sufficiently many observations.

Statements involving infinitely many observations are asymptotic. Consistency is an asymptotic guarantee. Guarantees involving finitely many observations are finite sample guarantees.

Here’s an example finite-sample guarantee:

Unbiased

An estimator is unbiased if the expected error in the estimates it returns is zero. That is, if:

\mathbb{E}_X[\hat{\theta}(X) - \theta] = \mathbb{E}_X[\hat{\theta}(X)] - \theta = 0.

(1)

Equivalently, an estimator is unbiased if its expectation matches the true quantity, $\mathbb{E}_{X}[\hat{\theta}(X)] = \theta$ .

An estimator is biased otherwise.

If $\mathbb{E}_X[\hat{\theta}(X) - \theta] > 0$ then $\mathbb{E}_{X}[\hat{\theta}(X)] > \theta$ so the estimator overestimates on average.
If $\mathbb{E}_X[\hat{\theta}(X) - \theta] < 0$ then $\mathbb{E}_{X}[\hat{\theta}(X)] < \theta$ so the estimator underestimates on average.

The bias of an estimator, unlike the error in the estimate, is not a random quantity. It is determined by the distribution of the estimates, $\hat{\theta}$ .

Estimators based on sample averages are often unbiased.

For example, in the last chapter we derived the maximum likelihood estimator for $p$ given $S \sim \text{Binomial}(n,p)$ . We saw that $\hat{p}(S) = S/n$ . What is the bias in this estimator?

\mathbb{E}_S[\hat{p}(S) - p] = \mathbb{E}_S[\hat{p}(S)] - p = \mathbb{E}_S[S/n] - p = \frac{1}{n} \mathbb{E}_S[S] - p.

(3)

If $S \sim \text{Binomial}(n,p)$ then $\mathbb{E}[S] = np$ . So, the bias is:

\mathbb{E}_S[\hat{p}(S) - p] = \frac{1}{n}(n p) - p = p - p = 0.

(4)

So, the estimator $\hat{p}(S) = S/n$ is unbiased.

Not all maximum likelihood estimators are unbiased. For example, if $W \sim \text{Geometric}(p)$ , and we adopt the maximum likelihood estimator $\hat{p}(W) = 1/W$ , then the bias is:

\mathbb{E}_W[\hat{p}(W) - p] = \mathbb{E}_W[\hat{p}(W)] - p = \mathbb{E}_W[1/W] - p.

(5)

To show that this estimator is biased, we’ll use Jensen’s inequality (see Section 4.1); the expected value of a convex function of a random variable is greater than the convex function of the expectation. Since the function $1/w$ is a convex function of $w$ :

\mathbb{E}_W[1/W] > \frac{1}{\mathbb{E}_W[W]}.

(6)

The expected value of a geometric random variable with success probability $p$ is $1/p$ (see Sections 7.1 and 10.2). Therefore:

\mathbb{E}_W[1/W] > \frac{1}{1/p} = p.

(7)

It follows that the bias in the estimator is positive:

\mathbb{E}_W[\hat{p}(W) - p] = \mathbb{E}_W[1/W] - p > p - p = 0.

(8)

So, the maximum likelihood estimator for the parameter of a geometric random variable is a biased estimator. On average, its estimates will overestimate the unknown $p$ .

Bias and Sample Size

If we collect $n$ independent samples from a geometric distribution with unknown parameter $p$ , then the maximum likelihood estimator for $p$ is:

\hat{p}(W_1,W_2,...,W_n) = \frac{1}{\frac{1}{n} \sum_{j=1}^n W_j}.

(9)

We know that this estimator is biased when $n = 1$ . Let’s show it is biased for all finite $n$ . To show it remains biased, we will use Jensen’s inequality again:

\mathbb{E}_{W_1,W_2,...,W_n}[\hat{p}(W_1,W_2,...,W_n)] = \mathbb{E}_{W_1,W_2,...,W_n}\left[\frac{1}{n} \sum_{j=1}^n W_j \right] > \frac{1}{\mathbb{E}_{W_1,W_2,...,W_n}[\frac{1}{n} \sum_{j=1}^n W_j]}.

(10)

We can simplify the denominator using expectation properties. Since expectations are linear and additive (see Section 4.2):

\mathbb{E}_{W_1,W_2,...,W_n}\left[\frac{1}{n} \sum_{j=1}^n W_j \right] = \frac{1}{n} \sum_{j=1}^n \mathbb{E}_{W_j}[W_j].

(11)

Then, since the expected value of a geometric random variable with parameter $p$ is $1/p$ :

\mathbb{E}_{W_1,W_2,...,W_n}\left[\frac{1}{n} \sum_{j=1}^n W_j\right] = \frac{1}{n} \sum_{j=1}^n \frac{1}{p} = \frac{1}{p}.

(12)

So, by Jensen:

\mathbb{E}_{W_1,W_2,...,W_n}[\hat{p}(W_1,W_2,...,W_n)] > \frac{1}{1/p} = p

(13)

so the estimator remains baised for all finite $n$ . No matter how many samples we observe, it will overestimate on average.

Nevertheless, the bias in the estimator decreases as the sample size increases, and converges to zero as $n$ approaches infinity. So, while the bias is never zero for any finite $n$ , we could make it arbitrarily small by using a large enough sample. Estimators with this property are asymptotically unbiased.

Exact Analysis of Bias

In some cases we can analyze the bias in an estimator exactly.

For example, suppose that $\{T_j\}_{j=1}^n \sim \text{Exponential}(\lambda)$ are drawn independently and identically. Suppose that $\lambda$ is unknown. Then, the maximum likelihood estimator for $\lambda$ , given the observation $\{T_j = t_j\}_{j=1}^n$ is:

\hat{\lambda}(t_1,t_2,...,t_n) = \frac{1}{\bar{t}} \text{ where } \bar{t} = \frac{1}{n} \sum_{j=1}^n t_j.

(14)

So, as a random variable:

\hat{\lambda}(T_1,T_2,...,T_n) = \frac{1}{\bar{T}_n}

(15)

where $\bar{T}_n$ is the sample average, $\frac{1}{n} \sum_{j=1}^n T_j$ .

This estimator is biased for the same reason the estimator for the parameter of a Geometric is biased. Applying Jensen:

\mathbb{E}_{T}[\hat{\lambda}(T)] = \mathbb{E}_T \left[\frac{1}{\bar{T}_n} \right] > \frac{1}{\mathbb{E}_T[\bar{T}_n]}.

(16)

The expected value of $\bar{T}_n$ is $\frac{1}{n} \sum_{j=1}^n \mathbb{E}[T_j] = \frac{1}{n} (n \mathbb{E}[T_1]) = \mathbb{E}[T_1]$ since the samples are identically distributed. The expected value of an exponential random variable is the reciprocal of its parameter (see Section 7.1) so:

\mathbb{E}_{T}[\hat{\lambda}(T)] > \frac{1}{\mathbb{E}[\bar{T}_n]} = \frac{1}{\mathbb{E}[T_1]} = \frac{1}{1/\lambda} = \lambda.

(17)

Therefore, the estimator has a positive bias.

In this case we can work out the exact value of the bias as a function of $\lambda$ and $n$ .

First, let $S_n = \sum_{j=1}^n T_j$ . Then $\hat{\lambda}(T) = \frac{n}{S_n}$ so:

\mathbb{E}_T[\hat{\lambda}(T)] = n \mathbb{E}\left[ \frac{1}{S_n} \right].

(18)

The random variable, $S_n = T_1 + T_2 + ... T_n$ is the sum of $n$ independent, identical exponential random variables. So, we can work out its density function using the convolution formula (see Section 10.3). In Section 10.3, we derived the density for $S_2$ :

S_2 \geq 0, \quad f_{S_2}(s) = \lambda^2 s e^{-\lambda s}.

(19)

This is an example of a Gamma random variable with shape parameter 2 and rate parameter $\lambda$ . On your 11th homework you iterated this analysis to show that the sum of $n$ independent, identical random variables with rate parameter $\lambda$ is a Gamma random variable with shape parameter $n$ and rate parameter $\lambda$ :

S_n \geq 0, \quad f_{S_n} = \frac{\lambda^n}{(n-1)!} s^{n-1} e^{-\lambda s}.

(20)

We can now find $\mathbb{E}\left[ \frac{1}{S_n} \right]$ directly:

\begin{aligned} \mathbb{E}\left[ \frac{1}{S_n} \right] & = \int_{s = 0}^{\infty} \frac{1}{s} f_{S_n}(s) ds \\& = \int_{s = 0}^{\infty} \frac{1}{s} \frac{\lambda^n}{(n-1)!} s^{n-1} e^{-\lambda s} \\ & = \frac{\lambda^{n-1}}{(n-1)!} \int_{s = 0}^{\infty} \lambda s^{n-2} e^{-\lambda s}. \end{aligned}

(21)

To work out the integral, integrate by parts (see Section 7.1). You evaluated this integral on your 8th homework. In general:

\int_{x = 0}^{\infty} \lambda x^k e^{-\lambda x} dx = \frac{k!}{\lambda^k}.

(22)

So:

\int_{s = 0}^{\infty} \lambda s^{n-2} e^{-\lambda s} = \frac{(n-2)!}{\lambda^{n-2}}.

(23)

Therefore:

\mathbb{E}\left[ \frac{1}{S_n} \right] = \frac{\lambda^{n-1}}{(n-1)!} \frac{(n-2)!}{\lambda^{n-2}} = \frac{\lambda}{n-1}.

(24)

Then, the expected estimate to $\lambda$ is:

\mathbb{E}[\hat{\lambda}(T)] = n \mathbb{E}\left[ \frac{1}{S_n} \right] = \frac{n}{n-1} \lambda.

(25)

This expectation is larger than $\lambda$ for all finite $n$ since $n > n - 1$ . The exact bias is:

\text{bias}(\hat{\lambda}) = \frac{n}{n-1} \lambda - \lambda = \left( \frac{n}{n-1} - 1 \right) \lambda = \frac{1}{n-1} \lambda.

(26)

So, while the estimator is biased, its bias decreases as we account for more observations. In particular, the bias in the estimator is proportional to $1/(n - 1)$ , so decays to zero at rate $\mathcal{O}(n^{-1})$ .

In other words, the maximum likelihood estimator to the parameter of an exponential distribution, based on $n$ independent, identical samples, is asymptotically unbiased. The bias in the estimator converges to zero as $n$ diverges!

An estimator could be unbiased and consistent, yet return wildly varying estimates given finitely many observations.

We usually look for estimators that are both unbiased and precise since these estimators reliably return accurate estimates. To understand accuracy we should measure the expected size of the errors produced by the estimator.

The mean square error is a popular measure of accuracy since we can decompose it into a contribution from the bias and a contribution from the precision of the estimator. Let $\bar{\theta} = \mathbb{E}_X[\hat{\theta}(X)]$ . Then:

\begin{aligned} \text{MSE}(\hat{\theta}) & = \mathbb{E}_X[(\hat{\theta}(X) - \theta)^2] \\ & = \mathbb{E}_X[((\hat{\theta}(X) - \bar{\theta}) - (\theta - \bar{\theta}))^2]. \end{aligned}

(28)

Let $b = \bar{\theta} - \theta = \mathbb{E}_X[\hat{\theta}(X)] - \theta$ denote the bias in the estimator. Let $\hat{\theta}_0(X)$ denote the centered variable $\hat{\theta}(X) - \bar{\theta}$ . Then, we can use the linearity of expectation to expand the MSE:

\begin{aligned} \text{MSE}(\hat{\theta}) &= \mathbb{E}_X[(\hat{\theta}_0(X) - b)^2] \\ & = \mathbb{E}_X[\hat{\theta}_0(X)^2 - 2 b \hat{\theta}_0(X) + b^2] \\ & = \mathbb{E}_X[\hat{\theta}_0(X)^2] - 2 b \mathbb{E}_X[\hat{\theta}_0(X)] + b^2. \end{aligned}

(29)

The expected value of any centered variable is zero, so $\mathbb{E}_X[\hat{\theta}_0(X)] = 0$ . Therefore:

\text{MSE}(\hat{\theta}) = \mathbb{E}_X[\hat{\theta}_0(X)^2] + b^2.

(30)

The first term is the expected square of a centered variable, so is the variance in the variable. Thus, the mean square error decomposes into a sum of the variance in the estimator plus the bias in the estimator squared:

\text{MSE}(\hat{\theta}) = \text{Var}_X[\hat{\theta}(X)] + b^2.

(31)

So, to make our estimators as accurate as possible, we will usually try to minimize the variance and the bias in the estimator at the same time. Usually, there is a trade-off between bias and variance. In many cases we cannot send both to zero, so need to decide whether we are more willing to except systematic errors associated with a bias, or random errors associated with fluctuations in the data.

This trade-off is a fundamental idea in estimation. You’ll see it more in future courses. Often, we can reduce the variance in an estimator only by admitting a bias.

You can think about the trade-off this way. If our estimator is as accurate as possible, then we can’t reduce both the bias and the variance, otherwise we could find a more accurate estimator.

In practice, the trade-off between bias and variance usually represents a trade-off associated with incorporating some prior information about the unknown. This prior information might be modeled by fixing a distribution over the unknown (the Bayesian approach), or by explicitly biasing the estimator closer to some fixed value that we think would be a good estimate in the absence of any data (the Frequentist approach). The latter idea is called “regularization.”

Both approaches make the estimator less variable by making it less sensitive to the data. This reduces the variance, but typically increases the bias, since by making the estimator less sensitive, we keep it closer to some prior estimate we would have made in absence of any data. By biasing our estimator towards the estimate we would have made in absence of data, we bias it away from the true value. This can improve the overall accuracy of the estimator if the reduction in variance outweighs the added error due to bias.

12.2 Properties of Estimators

The Distribution of Estimates¶

Properties¶