Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

12.1 Estimators

Suppose that you wanted to guess an unknown value based on some related data. In statistics we call this sort of problem an estimation problem. We’ll focus on two examples in this section:

  1. Estimating an unknown average: For example, you might want to estimate the income of the average American household based on reported incomes from a sample of 10,000 households. Since most of our summary measures (expected value, variance, covariance, correlation) are defined as averages over a distribution, we can use the problem, estimate an unknown average, to study estimators for a variety of measures that summarize a distribution.

  2. Estimating an unknown parameter: For example, you could model the waiting times between random events as exponential random variables with an unknown parameter, λ\lambda. Then, you could try to estimate the parameter from observed waiting times. Alternately, you might be a scientist interested in the probability that a treatment causes an effect. If you perform nn independent, identical trials, then the number of observed cases that show the desired effect is binomially distributed with an unknown “success” probability pp. In this case you would want to estimate the unknown probability pp from the observed experimental outcomes. In both cases you are estimating an unknown parameter of a probability model. The same problem shows up anytime we try to model observed data with a distribution that depends on parameters. If we don’t know the parameters, we have to estimate them from the data.

In both cases we can write the problem abstractly as follows.

  1. There is some quantity θ\theta whose value is unknown. We aim to estimate it. Frequentist approaches treat this quantity as fixed, while Bayesian approaches treat it as a sample of a random variable. In this book, we will adopt the Frequentist approach.

  2. We can observe some collection of random variables, {Xj}j=1n\{X_j\}_{j=1}^n. We denote the random variables with a capital letter, and use xjx_j to denote specific observed values. You can think of {Xj}j=1n\{X_j\}_{j=1}^n as the possible datasets we could observe, and {xj}j=1n\{x_j\}_{j=1}^n as a specific observed dataset. The dataset should be related to the unknown quantity in the sense that changing the unknown quantity would change the distribution of possible datasets {Xj}j=1n\{X_j\}_{j=1}^n.

  3. We adopt a procedure that produces a guess at the unknown quantity using the observed data. Since this procedure maps from a collection of input values (the data points) to an output value (the estimated value), it is a function from datasets to estimates. We call a function that accepts data and returns an estimate an estimator.

We will assume that our estimators produce an estimate given any dataset, and, that the estimates are deterministic functions of the observed data. In this case, every estimator is a scalar-valued function of the data.

The definition provided above provides no direction for choosing an estimator. In most estimation problems there are a variety of reasonable estimators to pick from.

This section we will review two strategies for selecting estimators.

Empirical Averages

We’ll begin with a classic example.

Suppose that {Xj}j=1n\{X_j\}_{j=1}^n are drawn independently and identically from a distribution with an unknown expected value, E[Xj]=μ\mathbb{E}[X_j] = \mu. How should we estimate μ\mu?

The simplest idea is to collect the observed data points, and average them:

μ^(x1,x2,...,xn)=1nj=1nxj.\hat{\mu}(x_1,x_2,...,x_n) = \frac{1}{n} \sum_{j=1}^n x_j.

The quantity on the left is the estimator. It is a function of the sampled data that estimates the unknown mean μ\mu. The expression on the right defines the estimator. It says, to estimate an unknown mean, try averaging the observed samples.

The average of a collection of observed samples is the sample or empirical average of the data.

We should be careful, at this stage, to distinguish the sample average, which is a function of the observed data, and thus a random variable before observation, from the background expectation, μ\mu. The expectation is a fixed value determined by the distribution. It is, in this example, unknown. The sample average is a function of the observed data, so varies with the data. It is not the expected value μ\mu. It provides a reasonable estimate to μ\mu. In sampling settings it is common to call the unknown, background expected value a population average or population mean to emphasize that it represents the average you would see if you could sample the entire population.

Sometimes it is helpful to think of sample averages as expected values. They are not the expected value of the underlying distribution that produced the data, but, they are the expected value of a related distribution, the empirical distribution of the data.

We call an empirical distribution “empirical” since it is the distribution of the observed data. In contrast, the background distribution of each XjX_j is the distribution of each possible datapoint. It describes hypothetical outcome values.

Whenever you’ve plotted a histogram generated by a collection of samples, you’ve visualized the empirical distribution.

We can define other sample summaries by applying our usual summary measures to the empirical distribution of the data. For example, the variance in the empirical distribution is:

VarJ[xJ]=1nj=1n(xjxˉ)2 where xˉ=1nj=1nxj=EJ[xJ].\text{Var}_J[x_J] = \frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2 \text{ where } \bar{x} = \frac{1}{n} \sum_{j=1}^n x_j = \mathbb{E}_J[x_J].

We saw sample covariances and correlations in Sections 11.1 and 11.2:

CovJ[xJ,yJ]=1nj=1n(xjxˉ)(yjyˉ),CorrJ[xJ,yJ]=CovJ[xJ,yJ]SDJ[xJ]SDJ[yJ]\text{Cov}_J[x_J,y_J] = \frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})(y_j - \bar{y}), \quad \text{Corr}_J[x_J,y_J] = \frac{\text{Cov}_J[x_J,y_J]}{\text{SD}_J[x_J] \text{SD}_J[y_J]}

Notice that, in every case, the data set is fixed (small xx and yy). We are introduce randomness by selecting a datapoint uniformly at random. Alternately, we are simply replacing averages over the hypothetical distribution of possible samples, with averages over the distribution of observed samples.

Now comes the main idea, the empirical distribution, produced by independent and identical samples from a distribution, will approximate that distribution.

We’ve used this idea a number of times in the course. It is related to the statement that probabilities should recover long run frequencies. You’ve observed it, and relied on it, everytime you used a histogram of observed samples to estimate a background distribution. When the number of samples is large, empirical distributions settle down to the background distribution of hypothetical samples. You can check this idea by running the distribution demos we first introduced in Section 2.2.

from utils_dist import run_distribution_explorer

run_distribution_explorer();

Notice that, in every case, as you increase the number of samples, the empirical distribution displayed as a histogram approaches the background distribution that produced the data.

It follows that, most quantities produced by applying a calculation to an empirical distribution will approximate the quantity that would have been produced if we had applied the same calculation to the hypothetical distribution. In particular, sample averages will approximate the matching expectations.

EX[X]EJ[xJ],VarX[X]VarJ[xJ],CovX,Y[X,Y]CovJ[xJ,yJ].\mathbb{E}_X[X] \approx \mathbb{E}_J[x_J], \quad \text{Var}_X[X] \approx \text{Var}_J[x_J], \quad \text{Cov}_{X,Y}[X,Y] \approx \text{Cov}_{J}[x_J,y_J].

This same logic justifies estimators for quantities that cannot be expressed as averages, but could be computed if we knew the background distribution. For example, it is standard practice to estimate the median of an unknown distribution with the median of a collection of samples. The median of the samples is the median of the empirical distribution, which approximates the background distribution. Similarly, if you’ve ever used a bootstrap, then you’ve used the idea that independent, identical sampling from the empirical distribution, approximates independent and identical sampling from the background distribution that produced the data.

We don’t have the tools yet to prove that empirical distributions approach the background distribution that generates samples, or even that sample averages provide good estimators for unknown averages. That analysis is the subject of the last chapter of this book.

For now, we will show that, in many cases, we can justify the use of sample averages as estimators by choosing our estimators to maximize the likelihood of the observed data.

Maximum Likelihood Estimation

Here’s a different idea.

Suppose that the unknown quantity is a parameter of a distribution that could have plausibly generated the observed data. For example, if we think our data was generated by samples from an exponential distribution, then we would want an estimate to the parameter of the exponential, λ\lambda.

This case is different from the examples described before since, if we knew the value of the parameter, then we could assign probabilities to the observed data. In the previous case we made no claim about the distribution that generated the data, so even if we had known the true mean, or variance, we could not have used that knowledge to evaluate the probability of the observed data. Accordingly, we adopted an estimation strategy that should work, no matter the actual distribution that generated the data.

If we know, or are willing to guess, the family of distributions that produced the data, then we can adopt estimators that are specially tailored to those distributions. In particular, we can select as our estimator the parameter value that, if true, would have made the observed data most likely.

Examples

Binomial

You are running a binary experiment. Your goal is to estimate the chance that your experiment succeeds, pp. You consider pp fixed, but unknown. You run your experiment nn times, and see S=sS = s successes. How can you estimate pp from nn and ss?

The simplest answer is, pick the estimate p^(s)=s/n\hat{p}(s) = s/n. Here we’ve given pp a “hat”, p^\hat{p}, to remind ourselves that the ratio s/ns/n is an estimator, not the true probability. This estimator is entirely natural. If you ran 100 trials and saw 60 successes, then you should estimate pp^=0.6p \approx \hat{p} = 0.6. After all, we defined chances as long run frequencies way back in Section 1.2.

Let’s justify that answer using maximum likelihood.

Imagine that there is some true success probability pp, but pp is unknown to you. If your trials are independent and identical, then the number of successes, SS, in nn trials, is a binomial random variable. So:

Pr(S=s;p)=PMF(s;p)=(ns)ps(1p)ns\text{Pr}(S = s;p) = \text{PMF}(s;p) = \left(\begin{array}{c} n \\ s \end{array} \right) p^s (1 - p)^{n-s}

So, the maximum likelihood estimation problem is:

We solved this problem in Section 3.3. Here’s our old solution:

In other words, the maximum likelihood estimator for the probability of an event, given nn independent and identical repetitions of the process, is the number of times the event occurs, divided by the number of trials. It is the empirical frequency of the event over the observation record. The unknown probability is the hypothetical frequency over infinitely many trials.

We can view this estimator as an sample average.

Let IjI_j be an indicator for success on the jthj^{th} trial. Then S=j=1nIjS = \sum_{j=1}^n I_j. Let iji_j be an indicator for the observed outcome of the jthj^{th} trial. Then, s=j=1nijs = \sum_{j=1}^n i_j. So, our maximum likelihood estimator is:

p^(i1,i2,...,in)=sn=1nj=1nij.\hat{p}(i_1,i_2,...,i_n) = \frac{s}{n} = \frac{1}{n} \sum_{j=1}^n i_j.

So, the maximum likelihood estimator for the probability of an event is, given nn independent and identical repetitions of the process, the sample average of the indicators for the event.

We’ll see that many maximum likelihood estimators are sample averages. When we can relate a sample average to a maximum likelihood estimator, then we can justify our use of the sample average as an estimator by claiming that the sample average returns the parameter estimate that, if true, would make the observed data most likely.

Let’s try a different example where the maximum likelihood estimator is not a sample average.

Geometric

Suppose that, instead of counting the number of successes in a fixed number of trials, we repeated our experiment until our first success then stopped. The number of trials until the first success is related to the chance of success per trial, since, the more likely we are to succeed, the earlier we will usually stop.

This is, again, an entirely natural answer. If it took 10 trials to see our first success then we saw 1 success in 10 attempts, so should estimate pp with 1/101/10.

Note, this estimator is not a sample average. Nevertheless, we can relate it to a sample average if we draw multiple geometric variables.

Suppose that we sampled {Wj}j=1n\{W_j\}_{j=1}^n independently and identically from Geometric(p)\text{Geometric}(p) and observe W1=w1,W2=w2,...,Wn=wnW_1 = w_1, W_2 = w_2, ..., W_n = w_n. What is the maximum likelihood estimator for pp now?

Now, the probability of our observation is:

Pr(W1=w1,...,Wn=wn;p)=j=1nPr(Wj=wj;p)=j=1n(1p)wj1p.\text{Pr}(W_1 = w_1,...,W_n = w_n;p) = \prod_{j=1}^n \text{Pr}(W_j = w_j;p) = \prod_{j=1}^n (1 - p)^{w_j - 1} p.

To maximize this probability, we’ll maximize:

g(p)=1nlog(Pr(W1=w1,...,Wn=wn;p))=1nj=1n(wj1)log(1p)+log(p).g(p) = \frac{1}{n} \log(\text{Pr}(W_1 = w_1,...,W_n = w_n;p)) = \frac{1}{n} \sum_{j=1}^n (w_j - 1) \log(1 - p) + \log(p).

Let wˉ=1nj=1nwj\bar{w} = \frac{1}{n}\sum_{j=1}^n w_j denote the sample average of the observations. Then, simplifying by moving all terms that do not depend on jj outside the sum gives:

g(p)=(wˉ1)log(1p)+log(p).g(p) = (\bar{w} - 1) \log(1 - p) + \log(p).

This function has the same form as the function we maximized before, replacing ww with wˉ\bar{w}. So, it points to the same solution, replacing ww with wˉ\bar{w}.

Normal

What if there is more than one unknown parameter?

Then, to find an MLE, optimize jointly over all unknown parameters.

Let’s do a continuous example.

You solved this problem in your optimization homework. Here’s the solution again.

So, in the case when our data is drawn from a normal distribution, we can justify the sample average and variance as estimators by arguing that they maximize the likelihood of the observed data!