Estimators - Data 89 Course Notes

Suppose that you wanted to guess an unknown value based on some related data. In statistics we call this sort of problem an estimation problem. We’ll focus on two examples in this section:

Estimating an unknown average: For example, you might want to estimate the income of the average American household based on reported incomes from a sample of 10,000 households. Since most of our summary measures (expected value, variance, covariance, correlation) are defined as averages over a distribution, we can use the problem, estimate an unknown average, to study estimators for a variety of measures that summarize a distribution.
Estimating an unknown parameter: For example, you could model the waiting times between random events as exponential random variables with an unknown parameter, $\lambda$ . Then, you could try to estimate the parameter from observed waiting times. Alternately, you might be a scientist interested in the probability that a treatment causes an effect. If you perform $n$ independent, identical trials, then the number of observed cases that show the desired effect is binomially distributed with an unknown “success” probability $p$ . In this case you would want to estimate the unknown probability $p$ from the observed experimental outcomes. In both cases you are estimating an unknown parameter of a probability model. The same problem shows up anytime we try to model observed data with a distribution that depends on parameters. If we don’t know the parameters, we have to estimate them from the data.

In both cases we can write the problem abstractly as follows.

There is some quantity $\theta$ whose value is unknown. We aim to estimate it. Frequentist approaches treat this quantity as fixed, while Bayesian approaches treat it as a sample of a random variable. In this book, we will adopt the Frequentist approach.
We can observe some collection of random variables, $\{X_j\}_{j=1}^n$ . We denote the random variables with a capital letter, and use $x_j$ to denote specific observed values. You can think of $\{X_j\}_{j=1}^n$ as the possible datasets we could observe, and $\{x_j\}_{j=1}^n$ as a specific observed dataset. The dataset should be related to the unknown quantity in the sense that changing the unknown quantity would change the distribution of possible datasets $\{X_j\}_{j=1}^n$ .
We adopt a procedure that produces a guess at the unknown quantity using the observed data. Since this procedure maps from a collection of input values (the data points) to an output value (the estimated value), it is a function from datasets to estimates. We call a function that accepts data and returns an estimate an estimator.

We will assume that our estimators produce an estimate given any dataset, and, that the estimates are deterministic functions of the observed data. In this case, every estimator is a scalar-valued function of the data.

The definition provided above provides no direction for choosing an estimator. In most estimation problems there are a variety of reasonable estimators to pick from.

This section we will review two strategies for selecting estimators.

Empirical Averages¶

We’ll begin with a classic example.

Suppose that $\{X_j\}_{j=1}^n$ are drawn independently and identically from a distribution with an unknown expected value, $\mathbb{E}[X_j] = \mu$ . How should we estimate $\mu$ ?

The simplest idea is to collect the observed data points, and average them:

\hat{\mu}(x_1,x_2,...,x_n) = \frac{1}{n} \sum_{j=1}^n x_j.

(1)

The quantity on the left is the estimator. It is a function of the sampled data that estimates the unknown mean $\mu$ . The expression on the right defines the estimator. It says, to estimate an unknown mean, try averaging the observed samples.

The average of a collection of observed samples is the sample or empirical average of the data.

We should be careful, at this stage, to distinguish the sample average, which is a function of the observed data, and thus a random variable before observation, from the background expectation, $\mu$ . The expectation is a fixed value determined by the distribution. It is, in this example, unknown. The sample average is a function of the observed data, so varies with the data. It is not the expected value $\mu$ . It provides a reasonable estimate to $\mu$ . In sampling settings it is common to call the unknown, background expected value a population average or population mean to emphasize that it represents the average you would see if you could sample the entire population.

Sometimes it is helpful to think of sample averages as expected values. They are not the expected value of the underlying distribution that produced the data, but, they are the expected value of a related distribution, the empirical distribution of the data.

We call an empirical distribution “empirical” since it is the distribution of the observed data. In contrast, the background distribution of each $X_j$ is the distribution of each possible datapoint. It describes hypothetical outcome values.

Whenever you’ve plotted a histogram generated by a collection of samples, you’ve visualized the empirical distribution.

We can define other sample summaries by applying our usual summary measures to the empirical distribution of the data. For example, the variance in the empirical distribution is:

\text{Var}_J[x_J] = \frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2 \text{ where } \bar{x} = \frac{1}{n} \sum_{j=1}^n x_j = \mathbb{E}_J[x_J].

(3)

We saw sample covariances and correlations in Sections 11.1 and 11.2:

\text{Cov}_J[x_J,y_J] = \frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})(y_j - \bar{y}), \quad \text{Corr}_J[x_J,y_J] = \frac{\text{Cov}_J[x_J,y_J]}{\text{SD}_J[x_J] \text{SD}_J[y_J]}

(4)

Notice that, in every case, the data set is fixed (small $x$ and $y$ ). We are introduce randomness by selecting a datapoint uniformly at random. Alternately, we are simply replacing averages over the hypothetical distribution of possible samples, with averages over the distribution of observed samples.

Now comes the main idea, the empirical distribution, produced by independent and identical samples from a distribution, will approximate that distribution.

We’ve used this idea a number of times in the course. It is related to the statement that probabilities should recover long run frequencies. You’ve observed it, and relied on it, everytime you used a histogram of observed samples to estimate a background distribution. When the number of samples is large, empirical distributions settle down to the background distribution of hypothetical samples. You can check this idea by running the distribution demos we first introduced in Section 2.2.

from utils_dist import run_distribution_explorer

run_distribution_explorer();

Notice that, in every case, as you increase the number of samples, the empirical distribution displayed as a histogram approaches the background distribution that produced the data.

It follows that, most quantities produced by applying a calculation to an empirical distribution will approximate the quantity that would have been produced if we had applied the same calculation to the hypothetical distribution. In particular, sample averages will approximate the matching expectations.

\mathbb{E}_X[X] \approx \mathbb{E}_J[x_J], \quad \text{Var}_X[X] \approx \text{Var}_J[x_J], \quad \text{Cov}_{X,Y}[X,Y] \approx \text{Cov}_{J}[x_J,y_J].

(5)

This same logic justifies estimators for quantities that cannot be expressed as averages, but could be computed if we knew the background distribution. For example, it is standard practice to estimate the median of an unknown distribution with the median of a collection of samples. The median of the samples is the median of the empirical distribution, which approximates the background distribution. Similarly, if you’ve ever used a bootstrap, then you’ve used the idea that independent, identical sampling from the empirical distribution, approximates independent and identical sampling from the background distribution that produced the data.

We don’t have the tools yet to prove that empirical distributions approach the background distribution that generates samples, or even that sample averages provide good estimators for unknown averages. That analysis is the subject of the last chapter of this book.

For now, we will show that, in many cases, we can justify the use of sample averages as estimators by choosing our estimators to maximize the likelihood of the observed data.

Maximum Likelihood Estimation¶

Here’s a different idea.

Suppose that the unknown quantity is a parameter of a distribution that could have plausibly generated the observed data. For example, if we think our data was generated by samples from an exponential distribution, then we would want an estimate to the parameter of the exponential, $\lambda$ .

Data Generating Process

A data generating process is a model for the process that produces the data. The simplest data generating models are parametric distributions. For example, if we say $X \sim \text{Normal}(\mu,\sigma)$ then we are asserting that our data was produced, or behaves as if it were produced, by sampling from a normal distribution.

Specification

A data generating process is well specified if the distribution of data we would see, over infinitely many independent repetitions, equals the hypothesized distribution produced by the generating process, for some selection of the parameters. It is mis-specified otherwise.

In most settings, we should assume that our models are, in truth, mis-specified, and hope to select a family of models that reasonably approximates the data generating process.

This case is different from the examples described before since, if we knew the value of the parameter, then we could assign probabilities to the observed data. In the previous case we made no claim about the distribution that generated the data, so even if we had known the true mean, or variance, we could not have used that knowledge to evaluate the probability of the observed data. Accordingly, we adopted an estimation strategy that should work, no matter the actual distribution that generated the data.

If we know, or are willing to guess, the family of distributions that produced the data, then we can adopt estimators that are specially tailored to those distributions. In particular, we can select as our estimator the parameter value that, if true, would have made the observed data most likely.

Examples¶

Binomial¶

You are running a binary experiment. Your goal is to estimate the chance that your experiment succeeds, $p$ . You consider $p$ fixed, but unknown. You run your experiment $n$ times, and see $S = s$ successes. How can you estimate $p$ from $n$ and $s$ ?

The simplest answer is, pick the estimate $\hat{p}(s) = s/n$ . Here we’ve given $p$ a “hat”, $\hat{p}$ , to remind ourselves that the ratio $s/n$ is an estimator, not the true probability. This estimator is entirely natural. If you ran 100 trials and saw 60 successes, then you should estimate $p \approx \hat{p} = 0.6$ . After all, we defined chances as long run frequencies way back in Section 1.2.

Let’s justify that answer using maximum likelihood.

Imagine that there is some true success probability $p$ , but $p$ is unknown to you. If your trials are independent and identical, then the number of successes, $S$ , in $n$ trials, is a binomial random variable. So:

\text{Pr}(S = s;p) = \text{PMF}(s;p) = \left(\begin{array}{c} n \\ s \end{array} \right) p^s (1 - p)^{n-s}

(6)

So, the maximum likelihood estimation problem is:

We solved this problem in Section 3.3. Here’s our old solution:

Solution

We’ll maximize via monotone composition.

First, notice that the choose coefficient out front is constant when $s$ is held fixed. It is a nonnegative number. So, it simply scales the functional form $p^s (1 - p)^{n - s}$ . Therefore, the PMF is maximized, as a function of $s$ where the simpler function, $g(p) = p^s (1 - p)^{n - s}$ is maximized.

So, we’ll maximize:

g(p) = p^s (1 - p)^{n-s}

(8)

instead.

This is a product. So, let’s try putting it inside a log. Logs are monotonically increasing, so the original function is maximized where the log is maximized.

\log(g(p)) = \log(p^s) + \log((1 - p)^{(n - s)}) = s \log(p) + (n - s) \log(1 - p)

(9)

Then, setting the derivative with respect to $p$ to zero:

\begin{aligned} \frac{d}{dp}\log(g(p)) & = \frac{s}{p} - \frac{n-s}{1-p} \\ & \Rightarrow \frac{s}{\hat{p}} = \frac{n-s}{1 - \hat{p}} \\ & \Rightarrow \frac{1-\hat{p}}{\hat{p}}=\frac{n-s}{s} \\ & \Rightarrow \frac{1}{\hat{p}} - 1 = \frac{n-s}{s} \\ & \Rightarrow \frac{1}{\hat{p}} = 1 + \frac{n-s}{s} = \frac{s + n - s}{s} = \frac{n}{s} \\ & \Rightarrow \hat{p} = \frac{s}{n}. \end{aligned}

(10)

So, the simple guess, $\hat{p} = s/n$ is actually a principled choice. It is the success probability that would make our observation most likely!

In other words, the maximum likelihood estimator for the probability of an event, given $n$ independent and identical repetitions of the process, is the number of times the event occurs, divided by the number of trials. It is the empirical frequency of the event over the observation record. The unknown probability is the hypothetical frequency over infinitely many trials.

We can view this estimator as an sample average.

Let $I_j$ be an indicator for success on the $j^{th}$ trial. Then $S = \sum_{j=1}^n I_j$ . Let $i_j$ be an indicator for the observed outcome of the $j^{th}$ trial. Then, $s = \sum_{j=1}^n i_j$ . So, our maximum likelihood estimator is:

\hat{p}(i_1,i_2,...,i_n) = \frac{s}{n} = \frac{1}{n} \sum_{j=1}^n i_j.

(12)

So, the maximum likelihood estimator for the probability of an event is, given $n$ independent and identical repetitions of the process, the sample average of the indicators for the event.

We’ll see that many maximum likelihood estimators are sample averages. When we can relate a sample average to a maximum likelihood estimator, then we can justify our use of the sample average as an estimator by claiming that the sample average returns the parameter estimate that, if true, would make the observed data most likely.

Let’s try a different example where the maximum likelihood estimator is not a sample average.

Geometric¶

Suppose that, instead of counting the number of successes in a fixed number of trials, we repeated our experiment until our first success then stopped. The number of trials until the first success is related to the chance of success per trial, since, the more likely we are to succeed, the earlier we will usually stop.

Solution

To maximize, maximize the log of the probability mass function:

g(p) = \log(\text{PMF}(w;p)) = (w - 1) \log(1 - p) + \log(p).

(15)

Differentiating with respect to $p$ :

\frac{d}{dp} g(p) = (w-1)\frac{-1}{1 - p} + \frac{1}{p}.

(16)

Setting the derivative to zero:

\frac{w-1}{1 - \hat{p}} = \frac{1}{\hat{p}}.

(17)

Then, solving:

\begin{aligned} (w - 1) \hat{p} = 1 - \hat{p} \quad & \Rightarrow \quad ((w - 1) + 1) \hat{p} = 1\\ & \Rightarrow \quad w \hat{p} = 1 \\ & \Rightarrow \quad \hat{p} = \frac{1}{w}. \end{aligned}

(18)

So, the MLE is $\hat{p}(w) = 1/w$ .

This is, again, an entirely natural answer. If it took 10 trials to see our first success then we saw 1 success in 10 attempts, so should estimate $p$ with $1/10$ .

Note, this estimator is not a sample average. Nevertheless, we can relate it to a sample average if we draw multiple geometric variables.

Suppose that we sampled $\{W_j\}_{j=1}^n$ independently and identically from $\text{Geometric}(p)$ and observe $W_1 = w_1, W_2 = w_2, ..., W_n = w_n$ . What is the maximum likelihood estimator for $p$ now?

Now, the probability of our observation is:

\text{Pr}(W_1 = w_1,...,W_n = w_n;p) = \prod_{j=1}^n \text{Pr}(W_j = w_j;p) = \prod_{j=1}^n (1 - p)^{w_j - 1} p.

(20)

To maximize this probability, we’ll maximize:

g(p) = \frac{1}{n} \log(\text{Pr}(W_1 = w_1,...,W_n = w_n;p)) = \frac{1}{n} \sum_{j=1}^n (w_j - 1) \log(1 - p) + \log(p).

(21)

Let $\bar{w} = \frac{1}{n}\sum_{j=1}^n w_j$ denote the sample average of the observations. Then, simplifying by moving all terms that do not depend on $j$ outside the sum gives:

g(p) = (\bar{w} - 1) \log(1 - p) + \log(p).

(22)

This function has the same form as the function we maximized before, replacing $w$ with $\bar{w}$ . So, it points to the same solution, replacing $w$ with $\bar{w}$ .

Normal¶

What if there is more than one unknown parameter?

Then, to find an MLE, optimize jointly over all unknown parameters.

Let’s do a continuous example.

MLE for Parameters of a Normal

Observe $\{X_j = x_j\}_{j=1}^n$ .
Assume $X_j \sim \text{Normal}(\mu,\sigma)$ for some unknown mean $\mu$ and standard deviation $\sigma$ .
Find the value $\hat{\mu},\hat{\sigma}$ such that, setting $\mu = \hat{\mu}, \sigma = \hat{\sigma}$ maximizes the chance of the observation $X_1 = x_1, X_2 = x_2,..., X_n = x_n$ .

Mathematically,

\text{Maximize: } f_{X_1,X_2,...,X_n}(x_1,x_2,..,x_n;\mu,\sigma) = \text{PDF}(\vec{x};\mu,\sigma) \text{ with respect to } \mu, \sigma \text{ while holding } \vec{x} \text{ fixed}.

(24)

Here:

f_{X_1,X_2,...,X_n}(x_1,x_2,..,x_n;\mu,\sigma) = \prod_{j=1}^n \frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{1}{2}\left(\frac{x_j - \mu}{\sigma} \right)^2}.

(25)

You solved this problem in your optimization homework. Here’s the solution again.

Solution

To maximize $f_{X_1,X_2,...,X_n}(x_1,x_2,..,x_n;\mu,\sigma)$ we will minimize the loss function $-\frac{1}{n} \log(f_{X_1,X_2,...,X_n}(x_1,x_2,..,x_n;\mu,\sigma))$ . Evaluating:

\log(f_{X_1,X_2,...,X_n}(x_1,x_2,...,x_n)) = -\frac{n}{2} \log(2 \pi) - n \log(\sigma) - \frac{1}{2} \sum_{j=1}^n \left(\frac{x_j - \mu}{\sigma} \right)^2

(26)

Dividing by $-n$ recovers the loss:

\mathcal{L}(\mu,\sigma;\{x_j\}_{j=1}^n) = \frac{1}{2}\left[ \frac{1}{n} \sum_{j=1}^n \left(\frac{x_j - \mu}{\sigma} \right)^2 \right] + \log(\sigma) + \frac{1}{2} \log(2 \pi).

(27)

This function is easier to optimize. To optimize it, we’ll set its gradient to zero one parameter at a time.

Let’s start with the unknown mean.

\begin{aligned} \partial_{\mu} \mathcal{L}(\mu,\sigma;\{x_j\}_{j=1}^n) & = \frac{1}{n} \sum_{j=1}^n \frac{x_j - \mu}{\sigma} = \frac{1}{\sigma} \left( \frac{1}{n}\sum_{j=1}^n x_j - \frac{1}{n}\sum_{j=1}^n \mu \right) = \frac{1}{\sigma} \left(\bar{x} - \mu \right)\end{aligned}

(28)

So, setting the partial to zero requires $\hat{\mu} = \bar{x}$ . Therefore, the maximum likelihood estimator for the mean is the sample average of the observed data.

Next, minimize the loss with respect to the unknown standard deviation, $\sigma$ , by setting $\mu = \bar{x}$ , and $\partial_{\sigma} \mathcal{L}(\bar{x},\sigma;\{x_j\}_{j=1}^n) = 0$ .

\begin{aligned} \partial_{\mu} \mathcal{L}(\bar{x},\sigma;\{x_j\}_{j=1}^n) & = - \frac{1}{n} \sum_{j=1}^n \frac{(x_j - \bar{x})^2}{\sigma^3} + \frac{1}{\sigma} = \frac{1}{\sigma} \left(1 - \frac{1}{\sigma^2}\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2) \right) \end{aligned}

(29)

So, setting the partial to zero requires:

1 - \frac{1}{\sigma_*^2}\frac{1}{n} \sum_{j=1}^n (x_j - \bar{x})^2 = 0 \quad \Rightarrow \quad \sigma_*^2 = \frac{1}{n} \sum_{j=1}^n (x_j - x_*)^2.

(30)

Therefore, the maximum likelihood estimator for the unknown variance is the sample variance, or, equivalently, the variance in the empirical distribution of the observed data.

So, in the case when our data is drawn from a normal distribution, we can justify the sample average and variance as estimators by arguing that they maximize the likelihood of the observed data!

12.1 Estimators

Empirical Averages¶

Maximum Likelihood Estimation¶

Examples¶

Binomial¶

Geometric¶

Normal¶