Local Properties - Data 89 Course Notes

So far we’ve discussed two methods for visualizing functions. These were, check its characteristics (Section 3.1), then check its composition (Section 3.2). The first helps narrow down the ways in which the function can behave. The second breaks it into simpler parts.

Both of those strategies are global. They provide information about $f(x)$ at all possible inputs $x$ .

This section is about local strategies. The simplest local strategy is the oldest. Just pick a bunch of inputs, $\{x_1,x_2,x_3,...\}$ , plug them each in to find $\{f(x_1),f(x_2),f(x_3),...\}$ , plot each, then trace a curve through your plot. This is the strategy you used in HW 2 to guess how the mode of the binomial depended on $n$ and $p$ .

That strategy always works, but it is slow and laborious. It’s also not very insightful. You usually won’t learn much about the function by doing it, so if we change the problem set up slightly, say by varying a free parameter, then you’ll have to do all of your work over again.

That said, evaluating a function at some specific input points is an essential strategy. The key is to pick those points wisely. Try to pick as few points as are sufficient to describe the function. In essence, select a set of reference points, learn what you can about $f$ near those references, then constrain your plot using what you’ve learned.

Choosing Reference Points¶

Make a list of reference points. In rough order of work to evaluate, check:

Any inputs where $f$ is easy to evaluate (often, $x = 0$ and $x = \pm 1$ ).
- Be greedy. Always do the easiest thing first.
The smallest and largest possible inputs. If necessary, take limits.
- If $X$ is a random variable with support equal to some interval $[a,b]$ , then you should always include $a$ and $b$ in your list of reference points.
- Some of these limits are easy. For example:
  - If you want to plot the CDF of $X$ , and $X \in [a,b]$ , then $\text{CDF}(x) = 0$ for all $x < a$ and $\text{CDF}(x) = 1$ for all $x > b$ .
  - If you want to plot a PMF or a PDF, and $X$ is unbounded in some direction (e.g. $X$ can be arbitrarily large), then your PMF/PDF must converge to zero as $x$ diverges. Otherwise, it couldn’t add/integrate to one so would not be normalized. This is why, whenever we’ve drawn a distribution for a random variable that is unbounded above, the distribution tailed off to zero for sufficiently large inputs.
An axis of symmetry if it exists.
- For instance, if you’re given a density $\text{PDF}(x) \propto g(x)$ where $g(x) = (1 + \tfrac{1}{4}(x + 3)^2)^{-5/2}$ , notice that this is a composition of functions, $x^{-5/2}$ and $1 + \tfrac{1}{4}(x + 3)^2$ . The inner function is a linear transformation (vertical translation, scale, horizontal translation) of our friend, $x^2$ , so must have some symmetry. The function $x^2$ is even about zero. So, if we shift it left by 3, then the shifted function must be even about $x_* = -3$ .
Any roots (locations where $f(x) = 0$ ) or poles (locations where $f(x)$ diverges).
- Sometimes roots are obvious. For instance, the function $g(x) = x^{3.1}(1 - x)^{1.2}$ will have roots at $x = 0$ and $x = 1$ . Roots are obvious when our functions are factored for us.
- Sometimes roots are very hard to recover. For instance, try to find the roots of $g(x) = 3x^4 + 2 x^3 - 10 x - 6$ . Not so easy. Add a power of $x^5$ and $x^6$ and there is no mathematical formula that could find the roots.
- So, be strategic. Only look for roots if they jump out at you from the page.
- To find poles, look for roots in the denominator. For instance, the function $g(x) = (1 + x^2)/(1 - x^2)$ will have poles at $x = \pm 1$ since the denominator equals zero there.
- So, apply the same logic about roots to poles. Only look for poles if they are obvious. It is worth putting in a bit more elbow-grease when looking for poles, since adding a vertical asymptote to a plot is more important for organizing the plot than adding a root.
- Be careful with poles. Remember that, if the numerator and denominator are both zero at some $x$ , then $g(x)$ may be zero, infinity, or some other finite number. In this case, you have to take a limit. We’ll review l’Hopital’s rule in our chapter on asymptotics and tail behavior.
Any points of discontinuity or nondifferentiability.
- These are usually easy to spot. Look for a piecewise definition, or an absolute value in the function definition.
All maxima and minima of the function.
- These are the most important reference points for drawing distribution functions accurately. Always look for maxima and minima.
- There are a couple strategies for finding maxima and minima. If your function is smooth (continuous and differentiable), then you should check for places where the derivative equals zero.
- Once you find the roots of the derivative, evaluate its sign between the roots. This creates a partition of the number line into intervals where your function is increasing or decreasing. Anywhere a function switches from increasing to decreasing is a maximum. Anywhere that it switches from decreasing to increasing is a minimum.
- The same arguments work for discrete functions, like PMF’s. We have to do a bit more work since a PMF is a bar chart, and doesn’t have a well defined slope. To see an example, look ahead to “Optimization”.
Any inflection points of the function.
- Check for places where the second derivative changes sign. Follow the same procedure we used for slope. First set $\frac{d^2}{dx^2} f(x) = 0$ , then check the sign of the second derivative in between its roots
- Inflection points of the CDF correspond to maxima and minima of the PMF/PDF since the second derivative of the CDF is the first derivative of the PDF.
- Inflection points are the least important features on this list unless you are drawing a CDF. They are also often the most work to find. So, save them for last. See if you have enough information to draw your function well before finding its inflection points.

That’s a long list, but you rarely need all of it. The more you practice, the faster you’ll get at recognizing functions, and the fewer reference points you’ll need. Most functions don’t have all the references on this list, so you can usually skip a bunch with little effort. For many distributions it is enough to check the largest and smallest possible inputs, check for symmetry, and to identify maxima and minima.

Evaluating $f$ Locally¶

Then:

Evaluate $f$ at each reference.
- Add a dot on your plot at each $(x,f(x))$ pair.
Evaluate the slope of $f$ at each reference.
- Add a small tangent line to your plot at $(x,f(x))$ with slope $\frac{d}{dx} f(x)$ . Make sure your drawing is tangent to the tangent line (has the correct slope).
Evaluate the slope of $f$ at some arbitrary point between each reference.
- Make sure your function is increasing and decreasing on the correct intervals.
- I usually draw a quick number line reference, mark the roots of the derivative, and add a plus sign on the intervals where $f$ is increasing, and a minus where it is decreasing.
- If you’ve already determined which of the roots of $\frac{d}{dx} f(x)$ are maxima and minima, then you can fill in the intervals without evaluating the slope again.
Evaluate the sign of the second derivative of $f$ at each reference.
- This can be slow. So, check whether you know the sign of the second derivative before computing a derivative.

Steps (1) and (2) are essential. If you’ve done (1) and (2) you usually get (3) for free. Step (4) is slow. It helps for accurate plots, especially of CDF’s, but is the least important of the three. Do it last and only when necessary.

Optimization Refresher¶

Let’s remind ourselves how to find maxima and minima of a function, $f(x)$ .

We’ll go through four techniques. The first two are the easiest. Always do them first. They mostly require looking at your function, and identifying its structure. The last two are the slowest, but they are failsafe. They are largely mechanical. Do them last as back-ups in case the first two fail.

Symmetry¶

As usual, start by looking for symmetries. If:

If $f(x)$ is even about an axis of symmetry, $x_*$ , then $x_*$ must be a maximum or a minimum of the function.
- If this isn’t clear to you, try to draw a counterexample. You’ll see that, if you draw $f(x)$ increasing as $x$ leaves $x_*$ to the right, then, by symmetry, $f(x)$ must also increase as $x$ leaves $x_*$ to the left.
- Picture $x^2$ . You come into zero descending, then, by symmetry, leave zero ascending.
- The same is true for concave functions. If, as $x$ approaches $x_*$ from below, $f(x)$ is increasing, then by symmetry $f(x)$ must be decreasing as $x$ moves away from $x_*$ to the right.
- The only counterexample is the constant function.
- It follows that if $f$ is differentiable, then $\frac{d}{dx}f(x_*) = 0$ at any even axis of symmetry. So, when you add an even axis of symmetry, you never need to evaluate the derivative there. Just add a horizontal tangent.
If $f(x)$ is odd about an axis of symmetry, $x_*$ , then $x_*$ cannot be a maximum or a minimum.
- If this isn’t clear to you, try to draw a counterexample. You’ll find that, if $f(x)$ is increasing as $x$ approaches $x_*$ from below, then $f(x)$ is also increasing as $x$ leaves $x_*$ to the right.
- Picture $x^3$ .

Monotone Compositions¶

Suppose that $f(x) = g(h(x))$ for some outer function $g$ and some inner function $h$ . Then, suppose that $g$ is a monotonic function. For instance, we’ll see lots of examples where $g$ is either $e^{x}$ , $e^{-x}$ , or $x^{-\alpha}$ for some $\alpha > 0$ .

Suppose that $g(x)$ is monotonically increasing or non-decreasing. Then, $h_* \geq h$ implies $g(h_*) \geq g(h)$ . So, if $h(x_*) \geq h(x)$ , then $g(h(x_*)) \geq g(h(x))$ . So, to maximize $g$ it is enough to maximize $h$ .

The same logic works in reverse for monotonically decreasing, or non-increasing, functions. If $g$ is monotonically decreasing or non-increasing, then making $h$ smaller makes $g(h)$ larger. So, to maximize $g(h(x))$ minimize $h(x)$ .

This strategy works whenever $h$ is easier to optimize than $f$ . It is espeically useful for distributions that are defined using an exponential. For example, we will see lots of distributions that look like:

f(x) = e^{h(x)}

(1)

for some $h(x)$ . In this case, if you’re asked to optimize $f$ , just optimize $h$ .

Example

Suppose that you are given $X \in (-\infty,\infty)$ and $\text{PDF}(x) \propto g(x)$ where:

g(x) = e^{-\frac{1}{2} (x - 5)^6}

(2)

and are asked to find the position of any modes. Now we have two strategies.

By symmetry: Because the inner function is even about $x_* = 5$ , we know that $x_* = 5$ must be a maxima or a minima.
By monotone composition: $e^{-\frac{1}{2}x}$ is monotonically decreasing, so the maxima of $g$ are the minima of $(x - 5)^6$ . Any number to an even power is nonnegative, and $(x - 5)^6$ is zero at $x_* = 5$ , and only at $x_* = 5$ . If $x_* \neq 5$ then $(x - 5)^6 > 0$ . Therefore, the inner function is minimized at $x = x_* = 5$ , and $g$ is maximized at $x = x_* = 5$ .

Some functions don’t look like $e^{h(x)}$ at first glance, but are still easier to optimize if we rewrite them this way. In particular, if your function involves a product of terms, then, since exponentials exchange products and sums, we can write:

f(x) = a(x) \times b(x) = e^{\log(a(x))} \times e^{\log(b(x))} = e^{\log(a(x)) + \log(b(x))}

(3)

Now, optimize the argument of the exponential, which is the log of $f$ :

\log(f(x)) = \log(a(x)) + \log(b(x))

(4)

Example

Suppose that you are given $X \in [0,1]$ and $\text{PDF}(x) \propto g(x)$ where:

g(x) = x^5 (1 - x)^{2.5}

(5)

and are asked to find the position of any modes.

Notice that $g$ is a product. So, let’s try taking it’s logarithm:

\begin{aligned}g(x) & = e^{\log(x^5) + \log((1 - x)^{2.5})} \\ & \Rightarrow \log(g(x)) = \log(x^5) + \log((1 - x)^{2.5}) \\ & = 5 \log(x) + 2.5 \log(1 - x) \end{aligned}

(6)

This may look harder to work with, but its actually easier to optimize since we’ve separated out the two terms. We’ll finish this example later in this chapter.

This trick is important for random variables generated by processes that involve a bunch of and statements. Those and statements produce products, so the corresponding distributions are often expressed as a product of terms. For instance, the binomial PMF is a product of three terms:

\text{PMF}(x) = \left(\begin{array}{c} n \\ x \end{array} \right) p^x (1 - p)^{n-x}

(7)

and the geometric PMF is a product of two:

\text{PMF}(x) = (1 - p)^{x - 1} p.

(8)

In both cases, if we want to find the choice of $p$ that maximizes the PMF at a given $x$ , we should start by taking a logarithm.

By Direction¶

In general, $x_*$ is a maximizer if $f(x)$ is increasing to the left of $x_*$ and decreasing to the right. It is a minimizer if $f(x)$ is decreasing to the left and increasing to the right. So, $x_*$ is a critical point (maximizer or minimizer) if $f$ changes direction at $x_*$ .

Differentiable Functions¶

If $f(x)$ is smooth, then we can use its derivative to check whether it is increasing or decreasing. If the derivative is positive then $f$ is increasing. If the derivative is negative, then it is decreasing. So, $f$ has critical points where the derivative changes sign.

If $f(x)$ is continuously differentiabe (that is, its slope is continuous), then its slope cannot change sign without crossing zero. So, to find critical points, we rely on the old calculus trick:

Take the derivative
Set it to zero.

Then, evaluate the sign of the derivative on either side of its roots to classiy critical points as maxima, minima, or neither.

Example

Let’s finish our example from before. We wanted to maximize:

g(x) = x^5 (1 - x)^{2.5}

(9)

and saw that maximizing $g$ was the same as maximizing:

h(x) = \log(g(x)) = 5 \log(x) + 2.5 \log(1 - x)

(10)

The derivative is:

\frac{d}{dx} h(x) = \frac{5}{x} - \frac{2.5}{1 - x}

(11)

Setting the derivative to zero:

\begin{aligned} \frac{5}{x_*} = \frac{2.5}{1 - x_*} & \Rightarrow \frac{1-x_*}{x_*}=\frac{2.5}{5} \\ & \Rightarrow \frac{1}{x_*} - 1 = \frac{1}{2} \\ & \Rightarrow \frac{1}{x_*} = \frac{3}{2} \\ & \Rightarrow x_* = \frac{2}{3}. \end{aligned}

(12)

Then, to show $x_*$ is a maximizer, note that $g(x) = 0$ at $x = 0$ and $x = 1$ , and only has one critical point at $x_* = 2/3$ , so $g$ must be increasing between 0 and $2/3$ , and decreasing from $x_* = 2/3$ to $x = 1$ . Therefore:

g(x) \text{ is maximized at } x = \frac{2}{3}.

(13)

Maximum Likelihood

Suppose that we are running a binary experiment. Our goal is to estimate the chance that our experiment succeeds, $p$ . We consider $p$ fixed, but unknown. We run our experiment $n$ times, and see $S = s$ successes. How can we estimate $p$ from $n$ and $s$ ?

The simplest answer is, pick the estimate $\hat{p} = s/n$ . Here we’ve given $p$ a “hat”, $\hat{p}$ , to remind ourselves that $s/n$ is just an estimate. In other words, if I ran 100 trials and saw 60 successes, then I should estimate $p \approx hat{p} = 0.6$ .

Here’s a way to justify that estimate.

Before running any trials, imagine that there is some true success probability $p$ , but $p$ is unknown to us. If our trials are independent and identical, then the number of successes, $S$ , in $n$ trials, is a Binomial random variable. So:

\text{Pr}(S = s;p) = \text{PMF}(s;p) = \left(\begin{array}{c} n \\ s \end{array} \right) p^s (1 - p)^{n-s}

(14)

Here we’ve added “ $;p$ ” inside the argument of our PMF. The semicolon indicates that we are going to add a free parameter. The symbol after the semicolon is the parameter. We’re adding it to the argument explicitly so that we remember that our function depends on both the input value $s$ , and the parameter $p$ .

Now, we run our experiment and see $s$ successes. What success probability, $p$ , would have maximized the chance of our observation? In other words, what choice of the unknown parameter $p$ would have made our observation most likely?

Well, let $\hat{p}$ be the value of $p$ that maximizes $\text{PMF}(s;p)$ holding $s$ fixed.

So, our problem is:

\text{Maximize: } \text{Pr}(S = s;p) = \text{PMF}(s;p) \text{ with respect to } p \text{ holding } s \text{ fixed}.

(15)

To do so, we’ll use the monotone composition approach.

First, notice that the choose coefficient out front is constant when $s$ is held fixed. It is a nonnegative number. So, it simply scales the functional form $p^s (1 - p)^{n - s}$ . Therefore, the PMF is maximized, as a function of $s$ where the simpler function, $g(p) =$ p^s (1 - p)^{n - s}$ is maximized.

So, we’ll maximize:

g(p) = p^s (1 - p)^{n-s}

(16)

instead.

This is clearly a product. So, let’s try putting it inside a log. Logs are monotonic, so the original function is maximized where the log is maximized.

\log(g(p)) = \log(p^s) + \log((1 - p)^{(n - s)}) = s \log(p) + (n - s) \log(1 - p)

(17)

That looks a lot like the example we solved earlier. In fact, from here on out, all the math is the same, just with $s$ and $n-s$ where we had 5 and 2.5, and $p$ where we had $x$ . Repeating the same steps:

\begin{aligned} \frac{d}{dp}\log(g(p)) & = \frac{s}{p} - \frac{n-s}{1-p} \\ & \Rightarrow \frac{s}{\hat{p}} = \frac{n-s}{1 - \hat{p}} \\ & \Rightarrow \frac{1-\hat{p}}{\hat{p}}=\frac{n-s}{s} \\ & \Rightarrow \frac{1}{\hat{p}} - 1 = \frac{n-s}{s} \\ & \Rightarrow \frac{1}{\hat{p}} = 1 + \frac{n-s}{s} = \frac{s + n - s}{s} = \frac{n}{s} \\ & \Rightarrow \hat{p} = \frac{s}{n}. \end{aligned}

(18)

So, the simple guess, $\hat{p} = s/n$ is actually a principled choice. It is the success probability that would make our observation most likely!

Non-differentiable Functions¶

What if $f(x)$ is not a smooth function of $x$ ? Then it doesn’t have a well defined slope, so our friendly “set derivative to zero” method doesn’t apply.

We can still find maxima and minima by looking for locations where $f(x)$ switches from increasing to decreasing, or from decreasing to increasing. Here’s an example.

Mode of the Binomial

On HW 2 you guessed a formula for the mode of the binomial by plotting it as a function of $n$ and $p$ . Let’s check our work.

Remember that the Binomial PMF is:

\text{PMF}(x) = \left(\begin{array}{c} n \\ x \end{array} \right) p^x (1 - p)^{n-x}

(19)

To find its mode, we want to find the $x$ that maximizes $\text{PMF}(x)$ for a given $n$ and $p$ . Notice, this is not the problem we solved in the maximum likelihood example. There we held the input variable fixed, and optimized over $p$ . Here, we’ll hold the parameters fixed, and optimize over the input variable, $x$ .

These are quite different problems, since the PMF is a smooth function of $p$ , but is a discontinuous function of $x$ , and is only defined for integer $x$ .

First, let’s figure out a way to decide whether the PMF is increasing or decreasing at each input.

Here’s an idea. We can decide whether the PMF is increasing or decreasing between $x$ and $x+1$ by checking whether $\text{PMF}(x+1)$ is greater than or less than $\text{PMF}(x)$ . Since the PMF is built as a product of terms, well check using a ratio:

\frac{\text{PMF}(x+1)}{\text{PMF}(x)}.

(20)

Confirm that, if the ratio is greater than 1, then the PMF is increasing between $x$ and $x+1$ , and, if the ratio is less than one, then it is decreasing.

This ratio simplifies nicely.

\begin{aligned} \frac{\text{PMF}(x+1)}{\text{PMF}(x)} & = \frac{\left(\begin{array}{c} n \\ x + 1 \end{array} \right) p^{x+1} (1 - p)^{n-(x+1)}}{\left(\begin{array}{c} n \\ x \end{array} \right) p^x (1 - p)^{n-x}} \\ & = \frac{n!}{n!} \frac{x!}{(x+1)!} \frac{(n-x)!}{(n - x - 1)!} \frac{p^{x+1}}{p^x} \frac{(1 - p)^{n - x - 1}}{(1 - p)^{n - x}} \\ & = \frac{1}{x+1} (n - x) p \frac{1}{1 - p} \\ & = \frac{n-x}{x+1} \frac{p}{1-p} \end{aligned}

(21)

To see why the ratio was a good choice, try simplifying using the difference $\text{PMF}(x+1) - \text{PMF}(x)$ instead.

Now, compare the ratio to 1. We’ll solve for a condition on $x$ where the PMF if increasing:

\begin{aligned} \frac{n-x}{x+1} \frac{p}{1-p} > 1 & \Rightarrow \frac{n-x}{x+1} > \frac{1-p}{p} = \frac{1}{p} - 1 \\ & \Rightarrow \frac{1}{p} > 1 + \frac{n - x}{x+1} = \frac{x +1 + n - x}{x+1} = \frac{n+1}{x+1} \\ & \Rightarrow p > \frac{x+1}{n+1} \\ & \Rightarrow x < p (n + 1) - 1 \end{aligned}

(22)

So, if $x < p (n + 1) - 1$ then the PMF is increasing. If, on the other hand, $x > p (n+1) - 1$ then the PMF is decreasing. This explains the rule we saw by experiment. The PMF switches from increasing to decreasing around $n p$ its not exactly $n p$ since the PMF only accepts integer $x$ .

3.3 Local Properties

Choosing Reference Points¶

Evaluating fff Locally¶

Optimization Refresher¶

Symmetry¶

Monotone Compositions¶

By Direction¶

Differentiable Functions¶

Non-differentiable Functions¶

Evaluating $f$ Locally¶