So far we’ve discussed two methods for visualizing functions. These were, check its characteristics (Section 3.1), then check its composition (Section 3.2). The first helps narrow down the ways in which the function can behave. The second breaks it into simpler parts.
Both of those strategies are global. They provide information about at all possible inputs .
This section is about local strategies. The simplest local strategy is the oldest. Just pick a bunch of inputs, , plug them each in to find , plot each, then trace a curve through your plot. This is the strategy you used in HW 2 to guess how the mode of the binomial depended on and .
That strategy always works, but it is slow and laborious. It’s also not very insightful. You usually won’t learn much about the function by doing it, so if we change the problem set up slightly, say by varying a free parameter, then you’ll have to do all of your work over again.
That said, evaluating a function at some specific input points is an essential strategy. The key is to pick those points wisely. Try to pick as few points as are sufficient to describe the function. In essence, select a set of reference points, learn what you can about near those references, then constrain your plot using what you’ve learned.
Choosing Reference Points¶
Make a list of reference points. In rough order of work to evaluate, check:
Any inputs where is easy to evaluate (often, and ).
Be greedy. Always do the easiest thing first.
The smallest and largest possible inputs. If necessary, take limits.
If is a random variable with support equal to some interval , then you should always include and in your list of reference points.
Some of these limits are easy. For example:
If you want to plot the CDF of , and , then for all and for all .
If you want to plot a PMF or a PDF, and is unbounded in some direction (e.g. can be arbitrarily large), then your PMF/PDF must converge to zero as diverges. Otherwise, it couldn’t add/integrate to one so would not be normalized. This is why, whenever we’ve drawn a distribution for a random variable that is unbounded above, the distribution tailed off to zero for sufficiently large inputs.
An axis of symmetry if it exists.
For instance, if you’re given a density where , notice that this is a composition of functions, and . The inner function is a linear transformation (vertical translation, scale, horizontal translation) of our friend, , so must have some symmetry. The function is even about zero. So, if we shift it left by 3, then the shifted function must be even about .
Any roots (locations where ) or poles (locations where diverges).
Sometimes roots are obvious. For instance, the function will have roots at and . Roots are obvious when our functions are factored for us.
Sometimes roots are very hard to recover. For instance, try to find the roots of . Not so easy. Add a power of and and there is no mathematical formula that could find the roots.
So, be strategic. Only look for roots if they jump out at you from the page.
To find poles, look for roots in the denominator. For instance, the function will have poles at since the denominator equals zero there.
So, apply the same logic about roots to poles. Only look for poles if they are obvious. It is worth putting in a bit more elbow-grease when looking for poles, since adding a vertical asymptote to a plot is more important for organizing the plot than adding a root.
Be careful with poles. Remember that, if the numerator and denominator are both zero at some , then may be zero, infinity, or some other finite number. In this case, you have to take a limit. We’ll review l’Hopital’s rule in our chapter on asymptotics and tail behavior.
Any points of discontinuity or nondifferentiability.
These are usually easy to spot. Look for a piecewise definition, or an absolute value in the function definition.
All maxima and minima of the function.
These are the most important reference points for drawing distribution functions accurately. Always look for maxima and minima.
There are a couple strategies for finding maxima and minima. If your function is smooth (continuous and differentiable), then you should check for places where the derivative equals zero.
Once you find the roots of the derivative, evaluate its sign between the roots. This creates a partition of the number line into intervals where your function is increasing or decreasing. Anywhere a function switches from increasing to decreasing is a maximum. Anywhere that it switches from decreasing to increasing is a minimum.
The same arguments work for discrete functions, like PMF’s. We have to do a bit more work since a PMF is a bar chart, and doesn’t have a well defined slope. To see an example, look ahead to “Optimization”.
Any inflection points of the function.
Check for places where the second derivative changes sign. Follow the same procedure we used for slope. First set , then check the sign of the second derivative in between its roots
Inflection points of the CDF correspond to maxima and minima of the PMF/PDF since the second derivative of the CDF is the first derivative of the PDF.
Inflection points are the least important features on this list unless you are drawing a CDF. They are also often the most work to find. So, save them for last. See if you have enough information to draw your function well before finding its inflection points.
That’s a long list, but you rarely need all of it. The more you practice, the faster you’ll get at recognizing functions, and the fewer reference points you’ll need. Most functions don’t have all the references on this list, so you can usually skip a bunch with little effort. For many distributions it is enough to check the largest and smallest possible inputs, check for symmetry, and to identify maxima and minima.
Evaluating Locally¶
Then:
Evaluate at each reference.
Add a dot on your plot at each pair.
Evaluate the slope of at each reference.
Add a small tangent line to your plot at with slope . Make sure your drawing is tangent to the tangent line (has the correct slope).
Evaluate the slope of at some arbitrary point between each reference.
Make sure your function is increasing and decreasing on the correct intervals.
I usually draw a quick number line reference, mark the roots of the derivative, and add a plus sign on the intervals where is increasing, and a minus where it is decreasing.
If you’ve already determined which of the roots of are maxima and minima, then you can fill in the intervals without evaluating the slope again.
Evaluate the sign of the second derivative of at each reference.
This can be slow. So, check whether you know the sign of the second derivative before computing a derivative.
Steps (1) and (2) are essential. If you’ve done (1) and (2) you usually get (3) for free. Step (4) is slow. It helps for accurate plots, especially of CDF’s, but is the least important of the three. Do it last and only when necessary.
Optimization Refresher¶
Let’s remind ourselves how to find maxima and minima of a function, .
We’ll go through four techniques. The first two are the easiest. Always do them first. They mostly require looking at your function, and identifying its structure. The last two are the slowest, but they are failsafe. They are largely mechanical. Do them last as back-ups in case the first two fail.
Symmetry¶
As usual, start by looking for symmetries. If:
If is even about an axis of symmetry, , then must be a maximum or a minimum of the function.
If this isn’t clear to you, try to draw a counterexample. You’ll see that, if you draw increasing as leaves to the right, then, by symmetry, must also increase as leaves to the left.
Picture . You come into zero descending, then, by symmetry, leave zero ascending.
The same is true for concave functions. If, as approaches from below, is increasing, then by symmetry must be decreasing as moves away from to the right.
The only counterexample is the constant function.
It follows that if is differentiable, then at any even axis of symmetry. So, when you add an even axis of symmetry, you never need to evaluate the derivative there. Just add a horizontal tangent.
If is odd about an axis of symmetry, , then cannot be a maximum or a minimum.
If this isn’t clear to you, try to draw a counterexample. You’ll find that, if is increasing as approaches from below, then is also increasing as leaves to the right.
Picture .
Monotone Compositions¶
Suppose that for some outer function and some inner function . Then, suppose that is a monotonic function. For instance, we’ll see lots of examples where is either , , or for some .
Suppose that is monotonically increasing or non-decreasing. Then, implies . So, if , then . So, to maximize it is enough to maximize .
The same logic works in reverse for monotonically decreasing, or non-increasing, functions. If is monotonically decreasing or non-increasing, then making smaller makes larger. So, to maximize minimize .
If for some function that is monotone, then we can optimize by finding maxima and minima of . If:
is monotone ascending (increasing or non-decreasing) then:
To maximize , maximize .
To minimize , minimize .
is monotone descending (decreasing or non-increasing) then:
To maximize , minimize .
To minimize , maximize .
Moreover, is increasing wherever is increasing, and is decreasing wherever is decreasing.
This strategy works whenever is easier to optimize than . It is espeically useful for distributions that are defined using an exponential. For example, we will see lots of distributions that look like:
for some . In this case, if you’re asked to optimize , just optimize .
Example
Suppose that you are given and where:
and are asked to find the position of any modes. Now we have two strategies.
By symmetry: Because the inner function is even about , we know that must be a maxima or a minima.
By monotone composition: is monotonically decreasing, so the maxima of are the minima of . Any number to an even power is nonnegative, and is zero at , and only at . If then . Therefore, the inner function is minimized at , and is maximized at .
Some functions don’t look like at first glance, but are still easier to optimize if we rewrite them this way. In particular, if your function involves a product of terms, then, since exponentials exchange products and sums, we can write:
Now, optimize the argument of the exponential, which is the log of :
Example
Suppose that you are given and where:
and are asked to find the position of any modes.
Notice that is a product. So, let’s try taking it’s logarithm:
This may look harder to work with, but its actually easier to optimize since we’ve separated out the two terms. We’ll finish this example later in this chapter.
This trick is important for random variables generated by processes that involve a bunch of and statements. Those and statements produce products, so the corresponding distributions are often expressed as a product of terms. For instance, the binomial PMF is a product of three terms:
and the geometric PMF is a product of two:
In both cases, if we want to find the choice of that maximizes the PMF at a given , we should start by taking a logarithm.
By Direction¶
In general, is a maximizer if is increasing to the left of and decreasing to the right. It is a minimizer if is decreasing to the left and increasing to the right. So, is a critical point (maximizer or minimizer) if changes direction at .
Differentiable Functions¶
If is smooth, then we can use its derivative to check whether it is increasing or decreasing. If the derivative is positive then is increasing. If the derivative is negative, then it is decreasing. So, has critical points where the derivative changes sign.
If is continuously differentiabe (that is, its slope is continuous), then its slope cannot change sign without crossing zero. So, to find critical points, we rely on the old calculus trick:
Take the derivative
Set it to zero.
Then, evaluate the sign of the derivative on either side of its roots to classiy critical points as maxima, minima, or neither.
Example
Let’s finish our example from before. We wanted to maximize:
and saw that maximizing was the same as maximizing:
The derivative is:
Setting the derivative to zero:
Then, to show is a maximizer, note that at and , and only has one critical point at , so must be increasing between 0 and , and decreasing from to . Therefore:
Maximum Likelihood
Suppose that we are running a binary experiment. Our goal is to estimate the chance that our experiment succeeds, . We consider fixed, but unknown. We run my experiment times, and see successes. How can we estimate from and ?
The simplest answer is, pick the estimate . Here we’ve given a “hat”, , to remind ourselves that is just an estimate. In other words, if I ran 100 trials and saw 60 successes, then I should estimate .
Here’s a way to justify that estimate.
Before running any trials, imagine that there is some true success probability , but is unknown to us. If our trials are independent and identical, then the number of successes, , in trials, is a Binomial random variable. So:
Here we’ve added “” inside the argument of our PMF. The semicolon indicates that we are going to add a free parameter. The symbol after the semicolon is the parameter. We’re adding it to the argument explicitly so that we remember that our function depends on both the input value , and the parameter .
Now, we run our experiment and see successes. What success probability, , would have maximized the chance of our observation? In other words, what choice of the unknown parameter would have made our observation most likely?
Well, let be the value of that maximizes holding fixed.
So, our problem is:
To do so, we’ll use the monotone composition approach.
First, notice that the choose coefficient out front is constant when is held fixed. It is a nonnegative number. So, it simply scales the functional form . Therefore, the PMF is maximized, as a function of where the simpler function, p^s (1 - p)^{n - s}$ is maximized.
So, we’ll maximize:
instead.
This is clearly a product. So, let’s try putting it inside a log. Logs are monotonic, so the original function is maximized where the log is maximized.
That looks a lot like the example we solved earlier. In fact, from here on out, all the math is the same, just with and where we had 5 and 2.5, and where we had . Repeating the same steps:
So, the simple guess, is actually a principled choice. It is the success probability that would make our observation most likely!
Note, this is not the success probability that is most likely given our observation. These are different statements. We’ve not said anything about the probability of, or the likelihood, of the unknown parameter. We didn’t model it as random. We modelled the data as random given the parameter, not the data and parameter as jointly random. So, we can only make statements like, the chance of seeing 10 successes is 0.02 if . We can’t make statements like, the chance that is 0.1 if we see 10 successes.
Try the same maximum likelihood analysis for the unknown parameter of a Geometric random variable . What value of would make the observation most likely?
Non-differentiable Functions¶
What if is not a smooth function of ? Then it doesn’t have a well defined slope, so our friendly “set derivative to zero” method doesn’t apply.
We can still find maxima and minima by looking for locations where switches from increasing to decreasing, or from decreasing to increasing. Here’s an example.
On HW 2 you guessed a formula for the mode of the binomial by plotting it as a function of and . Let’s check our work.
Remember that the Binomial PMF is:
To find its mode, we want to find the that maximizes for a given and . Notice, this is not the problem we solved in the maximum likelihood example. There we held the input variable fixed, and optimized over . Here, we’ll hold the parameters fixed, and optimize over the input variable, .
These are quite different problems, since the PMF is a smooth function of , but is a discontinuous function of , and is only defined for integer .
First, let’s figure out a way to decide whether the PMF is increasing or decreasing at each input.
Here’s an idea. We can decide whether the PMF is increasing or decreasing between and by checking whether is greater than or less than . Since the PMF is built as a product of terms, well check using a ratio:
Confirm that, if the ratio is greater than 1, then the PMF is increasing between and , and, if the ratio is less than one, then it is decreasing.
This ratio simplifies nicely.
To see why the ratio was a good choice, try simplifying using the difference instead.
Now, compare the ratio to 1. We’ll solve for a condition on where the PMF if increasing:
So, if then the PMF is increasing. If, on the other hand, then the PMF is decreasing. This explains the rule we saw by experiment. The PMF switches from increasing to decreasing around its not exactly since the PMF only accepts integer .