\documentclass[12pt]{article}

\usepackage{ amssymb, amsmath, bm, graphicx }

\addtolength{\textheight}{2.0in}
\addtolength{\topmargin}{-0.95in}
\addtolength{\textwidth}{1.6in}
\addtolength{\evensidemargin}{-0.8in}
\addtolength{\oddsidemargin}{-0.9in}
\setlength{\parskip}{0.1in}
\setlength{\parindent}{0.0in}

\raggedbottom

\newcommand{\given}{\, | \,}

\begin{document}

\begin{flushleft}

Prof.~David Draper \\
Department of \\ 
\hspace*{0.1in} Applied Mathematics and Statistics \\
University of California, Santa Cruz

\end{flushleft}

\begin{center}

\textbf{\large AMS 132: Homework 2 (final version)} \\ 
Target due date: Tue 14 Mar 2017 \textit{[420 total points]}

\end{center}

As with Homework 1, please collect \{all of the \texttt{R} code you used in answering the questions  below\} into an Appendix at the end of your document, so that (if you do something wrong) the grader can better give you part credit. To avoid plagiarism, if you end up using any of the code I post on the course web page, at the beginning of your Appendix you can say something like the following:

\begin{quote}

I used some of Professor Draper's \texttt{R} code in this assignment, adapting it as needed.

\end{quote}

1.~\textit{[60 points]} For each statement below (10 points each), say whether it's true or false; if true without further assumptions, briefly explain why it’s true (and --- extra credit (5 points each time) ---
what its implications are for statistical inference); if it's sometimes true, give the extra conditions necessary to make it true; if it's false, briefly explain how to change it so that it's true and/or give an example of why it's false. If the statement consists of two or more sub-statements and two or more of them are false, you need to explicitly address all of the false sub-statements in your answer.

\begin{itemize}

\item[(a)]

Consider the sampling model $( Y_i \given \theta \, \mathcal{ B }
) \stackrel{ \textrm{\tiny IID} }{ \sim } p ( y_i \given \theta \, \mathcal{ B } )$ for $i = 1, \dots, n$, where the $Y_i$ are univariate real values, $\theta$ is a parameter vector of length $1 \le p < \infty$ and $\mathcal{ B }$ summarizes Your
background information; a Bayesian analysis with the same sampling model
would add a prior distribution layer of the form $( \theta \given \mathcal{ B } ) \sim p ( \theta \given { \cal B } )$ to the hierarchy. The Bernstein-von Mises Theorem says that maximum-likelihood (ML) and Bayesian inferential conclusions will be similar in this setting if (a) $n$ is large and (b) $p ( \theta | { \cal B } )$ is diffuse, but the theorem does not provide guidance on how large $n$ needs to be for its conclusion to hold in any specific sampling model.

\item[(b)]

Being able to express Your sampling distribution as a member of the
exponential family is helpful, because (1) You can then readily identify a set
of sufficient statistics just by looking at the form of the exponential
family and (2) the conjugate prior is also directly available from the exponential family form.

\item[(c)]

In the basic diagram that illustrates the frequentist inferential paradigm
(with the population, sample and repeated-sampling data sets, each
containing $N$, $n$, and $M$ elements, respectively (see page 6 of the extra
notes from 17 Jan 2017)), when the population parameter of main interest is
the mean $\theta$ and the estimator is the sample mean $\bar{ Y }$, You will
always get a Gaussian long-run distribution for $\bar{ Y }$ (in the
repeated-sampling data set) as long as any one of $( N, n, M )$ goes to
infinity.

\item[(d)]

When Your sampling model has $n$ observations and a
single parameter $\theta$ (so that $p = 1$), if the sampling model is
regular (i.e., if the range of possible data values doesn't depend on
$\theta$), in large samples the observed information $\hat{ I } \! \left(
\hat{ \theta }_{ MLE } \right)$ is $O ( n )$, meaning that (1) information
in $\hat{ \theta }_{ MLE }$ about $\theta$ increases linearly with $n$ and
(2) $\hat{ V } \! \left( \hat{ \theta }_{ MLE } \right) = O \! \left( \frac{
1 }{ n } \right)$.

\item[(e)]

It's easier to reason from the part (or the particular, or the sample) to
the whole (or the general, or the population), and that's why statistical
inference (inductive reasoning) is easier than probability (deductive
reasoning).

\item[(f)]

When the sampling model is a regular parametric family
$p ( Y \given \theta \, { \cal B } )$, where $\theta$ is a vector of length $1 \le p <
\infty$ and $Y = ( Y_1, \dots, Y_n )$, for large $n$ the
repeated-sampling distribution of the (vector) MLE $\hat{ \theta }_{ MLE }$ is
approximately $p$--variate normal with mean vector $\theta$ and covariance
matrix $\hat{ I }^{ -1 }$ (the inverse of the observed information matrix),
and the bias of $\hat{ \theta }_{ MLE }$ as an estimate of $\theta$ in large
samples is $O \! \left( \frac{ 1 }{ n^2 } \right)$.

\end{itemize}

\begin{figure}[t!]

\centering

\caption{\textit{Differences $y_i$ between observed and predicted American
football scores, 1981--1984.}}

\includegraphics[ scale = 0.625 ]{homework-2-figure-1.pdf}

\label{football-data-1}

\end{figure}

2.~\textit{[200 points]} People in Las Vegas who are experts on the National Football League provide
a \textit{point spread} for every football game before it occurs, as a
measure of the difference in ability between the two teams (and taking
account of where the game will be played). For example, if Denver is a
3.5--point favorite to defeat San Francisco, the implication is that betting
on whether Denver's final score minus 3.5 points exceeds or falls short of
San Francisco's final score is an even-money proposition. With the definition \textit{actual outcome = (score of favorite -- score of underdog)}, Figure \ref{football-data-1}
(based on data from Gelman et al.~(2014)) presents a histogram of the
differences $y$ = (actual outcome -- point spread) for a sample of $n = 672$
professional football games in the early 1980s, with a normal density
superimposed having the same mean $\bar{ y } = 0.07$ and standard deviation
(SD) $s_u = 13.86$ as the sample (if this distribution didn't have a mean
that's close to 0, the experts would be \textit{uncalibrated} and you could
make money by betting against them). You can see from this figure that the
model $( Y_i \given \sigma^2 \, { \cal B } ) \stackrel{ \mbox{ \tiny IID} }{ \sim } N( 0, \sigma^2 )$ is reasonable for the observed differences $y_i$.

\begin{itemize}

\item[(a)]

Write down the likelihood and log likelihood functions for $\sigma^2$ in
this model. Show that $\hat{ \sigma }^2 = \frac{ 1 }{ n } \sum_{ i = 1 }^n
y_i^2$, which takes the value 191.8 with the data in Figure \ref{football-data-1}, is both
sufficient and the maximum likelihood estimator (MLE) for $\sigma^2$. Plot
the log likelihood function for $\sigma^2$ in the range from 160 to 240 with
these data, briefly explaining why it should be slightly skewed to the
right. \textit{[60 points]}

\item[(b)]

Show that the conjugate prior for $\sigma^2$ in this model is the \textit{scaled inverse chi-square}
distribution, 
\begin{equation} \label{sichi2-1}
( \sigma^2 | { \cal B } ) \sim \chi^{ -2 } ( \nu_0, \sigma_0^2 ), \ \ \
\mbox{ i.e., } \ \ \ p( \sigma^2 | { \cal B } ) = c \, \left( \sigma^2
\right)^{ - \left( \frac{ \nu_0 }{ 2 } + 1 \right) } \exp \left( - \frac{
\nu_0 \, \sigma_0^2 }{ 2 \, \sigma^2 } \right) ,
\end{equation} 
where $\nu_0$ is the prior sample size and $\sigma_0^2$ is a prior estimate
of $\sigma^2$. In an attempt to be ``non-informative" people sometimes work
with a version of (\ref{sichi2-1}) obtained by letting $\nu_0 \rightarrow
0$, namely $p( \sigma^2 | { \cal B } ) = c_0 \left( \sigma^2 \right)^{ -1
}$. The resulting prior is \textit{improper} in that it integrates to
$\infty$, but it turns out that posterior inferences will be sensible
nonetheless (even with sample sizes as small as $n = 1$). Show that with
this prior, the posterior distribution is $\chi^{ -2 } ( n, \hat{ \sigma }^2
)$. Given the interpretation of $\nu_0$ as the prior sample size and $\sigma_0^2$ as the prior estimate of $\sigma^2$, do the values of $n$ (for $\nu$) and $\hat{ \sigma }^2$ (for $\sigma^2$) in the posterior $\chi^{ -2 } ( n, \hat{ \sigma }^2
)$ make good intuitive sense? Explain briefly. \textit{[30 points]}

\begin{figure}[t!]

\centering

\caption{\textit{Prior, likelihood, and posterior densities with the
football data of Figure \ref{football-data-1}.}}

\includegraphics[ scale = 0.625 ]{homework-2-figure-2.pdf}

\label{football-data-2}

\end{figure}

\item[(c)]

Figure \ref{football-data-2} plots the prior, likelihood, and
posterior densities on the same graph using the data in Figure 1 and taking
$c_0 = 2.5$ for convenience in the plot. Get \texttt{R} to reproduce this figure (\textbf{NB} \texttt{Maple} has a hard
time doing this). You'll need to be careful to use the correct normalizing
constant $c$ in (\ref{sichi2-1}), which can be found in the lecture
notes and in Appendix A of Gelman et al.~(2014); and because the data
values in this example lead to astoundingly large and small numbers on the
original scale, it's necessary to do all possible computations on the log
scale and wait to transform back to the original scale until the last
possible moment (you'll need to use the built-in function \texttt{lgamma} in
\texttt{R}). Explicitly
identify the three curves, and briefly discuss what this plot implies about
the updating of information from prior to posterior in this case. 
\textit{[40 points]}

\item[(d)]

Fill in the dots in Table \ref{football-data-3} by making the following computations.

\begin{itemize}

\item

You already know the frequentist estimate, from part (i); to get the Bayesian estimate, let's use the posterior mean, which you can work out from the fact (mentioned in class) that
\begin{equation} \label{sichi2-2}
\textrm{if} \ \ \ ( \sigma^2 \given \mathcal{ B } ) \sim \chi^{ -2 } ( \nu_0, \sigma_0^2 ) \ \ \ \textrm{ then } \ \ \ E ( \sigma^2 \given \mathcal{ B } ) = \left( \frac{ \nu_0 }{ \nu_0 - 2 } \right) \sigma_0^2 \ \ \ \textrm{ as long as } \ \ \ \nu_0 > 2 \, . 
\end{equation}

\item

Work out the Fisher information from your log-likelihood function in (i) and use it to construct the large-sample frequentist 95\% confidence interval for $\sigma^2$.

\item

In a small modification of Mr.~Gosset's result (from class) about the small-sample-exact confidence interval for $\sigma^2$ in the Gaussian sampling model when $\mu$ is unknown, it can also be shown (you're not asked to show this) that  
under the model $( Y_i \given \sigma^2 \, { \cal B } ) \stackrel{ \mbox{ \tiny IID} }{ \sim } N( 0, \sigma^2 )$ for $i = 1, \dots, n$ (which we're using for the football data), the exact $100 ( 1 - \alpha )$\% confidence interval for $\sigma^2$ is of the form
\begin{equation} \label{gosset-1}
\left[ \frac{ n \, \hat{ \sigma }_u^2 }{ \left( \chi_n^2 \right)_{ 1 - \frac{ \alpha }{ 2 } } }, \frac{ n \, \hat{ \sigma }_u^2 }{ \left( \chi_n^2 \right)_{ \frac{ \alpha }{ 2 } } } \right] \, ,
\end{equation}
in which $\hat{ \sigma }_u^2 = \frac{ 1 }{ n } \sum_{ i = 1 }^n y_i^2$ is an unbiased estimate of $\sigma^2$ (here this coincides with the MLE) and $\left( \chi_\nu^2 \right)_\gamma$ is the place along the $\chi^2$ curve with $\nu$ degrees of freedom where $100 \, \gamma$\% of the total area under the curve is to the left of that place (i.e., the $\gamma$ quantile of the $\chi_\nu^2$ distribution).

\item

To get the 95\% Bayesian interval, use the fact, noted in class, that the $\chi^{ -2 } ( \nu_0, \sigma_0^2 )$ density is the same as the $\Gamma^{ -1 } \! \left( \frac{ \nu_0 }{ 2 }, \frac{ \nu_0 \, \sigma_0^2}{ 2 } \right)$ distribution, and make Inverse Gamma calculations with the \texttt{qinvgamma} function in the \texttt{CRAN} package \texttt{actuar} (see the item called \textit{R code for the Bayesian Gaussian analysis of the NB10 data} on the course web page for an example of this).

\end{itemize}

Briefly summarize how the frequentist and Bayesian results are similar and how they differ. Is this an example of the Bernstein-von Mises Theorem in action? Expain briefly. \textit{[70 points]}

\end{itemize} 

\begin{table}

\centering

\caption{\textit{Comparison of frequentist and Bayesian inference about $\sigma^2$ with the football data of Figure \ref{football-data-1}.}}

\bigskip

\begin{tabular}{c|c|c||c|c}

\multicolumn{3}{c||}{Frequentist} & \multicolumn{2}{c}{Diffuse-Prior Bayesian} \\

\hline

& Large-Sample & Small-Sample & \\

Estimate & 95\% Interval & 95\% Interval & Estimate & 95\% Interval \\

\hline

$\cdot$ & $( \, \cdot , \, \cdot \, )$ & $( \, \cdot , \, \cdot \, )$ & $\cdot$ & $( \, \cdot , \, \cdot \, )$

\end{tabular}

\label{football-data-3}

\end{table}

3.~\textit{[160 points]} Paleobotanists estimate the moment in the remote past when a given species became extinct by taking cylindrical, vertical core samples well below the earth's surface and looking for the last occurrence of the species in the fossil record, measured in meters above the point $P$ at which
the species was known to have first emerged. Letting $\bm{ y } = ( y_1, \dots, y_n )$ denote a sample of
such distances above $P$ at a random set of locations, the sampling model \vspace*{0.05in} \fbox{ $( Y_i \given \theta \, \mathcal{ B } ) \stackrel{ \mbox{ \tiny IID} }{\sim}$ Uniform$( 0, \theta ) \, ( * )$} emerges from simple and plausible assumptions. In this model the unknown $\theta > 0$ can be
used, through carbon dating, to estimate the species extinction time. This problem is about likelihood and
Bayesian inference for $\theta$ in model $( * )$, and it will be seen that some of our usual intuitions
(derived from the Bernoulli, Poisson, and Gaussian case studies) do not hold in this case.

The marginal sampling distribution of a single observation $y_i$ in this model may be written
\begin{equation} \label{uniform-1}
p ( y_i \given \theta \, \mathcal{ B } ) = \left\{ \begin{array}{cc} \frac{ 1 }{ \theta } & \textrm{if } 0 \le y_i \le \theta \\ 0 & \textrm{otherwise} \end{array} \right\} = \frac{ 1 }{ \theta } \, I ( 0 \le y_i \le \theta ) \, ,
\end{equation}
where (as usual) $I ( A ) = 1$ if proposition $A$ is true and 0 otherwise.

\begin{itemize}

\item[(a)]

Use the fact that $( 0 \le y_i \le \theta \textrm{ for all } i = 1, \dots n )$ if and only if $( m \triangleq \max ( y_1, \dots y_n ) \le \theta )$ to show that the likelihood function in this model is
\begin{equation} \label{uniform-2}
\ell ( \theta \given \bm{ y } \, \mathcal{ B } ) = \theta^{ - n } \, I ( \theta \ge m ) \, .
\end{equation}
Briefly explain why this demonstrates that $m$ is sufficient for $\theta$ in this model. \textit{[20 points]}

\item[(b)]

As we've discussed in class, the maximum likelihood estimator (MLE) of a
parameter $\theta$ is the value of $\theta$ (which will be a function of the data) that maximizes the
likelihood function, and this maximization is usually performed by setting the derivative
of the likelihood (or log likelihood) function to 0 and solving. Show by means of a rough
sketch of the likelihood function in (a) that $m$ is the maximum likelihood estimator (MLE)
of $\theta$, and briefly explain why the usual method for finding the MLE fails in this case. \textit{[20 points]}

% repeated-sampling distribution of m, and pivotal ci that results

\item[(c)]

A positive quantity $\theta$ follows the \textit{Pareto} distribution with parameters $\alpha, \beta > 0$ --- written $( \theta \given \mathcal{ B } ) \sim \textrm{Pareto} ( \alpha, \beta )$, and named for the Italian economist Vilfredo Pareto (1848--1923) --- if it has density
\begin{equation} \label{uniform-3}
p ( \theta \given \mathcal{ B } ) = \left\{ \begin{array}{cc} \alpha \, \beta^\alpha \, \theta^{ - ( \alpha + 1 ) } & \textrm{if } \theta \ge \beta \\ 0 & \textrm{otherwise} \end{array} \right\} \, .
\end{equation}
This distribution has mean $\frac{ \alpha \, \beta }{ \alpha - 1 }$ (if $\alpha > 1$) and variance $\frac{ \alpha \, \beta^2 }{ ( \alpha - 1 )^2 \, ( \alpha - 2 ) }$ (if $\alpha > 2$). With the likelihood function viewed as (a constant multiple of) a density for $\theta$, show that the likelihood corresponds to the Pareto$( n - 1, m )$ distribution. Show further that if the prior distribution for $\theta$ is taken to be (\ref{uniform-3}), under the model $( * )$ above the posterior
distribution is $p ( \theta \given \bm{ y } \, \mathcal{ B } ) = \textrm{Pareto} [ \alpha + n, max( \beta, m ) ]$, thereby demonstrating that the
Pareto distribution is conjugate to the Uniform$( 0, \theta )$ likelihood. \textit{[20 points]}

\item[(d)]

In an experiment conducted in the Antarctic in the 1980s to study a particular species
of fossil ammonite, the following was a linearly rescaled version of the observed data: $\bm{ y } = ( 2.8, 1.7, 1.0, 5.1, 3.7, 1.5, 4.3, 2.0, 3.2, 2.1, 0.4 )$. Prior
information equivalent to a Pareto prior specified by the choice $( \alpha, \beta ) = ( 2.5, 4 )$ was
available. Plot the prior, likelihood, and posterior distributions arising from this data
set on the same graph, explicitly identifying the three curves, and briefly
discuss what
this picture implies about the updating of information from prior to posterior in this
case. \textit{[30 points]}

\item[(e)]

Make a table summarizing the mean and standard deviation (SD) for the prior (Pareto$( \alpha, \beta )$), likelihood (Pareto$( n - 1, m )$), and posterior (Pareto$[ \alpha + n, \max( \beta, m )]$) distributions,
using the $( \alpha, \beta )$ choices and the data in part (d) above. In Bayesian updating the posterior
mean is usually (at least approximately) a weighted average of the prior and likelihood
means (with weights between 0 and 1), and the posterior SD is typically smaller than
either the prior or likelihood SDs. Are each of these behaviors true in this case? Explain
briefly. \textit{[50 points]}

\item[(f)]

You've shown in part (c) that the posterior for $\theta$ based on a sample of size $n$
in model $( * )$ is $p ( \theta \given \bm{ y } \, \mathcal{ B } ) = \textrm{Pareto} [ \alpha + n, \max ( \beta, m ) ]$. Write down a symbolic expression
for the posterior variance of $\theta$ in terms of $( \alpha, \beta, m, n )$. When considered as a function of
$n$, what's unusual about this expression in relation to the findings in our previous case
studies in this course? Explain briefly. \textit{[20 points]}

\end{itemize}

\end{document}