\documentclass[12pt]{article} \usepackage{ amssymb, amsmath, bm, graphicx } \addtolength{\textheight}{2.0in} \addtolength{\topmargin}{-0.95in} \addtolength{\textwidth}{1.6in} \addtolength{\evensidemargin}{-0.8in} \addtolength{\oddsidemargin}{-0.9in} \setlength{\parskip}{0.1in} \setlength{\parindent}{0.0in} \raggedbottom \newcommand{\given}{\, | \,} \begin{document} \begin{flushleft} Prof.~David Draper \\ Department of \\ \hspace*{0.1in} Applied Mathematics and Statistics \\ University of California, Santa Cruz \end{flushleft} \begin{center} \textbf{\large AMS 132: Homework 2 (final version)} \\ Target due date: Tue 14 Mar 2017 \textit{[420 total points]} \end{center} As with Homework 1, please collect \{all of the \texttt{R} code you used in answering the questions below\} into an Appendix at the end of your document, so that (if you do something wrong) the grader can better give you part credit. To avoid plagiarism, if you end up using any of the code I post on the course web page, at the beginning of your Appendix you can say something like the following: \begin{quote} I used some of Professor Draper's \texttt{R} code in this assignment, adapting it as needed. \end{quote} 1.~\textit{[60 points]} For each statement below (10 points each), say whether it's true or false; if true without further assumptions, briefly explain why it’s true (and --- extra credit (5 points each time) --- what its implications are for statistical inference); if it's sometimes true, give the extra conditions necessary to make it true; if it's false, briefly explain how to change it so that it's true and/or give an example of why it's false. If the statement consists of two or more sub-statements and two or more of them are false, you need to explicitly address all of the false sub-statements in your answer. \begin{itemize} \item[(a)] Consider the sampling model $( Y_i \given \theta \, \mathcal{ B } ) \stackrel{ \textrm{\tiny IID} }{ \sim } p ( y_i \given \theta \, \mathcal{ B } )$ for $i = 1, \dots, n$, where the $Y_i$ are univariate real values, $\theta$ is a parameter vector of length $1 \le p < \infty$ and $\mathcal{ B }$ summarizes Your background information; a Bayesian analysis with the same sampling model would add a prior distribution layer of the form $( \theta \given \mathcal{ B } ) \sim p ( \theta \given { \cal B } )$ to the hierarchy. The Bernstein-von Mises Theorem says that maximum-likelihood (ML) and Bayesian inferential conclusions will be similar in this setting if (a) $n$ is large and (b) $p ( \theta | { \cal B } )$ is diffuse, but the theorem does not provide guidance on how large $n$ needs to be for its conclusion to hold in any specific sampling model. \item[(b)] Being able to express Your sampling distribution as a member of the exponential family is helpful, because (1) You can then readily identify a set of sufficient statistics just by looking at the form of the exponential family and (2) the conjugate prior is also directly available from the exponential family form. \item[(c)] In the basic diagram that illustrates the frequentist inferential paradigm (with the population, sample and repeated-sampling data sets, each containing $N$, $n$, and $M$ elements, respectively (see page 6 of the extra notes from 17 Jan 2017)), when the population parameter of main interest is the mean $\theta$ and the estimator is the sample mean $\bar{ Y }$, You will always get a Gaussian long-run distribution for $\bar{ Y }$ (in the repeated-sampling data set) as long as any one of $( N, n, M )$ goes to infinity. \item[(d)] When Your sampling model has $n$ observations and a single parameter $\theta$ (so that $p = 1$), if the sampling model is regular (i.e., if the range of possible data values doesn't depend on $\theta$), in large samples the observed information $\hat{ I } \! \left( \hat{ \theta }_{ MLE } \right)$ is $O ( n )$, meaning that (1) information in $\hat{ \theta }_{ MLE }$ about $\theta$ increases linearly with $n$ and (2) $\hat{ V } \! \left( \hat{ \theta }_{ MLE } \right) = O \! \left( \frac{ 1 }{ n } \right)$. \item[(e)] It's easier to reason from the part (or the particular, or the sample) to the whole (or the general, or the population), and that's why statistical inference (inductive reasoning) is easier than probability (deductive reasoning). \item[(f)] When the sampling model is a regular parametric family $p ( Y \given \theta \, { \cal B } )$, where $\theta$ is a vector of length $1 \le p < \infty$ and $Y = ( Y_1, \dots, Y_n )$, for large $n$ the repeated-sampling distribution of the (vector) MLE $\hat{ \theta }_{ MLE }$ is approximately $p$--variate normal with mean vector $\theta$ and covariance matrix $\hat{ I }^{ -1 }$ (the inverse of the observed information matrix), and the bias of $\hat{ \theta }_{ MLE }$ as an estimate of $\theta$ in large samples is $O \! \left( \frac{ 1 }{ n^2 } \right)$. \end{itemize} \begin{figure}[t!] \centering \caption{\textit{Differences $y_i$ between observed and predicted American football scores, 1981--1984.}} \includegraphics[ scale = 0.625 ]{homework-2-figure-1.pdf} \label{football-data-1} \end{figure} 2.~\textit{[200 points]} People in Las Vegas who are experts on the National Football League provide a \textit{point spread} for every football game before it occurs, as a measure of the difference in ability between the two teams (and taking account of where the game will be played). For example, if Denver is a 3.5--point favorite to defeat San Francisco, the implication is that betting on whether Denver's final score minus 3.5 points exceeds or falls short of San Francisco's final score is an even-money proposition. With the definition \textit{actual outcome = (score of favorite -- score of underdog)}, Figure \ref{football-data-1} (based on data from Gelman et al.~(2014)) presents a histogram of the differences $y$ = (actual outcome -- point spread) for a sample of $n = 672$ professional football games in the early 1980s, with a normal density superimposed having the same mean $\bar{ y } = 0.07$ and standard deviation (SD) $s_u = 13.86$ as the sample (if this distribution didn't have a mean that's close to 0, the experts would be \textit{uncalibrated} and you could make money by betting against them). You can see from this figure that the model $( Y_i \given \sigma^2 \, { \cal B } ) \stackrel{ \mbox{ \tiny IID} }{ \sim } N( 0, \sigma^2 )$ is reasonable for the observed differences $y_i$. \begin{itemize} \item[(a)] Write down the likelihood and log likelihood functions for $\sigma^2$ in this model. Show that $\hat{ \sigma }^2 = \frac{ 1 }{ n } \sum_{ i = 1 }^n y_i^2$, which takes the value 191.8 with the data in Figure \ref{football-data-1}, is both sufficient and the maximum likelihood estimator (MLE) for $\sigma^2$. Plot the log likelihood function for $\sigma^2$ in the range from 160 to 240 with these data, briefly explaining why it should be slightly skewed to the right. \textit{[60 points]} \item[(b)] Show that the conjugate prior for $\sigma^2$ in this model is the \textit{scaled inverse chi-square} distribution, \begin{equation} \label{sichi2-1} ( \sigma^2 | { \cal B } ) \sim \chi^{ -2 } ( \nu_0, \sigma_0^2 ), \ \ \ \mbox{ i.e., } \ \ \ p( \sigma^2 | { \cal B } ) = c \, \left( \sigma^2 \right)^{ - \left( \frac{ \nu_0 }{ 2 } + 1 \right) } \exp \left( - \frac{ \nu_0 \, \sigma_0^2 }{ 2 \, \sigma^2 } \right) , \end{equation} where $\nu_0$ is the prior sample size and $\sigma_0^2$ is a prior estimate of $\sigma^2$. In an attempt to be ``non-informative" people sometimes work with a version of (\ref{sichi2-1}) obtained by letting $\nu_0 \rightarrow 0$, namely $p( \sigma^2 | { \cal B } ) = c_0 \left( \sigma^2 \right)^{ -1 }$. The resulting prior is \textit{improper} in that it integrates to $\infty$, but it turns out that posterior inferences will be sensible nonetheless (even with sample sizes as small as $n = 1$). Show that with this prior, the posterior distribution is $\chi^{ -2 } ( n, \hat{ \sigma }^2 )$. Given the interpretation of $\nu_0$ as the prior sample size and $\sigma_0^2$ as the prior estimate of $\sigma^2$, do the values of $n$ (for $\nu$) and $\hat{ \sigma }^2$ (for $\sigma^2$) in the posterior $\chi^{ -2 } ( n, \hat{ \sigma }^2 )$ make good intuitive sense? Explain briefly. \textit{[30 points]} \begin{figure}[t!] \centering \caption{\textit{Prior, likelihood, and posterior densities with the football data of Figure \ref{football-data-1}.}} \includegraphics[ scale = 0.625 ]{homework-2-figure-2.pdf} \label{football-data-2} \end{figure} \item[(c)] Figure \ref{football-data-2} plots the prior, likelihood, and posterior densities on the same graph using the data in Figure 1 and taking $c_0 = 2.5$ for convenience in the plot. Get \texttt{R} to reproduce this figure (\textbf{NB} \texttt{Maple} has a hard time doing this). You'll need to be careful to use the correct normalizing constant $c$ in (\ref{sichi2-1}), which can be found in the lecture notes and in Appendix A of Gelman et al.~(2014); and because the data values in this example lead to astoundingly large and small numbers on the original scale, it's necessary to do all possible computations on the log scale and wait to transform back to the original scale until the last possible moment (you'll need to use the built-in function \texttt{lgamma} in \texttt{R}). Explicitly identify the three curves, and briefly discuss what this plot implies about the updating of information from prior to posterior in this case. \textit{[40 points]} \item[(d)] Fill in the dots in Table \ref{football-data-3} by making the following computations. \begin{itemize} \item You already know the frequentist estimate, from part (i); to get the Bayesian estimate, let's use the posterior mean, which you can work out from the fact (mentioned in class) that \begin{equation} \label{sichi2-2} \textrm{if} \ \ \ ( \sigma^2 \given \mathcal{ B } ) \sim \chi^{ -2 } ( \nu_0, \sigma_0^2 ) \ \ \ \textrm{ then } \ \ \ E ( \sigma^2 \given \mathcal{ B } ) = \left( \frac{ \nu_0 }{ \nu_0 - 2 } \right) \sigma_0^2 \ \ \ \textrm{ as long as } \ \ \ \nu_0 > 2 \, . \end{equation} \item Work out the Fisher information from your log-likelihood function in (i) and use it to construct the large-sample frequentist 95\% confidence interval for $\sigma^2$. \item In a small modification of Mr.~Gosset's result (from class) about the small-sample-exact confidence interval for $\sigma^2$ in the Gaussian sampling model when $\mu$ is unknown, it can also be shown (you're not asked to show this) that under the model $( Y_i \given \sigma^2 \, { \cal B } ) \stackrel{ \mbox{ \tiny IID} }{ \sim } N( 0, \sigma^2 )$ for $i = 1, \dots, n$ (which we're using for the football data), the exact $100 ( 1 - \alpha )$\% confidence interval for $\sigma^2$ is of the form \begin{equation} \label{gosset-1} \left[ \frac{ n \, \hat{ \sigma }_u^2 }{ \left( \chi_n^2 \right)_{ 1 - \frac{ \alpha }{ 2 } } }, \frac{ n \, \hat{ \sigma }_u^2 }{ \left( \chi_n^2 \right)_{ \frac{ \alpha }{ 2 } } } \right] \, , \end{equation} in which $\hat{ \sigma }_u^2 = \frac{ 1 }{ n } \sum_{ i = 1 }^n y_i^2$ is an unbiased estimate of $\sigma^2$ (here this coincides with the MLE) and $\left( \chi_\nu^2 \right)_\gamma$ is the place along the $\chi^2$ curve with $\nu$ degrees of freedom where $100 \, \gamma$\% of the total area under the curve is to the left of that place (i.e., the $\gamma$ quantile of the $\chi_\nu^2$ distribution). \item To get the 95\% Bayesian interval, use the fact, noted in class, that the $\chi^{ -2 } ( \nu_0, \sigma_0^2 )$ density is the same as the $\Gamma^{ -1 } \! \left( \frac{ \nu_0 }{ 2 }, \frac{ \nu_0 \, \sigma_0^2}{ 2 } \right)$ distribution, and make Inverse Gamma calculations with the \texttt{qinvgamma} function in the \texttt{CRAN} package \texttt{actuar} (see the item called \textit{R code for the Bayesian Gaussian analysis of the NB10 data} on the course web page for an example of this). \end{itemize} Briefly summarize how the frequentist and Bayesian results are similar and how they differ. Is this an example of the Bernstein-von Mises Theorem in action? Expain briefly. \textit{[70 points]} \end{itemize} \begin{table} \centering \caption{\textit{Comparison of frequentist and Bayesian inference about $\sigma^2$ with the football data of Figure \ref{football-data-1}.}} \bigskip \begin{tabular}{c|c|c||c|c} \multicolumn{3}{c||}{Frequentist} & \multicolumn{2}{c}{Diffuse-Prior Bayesian} \\ \hline & Large-Sample & Small-Sample & \\ Estimate & 95\% Interval & 95\% Interval & Estimate & 95\% Interval \\ \hline $\cdot$ & $( \, \cdot , \, \cdot \, )$ & $( \, \cdot , \, \cdot \, )$ & $\cdot$ & $( \, \cdot , \, \cdot \, )$ \end{tabular} \label{football-data-3} \end{table} 3.~\textit{[160 points]} Paleobotanists estimate the moment in the remote past when a given species became extinct by taking cylindrical, vertical core samples well below the earth's surface and looking for the last occurrence of the species in the fossil record, measured in meters above the point $P$ at which the species was known to have first emerged. Letting $\bm{ y } = ( y_1, \dots, y_n )$ denote a sample of such distances above $P$ at a random set of locations, the sampling model \vspace*{0.05in} \fbox{ $( Y_i \given \theta \, \mathcal{ B } ) \stackrel{ \mbox{ \tiny IID} }{\sim}$ Uniform$( 0, \theta ) \, ( * )$} emerges from simple and plausible assumptions. In this model the unknown $\theta > 0$ can be used, through carbon dating, to estimate the species extinction time. This problem is about likelihood and Bayesian inference for $\theta$ in model $( * )$, and it will be seen that some of our usual intuitions (derived from the Bernoulli, Poisson, and Gaussian case studies) do not hold in this case. The marginal sampling distribution of a single observation $y_i$ in this model may be written \begin{equation} \label{uniform-1} p ( y_i \given \theta \, \mathcal{ B } ) = \left\{ \begin{array}{cc} \frac{ 1 }{ \theta } & \textrm{if } 0 \le y_i \le \theta \\ 0 & \textrm{otherwise} \end{array} \right\} = \frac{ 1 }{ \theta } \, I ( 0 \le y_i \le \theta ) \, , \end{equation} where (as usual) $I ( A ) = 1$ if proposition $A$ is true and 0 otherwise. \begin{itemize} \item[(a)] Use the fact that $( 0 \le y_i \le \theta \textrm{ for all } i = 1, \dots n )$ if and only if $( m \triangleq \max ( y_1, \dots y_n ) \le \theta )$ to show that the likelihood function in this model is \begin{equation} \label{uniform-2} \ell ( \theta \given \bm{ y } \, \mathcal{ B } ) = \theta^{ - n } \, I ( \theta \ge m ) \, . \end{equation} Briefly explain why this demonstrates that $m$ is sufficient for $\theta$ in this model. \textit{[20 points]} \item[(b)] As we've discussed in class, the maximum likelihood estimator (MLE) of a parameter $\theta$ is the value of $\theta$ (which will be a function of the data) that maximizes the likelihood function, and this maximization is usually performed by setting the derivative of the likelihood (or log likelihood) function to 0 and solving. Show by means of a rough sketch of the likelihood function in (a) that $m$ is the maximum likelihood estimator (MLE) of $\theta$, and briefly explain why the usual method for finding the MLE fails in this case. \textit{[20 points]} % repeated-sampling distribution of m, and pivotal ci that results \item[(c)] A positive quantity $\theta$ follows the \textit{Pareto} distribution with parameters $\alpha, \beta > 0$ --- written $( \theta \given \mathcal{ B } ) \sim \textrm{Pareto} ( \alpha, \beta )$, and named for the Italian economist Vilfredo Pareto (1848--1923) --- if it has density \begin{equation} \label{uniform-3} p ( \theta \given \mathcal{ B } ) = \left\{ \begin{array}{cc} \alpha \, \beta^\alpha \, \theta^{ - ( \alpha + 1 ) } & \textrm{if } \theta \ge \beta \\ 0 & \textrm{otherwise} \end{array} \right\} \, . \end{equation} This distribution has mean $\frac{ \alpha \, \beta }{ \alpha - 1 }$ (if $\alpha > 1$) and variance $\frac{ \alpha \, \beta^2 }{ ( \alpha - 1 )^2 \, ( \alpha - 2 ) }$ (if $\alpha > 2$). With the likelihood function viewed as (a constant multiple of) a density for $\theta$, show that the likelihood corresponds to the Pareto$( n - 1, m )$ distribution. Show further that if the prior distribution for $\theta$ is taken to be (\ref{uniform-3}), under the model $( * )$ above the posterior distribution is $p ( \theta \given \bm{ y } \, \mathcal{ B } ) = \textrm{Pareto} [ \alpha + n, max( \beta, m ) ]$, thereby demonstrating that the Pareto distribution is conjugate to the Uniform$( 0, \theta )$ likelihood. \textit{[20 points]} \item[(d)] In an experiment conducted in the Antarctic in the 1980s to study a particular species of fossil ammonite, the following was a linearly rescaled version of the observed data: $\bm{ y } = ( 2.8, 1.7, 1.0, 5.1, 3.7, 1.5, 4.3, 2.0, 3.2, 2.1, 0.4 )$. Prior information equivalent to a Pareto prior specified by the choice $( \alpha, \beta ) = ( 2.5, 4 )$ was available. Plot the prior, likelihood, and posterior distributions arising from this data set on the same graph, explicitly identifying the three curves, and briefly discuss what this picture implies about the updating of information from prior to posterior in this case. \textit{[30 points]} \item[(e)] Make a table summarizing the mean and standard deviation (SD) for the prior (Pareto$( \alpha, \beta )$), likelihood (Pareto$( n - 1, m )$), and posterior (Pareto$[ \alpha + n, \max( \beta, m )]$) distributions, using the $( \alpha, \beta )$ choices and the data in part (d) above. In Bayesian updating the posterior mean is usually (at least approximately) a weighted average of the prior and likelihood means (with weights between 0 and 1), and the posterior SD is typically smaller than either the prior or likelihood SDs. Are each of these behaviors true in this case? Explain briefly. \textit{[50 points]} \item[(f)] You've shown in part (c) that the posterior for $\theta$ based on a sample of size $n$ in model $( * )$ is $p ( \theta \given \bm{ y } \, \mathcal{ B } ) = \textrm{Pareto} [ \alpha + n, \max ( \beta, m ) ]$. Write down a symbolic expression for the posterior variance of $\theta$ in terms of $( \alpha, \beta, m, n )$. When considered as a function of $n$, what's unusual about this expression in relation to the findings in our previous case studies in this course? Explain briefly. \textit{[20 points]} \end{itemize} \end{document}