\documentclass[12pt]{article}

\usepackage{ amssymb, amsmath, bm, graphicx }

\addtolength{\textheight}{2.0in}
\addtolength{\topmargin}{-0.95in}
\addtolength{\textwidth}{1.6in}
\addtolength{\evensidemargin}{-0.8in}
\addtolength{\oddsidemargin}{-0.9in}
\setlength{\parskip}{0.1in}
\setlength{\parindent}{0.0in}

\raggedbottom

\newcommand{\given}{\, | \,}

\begin{document}

\begin{flushleft}

Prof.~David Draper \\
Department of \\ 
\hspace*{0.1in} Applied Mathematics and Statistics \\
University of California, Santa Cruz

\end{flushleft}

\begin{center}

\textbf{\large AMS 132: Take-Home Final Exam (final draft)} \\ 
Absolute due date: 24 Mar 2017 \textit{[640 total points]}

\end{center}

As with Homeworks 1 and 2, please collect \{all of the \texttt{R} code you used in answering the questions  below\} into an Appendix at the end of your document, so that (if you do something wrong) the grader can better give you part credit. To avoid plagiarism, if you end up using any of the code I post on the course web page, at the beginning of the Appendix you can say something like the following:

\begin{quote}

I used some of Professor Draper's \texttt{R} and/or \texttt{WinBUGS} code in this assignment, adapting it as needed.

\end{quote}

1.~\textit{[50 points]} For each statement below (10 points each), say whether it's true or false; if true without further assumptions, briefly explain why it’s true (and --- extra credit (5 points each time) ---
what its implications are for statistical inference); if it's sometimes true, give the extra conditions necessary to make it true; if it's false, briefly explain how to change it so that it's true and/or give an example of why it's false. If the statement consists of two or more sub-statements and two or more of them are false, you need to explicitly address all of the false sub-statements in your answer.

In answering these questions you may find it helpful to consult the following references, available on the course web page: DS (Degroot and Schervish (2012)) sections 3.10, 12.5, 12.6; Gelman et al.~(2014) Chapter 11.

\begin{itemize}

\item[(a)]

If you can figure out how to do IID sampling from the posterior distribution of interest to you, this will often be more Monte-Carlo efficient than MCMC sampling from the same posterior.

\item[(b)]

A (first-order) Markov chain is a particularly simple stochastic process: to simulate where the chain goes next, you only need to know (i) where it is now and (ii) where it was one iteration ago.

\item[(c)]

The bootstrap is a frequentist simulation-based computational method that can be used to create approximate confidence intervals for population summaries even when the population distribution of the outcome variable $y$ of interest is not known; for example, if all you know from problem context is that your observations $\bm{ y } = ( y_1, \dots, y_n )$ are IID from \textit{some} distribution with finite mean $\mu$ and finite SD $\sigma$, you can use the bootstrap to build an approximate confidence interval for $\mu$ even though you don't know what the population distribution is.

\item[(d)]

Simulation-based computational methods are needed in Bayesian inference because conjugate priors don't always exist and high-dimensional probability distributions are difficult to summarize algebraically.

\item[(e)]

In MCMC sampling from a distribution, you have to be really careful to use a burn-in period of just the right length, because if the burn-in goes on for too long the Markov chain will have missed its chance to find the equilibrium distribution.

\end{itemize}

2.~\textit{[210 points]} In late October 1988, a survey was conducted on behalf of \textit{CBS News} of $n =$ 1,447 adults aged 18+ in the United States to ask about their preferences in the upcoming presidential election. Out of the 1,447 people in the sample, $n_1 = 727$ supported George H.W.~Bush, $n_2 = 583$ supported Michael Dukakis, and $n_3 = 137$ supported other candidates or expressed no opinion. The polling organization used a sampling method called \textit{stratified random sampling} that's more complicated than the two sampling methods we know about in this class --- IID sampling (at random with replacement) and simple random sampling (SRS: at random without replacement) --- but here let's pretend they used SRS from the population $\mathcal{ P } =$ \{all American people of voting age in the U.S.~in October 1988\}. There were about 245 million Americans in 1988, of whom about 74\% were 18 or older, so $\mathcal{ P }$ had about 181 million people in it; the total sample size of $n =$ 1,447 is so small in relation to the population size that we can regard the sampling as effectively IID.

Under these conditions it can be shown that the only appropriate sampling distribution for the data vector $\bm{ N } = ( n_1, n_2, n_3 )$ is a generalization of the Binomial distribution called the \textit{Multinomial} distribution (see DS section 5.9). Suppose that a population of interest contains items of $p \ge 2$ types (in the example here: people who support \{Bush, Dukakis, other\}, so that in this case $p = 3$) and that the population proportion of items of type $j$ is $0 < \theta_j < 1$. Letting $\bm{ \theta } = ( \theta_1, \dots, \theta_p )$, note that there's a restriction on the components of $\bm{ \theta }$, namely $\sum_{ j = 1 }^p \theta_j = 1$. Now, as in the \textit{CBS News} example, suppose that someone takes an IID sample $\bm{ y } = ( y_1, \dots, y_n )$ of size $n$ from this population and counts how many elements in the sample are of type 1 (call this count $n_1$), type 2 ($n_2$), and so on up to type $p$ ($n_p$); let $\bm{ N } = ( n_1, \dots, n_p )$ be the (vector) random variable that keeps track of all of the counts. In this situation people say that $\bm{ N }$ follows the Multinomial distribution with parameters $n$ and $\bm{ \theta }$, which is defined as follows:
$( \bm{ N } \given n \, \bm{ \theta } \, \mathcal{ B } ) \sim \textrm{Multinomial} ( n, \bm{ \theta } )$ iff
\begin{equation} \label{multinomial-1}
P ( N_1 = n_1, \dots, N_p = n_p \given n \, \bm{ \theta } \, \mathcal{ B } ) = \left\{ \begin{array}{cc} \frac{ n ! }{ n_1 ! \, n_2 ! \, \cdots \, n_p ! } \,  \theta_1^{ n_1 } \, \theta_2^{ n_2 } \, \cdots \, \theta_p^{ n_p } & \textrm{if } n_1 + \cdots + n_p = n \\ 0 & \textrm{otherwise} \end{array} \right\} \, .
\end{equation}
The main scientific interest in this problem focuses on $\gamma = ( \theta_1 - \theta_2 )$, the margin by which Bush was leading Dukakis on the day of the survey.

\begin{itemize}

\item[(a)]

Show that the Multinomial is indeed a direct generalization of the Binomial, if we're careful in the notational conventions we adopt. Here's what I mean: the Binomial distribution arises when somebody makes $n$ IID success-failure (Bernoulli) trials, each with success probability $\theta$, and records the number $X$ of successes; this yields the sampling distribution
\begin{equation} \label{multinomial-2}
( X \given \theta \, \mathcal{ B } ) \sim \textrm{Binomial} ( n, \theta ) \textrm{ iff } P ( X = x \given \theta \, \mathcal{ B } ) = \left\{ \begin{array}{cc} \left( \begin{array}{c} n \\ x \end{array} \right) \theta^x \, ( 1 - \theta )^{ n - x } & \textrm{for } x = 0, \dots, n \\ 0 & \textrm{otherwise} \end{array} \right\} \, .
\end{equation}
Briefly and carefully explain why the correspondence between equation (\ref{multinomial-2}) and \{a version of equation (\ref{multinomial-1}) with $p = 2$\} is as in Table \ref{correspondence} \textit{[10 points]}.

\begin{table}[t!]

\centering

\caption{\textit{The Binomial as a special case of the Multinomial: notational correspondence.}}

\vspace*{0.1in}

\begin{tabular}{c|c}

Binomial & Multinomial $( p = 2 )$ \\

\hline

$n$ & $n$ \\

$y$ & $n_1$ \\

$( n - y )$ & $n_2$ \\

$\theta$ & $\theta_1$ \\

$( 1 - \theta )$ & $\theta_2$

\end{tabular}

\label{correspondence}

\end{table}

\item[(b)]

Returning now to the general Multinomial setting, briefly explain why the likelihood function for $\bm{ \theta }$ given $\bm{ N }$ is
\begin{equation} \label{multinomial-3}
\ell ( \bm{ \theta } \given \bm{ N } \, \mathcal{ B } ) = c \, \prod_{ j = 1 }^p \theta_j^{ n_j } \, ,
\end{equation}
leading to the log-likelihood function (ignoring the irrelevant constant)
\begin{equation} \label{multinomial-4}
\ell \ell ( \bm{ \theta } \given \bm{ N } \, \mathcal{ B } ) = \sum_{ j = 1 }^p n_j \, \log \theta_j \, .
\end{equation}
In finding the MLE $\hat{ \bm{ \theta } }$ of $\bm{ \theta }$, if you simply try, as usual, to set all of the first partial derivatives of $\ell \ell ( \bm{ \theta } \given \bm{ N } \, \mathcal{ B } )$ with respect to the $\theta_j$ equal to 0, you'll get a system of equations that has no solution (try it). This is because in so doing we forgot that we need to do a \textit{constrained optimization}, in which the constraint is $\sum_{ j = 1 }^p \theta_j = 1$. There are thus two ways forward to compute the MLE:

\begin{itemize}

\item[(i)]

Solve the constrained optimization problem directly with \textit{Lagrange multipliers} \textit{(Extra credit [20 points]: do this)}, or

\item[(ii)]

build the constraint directly into the likelihood function: define
\begin{equation} \label{multinomial-5}
\ell ( \theta_1, \dots, \theta_{ p - 1 } \given \bm{ N } \, \mathcal{ B } ) = c \left( \prod_{ j = 1 }^{ p - 1 }\theta_j^{ n_j } \right) \left( 1 - \sum_{ j = 1 }^{ p - 1 } \theta_j \right)^{ n_p } \, ,
\end{equation}
from which (ignoring the irrelevant constant)
\begin{equation} \label{multinomial-6}
\ell \ell ( \theta_1, \dots, \theta_{ p - 1 } \given \bm{ N } \, \mathcal{ B } ) = \sum_{ j = 1 }^{ p - 1 } n_j \, \log \theta_j + n_p \log \left( 1 - \sum_{ j = 1 }^{ p - 1 } \log \theta_j \right) \, .
\end{equation}
For $j = 1, \dots, ( p - 1 )$, show that
\begin{equation} \label{multinomial-7}
\frac{ \partial }{ \partial \, \theta_j } \, \ell \ell ( \theta_1, \dots, \theta_{ p - 1 } \given \bm{ N } \, \mathcal{ B } ) = \frac{ n_j }{ \theta_j } - \frac{ n_p }{ 1 - \sum_{ i = 1 }^{ p - 1 } \theta_i } \, .
\end{equation}
The MLE for $( \theta_1, \dots, \theta_{ p - 1 } )$ may now be found by setting 
$\frac{ \partial }{ \partial \, \theta_j } \, \ell \ell ( \theta_1, \dots, \theta_{ p - 1 } \given \bm{ N } \, \mathcal{ B } ) = 0$ for $j = 1, \dots, ( p - 1 )$ and solving the resulting system of $( p - 1 )$ equations in $( p - 1 )$ unknowns \textit{(Extra credit [20 points]: do this for general $p$)}, but that gets quite messy; 
let's just do it for $p = 3$, which is all we need for the CBS survey anyway. 
Solve the two equations
\begin{equation} \label{multinomial-8}
\left\{ \ \frac{ n_1 }{ \theta_1 } - \frac{ n_3 }{ 1 - \theta_1 - \theta_2 } = 0, \ \ \  \frac{ n_2 }{ \theta_2 } - \frac{ n_3 }{ 1 - \theta_1 - \theta_2 } = 0 \ \right\} 
\end{equation}
for $( \theta_1, \theta_2 )$ and then use the constraints $\sum_{ j = 1 }^3 \theta_j = 1$ and $\sum_{ j = 1 }^3 n_j = n$ to get the MLE for $\theta_3$, thereby demonstrating the (entirely obvious, after the fact) result that
\begin{equation} \label{multinomial-9}
\hat{ \bm{ \theta } } = \left( \hat{ \theta }_1, \hat{ \theta }_2, \hat{ \theta }_3 \right) = \left( \frac{ n_1 }{ n }, \frac{ n_2 }{ n }, \frac{ n_3 }{ n } \right)
\end{equation}
\textit{[30 points]}. (The result for general $p$, of course, is that $\hat{ \bm{ \theta } } = \frac{ 1 }{ n } \bm{ N }$.)

\end{itemize}

\item[(c)]

It can be shown \textit{(Extra credit [20 points]: do this for general $p$, by working out the negative Hessian, evaluated at the MLE, to get the information matrix $\hat{ \bm{ I } }$ and then inverting $\hat{ \bm{ I } }$)} that in repeated sampling the estimated covariance matrix of the MLE vector $\hat{ \bm{ \theta } } = \left( \hat{ \theta }_1, \hat{ \theta }_2, \hat{ \theta }_3 \right)$ is 
\begin{equation} \label{multinomial-10}
\hat{ \bm{ \Sigma } } = \left( \begin{array}{ccc} \frac{ \hat{ \theta }_1 ( 1 - \hat{ \theta }_1 ) }{ n } & - \frac{ \hat{ \theta }_1 \, \hat{ \theta }_2 }{ n } & - \frac{ \hat{ \theta }_1 \, \hat{ \theta }_3 }{ n } \\ - \frac{ \hat{ \theta }_1 \, \hat{ \theta }_2 }{ n } & \frac{ \hat{ \theta }_2 ( 1 - \hat{ \theta }_2 ) }{ n } & - \frac{ \hat{ \theta }_2 \, \hat{ \theta }_3 }{ n } \\ - \frac{ \hat{ \theta }_1 \, \hat{ \theta }_3 }{ n } & - \frac{ \hat{ \theta }_2 \, \hat{ \theta }_3 }{ n } & \frac{ \hat{ \theta }_3 ( 1 - \hat{ \theta }_3 ) }{ n } \end{array} \right) \, .
\end{equation}
Explain why the form of the diagonal elements of $\hat{ \bm{ \Sigma } }$ makes good intuitive sense (by thinking about the corresponding results when there
are only $p = 2$ outcome categories); also explain why it makes good sense that the off-diagonal elements of $\hat{ \bm{ \Sigma } }$ are negative. Use $\hat{ \bm{ \Sigma } }$ to compute approximate large-sample standard errors for the MLEs of the $\theta_i$ and of $\gamma$; for $\widehat{ SE } \! \left( \hat{ \gamma } \right)$ you can either 

\begin{itemize}

\item[(i)] 

work it out directly by thinking about the repeated-sampling variance of the difference of two (correlated) random quantities, or 

\item[(ii)] 

use the fact (from AMS 131) that if $\hat{ \bm{ \theta } }$ is a random vector with covariance matrix $\hat{ \bm{ \Sigma } }$ and $\gamma = \bm{ a }^T \, \bm{ \theta }$ for some vector $\bm{ a }$ of constants, then in repeated sampling
\begin{equation} \label{multinomial-11}
\hat{ V } \! \left( \hat{ \gamma } \right) = \hat{ V } \! \left( \bm{ a }^T \, \hat{ \bm{ \theta } } \right) = \bm{ a }^T \, \hat{ \bm{ \Sigma } } \, \bm{ a } \, .
\end{equation}

\end{itemize}

Finally, use your estimated SE for $\hat{ \gamma }$ to construct an approximate (large-sample) 95\% confidence interval for $\gamma$. Was Bush definitively ahead of Dukakis at the point when the survey was conducted? Explain briefly. \textit{[50 points]}

\item[(d)]

Looking back at equation (\ref{multinomial-3}), if a conjugate prior exists for the Multinomial likelihood it would have to be of the form

\begin{quote}

$\theta_1$ to a power times $\theta_2$ to a (possibly different) power times~...~times $\theta_p$ to a (possibly different) power.

\end{quote}
There is such a distribution --- it's called the \textit{Dirichlet}$( \bm{ \alpha } )$ distribution, with $\bm{ \alpha } = ( \alpha_1, \dots, \alpha_p )$ chosen so that all of the $\alpha_i$ are positive:
\begin{equation} \label{multinomial-12}
p ( \bm{ \theta } \given \mathcal{ B } ) = c \, \prod_{ i = 1 }^p \theta_i^{ \alpha_i - 1 } \, .
\end{equation}
Briefly explain why this means that the conjugate updating rule is
\begin{equation} \label{multinomial-13}
\left\{ \begin{array}{ccc} ( \bm{ \theta } \given \mathcal{ B } ) & \sim & \textrm{Dirichlet} ( \bm{ \alpha } ) \\ ( \bm{ N } \given \bm{ \theta } \, \mathcal{ B } ) & \sim & \textrm{Multinomial} ( n, \bm{ \theta } ) \end{array} \right\} \longrightarrow ( \bm{ \theta } \given \bm{ N } \, \mathcal{ B } ) \sim 
\textrm{Dirichlet} ( \bm{ \alpha } + \bm{ N } ) \, .
\end{equation} 
Given that $\bm{ N } = ( n_1, \dots, n_p )$ and that the $n_j$ represent sample sizes (numbers of observations $y_i$) in each of the $p$ Multinomial categories, briefly explain why this implies that, if context suggests a low-information-content prior, this would correspond to choosing the $\alpha_i$ all close to 0. \textit{[20 points]}

\item[(e)]

Briefly explain why, if You have a valid way of sampling from the Dirichlet distribution, it's not necessary in this problem in fitting model (\ref{multinomial-13}) to do MCMC sampling: IID Monte Carlo sampling is sufficient. It turns out that the following is a valid way to sample a vector $\bm{ \theta} = ( \theta_1, \dots, \theta_p )$ from the
Dirichlet$( \bm{ \alpha } )$ distribution: 

\begin{itemize}

\item[(i)]

pick any $\beta > 0$ of your choosing ($\beta = 1$ is a good choice that leads to fast random number generation);

\item[(ii)]

for $j = 1, \dots, p$ make $p$ independent draws $g_j$ with draw $j$ from the $\Gamma ( \alpha_j , \beta )$ distribution; and 

\item[(iii)]

then just normalize:
\begin{equation} \label{multinomial-14}
g_j \stackrel{ \textrm{\footnotesize I} }{ \sim } \Gamma ( \alpha_j , \beta ) \ \ \ \textrm{ and } \ \ \ \theta_j = \frac{ g_j }{ \sum_{ k = 1 }^p g_k } \, .
\end{equation}

\end{itemize}

Write Your own code, using this algorithm, to generate $M$ IID draws from the posterior distribution implied by model (\ref{multinomial-13}), using the CBS News polling data and a diffuse Dirichlet$( \bm{ \alpha } )$ prior with $\bm{ \alpha } = ( \epsilon, \dots, \epsilon )$ for some small $\epsilon > 0$ such as 0.01; in addition to monitoring the components of $\bm{ \theta }$, also monitor $\gamma = ( \theta_1 - \theta_2 )$. Choose a value of $M$ large enough so that the Monte Carlo standard errors of the posterior means of $\gamma$ and the components of $\bm{ \theta }$ are small enough to make all of the posterior mean estimates reliable to at least 3 significant figures, and justify your choice. 
Make graphical and numerical summaries of the posterior distributions for
$\gamma$ and for each of the components of $\bm{ \theta }$, and compare Your posterior distribution for $\gamma$ with Figure 3.2 (p.~70) from the Gelman et al.~(2014) book that's available at the course web site. How do Your Bayesian answers compare with those from maximum likelihood in this problem? Explain briefly. Compute a Monte Carlo estimate of $p ( \gamma > 0 \given \bm{ N } \, \mathcal{ B } )$, which quantifies the current information about whether Bush is leading Dukakis in the population of all adult Americans, and attach a Monte Carlo standard error to Your estimate. What substantive conclusions do You draw about where the Presidential race stood in late October of 1988, on the basis of Your analysis? Explain briefly. \textit{[100 points]} \textit{Extra credit ([10 points]): Use \texttt{Maple} or some equivalent environment (or paper and pen, if You're brave) to see if You can derive a closed-form expression for $p ( \gamma > 0 \given \bm{ N } \, \mathcal{ B } )$, and compare Your mathematical result with Your simulation-based findings; if no such expression is possible, explain why not.}

\end{itemize}

3.~\textit{[380 points]} The data set \texttt{U.S.-family-income-2009.txt} on the course web site contains family income values $\bm{ y } = ( y_1, \dots , y_n )$ from the year 2009 (in units of \$1,000, and sorted from smallest to largest) for a random sample of $n = 842$ American households with incomes under \$1 million, obtained from the \textsf{Current Population Survey} conducted by the \textit{U.S.~Census Bureau}. The point of this problem is to draw inferences about the mean family income $\theta$ in the population $\mathcal{ P }$ of all U.S.~households (with incomes under \$1 million) in 2009. In estimating $\theta$ we'll look at two frequentist-and-Bayesian parametric methods and two frequentist nonparametric approaches; the meta-goal of this problem is to experience statistical model uncertainty and to begin to learn how to cope with it.

\begin{itemize}

\item[(a)]

Make a histogram of these income values (on the density scale) with a large number of bars (e.g., 100); briefly comment on its shape, and visually estimate the mode. Compute the sample mean, median and SD, and compare the mean, median and mode; are they related to each other in the way you would expect them to be, given the distributional shape? Explain briefly. \textit{[40 points]}

\end{itemize}

Two common parametric sampling models for long-right-hand-tailed variables such as family income are based on the Lognormal and Gamma distributions. Consider first the model
\begin{equation} \label{lognormal-1}
\left\{ \begin{array}{c} ( Y_i \given \mu \, \sigma^2 \, \mathcal{ B } ) \stackrel{ \textrm{\tiny IID} }{ \sim } \textrm{Lognormal} ( \mu, \sigma^2 ) \\ (i = 1, \dots, n ) \end{array} \right\} \, ,
\end{equation}
in which $Y_i \sim \textrm{Lognormal} ( \mu, \sigma^2 )$ simply means that $W_i \triangleq \log Y_i \sim N ( \mu, \sigma^2 )$. (You can see that Lognormal is a bad name for this distribution: if $W_i = \log Y_i$ is Normal then $Y_i = e^{ W_i }$, so it should really be called the Exponential Normal model, but this name has been in use for more than 100 years and we're stuck with it.)

\begin{itemize}

\item[(b)]

It can be shown \textit{(Extra credit [20 points]: show these two facts)} that in this model 

\begin{itemize}

\item[(i)] 

the repeated-sampling density of $Y_i$ is
\begin{equation} \label{lognormal-2}
p ( y_i \given \mu \, \sigma^2 \mathcal{ B } ) = \frac{ 1 }{ \sigma \, y_i \sqrt{ 2 \, \pi } } \exp \left[ - \frac{ ( \log y_i - \mu )^2 }{ 2 \, \sigma^2 } \right]
\end{equation}
for $y_i > 0$ (and 0 otherwise) and 

\item[(ii)]

$E_{ RS } ( Y_i \given \mu \, \sigma^2 \, \mathcal{ B } ) = \exp \left( \mu + \frac{ \sigma^2 }{ 2 } \right)$, meaning that if we adopt sampling model (\ref{lognormal-1}) then $\theta = \exp \left( \mu + \frac{ \sigma^2 }{ 2 } \right)$. 

\end{itemize}

The definition of the Lognormal distribution --- $W_i \triangleq \log Y_i \sim N ( \mu, \sigma^2 )$ --- and fact (i) suggest two different ways to obtain the MLEs $( \hat{ \mu }_{ MLE }, \hat{ \sigma }_{ MLE }^2 )$ for $\mu$ and $\sigma^2$ in model (\ref{lognormal-1}). Describe both of these ways to get the ML estimators; identify which method is easier, and briefly explain why; and use the easier method to get the MLEs. Plot your histogram of the data set on the density scale from part (a) again, and superimpose a density trace of the Lognormal$( \hat{ \mu }_{ MLE }, \hat{ \sigma }_{ MLE }^2 )$ distribution; does this fit look acceptable? Explain briefly (\textit{Hint:} You can either write your own Lognormal density function (from equation (\ref{lognormal-2})) or gain access to the Lognormal density function \texttt{dlnorm} in \texttt{R} (and the information about it available with the \texttt{help} command) by first issuing the command \texttt{library( stats )}.) \textit{[60 points]}

\item[(c)]

Use \texttt{RJAGS} to fit, via MCMC, a Bayesian version of the Lognormal model to the income data with a low-information-content prior. \textit{I'll provide all of the code you need to do this on the course web page.} It turns out that to get DIC values for Bayesian models in \texttt{RJAGS}, you need to run $k = 2$ or more parallel Markov chains with different random number streams, so when you ask for $M$ iterations in \texttt{RJAGS} with $k = 2$ you're actually getting $2  \, M$ iterations, so I recommend a burn-in of 1,000 iterations from the starting values I've given you, followed by a monitoring run of $M =$ 50,000 iterations (on my desktop at home this only took about 1.5 minutes). Compare the posterior mean values for $\mu$ and $\sigma^2$ with their corresponding MLEs, and comment briefly on whether the comparison came out as you expected it to. Summarize the posterior for $\theta$ by extracting its posterior mean, SD and 95\% interval from the MCMC output. Compare the mean and SD and several quantiles of the predictive distribution for a new $y$ value with the corresponding quantities from the income data set, and make a qqplot of your random draws from the predictive distribution against the data set; does the model seem to make good predictions? Explain briefly. Get \texttt{RJAGS} to work out the DIC value for the Lognormal model. \textit{[60 points]}

\end{itemize}

Now instead consider the model (for $\alpha > 0, \beta > 0$)
\begin{equation} \label{gamma-1}
\left\{ \begin{array}{c} ( Y_i \given \alpha \, \beta \, \mathcal{ B } ) \stackrel{ \textrm{\tiny IID} }{ \sim } \Gamma ( \alpha, \beta ) \\ (i = 1, \dots, n ) \end{array} \right\} \ \ \ \longleftrightarrow \ \ \ p ( y_i \given \alpha \, \beta \, \mathcal{ B } ) = \left\{ \begin{array}{cc} \frac{ \beta^\alpha }{ \Gamma ( \alpha ) } y_i^{ \alpha - 1 } e^{ - \beta \, y_i } & \textrm{for } y_i > 0 \\ 0 & \textrm{otherwise} \end{array} \right\}  \, ,
\end{equation}
using the parameterization in which $E_{ RS } ( Y_i \given \alpha \, \beta \, \mathcal{ B } ) = \frac{ \alpha }{ \beta }$, meaning that if we adopt sampling model (\ref{gamma-1}) then $\theta = \frac{ \alpha }{ \beta }$.

\begin{itemize}

\item[(d)]

Show that the log-likelihood function in this model is
\begin{equation} \label{gamma-2}
\ell \ell ( \alpha \, \beta \given \bm{ y } \, \mathcal{ B } ) = n \, \alpha \, \log \beta - n \, \log \Gamma ( \alpha ) + ( \alpha - 1 ) \sum_{ i = 1 }^n \log y_i - \beta \sum_{ i = 1 }^n y_i \, .
\end{equation}
Briefly explain why this means that no closed-form (algebraic) expressions for the MLEs for $\alpha$ and $\beta$ are possible in this model; numerical methods are needed to maximize (\ref{gamma-2}). \textit{(Extra credit ([20 points]): Figure out how to use the \texttt{R} function \texttt{optim} to numerically maximize the log-likelihood function; note that \texttt{optim} has a \texttt{Hessian = TRUE} option that will deliver to you not only the MLEs but also the Hessian evaluated at the MLEs.)} It turns out with this data set that
$( \hat{ \alpha }_{ MLE }, \hat{ \beta }_{ MLE } ) \doteq ( 1.305, 0.01575 )$.
Plot your histogram of the data set on the density scale from part (a) yet again, and superimpose a density trace of the $\Gamma( \hat{ \alpha }_{ MLE }, \hat{ \beta }_{ MLE } )$ distribution; does \textit{this} fit look acceptable? Explain briefly. \textit{[30 points]}

\item[(e)]

Now use \texttt{RJAGS} to fit, via MCMC, a Bayesian version of the Gamma model to the income data with a low-information-content prior. \textit{Again, I'll provide all of the code you need to do this on the course web page.} This model takes longer to fit (this fact is related to the difficulty you saw in part (d) with maximum-likelihood fitting), so you may wish to settle for a smaller value of $M$ (e.g., 10,000) than you did with the Lognormal model (on my desktop at home, 20,000 monitoring iterations took about 70 seconds; you can adjust your value of $M$ based on the speed of your machine and your patience). As you did in the Lognormal case, summarize the posterior for $\theta$ by extracting its posterior mean, SD and 95\% interval from the MCMC output. Compare the mean and SD and several quantiles of the predictive distribution for a new $y$ value with the corresponding quantities from the income data set, and make a qqplot of your random draws from the predictive distribution against the data set; does \textit{this} model seem to make good predictions? Explain briefly. Get \texttt{RJAGS} to work out the DIC value for the Lognormal model. \textit{[60 points]}

\item[(f)]

Since interest focuses on the population mean $\theta$, one possible estimator of $\theta$ is the sample mean $\bar{ Y }$. Use the Central Limit Theorem (CLT) to construct an approximate 95\% confidence interval for $\theta$ based on $\bar{ Y }$. \textit{[10 points]}

\end{itemize}

Notice that the interval you created in part (f) is based on a frequentist nonparametric method: in building your interval \vspace*{0.025in} you made no assumptions about the form of the sampling distribution. \textbf{\fbox{Question:}} \vspace*{0.025in} how well calibrated is the CLT interval? By this I mean the usual notion of validity/calibration of 95\% intervals: if the process leading to your interval in part (f) is repeated many times, do the resulting intervals include the true parameter value $\theta$ approximately 95\% of the time?

In answering this question it looks like we need to know what the truth is about $\theta$. Since we don't know $\theta$, it turns out that the best we can do -- in estimating the calibration of the CLT interval process -- is to use the bootstrap idea we talked about in class: we can take random samples from our \textit{sample} and use these as proxies for what we would get if we were able to take a new random sample of $n = 842$ households from the population $\mathcal{ P }$. In this bootstrapping, we'll pretend that the true $\theta$ is the sample mean $\bar{ y } \doteq 82.88$ you calculated in part (a).

\begin{itemize}

\item[(g)]

Write a program in \texttt{R} to repeatedly ($M =$ 10,000 times, say) draw samples of size $n = 842$ at random with replacement from the data set \texttt{U.S.-family-income-2009.txt}, compute the CLT-based 95\% interval each time, and work out the proportion of these intervals that include the actual sample mean (put a copy of your code in the Appendix, as usual). To get ready for parts (h) and (i), also keep track of the $M$ simulated sample means $\bar{ y }^*$ you calculated in assessing calibration of the CLT interval-building process. \textit{[30 points]}

\item[(h)]

The CLT method assumes that $n = 842$ is large enough to produce a Normal sampling distribution for $\bar{ Y }$. Check this assumption by making a normal qqplot of the $\bar{ y }^*$ values you generated in part (g) --- has the CLT done its magic with $n = 842$, even though the population distribution had a quite long right-hand tail? Explain briefly. \textit{[20 points]}
 
\item[(i)]

In producing the bootstrap distribution of the sample means $\bar{ y }^*$ in part (g), you've also done all of the hard work to construct the bootstrap confidence interval for $\theta$, which you can now obtain simply by getting \texttt{R} to work out the 2.5\% and 97.5\% quantiles of the $\bar{ y }^*$ distribution. How does this bootstrap CI compare with the CLT-based CI? Explain briefly. \textit{[20 points]}

\begin{table}[t!]

\centering

\caption{\textit{Summary of inference about $\theta$ across six models/methods; ML = maximum likelihood.}}

\vspace*{0.1in}

\begin{tabular}{c|ccc}

& \multicolumn{3}{c}{$\theta$} \\ \cline{2-4}

& Mean/ & SD/ \\

Method/Model & Estimate & SE & 95\% Interval \\

\hline \hline

Lognormal ML & $\cdot$ & --- & --- \\

Lognormal Bayes & $\cdot$ & $\cdot$ & $( \, \cdot \, , \, \cdot \, )$ \\

\hline

Gamma ML & $\cdot$ & --- & --- \\

Gamma Bayes & $\cdot$ & $\cdot$ & $( \, \cdot \, , \, \cdot \, )$ \\

\hline

CLT & $\cdot$ & $\cdot$ & $( \, \cdot \, , \, \cdot \, )$ \\

Bootstrap & $\cdot$ & $\cdot$ & $( \, \cdot \, , \, \cdot \, )$ 

\end{tabular}

\label{summary-1}

\end{table}

\item[(j)]

Summarize all of this by completing Table \ref{summary-1}; a dash (---) in an entry means that you don't need to fill it in. Given your conclusions about model fit and validity of method, which numbers in Table \ref{summary-1} seem most reliable to you? What do you conclude, both about mean income in the U.S.~in 2009 and about the challenge of model uncertainty? \textit{[50 points]}

\end{itemize}

\end{document}