A conflict(?) between Frequentists and Bayesians: The Jeffreys-Lindley Paradox

The Jeffreys-Lindley paradox is an apparently puzzling problem in Statistical Inference . It has been seen often that frequentist and Bayesian approaches to the testing of a point null hypothesis i.e. a simple hypothesis , lead to divergent results especially when the sample size is large and for different choices of the prior distribution of the parameter under study.

Statement of the paradox:

The paradox can be understood in the general setting as follows:

Let x denote the observation or the data obtained from the experiment under study.
A test of significance for the null hypothesis $\mathbf{\mathit{H_{0}}}$ gets rejected at level of significance $\mathbf{\mathit{\alpha }}$ .
The posterior probability of $\mathbf{\mathit{H_{0}}}$ , given the data x , is very high even for small prior probability of $\mathbf{\mathit{H_{0}}}$ .

Lindley's original formulation of the problem in his paper "A Statistical Paradox" published in 1957 may be stated as follows:

Suppose we compare different sets of observations with varying sample sizes 'n' , all of which produce equally significant p-values (say, 0.01) when the frequentist test of significance is performed. Then, as the sample size n increases, the Bayesian approach would reveal that the data increasingly supports the null hypothesis. Thus the Bayesian approach accepts a null hypothesis which the frequentist approach rejects.

Lindley discussed this paradox in the context of Gaussian models .The formal statement of the paradox may be stated as:

In a Gaussian Model $\mathbf{\mathit{N(\Theta ,\sigma ^{2})}}$ , $\mathbf{\mathit{H_{0}:\Theta =\Theta _{0},H_{1}:\Theta \neq \Theta _{0}}}$ assume $\mathbf{\mathit{P(H_{0})> 0}}$ and any regular
proper prior distribution on $\mathbf{\mathit{\left \{ \Theta \neq \Theta _{0} \right \}}}$ . Then, for any testing level $\mathbf{\mathit{\alpha \in [0,1]}}$ , we can find a sample size $\mathbf{\mathit{N_{\alpha }}}$ and independent, identically distributed data $\mathbf{\mathit{x=(x_{1},x_{2},...,x_{n})}}$ such that

The sample mean $\mathbf{\mathit{\bar{x}}}$ is significantly different from $\mathbf{\mathit{\Theta_{0}}}$ at level $\mathbf{\mathit{\alpha }}$
$\mathbf{\mathit{P(H_{0}|x)}}$ is at least as big as $\mathbf{\mathit{1-\alpha }}$ .

Thus it appears that the two approaches are at odds with each other regarding the acceptance or rejection of the null hypothesis based on the same set of observations.

Mathematical justification:

We proceed to show the apparent discrepancy between the two approaches in the case of the Gaussian model as stated originally by Lindley.

Let $\mathbf{\mathit{(x_{1},x_{2},...,x_{n})}}$ be a random sample from a normal distribution of mean $\mathbf{\mathit{\Theta }}$ and known variance $\mathbf{\mathit{\sigma ^{2}}}$ . Let the prior probability that $\mathbf{\mathit{\Theta =\Theta _{0}}}$ , the value on the null hypothesis, be c. Suppose that the remainder of the prior probability is distributed uniformly over some interval $\mathbf{\mathit{I}}$ containing $\mathbf{\mathit{\Theta_{0}}}$ . We shall deal with situations where $\mathbf{\mathit{\bar{x}}}$ , the arithmetic mean of the observations, and a minimal sufficient statistic, is well within the interval $\mathbf{\mathit{I}}$ .

The posterior probability that $\mathbf{\mathit{\Theta =\Theta _{0}}}$ , in the light of the sample drawn, is given by

$\mathbf{\mathit{\bar{c}=ce^{[-n(\bar{x}-\Theta_{0})^{2}/2\sigma^{2}]/K}}}$

where $\mathbf{\mathit{K=ce^{[-n(\bar{x}-\Theta_{0})^{2}/2\sigma^{2}]}+(1-c)\int_{I}e^{[-n(\bar{x}-\Theta_{0})^{2}/2\sigma^{2}]}d\Theta }}$ , by Bayes's theorem.
By virtue of the assumption about $\mathbf{\mathit{\bar{x}}}$ and $\mathbf{\mathit{I}}$ the integral can be evaluated as $\mathbf{\mathit{\sigma \sqrt{(2\pi/n)}}}$ .

Now suppose that the value of $\mathbf{\mathit{\bar{x}}}$ is such that, on performing the usual significance test

for the mean $\mathbf{\mathit{\bar{x}}}$ of a normal distribution with known variance, the result is significant at

the $\mathbf{\mathit{\alpha }}$ percentage point. That is, $\mathbf{\mathit{\bar{x}=\Theta _{0}+\lambda_{\alpha }\sigma /\sqrt{n}}}$ , where $\mathbf{\mathit{\lambda _{\alpha }}}$ is a number dependent on $\mathbf{\mathit{\alpha }}$

only and can be found from tables of the normal distribution function. Putting this value

for $\mathbf{\mathit{\bar{x}}}$ we have the following value for the posterior probability that $\mathbf{\mathit{\Theta =\Theta _{0}}}$

$\mathbf{\mathit{\bar{c}=\frac{ce^{\frac{-\lambda _{\alpha }^{2}}{2}}}{ce^{\frac{-\lambda _{\alpha }^{2}}{2}}+(1-c)\sigma \sqrt{2\pi/n}}}}$

(Note that $\mathbf{\mathit{\bar{x}-\Theta_{0}}}$ tends to zero as n increases so that $\mathbf{\mathit{\bar{x}}}$ will lie well within the interval $\mathbf{\mathit{I}}$ for

sufficiently large n.)

We observe that as $\mathbf{\mathit{n\rightarrow \infty }}$ , $\mathbf{\mathit{\bar{c}\rightarrow 1}}$ .i.e. the Bayesian approach will be increasingly inclined to accept the null hypothesis as the sample size n increases while the p-value remains constant, leading to the paradox.

Reasons behind the paradox:

We briefly discuss the reason behind this apparent paradox

For consistent tests used in the frequentist approach, the power of the test converges to 1 as the sample size inreases. This means that even small deviations from the null hypothesis is regarded as significant resulting in a small p-value, and this is no paradoxical result since any good test should be consistent.

The frequentist and Bayesian approaches answer two fundamentally different questions, and the results obtained from the two approaches are misinterpreted. A small p-value obtained from the frequentist test indicates the deviation from the null hypothesis is signficant, but it does take into account the alternative hypothesis so as to conclude the alternative hypothesis is more plausible in the light of the given sample. A small p-value simply indicates that the data does not support the null hypothesis. The Bayesian approach on the other hand compares the posterior odds of the competing null and alternative hypothesis. It is to be understood that the null value to be tested is fundamentally different from the other values in the parameter space. We perform those tests only for which the null value of the parameter holds particular interest for us. Now if $\mathbf{\mathit{H_{0}}}$ is concentrated on a single point value and $\mathbf{\mathit{H_{1}}}$ is very diffuse, such that the null value of the parameter is a better fit to the data than most, but not necessarily all the values in the parameter space then the Bayesian approach concludes that the null is a better fit to the data than the alternative.

References:

Jeffreys, Harold (1939). Theory of Probability. Oxford University Press. MR 0000924.
Lindley, D.V. (1957). "A statistical paradox". Biometrika. 44 (1–2): 187–192. doi:10.1093/biomet/44.1-2.187. JSTOR 2333251.
Spanos, Aris (2013). "Who should be afraid of the Jeffreys-Lindley paradox?". Philosophy of Science (journal). 80.1: 73–93. doi:10.1086/668875.

Search This Blog

Curious Minds