KURTOSIS- Peakedness or tailedness??


In most of the introductory courses in statistics, we get acquainted to the various descriptive measures used in order to summarize the different features of a given data set. Our emphasis lies primarily on the computation of these descriptive measures using a set of mathematical formulas. We do not give much thought to it's interpretation. In some cases, it might happen that the measures are misinterpreted or are not completely understood. Kurtosis is one such instance!
Kurtosis is one of the most useful measures of a distribution, but it is one of the most commonly misinterpreted measures as well. Many textbooks interpret kurtosis as the degree of peakedness or flatness of a theoretical  probability distribution  or the histogram obtained on the basis of a number of sample observations.We recently came across one such instance of misinterpretation of kurtosis while solving exercises from the celebrated book on statistical Inference by George Casella and Roger L. Berger (to be precise Exercise 2.28 ,Pg 79, 2nd edition ,20th Indian Reprint 2017) where it says and we quote " The skewness measures the lack of symmetry in the pdf. The kurtosis, although harder to interpret,measures the peakedness or flatness of the pdf." This statement is common in many introductory statistics textbooks at both high school and undergraduate level, but such a statement is hardly expected in such a book,which prompted us to write this post.

In classical textbooks, the terms mesokurtic, platykurtic, and leptokurtic appear, as descriptions of distributions with zero, negative, and positive kurtosis, respectively. These terms, while impressive sounding, are actually quite misleading. The problem with these terms is that the prefixes platy- and lepto- are descriptors of the peak of a distribution, rather than its tails. Platykurtic means “broad peaked” and leptokurtic means “thin peaked,” and this is how kurtosis is often presented—as a descriptor of the peak of the distribution. The word kurtosis itself derives from a word meaning “curve.”
Kurtosis is formally defined for univariate distributions as the the standardized fourth population central moment,adjusted by a constant 3,
$ \boldsymbol{\mathit{\gamma_2 = \frac{E(X-\mu)^4}{(E(X-\mu)^2)^2} -3= \frac{\mu_4}{\sigma^4}-3=\beta_2-3}}$
where E is the expectation operator, $ \mu$ is the mean, $\mu_4$ is the fourth moment about the mean, and $ \sigma$ is the standard deviation of the distribution of the random variable X.The corresponding sample counterparts to $ \boldsymbol{\gamma_2}$ and $ \boldsymbol{\beta_2}$ ,obtained by replacing the population moments by the central moments , are given by
$ \boldsymbol{\mathit{g_2 = \frac{\sum_{i=1}^{n} (X_i-\bar{X})^4/n}{(\sum_{i=1}^{n} (X_i-\bar{X})^2/n)^2}-3=\frac{m_4}{m_2^2}-3=b_2-3}}$
where $ g_2$ is the sample kurtosis , $ \bar{X}$ is the sample mean and n is the number of sample observations.
This definition of kurtosis was given by Karl Pearson in 1905 as a measure of departure from normality. He defined a distribution to be "leptokurtic","mesokurtic" and "platykurtic" according as $ \gamma_2$ is positive, zero or negative. He equated the concept of kurtosis with the degree of flat-toppedness relative to the the normal distribution. Two distributions having the same degree of variability as measured by the standard deviation may be relatively more or less flat-topped than the normal curve. If more flat-topped , he termed them platykutic, if less flat-topped he termed them leptokurtic and if equally flat-topped he termed them mesokurtic.
As Peter H. Westfall notes in his paper "Kurtosis as Peakedness, 1905–2014. R.I.P.", a vast number of websites , textbooks as well as leading academic journals are responsible for the promotion of the ambiguous interpretation of kurtosis as a measure of the degree of peakedness or heaviness of the tails of the distributions or both, and notes the rarity of articles which give the true interpretation of kurtosis.
For a theoretical probability distribution, $ \beta_2 \geq 1$ and hence $\gamma_2 \geq -2$, equality being attained for a two point distribution having equal probability at the two points. For a normal distribution , $ \beta_2 = 3$ and hence $ \gamma_2 = 0$.
He argues that the only unambiguous interpretation of kurtosis is in terms of tail extremities. By this he means that a "high " value of sample kurtosis reflects the presence of extreme values or outliers in the sample observations while a "high" value of kurtosis for a theoretical distribution indicates the propensity to produce outliers when a random sample is drawn from the said distribution.
For common unimodal bell shaped distributions , it is seen that distributions having a large value of kurtosis tend to have higher peaks. It has been observed that heavy-tailed distributions do sometimes higher peaks than light-tailed distributions . However counterexamples may be cited demonstrating that probability distributions having lower or higher value of kurtosis than the standard normal density may be more peaked or less peaked than the normal density. We provide four examples of densities all having unit standard deviations and symmetric about the mean 0, but those having negative values of $ \boldsymbol{\gamma_2}$ are more peaked and those having positive values of $ \boldsymbol{\gamma_2}$ are less peaked than the standard normal density $ \boldsymbol{\mathit{f(x) = \frac{1}{\sqrt2\pi}e^{\frac{-x^2}{2}},x \in \mathbb{R}}}$ having $ \boldsymbol{\mathit{\gamma_2 = 0}}$ and $ \boldsymbol{\mathit{f(0) = 0.3989423}}$
  1. $ \boldsymbol{\mathit{f_1(x) = \frac{1}{3\sqrt\pi}(\frac{9}{4}+x^4)e^{-x^2},x \in \mathbb{R} , f_1(0) = 0.4231422,\mu_4 = 2.75 , \gamma_2=-0.25}}$
  2. $ \boldsymbol{\mathit{f_2(x) = \frac{3}{2\sqrt2\pi}e^{-\frac{x^2}{2}}-\frac{1}{6\sqrt\pi}(\frac{9}{4}+x^4)e^{-x^2},x \in \mathbb{R} ,  f_2(0) = 0.3868423,\mu_4 = 3.125 , \gamma_2=0.125}}$
  3. $ \boldsymbol{\mathit{f_3(x) = \frac{1}{6\sqrt\pi}(e^{\frac{-x^2}{4}}+4e^{-x^2}),x \in \mathbb{R} ,  f_3(0) = 0.470158,\mu_4 = 4.75 , \gamma_2 = 1.75}}$
  4. $ \boldsymbol{\mathit{f_4(x) = \frac{3\sqrt3}{16\sqrt\pi}(2+x^2)e^{\frac{-3x^2}{4}},x \in \mathbb{R} ,  f_3(0) = 0.3664519,\mu_4 = \frac{8}{3} = 2.667 , \gamma_2 = -0.333}}$
The comparison of the peakedness of the densities are illustrated in the following figure . Peakedness of various densities w.r.t normal density
     Fig.1
Westfall contends that the persistence of the association of "peakedness" of a distribution with its kurtosis results from the misinterpretation of histograms of heavy-tailed data i.e. data having extreme values or  outliers. To give a concrete illustration we adopt his example.
Cauchy outliers pic



      Fig.2


A typical histogram as in Fig.2 is obtained when a random sample of size n = 1000 is generated from a standard Cauchy distribution (having median zero and scale parameter unity).The histogram may give the appearance of a distinct narrow sharp peak around the value zero , but this is actually a visual effect of the scaling of the horizontal axis . The presence of outliers or extreme observations determines the appearance of the histogram in this case.Here the dotted lines demarcate the parts of the histogram within one standard deviation distance from the mean of the generated data.
The generated data has mean $ \boldsymbol{\mathit{m= -1.551403}}$ and standard deviation $ \boldsymbol{\mathit{sd= 34.85339}}$ (using n instead of the usual (n-1) in the denominator). If $ \boldsymbol{\mathit{z_i = \frac{x_i - m}{sd},i=1,2,\dots,n}}$ be the standardized values, then the sample kurtosis $ \boldsymbol{\mathit{g_2 = b_2 - 3 = \frac{\sum_{i=1}^{n}z_i^4}{n} = 437.3871 - 3 = 434.3871}}$. The contributions of values within and outside one standard deviation distance of the mean to the value of $ \boldsymbol{\mathit{b_2}}$ and hence $ \boldsymbol{\mathit{g_2}}$ are
$ \boldsymbol{\mathit{p_1 = (1/n)\sum\limits_{|z_i|<1}z_i^4 = 0.007308361}}$
and
$ \boldsymbol{\mathit{p_2 = (1/n)\sum\limits_{|z_i|\geq1}z_i^4 = 437.3798}}$
Thus we can see that the proportion of the value of $ \boldsymbol{\mathit{b_2}}$ contributed by $ \boldsymbol{\mathit{p_1}}$ is 0.007308361/437.3871 = 0.00001670914  , which is extremely small. Hence this shows that the kurtosis statistic has nothing to do with measuring the peakedness of the histogram in this case, rather it measures the effect of the outliers.
The logic for the why the kurtosis formula given by $ \boldsymbol{\mathit{\gamma_2}}$ measures outlier propensity is as follows. First, note that the term  $ \boldsymbol{\mathit{(\frac{X-\mu}{\sigma})^4}}$is always positive, with the occasional numbers that are far from $ \boldsymbol{\mathit{\mu}}$ (outliers) being greatly influential since they are taken to the 4th power. The net result is that kurtosis will be large when the distribution produces occasional outliers. The −3 term makes the kurtosis exactly 0 in the case of the normal distribution. Thus, if the kurtosis($\boldsymbol{\mathit{\gamma _2}}$) is greater than zero, then the distribution is more outlier-prone than the normal distribution or that the tails of the distribution are fatter than the tails of the normal distribution; and if the kurtosis is less than zero, then the distribution is less outlier-prone than the normal distribution or that the tails of the distribution are thinner than the tails of the normal distribution.
We now proceed to lend some mathematical explanation to this point. We define the random variable Z as $ \boldsymbol{\mathit{Z = \frac{X-\mu}{\sigma}}}$ . We define a central portion of the distribution of Z as $ \boldsymbol{\mathit{A_h = \{z:|z|\leq{h}\}}}$ for some positive real number h . Assuming X , and hence Z ,to be an absolutely continuous random variable , $ \boldsymbol{\mathit{f(z)}}$ being the pdf of Z , we define the quantities Centerh$ \boldsymbol{\mathit{ = \int\limits_{A_h}z^4f(z)dz}}$ and Tailh$ \boldsymbol{\mathit{ = \int\limits_{A_h^c}z^4f(z)dz}}$ . We note that the kurtosis $ \boldsymbol{\mathit{\gamma_2}}$ can be written as follows:
$ \boldsymbol{\mathit{\gamma_2 = \int\limits_{A_h}z^4f(z)dz+\int\limits_{A_h^c}z^4f(z)dz - 3 = Center_h +Tail_h - 3}}$
Clearly, Centerh ≥ 0. But since $ \boldsymbol{{z}^4\leq{h}^4}$ when z is in Ah, it therefore follows that $ \boldsymbol{\mathit{Center_{h}=\int_{A_{h}}{z^{4}f(z)dz}\le h^{4}\int_{A_{h}}{f(z)dz}}}$. Further, since $\int_{A_{h}}{f(z)dz\le \int_{z}{f(z)dz}=1}$, it follows that $0\le Center_{h}\le h^{4}$. Thus

0 ≤ Centerh ≤ h4

⇒ Tailh − 3 ≤ Centerh + Tailh − 3 ≤ Tailh − 3 + h4

⇒ Tailh− 3 ≤ $ \boldsymbol{\mathit{\gamma_2}}$≤ Tailh − 3 + h4

This implies that the kurtosis is largely determined by the tail . It also follows that for large kurtosis, the portion determined by the center is vanishingly small, no matter how many standard deviations from the mean define the center.

Note: The R Code for the graphs and computations is at https://github.com/curiousmindsstats/Kurtosis

References:
  1. Peter H. Westfall (2014) Kurtosis as Peakedness, 1905–2014. R.I.P., The American Statistician, 68:3,191-195, DOI: 10.1080/00031305.2014.917055
  2. Kaplansky, I. (1945), “A Common Error Concerning Kurtosis,” Journal of the
    American Statistical Association, 40, 259
  3. Westfall, P. H., and Henning, K. S. S. (2013), Understanding Advanced Statistical
    Methods, Boca Raton, FL: Chapman & Hall/CRC Texts in Statistical
    Science Series
  4. Pearson, K. (1905), “Das Fehlergesetz und seine Verallgemeinerungen durch
    Fechner und Pearson. A Rejoinder,” Biometrika, 4, 169–212

Comments

Popular posts from this blog

The German Tank Problem: Frequentist vs. Bayesian Approach

A conflict(?) between Frequentists and Bayesians: The Jeffreys-Lindley Paradox

Was Hotel California written keeping Hilbert's Hotel in mind? A peek into the Infinite Hotel Paradox