{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CHAPTER 16. Metric-Predicted Variable on One or Two Groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 싸이그래머 / 베이지안R [1]\n", "* 김무성" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Contents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 16.1. Estimating the Mean and Standard Deviation of a Normal Distribution\n", "* 16.2. Outliers and Robust Estimation : The t Distribution\n", "* 16.3. Two Groups\n", "* 16.4. Other Noise Distributions and Transforming Data\n", "* 16.5. EXERCISES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this chapter, we consider a situation in which we have a metric-predicted variable that is observed for items from one or two groups." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For example, \n", " - we could measure \n", " - the blood pressure (i.e., a metric variable) for people \n", " - randomly sampled from first-year university students (i.e., a single group).\n", " - In this case, we might be interested in \n", " - how much the group’s typical blood pressure differs \n", " - from the recommended value for people of that age as published by a federal agency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* As another example, \n", " - we could measure \n", " - the IQ (i.e., a metric variable) of people \n", " - randomly sampled from everyone self-described as vegetarian (i.e., a single group). \n", " - In this case, we could be interested in \n", " - how much this group’s IQ differs \n", " - from the general population’s average IQ of 100." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In the context of the generalized linear model (GLM) introduced in the previous chapter, this chapter’s situation involves the most trivial cases of the linear core of the GLM, as indicated in the left cells of Table 15.1 (p. 434), with a link function that is the identity along with a normal distribution for describing noise in the data, as indicated in the first row of Table 15.2 (p. 443). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We will explore options for the prior distribution on parameters of the normal distribution, and methods for Bayesian estimation of the parameters. \n", "* We will also consider alternative noise distributions for describing data that have outliers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.1. Estimating the Mean and Standard Deviation of a Normal Distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 16.1.1 Solution by mathematical analysis\n", "* 16.1.2 Approximation by MCMC in JAGS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The normal distribution specifies the probability density of a value y, given the values of two\n", "parameters, the mean μ and standard deviation σ :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* To get an intuition for the normal distribution as a likelihood function, consider three data values y1 = 85, y2 = 100, and y3 = 115, which are plotted as large dots in Figure 16.1. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Figure 16.1 shows p(D|μ, σ ) for different values of μ and σ . As you can see, there are values of μ and σ that make the data most probable, but other nearby values also accommodate the data reasonably well" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The question is, given the data, how should we allocate credibility to combinations of μ and σ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* likelhood \n", " - Figure 16.1 shows examples of p(D|μ, σ ) for a particular data set at different values of μ and σ \n", "* prior \n", " -The prior, p(μ,σ), specifies the credibility of each combination of μ,σ values in the two-dimensional joint parameter space, without the data.\n", "* posterior\n", " - Bayes’ rule says that the posterior credibility of each combination of μ, σ values is the prior credibility times the likelihood, normalized by the marginal likelihood." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal now is to evaluate Equation 16.2 for reasonable choices of the prior distribution, p(μ, σ )." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.1.1 Solution by mathematical analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We take a short algebraic tour before moving on to MCMC implementations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When σ is fixed, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 참고 : Conjugate prior\n", " - https://en.wikipedia.org/wiki/Conjugate_prior" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "* It is convenient first to consider the case in which the standard deviation of the likelihood function is fixed at a specific value. In other words, the prior distribution on σ is a spike over that specific value. We’ll denote that fixed value as σ = Sy. \n", "* With this simplifying assumption, we are only estimating μ because we are assuming perfectly certain prior knowledge about σ .\n", "* proir : When σ is fixed, then the prior distribution on μ in Equation 16.2 can be easily chosen to be conjugate to the normal likelihood.\n", " - The term “conjugate prior” was defined in Section 6.2, p. 126.\n", " - It turns out that the product of normal distributions is again a normal distribution; \n", " - in other words, if the prior on μ is normal, then the posterior on μ is normal.\n", " - Let the prior distribution on μ be normal with mean Mμ and standard deviation Sμ .\n", "* likelihood * prior : " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### precision & posterior precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Thus, the posterior precision is the sum of the prior precision and the likelihood precision." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### posterior mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In other words, the posterior mean is a weighted average of the prior mean and the datum, with the weighting corresponding to the relative precisions of the prior and the likelihood." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### N value case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### posterior mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### posterior precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Notice that as the sample size N increases, the posterior mean is dominated by the data mean.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When μ is fixed, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We can instead estimate the σ parameter when μ is fixed. \n", "* It turns out that when μ is fixed, a conjugate prior for the precision is the gamma distribution (e.g., Gelman et al., 2013, p. 43). \n", " - It is important to understand the meaning of a gamma prior on precision. \n", " - Consider a gamma distribution \n", " - that is loaded heavily over very small values, but has a long shallow tail extending over large values. \n", " - This sort of gamma distribution on precision indicates that we believe most strongly in small precisions, but we admit that large precisions are possible. \n", " - If this is a belief about the precision of a normal likelihood function, then this sort of gamma distribution expresses a belief that the data will be more spread out, because small precisions imply large standard deviations. \n", " - If the gamma distribution is instead loaded over large values of precision, \n", " - it expresses a belief that the data will be tightly clustered." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 참고: gamma distribution ?\n", "\n", "* https://en.wikipedia.org/wiki/Gamma_distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### conjugate priors & gamma distiribution & precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Because of its role in conjugate priors for the normal likelihood function, the gamma distribution is routinely used as a prior on precision. \n", "* But there is no logical necessity to do so, and modern MCMC methods permit more flexible specification of priors. \n", "* Indeed, because precision is less intuitive than standard deviation, it can be more useful instead to give the standard deviation a uniform prior that spans a wide range." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have assumed that the data are generated by a normal likelihood function,\n", "parameterized by a mean μ and standard deviation σ , and denoted y ∼ normal(y|μ, σ ).\n", "For purposes of mathematical derivation, we made unrealistic assumptions that the prior\n", "distribution is either a spike on σ or a spike on μ, in order to make three main\n", "points:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. A natural way to express a prior on μ is with a normal distribution, because this is conjugate with the normal likelihood when its standard deviation is fixed.\n", "2. A way to express a prior on the precision 1/σ 2 is with a gamma distribution, because this is conjugate with the normal likelihood when its mean is fixed. However in practice the standard deviation can instead be given a uniform prior (or anything else that reflects prior beliefs, of course).\n", "3. The formulas for Bayesian updating of the parameter distribution are more conveniently expressed in terms of precision than standard deviation. Normal distributions are described sometimes in terms of standard deviation and sometimes in terms of precision, so it is important to glean from context which is being referred to. In R and Stan, the normal distribution is parameterized by mean and standard deviation. In JAGS and BUGS, the normal distribution is parameterized by mean and precision." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.1.2 Approximation by MCMC in JAGS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is easy to estimate the mean and standard deviation in JAGS." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The right panel of Figure 16.2 instead puts a broad uniform distribution directly\n", "on σ . The low and high values of the uniform distribution are set to be far outside any realistic\n", "value for the data, so that the prior has minimal influence on the posterior. \n", "* The uniform prior on σ is easier to intuit than a gamma prior on precision, but the priors are not equivalent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataList = list(\n", " y = y ,\n", " Ntotal = length(y) ,\n", " meanY = mean(y) ,\n", " sdY = sd(y)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " for ( i in 1:Ntotal ) {\n", " y[i] ̃ dnorm( mu , 1/sigmaˆ2 ) # JAGS uses precision\n", " }\n", " mu ̃ dnorm( meanY , 1/(100*sdY)ˆ2 ) # JAGS uses precision\n", " sigma ̃ dunif( sdY/1000 , sdY*1000 )\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For purposes of illustration, we use fictitious data.\n", "* The data are \n", " - IQ (intelligence quotient) scores \n", " - from a group of people who have consumed a “smart drug.” \n", " - We know that IQ tests have been normed to the general population so that they have an average score of 100 and a standard deviation of 15. \n", "* Therefore, we would like to know how differently the smart-drug group has performed relative to the general population average." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Jags-Ymet-Xnom1grp-Mnormal-Example.R" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업\n", "cur_dir = getwd()\n", "setwd(sprintf(\"%s/%s\", cur_dir, 'data'))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Example for Jags-Ymet-Xnom1grp-Mnormal.R \n", "#------------------------------------------------------------------------------- \n", "# Optional generic preliminaries:\n", "#graphics.off() # This closes all of R's graphics windows.\n", "#rm(list=ls()) # Careful! This clears all of R's memory!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load The data file \n", "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
ScoreGroup
1102Smart Drug
2107Smart Drug
392Smart Drug
4101Smart Drug
5110Smart Drug
668Smart Drug
7119Smart Drug
8106Smart Drug
999Smart Drug
10103Smart Drug
1190Smart Drug
1293Smart Drug
1379Smart Drug
1489Smart Drug
15137Smart Drug
16119Smart Drug
17126Smart Drug
18110Smart Drug
1971Smart Drug
20114Smart Drug
21100Smart Drug
2295Smart Drug
2391Smart Drug
2499Smart Drug
2597Smart Drug
26106Smart Drug
27106Smart Drug
28129Smart Drug
29115Smart Drug
30124Smart Drug
31137Smart Drug
3273Smart Drug
3369Smart Drug
3495Smart Drug
35102Smart Drug
36116Smart Drug
37111Smart Drug
38134Smart Drug
39102Smart Drug
40110Smart Drug
41139Smart Drug
42112Smart Drug
43122Smart Drug
4484Smart Drug
45129Smart Drug
46112Smart Drug
47127Smart Drug
48106Smart Drug
49113Smart Drug
50109Smart Drug
51208Smart Drug
52114Smart Drug
53107Smart Drug
5450Smart Drug
55169Smart Drug
56133Smart Drug
5750Smart Drug
5897Smart Drug
59139Smart Drug
6072Smart Drug
61100Smart Drug
62144Smart Drug
63112Smart Drug
64109Placebo
6598Placebo
66106Placebo
67101Placebo
68100Placebo
69111Placebo
70117Placebo
71104Placebo
72106Placebo
7389Placebo
7484Placebo
7588Placebo
7694Placebo
7778Placebo
78108Placebo
79102Placebo
8095Placebo
8199Placebo
8290Placebo
83116Placebo
8497Placebo
85107Placebo
86102Placebo
8791Placebo
8894Placebo
8995Placebo
9086Placebo
91108Placebo
92115Placebo
93108Placebo
9488Placebo
95102Placebo
96102Placebo
97120Placebo
98112Placebo
99100Placebo
100105Placebo
101105Placebo
10288Placebo
10382Placebo
104111Placebo
10596Placebo
10692Placebo
107109Placebo
10891Placebo
10992Placebo
110123Placebo
11161Placebo
11259Placebo
113105Placebo
114184Placebo
11582Placebo
116138Placebo
11799Placebo
11893Placebo
11993Placebo
12072Placebo
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " & Score & Group\\\\\n", "\\hline\n", "\t1 & 102 & Smart Drug\\\\\n", "\t2 & 107 & Smart Drug\\\\\n", "\t3 & 92 & Smart Drug\\\\\n", "\t4 & 101 & Smart Drug\\\\\n", "\t5 & 110 & Smart Drug\\\\\n", "\t6 & 68 & Smart Drug\\\\\n", "\t7 & 119 & Smart Drug\\\\\n", "\t8 & 106 & Smart Drug\\\\\n", "\t9 & 99 & Smart Drug\\\\\n", "\t10 & 103 & Smart Drug\\\\\n", "\t11 & 90 & Smart Drug\\\\\n", "\t12 & 93 & Smart Drug\\\\\n", "\t13 & 79 & Smart Drug\\\\\n", "\t14 & 89 & Smart Drug\\\\\n", "\t15 & 137 & Smart Drug\\\\\n", "\t16 & 119 & Smart Drug\\\\\n", "\t17 & 126 & Smart Drug\\\\\n", "\t18 & 110 & Smart Drug\\\\\n", "\t19 & 71 & Smart Drug\\\\\n", "\t20 & 114 & Smart Drug\\\\\n", "\t21 & 100 & Smart Drug\\\\\n", "\t22 & 95 & Smart Drug\\\\\n", "\t23 & 91 & Smart Drug\\\\\n", "\t24 & 99 & Smart Drug\\\\\n", "\t25 & 97 & Smart Drug\\\\\n", "\t26 & 106 & Smart Drug\\\\\n", "\t27 & 106 & Smart Drug\\\\\n", "\t28 & 129 & Smart Drug\\\\\n", "\t29 & 115 & Smart Drug\\\\\n", "\t30 & 124 & Smart Drug\\\\\n", "\t31 & 137 & Smart Drug\\\\\n", "\t32 & 73 & Smart Drug\\\\\n", "\t33 & 69 & Smart Drug\\\\\n", "\t34 & 95 & Smart Drug\\\\\n", "\t35 & 102 & Smart Drug\\\\\n", "\t36 & 116 & Smart Drug\\\\\n", "\t37 & 111 & Smart Drug\\\\\n", "\t38 & 134 & Smart Drug\\\\\n", "\t39 & 102 & Smart Drug\\\\\n", "\t40 & 110 & Smart Drug\\\\\n", "\t41 & 139 & Smart Drug\\\\\n", "\t42 & 112 & Smart Drug\\\\\n", "\t43 & 122 & Smart Drug\\\\\n", "\t44 & 84 & Smart Drug\\\\\n", "\t45 & 129 & Smart Drug\\\\\n", "\t46 & 112 & Smart Drug\\\\\n", "\t47 & 127 & Smart Drug\\\\\n", "\t48 & 106 & Smart Drug\\\\\n", "\t49 & 113 & Smart Drug\\\\\n", "\t50 & 109 & Smart Drug\\\\\n", "\t51 & 208 & Smart Drug\\\\\n", "\t52 & 114 & Smart Drug\\\\\n", "\t53 & 107 & Smart Drug\\\\\n", "\t54 & 50 & Smart Drug\\\\\n", "\t55 & 169 & Smart Drug\\\\\n", "\t56 & 133 & Smart Drug\\\\\n", "\t57 & 50 & Smart Drug\\\\\n", "\t58 & 97 & Smart Drug\\\\\n", "\t59 & 139 & Smart Drug\\\\\n", "\t60 & 72 & Smart Drug\\\\\n", "\t61 & 100 & Smart Drug\\\\\n", "\t62 & 144 & Smart Drug\\\\\n", "\t63 & 112 & Smart Drug\\\\\n", "\t64 & 109 & Placebo\\\\\n", "\t65 & 98 & Placebo\\\\\n", "\t66 & 106 & Placebo\\\\\n", "\t67 & 101 & Placebo\\\\\n", "\t68 & 100 & Placebo\\\\\n", "\t69 & 111 & Placebo\\\\\n", "\t70 & 117 & Placebo\\\\\n", "\t71 & 104 & Placebo\\\\\n", "\t72 & 106 & Placebo\\\\\n", "\t73 & 89 & Placebo\\\\\n", "\t74 & 84 & Placebo\\\\\n", "\t75 & 88 & Placebo\\\\\n", "\t76 & 94 & Placebo\\\\\n", "\t77 & 78 & Placebo\\\\\n", "\t78 & 108 & Placebo\\\\\n", "\t79 & 102 & Placebo\\\\\n", "\t80 & 95 & Placebo\\\\\n", "\t81 & 99 & Placebo\\\\\n", "\t82 & 90 & Placebo\\\\\n", "\t83 & 116 & Placebo\\\\\n", "\t84 & 97 & Placebo\\\\\n", "\t85 & 107 & Placebo\\\\\n", "\t86 & 102 & Placebo\\\\\n", "\t87 & 91 & Placebo\\\\\n", "\t88 & 94 & Placebo\\\\\n", "\t89 & 95 & Placebo\\\\\n", "\t90 & 86 & Placebo\\\\\n", "\t91 & 108 & Placebo\\\\\n", "\t92 & 115 & Placebo\\\\\n", "\t93 & 108 & Placebo\\\\\n", "\t94 & 88 & Placebo\\\\\n", "\t95 & 102 & Placebo\\\\\n", "\t96 & 102 & Placebo\\\\\n", "\t97 & 120 & Placebo\\\\\n", "\t98 & 112 & Placebo\\\\\n", "\t99 & 100 & Placebo\\\\\n", "\t100 & 105 & Placebo\\\\\n", "\t101 & 105 & Placebo\\\\\n", "\t102 & 88 & Placebo\\\\\n", "\t103 & 82 & Placebo\\\\\n", "\t104 & 111 & Placebo\\\\\n", "\t105 & 96 & Placebo\\\\\n", "\t106 & 92 & Placebo\\\\\n", "\t107 & 109 & Placebo\\\\\n", "\t108 & 91 & Placebo\\\\\n", "\t109 & 92 & Placebo\\\\\n", "\t110 & 123 & Placebo\\\\\n", "\t111 & 61 & Placebo\\\\\n", "\t112 & 59 & Placebo\\\\\n", "\t113 & 105 & Placebo\\\\\n", "\t114 & 184 & Placebo\\\\\n", "\t115 & 82 & Placebo\\\\\n", "\t116 & 138 & Placebo\\\\\n", "\t117 & 99 & Placebo\\\\\n", "\t118 & 93 & Placebo\\\\\n", "\t119 & 93 & Placebo\\\\\n", "\t120 & 72 & Placebo\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " Score Group\n", "1 102 Smart Drug\n", "2 107 Smart Drug\n", "3 92 Smart Drug\n", "4 101 Smart Drug\n", "5 110 Smart Drug\n", "6 68 Smart Drug\n", "7 119 Smart Drug\n", "8 106 Smart Drug\n", "9 99 Smart Drug\n", "10 103 Smart Drug\n", "11 90 Smart Drug\n", "12 93 Smart Drug\n", "13 79 Smart Drug\n", "14 89 Smart Drug\n", "15 137 Smart Drug\n", "16 119 Smart Drug\n", "17 126 Smart Drug\n", "18 110 Smart Drug\n", "19 71 Smart Drug\n", "20 114 Smart Drug\n", "21 100 Smart Drug\n", "22 95 Smart Drug\n", "23 91 Smart Drug\n", "24 99 Smart Drug\n", "25 97 Smart Drug\n", "26 106 Smart Drug\n", "27 106 Smart Drug\n", "28 129 Smart Drug\n", "29 115 Smart Drug\n", "30 124 Smart Drug\n", "31 137 Smart Drug\n", "32 73 Smart Drug\n", "33 69 Smart Drug\n", "34 95 Smart Drug\n", "35 102 Smart Drug\n", "36 116 Smart Drug\n", "37 111 Smart Drug\n", "38 134 Smart Drug\n", "39 102 Smart Drug\n", "40 110 Smart Drug\n", "41 139 Smart Drug\n", "42 112 Smart Drug\n", "43 122 Smart Drug\n", "44 84 Smart Drug\n", "45 129 Smart Drug\n", "46 112 Smart Drug\n", "47 127 Smart Drug\n", "48 106 Smart Drug\n", "49 113 Smart Drug\n", "50 109 Smart Drug\n", "51 208 Smart Drug\n", "52 114 Smart Drug\n", "53 107 Smart Drug\n", "54 50 Smart Drug\n", "55 169 Smart Drug\n", "56 133 Smart Drug\n", "57 50 Smart Drug\n", "58 97 Smart Drug\n", "59 139 Smart Drug\n", "60 72 Smart Drug\n", "61 100 Smart Drug\n", "62 144 Smart Drug\n", "63 112 Smart Drug\n", "64 109 Placebo\n", "65 98 Placebo\n", "66 106 Placebo\n", "67 101 Placebo\n", "68 100 Placebo\n", "69 111 Placebo\n", "70 117 Placebo\n", "71 104 Placebo\n", "72 106 Placebo\n", "73 89 Placebo\n", "74 84 Placebo\n", "75 88 Placebo\n", "76 94 Placebo\n", "77 78 Placebo\n", "78 108 Placebo\n", "79 102 Placebo\n", "80 95 Placebo\n", "81 99 Placebo\n", "82 90 Placebo\n", "83 116 Placebo\n", "84 97 Placebo\n", "85 107 Placebo\n", "86 102 Placebo\n", "87 91 Placebo\n", "88 94 Placebo\n", "89 95 Placebo\n", "90 86 Placebo\n", "91 108 Placebo\n", "92 115 Placebo\n", "93 108 Placebo\n", "94 88 Placebo\n", "95 102 Placebo\n", "96 102 Placebo\n", "97 120 Placebo\n", "98 112 Placebo\n", "99 100 Placebo\n", "100 105 Placebo\n", "101 105 Placebo\n", "102 88 Placebo\n", "103 82 Placebo\n", "104 111 Placebo\n", "105 96 Placebo\n", "106 92 Placebo\n", "107 109 Placebo\n", "108 91 Placebo\n", "109 92 Placebo\n", "110 123 Placebo\n", "111 61 Placebo\n", "112 59 Placebo\n", "113 105 Placebo\n", "114 184 Placebo\n", "115 82 Placebo\n", "116 138 Placebo\n", "117 99 Placebo\n", "118 93 Placebo\n", "119 93 Placebo\n", "120 72 Placebo" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myDataFrame" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " Score Group \n", " Min. : 50.00 Placebo :57 \n", " 1st Qu.: 92.75 Smart Drug:63 \n", " Median :102.00 \n", " Mean :104.13 \n", " 3rd Qu.:112.00 \n", " Max. :208.00 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary(myDataFrame)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# For purposes of this one-group example, use data from Smart Drug group:\n", "myData = myDataFrame$Score[myDataFrame$Group==\"Smart Drug\"]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional: Specify filename root and graphical format for saving output.\n", "# Otherwise specify as NULL or leave saveName and saveType arguments \n", "# out of function calls.\n", "fileNameRoot = \"OneGroupIQnormal-\" \n", "graphFileType = \"png\" #\"eps\" " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "*********************************************************************\n", "Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:\n", "A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.\n", "*********************************************************************\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: coda\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Jags-Ymet-Xnom1grp-Mnormal.R\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: rjags\n", "Linked to JAGS 3.4.0\n", "Loaded modules: basemod,bugs\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Compiling model graph\n", " Resolving undeclared variables\n", " Allocating nodes\n", " Graph Size: 79\n", "\n", "Initializing model\n", "\n", "Burning in the MCMC chain...\n", "Sampling final MCMC chain...\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "mcmcCoda = genMCMC( data=myData , numSavedSteps=20000 , saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mean Median Mode ESS HDImass HDIlow\n", "mu 107.8278271 107.8114667 107.0894331 20097.3 0.95 101.43788497\n", "sigma 25.9588549 25.7791320 25.0124458 11475.4 0.95 21.46449101\n", "effSz 0.3039062 0.3034737 0.2807922 20000.0 0.95 0.04810492\n", " HDIhigh CompVal PcntGtCompVal ROPElow ROPEhigh PcntLtROPE PcntInROPE\n", "mu 114.4334683 100 98.95 99.0 101.0 0.41 1.625\n", "sigma 30.6368277 15 100.00 14.0 16.0 0.00 0.000\n", "effSz 0.5540163 0 98.95 -0.1 0.1 0.07 5.650\n", " PcntGtROPE\n", "mu 97.965\n", "sigma 100.000\n", "effSz 94.280\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " saveName=fileNameRoot )\n", "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.2. Outliers and Robust Estimation : The t Distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 16.2.1 Using the t distribution in JAGS\n", "* 16.2.2 Using the t distribution in Stan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When data appear to have outliers \n", "beyond what would be accommodated by a normal distribution, it would be useful to be able to describe the data with a more appropriate distribution that has taller or heavier tails than the normal. \n", "* A well-known distribution with heavy tails is the t distribution[2-6].\n", " - Figure 16.4 shows examples of the t distribution. \n", " - Like the normal distribution, it has two parameters \n", " - that control its mean and its width. \n", " - The standard deviation is controlled indirectly via the t distribution’s “scale” parameter. \n", " - In Figure 16.4, the mean is set to zero and the scale is set to one.\n", " - The t distribution has a third parameter \n", " - that controls the heaviness of its tails, \n", " - which I will refer to as the “normality” parameter, \n", " - ν (Greek letter nu). \n", " - Many people might be familiar with this parameter as the “degrees of freedom” from its use in NHST. \n", " - But because \n", " - we will not be using the t distribution as a sampling distribution, \n", " - and instead we will be using it only as a descriptive shape, \n", " - I prefer to name the parameter by its effect on the distribution’s shape. \n", " - The normality parameter can range continuously from 1 to ∞. As can be seen in Figure 16.4, \n", " - when ν = 1 the t distribution has very heavy tails, and \n", " - when ν approaches ∞ the t distribution becomes normal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Figure 16.5 shows examples of how the t distribution is robust against outliers. \n", " - D = {yi}\n", " - p(D|μ, σ ) <- normal\n", " - p(D|μ, σ , ν) <- t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* It is important to understand that the scale parameter σ in the t distribution is not the standard deviation of the distribution. \n", "* The standard deviation is actually larger than σ because of the heavy tails. \n", " - In fact, when ν drops below 2 (but is still ≥ 1), the standard deviation of the mathematical t distribution goes to infinity. \n", " - For example, in the upper panel of Figure 16.5, ν is only 1.14, which means that the standard deviation of the mathematical t distribution is infinity, even though the sample standard deviation of the data is finite. \n", " - At the same time, the scale parameter of the t distribution has value σ = 1.47.\n", "* While this value of the scale parameter is not the standard deviation of the distribution, it does have an intuitive relation to the spread of the data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The lower panel of Figure 16.5 uses realistic data that indicates levels of inorganic phosphorous, measured in milligrams per deciliter, in 177 human subjects aged 65 or older. \n", " - The authors of the data (Holcomb & Spalsbury, 2005) intentionally altered a few data points to reflect typical transcription errors and to illustrate methods for detecting and correcting such errors.\n", " - We instead assume that we no longer have access to records of the original individual measurements, and must model the uncorrected data set.\n", "* The t distribution accommodates the outliers and fits the distribution of data much better than the normal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### robust estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The use of a heavy-tailed distribution is often called robust estimation because the estimated value of the central tendency is stable, that is, “robust,” against outliers.\n", "* The t distribution is useful \n", " - as a likelihood function for modeling outliers at the level of observed data. But the t distribution is also useful\n", " - for modeling outliers at higher levels in a hierarchical prior. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.2.1 Using the t distribution in JAGS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### using normal " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataList = list(\n", " y = y ,\n", " Ntotal = length(y) ,\n", " meanY = mean(y) ,\n", " sdY = sd(y)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " for ( i in 1:Ntotal ) {\n", " y[i] ̃ dnorm( mu , 1/sigmaˆ2 ) # JAGS uses precision\n", " }\n", " mu ̃ dnorm( meanY , 1/(100*sdY)ˆ2 ) # JAGS uses precision\n", " sigma ̃ dunif( sdY/1000 , sdY*1000 )\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### using t " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " for ( i in 1:Ntotal ) {\n", " y[i] ̃ dt( mu , 1/sigmaˆ2 , nu ) # JAGS: dt uses precision\n", " }\n", " mu ̃ dnorm( meanY , 1/(100*sdY)ˆ2 )\n", " sigma ̃ dunif( sdY/1000 , sdY*1000 )\n", " nu <- nuMinusOne+1 # nu must be >= 1\n", " nuMinusOne ̃ dexp(1/29) # prior on nu-1\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Jags-Ymet-Xnom1grp-Mrobust-Example.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업\n", "cur_dir = getwd()\n", "setwd(sprintf(\"%s/%s\", cur_dir, 'data'))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional generic preliminaries:\n", "graphics.off() # This closes all of R's graphics windows.\n", "rm(list=ls()) # Careful! This clears all of R's memory!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load The data file \n", "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )\n", "# For purposes of this one-group example, use data from Smart Drug group:\n", "myData = myDataFrame$Score[myDataFrame$Group==\"Smart Drug\"]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional: Specify filename root and graphical format for saving output.\n", "# Otherwise specify as NULL or leave saveName and saveType arguments \n", "# out of function calls.\n", "fileNameRoot = \"OneGroupIQrobust-Jags-\" \n", "graphFileType = \"png\" #\"eps\" " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "*********************************************************************\n", "Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:\n", "A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.\n", "*********************************************************************\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Installing packages into ‘/home/moosung/R/x86_64-pc-linux-gnu-library/3.2’\n", "(as ‘lib’ is unspecified)\n" ] }, { "ename": "ERROR", "evalue": "Error in contrib.url(repos, type): trying to use CRAN without setting a mirror\n", "output_type": "error", "traceback": [ "Error in contrib.url(repos, type): trying to use CRAN without setting a mirror\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Jags-Ymet-Xnom1grp-Mrobust.R\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "mcmcCoda = genMCMC( data=myData , numSavedSteps=20000 , saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.2.2 Using the t distribution in Stan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* When you run the JAGS program yourself, you will see that it uses many steps to produce a posterior sample for σ that has an ESS exceeding 10,000. \n", "* You will also see that the ESS for ν is less than 10,000 despite the long chain. \n", " - In other words, there is high autocorrelation in the chains in JAGS.\n", "* We do not care that the chain for ν has a relatively small ESS because \n", " - (a) we do not care about the exact value of ν when interpreting the posterior, as explained above, and \n", " - (b) the exact value of ν has relatively little effect on the estimates of the other parameters. \n", " - To be sure, the posterior sample of ν must be converged and truly representative of the posterior, but it does not need to be as finely detailed as the other parameters. \n", "* Nevertheless, it would be less worrying if ν had a larger ESS.\n", " - The autocorrelation of the MCMC sampling in JAGS requires a long chain, \n", " - which requires us to have patience while the computer chugs along. \n", " - We have discussed two options for improving the efficiency of the sampling. \n", " - One option is to run parallel chains on multiple cores using runjags (Section 8.7, p. 215).\n", " - Another option is to implement the model in Stan, \n", " - which may explore the parameter space more efficiently with its HMC sampling (Chapter 14). \n", "* This section shows how to run the model in Stan. Its results do indeed show a posterior sample of ν with higher ESS than JAGS." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data {\n", " int Ntotal ;\n", " real y[Ntotal] ;\n", " real meanY ;\n", " real sdY ;\n", "}\n", "transformed data { // compute the constants for the priors\n", " real unifLo ;\n", " real unifHi ;\n", " real normalSigma ;\n", " real expLambda ;\n", " unifLo <- sdY/1000 ;\n", " unifHi <- sdY*1000 ;\n", " normalSigma <- sdY*100 ;\n", " expLambda <- 1/29.0 ;\n", "}\n", "parameters {\n", " real nuMinusOne ;\n", " real mu ;\n", " real sigma ;\n", "}\n", "transformed parameters {\n", " real nu ;\n", " nu <- nuMinusOne + 1 ;\n", "}\n", "model {\n", " sigma ̃ uniform( unifLo , unifHi ) ;\n", " mu ̃ normal( meanY , normalSigma ) ;\n", " nuMinusOne ̃ exponential( expLambda ) ;\n", " y ̃ student_t( nu , mu , sigma ) ; // vectorized\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stan-Ymet-Xnom1grp-Mrobust-Example.R" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업\n", "cur_dir = getwd()\n", "setwd(sprintf(\"%s/%s\", cur_dir, 'data'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional generic preliminaries:\n", "graphics.off() # This closes all of R's graphics windows.\n", "rm(list=ls()) # Careful! This clears all of R's memory!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load The data file \n", "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )\n", "# For purposes of this one-group example, use data from Smart Drug group:\n", "myData = myDataFrame$Score[myDataFrame$Group==\"Smart Drug\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Stan-Ymet-Xnom1grp-Mrobust.R\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional: Specify filename root and graphical format for saving output.\n", "# Otherwise specify as NULL or leave saveName and saveType arguments \n", "# out of function calls.\n", "fileNameRoot = \"OneGroupIQrobust-Stan-\" \n", "graphFileType = \"png\" #\"eps\" " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "mcmcCoda = genMCMC( data=myData , numSavedSteps=20000 , saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " saveName=fileNameRoot )\n", "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.3. Two Groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 16.3.1 Analysis by NHST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### An often-used research design is a comparison of two groups. \n", "\n", "* For example, in the context of assaying the effect of a “smart drug” on IQ, instead of comparing the mean of the treatment group against an assumed landmark such as 100 (see Figure 16.3), it would make more sense to compare the treatment group against an identically handled placebo group.\n", "* When there are two groups, we estimate the mean and scale for each group. \n", "* When using t distributions for robust estimation, \n", " - we could also estimate the normality of each group separately. \n", " - But because there usually are relatively few outliers, \n", " - we will use a single normality parameter to describe both groups, \n", " - so that the estimate of the normality is more stably estimated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data {\n", " int Ntotal ;\n", " int x[Ntotal] ; // group identifier \n", " real y[Ntotal] ;\n", " real meanY ;\n", " real sdY ;\n", "}\n", "transformed data {\n", " real unifLo ;\n", " real unifHi ;\n", " real normalSigma ;\n", " real expLambda ;\n", " unifLo <- sdY/1000 ;\n", " unifHi <- sdY*1000 ;\n", " normalSigma <- sdY*100 ;\n", " expLambda <- 1/29.0 ;\n", "}\n", "parameters {\n", " real nuMinusOne ;\n", " real mu[2] ; // 2 groups \n", " real sigma[2] ; // 2 groups\n", "}\n", "transformed parameters {\n", " real nu ;\n", " nu <- nuMinusOne + 1 ;\n", "}\n", "model {\n", " sigma ̃ uniform( unifLo , unifHi ) ; // vectorized 2 groups \n", " mu ̃ normal( meanY , normalSigma ) ; // vectorized 2 groups\n", " nuMinusOne ̃ exponential( expLambda ) ;\n", " for ( i in 1:Ntotal ) {\n", " y[i] ̃ student_t( nu , mu[x[i]] , sigma[x[i]] ) ;\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stan-Ymet-Xnom1grp-Mrobust-Example.R" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional generic preliminaries:\n", "graphics.off() # This closes all of R's graphics windows.\n", "rm(list=ls()) # Careful! This clears all of R's memory!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load The data file \n", "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )\n", "# For purposes of this one-group example, use data from Smart Drug group:\n", "myData = myDataFrame$Score[myDataFrame$Group==\"Smart Drug\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Stan-Ymet-Xnom1grp-Mrobust.R\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional: Specify filename root and graphical format for saving output.\n", "# Otherwise specify as NULL or leave saveName and saveType arguments \n", "# out of function calls.\n", "fileNameRoot = \"OneGroupIQrobust-Stan-\" \n", "graphFileType = \"png\" #\"eps\" " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "mcmcCoda = genMCMC( data=myData , numSavedSteps=20000 , saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.3.1 Analysis by NHST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In traditional NHST, metric data from two groups would be submitted to a t-test." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\n", "\tWelch Two Sample t-test\n", "\n", "data: Score by Group\n", "t = -1.958, df = 111.44, p-value = 0.05273\n", "alternative hypothesis: true difference in means is not equal to 0\n", "95 percent confidence interval:\n", " -15.70602585 0.09366161\n", "sample estimates:\n", " mean in group Placebo mean in group Smart Drug \n", " 100.0351 107.8413 \n" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.test(Score~Group , data=myDataFrame)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Notice that the p value is greater than 0.05, which means that the conventional decision would be not to reject the null hypothesis. \n", " - This conclusion conflicts with the Bayesian analysis in Figure 16.12, unless we use a conservatively wide ROPE. \n", " - The reason that \n", " - the t test is less sensitive than the Bayesian estimation \n", " - in this example is that \n", " - the t test assumes normality and therefore \n", " - its estimate of the within-group variances is too large \n", " - when there are outliers.\n", " - The t test has other problems. Unlike the Bayesian analysis, \n", " - the t test provides only a test of the equality of means, \n", " - without a test of the equality of variances. \n", " - To test equality of variances, \n", " - we need to run an additional test, \n", " - namely an F test of the ratio of variances, which would inflate the p values of both tests. \n", " - Moreover, \n", " - both tests compute p values based on hypothetical normally distributed data, \n", " - and the F test is particularly sensitive to violations of this assumption. \n", " - Therefore it would be better to use resampling methods to compute the p values (and correcting them for multiple tests)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.4. Other Noise Distributions and Transforming Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.5. EXERCISES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 16.1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### [Purpose: Practice using different data files in the high-level script, with an interesting real example about alcohol preference of sexually frustrated males.]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 참고자료" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [1] Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan - http://www.amazon.com/Doing-Bayesian-Analysis-Second-Edition/dp/0124058884\n", "* [2] 스튜던트 t 분포(한글위키피디아) - https://ko.wikipedia.org/wiki/%EC%8A%A4%ED%8A%9C%EB%8D%98%ED%8A%B8_t_%EB%B6%84%ED%8F%AC\n", "* [3] t분포(티분포) 개념정리! - http://math7.tistory.com/55\n", "* [4] t분포표(티분포표) 보는 법! - http://math7.tistory.com/56\n", "* [5] 표준정규분포와 t분포의 차이 - http://blog.daum.net/_blog/BlogTypeView.do?blogid=0Isuz&articleno=1172552\n", "* [6] t 분포 (Student's t-distribution) - http://godrag77.blogspot.kr/2011/07/t-students-t-distribution.html" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.2.1" } }, "nbformat": 4, "nbformat_minor": 0 }