{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CHAPTER 16. Metric-Predicted Variable on One or Two Groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 싸이그래머 / 베이지안R [1]\n", "* 김무성" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Contents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 16.1. Estimating the Mean and Standard Deviation of a Normal Distribution\n", "* 16.2. Outliers and Robust Estimation : The t Distribution\n", "* 16.3. Two Groups\n", "* 16.4. Other Noise Distributions and Transforming Data\n", "* 16.5. EXERCISES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this chapter, we consider a situation in which we have a metric-predicted variable that is observed for items from one or two groups." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For example, \n", " - we could measure \n", " - the blood pressure (i.e., a metric variable) for people \n", " - randomly sampled from first-year university students (i.e., a single group).\n", " - In this case, we might be interested in \n", " - how much the group’s typical blood pressure differs \n", " - from the recommended value for people of that age as published by a federal agency." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* As another example, \n", " - we could measure \n", " - the IQ (i.e., a metric variable) of people \n", " - randomly sampled from everyone self-described as vegetarian (i.e., a single group). \n", " - In this case, we could be interested in \n", " - how much this group’s IQ differs \n", " - from the general population’s average IQ of 100." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In the context of the generalized linear model (GLM) introduced in the previous chapter, this chapter’s situation involves the most trivial cases of the linear core of the GLM, as indicated in the left cells of Table 15.1 (p. 434), with a link function that is the identity along with a normal distribution for describing noise in the data, as indicated in the first row of Table 15.2 (p. 443). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We will explore options for the prior distribution on parameters of the normal distribution, and methods for Bayesian estimation of the parameters. \n", "* We will also consider alternative noise distributions for describing data that have outliers." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.1. Estimating the Mean and Standard Deviation of a Normal Distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 16.1.1 Solution by mathematical analysis\n", "* 16.1.2 Approximation by MCMC in JAGS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The normal distribution specifies the probability density of a value y, given the values of two\n", "parameters, the mean μ and standard deviation σ :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* To get an intuition for the normal distribution as a likelihood function, consider three data values y1 = 85, y2 = 100, and y3 = 115, which are plotted as large dots in Figure 16.1. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Figure 16.1 shows p(D|μ, σ ) for different values of μ and σ . As you can see, there are values of μ and σ that make the data most probable, but other nearby values also accommodate the data reasonably well" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The question is, given the data, how should we allocate credibility to combinations of μ and σ?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* likelhood \n", " - Figure 16.1 shows examples of p(D|μ, σ ) for a particular data set at different values of μ and σ \n", "* prior \n", " -The prior, p(μ,σ), specifies the credibility of each combination of μ,σ values in the two-dimensional joint parameter space, without the data.\n", "* posterior\n", " - Bayes’ rule says that the posterior credibility of each combination of μ, σ values is the prior credibility times the likelihood, normalized by the marginal likelihood." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our goal now is to evaluate Equation 16.2 for reasonable choices of the prior distribution, p(μ, σ )." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.1.1 Solution by mathematical analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We take a short algebraic tour before moving on to MCMC implementations." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When σ is fixed, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 참고 : Conjugate prior\n", " - https://en.wikipedia.org/wiki/Conjugate_prior" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "* It is convenient first to consider the case in which the standard deviation of the likelihood function is fixed at a specific value. In other words, the prior distribution on σ is a spike over that specific value. We’ll denote that fixed value as σ = Sy. \n", "* With this simplifying assumption, we are only estimating μ because we are assuming perfectly certain prior knowledge about σ .\n", "* proir : When σ is fixed, then the prior distribution on μ in Equation 16.2 can be easily chosen to be conjugate to the normal likelihood.\n", " - The term “conjugate prior” was defined in Section 6.2, p. 126.\n", " - It turns out that the product of normal distributions is again a normal distribution; \n", " - in other words, if the prior on μ is normal, then the posterior on μ is normal.\n", " - Let the prior distribution on μ be normal with mean Mμ and standard deviation Sμ .\n", "* likelihood * prior : " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### precision & posterior precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Thus, the posterior precision is the sum of the prior precision and the likelihood precision." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### posterior mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In other words, the posterior mean is a weighted average of the prior mean and the datum, with the weighting corresponding to the relative precisions of the prior and the likelihood." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### N value case" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### posterior mean" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### posterior precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Notice that as the sample size N increases, the posterior mean is dominated by the data mean.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When μ is fixed, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We can instead estimate the σ parameter when μ is fixed. \n", "* It turns out that when μ is fixed, a conjugate prior for the precision is the gamma distribution (e.g., Gelman et al., 2013, p. 43). \n", " - It is important to understand the meaning of a gamma prior on precision. \n", " - Consider a gamma distribution \n", " - that is loaded heavily over very small values, but has a long shallow tail extending over large values. \n", " - This sort of gamma distribution on precision indicates that we believe most strongly in small precisions, but we admit that large precisions are possible. \n", " - If this is a belief about the precision of a normal likelihood function, then this sort of gamma distribution expresses a belief that the data will be more spread out, because small precisions imply large standard deviations. \n", " - If the gamma distribution is instead loaded over large values of precision, \n", " - it expresses a belief that the data will be tightly clustered." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 참고: gamma distribution ?\n", "\n", "* https://en.wikipedia.org/wiki/Gamma_distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### conjugate priors & gamma distiribution & precision" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Because of its role in conjugate priors for the normal likelihood function, the gamma distribution is routinely used as a prior on precision. \n", "* But there is no logical necessity to do so, and modern MCMC methods permit more flexible specification of priors. \n", "* Indeed, because precision is less intuitive than standard deviation, it can be more useful instead to give the standard deviation a uniform prior that spans a wide range." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Summary" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have assumed that the data are generated by a normal likelihood function,\n", "parameterized by a mean μ and standard deviation σ , and denoted y ∼ normal(y|μ, σ ).\n", "For purposes of mathematical derivation, we made unrealistic assumptions that the prior\n", "distribution is either a spike on σ or a spike on μ, in order to make three main\n", "points:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "1. A natural way to express a prior on μ is with a normal distribution, because this is conjugate with the normal likelihood when its standard deviation is fixed.\n", "2. A way to express a prior on the precision 1/σ 2 is with a gamma distribution, because this is conjugate with the normal likelihood when its mean is fixed. However in practice the standard deviation can instead be given a uniform prior (or anything else that reflects prior beliefs, of course).\n", "3. The formulas for Bayesian updating of the parameter distribution are more conveniently expressed in terms of precision than standard deviation. Normal distributions are described sometimes in terms of standard deviation and sometimes in terms of precision, so it is important to glean from context which is being referred to. In R and Stan, the normal distribution is parameterized by mean and standard deviation. In JAGS and BUGS, the normal distribution is parameterized by mean and precision." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.1.2 Approximation by MCMC in JAGS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is easy to estimate the mean and standard deviation in JAGS." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The right panel of Figure 16.2 instead puts a broad uniform distribution directly\n", "on σ . The low and high values of the uniform distribution are set to be far outside any realistic\n", "value for the data, so that the prior has minimal influence on the posterior. \n", "* The uniform prior on σ is easier to intuit than a gamma prior on precision, but the priors are not equivalent." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataList = list(\n", " y = y ,\n", " Ntotal = length(y) ,\n", " meanY = mean(y) ,\n", " sdY = sd(y)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " for ( i in 1:Ntotal ) {\n", " y[i] ̃ dnorm( mu , 1/sigmaˆ2 ) # JAGS uses precision\n", " }\n", " mu ̃ dnorm( meanY , 1/(100*sdY)ˆ2 ) # JAGS uses precision\n", " sigma ̃ dunif( sdY/1000 , sdY*1000 )\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For purposes of illustration, we use fictitious data.\n", "* The data are \n", " - IQ (intelligence quotient) scores \n", " - from a group of people who have consumed a “smart drug.” \n", " - We know that IQ tests have been normed to the general population so that they have an average score of 100 and a standard deviation of 15. \n", "* Therefore, we would like to know how differently the smart-drug group has performed relative to the general population average." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Jags-Ymet-Xnom1grp-Mnormal-Example.R" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업\n", "cur_dir = getwd()\n", "setwd(sprintf(\"%s/%s\", cur_dir, 'data'))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Example for Jags-Ymet-Xnom1grp-Mnormal.R \n", "#------------------------------------------------------------------------------- \n", "# Optional generic preliminaries:\n", "#graphics.off() # This closes all of R's graphics windows.\n", "#rm(list=ls()) # Careful! This clears all of R's memory!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load The data file \n", "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "

	Score	Group
1	102	Smart Drug
2	107	Smart Drug
3	92	Smart Drug
4	101	Smart Drug
5	110	Smart Drug
6	68	Smart Drug
7	119	Smart Drug
8	106	Smart Drug
9	99	Smart Drug
10	103	Smart Drug
11	90	Smart Drug
12	93	Smart Drug
13	79	Smart Drug
14	89	Smart Drug
15	137	Smart Drug
16	119	Smart Drug
17	126	Smart Drug
18	110	Smart Drug
19	71	Smart Drug
20	114	Smart Drug
21	100	Smart Drug
22	95	Smart Drug
23	91	Smart Drug
24	99	Smart Drug
25	97	Smart Drug
26	106	Smart Drug
27	106	Smart Drug
28	129	Smart Drug
29	115	Smart Drug
30	124	Smart Drug
31	137	Smart Drug
32	73	Smart Drug
33	69	Smart Drug
34	95	Smart Drug
35	102	Smart Drug
36	116	Smart Drug
37	111	Smart Drug
38	134	Smart Drug
39	102	Smart Drug
40	110	Smart Drug
41	139	Smart Drug
42	112	Smart Drug
43	122	Smart Drug
44	84	Smart Drug
45	129	Smart Drug
46	112	Smart Drug
47	127	Smart Drug
48	106	Smart Drug
49	113	Smart Drug
50	109	Smart Drug
51	208	Smart Drug
52	114	Smart Drug
53	107	Smart Drug
54	50	Smart Drug
55	169	Smart Drug
56	133	Smart Drug
57	50	Smart Drug
58	97	Smart Drug
59	139	Smart Drug
60	72	Smart Drug
61	100	Smart Drug
62	144	Smart Drug
63	112	Smart Drug
64	109	Placebo
65	98	Placebo
66	106	Placebo
67	101	Placebo
68	100	Placebo
69	111	Placebo
70	117	Placebo
71	104	Placebo
72	106	Placebo
73	89	Placebo
74	84	Placebo
75	88	Placebo
76	94	Placebo
77	78	Placebo
78	108	Placebo
79	102	Placebo
80	95	Placebo
81	99	Placebo
82	90	Placebo
83	116	Placebo
84	97	Placebo
85	107	Placebo
86	102	Placebo
87	91	Placebo
88	94	Placebo
89	95	Placebo
90	86	Placebo
91	108	Placebo
92	115	Placebo
93	108	Placebo
94	88	Placebo
95	102	Placebo
96	102	Placebo
97	120	Placebo
98	112	Placebo
99	100	Placebo
100	105	Placebo
101	105	Placebo
102	88	Placebo
103	82	Placebo
104	111	Placebo
105	96	Placebo
106	92	Placebo
107	109	Placebo
108	91	Placebo
109	92	Placebo
110	123	Placebo
111	61	Placebo
112	59	Placebo
113	105	Placebo
114	184	Placebo
115	82	Placebo
116	138	Placebo
117	99	Placebo
118	93	Placebo
119	93	Placebo
120	72	Placebo

\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " & Score & Group\\\\\n", "\\hline\n", "\t1 & 102 & Smart Drug\\\\\n", "\t2 & 107 & Smart Drug\\\\\n", "\t3 & 92 & Smart Drug\\\\\n", "\t4 & 101 & Smart Drug\\\\\n", "\t5 & 110 & Smart Drug\\\\\n", "\t6 & 68 & Smart Drug\\\\\n", "\t7 & 119 & Smart Drug\\\\\n", "\t8 & 106 & Smart Drug\\\\\n", "\t9 & 99 & Smart Drug\\\\\n", "\t10 & 103 & Smart Drug\\\\\n", "\t11 & 90 & Smart Drug\\\\\n", "\t12 & 93 & Smart Drug\\\\\n", "\t13 & 79 & Smart Drug\\\\\n", "\t14 & 89 & Smart Drug\\\\\n", "\t15 & 137 & Smart Drug\\\\\n", "\t16 & 119 & Smart Drug\\\\\n", "\t17 & 126 & Smart Drug\\\\\n", "\t18 & 110 & Smart Drug\\\\\n", "\t19 & 71 & Smart Drug\\\\\n", "\t20 & 114 & Smart Drug\\\\\n", "\t21 & 100 & Smart Drug\\\\\n", "\t22 & 95 & Smart Drug\\\\\n", "\t23 & 91 & Smart Drug\\\\\n", "\t24 & 99 & Smart Drug\\\\\n", "\t25 & 97 & Smart Drug\\\\\n", "\t26 & 106 & Smart Drug\\\\\n", "\t27 & 106 & Smart Drug\\\\\n", "\t28 & 129 & Smart Drug\\\\\n", "\t29 & 115 & Smart Drug\\\\\n", "\t30 & 124 & Smart Drug\\\\\n", "\t31 & 137 & Smart Drug\\\\\n", "\t32 & 73 & Smart Drug\\\\\n", "\t33 & 69 & Smart Drug\\\\\n", "\t34 & 95 & Smart Drug\\\\\n", "\t35 & 102 & Smart Drug\\\\\n", "\t36 & 116 & Smart Drug\\\\\n", "\t37 & 111 & Smart Drug\\\\\n", "\t38 & 134 & Smart Drug\\\\\n", "\t39 & 102 & Smart Drug\\\\\n", "\t40 & 110 & Smart Drug\\\\\n", "\t41 & 139 & Smart Drug\\\\\n", "\t42 & 112 & Smart Drug\\\\\n", "\t43 & 122 & Smart Drug\\\\\n", "\t44 & 84 & Smart Drug\\\\\n", "\t45 & 129 & Smart Drug\\\\\n", "\t46 & 112 & Smart Drug\\\\\n", "\t47 & 127 & Smart Drug\\\\\n", "\t48 & 106 & Smart Drug\\\\\n", "\t49 & 113 & Smart Drug\\\\\n", "\t50 & 109 & Smart Drug\\\\\n", "\t51 & 208 & Smart Drug\\\\\n", "\t52 & 114 & Smart Drug\\\\\n", "\t53 & 107 & Smart Drug\\\\\n", "\t54 & 50 & Smart Drug\\\\\n", "\t55 & 169 & Smart Drug\\\\\n", "\t56 & 133 & Smart Drug\\\\\n", "\t57 & 50 & Smart Drug\\\\\n", "\t58 & 97 & Smart Drug\\\\\n", "\t59 & 139 & Smart Drug\\\\\n", "\t60 & 72 & Smart Drug\\\\\n", "\t61 & 100 & Smart Drug\\\\\n", "\t62 & 144 & Smart Drug\\\\\n", "\t63 & 112 & Smart Drug\\\\\n", "\t64 & 109 & Placebo\\\\\n", "\t65 & 98 & Placebo\\\\\n", "\t66 & 106 & Placebo\\\\\n", "\t67 & 101 & Placebo\\\\\n", "\t68 & 100 & Placebo\\\\\n", "\t69 & 111 & Placebo\\\\\n", "\t70 & 117 & Placebo\\\\\n", "\t71 & 104 & Placebo\\\\\n", "\t72 & 106 & Placebo\\\\\n", "\t73 & 89 & Placebo\\\\\n", "\t74 & 84 & Placebo\\\\\n", "\t75 & 88 & Placebo\\\\\n", "\t76 & 94 & Placebo\\\\\n", "\t77 & 78 & Placebo\\\\\n", "\t78 & 108 & Placebo\\\\\n", "\t79 & 102 & Placebo\\\\\n", "\t80 & 95 & Placebo\\\\\n", "\t81 & 99 & Placebo\\\\\n", "\t82 & 90 & Placebo\\\\\n", "\t83 & 116 & Placebo\\\\\n", "\t84 & 97 & Placebo\\\\\n", "\t85 & 107 & Placebo\\\\\n", "\t86 & 102 & Placebo\\\\\n", "\t87 & 91 & Placebo\\\\\n", "\t88 & 94 & Placebo\\\\\n", "\t89 & 95 & Placebo\\\\\n", "\t90 & 86 & Placebo\\\\\n", "\t91 & 108 & Placebo\\\\\n", "\t92 & 115 & Placebo\\\\\n", "\t93 & 108 & Placebo\\\\\n", "\t94 & 88 & Placebo\\\\\n", "\t95 & 102 & Placebo\\\\\n", "\t96 & 102 & Placebo\\\\\n", "\t97 & 120 & Placebo\\\\\n", "\t98 & 112 & Placebo\\\\\n", "\t99 & 100 & Placebo\\\\\n", "\t100 & 105 & Placebo\\\\\n", "\t101 & 105 & Placebo\\\\\n", "\t102 & 88 & Placebo\\\\\n", "\t103 & 82 & Placebo\\\\\n", "\t104 & 111 & Placebo\\\\\n", "\t105 & 96 & Placebo\\\\\n", "\t106 & 92 & Placebo\\\\\n", "\t107 & 109 & Placebo\\\\\n", "\t108 & 91 & Placebo\\\\\n", "\t109 & 92 & Placebo\\\\\n", "\t110 & 123 & Placebo\\\\\n", "\t111 & 61 & Placebo\\\\\n", "\t112 & 59 & Placebo\\\\\n", "\t113 & 105 & Placebo\\\\\n", "\t114 & 184 & Placebo\\\\\n", "\t115 & 82 & Placebo\\\\\n", "\t116 & 138 & Placebo\\\\\n", "\t117 & 99 & Placebo\\\\\n", "\t118 & 93 & Placebo\\\\\n", "\t119 & 93 & Placebo\\\\\n", "\t120 & 72 & Placebo\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " Score Group\n", "1 102 Smart Drug\n", "2 107 Smart Drug\n", "3 92 Smart Drug\n", "4 101 Smart Drug\n", "5 110 Smart Drug\n", "6 68 Smart Drug\n", "7 119 Smart Drug\n", "8 106 Smart Drug\n", "9 99 Smart Drug\n", "10 103 Smart Drug\n", "11 90 Smart Drug\n", "12 93 Smart Drug\n", "13 79 Smart Drug\n", "14 89 Smart Drug\n", "15 137 Smart Drug\n", "16 119 Smart Drug\n", "17 126 Smart Drug\n", "18 110 Smart Drug\n", "19 71 Smart Drug\n", "20 114 Smart Drug\n", "21 100 Smart Drug\n", "22 95 Smart Drug\n", "23 91 Smart Drug\n", "24 99 Smart Drug\n", "25 97 Smart Drug\n", "26 106 Smart Drug\n", "27 106 Smart Drug\n", "28 129 Smart Drug\n", "29 115 Smart Drug\n", "30 124 Smart Drug\n", "31 137 Smart Drug\n", "32 73 Smart Drug\n", "33 69 Smart Drug\n", "34 95 Smart Drug\n", "35 102 Smart Drug\n", "36 116 Smart Drug\n", "37 111 Smart Drug\n", "38 134 Smart Drug\n", "39 102 Smart Drug\n", "40 110 Smart Drug\n", "41 139 Smart Drug\n", "42 112 Smart Drug\n", "43 122 Smart Drug\n", "44 84 Smart Drug\n", "45 129 Smart Drug\n", "46 112 Smart Drug\n", "47 127 Smart Drug\n", "48 106 Smart Drug\n", "49 113 Smart Drug\n", "50 109 Smart Drug\n", "51 208 Smart Drug\n", "52 114 Smart Drug\n", "53 107 Smart Drug\n", "54 50 Smart Drug\n", "55 169 Smart Drug\n", "56 133 Smart Drug\n", "57 50 Smart Drug\n", "58 97 Smart Drug\n", "59 139 Smart Drug\n", "60 72 Smart Drug\n", "61 100 Smart Drug\n", "62 144 Smart Drug\n", "63 112 Smart Drug\n", "64 109 Placebo\n", "65 98 Placebo\n", "66 106 Placebo\n", "67 101 Placebo\n", "68 100 Placebo\n", "69 111 Placebo\n", "70 117 Placebo\n", "71 104 Placebo\n", "72 106 Placebo\n", "73 89 Placebo\n", "74 84 Placebo\n", "75 88 Placebo\n", "76 94 Placebo\n", "77 78 Placebo\n", "78 108 Placebo\n", "79 102 Placebo\n", "80 95 Placebo\n", "81 99 Placebo\n", "82 90 Placebo\n", "83 116 Placebo\n", "84 97 Placebo\n", "85 107 Placebo\n", "86 102 Placebo\n", "87 91 Placebo\n", "88 94 Placebo\n", "89 95 Placebo\n", "90 86 Placebo\n", "91 108 Placebo\n", "92 115 Placebo\n", "93 108 Placebo\n", "94 88 Placebo\n", "95 102 Placebo\n", "96 102 Placebo\n", "97 120 Placebo\n", "98 112 Placebo\n", "99 100 Placebo\n", "100 105 Placebo\n", "101 105 Placebo\n", "102 88 Placebo\n", "103 82 Placebo\n", "104 111 Placebo\n", "105 96 Placebo\n", "106 92 Placebo\n", "107 109 Placebo\n", "108 91 Placebo\n", "109 92 Placebo\n", "110 123 Placebo\n", "111 61 Placebo\n", "112 59 Placebo\n", "113 105 Placebo\n", "114 184 Placebo\n", "115 82 Placebo\n", "116 138 Placebo\n", "117 99 Placebo\n", "118 93 Placebo\n", "119 93 Placebo\n", "120 72 Placebo" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myDataFrame" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " Score Group \n", " Min. : 50.00 Placebo :57 \n", " 1st Qu.: 92.75 Smart Drug:63 \n", " Median :102.00 \n", " Mean :104.13 \n", " 3rd Qu.:112.00 \n", " Max. :208.00 " ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary(myDataFrame)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# For purposes of this one-group example, use data from Smart Drug group:\n", "myData = myDataFrame$Score[myDataFrame$Group==\"Smart Drug\"]" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional: Specify filename root and graphical format for saving output.\n", "# Otherwise specify as NULL or leave saveName and saveType arguments \n", "# out of function calls.\n", "fileNameRoot = \"OneGroupIQnormal-\" \n", "graphFileType = \"png\" #\"eps\" " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "*********************************************************************\n", "Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:\n", "A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.\n", "*********************************************************************\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: coda\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Jags-Ymet-Xnom1grp-Mnormal.R\")" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: rjags\n", "Linked to JAGS 3.4.0\n", "Loaded modules: basemod,bugs\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Compiling model graph\n", " Resolving undeclared variables\n", " Allocating nodes\n", " Graph Size: 79\n", "\n", "Initializing model\n", "\n", "Burning in the MCMC chain...\n", "Sampling final MCMC chain...\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "mcmcCoda = genMCMC( data=myData , numSavedSteps=20000 , saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "

" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mean Median Mode ESS HDImass HDIlow\n", "mu 107.8278271 107.8114667 107.0894331 20097.3 0.95 101.43788497\n", "sigma 25.9588549 25.7791320 25.0124458 11475.4 0.95 21.46449101\n", "effSz 0.3039062 0.3034737 0.2807922 20000.0 0.95 0.04810492\n", " HDIhigh CompVal PcntGtCompVal ROPElow ROPEhigh PcntLtROPE PcntInROPE\n", "mu 114.4334683 100 98.95 99.0 101.0 0.41 1.625\n", "sigma 30.6368277 15 100.00 14.0 16.0 0.00 0.000\n", "effSz 0.5540163 0 98.95 -0.1 0.1 0.07 5.650\n", " PcntGtROPE\n", "mu 97.965\n", "sigma 100.000\n", "effSz 94.280\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " saveName=fileNameRoot )\n", "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.2. Outliers and Robust Estimation : The t Distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 16.2.1 Using the t distribution in JAGS\n", "* 16.2.2 Using the t distribution in Stan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### When data appear to have outliers \n", "beyond what would be accommodated by a normal distribution, it would be useful to be able to describe the data with a more appropriate distribution that has taller or heavier tails than the normal. \n", "* A well-known distribution with heavy tails is the t distribution[2-6].\n", " - Figure 16.4 shows examples of the t distribution. \n", " - Like the normal distribution, it has two parameters \n", " - that control its mean and its width. \n", " - The standard deviation is controlled indirectly via the t distribution’s “scale” parameter. \n", " - In Figure 16.4, the mean is set to zero and the scale is set to one.\n", " - The t distribution has a third parameter \n", " - that controls the heaviness of its tails, \n", " - which I will refer to as the “normality” parameter, \n", " - ν (Greek letter nu). \n", " - Many people might be familiar with this parameter as the “degrees of freedom” from its use in NHST. \n", " - But because \n", " - we will not be using the t distribution as a sampling distribution, \n", " - and instead we will be using it only as a descriptive shape, \n", " - I prefer to name the parameter by its effect on the distribution’s shape. \n", " - The normality parameter can range continuously from 1 to ∞. As can be seen in Figure 16.4, \n", " - when ν = 1 the t distribution has very heavy tails, and \n", " - when ν approaches ∞ the t distribution becomes normal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Figure 16.5 shows examples of how the t distribution is robust against outliers. \n", " - D = {yi}\n", " - p(D|μ, σ ) <- normal\n", " - p(D|μ, σ , ν) <- t" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* It is important to understand that the scale parameter σ in the t distribution is not the standard deviation of the distribution. \n", "* The standard deviation is actually larger than σ because of the heavy tails. \n", " - In fact, when ν drops below 2 (but is still ≥ 1), the standard deviation of the mathematical t distribution goes to infinity. \n", " - For example, in the upper panel of Figure 16.5, ν is only 1.14, which means that the standard deviation of the mathematical t distribution is infinity, even though the sample standard deviation of the data is finite. \n", " - At the same time, the scale parameter of the t distribution has value σ = 1.47.\n", "* While this value of the scale parameter is not the standard deviation of the distribution, it does have an intuitive relation to the spread of the data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The lower panel of Figure 16.5 uses realistic data that indicates levels of inorganic phosphorous, measured in milligrams per deciliter, in 177 human subjects aged 65 or older. \n", " - The authors of the data (Holcomb & Spalsbury, 2005) intentionally altered a few data points to reflect typical transcription errors and to illustrate methods for detecting and correcting such errors.\n", " - We instead assume that we no longer have access to records of the original individual measurements, and must model the uncorrected data set.\n", "* The t distribution accommodates the outliers and fits the distribution of data much better than the normal." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### robust estimation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The use of a heavy-tailed distribution is often called robust estimation because the estimated value of the central tendency is stable, that is, “robust,” against outliers.\n", "* The t distribution is useful \n", " - as a likelihood function for modeling outliers at the level of observed data. But the t distribution is also useful\n", " - for modeling outliers at higher levels in a hierarchical prior. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.2.1 Using the t distribution in JAGS" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### using normal " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "dataList = list(\n", " y = y ,\n", " Ntotal = length(y) ,\n", " meanY = mean(y) ,\n", " sdY = sd(y)\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " for ( i in 1:Ntotal ) {\n", " y[i] ̃ dnorm( mu , 1/sigmaˆ2 ) # JAGS uses precision\n", " }\n", " mu ̃ dnorm( meanY , 1/(100*sdY)ˆ2 ) # JAGS uses precision\n", " sigma ̃ dunif( sdY/1000 , sdY*1000 )\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### using t " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " for ( i in 1:Ntotal ) {\n", " y[i] ̃ dt( mu , 1/sigmaˆ2 , nu ) # JAGS: dt uses precision\n", " }\n", " mu ̃ dnorm( meanY , 1/(100*sdY)ˆ2 )\n", " sigma ̃ dunif( sdY/1000 , sdY*1000 )\n", " nu <- nuMinusOne+1 # nu must be >= 1\n", " nuMinusOne ̃ dexp(1/29) # prior on nu-1\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Jags-Ymet-Xnom1grp-Mrobust-Example.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업\n", "cur_dir = getwd()\n", "setwd(sprintf(\"%s/%s\", cur_dir, 'data'))" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional generic preliminaries:\n", "graphics.off() # This closes all of R's graphics windows.\n", "rm(list=ls()) # Careful! This clears all of R's memory!" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load The data file \n", "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )\n", "# For purposes of this one-group example, use data from Smart Drug group:\n", "myData = myDataFrame$Score[myDataFrame$Group==\"Smart Drug\"]" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional: Specify filename root and graphical format for saving output.\n", "# Otherwise specify as NULL or leave saveName and saveType arguments \n", "# out of function calls.\n", "fileNameRoot = \"OneGroupIQrobust-Jags-\" \n", "graphFileType = \"png\" #\"eps\" " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "*********************************************************************\n", "Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:\n", "A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.\n", "*********************************************************************\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Installing packages into ‘/home/moosung/R/x86_64-pc-linux-gnu-library/3.2’\n", "(as ‘lib’ is unspecified)\n" ] }, { "ename": "ERROR", "evalue": "Error in contrib.url(repos, type): trying to use CRAN without setting a mirror\n", "output_type": "error", "traceback": [ "Error in contrib.url(repos, type): trying to use CRAN without setting a mirror\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Jags-Ymet-Xnom1grp-Mrobust.R\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "mcmcCoda = genMCMC( data=myData , numSavedSteps=20000 , saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.2.2 Using the t distribution in Stan" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* When you run the JAGS program yourself, you will see that it uses many steps to produce a posterior sample for σ that has an ESS exceeding 10,000. \n", "* You will also see that the ESS for ν is less than 10,000 despite the long chain. \n", " - In other words, there is high autocorrelation in the chains in JAGS.\n", "* We do not care that the chain for ν has a relatively small ESS because \n", " - (a) we do not care about the exact value of ν when interpreting the posterior, as explained above, and \n", " - (b) the exact value of ν has relatively little effect on the estimates of the other parameters. \n", " - To be sure, the posterior sample of ν must be converged and truly representative of the posterior, but it does not need to be as finely detailed as the other parameters. \n", "* Nevertheless, it would be less worrying if ν had a larger ESS.\n", " - The autocorrelation of the MCMC sampling in JAGS requires a long chain, \n", " - which requires us to have patience while the computer chugs along. \n", " - We have discussed two options for improving the efficiency of the sampling. \n", " - One option is to run parallel chains on multiple cores using runjags (Section 8.7, p. 215).\n", " - Another option is to implement the model in Stan, \n", " - which may explore the parameter space more efficiently with its HMC sampling (Chapter 14). \n", "* This section shows how to run the model in Stan. Its results do indeed show a posterior sample of ν with higher ESS than JAGS." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data {\n", " int Ntotal ;\n", " real y[Ntotal] ;\n", " real meanY ;\n", " real sdY ;\n", "}\n", "transformed data { // compute the constants for the priors\n", " real unifLo ;\n", " real unifHi ;\n", " real normalSigma ;\n", " real expLambda ;\n", " unifLo <- sdY/1000 ;\n", " unifHi <- sdY*1000 ;\n", " normalSigma <- sdY*100 ;\n", " expLambda <- 1/29.0 ;\n", "}\n", "parameters {\n", " real nuMinusOne ;\n", " real mu ;\n", " real sigma ;\n", "}\n", "transformed parameters {\n", " real nu ;\n", " nu <- nuMinusOne + 1 ;\n", "}\n", "model {\n", " sigma ̃ uniform( unifLo , unifHi ) ;\n", " mu ̃ normal( meanY , normalSigma ) ;\n", " nuMinusOne ̃ exponential( expLambda ) ;\n", " y ̃ student_t( nu , mu , sigma ) ; // vectorized\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stan-Ymet-Xnom1grp-Mrobust-Example.R" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업\n", "cur_dir = getwd()\n", "setwd(sprintf(\"%s/%s\", cur_dir, 'data'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional generic preliminaries:\n", "graphics.off() # This closes all of R's graphics windows.\n", "rm(list=ls()) # Careful! This clears all of R's memory!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load The data file \n", "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )\n", "# For purposes of this one-group example, use data from Smart Drug group:\n", "myData = myDataFrame$Score[myDataFrame$Group==\"Smart Drug\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Stan-Ymet-Xnom1grp-Mrobust.R\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional: Specify filename root and graphical format for saving output.\n", "# Otherwise specify as NULL or leave saveName and saveType arguments \n", "# out of function calls.\n", "fileNameRoot = \"OneGroupIQrobust-Stan-\" \n", "graphFileType = \"png\" #\"eps\" " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "mcmcCoda = genMCMC( data=myData , numSavedSteps=20000 , saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " saveName=fileNameRoot )\n", "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.3. Two Groups" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 16.3.1 Analysis by NHST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### An often-used research design is a comparison of two groups. \n", "\n", "* For example, in the context of assaying the effect of a “smart drug” on IQ, instead of comparing the mean of the treatment group against an assumed landmark such as 100 (see Figure 16.3), it would make more sense to compare the treatment group against an identically handled placebo group.\n", "* When there are two groups, we estimate the mean and scale for each group. \n", "* When using t distributions for robust estimation, \n", " - we could also estimate the normality of each group separately. \n", " - But because there usually are relatively few outliers, \n", " - we will use a single normality parameter to describe both groups, \n", " - so that the estimate of the normality is more stably estimated." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data {\n", " int Ntotal ;\n", " int x[Ntotal] ; // group identifier \n", " real y[Ntotal] ;\n", " real meanY ;\n", " real sdY ;\n", "}\n", "transformed data {\n", " real unifLo ;\n", " real unifHi ;\n", " real normalSigma ;\n", " real expLambda ;\n", " unifLo <- sdY/1000 ;\n", " unifHi <- sdY*1000 ;\n", " normalSigma <- sdY*100 ;\n", " expLambda <- 1/29.0 ;\n", "}\n", "parameters {\n", " real nuMinusOne ;\n", " real mu[2] ; // 2 groups \n", " real sigma[2] ; // 2 groups\n", "}\n", "transformed parameters {\n", " real nu ;\n", " nu <- nuMinusOne + 1 ;\n", "}\n", "model {\n", " sigma ̃ uniform( unifLo , unifHi ) ; // vectorized 2 groups \n", " mu ̃ normal( meanY , normalSigma ) ; // vectorized 2 groups\n", " nuMinusOne ̃ exponential( expLambda ) ;\n", " for ( i in 1:Ntotal ) {\n", " y[i] ̃ student_t( nu , mu[x[i]] , sigma[x[i]] ) ;\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Stan-Ymet-Xnom1grp-Mrobust-Example.R" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional generic preliminaries:\n", "graphics.off() # This closes all of R's graphics windows.\n", "rm(list=ls()) # Careful! This clears all of R's memory!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load The data file \n", "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )\n", "# For purposes of this one-group example, use data from Smart Drug group:\n", "myData = myDataFrame$Score[myDataFrame$Group==\"Smart Drug\"]" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Stan-Ymet-Xnom1grp-Mrobust.R\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Optional: Specify filename root and graphical format for saving output.\n", "# Otherwise specify as NULL or leave saveName and saveType arguments \n", "# out of function calls.\n", "fileNameRoot = \"OneGroupIQrobust-Stan-\" \n", "graphFileType = \"png\" #\"eps\" " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "mcmcCoda = genMCMC( data=myData , numSavedSteps=20000 , saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , \n", " compValMu=100.0 , ropeMu=c(99.0,101.0) ,\n", " compValSigma=15.0 , ropeSigma=c(14,16) ,\n", " compValEff=0.0 , ropeEff=c(-0.1,0.1) ,\n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 16.3.1 Analysis by NHST" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In traditional NHST, metric data from two groups would be submitted to a t-test." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "myDataFrame = read.csv( file=\"TwoGroupIQ.csv\" )" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "\n", "\tWelch Two Sample t-test\n", "\n", "data: Score by Group\n", "t = -1.958, df = 111.44, p-value = 0.05273\n", "alternative hypothesis: true difference in means is not equal to 0\n", "95 percent confidence interval:\n", " -15.70602585 0.09366161\n", "sample estimates:\n", " mean in group Placebo mean in group Smart Drug \n", " 100.0351 107.8413 \n" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t.test(Score~Group , data=myDataFrame)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Notice that the p value is greater than 0.05, which means that the conventional decision would be not to reject the null hypothesis. \n", " - This conclusion conflicts with the Bayesian analysis in Figure 16.12, unless we use a conservatively wide ROPE. \n", " - The reason that \n", " - the t test is less sensitive than the Bayesian estimation \n", " - in this example is that \n", " - the t test assumes normality and therefore \n", " - its estimate of the within-group variances is too large \n", " - when there are outliers.\n", " - The t test has other problems. Unlike the Bayesian analysis, \n", " - the t test provides only a test of the equality of means, \n", " - without a test of the equality of variances. \n", " - To test equality of variances, \n", " - we need to run an additional test, \n", " - namely an F test of the ratio of variances, which would inflate the p values of both tests. \n", " - Moreover, \n", " - both tests compute p values based on hypothetical normally distributed data, \n", " - and the F test is particularly sensitive to violations of this assumption. \n", " - Therefore it would be better to use resampling methods to compute the p values (and correcting them for multiple tests)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.4. Other Noise Distributions and Transforming Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 16.5. EXERCISES" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 16.1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### [Purpose: Practice using different data files in the high-level script, with an interesting real example about alcohol preference of sexually frustrated males.]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 참고자료" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [1] Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan - http://www.amazon.com/Doing-Bayesian-Analysis-Second-Edition/dp/0124058884\n", "* [2] 스튜던트 t 분포(한글위키피디아) - https://ko.wikipedia.org/wiki/%EC%8A%A4%ED%8A%9C%EB%8D%98%ED%8A%B8_t_%EB%B6%84%ED%8F%AC\n", "* [3] t분포(티분포) 개념정리! - http://math7.tistory.com/55\n", "* [4] t분포표(티분포표) 보는 법! - http://math7.tistory.com/56\n", "* [5] 표준정규분포와 t분포의 차이 - http://blog.daum.net/_blog/BlogTypeView.do?blogid=0Isuz&articleno=1172552\n", "* [6] t 분포 (Student's t-distribution) - http://godrag77.blogspot.kr/2011/07/t-students-t-distribution.html" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.2.1" } }, "nbformat": 4, "nbformat_minor": 0 }