{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Chapter 18. Metric Predicted Variable with Multiple Metric Predictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 베이지안 통계 R\n", "* Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan\n", "* 김무성" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Contents" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 18.1. Multiple Linear Regression\n", "* 18.2. Multiplicative Interaction of Metric Predictors\n", "* 18.3. Shrinkage of Regression Coefficients\n", "* 18.4. Variable Selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this chapter, we are concerned with situations in which the value to be predicted is on a metric scale, and there are several predictors, each of which is also on a metric scale. \n", "\n", "* For example, \n", " - we might predict \n", " - a person’s college grade point average (GPA) from \n", " - his or her high-school GPA and \n", " - scholastic aptitude test (SAT) score. \n", " - Another such situation is predicting \n", " - a person’s blood pressure from \n", " - his or her height and \n", " - weight." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* multiple linear regression \n", " - predicted variable is an additive combination of predictors\n", "* interactions \n", " - nonadditive combinations of predictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "

\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 18.1. Multiple Linear Regression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 18.1.1 The perils of correlated predictors\n", "* 18.1.2 The model and implementation\n", "* 18.1.3 The posterior distribution\n", "* 18.1.4 Redundant predictors\n", "* 18.1.5 Informative priors, sparse data, and correlated predictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "y ∼ normal(μ, σ ) and μ = β0+β1x1+β2x2." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.1.1 The perils of correlated predictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.1.2 The model and implementation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data block of the JAGS code then standardizes the predictors by looping\n", "through the columns of the x matrix, as follows:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "data {\n", " ym <- mean(y)\n", " ysd <- sd(y)\n", " for ( i in 1:Ntotal ) { # Ntotal is the number of data rows\n", " zy[i] <- ( y[i] - ym ) / ysd\n", " }\n", " for ( j in 1:Nx ) {# Nx is the number of x predictors\n", " xm[j] <- mean(x[,j]) # x is a matrix, each column a predictor\n", " xsd[j] <- sd(x[,j])\n", " for ( i in 1:Ntotal ) {\n", " zx[i,j] <- ( x[i,j] - xm[j] ) / xsd[j]\n", " }\n", " }\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The standardized parameters are then transformed to the original scale by generalizing Equation 17.2 (p. 485) to multiple predictors:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The JAGS model specification looks like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " for ( i in 1:Ntotal ) {\n", " zy[i] ̃ dt( zbeta0 + sum( zbeta[1:Nx] * zx[i,1:Nx] ) , 1/zsigmaˆ2 , nu )\n", " }\n", " # Priors vague on standardized scale:\n", " zbeta0 ̃ dnorm( 0 , 1/2ˆ2 )\n", " for ( j in 1:Nx ) {\n", " zbeta[j] ̃ dnorm( 0 , 1/2ˆ2 )\n", " }\n", " zsigma ̃ dunif( 1.0E-5 , 1.0E+1 )\n", " nu <- nuMinusOne+1\n", " nuMinusOne ̃ dexp(1/29.0)\n", " # Transform to original scale:\n", " beta[1:Nx] <- ( zbeta[1:Nx] / xsd[1:Nx] )*ysd\n", " beta0 <- zbeta0*ysd + ym - sum( zbeta[1:Nx] * xm[1:Nx] / xsd[1:Nx] )*ysd\n", " sigma <- zsigma*ysd\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Jags-Ymet-XmetMulti-Mrobust-Example.R" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업\n", "cur_dir = getwd()\n", "setwd(sprintf(\"%s/%s\", cur_dir, 'data'))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Example for Jags-Ymet-XmetMulti-Mrobust.R \n", "#------------------------------------------------------------------------------- \n", "# Optional generic preliminaries:\n", "#graphics.off() # This closes all of R's graphics windows.\n", "#rm(list=ls()) # Careful! This clears all of R's memory!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#.............................................................................\n", "# # Two predictors:\n", "myData = read.csv( file=\"Guber1999data.csv\" )" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "

	State	Spend	StuTeaRat	Salary	PrcntTake	SATV	SATM	SATT
1	Alabama	4.405	17.2	31.144	8	491	538	1029
2	Alaska	8.963	17.6	47.951	47	445	489	934
3	Arizona	4.778	19.3	32.175	27	448	496	944
4	Arkansas	4.459	17.1	28.934	6	482	523	1005
5	California	4.992	24	41.078	45	417	485	902
6	Colorado	5.443	18.4	34.571	29	462	518	980
7	Connecticut	8.817	14.4	50.045	81	431	477	908
8	Delaware	7.03	16.6	39.076	68	429	468	897
9	Florida	5.718	19.1	32.588	48	420	469	889
10	Georgia	5.193	16.3	32.291	65	406	448	854
11	Hawaii	6.078	17.9	38.518	57	407	482	889
12	Idaho	4.21	19.1	29.783	15	468	511	979
13	Illinois	6.136	17.3	39.431	13	488	560	1048
14	Indiana	5.826	17.5	36.785	58	415	467	882
15	Iowa	5.483	15.8	31.511	5	516	583	1099
16	Kansas	5.817	15.1	34.652	9	503	557	1060
17	Kentucky	5.217	17	32.257	11	477	522	999
18	Louisiana	4.761	16.8	26.461	9	486	535	1021
19	Maine	6.428	13.8	31.972	68	427	469	896
20	Maryland	7.245	17	40.661	64	430	479	909
21	Massachusetts	7.287	14.8	40.795	80	430	477	907
22	Michigan	6.994	20.1	41.895	11	484	549	1033
23	Minnesota	6	17.5	35.948	9	506	579	1085
24	Mississippi	4.08	17.5	26.818	4	496	540	1036
25	Missouri	5.383	15.5	31.189	9	495	550	1045
26	Montana	5.692	16.3	28.785	21	473	536	1009
27	Nebraska	5.935	14.5	30.922	9	494	556	1050
28	Nevada	5.16	18.7	34.836	30	434	483	917
29	New Hampshire	5.859	15.6	34.72	70	444	491	935
30	New Jersey	9.774	13.8	46.087	70	420	478	898
31	New Mexico	4.586	17.2	28.493	11	485	530	1015
32	New York	9.623	15.2	47.612	74	419	473	892
33	North Carolina	5.077	16.2	30.793	60	411	454	865
34	North Dakota	4.775	15.3	26.327	5	515	592	1107
35	Ohio	6.162	16.6	36.802	23	460	515	975
36	Oklahoma	4.845	15.5	28.172	9	491	536	1027
37	Oregon	6.436	19.9	38.555	51	448	499	947
38	Pennsylvania	7.109	17.1	44.51	70	419	461	880
39	Rhode Island	7.469	14.7	40.729	70	425	463	888
40	South Carolina	4.797	16.4	30.279	58	401	443	844
41	South Dakota	4.775	14.4	25.994	5	505	563	1068
42	Tennessee	4.388	18.6	32.477	12	497	543	1040
43	Texas	5.222	15.7	31.223	47	419	474	893
44	Utah	3.656	24.3	29.082	4	513	563	1076
45	Vermont	6.75	13.8	35.406	68	429	472	901
46	Virginia	5.327	14.6	33.987	65	428	468	896
47	Washington	5.906	20.2	36.151	48	443	494	937
48	West Virginia	6.107	14.8	31.944	17	448	484	932
49	Wisconsin	6.93	15.9	37.746	9	501	572	1073
50	Wyoming	6.16	14.9	31.285	10	476	525	1001

\n" ], "text/latex": [ "\\begin{tabular}{r|llllllll}\n", " & State & Spend & StuTeaRat & Salary & PrcntTake & SATV & SATM & SATT\\\\\n", "\\hline\n", "\t1 & Alabama & 4.405 & 17.2 & 31.144 & 8 & 491 & 538 & 1029\\\\\n", "\t2 & Alaska & 8.963 & 17.6 & 47.951 & 47 & 445 & 489 & 934\\\\\n", "\t3 & Arizona & 4.778 & 19.3 & 32.175 & 27 & 448 & 496 & 944\\\\\n", "\t4 & Arkansas & 4.459 & 17.1 & 28.934 & 6 & 482 & 523 & 1005\\\\\n", "\t5 & California & 4.992 & 24 & 41.078 & 45 & 417 & 485 & 902\\\\\n", "\t6 & Colorado & 5.443 & 18.4 & 34.571 & 29 & 462 & 518 & 980\\\\\n", "\t7 & Connecticut & 8.817 & 14.4 & 50.045 & 81 & 431 & 477 & 908\\\\\n", "\t8 & Delaware & 7.03 & 16.6 & 39.076 & 68 & 429 & 468 & 897\\\\\n", "\t9 & Florida & 5.718 & 19.1 & 32.588 & 48 & 420 & 469 & 889\\\\\n", "\t10 & Georgia & 5.193 & 16.3 & 32.291 & 65 & 406 & 448 & 854\\\\\n", "\t11 & Hawaii & 6.078 & 17.9 & 38.518 & 57 & 407 & 482 & 889\\\\\n", "\t12 & Idaho & 4.21 & 19.1 & 29.783 & 15 & 468 & 511 & 979\\\\\n", "\t13 & Illinois & 6.136 & 17.3 & 39.431 & 13 & 488 & 560 & 1048\\\\\n", "\t14 & Indiana & 5.826 & 17.5 & 36.785 & 58 & 415 & 467 & 882\\\\\n", "\t15 & Iowa & 5.483 & 15.8 & 31.511 & 5 & 516 & 583 & 1099\\\\\n", "\t16 & Kansas & 5.817 & 15.1 & 34.652 & 9 & 503 & 557 & 1060\\\\\n", "\t17 & Kentucky & 5.217 & 17 & 32.257 & 11 & 477 & 522 & 999\\\\\n", "\t18 & Louisiana & 4.761 & 16.8 & 26.461 & 9 & 486 & 535 & 1021\\\\\n", "\t19 & Maine & 6.428 & 13.8 & 31.972 & 68 & 427 & 469 & 896\\\\\n", "\t20 & Maryland & 7.245 & 17 & 40.661 & 64 & 430 & 479 & 909\\\\\n", "\t21 & Massachusetts & 7.287 & 14.8 & 40.795 & 80 & 430 & 477 & 907\\\\\n", "\t22 & Michigan & 6.994 & 20.1 & 41.895 & 11 & 484 & 549 & 1033\\\\\n", "\t23 & Minnesota & 6 & 17.5 & 35.948 & 9 & 506 & 579 & 1085\\\\\n", "\t24 & Mississippi & 4.08 & 17.5 & 26.818 & 4 & 496 & 540 & 1036\\\\\n", "\t25 & Missouri & 5.383 & 15.5 & 31.189 & 9 & 495 & 550 & 1045\\\\\n", "\t26 & Montana & 5.692 & 16.3 & 28.785 & 21 & 473 & 536 & 1009\\\\\n", "\t27 & Nebraska & 5.935 & 14.5 & 30.922 & 9 & 494 & 556 & 1050\\\\\n", "\t28 & Nevada & 5.16 & 18.7 & 34.836 & 30 & 434 & 483 & 917\\\\\n", "\t29 & New Hampshire & 5.859 & 15.6 & 34.72 & 70 & 444 & 491 & 935\\\\\n", "\t30 & New Jersey & 9.774 & 13.8 & 46.087 & 70 & 420 & 478 & 898\\\\\n", "\t31 & New Mexico & 4.586 & 17.2 & 28.493 & 11 & 485 & 530 & 1015\\\\\n", "\t32 & New York & 9.623 & 15.2 & 47.612 & 74 & 419 & 473 & 892\\\\\n", "\t33 & North Carolina & 5.077 & 16.2 & 30.793 & 60 & 411 & 454 & 865\\\\\n", "\t34 & North Dakota & 4.775 & 15.3 & 26.327 & 5 & 515 & 592 & 1107\\\\\n", "\t35 & Ohio & 6.162 & 16.6 & 36.802 & 23 & 460 & 515 & 975\\\\\n", "\t36 & Oklahoma & 4.845 & 15.5 & 28.172 & 9 & 491 & 536 & 1027\\\\\n", "\t37 & Oregon & 6.436 & 19.9 & 38.555 & 51 & 448 & 499 & 947\\\\\n", "\t38 & Pennsylvania & 7.109 & 17.1 & 44.51 & 70 & 419 & 461 & 880\\\\\n", "\t39 & Rhode Island & 7.469 & 14.7 & 40.729 & 70 & 425 & 463 & 888\\\\\n", "\t40 & South Carolina & 4.797 & 16.4 & 30.279 & 58 & 401 & 443 & 844\\\\\n", "\t41 & South Dakota & 4.775 & 14.4 & 25.994 & 5 & 505 & 563 & 1068\\\\\n", "\t42 & Tennessee & 4.388 & 18.6 & 32.477 & 12 & 497 & 543 & 1040\\\\\n", "\t43 & Texas & 5.222 & 15.7 & 31.223 & 47 & 419 & 474 & 893\\\\\n", "\t44 & Utah & 3.656 & 24.3 & 29.082 & 4 & 513 & 563 & 1076\\\\\n", "\t45 & Vermont & 6.75 & 13.8 & 35.406 & 68 & 429 & 472 & 901\\\\\n", "\t46 & Virginia & 5.327 & 14.6 & 33.987 & 65 & 428 & 468 & 896\\\\\n", "\t47 & Washington & 5.906 & 20.2 & 36.151 & 48 & 443 & 494 & 937\\\\\n", "\t48 & West Virginia & 6.107 & 14.8 & 31.944 & 17 & 448 & 484 & 932\\\\\n", "\t49 & Wisconsin & 6.93 & 15.9 & 37.746 & 9 & 501 & 572 & 1073\\\\\n", "\t50 & Wyoming & 6.16 & 14.9 & 31.285 & 10 & 476 & 525 & 1001\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " State Spend StuTeaRat Salary PrcntTake SATV SATM SATT\n", "1 Alabama 4.405 17.2 31.144 8 491 538 1029\n", "2 Alaska 8.963 17.6 47.951 47 445 489 934\n", "3 Arizona 4.778 19.3 32.175 27 448 496 944\n", "4 Arkansas 4.459 17.1 28.934 6 482 523 1005\n", "5 California 4.992 24.0 41.078 45 417 485 902\n", "6 Colorado 5.443 18.4 34.571 29 462 518 980\n", "7 Connecticut 8.817 14.4 50.045 81 431 477 908\n", "8 Delaware 7.030 16.6 39.076 68 429 468 897\n", "9 Florida 5.718 19.1 32.588 48 420 469 889\n", "10 Georgia 5.193 16.3 32.291 65 406 448 854\n", "11 Hawaii 6.078 17.9 38.518 57 407 482 889\n", "12 Idaho 4.210 19.1 29.783 15 468 511 979\n", "13 Illinois 6.136 17.3 39.431 13 488 560 1048\n", "14 Indiana 5.826 17.5 36.785 58 415 467 882\n", "15 Iowa 5.483 15.8 31.511 5 516 583 1099\n", "16 Kansas 5.817 15.1 34.652 9 503 557 1060\n", "17 Kentucky 5.217 17.0 32.257 11 477 522 999\n", "18 Louisiana 4.761 16.8 26.461 9 486 535 1021\n", "19 Maine 6.428 13.8 31.972 68 427 469 896\n", "20 Maryland 7.245 17.0 40.661 64 430 479 909\n", "21 Massachusetts 7.287 14.8 40.795 80 430 477 907\n", "22 Michigan 6.994 20.1 41.895 11 484 549 1033\n", "23 Minnesota 6.000 17.5 35.948 9 506 579 1085\n", "24 Mississippi 4.080 17.5 26.818 4 496 540 1036\n", "25 Missouri 5.383 15.5 31.189 9 495 550 1045\n", "26 Montana 5.692 16.3 28.785 21 473 536 1009\n", "27 Nebraska 5.935 14.5 30.922 9 494 556 1050\n", "28 Nevada 5.160 18.7 34.836 30 434 483 917\n", "29 New Hampshire 5.859 15.6 34.720 70 444 491 935\n", "30 New Jersey 9.774 13.8 46.087 70 420 478 898\n", "31 New Mexico 4.586 17.2 28.493 11 485 530 1015\n", "32 New York 9.623 15.2 47.612 74 419 473 892\n", "33 North Carolina 5.077 16.2 30.793 60 411 454 865\n", "34 North Dakota 4.775 15.3 26.327 5 515 592 1107\n", "35 Ohio 6.162 16.6 36.802 23 460 515 975\n", "36 Oklahoma 4.845 15.5 28.172 9 491 536 1027\n", "37 Oregon 6.436 19.9 38.555 51 448 499 947\n", "38 Pennsylvania 7.109 17.1 44.510 70 419 461 880\n", "39 Rhode Island 7.469 14.7 40.729 70 425 463 888\n", "40 South Carolina 4.797 16.4 30.279 58 401 443 844\n", "41 South Dakota 4.775 14.4 25.994 5 505 563 1068\n", "42 Tennessee 4.388 18.6 32.477 12 497 543 1040\n", "43 Texas 5.222 15.7 31.223 47 419 474 893\n", "44 Utah 3.656 24.3 29.082 4 513 563 1076\n", "45 Vermont 6.750 13.8 35.406 68 429 472 901\n", "46 Virginia 5.327 14.6 33.987 65 428 468 896\n", "47 Washington 5.906 20.2 36.151 48 443 494 937\n", "48 West Virginia 6.107 14.8 31.944 17 448 484 932\n", "49 Wisconsin 6.930 15.9 37.746 9 501 572 1073\n", "50 Wyoming 6.160 14.9 31.285 10 476 525 1001" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myData" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ " State Spend StuTeaRat Salary \n", " Alabama : 1 Min. :3.656 Min. :13.80 Min. :25.99 \n", " Alaska : 1 1st Qu.:4.882 1st Qu.:15.22 1st Qu.:30.98 \n", " Arizona : 1 Median :5.768 Median :16.60 Median :33.29 \n", " Arkansas : 1 Mean :5.905 Mean :16.86 Mean :34.83 \n", " California: 1 3rd Qu.:6.434 3rd Qu.:17.57 3rd Qu.:38.55 \n", " Colorado : 1 Max. :9.774 Max. :24.30 Max. :50.05 \n", " (Other) :44 \n", " PrcntTake SATV SATM SATT \n", " Min. : 4.00 Min. :401.0 Min. :443.0 Min. : 844.0 \n", " 1st Qu.: 9.00 1st Qu.:427.2 1st Qu.:474.8 1st Qu.: 897.2 \n", " Median :28.00 Median :448.0 Median :497.5 Median : 945.5 \n", " Mean :35.24 Mean :457.1 Mean :508.8 Mean : 965.9 \n", " 3rd Qu.:63.00 3rd Qu.:490.2 3rd Qu.:539.5 3rd Qu.:1032.0 \n", " Max. :81.00 Max. :516.0 Max. :592.0 Max. :1107.0 \n", " " ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "summary(myData)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "yName = \"SATT\" ; xName = c(\"Spend\",\"PrcntTake\")\n", "fileNameRoot = \"Guber1999data-Jags-\" \n", "numSavedSteps=15000 ; thinSteps=5" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphFileType = \"png\" " ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "*********************************************************************\n", "Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:\n", "A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.\n", "*********************************************************************\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: coda\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Jags-Ymet-XmetMulti-Mrobust.R\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "CORRELATION MATRIX OF PREDICTORS:\n", " Spend PrcntTake\n", "Spend 1.000 0.593\n", "PrcntTake 0.593 1.000\n", "\n", "Calling 3 simulations using the parallel method...\n", "Following the progress of chain 1 (the program will wait for all chains\n", "to finish before continuing):\n", "Welcome to JAGS 3.4.0 on Wed Sep 16 20:42:43 2015\n", "JAGS is free software and comes with ABSOLUTELY NO WARRANTY\n", "Loading module: basemod: ok\n", "Loading module: bugs: ok\n", ". . Reading data file data.txt\n", ". Compiling data graph\n", " Resolving undeclared variables\n", " Allocating nodes\n", " Initializing\n", " Reading data back into data table\n", "Compiling model graph\n", " Resolving undeclared variables\n", " Allocating nodes\n", " Graph Size: 535\n", ". Reading parameter file inits1.txt\n", ". Initializing model\n", ". Adapting 500\n", "-------------------------------------------------| 500\n", "++++++++++++++++++++++++++++++++++++++++++++++++++ 100%\n", "Adaptation successful\n", ". Updating 1000\n", "-------------------------------------------------| 1000\n", "************************************************** 100%\n", ". . . . . . . . Updating 25000\n", "-------------------------------------------------| 25000\n", "************************************************** 100%\n", ". . . . Updating 0\n", ". Deleting model\n", ". \n", "All chains have finished\n", "Simulation complete. Reading coda files...\n", "Coda files loaded successfully\n", "Finished running the simulation\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "#startTime = proc.time()\n", "mcmcCoda = genMCMC( data=myData , xName=xName , yName=yName , \n", " numSavedSteps=numSavedSteps , thinSteps=thinSteps , \n", " saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mean Median Mode ESS HDImass HDIlow\n", "CHAIN 2.000000000 2.000000000 1.997413e+00 1.5 0.95 1.0000000\n", "beta0 991.492221267 991.336500000 9.899081e+02 13808.0 0.95 948.3840000\n", "beta[1] 12.832703606 12.858900000 1.273112e+01 13153.9 0.95 4.5361400\n", "beta[2] -2.878617251 -2.879090000 -2.881414e+00 13227.0 0.95 -3.3193300\n", "sigma 31.530577473 31.329000000 3.106441e+01 13692.3 0.95 23.9116000\n", "zbeta0 -0.001200131 -0.001465985 -9.973416e-05 15000.0 0.95 -0.1229360\n", "zbeta[1] 0.233739136 0.234216000 2.318889e-01 13153.9 0.95 0.0826227\n", "zbeta[2] -1.029646875 -1.029815000 -1.030645e+00 13227.0 0.95 -1.1872800\n", "zsigma 0.421415965 0.418722000 4.151860e-01 13692.3 0.95 0.3195860\n", "nu 33.397469008 24.824800000 1.103249e+01 10801.1 0.95 1.8336700\n", "log10(nu) 1.375887230 1.394885757 1.485261e+00 12329.1 0.95 0.6664712\n", " HDIhigh CompVal PcntGtCompVal ROPElow ROPEhigh PcntLtROPE\n", "CHAIN 3.000000 NA NA NA NA NA\n", "beta0 1035.490000 NA NA NA NA NA\n", "beta[1] 21.497200 NA NA NA NA NA\n", "beta[2] -2.442310 NA NA NA NA NA\n", "sigma 39.230700 NA NA NA NA NA\n", "zbeta0 0.121780 NA NA NA NA NA\n", "zbeta[1] 0.391557 NA NA NA NA NA\n", "zbeta[2] -0.873587 NA NA NA NA NA\n", "zsigma 0.524331 NA NA NA NA NA\n", "nu 90.746200 NA NA NA NA NA\n", "log10(nu) 2.072673 NA NA NA NA NA\n", " PcntInROPE PcntGtROPE\n", "CHAIN NA NA\n", "beta0 NA NA\n", "beta[1] NA NA\n", "beta[2] NA NA\n", "sigma NA NA\n", "zbeta0 NA NA\n", "zbeta[1] NA NA\n", "zbeta[2] NA NA\n", "zsigma NA NA\n", "nu NA NA\n", "log10(nu) NA NA\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " saveName=fileNameRoot )\n", "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , xName=xName , yName=yName , \n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.1.3 The posterior distribution" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### proportion of variance accounted for" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### frequentist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In least-squares regression, the overall variance in y is algebraically decomposed into the variance of the linearly predicted values and the residual variance:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The proportion of variance accounted for is" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### bayesian" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Bayesian analysis no such decomposition of variance occurs\n", "* But for people familiar with least-squares notions who crave a statistic analogous to R^2 , we\n", "can compute a surrogate. At each step in the MCMC chain, a credible value of R^2 is computed as" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "where" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "is the standardized regression coefficient for the jth predictor at that step in the MCMC chain," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "is the correlation of the predicted values," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "with the jth predictor values," ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.1.4 Redundant predictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.1.5 Informative priors, sparse data, and correlated predictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Informed priors " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The examples in this book tend to use mildly informed priors (e.g., using information about the rough magnitude and range of the data). But a benefit of Bayesian analysis is the potential for cumulative scientific progress by using priors that have been informed from previous research." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### sparse data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Informed priors can be especially useful when the amount of data is small compared to the parameter space.\n", "* A strongly informed prior essentially reduces the scope of the credible parameter space, so that a small amount of new data implies a narrow zone of credible parameter values." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### correlated predictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In the context of multiple linear regression, sparse data can lead to usefully precise posteriors on regression coefficients if some of the regression coefficients have informed priors and the predictors are strongly correlated. \n", "* To understand this idea, it is important to remember that when predictors are correlated, their regression coefficients are also (anti-)correlated.\n", "* For example, \n", " - recall the SAT data from Figure 18.3 (p. 514) in which spending-per-pupil and percent-taking-the-exam are correlated. Consequently, the posterior estimates of the regression coefficients had a negative correlation, as shown in Figure 18.5 (p. 518). The correlation of credible regression coefficients implies that a strong belief about the value of one regression coefficient constrains the value of the other coefficient. \n", " - That influence of one slope estimate on another can be used for inferential advantage when we have prior knowledge about one of the slopes. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 18.2. Multiplicative Interaction of Metric Predictors" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 18.2.1 An example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.2.1 An example" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "we conceptualize the interaction term of Equation 18.2 as an additional additive predictor, like this:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We create the new variable x3 = x1x2 outside the model and then submit the new variable as if it were another additive predictor.\n", " - One benefit of this approach is that we do not have to create a new model, and it is easy, in cases of many predictors, to set up interaction variables for many different combinations of variables. \n", " - Another key benefit is that we can examine the correlations of the single predictors with the interaction variables. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* To illustrate some of the issues involved in interpreting the parameters of a model with interaction, consider again the SAT data from Figure 18.3. \n", " - SAT score\n", " - the spending per pupil (Spend) \n", " - the percentage of students who took the test (PrcntTake)\n", " - When no interaction term was included in the model, the posterior distribution looked like Figure 18.5, which indicated a positive influence of Spend and a negative influence of PrcntTake.\n", "* We will include a multiplicative interaction of Spend and PrcntTake." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Jags-Ymet- XmetMulti-Mrobust-Example.R" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Read in data:\n", "myData = read.csv( file=\"Guber1999data.csv\" )\n", "# Append the new interaction variable:\n", "myData = cbind( myData , SpendXPrcnt = myData[,\"Spend\"] * myData[,\"PrcntTake\"] )\n", "# Specify names of data columns to use in the analysis:\n", "yName = \"SATT\" ; xName = c(\"Spend\",\"PrcntTake\",\"SpendXPrcnt\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#### 수정된 버전 Jags-Ymet-XmetMulti-Mrobust-Example.R" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업\n", "cur_dir = getwd()\n", "setwd(sprintf(\"%s/%s\", cur_dir, 'data'))" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#.............................................................................\n", "# # Two predictors:\n", "myData = read.csv( file=\"Guber1999data.csv\" )" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Append the new interaction variable:\n", "myData = cbind( myData , SpendXPrcnt = myData[,\"Spend\"] * myData[,\"PrcntTake\"] )\n", "# Specify names of data columns to use in the analysis:\n", "yName = \"SATT\" ; xName = c(\"Spend\",\"PrcntTake\",\"SpendXPrcnt\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "

	State	Spend	StuTeaRat	Salary	PrcntTake	SATV	SATM	SATT	SpendXPrcnt
1	Alabama	4.405	17.2	31.144	8	491	538	1029	35.24
2	Alaska	8.963	17.6	47.951	47	445	489	934	421.261
3	Arizona	4.778	19.3	32.175	27	448	496	944	129.006
4	Arkansas	4.459	17.1	28.934	6	482	523	1005	26.754
5	California	4.992	24	41.078	45	417	485	902	224.64
6	Colorado	5.443	18.4	34.571	29	462	518	980	157.847
7	Connecticut	8.817	14.4	50.045	81	431	477	908	714.177
8	Delaware	7.03	16.6	39.076	68	429	468	897	478.04
9	Florida	5.718	19.1	32.588	48	420	469	889	274.464
10	Georgia	5.193	16.3	32.291	65	406	448	854	337.545
11	Hawaii	6.078	17.9	38.518	57	407	482	889	346.446
12	Idaho	4.21	19.1	29.783	15	468	511	979	63.15
13	Illinois	6.136	17.3	39.431	13	488	560	1048	79.768
14	Indiana	5.826	17.5	36.785	58	415	467	882	337.908
15	Iowa	5.483	15.8	31.511	5	516	583	1099	27.415
16	Kansas	5.817	15.1	34.652	9	503	557	1060	52.353
17	Kentucky	5.217	17	32.257	11	477	522	999	57.387
18	Louisiana	4.761	16.8	26.461	9	486	535	1021	42.849
19	Maine	6.428	13.8	31.972	68	427	469	896	437.104
20	Maryland	7.245	17	40.661	64	430	479	909	463.68
21	Massachusetts	7.287	14.8	40.795	80	430	477	907	582.96
22	Michigan	6.994	20.1	41.895	11	484	549	1033	76.934
23	Minnesota	6	17.5	35.948	9	506	579	1085	54
24	Mississippi	4.08	17.5	26.818	4	496	540	1036	16.32
25	Missouri	5.383	15.5	31.189	9	495	550	1045	48.447
26	Montana	5.692	16.3	28.785	21	473	536	1009	119.532
27	Nebraska	5.935	14.5	30.922	9	494	556	1050	53.415
28	Nevada	5.16	18.7	34.836	30	434	483	917	154.8
29	New Hampshire	5.859	15.6	34.72	70	444	491	935	410.13
30	New Jersey	9.774	13.8	46.087	70	420	478	898	684.18
31	New Mexico	4.586	17.2	28.493	11	485	530	1015	50.446
32	New York	9.623	15.2	47.612	74	419	473	892	712.102
33	North Carolina	5.077	16.2	30.793	60	411	454	865	304.62
34	North Dakota	4.775	15.3	26.327	5	515	592	1107	23.875
35	Ohio	6.162	16.6	36.802	23	460	515	975	141.726
36	Oklahoma	4.845	15.5	28.172	9	491	536	1027	43.605
37	Oregon	6.436	19.9	38.555	51	448	499	947	328.236
38	Pennsylvania	7.109	17.1	44.51	70	419	461	880	497.63
39	Rhode Island	7.469	14.7	40.729	70	425	463	888	522.83
40	South Carolina	4.797	16.4	30.279	58	401	443	844	278.226
41	South Dakota	4.775	14.4	25.994	5	505	563	1068	23.875
42	Tennessee	4.388	18.6	32.477	12	497	543	1040	52.656
43	Texas	5.222	15.7	31.223	47	419	474	893	245.434
44	Utah	3.656	24.3	29.082	4	513	563	1076	14.624
45	Vermont	6.75	13.8	35.406	68	429	472	901	459
46	Virginia	5.327	14.6	33.987	65	428	468	896	346.255
47	Washington	5.906	20.2	36.151	48	443	494	937	283.488
48	West Virginia	6.107	14.8	31.944	17	448	484	932	103.819
49	Wisconsin	6.93	15.9	37.746	9	501	572	1073	62.37
50	Wyoming	6.16	14.9	31.285	10	476	525	1001	61.6

\n" ], "text/latex": [ "\\begin{tabular}{r|lllllllll}\n", " & State & Spend & StuTeaRat & Salary & PrcntTake & SATV & SATM & SATT & SpendXPrcnt\\\\\n", "\\hline\n", "\t1 & Alabama & 4.405 & 17.2 & 31.144 & 8 & 491 & 538 & 1029 & 35.24\\\\\n", "\t2 & Alaska & 8.963 & 17.6 & 47.951 & 47 & 445 & 489 & 934 & 421.261\\\\\n", "\t3 & Arizona & 4.778 & 19.3 & 32.175 & 27 & 448 & 496 & 944 & 129.006\\\\\n", "\t4 & Arkansas & 4.459 & 17.1 & 28.934 & 6 & 482 & 523 & 1005 & 26.754\\\\\n", "\t5 & California & 4.992 & 24 & 41.078 & 45 & 417 & 485 & 902 & 224.64\\\\\n", "\t6 & Colorado & 5.443 & 18.4 & 34.571 & 29 & 462 & 518 & 980 & 157.847\\\\\n", "\t7 & Connecticut & 8.817 & 14.4 & 50.045 & 81 & 431 & 477 & 908 & 714.177\\\\\n", "\t8 & Delaware & 7.03 & 16.6 & 39.076 & 68 & 429 & 468 & 897 & 478.04\\\\\n", "\t9 & Florida & 5.718 & 19.1 & 32.588 & 48 & 420 & 469 & 889 & 274.464\\\\\n", "\t10 & Georgia & 5.193 & 16.3 & 32.291 & 65 & 406 & 448 & 854 & 337.545\\\\\n", "\t11 & Hawaii & 6.078 & 17.9 & 38.518 & 57 & 407 & 482 & 889 & 346.446\\\\\n", "\t12 & Idaho & 4.21 & 19.1 & 29.783 & 15 & 468 & 511 & 979 & 63.15\\\\\n", "\t13 & Illinois & 6.136 & 17.3 & 39.431 & 13 & 488 & 560 & 1048 & 79.768\\\\\n", "\t14 & Indiana & 5.826 & 17.5 & 36.785 & 58 & 415 & 467 & 882 & 337.908\\\\\n", "\t15 & Iowa & 5.483 & 15.8 & 31.511 & 5 & 516 & 583 & 1099 & 27.415\\\\\n", "\t16 & Kansas & 5.817 & 15.1 & 34.652 & 9 & 503 & 557 & 1060 & 52.353\\\\\n", "\t17 & Kentucky & 5.217 & 17 & 32.257 & 11 & 477 & 522 & 999 & 57.387\\\\\n", "\t18 & Louisiana & 4.761 & 16.8 & 26.461 & 9 & 486 & 535 & 1021 & 42.849\\\\\n", "\t19 & Maine & 6.428 & 13.8 & 31.972 & 68 & 427 & 469 & 896 & 437.104\\\\\n", "\t20 & Maryland & 7.245 & 17 & 40.661 & 64 & 430 & 479 & 909 & 463.68\\\\\n", "\t21 & Massachusetts & 7.287 & 14.8 & 40.795 & 80 & 430 & 477 & 907 & 582.96\\\\\n", "\t22 & Michigan & 6.994 & 20.1 & 41.895 & 11 & 484 & 549 & 1033 & 76.934\\\\\n", "\t23 & Minnesota & 6 & 17.5 & 35.948 & 9 & 506 & 579 & 1085 & 54\\\\\n", "\t24 & Mississippi & 4.08 & 17.5 & 26.818 & 4 & 496 & 540 & 1036 & 16.32\\\\\n", "\t25 & Missouri & 5.383 & 15.5 & 31.189 & 9 & 495 & 550 & 1045 & 48.447\\\\\n", "\t26 & Montana & 5.692 & 16.3 & 28.785 & 21 & 473 & 536 & 1009 & 119.532\\\\\n", "\t27 & Nebraska & 5.935 & 14.5 & 30.922 & 9 & 494 & 556 & 1050 & 53.415\\\\\n", "\t28 & Nevada & 5.16 & 18.7 & 34.836 & 30 & 434 & 483 & 917 & 154.8\\\\\n", "\t29 & New Hampshire & 5.859 & 15.6 & 34.72 & 70 & 444 & 491 & 935 & 410.13\\\\\n", "\t30 & New Jersey & 9.774 & 13.8 & 46.087 & 70 & 420 & 478 & 898 & 684.18\\\\\n", "\t31 & New Mexico & 4.586 & 17.2 & 28.493 & 11 & 485 & 530 & 1015 & 50.446\\\\\n", "\t32 & New York & 9.623 & 15.2 & 47.612 & 74 & 419 & 473 & 892 & 712.102\\\\\n", "\t33 & North Carolina & 5.077 & 16.2 & 30.793 & 60 & 411 & 454 & 865 & 304.62\\\\\n", "\t34 & North Dakota & 4.775 & 15.3 & 26.327 & 5 & 515 & 592 & 1107 & 23.875\\\\\n", "\t35 & Ohio & 6.162 & 16.6 & 36.802 & 23 & 460 & 515 & 975 & 141.726\\\\\n", "\t36 & Oklahoma & 4.845 & 15.5 & 28.172 & 9 & 491 & 536 & 1027 & 43.605\\\\\n", "\t37 & Oregon & 6.436 & 19.9 & 38.555 & 51 & 448 & 499 & 947 & 328.236\\\\\n", "\t38 & Pennsylvania & 7.109 & 17.1 & 44.51 & 70 & 419 & 461 & 880 & 497.63\\\\\n", "\t39 & Rhode Island & 7.469 & 14.7 & 40.729 & 70 & 425 & 463 & 888 & 522.83\\\\\n", "\t40 & South Carolina & 4.797 & 16.4 & 30.279 & 58 & 401 & 443 & 844 & 278.226\\\\\n", "\t41 & South Dakota & 4.775 & 14.4 & 25.994 & 5 & 505 & 563 & 1068 & 23.875\\\\\n", "\t42 & Tennessee & 4.388 & 18.6 & 32.477 & 12 & 497 & 543 & 1040 & 52.656\\\\\n", "\t43 & Texas & 5.222 & 15.7 & 31.223 & 47 & 419 & 474 & 893 & 245.434\\\\\n", "\t44 & Utah & 3.656 & 24.3 & 29.082 & 4 & 513 & 563 & 1076 & 14.624\\\\\n", "\t45 & Vermont & 6.75 & 13.8 & 35.406 & 68 & 429 & 472 & 901 & 459\\\\\n", "\t46 & Virginia & 5.327 & 14.6 & 33.987 & 65 & 428 & 468 & 896 & 346.255\\\\\n", "\t47 & Washington & 5.906 & 20.2 & 36.151 & 48 & 443 & 494 & 937 & 283.488\\\\\n", "\t48 & West Virginia & 6.107 & 14.8 & 31.944 & 17 & 448 & 484 & 932 & 103.819\\\\\n", "\t49 & Wisconsin & 6.93 & 15.9 & 37.746 & 9 & 501 & 572 & 1073 & 62.37\\\\\n", "\t50 & Wyoming & 6.16 & 14.9 & 31.285 & 10 & 476 & 525 & 1001 & 61.6\\\\\n", "\\end{tabular}\n" ], "text/plain": [ " State Spend StuTeaRat Salary PrcntTake SATV SATM SATT SpendXPrcnt\n", "1 Alabama 4.405 17.2 31.144 8 491 538 1029 35.240\n", "2 Alaska 8.963 17.6 47.951 47 445 489 934 421.261\n", "3 Arizona 4.778 19.3 32.175 27 448 496 944 129.006\n", "4 Arkansas 4.459 17.1 28.934 6 482 523 1005 26.754\n", "5 California 4.992 24.0 41.078 45 417 485 902 224.640\n", "6 Colorado 5.443 18.4 34.571 29 462 518 980 157.847\n", "7 Connecticut 8.817 14.4 50.045 81 431 477 908 714.177\n", "8 Delaware 7.030 16.6 39.076 68 429 468 897 478.040\n", "9 Florida 5.718 19.1 32.588 48 420 469 889 274.464\n", "10 Georgia 5.193 16.3 32.291 65 406 448 854 337.545\n", "11 Hawaii 6.078 17.9 38.518 57 407 482 889 346.446\n", "12 Idaho 4.210 19.1 29.783 15 468 511 979 63.150\n", "13 Illinois 6.136 17.3 39.431 13 488 560 1048 79.768\n", "14 Indiana 5.826 17.5 36.785 58 415 467 882 337.908\n", "15 Iowa 5.483 15.8 31.511 5 516 583 1099 27.415\n", "16 Kansas 5.817 15.1 34.652 9 503 557 1060 52.353\n", "17 Kentucky 5.217 17.0 32.257 11 477 522 999 57.387\n", "18 Louisiana 4.761 16.8 26.461 9 486 535 1021 42.849\n", "19 Maine 6.428 13.8 31.972 68 427 469 896 437.104\n", "20 Maryland 7.245 17.0 40.661 64 430 479 909 463.680\n", "21 Massachusetts 7.287 14.8 40.795 80 430 477 907 582.960\n", "22 Michigan 6.994 20.1 41.895 11 484 549 1033 76.934\n", "23 Minnesota 6.000 17.5 35.948 9 506 579 1085 54.000\n", "24 Mississippi 4.080 17.5 26.818 4 496 540 1036 16.320\n", "25 Missouri 5.383 15.5 31.189 9 495 550 1045 48.447\n", "26 Montana 5.692 16.3 28.785 21 473 536 1009 119.532\n", "27 Nebraska 5.935 14.5 30.922 9 494 556 1050 53.415\n", "28 Nevada 5.160 18.7 34.836 30 434 483 917 154.800\n", "29 New Hampshire 5.859 15.6 34.720 70 444 491 935 410.130\n", "30 New Jersey 9.774 13.8 46.087 70 420 478 898 684.180\n", "31 New Mexico 4.586 17.2 28.493 11 485 530 1015 50.446\n", "32 New York 9.623 15.2 47.612 74 419 473 892 712.102\n", "33 North Carolina 5.077 16.2 30.793 60 411 454 865 304.620\n", "34 North Dakota 4.775 15.3 26.327 5 515 592 1107 23.875\n", "35 Ohio 6.162 16.6 36.802 23 460 515 975 141.726\n", "36 Oklahoma 4.845 15.5 28.172 9 491 536 1027 43.605\n", "37 Oregon 6.436 19.9 38.555 51 448 499 947 328.236\n", "38 Pennsylvania 7.109 17.1 44.510 70 419 461 880 497.630\n", "39 Rhode Island 7.469 14.7 40.729 70 425 463 888 522.830\n", "40 South Carolina 4.797 16.4 30.279 58 401 443 844 278.226\n", "41 South Dakota 4.775 14.4 25.994 5 505 563 1068 23.875\n", "42 Tennessee 4.388 18.6 32.477 12 497 543 1040 52.656\n", "43 Texas 5.222 15.7 31.223 47 419 474 893 245.434\n", "44 Utah 3.656 24.3 29.082 4 513 563 1076 14.624\n", "45 Vermont 6.750 13.8 35.406 68 429 472 901 459.000\n", "46 Virginia 5.327 14.6 33.987 65 428 468 896 346.255\n", "47 Washington 5.906 20.2 36.151 48 443 494 937 283.488\n", "48 West Virginia 6.107 14.8 31.944 17 448 484 932 103.819\n", "49 Wisconsin 6.930 15.9 37.746 9 501 572 1073 62.370\n", "50 Wyoming 6.160 14.9 31.285 10 476 525 1001 61.600" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "myData" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "fileNameRoot = \"Guber1999data-interaction-Jags-\" \n", "numSavedSteps=15000 ; thinSteps=5" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "graphFileType = \"png\" " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "*********************************************************************\n", "Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:\n", "A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.\n", "*********************************************************************\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Loading required package: coda\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Jags-Ymet-XmetMulti-Mrobust.R\")" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "CORRELATION MATRIX OF PREDICTORS:\n", " Spend PrcntTake SpendXPrcnt\n", "Spend 1.000 0.593 0.775\n", "PrcntTake 0.593 1.000 0.951\n", "SpendXPrcnt 0.775 0.951 1.000\n", "\n", "Calling 3 simulations using the parallel method...\n", "Following the progress of chain 1 (the program will wait for all chains\n", "to finish before continuing):\n", "Welcome to JAGS 3.4.0 on Wed Oct 7 02:03:51 2015\n", "JAGS is free software and comes with ABSOLUTELY NO WARRANTY\n", "Loading module: basemod: ok\n", "Loading module: bugs: ok\n", ". . Reading data file data.txt\n", ". Compiling data graph\n", " Resolving undeclared variables\n", " Allocating nodes\n", " Initializing\n", " Reading data back into data table\n", "Compiling model graph\n", " Resolving undeclared variables\n", " Allocating nodes\n", " Graph Size: 638\n", ". Reading parameter file inits1.txt\n", ". Initializing model\n", ". Adapting 500\n", "-------------------------------------------------| 500\n", "++++++++++++++++++++++++++++++++++++++++++++++++++ 100%\n", "Adaptation successful\n", ". Updating 1000\n", "-------------------------------------------------| 1000\n", "************************************************** 100%\n", ". . . . . . . . Updating 25000\n", "-------------------------------------------------| 25000\n", "************************************************** 100%\n", ". . . . Updating 0\n", ". Deleting model\n", "All chains have finished\n", "Simulation complete. Reading coda files...\n", "Coda files loaded successfully\n", "Finished running the simulation\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "#startTime = proc.time()\n", "mcmcCoda = genMCMC( data=myData , xName=xName , yName=yName , \n", " numSavedSteps=numSavedSteps , thinSteps=thinSteps , \n", " saveName=fileNameRoot )" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mean Median Mode ESS HDImass HDIlow\n", "CHAIN 2.000000e+00 2.00000000 1.997413e+00 1.5 0.95 1.0000000\n", "beta0 1.046328e+03 1046.55500000 1.045477e+03 1192.8 0.95 964.1770000\n", "beta[1] 2.706973e+00 2.64031500 2.081181e+00 1215.5 0.95 -13.2015000\n", "beta[2] -4.064498e+00 -4.06793500 -4.089410e+00 898.6 0.95 -5.6597000\n", "beta[3] 2.037744e-01 0.20437850 2.117125e-01 813.3 0.95 -0.0561615\n", "sigma 3.104436e+01 30.81855000 3.040303e+01 13947.4 0.95 23.9590000\n", "zbeta0 -1.564266e-03 -0.00124368 3.764121e-03 15000.0 0.95 -0.1236310\n", "zbeta[1] 4.930570e-02 0.04809150 3.790802e-02 1215.5 0.95 -0.2404560\n", "zbeta[2] -1.453822e+00 -1.45505000 -1.462733e+00 898.6 0.95 -2.0244100\n", "zbeta[3] 5.615321e-01 0.56319650 5.834098e-01 813.3 0.95 -0.1547620\n", "zsigma 4.149175e-01 0.41189950 4.063463e-01 13947.4 0.95 0.3202200\n", "nu 3.490538e+01 26.42730000 1.158489e+01 10637.7 0.95 2.1464200\n", "log10(nu) 1.400149e+00 1.42205278 1.457488e+00 12121.0 0.95 0.7047432\n", " HDIhigh CompVal PcntGtCompVal ROPElow ROPEhigh PcntLtROPE\n", "CHAIN 3.000000 NA NA NA NA NA\n", "beta0 1129.180000 NA NA NA NA NA\n", "beta[1] 17.740100 NA NA NA NA NA\n", "beta[2] -2.461580 NA NA NA NA NA\n", "beta[3] 0.468960 NA NA NA NA NA\n", "sigma 38.820800 NA NA NA NA NA\n", "zbeta0 0.120886 NA NA NA NA NA\n", "zbeta[1] 0.323124 NA NA NA NA NA\n", "zbeta[2] -0.880479 NA NA NA NA NA\n", "zbeta[3] 1.292290 NA NA NA NA NA\n", "zsigma 0.518852 NA NA NA NA NA\n", "nu 94.077300 NA NA NA NA NA\n", "log10(nu) 2.096834 NA NA NA NA NA\n", " PcntInROPE PcntGtROPE\n", "CHAIN NA NA\n", "beta0 NA NA\n", "beta[1] NA NA\n", "beta[2] NA NA\n", "beta[3] NA NA\n", "sigma NA NA\n", "zbeta0 NA NA\n", "zbeta[1] NA NA\n", "zbeta[2] NA NA\n", "zbeta[3] NA NA\n", "zsigma NA NA\n", "nu NA NA\n", "log10(nu) NA NA\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " saveName=fileNameRoot )\n", "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , xName=xName , yName=yName , \n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the analysis is run, the first thing it does is display the correlations of the predictors:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We can see that the interaction variable is strongly correlated with both predictors. \n", "* Therefore, we know that \n", " - there will be strong trade-offs among the regression coefficients, \n", " - and the marginal distributions of single regression coefficients might be much wider than when there was no interaction included." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The marginal distribution for β3, also labeled as SpendXPrcnt, indicates that the modal value of the interaction coefficients is indeed positive, as we anticipated it could be. However, the 95% HDI includes zero, which indicates that we do not have very strong precision in the estimate of the magnitude of the interaction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Notice that the inclusion of the interaction term has changed the apparent marginal distributions for the regression coefficients on Spend and PrcntTake.\n", " - In particular, the regression coefficient on Spend now clearly includes zero.\n", " - This might lead a person, inappropriately, to conclude that there is not a credible influence of spending on SAT scores, because zero is among the credible values of β1 .\n", " - This conclusion is not appropriate because β1 only indicates the slope on spending when the percentage of students taking the test is zero.\n", " - The slope on Spend depends on the value of PrcntTake because of the interaction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To properly understand the credible slopes on the two predictors, we must consider the credible slopes on each predictor as a function of the value of the other predictor.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Recall from Equations 18.3 that the slope on x1 is β1 + β1×2 x2 . \n", "* Thus, for the present application, the slope on Spend is β1 + β3 · PrcntTake\n", " - because β1×2 is β3 and x2 is PrcntTake.\n", "* Thus, for any particular value of PrcntTake, we get the distribution of credible slopes on Spend by stepping through the MCMC chain and computing β1 + β3 · PrcntTake at each step." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We can summarize the distribution of slopes by its median and 95% HDI. \n", "* We do that for many candidate values of PrcntTake, and the result is plotted in the middle panel of Figure 18.9. \n", " - You can see that when PrcntTake is large, the credible slopes on Spend clearly exceed zero.\n", " - You can also mentally extrapolate that when PrcntTake is zero, the median and HDI will match the marginal distribution of β1 shown in the top of Figure 18.9." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In summary, \n", "* when there is interaction, then the influence of the individual predictors can not be summarized by their individual regression coefficients alone, \n", "* because those coefficients only describe the influence when the other variables are at zero. \n", "* Notice that this is true even though the interaction coefficient did not exclude zero from its 95% HDI. \n", "* In other words, if you include an interaction term, you cannot ignore it even if its marginal posterior distribution includes zero.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 18.3. Shrinkage of Regression Coefficients" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With so many candidate predictors of noisy data, there may be some regression coefficients that are spuriously estimated to be non-zero. \n", "* we would like the description to de-emphasize weak or spurious predictors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### t-distribution prior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* One way to implement such a description is by using a t distribution for the prior on the regression coefficients.\n", " - By setting \n", " - its mean to zero, \n", " - its normality parameter to a small value, \n", " - and its scale parameter to a moderately small value, \n", " - the t-distributed prior dictates that regression coefficients should probably be near zero, \n", " - where the narrow peak of the t distribution is. \n", " - But if a regression coefficient is clearly nonzero, \n", " - then it could be large, \n", " - as is allowed by the heavy tail of the t-distributed prior." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The empty braces in the top of Figure 18.10 refer to optional aspects of the model.\n", " - As was mentioned in the previous paragraph, the normality parameter νβ and the scale parameter σβ could be set to constants, \n", " - in which case the braces as the top of the diagram would enclose constants and the arrows would be labeled with an equal sign.\n", " - When the prior has constants, it is sometimes called a regularizer for the estimation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### double exponential distribution prior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Alternatively, a double exponential distribution could be used.\n", "* A double exponential is simply an exponential distribution on both +β and −β, equally spread.\n", "* The double exponential has a single-scale parameter (and no shape parameter). \n", "* The double exponential is built into JAGS and Stan. \n", "* A well-known regularization method called lasso regression uses a double exponential weighting on the regression coefficients.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Should the scale parameter (i.e., σβ in Figure 18.10) be fixed at a constant value or should it be estimated from the data?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* If it is fixed, \n", " - then every regression coefficient experiences the same fixed regularization, \n", " - independently from all the other regression\n", "coefficients.\n", "* If the scale parameter is estimated, \n", " - then the variability of the estimated regression coefficients across predictors influences the estimate of the scale parameter, \n", " - which in turn influences all the regression coefficients.\n", " - If the model estimates σβ (instead of fixing it), the model is assuming that all the regression coefficients are mutually representative of the variability across regression coefficients." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To illustrate these ideas with a concrete example, consider again the SAT data of Figure 18.3, but now supplemented with 12 more randomly generated predictors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The x values were randomly and independently drawn from a normal distribution centered at zero, and therefore, any correlation between the predictors and any nonzero regression coefficient is an accident of random sampling." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* We will first apply the simple model of Figure 18.4, for which the regression coefficients have fixed, independent,\n", "vague normal priors. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting posterior distribution is shown in Figure 18.11.\n", "* To save space, the results for random predictors 4-9 (xRand4–xRand9) have not been displayed.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Attend specifically to the distributions for the regression coefficients on spending (Spend) and random predictor 10 (xRand10).\n", " - The coefficient on Spend is still estimated to be positive and its 95% HDI falls above zero, but only barely.\n", " - The coefficient on xRand10 is negative, with its 95% HDI falling below zero, to about\n", "the same extent as Spend fell above zero.\n", " - This apparent relation of xRand10 with SAT score is spurious, an accident of random sampling. \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we repeat the analysis using the hierarchical model of Figure 18.10, with νβ fixed at one (i.e., heavy tails), and with σβ given a gamma prior that has mode 1.0 and standard deviation 1.0 (i.e., broad for the standardized data)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Jags-Ymet-XmetMulti-MrobustShrink-Example.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The resulting posterior distribution is shown in Figure 18.12." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Notice that the marginal distribution for the regression coefficient on xRand10 is now shifted so that its 95% HDI covers zero.\n", " - The estimate is shrunken toward zero because many predictors are telling the higher-level t distribution that their\n", "regression coefficients are near zero.\n", " - Indeed, the estimate of σβ (not displayed) has its posterior mode around 0.05, even though its prior mode was at 1.0.\n", " - The shrinkage also causes the estimate of the regression coefficient on Spend to shift toward zero.\n", " - but it has also suppressed what might be a real but small regression coefficient on Spend.\n", "* Notice, however, that the marginal distribution for the coefficient on PrcntTake has not been much affected by shrinkage, presumably because it is big enough that it falls in the tail of the t distribution where the prior is relatively flat.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The shrinkage is desirable not only because it shares information across predictors (as expressed by the hierarchical prior) but also because it rationally helps control for false alarms in declaring that a predictor has a nonzero regression coefficient." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Finally, notice in Figure 18.12 that the marginal posterior distributions on many of the regression coefficients are (upside-down) funnel shaped, each with a pointy peak near\n", ". You can imagine that the posterior distribution zero and long concave tails, like this:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* A funnel shape is characteristic of a posterior distribution experiencing strong shrinkage.\n", " - If a marginal posterior distribution is displayed only by a dot at its central tendency and a segment for its HDI, without a profile for its shape, then this signature of shrinkage is missed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 18.4. Variable Selection" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* 18.4.1 Inclusion probability is strongly affected by vagueness of prior\n", "* 18.4.2 Variable selection with hierarchical shrinkage\n", "* 18.4.3 What to report and what to conclude\n", "* 18.4.4 Caution: Computational methods\n", "* 18.4.5 Caution: Interaction variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Deciding which predictors to include is often called variable selection." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The motivation of the previous section was \n", " - an assumption that many predictors might have weak predictive value relative to the noise in the data, \n", " - and therefore, shrinkage would be appropriate to stabilize the estimates.\n", "* In some applications, it may be theoretically meaningful to suppose that a predictor has literally zero predictive value. \n", " - In this case, the issue is not estimating a presumed weak predictiveness relative to noise;\n", " - instead, the issue is deciding whether there is any predictiveness at all." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* we are entertaining a situation in which there are many candidate predictors that may genuinely have zero real relation to the predicted value or\n", "have relations small enough to be counted as insignificant. \n", "* In this situation, a reasonable question is, which predictors can be credibly included in the descriptive model?\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This section introduces some basic ideas and methods of Bayesian variable selection,\n", "using some simple illustrative examples. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The topic is extensively studied and undergoing rapid development in the literature. \n", "* The examples presented here are intended to reveal some of the foundational concepts and methods, not to serve as a comprehensive\n", "reference for the latest and greatest methods." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The key to models of variable selection (as opposed to shrinkage) \n", "\n", "* is that each predictor has both a regression coefficient and an inclusion indicator, which can be thought of as simply another coefficient that takes on the values 0 or 1.\n", " - When the inclusion indicator is 1, \n", " - then the regression coefficient has its usual role. \n", " - When the inclusion indicator is 0, \n", " - the predictor has no role in the model and the regression coefficient is superfluous." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For example, \n", " - if there are four predictors, then \n", " - <δ1 , δ2 , δ3 , δ4> = [1, 1, 1, 1] is a model that includes all four predictors, \n", " - and <δ1 , δ2 , δ3 , δ4> = [0, 1, 0, 1] is a model that includes only the second and fourth predictors, and so on. \n", " - With four predictors, there are 2^4 = 16 possible models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* A simple way to put a prior on the inclusion indicator is to have each indicator come\n", "from an independent Bernoulli prior, such as" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### JAGS code" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " for ( i in 1:Ntotal ) {\n", " zy[i] ̃ dt( zbeta0 + sum( zbeta[1:Nx] * zx[i,1:Nx] ) , 1/zsigmaˆ2 , nu )\n", " }\n", " # Priors vague on standardized scale:\n", " zbeta0 ̃ dnorm( 0 , 1/2ˆ2 )\n", " for ( j in 1:Nx ) {\n", " zbeta[j] ̃ dnorm( 0 , 1/2ˆ2 )\n", " }\n", " zsigma ̃ dunif( 1.0E-5 , 1.0E+1 )\n", " nu <- nuMinusOne+1\n", " nuMinusOne ̃ dexp(1/29.0)\n", " # Transform to original scale:\n", " beta[1:Nx] <- ( zbeta[1:Nx] / xsd[1:Nx] )*ysd\n", " beta0 <- zbeta0*ysd + ym - sum( zbeta[1:Nx] * xm[1:Nx] / xsd[1:Nx] )*ysd\n", " sigma <- zsigma*ysd\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Nx : number of predictors.\n", "model {\n", " for ( i in 1:Ntotal ) {\n", " zy[i] ̃ dt( zbeta0 + sum( delta[1:Nx] * zbeta[1:Nx] * zx[i,1:Nx] ) ,\n", " 1/zsigmaˆ2 , nu )\n", " }\n", " # Priors vague on standardized scale:\n", " zbeta0 ̃ dnorm( 0 , 1/2ˆ2 )\n", " for ( j in 1:Nx ) {\n", " zbeta[j] ̃ dnorm( 0 , 1/2ˆ2 )\n", " delta[j] ̃ dbern( 0.5 )\n", " }\n", " zsigma ̃ dunif( 1.0E-5 , 1.0E+1 )\n", " nu <- nuMinusOne+1\n", " nuMinusOne ̃ dexp(1/29.0)\n", " # Transform to original scale:\n", " beta[1:Nx] <- ( delta[1:Nx] * zbeta[1:Nx] / xsd[1:Nx] )*ysd\n", " beta0 <- zbeta0*ysd + ym - sum( delta[1:Nx] * zbeta[1:Nx] * xm[1:Nx]\n", " / xsd[1:Nx] )*ysd\n", " sigma <- zsigma*ysd\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a first example of applying the variable-selection method, recall the SAT data of\n", "Figure 18.3 and the posterior distribution shown in Figure 18.5." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "

\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* For each of 50 states, the average SAT score was regressed on two predictors: \n", " - average spending per pupil (Spend)\n", " - and percentage of students who took the test (PrcntTake). " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With two predictors, there are four possible models involving different subsets of predictors." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Because the prior inclusion probability was set at 0.5, each model had a prior probability of 0.52 = 0.25." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure 18.13 shows the results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Of the four possible models, \n", " - only two had a non-negligible posterior probability, namely the model that \n", " - included both predictors and\n", " - the model that included only PrcntTake.\n", "* As shown in the upper panel of Figure 18.13,\n", " - the model with both predictors has a posterior probability of about 70%.\n", " - This value is simply the number of times that the MCMC chain had <δ1 , δ2> = [1,1] divided by the total number of steps in the chain.\n", "* As shown in the lower panel of Figure 18.13, \n", " - the model with only PrcntTake has a posterior probability of about 30%.\n", " - This value is simply the number of times that the MCMC chain had <δ1 , δ2> = [0, 1] divided by the total number of steps in the chain.\n", "* Thus, the model involving both predictors is more than twice as credible as the model involving only one predictor.\n", "* Notice that the parameter estimates are different for different models.\n", " - For example, the estimate of the intercept is quite different for different included predictors.\n", "\n", "\n", "\n", " \n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.4.1 Inclusion probability is strongly affected by vagueness of prior" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will now see that the degree of vagueness of the prior on the regression coefficient\n", "can have an enormous influence on the inclusion probability, even though the degree\n", "of vagueness has little influence on the estimate of the regression coefficient itself." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that the prior in the model was specified as a generic broad distribution on the standardized regression coefficients, like this:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " ...\n", " # Priors vague on standardized scale:\n", " zbeta0 ̃ dnorm( 0 , 1/2ˆ2 ) # SD=2\n", " for ( j in 1:Nx ) {\n", " zbeta[j] ̃ dnorm( 0 , 1/2ˆ2 ) # SD=2\n", " delta[j] ̃ dbern( 0.5 )\n", " }\n", " ...\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The choice of SD=2 was arbitrary but reasonable because standardized regression coefficients cannot exceed ±1 in least-squares regression. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now re-run the analysis using different arbitrary degrees of vagueness on the priors for the standardized regression coefficients.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will illustrate with SD=1, like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " ...\n", " # Priors vague on standardized scale:\n", " zbeta0 ̃ dnorm( 0 , 1/1ˆ2 ) # SD=1\n", " for ( j in 1:Nx ) {\n", " zbeta[j] ̃ dnorm( 0 , 1/1ˆ2 ) # SD=1\n", " delta[j] ̃ dbern( 0.5 )\n", " }\n", " ...\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "and with SD=10, like this:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " ...\n", " # Priors vague on standardized scale:\n", " zbeta0 ̃ dnorm( 0 , 1/10ˆ2 ) # SD=10\n", " for ( j in 1:Nx ) {\n", " zbeta[j] ̃ dnorm( 0 , 1/10ˆ2 ) # SD=10\n", " delta[j] ̃ dbern( 0.5 )\n", " }\n", " ...\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Notice that the prior probability on the inclusion parameters has not changed. \n", "* The prior inclusion probability is always 0.5 in these examples." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure 18.14 shows the results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The upper pair of panels shows the posterior probabilities when the prior on the standardized regression coefficients has SD=1.\n", " - You can see that there is a strong advantage for the two-predictor model, \n", " - with the posterior inclusion probability of Spend being about 0.82.\n", "* The lower pair of panels in Figure 18.14 shows the posterior probabilities when the prior on the standardized regression coefficients has SD=10.\n", " - You can see that there is now an advantage for the one-predictor model, \n", " - with the posterior inclusion probability of Spend being only about 0.36.\n", "* How could that be?\n", " - After all, a model with all predictors included must be able to fit the data at least as well as a model with only a subset of predictors excluded.\n", " - The reason for the lower probability of the more complex model is that each extra parameter dilutes the prior density on the pre-existing parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Consider the PrcntTake-only model. \n", " - In this model, \n", " - one of the most credible values for the regression coefficient on PrcntTake is β2 = −2.5. \n", " - The likelihood at that point involves p(D|β2 = −2.5) \n", " - and the prior density at the point involves p(β2 = −2.5), \n", " - with the posterior probability proportional to their product. \n", "* The very same likelihood can be achieved by the model that also includes Spend, \n", " - merely by setting the regression coefficient on Spend to zero:\n", " - p(D|β2 = −2.5) = p(D|β2 = −2.5, β1 = 0). \n", " - But the prior density at that point, p(β2 = −2.5, β1 = 0) = p(β2 = −2.5) p(β1 = 0), \n", " - will typically be less than p(β2 = −2.5), \n", " - because the prior is p(β2 = −2.5) multiplied by a probability density that is almost certainly less than one.\n", "* Thus, models that include more predictors will pay the cost of lower prior probability." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* On the other hand, the change in vagueness of the prior distribution has hardly affected the estimates of the regression coefficients at all.\n", " - Figure 18.14 shows that the estimate of the regression coefficient on Spend has a 95% HDI from about 4 to 21 for both prior distributions, regardless of whether its inclusion probability is low or high.\n", "* From these results, do we conclude that Spend should be included or not?\n", " - For me, the robustness of the explicit estimate of the regression coefficient, showing that it is non-zero, trumps the model comparison." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Bayesian model comparison can be strongly affected by the degree of vagueness in the priors, even though explicit estimates of the parameter\n", "values may be minimally affected." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.4.2 Variable selection with hierarchical shrinkage" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you have strong previous research that can inform the prior, then it should be used. But if previous knowledge is weak, then the uncertainty should be expressed in the prior.\n", "* This is an underlying mantra of the Bayesian approach: Any uncertainty should be expressed in the prior. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Thus, if you are not sure what the value of σβ should be, you can estimate it and\n", "include a higher-level distribution to express your prior uncertainty. \n", "* In other words, in Figure 18.10 (p. 531), we estimate σβ and give it a prior distribution in place of the open\n", "braces." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the present application, we have uncertainty, and therefore, we place a broad prior\n", "on σβ ." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code below shows a few different options in commented lines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In the code, σβ is denoted sigmaBeta. \n", " - One option sets sigmaBeta to a constant, \n", " - which produces the results reported in the previous section. \n", " - Another option puts a broad uniform prior on sigmaBeta. \n", " - A uniform prior is intuitively straight forward, but a uniform prior must\n", "always have some arbitrary bounds. \n", " - Therefore, the next option, \n", " - not commented out of the code below, \n", " - is a gamma distribution that has a mode at 1.0 \n", " - but is very broad with a standard deviation of 10.0 " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "model {\n", " ...\n", " # Priors vague on standardized scale:\n", " zbeta0 ̃ dnorm( 0 , 1/2ˆ2 )\n", " for ( j in 1:Nx ) {\n", " zbeta[j] ̃ dt( 0 , 1/sigmaBetaˆ2 , 1 ) # Notice sigmaBeta\n", " delta[j] ̃ dbern( 0.5 )\n", " }\n", " zsigma ̃ dunif( 1.0E-5 , 1.0E+1 )\n", " ## Uncomment one of the following specifications for sigmaBeta:\n", " # sigmaBeta <- 2.0\n", " # sigmaBeta ̃ dunif( 1.0E-5 , 1.0E+2 )\n", " sigmaBeta ̃ dgamma(1.1051,0.1051) # mode 1.0, sd 10.0\n", " # sigmaBeta <- 1/sqrt(tauBeta) ; tauBeta ̃ dgamma(0.001,0.001)\n", " ...\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Jags-Ymet-XmetMulti-MrobustVarSelect-Example.R" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When it is run, it produces results very similar to those in Figure 18.13 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### 18.13 관련 식" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "

" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 18.13 관련 코드\n", "# Nx : number of predictors.\n", "model {\n", " for ( i in 1:Ntotal ) {\n", " zy[i] ̃ dt( zbeta0 + sum( delta[1:Nx] * zbeta[1:Nx] * zx[i,1:Nx] ) ,\n", " 1/zsigmaˆ2 , nu )\n", " }\n", " # Priors vague on standardized scale:\n", " zbeta0 ̃ dnorm( 0 , 1/2ˆ2 )\n", " for ( j in 1:Nx ) {\n", " zbeta[j] ̃ dnorm( 0 , 1/2ˆ2 )\n", " delta[j] ̃ dbern( 0.5 )\n", " }\n", " zsigma ̃ dunif( 1.0E-5 , 1.0E+1 )\n", " nu <- nuMinusOne+1\n", " nuMinusOne ̃ dexp(1/29.0)\n", " # Transform to original scale:\n", " beta[1:Nx] <- ( delta[1:Nx] * zbeta[1:Nx] / xsd[1:Nx] )*ysd\n", " beta0 <- zbeta0*ysd + ym - sum( delta[1:Nx] * zbeta[1:Nx] * xm[1:Nx]\n", " / xsd[1:Nx] )*ysd\n", " sigma <- zsigma*ysd\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Example for Jags-Ymet-XmetMulti-MrobustVarSelect.R " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 책에는 없는 코드. 실습을 위해 data 폴더에 넣은 코드를 실행시키기 위한 작업\n", "cur_dir = getwd()\n", "setwd(sprintf(\"%s/%s\", cur_dir, 'data'))" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Load data file and specity column names of x (predictors) and y (predicted):\n", "myData = read.csv( file=\"Guber1999data.csv\" )" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# UNCOMMENT ONE OF THE FOLLOWING SECTIONS (In RStudio, select and ctrl-shift-C)\n", "#.............................................................................\n", "# Two predictors:\n", "yName = \"SATT\" ; xName = c(\"Spend\",\"PrcntTake\")\n", "fileNameRoot = \"Guber1999data-Jags-VarSelect-\" \n", "numSavedSteps=15000 ; thinSteps=25" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#.............................................................................\n", "# # Two predictors with redundant predictor:\n", "# PropNotTake = (100-myData[,\"PrcntTake\"])/100\n", "# myData = cbind( myData , PropNotTake )\n", "# yName = \"SATT\" ; xName = c(\"Spend\",\"PrcntTake\",\"PropNotTake\")\n", "# fileNameRoot = \"Guber1999data-Jags-Redund-VarSelect-\" \n", "# numSavedSteps=15000 ; thinSteps=30" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#.............................................................................\n", "# # Two predictors with two redundant predictors:\n", "# PropNotTake = (100-myData[,\"PrcntTake\"])/100\n", "# Partic = myData[,\"PrcntTake\"]/10 \n", "# myData = cbind( myData , PropNotTake , Partic )\n", "# yName = \"SATT\" ; xName = c(\"Spend\",\"PrcntTake\",\"PropNotTake\",\"Partic\")\n", "# fileNameRoot = \"Guber1999data-Jags-Redund2-VarSelect-\" \n", "# numSavedSteps=15000 ; thinSteps=15" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#.............................................................................\n", "# # Four predictors:\n", "# yName = \"SATT\" ; xName = c(\"Spend\",\"PrcntTake\",\"StuTeaRat\",\"Salary\")\n", "# fileNameRoot = \"Guber1999data-Jags-4X-VarSelect-\" \n", "# numSavedSteps=15000 ; thinSteps=20 " ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#.............................................................................\n", "# # Append columns of random predictors:\n", "# standardize = function(v){(v-mean(v))/sd(v)}\n", "# Ny=nrow(myData)\n", "# NxRand = 12\n", "# set.seed(47405)\n", "# for ( xIdx in 1:NxRand ) {\n", "# xRand = standardize(rnorm(Ny))\n", "# myData = cbind( myData , xRand )\n", "# colnames(myData)[ncol(myData)] = paste0(\"xRand\",xIdx)\n", "# }\n", "# yName = \"SATT\" ; xName = c(\"Spend\",\"PrcntTake\", paste0(\"xRand\",1:NxRand) )\n", "# fileNameRoot = \"Guber1999data-Jags-RandX-VarSelect-\" \n", "# numSavedSteps=15000 ; thinSteps=5" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#.............................................................................\n", "# # Two predictors with interaction.\n", "# # Append named column with interaction product:\n", "# myData = cbind( myData , SpendXPrcnt=myData[,\"Spend\"]*myData[,\"PrcntTake\"] )\n", "# yName = \"SATT\" ; xName = c(\"Spend\",\"PrcntTake\",\"SpendXPrcnt\")\n", "# fileNameRoot = \"Guber1999data-Jags-Inter-VarSelect-\" \n", "# numSavedSteps=15000 ; thinSteps=25" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#.............................................................................\n", "graphFileType = \"png\"" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "*********************************************************************\n", "Kruschke, J. K. (2015). Doing Bayesian Data Analysis, Second Edition:\n", "A Tutorial with R, JAGS, and Stan. Academic Press / Elsevier.\n", "*********************************************************************\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "Installing packages into ‘/home/moosung/R/x86_64-pc-linux-gnu-library/3.2’\n", "(as ‘lib’ is unspecified)\n" ] }, { "ename": "ERROR", "evalue": "Error in contrib.url(repos, type): trying to use CRAN without setting a mirror\n", "output_type": "error", "traceback": [ "Error in contrib.url(repos, type): trying to use CRAN without setting a mirror\n" ] } ], "source": [ "#------------------------------------------------------------------------------- \n", "# Load the relevant model into R's working memory:\n", "source(\"Jags-Ymet-XmetMulti-MrobustVarSelect.R\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Generate the MCMC chain:\n", "#startTime = proc.time()\n", "mcmcCoda = genMCMC( data=myData , xName=xName , yName=yName , \n", " numSavedSteps=numSavedSteps , thinSteps=thinSteps , \n", " saveName=fileNameRoot )\n", "#stopTime = proc.time()\n", "#duration = stopTime - startTime\n", "#show(duration)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Display diagnostics of chain, for specified parameters:\n", "parameterNames = varnames(mcmcCoda) # get all parameter names\n", "for ( parName in parameterNames ) {\n", " diagMCMC( codaObject=mcmcCoda , parName=parName , \n", " saveName=fileNameRoot , saveType=graphFileType )\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "#------------------------------------------------------------------------------- \n", "# Get summary statistics of chain:\n", "summaryInfo = smryMCMC( mcmcCoda , \n", " saveName=fileNameRoot )\n", "show(summaryInfo)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Display posterior information:\n", "plotMCMC( mcmcCoda , data=myData , xName=xName , yName=yName , \n", " pairsPlot=TRUE , showCurve=FALSE ,\n", " saveName=fileNameRoot , saveType=graphFileType )" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# 창으로 뜨는 것들을 없앤다.\n", "graphics.off()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### As a small extension of the example, " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* it turns out that the SAT data from Guber (1999) had two additional predictors, \n", " - namely the average student-teacher ratio (StuTeaRat) in each state \n", " - and the average salary of the teachers (Salary).\n", "* These variables are also plausible predictors of SAT score. Should they be included?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we consider the correlations of the candidate predictors:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* Notice above that Salary is strongly correlated with Spend, and therefore, \n", " - a model that includes both Salary and Spend \n", " - will show \n", " - a strong trade-off between those predictors \n", " - and consequently will show inflated uncertainty in the regression coefficients for either\n", "one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Should only one or the other be included, or both, or neither?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The prior inclusion bias was 0.5 for each predictor, and therefore, \n", " - each of 2^4 = 16 models \n", " - had a prior probability of 0.5^4 = 0.0625. \n", "* The prior on the regression coefficients was hierarchical \n", " - with σβ having a gamma prior \n", " - with mode 1.0 \n", " - and standard deviation of 10.0, as explained in the previous paragraphs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Figure 18.15 shows the results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

\n", "

\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The most probable model, \n", " - shown at the top of Figure 18.15, \n", " - includes only two predictors, namely Spend and PrcntTake, \n", " - and has a posterior probability of roughly 50%.\n", "* The runner up (in the second row of Figure 18.15)\n", " - has a posterior probability of about half that much and includes only PrcntTake.\n", "* The Salary predictor is included only in the third most probable model, \n", " - with a posterior probability of only about 8%." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.4.3 What to report and what to conclude" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* From the results of variable-selection analysis, such as in Figure 18.15, what should\n", "be reported and what should be concluded?\n", "* Which candidate predictors should be included in an explanatory model?\n", "* How should predictions of future data be made?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Unfortunately, there is no singular “correct” answer. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The analysis tells us the relative posterior credibilities of the models for our particular choice of prior.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "책의 설명을 더 읽어보자." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.4.4 Caution: Computational methods" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* The computer code that created the examples of this section is in the files Jags-Ymet-XmetMulti-MrobustVarSelect-Example.R and Jags-Ymet-XmetMulti-\n", "MrobustVarSelect.R. The code is meant primarily for pedagogical purposes, and it does not scale well for larger applications, for reasons explained presently.\n", "* There are a variety of approaches to Bayesian variable selection, and MCMC is just one. \n", " - MCMC is useful only when there are a modest number of predictors.\n", " - A useful MCMC chain will need ample opportunity to sample from all the models, which would require an impractically long chain even for moderately large numbers of predictors.\n", " - Even for a modest number of predictors, the MCMC chain can be badly autocorrelated in the model indices or inclusion indicators." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To conclude this section regarding variable selection, \n", " - it is appropriate to recapitulate the considerations at the beginning of the section. \n", " - Variable selection is a reasonable approach only if it is genuinely plausible and meaningful that candidate predictors have zero relation to the predicted variable. \n", " - The results can be surprisingly sensitive to the seemingly innocuous choice of prior for the regression coefficients, and, of course, the prior for the inclusion probability. \n", " - Because of these limitations, hierarchical shrinkage priors may be a more meaningful approach." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 18.4.5 Caution: Interaction variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 참고자료" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* [1] Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan - http://www.amazon.com/Doing-Bayesian-Analysis-Second-Edition/dp/0124058884" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.2.1" } }, "nbformat": 4, "nbformat_minor": 0 }