\n",
"\n",
"My name is Dmitry Smirnov, currently I'm PhD student in Brain and Mind Laboratory, NBE department in Aalto University, Finland. I mostly do cognitive neuroscience with fMRI, using various classification and regression models. You can see more about me:\n",
"\n",
"My LinkedIn\n",
"\n",
"My personal page\n",
"\n",
"
About project
\n",
"\n",
"I've been reading and playing around with Bayesian methods for quite a while now, and in one conversation I was asked if these methods can help solve a research question that was answered with frequentist stats. I was eager to take the challenge, also to check how much did I really learn about this methodology.\n",
"\n",
"The data were obtained from previously published study of the influence of culture on the incidence of extreme judgment in moral dilemmas. Original study has showed, among other results, that in comparison with the English-speaking subjects, Russian subjects avoided extreme judgments (Arutyunova et al., 2013). I was provided with tables for Russian- and English-speaking samples, and given permission to run some Bayesian analysis on these.\n",
"\n",
"Reference: Arutyunova KR, Alexandrov YuI, Znakov VV, Hauser MD (2013) Moral Judgments in Russian Culture: Universality and Cultural Specificity. Journal of Cognition and Culture 13: 255–285."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Importing packages, loading data and preprocessing
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this notebook, I will use a hierarchical model. To give quick overview, in hierarchical model, I take in consideration not only each moral dilemma separately, but also allow the data from various dilemmas to influence group parameter estimates, and group parameter estimates can influence parameters of individual dilemmas.\n",
"In other words, each dilemma is an instance of some common group process of decision making. So I will model each dilemma, but tie the parameters of dilemmawise models to some group parameters."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I start with adding packages that I am going to use later in the code, and load data for Russian and English samples. Technically, dataRU and dataENG are just matrixes of Participant by Dilemma shape.\n",
"\n",
"I also setup matplotlib in a way that in throws plots into the notebook and setup up default grey grid for seaborn."
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import numpy as np\n",
"import pymc3\n",
"from sklearn.cross_validation import KFold\n",
"import scipy.stats as sp"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"sns.set_style('darkgrid')"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"dataRU = np.array(pd.read_csv('http://www.becs.aalto.fi/~smirnod1/RusKarina.csv',header=None))\n",
"dataENG = np.array(pd.read_csv('http://www.becs.aalto.fi/~smirnod1/EngKarina.csv',header=None))"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(327, 30)\n",
"(332, 30)\n"
]
}
],
"source": [
"print(dataRU.shape)\n",
"print(dataENG.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now I transform our data, so that extreme answers (7 and 1) are coded as '1', and all other answers as '0'. Final matrices will be data1, which contains encoded (1-extreme, 0-moderate) responses from 327 individuals over 30 moral dilemmas from Russian-speaking sample, and data2 contains encoded responses from 330 individuals over 30 moral dilemmas from English-speaking sample."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"data1 = np.zeros(np.shape(dataRU))\n",
"data2 = np.zeros(np.shape(dataENG))\n",
"data1[dataRU == 7] = 1\n",
"data1[dataRU == 1] = 1\n",
"data2[dataENG == 7] = 1\n",
"data2[dataENG == 1] = 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For some of the later analyses I will need a few parameters, so let's start with them.\n",
"First, for this model I need to know the number of stories used, so S is just auxiliary variable for that. \n",
"Then, I also need to keep the indices for different stories, s_idx is for that.\n",
"\n",
"For Gamma priors later we need shape and rate parameters, I initialize them here, but see more explanation below."
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"S = data1.shape[1]\n",
"s_idx = [i for i in range(S)]\n",
"s = 0.01 # shape for Gamma\n",
"r = 0.01 # rate for Gamma\n",
"nsamp = 100000 # How many samples to get from MCMC\n",
"njobs = 4 # how many jobs to run in parallel\n",
"s_idx = [i for i in range(S)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
Model specification
"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"In Bayesian analysis, we need to have prior beliefs about the way our model is parametrized. Here I used approach taken by John Krushcke (Kruschke, 2014: http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/). \n",
"\n",
"
\n",
"\n",
"In this case, for each group I have a Bernoulli likelihood function with parameter $\\theta_{g,s}$. $\\theta_{g,s}$ comes from Beta distribution which is parametrized with mode $\\omega$ and concentration $\\kappa$ hyperparameters. $\\kappa$ gets Gamma prior with shape = 0.01 and rate = 0.01, which is a vague prior. $\\omega$ gets uniform Beta(1,1) prior.\n",
"So, to give another overview, now for each group I have group mode $\\omega$ and group concentration $\\kappa$, which are informing story-specific $\\theta_{g,s}$, and are informed by story-specific $\\theta_{g,s}$. In applied sense that means, that I will get shrinkage effect, so if $\\theta_{g,s}$ will be far away from group $\\omega$, it will be pulled towards group estimate.\n",
"\n",
"For a little more intuitive explanation, we can think of our data as being two factories, each producing 30 coins. Each coin was flipped 300 times (or, more precisely, 327 in case of first factory, and 330 in case of second). We assume, that each factory produces coins with bias centered around some value, and the coins coming from single factory have bias close to that value.\n",
"\n",
"If we would have had some metric covariates, like age or size of social circle, it would be better to use logistic regression, but here I have none of those, so I keep the model simple."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the bit of code below, I use PyMC3 to run Markov Chain Monte Carlo analysis with adaptive Metropolis-Hastings sampler. I will package the model into a function, because later on I will have to run it repeatedly during posterior predictive checking."
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"def TrainHierarchicalModel(data1, data2, s_idx, njobs=2, s=0.01, r=0.01, nsamp=10000):\n",
" with pymc3.Model() as model:\n",
" # define the group hyperparameters\n",
" k_minus_two1 = pymc3.Gamma('k_minus_two1', alpha=s, beta=r)\n",
" k1 = pymc3.Deterministic('k1', k_minus_two1 + 2)\n",
" w1 = pymc3.Beta('w1', alpha=1, beta=1)\n",
" k_minus_two2 = pymc3.Gamma('k_minus_two2', alpha=s, beta=r)\n",
" k2 = pymc3.Deterministic('k2', k_minus_two2 + 2)\n",
" w2 = pymc3.Beta('w2', alpha=1, beta=1) \n",
" # define the prior for group, want to tie story specific theta to it's group theta\n",
" theta1 = pymc3.Beta('theta1', alpha=(w1 * k_minus_two1) + 1, beta=((1-w1)*k_minus_two1)+1, shape=S)\n",
" theta2 = pymc3.Beta('theta2', alpha=(w2 * k_minus_two2) + 1, beta=((1-w2)*k_minus_two2)+1, shape=S)\n",
" # define the likelihood\n",
" y1 = pymc3.Bernoulli('y1', p=theta1[s_idx], observed=data1)\n",
" y2 = pymc3.Bernoulli('y2', p=theta2[s_idx], observed=data2)\n",
" \n",
" wdiff = pymc3.Deterministic('wdiff', w1 - w2)\n",
" thetadiff = pymc3.Deterministic('thetadiff', theta1 - theta2)\n",
" oddsratio = pymc3.Deterministic('oddsratio', (theta2/(1-theta2))/(theta1/(1-theta1)))\n",
" \n",
" step = pymc3.Metropolis() # Instantiate MCMC sampling algorithm\n",
" trace = pymc3.sample(nsamp/njobs, step, njobs=njobs, progressbar=False)\n",
" return model,trace"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"
"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In principle, posterior predictive checking goes along the following steps (one small reference: http://www.people.fas.harvard.edu/~plam/teaching/methods/modelcheck/modelcheck_print.pdf):\n",
"\n",
"
Draw multiple candidate parameter values from posterior.
\n",
"
Simulate datasets using these values.
\n",
"
Choose a statistic, that will reflect some discrepancy between model and real data.
\n",
"
Calculate this statistic over simulated and real data.
\n",
"
Compare and decide, if model requires adjustment.
\n",
"\n",
"\n",
"Here I add cross validation to it, so that all of these steps are embedded in following:\n",
"\n",
"
Split the dataset into training and validation sets.
\n",
"
Simulate datasets from candidate values drawn from model, trained on training set.
\n",
"
Select statistic and calculate it over the simulated data, and real data from validation set.
\n",
"
Check, how does this statistic differ between real and simulated datasets.
\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, I need to prepare the partitioning, and also decide how many simulations I will do. I decided to have 5 folds for cross validation, and I prepare the indices here, and also to simulate 500 datasets for posterior predictive checks."
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"nsim = 500 # How many simulated datasets to get\n",
"n_folds = 5 # How many folds for cross validation\n",
"# Use KFold partitioner from sklearn to create cross-validation indices\n",
"cv1 = list(KFold(data1.shape[0], n_folds=n_folds))\n",
"cv2 = list(KFold(data2.shape[0], n_folds=n_folds))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I will initialize the containers for outcomes."
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"prop_1 = np.empty((S,n_folds),dtype=float) # Proportion of extreme answers for each story in Russian group\n",
"prop_2 = np.empty((S,n_folds),dtype=float) # Proportion of extreme answers for each story in English group\n",
"prop_yrep1 = np.empty((S,n_folds),dtype=float) # Proportions for Russian group in simulated datasets\n",
"prop_yrep2 = np.empty((S,n_folds),dtype=float) # Proportions for English group in simulated datasets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, there is going to be a lot of code, but let's go through the basic idea.\n",
"\n",
"
For each fold, I split data into training and testing samples.
\n",
"
I train the model using training dataset.
\n",
"
I simulate 500 datasets using candidate values from the posterior.
\n",
"
I calculate proportion of extreme answers for each story.