## Tutorial on study designs and measures of effect
## Clinic on the Meaningful Modeling of Epidemiological Data
## International Clinics on Infectious Disease Dynamics and Data (ICI3D) Program
## African Institute for Mathematical Sciences, Muizenberg, RSA
## Jim Scott 2012

## Identifying study designs

## For each of the following descriptions, determine what type of 
## study design was used.  The answers are at the end of this tutorial

## Study #1
## A study examined risk factors associated with falling at an 
## assisted care living facility.  Researchers enrolled 75 patients that 
## had falls and another 211 inpatients that did not have falls.  For each 
## study participant, the researchers examined adverse event reports, 
## medical records and nurse staffing records.  They found that patients 
## with a balance deficit or lower extremity problem were at higher risk 
## for a fall.
##
## What type of study design best describes this study?
##
## a) Cohort Study
## b) Case-Control Study
## c) Cross-Sectional
## d) Correlational Study
## e) Randomized Controlled Trial

## Study #2
## Researchers were interested in identifying risk factors associated 
## with needle stick injuries among medical students.  To do so, a survey
## was mailed to  417 medical students at a National University.
## The survey included questions about demographic factors, knowledge of 
## needle handling protocols, and episodes of needlestick injury.  Over all,
## 59 students (14.1%) reported experiencing one or more needle stick injuries.  
## Invesigators found that those who reported having attended at least one 
## needle handling seminar had a lower prevalence of injury compared to 
## those that had not reported attending a seminar on needle handling.
##
##
## What type of study design best describes this study?
##
## a) Cohort Study
## b) Case-Control Study
## c) Cross-Sectional
## d) Correlational Study
## e) Randomized Controlled Trial

## Study 3
## Investigators sought to determine if water treatment via solar radiation
## is effective at reducing the overall incidence of diarrheal illness.  To 
## do so, researchers solicited participants from two neighboring towns.  
## All participants recieved clear plastic water containers.  However, 
## participants in town A were asked to treat their drinking water
## using the solar radtion method while those in town B were given no specific
## instructions.  AFter 6 months of follow-up, diarrhea incidence rates were 
## compared.
##
## What type of study design best describes this study?
##
## a) Cohort Study
## b) Case-Control Study
## c) Cross-Sectional
## d) Correlational Study
## e) Randomized Controlled Trial

## Study 4
##
## Is alcohol consumption associated with HIV transmission?
## To answer this question, researchers collected data on alcohol sales 
## and HIV prevalence in in 48 different countries.  To control for possible
## confounding, additional data such as GDP (Gross Domestic Product), 
## unemployment, and education were also included in the analysis. 
##
## What type of study design best describes this study?
##
## a) Cohort Study
## b) Case-Control Study
## c) Cross-Sectional
## d) Correlational Study
## e) Randomized Controlled Trial


## Analyzing a 2 x 2 table
##
## In order to demonstrate 2 x 2 table analysis and to calculate measures of effect,
## we'll look at some data collected by Lefevre, et. al (2010).  In that study 
## researchers conducted an experiment to determine if beer consumption increases how
## attractive humans are to mosquitoes.  In short, A number of  volunteers were 
## randomized to consume Beer.  Subsequently, mosquitoes were released into a
## controlled apparatus that led them to tents filled with study participants or an
## empty tent of outdoor air (uncontaminated by participants).  Two different mosquito
## releases were performed, once before beer was consumed and once after.
## You can read more details about the complete experiment online:
##  http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0009546

## Prior to beer consumption, 215 mosquitoes flew towards the tent containing study 
## participants while 219 flew toward the outdoor air.  After beer consumption 369
## mosquitoes flew toward the participants while only 221 flew towards the outdoor air. 

## We can replicate these data in R by entering the following commands:

Timing <- c(rep(1,590),rep(0,434))
Choice <- c(rep(1,369),rep(0,221),rep(1,215),rep(0,219))

Timing <- factor(Timing, levels = c(1,0), labels=c("After","Before"))
Choice <- factor(Choice, levels = c(1,0), labels=c("Human","Outdoors"))

MosquitoData <- data.frame(Timing,Choice)

## To see the data in table format you can use the table() command:

table(MosquitoData)

## Examining the data:
## It's usually a good idea to visually inspect the data with an appropriate graph.
## Immediate results can be attained by entering the following command:

barplot(table(MosquitoData))

## There are a number of reasons why this isn't a very satisfying plot.  Take a 
## a minute to consider what it's lacking.  How can it be improved?

## For one, it would probably make more sense to have the data stratified by 
## exposure - in this case, Timing of mosquito release -  before or after beer
## consumption.  Also, the plot is lacking appropriate labels and a title.

table(Choice, Timing)  # Timing variable now in columns

## This is an immidiate improvement, but still somewhat misleading:
barplot(table(Choice,Timing),main="Mosquito choice by Timing of release",
        xlab="Timing of release relative to beer consumption", 
        ylab="No. Choosing Participants (dark)", 
        col = c("darkblue", "lightblue","darkblue", "lightblue"))

## Better is to show the actual distribution of mosquito choice by Beer status using
## percentages:
prop.table(table(Choice,Timing),margin=2)

## This can be directly inserted into the barplot command:
barplot(prop.table(table(Choice, Timing),margin=2),
        main="Distribution of Mosquito Choice by Timing of release",
        xlab="Timing of release relative to beer consumption", 
        ylab="Proportion Choosing Participants (dark)", 
        col = c("darkblue", "lightblue"))

## Now the scales are comparable and it's clear that a greater proportion of 
## mosquitoes were attracted to the participants after they consumed the beer.

## The above code demonstrates the flexibility of R in creating plots.  Try 
## experimenting with different colors.  Also, it's probably more appropriate 
## to change the column ordering so that the 'Before' column is first.  Try using
## your knowledge of R's indexing system to do this on your own.  One possible 
## answer appears at the end of this tutorial.

## Now that you've looked at the data, a natural question that arises is:
## "Did drinking the beer really increase the attractiveness of the particiapnts?
## (as far as the mosquitoes are concerned, that is!)" -or- "Could the observed
## difference be do to chance?"

## There are a number of statistical tests that could be used to answer this question.
## (in particular, you could use a permuation test as demonstrated in previous lectures)
## Possibly the simplest method would be to use a Chi-square test of independence.
## The null hypothesis for this test is that the column and row variables are
## independent.  In this case, we could state the null hypothesis as:
## "Beer consumption has no effect on participant attractiveness".  The chi-square
## test statistic will have an approximate chi-square distribution with 
## (r - 1)*(c - 1) degrees of freedom (where r and c represent the number of rows
## and columns in the table - here df = 1) as long as the number of observations in 
## each cell is not "small".  You can get the chi-square test statistic, df, and 
## p-value from R by using the summary command in conjunction with table():
summary(table(Choice,Timing))

## The resulting p-value is very small, which provides evidence against the null
## hypothesis.  Conclusion: the evidence suggests that beer consumption and 
## attractiveness are not independent(!).  Of course, more research is needed.
## Chi-square test results can also be obtained in R by using the chisq.test()
## command:
?chisq.test

## Perform the same chi-square test that you did previously, but this time, use 
## the chisq.test() command.  Note: you may need to change one of the input arguments
## to get results that exaclty match those that you previous obtained.

## Measures of effect
## The odds ratio and relative risk can be calculated directly from our table:
table(Choice,Timing)

## Take a minute to perform these calculations by hand - then check your results 
## using R.  Also be sure you know how to interpret these measures.  which of these
## measures of effect is most appropriate for the given study?  

## It's good practice to provide confidence intervales (CIs) when reporting ORs and 
## RRs. These convey the degree of uncertainty (due to sampling) that is present in  
## the estimate(s) and represent a range of plausible values for the true measure of 
## effect.  Different methods exist for calculating CIs.  We'll rely on R to do the 
## calculating for us.  One way to obtain CIs is via the 'epiDisplay' package.  In 
## R-studio, it should be listed under the 'Packages' tab.  To load it you need 
## check its box or, alternatively, type: library(epiDisplay). If it isn't installed,
## do install.packages('epiDisplay') and then library(epiDisplay)

library(epiDisplay)

##
## If you don't see it listed under the 'Packages' tab, it may not be installed on 
## your computer.  You can attempt to do so by clicking the 'Install Packages' button
## and typing in: epiDisplay.

## Once epiDisplay is loaded, you have access to many commands relevant to 
## epidemiological analysis.  The command cci() provides the OR, CI, and results
## from a number of hypothesis tests associated 2x2 tables. 
## You can get help using:
## ?cci

## Here it makes sense to consider 'Timing' as the exposure (After = exposed, 
## Before = unexposed). 'Choice' could play the role of "disease" (Human = case
## Outdoors = control).
table(Choice, Timing)

## Selecting the appropriate values for the cci command:
cci(369, 221, 215, 219, graph=FALSE)

## Analagously, metrics associated with the RR can be obtained through epiDisplay's
## csi() command.  Try using csi() to obtain a CI for the RR. Note: csi doesn't 
## plot a graph for a 2x2 table, so you don't need to specify graph=FALSE

## Based on the CI's for the OR and RR, what conclusions can you draw about the 
## relationship between the variables in the table?


## Often, it is necessary to control for the effects that a potentially confounding 
## variable may have on an exposure/disease relationship.  When confounding is
## present, it is not possible to obtain an ubiased a measure of effect
## One way to determine if confounding is present is through the application 
## of stratified analysis. Consider the following raw dataset from a 
## hypothetical case-control study investigating gender as a risk factor for
## Malaria (adapted from Szklo & Nieto, 2000):

## replicate raw data:
Gender <- c(rep(1,156),rep(0,144))
Malaria <- c(rep(1,88),rep(0,68),rep(1,62),rep(0,82))
Workplace <- c(rep(1,35),rep(0,53),rep(1,53),rep(0,15),rep(1,52),rep(0,10),rep(1,79),rep(0,3))

Gender <- factor(Gender, levels = c(1,0), labels=c("male","female"))
Malaria <- factor(Malaria, levels = c(1,0), labels=c("case","control"))
Workplace <- factor(Workplace, levels = c(1,0), labels=c("indoor","outdoor"))

MalariaData <- data.frame(Gender,Malaria,Workplace)

## take a look at the first few lines of raw data:
head(MalariaData)

## examine the relationship between Gender and Malaria:
table(Gender,Malaria)

## An OR can be obtained using the cci command.  Try it out.  Note: remember 
## the format is cci(caseexp, controlex, casenoex, controlnoex, graph=FALSE).  
## Assume exposed = male:

## You should have found the OR = 1.71.  This represent the "crude" or "unadjusted"
## OR.  It suggests that the odds of malaria is higher for men than for women.  Now,
## let's see what happens when we stratify the data by workplace.  If workplace is 
## unrelated to these data (i.e. not a confounder) then we should get approximately
## the same OR (OR=1.71) as we did before for both levels of workplace.
##
table(Gender,Malaria,Workplace)

## Compute the ORs for each table separately using cci().  

## The resulting OR's are both close to 1.00.  This reveals two things: 1) Workplace 
## appears to be a confounder - the stratified ORs differ from the crude OR. 2) 
## Since the stratified ORs are approximately equal, workplace does not appear to 
## be an effect modifier (i.e. the ORs do not vary by workplace).  
##
## We can further explore the confounding nature of  Workplace by examining the 
## relationships between workplace & gender and workplace & malaria
table(Workplace,Gender)
cci(68, 88, 13, 131,graph=FALSE) ## Males are more likely to work outdoors

table(Workplace,Malaria)
cci(63, 18, 87, 132, graph=FALSE) ## Malaria is associated with working outside

## Because workplace is associated with both gender and malaria, it is not 
## surprising that it had a confounding effect on the relationship between
## gender and malaria.

## To complete the analysis a Mantel-Haenszel method for stratified data (not 
## covered in this tutorial) could be applied to determine a combined, adjusted
## measure of effect. 
## 
## Stratified analysis is perhaps most useful when variables are categorical and 
## the overall number of variables is small.  When dealing with a larger number of
## variables (e.g. many confounding factors) or continuous explanatory variables,
## generalized linear model methods such as logistic regression can be used to 
## estimate adjusted measures of effect.  

## For those of you that are already familiar with these types of models, you can
## use R to fit a logistic model to these data using the following commands:

my.model <- glm(Malaria=='case' ~ Gender + Workplace, family=binomial)
summary(my.model)

## The adjusted OR for malaria and gender after controlling for workplace can be
## obtained using:
exp(my.model$coefficients) 

## confidence intervals for the ORs can be obtained using:
exp(confint.default(my.model))

##  
##
##
##  Answers to selected exercises appear below
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## Answers to Identifying Study Designs
## Study 1: Case control - participants were enrolled based on disease status
## (i.e. "falls") distributions of various exposures were then investigated.
##
## Study 2: Cross sectional - disease (needle stick) and exposure status 
## (e.g. attending a seminar) were assessed at the same time.  There is no
## way to know for certain which came first.  A very detailed survey may have
## asked about dates, but even then, results may be vulnerable to recall bias.
##
## Study 3: Cohort (best answer) - exposed (solar radiation) and unexposed 
## (no solar radiation) groups are followed over time and incidence rates 
## between the two groups are compared.  This could be considered a RCT but
## ONLY IF participants were randomly assigned to a treatment group.
##
## Study 4: Correlational - the unit of analysis is country.  No individual
## level data were collected.  Incidentally, there is no way to determine if 
## those who are HIV positive actually consumed more alcohol (on average) than 
## those who are HIV negative.  
##
## Examining the data:
## One answer to column switching exercise: 
## barplot(prop.table(table(Choice, Timing)[,c(2,1)],margin=2),
##       main="Distribution of Mosquito Choice by Timing of release",
##       xlab="Timing of release relative to beer consumption", 
##       ylab="Proportion Choosing Participants (dark)", 
##       col = c("darkblue", "lightblue"))
##
## Chi-square test:
## chisq.test(Choice,Timing, correct=FALSE)
##
## Measures of effect:
## 369*219 / (221*215)  ## OR = 1.7007
##
## The odds of attracting a mosquito are 1.70 times higher after cosuming beer
## (compared to no beer consumption)
##  
## (369/(369+221))/(215/(215+219))  ## RR = 1.2625
##
## The risk of attracting a mosquito is 1.26 times higher after cosuming beer
## (compared to no beer consumption)
##
## Which is more appropriate?  In this study we know the distribution of exposure
## (i.e. before/after) conditional on disease (i.e. human/outdoors) AND the 
## distribution of disease conditional on exposure.  As a result, either measure is
## appropriate - however, it's conventional to provide RR whenever possible because
## it is usually considered to be a more intuitive measure.  In a case-control study
## only the distriubtion of exposure conditional on disease is known - in that case,
## it would NOT be appropriate to calculate the RR - only the OR.
##
## cci command with appropriate labels:
## cci(369, 221, 215, 219, xlab="Timing",xaxis=c("Before","After"),
## ylab="Odds of choosing a human",yaxis=c("Human","Outdoors"), 
## main="Odds of choosing a human by exposure status")  
##
## csi command:
## csi(369, 221, 215, 219)
##
## A possible interpretation:
## The range of plausible values for the OR does not include the value 1.00, 
## suggesting that the odds of exposure between groups is unequal, therefore,
## we have evidence that the variables are related.  Similarly, the range of
## plausible values for the RR does not include 1.00, suggesting that the risk of 
## 'choosing a human' is not equal between exposure groups.  We have statistical 
## evidence thats supports the hypothesis that exposure and disease are 
## associated with one another.
##
## Crude OR for Malaria Data:
## cci(88,68,62,82, graph=FALSE)
##
## Stratified ORs
## cci(35,53,52,79,graph=FALSE)
## cci(53,15,10,3,graph=FALSE)

##  References:
##
## 1. Lefevre T, et. al. (2010) Beer Consumption Increases Human Attractiveness to 
##    Malarial Mosquitoes. Plos ONE 5(3); e9546.
##  
## 2. Szklo & Nieto, Epidemiology: Beyond the Basics, 2000 Aspen Publishers.
##