## Tutorial on study designs and measures of effect
## Clinic on the Meaningful Modeling of Epidemiological Data
## International Clinics on Infectious Disease Dynamics and Data (ICI3D) Program
## African Institute for Mathematical Sciences, Muizenberg, RSA
## Jim Scott 2012
## Identifying study designs
## For each of the following descriptions, determine what type of
## study design was used. The answers are at the end of this tutorial
## Study #1
## A study examined risk factors associated with falling at an
## assisted care living facility. Researchers enrolled 75 patients that
## had falls and another 211 inpatients that did not have falls. For each
## study participant, the researchers examined adverse event reports,
## medical records and nurse staffing records. They found that patients
## with a balance deficit or lower extremity problem were at higher risk
## for a fall.
##
## What type of study design best describes this study?
##
## a) Cohort Study
## b) Case-Control Study
## c) Cross-Sectional
## d) Correlational Study
## e) Randomized Controlled Trial
## Study #2
## Researchers were interested in identifying risk factors associated
## with needle stick injuries among medical students. To do so, a survey
## was mailed to 417 medical students at a National University.
## The survey included questions about demographic factors, knowledge of
## needle handling protocols, and episodes of needlestick injury. Over all,
## 59 students (14.1%) reported experiencing one or more needle stick injuries.
## Invesigators found that those who reported having attended at least one
## needle handling seminar had a lower prevalence of injury compared to
## those that had not reported attending a seminar on needle handling.
##
##
## What type of study design best describes this study?
##
## a) Cohort Study
## b) Case-Control Study
## c) Cross-Sectional
## d) Correlational Study
## e) Randomized Controlled Trial
## Study 3
## Investigators sought to determine if water treatment via solar radiation
## is effective at reducing the overall incidence of diarrheal illness. To
## do so, researchers solicited participants from two neighboring towns.
## All participants recieved clear plastic water containers. However,
## participants in town A were asked to treat their drinking water
## using the solar radtion method while those in town B were given no specific
## instructions. AFter 6 months of follow-up, diarrhea incidence rates were
## compared.
##
## What type of study design best describes this study?
##
## a) Cohort Study
## b) Case-Control Study
## c) Cross-Sectional
## d) Correlational Study
## e) Randomized Controlled Trial
## Study 4
##
## Is alcohol consumption associated with HIV transmission?
## To answer this question, researchers collected data on alcohol sales
## and HIV prevalence in in 48 different countries. To control for possible
## confounding, additional data such as GDP (Gross Domestic Product),
## unemployment, and education were also included in the analysis.
##
## What type of study design best describes this study?
##
## a) Cohort Study
## b) Case-Control Study
## c) Cross-Sectional
## d) Correlational Study
## e) Randomized Controlled Trial
## Analyzing a 2 x 2 table
##
## In order to demonstrate 2 x 2 table analysis and to calculate measures of effect,
## we'll look at some data collected by Lefevre, et. al (2010). In that study
## researchers conducted an experiment to determine if beer consumption increases how
## attractive humans are to mosquitoes. In short, A number of volunteers were
## randomized to consume Beer. Subsequently, mosquitoes were released into a
## controlled apparatus that led them to tents filled with study participants or an
## empty tent of outdoor air (uncontaminated by participants). Two different mosquito
## releases were performed, once before beer was consumed and once after.
## You can read more details about the complete experiment online:
## http://www.plosone.org/article/info:doi%2F10.1371%2Fjournal.pone.0009546
## Prior to beer consumption, 215 mosquitoes flew towards the tent containing study
## participants while 219 flew toward the outdoor air. After beer consumption 369
## mosquitoes flew toward the participants while only 221 flew towards the outdoor air.
## We can replicate these data in R by entering the following commands:
Timing <- c(rep(1,590),rep(0,434))
Choice <- c(rep(1,369),rep(0,221),rep(1,215),rep(0,219))
Timing <- factor(Timing, levels = c(1,0), labels=c("After","Before"))
Choice <- factor(Choice, levels = c(1,0), labels=c("Human","Outdoors"))
MosquitoData <- data.frame(Timing,Choice)
## To see the data in table format you can use the table() command:
table(MosquitoData)
## Examining the data:
## It's usually a good idea to visually inspect the data with an appropriate graph.
## Immediate results can be attained by entering the following command:
barplot(table(MosquitoData))
## There are a number of reasons why this isn't a very satisfying plot. Take a
## a minute to consider what it's lacking. How can it be improved?
## For one, it would probably make more sense to have the data stratified by
## exposure - in this case, Timing of mosquito release - before or after beer
## consumption. Also, the plot is lacking appropriate labels and a title.
table(Choice, Timing) # Timing variable now in columns
## This is an immidiate improvement, but still somewhat misleading:
barplot(table(Choice,Timing),main="Mosquito choice by Timing of release",
xlab="Timing of release relative to beer consumption",
ylab="No. Choosing Participants (dark)",
col = c("darkblue", "lightblue","darkblue", "lightblue"))
## Better is to show the actual distribution of mosquito choice by Beer status using
## percentages:
prop.table(table(Choice,Timing),margin=2)
## This can be directly inserted into the barplot command:
barplot(prop.table(table(Choice, Timing),margin=2),
main="Distribution of Mosquito Choice by Timing of release",
xlab="Timing of release relative to beer consumption",
ylab="Proportion Choosing Participants (dark)",
col = c("darkblue", "lightblue"))
## Now the scales are comparable and it's clear that a greater proportion of
## mosquitoes were attracted to the participants after they consumed the beer.
## The above code demonstrates the flexibility of R in creating plots. Try
## experimenting with different colors. Also, it's probably more appropriate
## to change the column ordering so that the 'Before' column is first. Try using
## your knowledge of R's indexing system to do this on your own. One possible
## answer appears at the end of this tutorial.
## Now that you've looked at the data, a natural question that arises is:
## "Did drinking the beer really increase the attractiveness of the particiapnts?
## (as far as the mosquitoes are concerned, that is!)" -or- "Could the observed
## difference be do to chance?"
## There are a number of statistical tests that could be used to answer this question.
## (in particular, you could use a permuation test as demonstrated in previous lectures)
## Possibly the simplest method would be to use a Chi-square test of independence.
## The null hypothesis for this test is that the column and row variables are
## independent. In this case, we could state the null hypothesis as:
## "Beer consumption has no effect on participant attractiveness". The chi-square
## test statistic will have an approximate chi-square distribution with
## (r - 1)*(c - 1) degrees of freedom (where r and c represent the number of rows
## and columns in the table - here df = 1) as long as the number of observations in
## each cell is not "small". You can get the chi-square test statistic, df, and
## p-value from R by using the summary command in conjunction with table():
summary(table(Choice,Timing))
## The resulting p-value is very small, which provides evidence against the null
## hypothesis. Conclusion: the evidence suggests that beer consumption and
## attractiveness are not independent(!). Of course, more research is needed.
## Chi-square test results can also be obtained in R by using the chisq.test()
## command:
?chisq.test
## Perform the same chi-square test that you did previously, but this time, use
## the chisq.test() command. Note: you may need to change one of the input arguments
## to get results that exaclty match those that you previous obtained.
## Measures of effect
## The odds ratio and relative risk can be calculated directly from our table:
table(Choice,Timing)
## Take a minute to perform these calculations by hand - then check your results
## using R. Also be sure you know how to interpret these measures. which of these
## measures of effect is most appropriate for the given study?
## It's good practice to provide confidence intervales (CIs) when reporting ORs and
## RRs. These convey the degree of uncertainty (due to sampling) that is present in
## the estimate(s) and represent a range of plausible values for the true measure of
## effect. Different methods exist for calculating CIs. We'll rely on R to do the
## calculating for us. One way to obtain CIs is via the 'epiDisplay' package. In
## R-studio, it should be listed under the 'Packages' tab. To load it you need
## check its box or, alternatively, type: library(epiDisplay). If it isn't installed,
## do install.packages('epiDisplay') and then library(epiDisplay)
##
## If you don't see it listed under the 'Packages' tab, it may not be installed on
## your computer. You can attempt to do so by clicking the 'Install Packages' button
## and typing in: epiDisplay.
## Once epiDisplay is loaded, you have access to many commands relevant to
## epidemiological analysis. The command cci() provides the OR, CI, and results
## from a number of hypothesis tests associated 2x2 tables.
?cci
## Here it makes sense to consider 'Timing' as the exposure (After = exposed,
## Before = unexposed). 'Choice' could play the role of "disease" (Human = case
## Outdoors = control).
table(Choice, Timing)
## Selecting the appropriate values for the cci command:
cci(369, 221, 215, 219, graph=FALSE)
## Analagously, metrics associated with the RR can be obtained through epiDisplay's
## csi() command. Try using csi() to obtain a CI for the RR. Note: csi doesn't
## plot a graph for a 2x2 table, so you don't need to specify graph=FALSE
## Based on the CI's for the OR and RR, what conclusions can you draw about the
## relationship between the variables in the table?
## Often, it is necessary to control for the effects that a potentially confounding
## variable may have on an exposure/disease relationship. When confounding is
## present, it is not possible to obtain an ubiased a measure of effect
## One way to determine if confounding is present is through the application
## of stratified analysis. Consider the following raw dataset from a
## hypothetical case-control study investigating gender as a risk factor for
## Malaria (adapted from Szklo & Nieto, 2000):
## replicate raw data:
Gender <- c(rep(1,156),rep(0,144))
Malaria <- c(rep(1,88),rep(0,68),rep(1,62),rep(0,82))
Workplace <- c(rep(1,35),rep(0,53),rep(1,53),rep(0,15),rep(1,52),rep(0,10),rep(1,79),rep(0,3))
Gender <- factor(Gender, levels = c(1,0), labels=c("male","female"))
Malaria <- factor(Malaria, levels = c(1,0), labels=c("case","control"))
Workplace <- factor(Workplace, levels = c(1,0), labels=c("indoor","outdoor"))
MalariaData <- data.frame(Gender,Malaria,Workplace)
## take a look at the first few lines of raw data:
head(MalariaData)
## examine the relationship between Gender and Malaria:
table(Gender,Malaria)
## An OR can be obtained using the cci command. Try it out. Note: remember
## the format is cci(caseexp, controlex, casenoex, controlnoex, graph=FALSE).
## Assume exposed = male:
## You should have found the OR = 1.71. This represent the "crude" or "unadjusted"
## OR. It suggests that the odds of malaria is higher for men than for women. Now,
## let's see what happens when we stratify the data by workplace. If workplace is
## unrelated to these data (i.e. not a confounder) then we should get approximately
## the same OR (OR=1.71) as we did before for both levels of workplace.
##
table(Gender,Malaria,Workplace)
## Compute the ORs for each table separately using cci().
## The resulting OR's are both close to 1.00. This reveals two things: 1) Workplace
## appears to be a confounder - the stratified ORs differ from the crude OR. 2)
## Since the stratified ORs are approximately equal, workplace does not appear to
## be an effect modifier (i.e. the ORs do not vary by workplace).
##
## We can further explore the confounding nature of Workplace by examining the
## relationships between workplace & gender and workplace & malaria
table(Workplace,Gender)
cci(68, 88, 13, 131,graph=FALSE) ## Males are more likely to work outdoors
table(Workplace,Malaria)
cci(63, 18, 87, 132, graph=FALSE) ## Malaria is associated with working outside
## Because workplace is associated with both gender and malaria, it is not
## surprising that it had a confounding effect on the relationship between
## gender and malaria.
## To complete the analysis a Mantel-Haenszel method for stratified data (not
## covered in this tutorial) could be applied to determine a combined, adjusted
## measure of effect.
##
## Stratified analysis is perhaps most useful when variables are categorical and
## the overall number of variables is small. When dealing with a larger number of
## variables (e.g. many confounding factors) or continuous explanatory variables,
## generalized linear model methods such as logistic regression can be used to
## estimate adjusted measures of effect.
## For those of you that are already familiar with these types of models, you can
## use R to fit a logistic model to these data using the following commands:
my.model <- glm(Malaria=='case' ~ Gender + Workplace, family=binomial)
summary(my.model)
## The adjusted OR for malaria and gender after controlling for workplace can be
## obtained using:
exp(my.model$coefficients)
## confidence intervals for the ORs can be obtained using:
exp(confint.default(my.model))
##
##
##
## Answers to selected exercises appear below
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
##
## Answers to Identifying Study Designs
## Study 1: Case control - participants were enrolled based on disease status
## (i.e. "falls") distributions of various exposures were then investigated.
##
## Study 2: Cross sectional - disease (needle stick) and exposure status
## (e.g. attending a seminar) were assessed at the same time. There is no
## way to know for certain which came first. A very detailed survey may have
## asked about dates, but even then, results may be vulnerable to recall bias.
##
## Study 3: Cohort (best answer) - exposed (solar radiation) and unexposed
## (no solar radiation) groups are followed over time and incidence rates
## between the two groups are compared. This could be considered a RCT but
## ONLY IF participants were randomly assigned to a treatment group.
##
## Study 4: Correlational - the unit of analysis is country. No individual
## level data were collected. Incidentally, there is no way to determine if
## those who are HIV positive actually consumed more alcohol (on average) than
## those who are HIV negative.
##
## Examining the data:
## One answer to column switching exercise:
## barplot(prop.table(table(Choice, Timing)[,c(2,1)],margin=2),
## main="Distribution of Mosquito Choice by Timing of release",
## xlab="Timing of release relative to beer consumption",
## ylab="Proportion Choosing Participants (dark)",
## col = c("darkblue", "lightblue"))
##
## Chi-square test:
## chisq.test(Choice,Timing, correct=FALSE)
##
## Measures of effect:
## 369*219 / (221*215) ## OR = 1.7007
##
## The odds of attracting a mosquito are 1.70 times higher after cosuming beer
## (compared to no beer consumption)
##
## (369/(369+221))/(215/(215+219)) ## RR = 1.2625
##
## The risk of attracting a mosquito is 1.26 times higher after cosuming beer
## (compared to no beer consumption)
##
## Which is more appropriate? In this study we know the distribution of exposure
## (i.e. before/after) conditional on disease (i.e. human/outdoors) AND the
## distribution of disease conditional on exposure. As a result, either measure is
## appropriate - however, it's conventional to provide RR whenever possible because
## it is usually considered to be a more intuitive measure. In a case-control study
## only the distriubtion of exposure conditional on disease is known - in that case,
## it would NOT be appropriate to calculate the RR - only the OR.
##
## cci command with appropriate labels:
## cci(369, 221, 215, 219, xlab="Timing",xaxis=c("Before","After"),
## ylab="Odds of choosing a human",yaxis=c("Human","Outdoors"),
## main="Odds of choosing a human by exposure status")
##
## csi command:
## csi(369, 221, 215, 219)
##
## A possible interpretation:
## The range of plausible values for the OR does not include the value 1.00,
## suggesting that the odds of exposure between groups is unequal, therefore,
## we have evidence that the variables are related. Similarly, the range of
## plausible values for the RR does not include 1.00, suggesting that the risk of
## 'choosing a human' is not equal between exposure groups. We have statistical
## evidence thats supports the hypothesis that exposure and disease are
## associated with one another.
##
## Crude OR for Malaria Data:
## cci(88,68,62,82, graph=FALSE)
##
## Stratified ORs
## cci(35,53,52,79,graph=FALSE)
## cci(53,15,10,3,graph=FALSE)
## References:
##
## 1. Lefevre T, et. al. (2010) Beer Consumption Increases Human Attractiveness to
## Malarial Mosquitoes. Plos ONE 5(3); e9546.
##
## 2. Szklo & Nieto, Epidemiology: Beyond the Basics, 2000 Aspen Publishers.
##