--- title: "Analysing cultural frequency data: neutral theory and beyond" author: "Enrico Crema" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true fig_caption: true self_contained: yes fontsize: 11pt documentclass: article vignette: > %\VignetteIndexEntry{Analysing cultural frequency data: neutral theory and beyond} %\VignetteEngine{knitr::rmarkdown_notangle} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE) ``` # Introduction and Setup This markdown document contains the R script for reproducing some of the figures of the book chapter "Analysing cultural frequency data: neutral theory and beyond" by A.Kandler and E.Crema. The main functions for running the simulations described in the chapter can be installed as an R package with the following command: ```{r,installpackage,eval=FALSE} devtools::install_github("ercrema/HERAChp.KandlerCrema") ``` To fully reproduce the figures in the text the following R packages must also be loaded: ```{r,load_packages,message=FALSE,results='hide'} library(HERAChp.KandlerCrema) library(plotrix) library(LaplacesDemon) library(ggridges) library(ggplot2) library(foreach) library(doParallel) ``` Some of the figures are based on a considerable number of simulation runs. In order to reduce running time the example below are based on 100 repetitions instead of the 1,000 used in the book chapter, and most simulations are executed using multiple threads. ```{r set_nsim} nsim = 100 #number of repetitions ncores = detectCores()-1 #use total number of available threads minus 1 ``` Notice that even with the default settings above executing all the scripts below took about 40 minutes using a machine with 8 logical CPUs (3.3 GHz Intel Core i7) and 16 GB Ram. # Figure 1: Diversity and Battleship Curves The function `transmission()` generates temporal frequencies of different cultural variants under unbiased and frequency-dependent transmission returning a variety of statistics. The example below compares the time series of the Simpson's diversity index against the so-called battle-ship curve of the most common variants. First we run the model and extract relevant outputs: ```{r,fig1.sim,fig.height=5,fig.width=8} #set random seed set.seed(123) #run simulation res.fig1=transmission(timesteps=2000,mu=0.005,N=500,warmUp=1000,bias=0,raw=TRUE) #extract matrix of frequencies mat.fig1=res.fig1$rawMatrix #compute proportions pmat.fig1=prop.table(mat.fig1,1) #reorder matrix based on total frequencies pmat.fig1=pmat.fig1[,order(apply(pmat.fig1,2,sum),decreasing=TRUE)] #extract observed diversity time-series divTS = res.fig1$obs.div #extract expected diversity under neutrality divExp = res.fig1$exp.div ``` The matrix `pmat.fig1` now contains the frequencies of each variant at each timestep, the vector `divTS` contains the corresponding time series of Simpson's diversity index, while `divExp` contains the expected level of diversity under neutrality given by $1-\frac{1}{2N\mu+1}$. We are now ready to produce Figure 1: ```{r,fig1.plot,fig.height=5,fig.width=8} #make figure par(mfrow=c(1,2)) #left panel par(mar=c(5,0,5,0)) topVariants = 10 #show only the 10 most frequent variants battleship.plot(pmat.fig1[,1:topVariants],yaxlab="",xaxlab= letters[1:topVariants],border=NA,col="grey",mar=c(5,5,5,1)) axis(2,padj=0.5,line=-4) mtext(side=2,"time",line=-2) mtext(side=1,paste0("Top ",topVariants," variants"),line=2) #right panel par(mar=c(5,0,5,0)) plot(divTS,1:1000,type="l",xlab="",axes=F,ylab="",xlim=c(0,1)) abline(v=divExp,lty=2) axis(side=1,padj=-0.5) mtext(side=1,"Diversity",line=2) ``` # Figure 2: Effect of sampling window for computing reliable estimates of the turn-over rate Here we explore how estimates of the exponent $x$ in Eq. (11) are affected by the size (duration) of the sampling window. It has been shown that under neutrality $x$ assumes a value of 0.86 and we compare this theoretical prediction against distributions of simulation outputs with different number of timesteps (i.e. differently sized of sampling window). We first execute simulations of the Wright-Fisher model (notice that with `nsim=1000` the process will take XXX hours): ```{r,fig2.simulation} #set sampling window from 10 to 500, all with 1000 generations of warmup timesteps <- c(seq(1010,1100,10),1200,1500) # set parameter space timesteps.param = rep(timesteps,each=nsim) #set up parallel computing cl <- makeCluster(ncores) registerDoParallel(cl) # execute simulation in parallel and extract estimates of the exponent x (here named b): res.fig2.vector = foreach(i=1:length(timesteps.param),.combine=c,.export="transmission") %dopar% { tmp=transmission(timesteps=timesteps.param[i],mu=0.01,N=500,warmUp=1000,bias=0,raw=F,top=10)$x } #stop cluster stopCluster(cl) # convert output into a matrix res.fig2 = matrix(res.fig2.vector,nrow=nsim,ncol=length(timesteps)) #Extract summary statistics (median, first and third quartile) from each setting of the sampling window mid.fig2=apply(res.fig2,2,median) lo.fig2=apply(res.fig2,2,quantile,0.25) up.fig2=apply(res.fig2,2,quantile,0.75) ``` Now we can visualise the output. The object `res.fig2` contains the estimates of $x$ under different sampling window size, so a quick way to plot the simulation output is to generate a box and whisker plot: ```{r,fig2.plot1,fig.width=9,fig.height=6} boxplot(res.fig2,names=timesteps-1000,xlab="Number of timesteps",ylab="Estimates of x",col="bisque3") abline(h=0.86,col=2,lty=2) #expected value of x ``` In the manuscript we used the code below: ```{r,fig2.plot2,fig.width=9,fig.height=6} #Plot results using rect() and unqeual spacing along the x-axis par(mar=c(5,4,4,1)) plot(1:14,1:14,ylim=c(min(lo.fig2),max(up.fig2)),type="n",axes=F,xlab="Number of timesteps",ylab="Estimates of x") for (x in 1:10) { rect(xleft=x-0.25,xright=x+0.25,ybottom=lo.fig2[x],ytop=up.fig2[x],col="lightgrey",border=NA) lines(x=c(x-0.25,x+0.25),y=c(mid.fig2[x],mid.fig2[x]),lwd=2) } x=11 rect(xleft=12-0.25,xright=12+0.25,ybottom=lo.fig2[x],ytop=up.fig2[x],col="lightgrey",border=NA) lines(x=c(12-0.25,12+0.25),y=c(mid.fig2[x],mid.fig2[x]),lwd=2) x=12 rect(xleft=14-0.25,xright=14+0.25,ybottom=lo.fig2[x],ytop=up.fig2[x],col="lightgrey",border=NA) lines(x=c(14-0.25,14+0.25),y=c(mid.fig2[x],mid.fig2[x]),lwd=2) # add axes axis(side=2,at=seq(0,5,0.5),cex.axis=0.8,las=2) axis(side=1, at=c(1:10,12,14),labels=c(seq(10,100,10),200,500),cex=0.9) # show the expected value of b under neutrality abline(h=0.86,lty=2,col="red") ``` # Figure 3: Effect of heterogeneity in the strength of the transmission bias in the population To simulate an heterogeneous population of learners we use the function `heteroPopTransmission()` which enables us to define the strength of the frequency-dependent transmission bias for each individual in each generation separately as a random draw from a normal distribution with a user-defined mean (argument `bmean`) and standard deviation (argument `bsd`). We then compare the distributions of Simpson diversity levels for populations with means of 0 (i.e. on average the learners engage in unbiased transmission), but different standard deviations. ```{r,fig3.simulation} #set number of generations ngen=1000 #set parameter to vary (bsd) bsd = c(0,0.1,0.2) #set parameter space bsd.param = rep(bsd,each=nsim) #set up parallel computing cl <- makeCluster(ncores) registerDoParallel(cl) res.fig3.vector = foreach(i=1:length(bsd.param),.combine=c,.export="heteroPopTransmission") %dopar% { heteroPopTransmission(N=500,bmean=0,bsd=bsd.param[i],mu=0.01,timesteps=ngen)[ngen] } #stop cluster stopCluster(cl) # convert output into a matrix res.fig3 = matrix(res.fig3.vector,nrow=nsim,ncol=length(bsd)) ``` Each column of the matrix `res.fig3` now contains the diversity levels of 1,000 simulation runs with a level of heterogeneity in the population of learners as defined by the parameter `bsd`. ```{r,fig3.plot,fig.height=4,fig.width=12} # Setup layout mat=matrix(c(1,4,4,4,2,5,5,5,3,6,6,6),nrow=2,ncol=6) layout(mat=mat,heights=c(0.45,0.55),widths=c(0.45,0.55,0.45,0.55,0.45,0.55)) # Visualise distribution of learners' frequency bias under each setting par(mar=c(5,5,1,1)) plot(0,1,type="n",axes=F,xlab="",ylab="",xlim=c(-0.6,0.6)) lines(x=c(0,0),y=c(0,2),lwd=2) axis(1,at=c(-0.4,0,0.4),padj=-0.5) mtext("b",1,line=2,cex=0.8) par(mar=c(5,5,1,1)) xx=seq(-0.6,0.6,length.out=1000) yy=dnorm(x=xx,sd=0.1) plot(xx,yy,type="n",axes=F,xlab="",ylab="") polygon(c(xx,rev(xx)),c(yy,rep(0,1000)),col="black") axis(1,at=c(-0.4,0,0.4),padj=-0.5) mtext("b",1,line=2,cex=0.8) par(mar=c(5,5,1,1)) xx=seq(-0.6,0.6,length.out=1000) yy=dnorm(x=xx,sd=0.2) plot(xx,yy,type="n",axes=F,xlab="",ylab="") polygon(c(xx,rev(xx)),c(yy,rep(0,1000)),col="black") axis(1,at=c(-0.4,0,0.4),padj=-0.5) mtext("b",1,line=2,cex=0.8) # Show simulation result par(mar=c(5,5,1,1.2)) hist(res.fig3[,1],breaks=seq(0.55,1,0.02),border=NA,col="bisque2",xlab="Diversity",main="",cex.lab=1.5,cex.axis=1.3) abline(v=1-(1/(500* 0.01*2+1)),lty=2,col=2,lwd=1.5) abline(v=mean(res.fig3[,1]),lty=3,lwd=1.5) legend("left",legend=c("Average Diversity (observed)","Expected Diversity"),col=c(1,2),lty=c(3,2),bty="n",cex=0.8,y.intersp=1.5) hist(res.fig3[,2],breaks=seq(0.55,1,0.02),border=NA,col="bisque2",xlab="Diversity",main="",cex.lab=1.5,cex.axis=1.3) abline(v=1-(1/(500* 0.01*2+1)),lty=2,col=2,lwd=1.5) abline(v=mean(res.fig3[,2]),lty=3,lwd=1.5) hist(res.fig3[,3],breaks=seq(0.55,1,0.02),border=NA,col="bisque2",xlab="Diversity",main="",cex.lab=1.5,cex.axis=1.3) abline(v=1-(1/(500* 0.01*2+1)),lty=2,col=2,lwd=1.5) abline(v=mean(res.fig3[,3]),lty=3,lwd=1.5) ``` # Figure 4: Effect of temporally changing transmission modes Here we explore how a temporary shift from in the underlying transmission bias, in our case from an unbiased transmission to an anti-conformist transmission, affects the observed level of cultural diversity. The argument `bias` in the function `transmission()` allows for a time-varying frequency-dependent transmission bias by defining the strength of the bias in each time step. ```{r} #define bias parameter bias.eq = 0 bias.neq = c(rep(0,1500),rep(0.5,10),rep(0,490)) #run simulations #res.fig4.eq=matrix(replicate(transmission(timesteps=2000,mu=0.005,N=500,warmUp=1000,bias=bias.eq,raw=TRUE)$obs.div[401:900],n=nsim),nrow=500,ncol=nsim) #set up parallel computing cl <- makeCluster(detectCores()-1) #use one core less than the total available registerDoParallel(cl) res.fig4.eq.vector = foreach(i=1:nsim,.combine=c,.export="transmission") %dopar% { transmission(timesteps=2000,mu=0.005,N=500,warmUp=1000,bias=bias.eq,raw=TRUE)$obs.div[401:900] } res.fig4.neq.vector = foreach(i=1:nsim,.combine=c,.export="transmission") %dopar% { transmission(timesteps=2000,mu=0.005,N=500,warmUp=1000,bias=bias.neq,raw=TRUE)$obs.div[401:900] } #stop cluster stopCluster(cl) res.fig4.eq = matrix(res.fig4.eq.vector,nrow=500,ncol=nsim) res.fig4.neq = matrix(res.fig4.neq.vector,nrow=500,ncol=nsim) # Compute Expected Diversity expected.diversity = 1 - (1/(2*500*0.005+1)) ``` ```{r,fig4.plot1,fig.width=8,fig.height=6} par(mfrow=c(2,1),mar=c(1,5,5,2)) plot(1:500,res.fig4.eq[,1],ylim=c(0.4,1),type="n",xlab="",ylab="Diversity",xaxs="i",axes=F,las=2) text(480,0.5,"a",cex=1.5) polygon(x=c(1:500,500:1),y=c(apply(res.fig4.eq,1,quantile,prob=0.025),rev(apply(res.fig4.eq,1,quantile,prob=0.975))),col=rgb(1,0,0,0.2),border=NA) apply(res.fig4.eq,2,lines,col=rgb(0,0,0,0.01),x=1:500) lines(1:500,apply(res.fig4.eq,1,mean),col="red") abline(h=expected.diversity,col="black",lty=2) axis(2,at=seq(0,1,0.2),las=2,hadj=0.8) par(mar=c(6,5,0,2)) plot(1:500,res.fig4.neq[,1],ylim=c(0.4,1),type="n",xlab="time",ylab="Diversity",xaxs="i",axes=F,las=2) polygon(x=c(1:500,500:1),y=c(apply(res.fig4.neq,1,quantile,prob=0.025),rev(apply(res.fig4.neq,1,quantile,prob=0.975))),col=rgb(1,0,0,0.2),border=NA) apply(res.fig4.neq,2,lines,col=rgb(0,0,0,0.01),x=1:500) lines(1:500,apply(res.fig4.neq,1,mean),col="red") abline(h=expected.diversity,col="black",lty=2) rect(xleft=100,xright=110,ybottom=-100,ytop=100,col=rgb(0,0,1,0.2),lty=3) text(480,0.5,"b",cex=1.5) axis(2,at=seq(0,1,0.2),las=2,hadj=0.8) axis(1,at=c(1,100,200,300,400,500)) ``` # Figure 7: Posterior distribution of the strength of frequency-dependent transmission for the Merzbach assemblage The code below plots the results of the ABC analysis for the Merzbach assemblage. The step-by-step description of this analysis is the subject of a forthcoming R package. ```{r,fig7.load} # load data data("post_b_merzbach") ``` ```{r,fig7.plot1,fig.width=6,fig.height=8} # Simple boxplot: par(mar=c(5,8,1,1)) boxplot(post_b_merzbach,horizontal=T,las=2,xlab="b") abline(v=0,lty=2,col=2) #unbiased transmission ``` ```{r,fig7.plot2,fig.width=6,fig.height=8} #Extract 50 and 95$ HPDI hpd50 = t(apply(post_b_merzbach,2,p.interval,prob=0.5)) hpd95 = t(apply(post_b_merzbach,2,p.interval,prob=0.95)) #Restructure into data.frame data=rbind.data.frame( data.frame(Model="Equilibrium",b=post_b_merzbach[,1],Phase="All(Eq)"), data.frame(Model="Var.Population",b=post_b_merzbach[,2],Phase="All(S.Eq)"), data.frame(Model="Var.Pop & Transmission",Phase="VIII",b=post_b_merzbach[,3]), data.frame(Model="Var.Pop & Transmission",Phase="IX",b=post_b_merzbach[,4]), data.frame(Model="Var.Pop & Transmission",Phase="X",b=post_b_merzbach[,5]), data.frame(Model="Var.Pop & Transmission",Phase="XI",b=post_b_merzbach[,6]), data.frame(Model="Var.Pop & Transmission",Phase="XII",b=post_b_merzbach[,7]), data.frame(Model="Var.Pop & Transmission",Phase="XIII",b=post_b_merzbach[,8]), data.frame(Model="Var.Pop & Transmission",Phase="XIV",b=post_b_merzbach[,9])) #make joyplot g=ggplot(data, aes(x = b, y = Phase,fill=Model)) + geom_density_ridges(rel_min_height=0.005,alpha=0.5,color="white")+scale_fill_manual(values=c("indianred","royalblue","darkgrey"))+ theme_ridges(center_axis_labels = TRUE)+ theme(plot.margin = unit(c(1,1,1,1), "cm"),legend.position = c(0.6, 0.8))+scale_y_discrete(labels=c("All","All","VIII","IX","X","XI","XII","XIII","XIV"))+coord_cartesian(xlim = c(-0.25, 0.25)) + scale_x_continuous(expand = c(0.01, 0))+ annotate("rect",xmin = 0.05, xmax = 2.5, ymin = 6.5, ymax = 8.7,alpha=0.7,fill="white")+ annotate("segment",x = 0.06, xend = 0.077, y = 7.1, yend = 7.1)+ annotate("segment",x = 0.06, xend = 0.06, y = 7.1, yend = 7.05)+ annotate("segment",x = 0.077, xend = 0.077, y = 7.1, yend = 7.05)+ annotate("segment",x = 0.06, xend = 0.06, y = 7.1, yend = 7.15)+ annotate("segment",x = 0.077, xend = 0.077, y = 7.1, yend = 7.15)+ annotate("text",x = 0.12,y = 7.1,label="50% HPDI")+ annotate("segment",x = 0.06, xend = 0.078, y = 6.8, yend = 6.8,linetype="dotted")+ annotate("text",x = 0.12,y = 6.8,label="95% HPDI") #add HPDI intervals for (i in 1:9) { g = g + annotate("segment",x = hpd50[i,1], xend = hpd50[i,2], y = i+0.3, yend = i+0.3,lwd=0.5)+ annotate("segment",x = hpd50[i,1], xend = hpd50[i,1], y = i+0.3, yend = i+0.25,lwd=0.5)+ annotate("segment",x = hpd50[i,2], xend = hpd50[i,2], y = i+0.3, yend = i+0.25,lwd=0.5)+ annotate("segment",x = hpd50[i,1], xend = hpd50[i,1], y = i+0.3, yend = i+0.35,lwd=0.5)+ annotate("segment",x = hpd50[i,2], xend = hpd50[i,2], y = i+0.3, yend = i+0.35,lwd=0.5)+ annotate("segment",x = hpd95[i,1], xend = hpd95[i,2], y = i+0.3, yend = i+0.3, linetype="dotted",lwd=0.5) } g ```