---
title: "Analysing cultural frequency data: neutral theory
and beyond"
author: "Enrico Crema"
date: "`r Sys.Date()`"
output:
  rmarkdown::html_vignette:
    toc: true
    fig_caption: true
    self_contained: yes
fontsize: 11pt
documentclass: article
vignette: >
  %\VignetteIndexEntry{Analysing cultural frequency data: neutral theory and beyond}
  %\VignetteEngine{knitr::rmarkdown_notangle}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Introduction and Setup

This markdown document contains the R script for reproducing some of the figures of the book chapter "Analysing cultural frequency data: neutral theory and beyond" by A.Kandler and E.Crema. The main functions for running the simulations described in the chapter can be installed as an R package with the following command:

```{r,installpackage,eval=FALSE}
devtools::install_github("ercrema/HERAChp.KandlerCrema")
```

To fully reproduce the figures in the text the following R packages must also be loaded:

```{r,load_packages,message=FALSE,results='hide'}
library(HERAChp.KandlerCrema)
library(plotrix)
library(LaplacesDemon)
library(ggridges)
library(ggplot2)
library(foreach)
library(doParallel)
```

Some of the figures are based on a considerable number of simulation runs. In order to reduce running time the example below are based on 100 repetitions instead of the 1,000 used in the book chapter, and most simulations are executed using multiple threads. 

```{r set_nsim}
nsim = 100 #number of repetitions
ncores = detectCores()-1 #use total number of available threads minus 1
```

Notice that even with the default settings above executing all the scripts below took about 40 minutes using a machine with 8 logical CPUs (3.3 GHz Intel Core i7) and 16 GB Ram.  

# Figure 1: Diversity and Battleship Curves

The function `transmission()` generates temporal frequencies of different cultural variants under unbiased and frequency-dependent transmission returning a variety of statistics. The example below compares the time series of the Simpson's diversity index against the so-called battle-ship curve of the most common variants. First we run the model and extract relevant outputs:


```{r,fig1.sim,fig.height=5,fig.width=8}
#set random seed
set.seed(123)
#run simulation
res.fig1=transmission(timesteps=2000,mu=0.005,N=500,warmUp=1000,bias=0,raw=TRUE)
#extract matrix of frequencies
mat.fig1=res.fig1$rawMatrix
#compute proportions
pmat.fig1=prop.table(mat.fig1,1)
#reorder matrix based on total frequencies
pmat.fig1=pmat.fig1[,order(apply(pmat.fig1,2,sum),decreasing=TRUE)]
#extract observed diversity time-series
divTS = res.fig1$obs.div
#extract expected diversity under neutrality
divExp = res.fig1$exp.div
```

The matrix `pmat.fig1` now contains the frequencies of each variant at each timestep, the vector `divTS` contains the corresponding time series of Simpson's diversity index, while `divExp` contains the expected level of diversity under neutrality given by $1-\frac{1}{2N\mu+1}$. We are now ready to produce Figure 1:


```{r,fig1.plot,fig.height=5,fig.width=8}
#make figure
par(mfrow=c(1,2))

#left panel 
par(mar=c(5,0,5,0))
topVariants = 10 #show only the 10 most frequent variants
battleship.plot(pmat.fig1[,1:topVariants],yaxlab="",xaxlab= letters[1:topVariants],border=NA,col="grey",mar=c(5,5,5,1))
axis(2,padj=0.5,line=-4)
mtext(side=2,"time",line=-2)
mtext(side=1,paste0("Top ",topVariants," variants"),line=2)
#right panel
par(mar=c(5,0,5,0))
plot(divTS,1:1000,type="l",xlab="",axes=F,ylab="",xlim=c(0,1))
abline(v=divExp,lty=2)
axis(side=1,padj=-0.5)
mtext(side=1,"Diversity",line=2)
```

# Figure 2: Effect of sampling window for computing reliable estimates of the turn-over rate

Here we explore how estimates of the exponent $x$ in Eq. (11) are affected by the size (duration) of the sampling window. It has been shown that under neutrality $x$ assumes a value of 0.86 and we compare this theoretical prediction against distributions of simulation outputs with different number of timesteps (i.e. differently sized of sampling window). We first execute simulations of the Wright-Fisher model (notice that with `nsim=1000` the process will take XXX hours): 


```{r,fig2.simulation}

#set sampling window from 10 to 500, all with 1000 generations of warmup
timesteps <- c(seq(1010,1100,10),1200,1500)

# set parameter space
timesteps.param = rep(timesteps,each=nsim)

#set up parallel computing
cl <-  makeCluster(ncores)  
registerDoParallel(cl)

# execute simulation in parallel and extract estimates of the exponent x (here named b):
res.fig2.vector = foreach(i=1:length(timesteps.param),.combine=c,.export="transmission") %dopar% {
  tmp=transmission(timesteps=timesteps.param[i],mu=0.01,N=500,warmUp=1000,bias=0,raw=F,top=10)$x
}

#stop cluster
stopCluster(cl)

# convert output into a matrix
res.fig2 = matrix(res.fig2.vector,nrow=nsim,ncol=length(timesteps))

#Extract summary statistics (median, first and third quartile) from each setting of the sampling window
mid.fig2=apply(res.fig2,2,median)
lo.fig2=apply(res.fig2,2,quantile,0.25)
up.fig2=apply(res.fig2,2,quantile,0.75)
```

Now we can visualise the output. The object `res.fig2` contains the estimates of $x$ under different sampling window size, so a quick way to plot the simulation output is to generate a box and whisker plot:

```{r,fig2.plot1,fig.width=9,fig.height=6}
boxplot(res.fig2,names=timesteps-1000,xlab="Number of timesteps",ylab="Estimates of x",col="bisque3")
abline(h=0.86,col=2,lty=2) #expected value of x
```

In the manuscript we used the code below:

```{r,fig2.plot2,fig.width=9,fig.height=6}
#Plot results using rect() and unqeual spacing along the x-axis
par(mar=c(5,4,4,1))
plot(1:14,1:14,ylim=c(min(lo.fig2),max(up.fig2)),type="n",axes=F,xlab="Number of timesteps",ylab="Estimates of x")

for (x in 1:10)
{
	rect(xleft=x-0.25,xright=x+0.25,ybottom=lo.fig2[x],ytop=up.fig2[x],col="lightgrey",border=NA)
	lines(x=c(x-0.25,x+0.25),y=c(mid.fig2[x],mid.fig2[x]),lwd=2)
}

x=11
rect(xleft=12-0.25,xright=12+0.25,ybottom=lo.fig2[x],ytop=up.fig2[x],col="lightgrey",border=NA)
lines(x=c(12-0.25,12+0.25),y=c(mid.fig2[x],mid.fig2[x]),lwd=2)

x=12
rect(xleft=14-0.25,xright=14+0.25,ybottom=lo.fig2[x],ytop=up.fig2[x],col="lightgrey",border=NA)
lines(x=c(14-0.25,14+0.25),y=c(mid.fig2[x],mid.fig2[x]),lwd=2)

# add axes
axis(side=2,at=seq(0,5,0.5),cex.axis=0.8,las=2)
axis(side=1, at=c(1:10,12,14),labels=c(seq(10,100,10),200,500),cex=0.9)

# show the expected value of b under neutrality
abline(h=0.86,lty=2,col="red")

```

# Figure 3: Effect of heterogeneity in the strength of the transmission bias in the population

To simulate an heterogeneous population of learners we use the function `heteroPopTransmission()` which enables us to define the strength of the frequency-dependent transmission bias for each individual in each generation separately as a random draw from a normal distribution with a user-defined mean (argument `bmean`) and standard deviation (argument `bsd`). We then compare the distributions of Simpson diversity levels for populations with means of 0 (i.e. on average the learners engage in unbiased transmission), but different standard deviations. 


```{r,fig3.simulation}
#set number of generations
ngen=1000
#set parameter to vary (bsd)
bsd = c(0,0.1,0.2)
#set parameter space
bsd.param = rep(bsd,each=nsim)

#set up parallel computing
cl <-  makeCluster(ncores)  
registerDoParallel(cl)

res.fig3.vector = foreach(i=1:length(bsd.param),.combine=c,.export="heteroPopTransmission") %dopar% {
  heteroPopTransmission(N=500,bmean=0,bsd=bsd.param[i],mu=0.01,timesteps=ngen)[ngen]
}

#stop cluster
stopCluster(cl)

# convert output into a matrix
res.fig3 = matrix(res.fig3.vector,nrow=nsim,ncol=length(bsd))
```

Each column of the matrix `res.fig3` now contains the diversity levels of 1,000 simulation runs with a level of heterogeneity in the population of learners as defined by the parameter `bsd`.

```{r,fig3.plot,fig.height=4,fig.width=12}
# Setup layout
mat=matrix(c(1,4,4,4,2,5,5,5,3,6,6,6),nrow=2,ncol=6)
layout(mat=mat,heights=c(0.45,0.55),widths=c(0.45,0.55,0.45,0.55,0.45,0.55))

# Visualise distribution of learners' frequency bias under each setting
par(mar=c(5,5,1,1))
plot(0,1,type="n",axes=F,xlab="",ylab="",xlim=c(-0.6,0.6))
lines(x=c(0,0),y=c(0,2),lwd=2)
axis(1,at=c(-0.4,0,0.4),padj=-0.5)
mtext("b",1,line=2,cex=0.8)

par(mar=c(5,5,1,1))
xx=seq(-0.6,0.6,length.out=1000)
yy=dnorm(x=xx,sd=0.1)
plot(xx,yy,type="n",axes=F,xlab="",ylab="")
polygon(c(xx,rev(xx)),c(yy,rep(0,1000)),col="black")
axis(1,at=c(-0.4,0,0.4),padj=-0.5)
mtext("b",1,line=2,cex=0.8)

par(mar=c(5,5,1,1))
xx=seq(-0.6,0.6,length.out=1000)
yy=dnorm(x=xx,sd=0.2)
plot(xx,yy,type="n",axes=F,xlab="",ylab="")
polygon(c(xx,rev(xx)),c(yy,rep(0,1000)),col="black")
axis(1,at=c(-0.4,0,0.4),padj=-0.5)
mtext("b",1,line=2,cex=0.8)


# Show simulation result
par(mar=c(5,5,1,1.2))
hist(res.fig3[,1],breaks=seq(0.55,1,0.02),border=NA,col="bisque2",xlab="Diversity",main="",cex.lab=1.5,cex.axis=1.3)
abline(v=1-(1/(500* 0.01*2+1)),lty=2,col=2,lwd=1.5)
abline(v=mean(res.fig3[,1]),lty=3,lwd=1.5)
legend("left",legend=c("Average Diversity (observed)","Expected Diversity"),col=c(1,2),lty=c(3,2),bty="n",cex=0.8,y.intersp=1.5)

hist(res.fig3[,2],breaks=seq(0.55,1,0.02),border=NA,col="bisque2",xlab="Diversity",main="",cex.lab=1.5,cex.axis=1.3)
abline(v=1-(1/(500* 0.01*2+1)),lty=2,col=2,lwd=1.5)
abline(v=mean(res.fig3[,2]),lty=3,lwd=1.5)

hist(res.fig3[,3],breaks=seq(0.55,1,0.02),border=NA,col="bisque2",xlab="Diversity",main="",cex.lab=1.5,cex.axis=1.3)
abline(v=1-(1/(500* 0.01*2+1)),lty=2,col=2,lwd=1.5)
abline(v=mean(res.fig3[,3]),lty=3,lwd=1.5)
```

# Figure 4: Effect of temporally changing transmission modes

Here we explore how a temporary shift from in the underlying transmission bias, in our case from an unbiased transmission to an anti-conformist transmission, affects the observed level of cultural diversity. The argument `bias` in the function `transmission()` allows for a time-varying frequency-dependent transmission bias by defining the strength of the bias in each time step. 


```{r}
#define bias parameter
bias.eq = 0
bias.neq = c(rep(0,1500),rep(0.5,10),rep(0,490)) 


#run simulations
#res.fig4.eq=matrix(replicate(transmission(timesteps=2000,mu=0.005,N=500,warmUp=1000,bias=bias.eq,raw=TRUE)$obs.div[401:900],n=nsim),nrow=500,ncol=nsim)

#set up parallel computing
cl <-  makeCluster(detectCores()-1)  #use one core less than the total available
registerDoParallel(cl)

res.fig4.eq.vector = foreach(i=1:nsim,.combine=c,.export="transmission") %dopar% {
  transmission(timesteps=2000,mu=0.005,N=500,warmUp=1000,bias=bias.eq,raw=TRUE)$obs.div[401:900]
}

res.fig4.neq.vector = foreach(i=1:nsim,.combine=c,.export="transmission") %dopar% {
  transmission(timesteps=2000,mu=0.005,N=500,warmUp=1000,bias=bias.neq,raw=TRUE)$obs.div[401:900]
}

#stop cluster
stopCluster(cl)

res.fig4.eq = matrix(res.fig4.eq.vector,nrow=500,ncol=nsim)
res.fig4.neq = matrix(res.fig4.neq.vector,nrow=500,ncol=nsim)

# Compute Expected Diversity
expected.diversity = 1 - (1/(2*500*0.005+1))
```


```{r,fig4.plot1,fig.width=8,fig.height=6}
par(mfrow=c(2,1),mar=c(1,5,5,2))

plot(1:500,res.fig4.eq[,1],ylim=c(0.4,1),type="n",xlab="",ylab="Diversity",xaxs="i",axes=F,las=2)
text(480,0.5,"a",cex=1.5)
polygon(x=c(1:500,500:1),y=c(apply(res.fig4.eq,1,quantile,prob=0.025),rev(apply(res.fig4.eq,1,quantile,prob=0.975))),col=rgb(1,0,0,0.2),border=NA)
apply(res.fig4.eq,2,lines,col=rgb(0,0,0,0.01),x=1:500)
lines(1:500,apply(res.fig4.eq,1,mean),col="red")
abline(h=expected.diversity,col="black",lty=2)
axis(2,at=seq(0,1,0.2),las=2,hadj=0.8)

par(mar=c(6,5,0,2))

plot(1:500,res.fig4.neq[,1],ylim=c(0.4,1),type="n",xlab="time",ylab="Diversity",xaxs="i",axes=F,las=2)
polygon(x=c(1:500,500:1),y=c(apply(res.fig4.neq,1,quantile,prob=0.025),rev(apply(res.fig4.neq,1,quantile,prob=0.975))),col=rgb(1,0,0,0.2),border=NA)
apply(res.fig4.neq,2,lines,col=rgb(0,0,0,0.01),x=1:500)
lines(1:500,apply(res.fig4.neq,1,mean),col="red")
abline(h=expected.diversity,col="black",lty=2)
rect(xleft=100,xright=110,ybottom=-100,ytop=100,col=rgb(0,0,1,0.2),lty=3)
text(480,0.5,"b",cex=1.5)
axis(2,at=seq(0,1,0.2),las=2,hadj=0.8)
axis(1,at=c(1,100,200,300,400,500))
```

# Figure 7: Posterior distribution of the strength of frequency-dependent transmission for the Merzbach assemblage

The code below plots the results of the ABC analysis for the Merzbach assemblage. The step-by-step description of this analysis is the subject of a forthcoming R package.  

```{r,fig7.load}
# load data
data("post_b_merzbach")
```

```{r,fig7.plot1,fig.width=6,fig.height=8}
# Simple boxplot:
par(mar=c(5,8,1,1))
boxplot(post_b_merzbach,horizontal=T,las=2,xlab="b")
abline(v=0,lty=2,col=2) #unbiased transmission
```


```{r,fig7.plot2,fig.width=6,fig.height=8}
#Extract 50 and 95$ HPDI
hpd50 = t(apply(post_b_merzbach,2,p.interval,prob=0.5))
hpd95 = t(apply(post_b_merzbach,2,p.interval,prob=0.95))


#Restructure into data.frame
data=rbind.data.frame(
data.frame(Model="Equilibrium",b=post_b_merzbach[,1],Phase="All(Eq)"),
data.frame(Model="Var.Population",b=post_b_merzbach[,2],Phase="All(S.Eq)"),
data.frame(Model="Var.Pop & Transmission",Phase="VIII",b=post_b_merzbach[,3]),
data.frame(Model="Var.Pop & Transmission",Phase="IX",b=post_b_merzbach[,4]),
data.frame(Model="Var.Pop & Transmission",Phase="X",b=post_b_merzbach[,5]),
data.frame(Model="Var.Pop & Transmission",Phase="XI",b=post_b_merzbach[,6]),
data.frame(Model="Var.Pop & Transmission",Phase="XII",b=post_b_merzbach[,7]),
data.frame(Model="Var.Pop & Transmission",Phase="XIII",b=post_b_merzbach[,8]),
data.frame(Model="Var.Pop & Transmission",Phase="XIV",b=post_b_merzbach[,9]))

#make joyplot
g=ggplot(data, aes(x = b, y = Phase,fill=Model)) + 
geom_density_ridges(rel_min_height=0.005,alpha=0.5,color="white")+scale_fill_manual(values=c("indianred","royalblue","darkgrey"))+
  theme_ridges(center_axis_labels = TRUE)+
    theme(plot.margin = unit(c(1,1,1,1), "cm"),legend.position = c(0.6, 0.8))+scale_y_discrete(labels=c("All","All","VIII","IX","X","XI","XII","XIII","XIV"))+coord_cartesian(xlim = c(-0.25, 0.25)) +  scale_x_continuous(expand = c(0.01, 0))+
    annotate("rect",xmin = 0.05, xmax = 2.5, ymin = 6.5, ymax = 8.7,alpha=0.7,fill="white")+
           annotate("segment",x = 0.06, xend = 0.077, y = 7.1, yend = 7.1)+
           annotate("segment",x = 0.06, xend = 0.06, y = 7.1, yend = 7.05)+
           annotate("segment",x = 0.077, xend = 0.077, y = 7.1, yend = 7.05)+
            annotate("segment",x = 0.06, xend = 0.06, y = 7.1, yend = 7.15)+
           annotate("segment",x = 0.077, xend = 0.077, y = 7.1, yend = 7.15)+
           annotate("text",x = 0.12,y = 7.1,label="50% HPDI")+
           annotate("segment",x = 0.06, xend = 0.078, y = 6.8, yend = 6.8,linetype="dotted")+
           annotate("text",x = 0.12,y = 6.8,label="95% HPDI")
        
#add HPDI intervals  
  for (i in 1:9)
  {
    g = g +
    annotate("segment",x = hpd50[i,1], xend = hpd50[i,2], y = i+0.3, yend = i+0.3,lwd=0.5)+
    annotate("segment",x = hpd50[i,1], xend = hpd50[i,1], y = i+0.3, yend = i+0.25,lwd=0.5)+
    annotate("segment",x = hpd50[i,2], xend = hpd50[i,2], y = i+0.3, yend = i+0.25,lwd=0.5)+ 
    annotate("segment",x = hpd50[i,1], xend = hpd50[i,1], y = i+0.3, yend = i+0.35,lwd=0.5)+
    annotate("segment",x = hpd50[i,2], xend = hpd50[i,2], y = i+0.3, yend = i+0.35,lwd=0.5)+
    annotate("segment",x = hpd95[i,1], xend = hpd95[i,2], y = i+0.3, yend = i+0.3, linetype="dotted",lwd=0.5)
  }
g
```