How to calibrate coalescent trees to geological time

There has been much interest recently in using the multi-species coalescent to estimate species phylogenies from molecular data. The advantage of the method is that incomplete lineage sorting (the discrepancy between gene trees and the species tree), and ancestral polymorphism are accounted for during phylogenetic inference. Bayesian implementations of the method are computationally expensive, and are best suited for inference among closely related species (or populations).

The figure below shows an example of a phylogeny of mouse lemurs (Microcebus spp.) estimated from RADseq (restriction site associated DNA sequencing) data using the program BPP (Yang, 2015), which implements the multi-species coalescent. Each node in the tree represents a speciation event, with the node ages given as numbers of substitutions per site. The blue bars represent the 95% credibility interval of the node age. For example, the molecular distance from the last common ancestor of M. rufus and M. berthae to the present is 1.29 × 10–4 substitutions per site. If we knew the molecular substitution rate per year for mouse lemurs, we could calibrate the tree to geological time, that is, we could convert the node ages from units of substitution per site to units of real time. I’ll explain how to do so in this post.

Tree

A simple method

The mouse lemur genome is roughly 2.8 Gbp long, with perhaps the majority of the genome evolving neutrally. The RADseq data used to build the phylogeny above represents a random sample of the genome, and thus we may assume that most RAD-fragments are evolving neutrally. Estimates of the de novo mutation rate in the human genome are around 1.2 × 10–8 substitutions per site per generation (see Scally and Durbin, 2012, for a review). Given that mouse lemurs are also primates, we could use the human mutation rate as a proxy for the mouse lemur rate. Assuming a generation time of 4 years for mouse lemurs would roughly give us 1.2 × 10–8 / 4 = 0.3 × 10–8 substitutions per site per year for the lemur rate. Let τ be the age of a node in the phylogeny in units of substitutions per site. Then it follows that τ = rt, where r is the mutation rate per year, and t the age in years. For the M. rufus and M. berthae divergence we have that τ = 1.29 × 10–4, and thus tτ/r = 1.29 × 10–4 / 0.3 × 10–8 = 43,000 years. We could repeat the procedure to the other nodes in the tree to obtain all the ages of divergence. This is perhaps the simplest procedure to calibrate a coalescent tree to geological time.

The procedure outlined above has at least two serious limitations however. The first is that we used point estimates for τ and the mutation rate, r, and we have thus ignored the uncertainties associated with these estimates. The second is that the human mutation rate is probably not a good proxy for the mouse lemur rate. We now look at a more sophisticated way to calibrate the tree that overcomes these issues.

A full Bayesian approach

The following discussion assumes you have a basic grasp of Bayesian statistics, are familiar with BPP (or other similar Bayesian coalescent inference software), and with the R environment for statistical computing.

The program BPP uses an MCMC Bayesian approach to estimate the model parameters (i.e. the species phylogeny, the ancestral nucleotide diversities, θ = 4Nu, and the τ’s) from a molecular sequence alignment. Once an MCMC BPP analysis has finished, one obtains a sample of the posterior distribution of the parameters of interest. The sample size is specified by the user when running the software. For example, for the tree above I analysed 80,662 RAD-fragments and took 20,000 samples from the MCMC (see Yoder et al., 2016, for details). In fact, because the analysis is so computationally intensive, I actually ran 8 parallel MCMC analysis, each with 20,000 samples. Here I discuss the results from only one of those analysis, but the results presented in the main paper by Yoder et al. were calculated using all 8 × 20,000 = 160,000 samples.

To obtain a full Bayesian estimate of the divergence times of the mouse lemurs we must first construct priors on the per-generation mutation rate, u; and the generation time in years, g. One can then obtain samples from these priors, which can then be combined with the samples of τ’s from BPP to obtain the desired divergence times (Angelis and dos Reis, 2015). Let τi be the i-th sample of a node age from the MCMC, and ui, and gi the i-th sample obtained from the priors. Then, the i-th sample of the mutation rate per year is given by riugi. Finally, the i-th sample of the divergence time in years is ti = τi ri. Thus, one simply repeats the procedure for as many times as the MCMC sample size to obtain the full posterior distribution of t. The whole procedure is then repeated for the other nodes in the phylogeny. Below I exemplify the procedure using the mouse lemur phylogeny.

A prior on the mouse lemur generation time: Considerations of the reproductive biology and lifespan in wild and captive populations of mouse lemurs give an estimate of 3 to 4.5 years for the generation time in this species (Yoder et al. 2016). Thus we can construct a gamma prior for g with mean 3.75 years (the mid-point of 3–4.5) and standard deviation 0.375 years (so that 3.75 ± 2 × 0.375 equates to 3–4.5 years). Recall that the gamma distribution with parameters α and β has mean α/β and variance α/β2. Thus, the prior for g is

(g) = Gamma(gα = 100, β = 26.67).  (1)

A prior on the mouse lemur per-generation mutation rate: Estimates of the mutation rate in the lab mouse are about 0.54 × 10–8 substitutions per site per generation (Uchimura et al. 2015), while, as mentioned above, they are about 1.2 × 10–8 substitutions per site per generation in human. It is unclear whether the per-generation mutation rate in the mouse lemurs should be similar to the lab mouse (which has a similar body size and lifespan), or to the human (a fellow Primate), thus, we construct a conservative gamma prior on u between 0.5–1.2 (× 10–8) substitutions per site per generation. Using the same procedure as above, we get a mean of 0.87 (× 10–8) substitutions per site per generation, and a standard deviation of 0.165 (× 10–8) substitutions per site per generation. The prior on u is then

f (u) = Gamma(uα = 27.80, β = 31.96).  (2)

Converting the τ’s into geological times of divergence: The table below summarises the procedure. The first column (i) indicates the sample number. The second column (τrb) is the MCMC posterior sample obtained with BPP of the age of the  M. rufus and M. berthae divergence. The third column (u), is the sample of the per-generation mutation rate from the gamma prior of Eq. (1). I generated the sample of u using R, that is, by running the R command 'rgamma(2e4, 27.80, 31.96)'. The fourth column (g), is the sample of the generation time from the gamma prior of Eq. (2), also obtained with R. The fifth column (r) is the prior sample of the per-year mutation rate, obtained by combining the priors on u and g. Finally, the last column (trb) is the resulting sample of the posterior distribution of the geological time of divergence of M. rufus and M. berthae, obtained by combining the prior on r with BPP’s posterior of τrb. One can summarise the posterior sample of trb in the usual way, to obtain the posterior mean, variance, and 95% credibility interval (CI). The posterior mean is 57,545 years before present with 95% CI: 142,193–10,207 years. The table below can be uploaded into a program such as Tracer, to analyse the traces of r and trb, calculate the ESS, etc.

i τrb (× 10–4) u (× 10–8) g r (= u/g) (× 10–8)

trb (= τ/r)

1 1.632 0.742 4.109 0.181 90,372
2 1.531 0.903 4.033 0.224 68,378
3 1.566 1.124 2.956 0.380 41,165
. . . . . . . . . . . . . . . . . .
20,000 1.599 0.758 3.673 0.206 77,473

The advantage of the procedure just outlined is that one obtains the full, correct posterior distribution of the divergence times integrating all the uncertainties involved. Note that this procedure allows the use of arbitrary prior distributions for u and g (as long as one can obtain samples for them) without having to make any complex analytical calculations. The procedure can be easily extended to calculate the ancestral effective population sizes, N, from BPP’s posterior sample of the nucleotide diversities (θ = 4Nu). The reader should be able to work out how to do this easily. The method can also be trivially extended to deal with the output of other programs and methods (such as, for example, the pairwise sequentially Markovian coalescent or PSMC).

You can read more about our mouse lemur research, and see how estimating their times of divergence let us get a glimpse of Madagascar’s forests past, at The Washington Post and National Geographic.

References

Angelis K, and dos Reis M. (2015) The impact of ancestral population size and incomplete lineage sorting on Bayesian estimation of species divergence times. Current Zoology, 61: 874–885.

Scally A, and Durbin R. (2012) Revising the human mutation rate: implications for understanding human evolution. Nature Reviews Genetics, 13: 745–753.

Uchimura A, et al. (2015) Germline mutation rates and the long-term phenotypic effects of mutation accumulation in wild-type laboratory mice and mutator mice. Genome Research, 25: 1125–1134.

Yang Z. (2015)  The BPP program for species tree estimation and species delimitation. Current Zoology, 61: 854–865.

Yoder AD, et al. (2016) Geogenetic patterns in mouse lemurs (genus Microcebus) reveal the ghosts of Madagascar’s forests past. Proceedings of the National Academy of Sciences, 113: 8049–8056.

 

Write your own Bayesian MCMC phylogenetic program with R

This is an R tutorial I wrote for a workshop on Bayesian phylogenetics at Bristol University in 2016. The tutorial first introduces a Bayesian problem where we calculate, analytically, the posterior distribution of the molecular branch length between two species. Then the tutorial moves onto calculating the posterior by MCMC sampling. Here you can learn MCMC concepts such as proposal densities, proposal window size, mixing, convergence, efficiency, autocorrelation, and effective sample size. By tweaking the parameters you can see how the efficiency of the MCMC and the parameter’s ESS are affected. At the end there is an exercise where you can modify the MCMC program to convert it to a two-parameter model where you estimate the divergence time and the molecular evolutionary rate between the two species. Send me an email if you would like to see the answers to the exercises. The workshop solves example 7.1 (p. 216) from Ziheng Yang’s book “Molecular Evolution: A Statistical Approach” by OUP (see also example 6.4, p.188, and problems 7.1 and 7.5, p.260 in the book).

# Workshop "Bayesian methods to estimate species divergence times"
# July 2015
# Mario dos Reis, Ziheng Yang and Phil Donoghue

# A Bayesian phylogenetics inference program

rm(list=ls())
# ##########################################
# Simple example, calculated analytically
# ##########################################

# The data are a pairwise sequence alignment of the 12s RNA gene sequences
# from human and orangutan. 
# * The alignment is n=948 nucleotides long.
# * We observed x=90 differences between the two sequences.
# (Example from p.216, Molecular Evoltion: A Statistical Approach)

# Our aim is to estimate the molecular distance d, in substitutions per site, 
# between the two sequences. Because the two species are closely related (i.e.,
# the alignment has less than 10% divergence) we use the Jukes and Cantor (1969)
# nucleotide substitution model (JC69).

# The Bayesian estimate of d, is given by the posterior distribution:
# p(d | x) = C * p(d) * p(x | d).

# p(d) is the prior on the distance.
# p(x | d) is the (Jukes-Cantor) likelihood, i.e. the probability of observing
# the data x given d, n. C is a normalising constant, C = 1 / ∫ p(d) p(x | d) dd 

# likelihood under JC69 model, i.e. L = p(x | d):
L <- function(d, x, n) (3/4 - 3*exp(-4*d/3)/4)^x * (1/4 + 3*exp(-4*d/3)/4)^(n - x)

# plot the likelihood:
curve(L(x, 90, 948), from=0, to=0.2, n=500, xlab="distance, d", ylab="likelihood, p(x | d)")

# The maximum likelihood estimate is (the highest point in the curve)
d.mle <- -3/4 * log(1 - 4/3 * 90/948) # [1] 0.1015060
abline(v=d.mle, lty=2)
title(expression(hat(d) == 0.1015))

# For the prior on d, we use an exponential distribution with mean = 0.2
# p(d) = exp(-d/mu) / mu; mu = 0.2.
d.prior <- function(d, mu) exp(-d/mu) / mu

# un-normalised posterior:
d.upost <- function(d, x, n, mu) d.prior(d, mu=mu) * L(d, x=x, n=n)

# To calculate the normalising constant C, we need to integrate the un-normalised posterior:
integrate(d.upost, lower=0, upper=Inf, x=90, n=948, mu=0.2, abs.tol=0)
# 5.167762e-131 with absolute error < 1.9e-135
C  <- 1 / 5.167762e-131

# The posterior distribution function (when x=90, n=948 and mu=0.2)is thus
d.post <- function(d) C * d.upost(d=d, x=90, n=948, mu=0.2)

# Plot posterior distribution:
curve(d.post(x), from=0, to=0.2, n=500, xlab="d", ylab="density")
# Add prior:
curve(d.prior(x, mu=0.2), add=TRUE, lty=2)

# EXERCISE 1: Try adding the likelihood curve to the plots above.
# HINT: You may want to scale the likelihood by a constant like C * 3

# The posterior mean of d is found by integrating ∫ d * p(d | x) dd
f <- function(d) d * d.post(d)
integrate(f, lower=0, upper=Inf, abs.tol=0)
# 0.1021246 with absolute error < 5.5e-06
# Add it to the plot:
abline(v=0.1021246, lty=3) # posterior mean

# Add MLE:
abline(v=d.mle, lty=3, col="red")

# The posterior distribution can be used to answer questions such as
# what is the probability that d > 0.1?
integrate(d.post, lower=0.1, upper=Inf)
# 0.5632372 with absolute error < 1.6e-08, i.e. 56.3%

# The 95% equal-tail credibility interval is (0.08191, 0.12463)
integrate(d.post, lower=0, upper=0.08191) # 2.5%
integrate(d.post, lower=0, upper=0.12463) # 97.5%

# EXERCISE 2 (VERY HARD): Can you work out how to calculate the 99% CI?
# HINT: You need to find the upper bounds where the integral is 0.005
# and 0.995 respectively. The integrate and optim functions in R may be useful.


# #############################################
# The same example analysed using
# Markov chain Monte Carlo (MCMC)
# #############################################

# We now obtain the posterior distribution by MCMC sampling.
# In most practical problems, constant C cannot be calculated (either
# analytically or numerically), and so the MCMC algorithm becomes necessary.

# Draft MCMC algorithm:
# 1. Set initial state for d.
# 2. Propose a new state d* (from an appropriate proposal density).
# 3. Accept or reject the proposal with probability
#      min(1, p(d*)p(x|d*) / p(d)p(x|d))
#      If the proposal is accepted set d=d*, otherwise d=d.
# 4. Save d.
# 5. Go to step 2.
d.mcmcf <- function(init.d, N, w) {
   # init.d is the initial state
   # N is the number of 'generations' the algorithm is run for.
   # w is the 'width' of the proposal density.
   	d <- numeric(N+1) # Here we will keep the visited values of d.
   	d[1] <- init.d
   	acc.prop <- 0 # acceptance proportion
   	for (i in 1:N) {
        # here we use a uniform density with reflection to propose a new d*
   		d.prop <- abs(d[i] + runif(1, -w/2, w/2))
   		#d.prop  <- abs(d[i] + rnorm(1, 0, w)) # a normal proposal
   		p.ratio <- d.upost(d.prop, x=90, n=948, mu=0.2) /
                  d.upost(d[i], x=90, n=948, mu=0.2)
   		alpha <- min(c(1, p.ratio))
                # if ru < alpha accept proposal:
   		if (runif(1, 0, 1) < alpha) { d[i+1] <- d.prop; acc.prop  <- acc.prop + 1 }
                # else reject it:
   		else d[i+1] <- d[i]
   		# Can you think of a way to print the progress of the MCMC to the screen?
   	}
    # print out the proportion of times the proposal was accepted
   	print(c("Acceptance proportion:", acc.prop/N))
   	return (d) # return vector of d visited during MCMC
}

# Test the algorithm
d.1 <- d.mcmcf(0.5, 100, 0.08)

# plot output:
plot(d.1, type='l', xlab="generation", ylab="d") # This is the 'trace' plot of d.
# the chain wiggles about spending time at different values of d in
# proportion to their posterior density.

# Add 95% CI as reference:
abline(h=c(0.08191, 0.12463), lty=2, col="red")
#text(locator(1), "Burn-in phase")
#text(locator(1), "Stationary phase")

# It may be useful to caculate the time it takes to run x generations:
system.time(d.mcmcf(0.5, 1e4, 0.08))
# ~ 0.6s in my laptop, that means 1e6 would take ~60s or 1min.

# Get a very long sample:
d.l <- d.mcmcf(0.5, 1e4, 0.08)
plot(d.l, type='l')
# The trace plot looks like a dense, hairy caterpillar!
# This is good!

# Remove initial state and burn-in
d.l <- d.l[-(1:101)]

hist(d.l, prob=TRUE, n=50, xlab="d", main="") # The sample histogram from the MCMC
curve(d.post(x), add=TRUE, col="red", lwd=2)  # The true posterior distribution.

# Alternatively:
plot(density(d.l, adj=1.5)); rug(d.l)
curve(d.post(x), add=TRUE, col="red", lwd=2)

# EXERCISE 3: What does adj in the density function do? Try different values
# such as adj=0.1 or adj=1.0.
# How long (how many generations, N) do you need to run the MCMC so that
# the density plot is almost identical to the true distribution?

# Now we can get the posterior mean, and 95% CI from the sample:
mean(d.l) # should be close to 0.1021246
quantile(d.l, prob=c(0.025, 0.975)) # should be close to (0.08191, 0.12463)

# Effect of step size, w:
d.1 <- d.mcmcf(0.5, 100, w=0.08) # moderate step size
plot(d.1, type='b', col="black", pch=19, cex=.5) # chain moves about

d.2 <- d.mcmcf(0.5, 100, w=0.01) # tiny step size
lines(d.2, type='b', col="green", pch=19, cex=.5) # baby steps, chain moves slowly

d.3 <- d.mcmcf(0.5, 100, w=1) # large step size
lines(d.3, type='b', col="blue", pch=19, cex=.5) # chain gets stuck easily

# EXERCISE 4: The optimal acceptance proportion is about 30%. Try various step sizes, w,
# and see the effect on the acceptance proportion. What is the optimal w?
# Finding the best step size is called fine-tuning.


# ##########################################
# Autocorrelation, efficiency, effective
# sample size and thinning:
# ##########################################

# States in a MCMC sample are autocorrelated, because each new state is
# either the previous state or a modification of it.

x <- d.l[-length(d.l)] # remove the last state
y <- d.l[-1] # remove the first state
plot(x, y, xlab="current state", ylab="next state", main="lag, k=1")
cor(x, y) # this is the autocorrelation for the d sample for lag 1, ~50-60%

# We can use the acf function to calculate the autocorrelation for various lags:
d.acf <- acf(d.l)  # The correlation is lost after a lag of k=10
d.acf$acf # the actual acf values are here.

# The efficiency of an MCMC chain is closely related to the autocorrelation.
# Intuitively, if the autocorrelation is high, the chain will be inefficient,
# i.e. we will need to run the chain for a long time to obtain a good
# approximation to the posterior distribution.
# The efficiency of a chain is defined as:
# eff = 1 / (1 + 2(r1 + r2 + r3 + ...))
# where ri is the correlation for lag k=i.

eff <- function(acf) 1 / (1 + 2 * sum(acf$acf[-1]))
eff(d.acf)  # ~ 30% What does that mean?

# Efficiency is closely related to the effective sample size (ESS),
# which is defined as: ess = N * eff
N <- length(d.l) # 9,900
length(d.l) * eff(d.acf) # ~ 2,700
# That is, our MCMC sample of size 9,900 has as much information about d as an
# independent sample of size ~2,700. In other words, 30% efficiency means
# the chain is 30% as efficient as an independent sample.

# EXERCISE 5: Run a long chain (say N=1e4) with step size w=0.01.
# Calculate the efficiency and the ESS.
# Do it at home: Try various values of w. What value of w gives the highest ESS?
#                Plot estimated efficiency vs. w.

# Usually, to save hard drive space, the MCMC is 'thinned', i.e. rather than
# saving every single value sampled, we only save every m-th value. The resulting
# sample will be smaller (thus saves hard drive space) but still has nearly as much info
# as the total sample. In our case, we may reduce the sample to ~1/4 the original size:
d.reduced <- d.l[seq(from=1, to=N, by=4)]
mean(d.reduced) # still close to 0.1021246
quantile(d.reduced, c(0.025, 0.975)) # still close to (0.08191, 0.12463)


# ################################################
# MCMC diagnostics.
# How do we know our results are reliable?
# ################################################

# (1) Run the chain from different starting values, the chains
#     should converge to the same region in parameter space:
d.high  <- d.mcmcf(0.4, 1e2, 0.1)
plot(d.high, col="red", ty='l', ylim=c(0, 0.4))

d.medium <- d.mcmcf(0.1, 1e2, 0.1)
lines(d.medium, col="black", ty='l')

d.low <- d.mcmcf(0, 1e2, 0.1)
lines(d.low, col="blue", ty='l')

abline(h=c(0.08191, 0.12463), lty=2)

# (2) Compare the histograms ,and summary statistics obtained 
#     from several long, independent runs:
d.l1 <- d.mcmcf(0.4, 1e3, 0.1)[-(1:101)]
d.l2 <- d.mcmcf(0.0, 1e3, 0.1)[-(1:101)]

plot(density(d.l1, adj=1), col="red")
lines(density(d.l2, adj=1), col="blue")

mean(d.l1); mean(d.l2)
# relative difference
abs(mean(d.l1) - mean(d.l2)) / mean(c(d.l1, d.l2)) * 100

# I cannot stressed enough how important it is that you run all your
# MCMC analyses at least twice!
# You cannot assess the robustness of a MCMC if it has been run just
# once!

# Try again using N=1e5 (will take a little to finish)
d.l1 <- d.mcmcf(0.4, 1e5, 0.1)[-(1:101)]
d.l2 <- d.mcmcf(0.4, 1e5, 0.1)[-(1:101)]

# (3) Calculate ESS. Ideally, we should aim for ESS > 1,000, when
#     summary statistics such as the posterior mean and 95% CI can be
#     accurately estimated. ESS > 10,000 offers superb performance, but this
#     is rarely achieved in phylogenetic problems. ESS < 100 is poor:
length(d.l1) * eff(acf(d.l1))  # > 22,000 (nice!)

# (4) Plot running means:
rm1 <- cumsum(d.l1) / (1:length(d.l1))
rm2 <- cumsum(d.l2) / (1:length(d.l2))
plot(rm1, ty='l', ylab="posterior mean", xlab="generation"); lines(rm2, col="red")

# There are more diagnostics tests and statistics that can be applied, for
# example the potential scale reduction statistic (p. 242, Molecular Evolution:
# A statistical).

# R packages such as CODA have been written to perform MCMC 
# diagnostics, and they may prove quite useful.

# ################################################
# Bayesian MCMC - Molecular dating
# ################################################

# Now consider estimating the time of divergence between human and orangutan
# using the data above, x=90, n=948.
# The molecular distance is the product of geological time, t, and the
# molecular rate, r: d = r * t.

# Our JC69 likelihood based on r and t is:
L.rt <- function(r, t, x, n) L(d=r*t, x=x, n=n)

# The likelihood is now a 2D surface:
r <- seq(from=2e-3, to=6e-3, len=50)
t <- seq(from=10, to=40, len=50)

# creates a table of all combinations of r and t values:
rt <- expand.grid(r=r, t=t)

z <- L.rt(r=rt$r, t=rt$t, x=90, n=948)

# The likelihood surface
image(r, t, matrix(z, ncol=50), xlab="r (s/My)", ylab="t (Ma)")
contour(r, t, matrix(z, ncol=50), nlevels=5, add=TRUE)

# The likelihood looks like a flat, curve mountanous ridge
# The top of the ridge is located at d.mle = 0.1015060 = r * t
curve(d.mle/x, add=TRUE, col="blue", lwd=2, lty=2)

# The molecular data are informative about d only, but not about r or t
# separately. We say that r and t are confounded parameters. Any combination
# of r and t that satisfy 0.1015060 = r * t are ML solutions.

# We can make a neat 3D plot of the likelihood:
persp(r, t, matrix(z, ncol=50), zlab="likelihood", theta=40, phi=40)

# The Bayesian method provides a useful methodology to estimate t and r through
# the use priors.
# The fossil record suggests that the common ancestor of human-orang lived
# 33.7-11.2 Ma (see Benton et al. 2009 in the Timetree of Life book)
# Thus the prior on t is a normal distribution with mean 22.45 (the midpoint
# of the fossil calibraiton) and sd 5.6, so that the 95% range is roughly
# 33.7 to 11.2.
t.prior <- function(t, mu=22.45, sd=5.6) dnorm(t, 22.45, 5.6)
curve(t.prior(x), from=5, to=40, n=5e2, xlab="t (Ma)", ylab="density")

# For the rate, r, we use an exponential distribution with mean
# mu = 0.10 / 22.45 = 0.00445 (based on the distance MLE and the midpoint
# of the calibration)
r.prior <- function(r, mu=0.10/22.45) dexp(r, 1/mu)
curve(r.prior(x), from=0, to=.02, xlab="r (s/My)", ylab="density")

# We can plot the joint prior:
z1 <- r.prior(rt$r) * t.prior(rt$t)
image(r, t, matrix(z1, ncol=50), xlab="r (s/My)", ylab="t (Ma)")
contour(r, t, matrix(z1, ncol=50), nlevels=5, add=TRUE)

# The unnormalized posterior is thus
rt.upost <- function(r, t, x=90, n=948) {
  r.prior(r) * t.prior(t) * L.rt(r, t, x, n)
}

# We can plot the un-normalised posterior:
z2 <- rt.upost(rt$r, rt$t)
image(r, t, matrix(z2, ncol=50), xlab="r (s/My)", ylab="t (Ma)")
contour(r, t, matrix(z2, ncol=50), nlevels=5, add=TRUE)
curve(d.mle/x, add=TRUE, col="blue", lty=2, lwd=2)

# EXERCISE 6: Can you plot the un-normalise posterior as a 3D surface using persp?

# We are now ready to build a two-parameter MCMC!

# The draft algorithm is as follows:
# 1. Set initial states for t and r.
# 2. Propose a new state r* (from an appropriate proposal density).
# 3. Accept or reject the proposal with probability
#      min(1, p(r*)p(t)p(x|d*) / p(r)p(t)p(x|d))
#      If the proposal is accepted set r=r*, otherwise r=r
# 4. Repeat 2 and 3 for t (i.e. propose t*, accept or reject).
# 5. Save r and t.
# 6. Go to step 2.

# EXERCISE 7: Create a function called rt.mcmcf based on the d.mcmcf function.
# Run your newly written phylogenetics MCMC program! You will need two step sizes,
# w1 and w2, for r and t. Vary the step sizes and check the performance of the
# MCMC chain. Run the analysis several times, and calculate posterior means and
# 95% CIs for r and t. Calculate efficiency and ess. A 3D density histogram of the
# MCMC posterior can be obtained using the kde2d in the MASS package.

# EXERCISE 8: Try using a normal density (rather than uniform) with reflection as
# the proposal distribution for r and t. Does it make a difference?

# EXERCISE (VERY HARD) 9: The data are now a longer alignment with x=900 and n=9480.
# When you have so much data, the likelihood is soo tiny that the computer has
# trouble calculating it. In this case it is best to work with the log-likelihood:
# lL.d <- function(d, x, n)  x * log(3/4 - 3*exp(-4*d/3)/4) +
#                            (n - x) * log(1/4 + 3*exp(-4*d/3)/4)
# You then need to calculate the log-posterior:
# log p(d | x) = log p(d) + log p(x | d)

# An MCMC algorithm can be constructed that works only with the log-prior and
# the log-likelihood. The proposal ratio is then
# exp (log p(d*) + log p(x | d*) - log p(d) - log p(x | d))

# Can you write an MCMC program to calculate the posterior of d for such large
# dataset? All Bayesian phylogenetic software work with the log-prior and log-likelihood.

# EXERCISE (VERY HARD) 10: The normalising constant in this case is a double integral
# C = 1 /  ∫∫ p(r) p(t) p(x | d) dr dt
# Can you calculate C using R's cubature package (the normal integrate function
# does not work, you need special multidimensional integration packages)?
# The cubature package is not a default package, so you need to use install.packages 
# to get it.

PhD position available in my lab!

I am now recruiting for a PhD student to join my group at SBCS, Queen Mary University of London in the UK.

Integrating fossils and genomes to elucidate the evolutionary history of species through time

The genomes of organisms bear the footprint of their evolutionary history. By combining information from molecular sequences (genomes) with information from the fossil record, inferences about this evolutionary history can be obtained and placed in the right
geological context. However, the fossil record is notoriously incomplete, and patterns of genome evolution vary substantially among species, providing important challenges to the study of ancient evolutionary events. Recent advances in Bayesian statistics allow probabilistic modelling of the uncertainties in fossils and genomic evolutionary rates, so that robust inferences about species divergence times, that integrate these sources of uncertainty, can now be made. The Bayesian method is now being used to study controversial topics such as the pattern of diversification of birds and mammals relative to the End-Cretaceous mass extinction, or the elucidation of the time of origin of animals over 540 million years ago in the pre-Cambrian. In this project the student will work in the application and/or development of Bayesian MCMC statistical methods to study species divergences through time. The project will include the collection of genomic and fossil data from online databases, and the use of computer software for analysis. Experience in the use of statistical packages (such as R) and computer programming would be an advantage. The project is suitable for students interested in genomics, palaeontology and Bayesian statistics. Students with backgrounds in the life sciences, earth sciences, physics, computing or maths are welcomed to apply.

Deadline: 31st of March 2016. Interviews are expected to take place at the end of April or beginning of May.

Eligibility: open to UK and European students.

Funding: The studentship will cover tuition fees and provide an annual tax-free maintenance allowance for 3 years at Research Councils UK rates (£16,057 in 2015/16).

Reference: dos Reis et al. (2016) Bayesian molecular clock dating of species divergences in the genomics era. Nature Reviews Genetics, 17: 71–80.  go.nature.com/Q4vHzQ

More info and application form: bit.ly/dosreislab-phd

Enquiries: m.dosreisbarros@qmul.ac.uk

Workshop: Bayesian Methods to Estimate Species Divergence Times

Together with Ziheng Yang and Philip Donoghue I am organising a workshop on Bayesian phylogenetic methods. The workshop will take place between the 30th and 31st of July of 2015 at the University of Bristol. To register please visit bit.ly/1INFQxZ . This page will be updated regularly so stay tuned!

Summary

In this two-day workshop we will introduce the theory and practice of Bayesian statistical methods to analyze molecular, morphological and fossil data to estimate the times of divergence of species on a phylogeny. The workshop is aimed at scientists (graduate students, postdocs and established academics) interested in using the methods in their own research. Attendees will learn how to use R to write a basic Baysian phylogenetics program, and will learn how to use specialized packages such as Beast, MCMCTree and MrBayes for divergence time estimation. The workshop will introduce attendees to the newer methods for combined analysis of morphological and molecular data (so-called total evidence dating). In particular, there may be an opportunity for attendees to learn how to analyse continuous morphological data (such as landmarks) with the program MCMCTree. The workshop is made possible thanks to funding by the BBSRC, UK.

Prerequisites

Attendees are required to have basic knowledge of statistics and basic knowledge of the R computer package for statistics (for example, you know how to perform basic calculations with R, know how to define functions and how to write ‘for loops’). Attendees should ideally have some basic experience in phylogenetics (for example, you may have used a phylogenetics program such as RAxML to build a phylogenetic tree). Basic knowledge of the command line or terminal (in Windows, Mac or Linux) is also necessary. Attendees must bring their own laptops to participate in the workshop. Detailed instructions about the software the attendees need to install in their laptops will be given close to the workshop date.

Draft Programme

Thursday 30th
09:00-10:00 Registration. Tea and coffee.
10:00-11:00 Lecture: Phil Donoghue on the fossil record.
11:00-12:00 Lecture: Ziheng Yang on Bayesian phylogenetics theory.
12:00-13:30 Lunch.
13:30-15:30 R workshop: How to write your own MCMC phylogenetics program.
15:30-16:00 Tea and coffee break.
16:00-18:00 R workshop: MCMC diagnostics.
18:00-19:00 Wine reception.

Friday 31st
09:00-10:00 Questions and Answers session. Tea and Coffee.
10:00-12:00 Beast workshop: Dating a virus phylogeny.
12:00-13:30 lunch.
13:30-15:30 MCMCTree workshop: Analysis of genome-scale (and morphological?) data.
15:30-16:00 Tea and coffee break.
16:00-18:00 MrBayes workshop: Combined analysis of morphological and molecular data.

Further Notes

  • You will need to bring your own laptop, and install several computer packages on it, in order to participate in the workshop. You are expected to be familiar using the command line (a.k.a. terminal, shell, etc.) of the operating system in your laptop (whether it is Windows or a Unix-like system such as Mac OS X or Linux). If you need to polish your command line skills, you can find various tutorials here.
  • On the first day there will be a session where you will write your own R program to perform Bayesian MCMC phylogenetics. Thus you are expected to be familiar with the R environment for statistical computing. An extensive introduction to R can be found here. You can also check my own short R tutorial here. Make sure you have installed the latest R version.
  • You need to install the PAML v4.8a package, which contains the MCMCTree program to be used in the workshop. Ideally you would set the Path variable so the the operating system can find the PAML executables. Instructions for Windows vista are in the PAML’s website. Linux and Mac OS X users should search online for setting $PATH.
  • You need to install MrBayes v3.2.5. You may want to add the mb executable to your Path variable as well.
  • You need to install Beast v1.8.2, Tracer v1.6, and FigTree v1.4.2. These programs need Java installed in your system. My understanding is that there are two versions of Beast being currently developed in parallel: 1.8 and 2. We will work with 1.8 only, so make sure you have installed the right version, as there are substantial differences between 1.8 and 2.
  • Please install the software needed well in advance, and do some basic tests to make sure that the programs work correctly in your laptop. Our ability to help people out with software installation issues during the workship will be very limited.
  • BOOK: An easy to read, yet extensive, introduction to Bayesian statistics and Bayesian phylogenetics can be found in chapters six to eight of “Molecular evolution: A statistical approach“. If you want to get a copy of the book at the discounted price of £26 please contact me by Thursday 9th July. Ziheng Yang will pre-order these in advance. The book is not necessary for the workshop.
  • FOOD: We will provide tea and coffee and lunch on both days at no extra cost. If you have any special dietary requirements please contact me as soon as possible.
  • CANCELLATIONS: If due to unforeseen circumstances you cannot make the workshop, please let us know a.s.a.p so that we can offer your place to somebody else.

For enquires contact: Mario dos Reis, mario.barros@ucl.ac.uk

nostrap-bbsrc-colour

A Crash Introduction to R

I wrote this R tutorial several years ago for people working in my lab, and then completely forgot about it. Apparently, the tutorial had been doing the rounds, and was found to be quite useful, so I have decided to post it here. It is a very brief tutorial meant for people who have used R a little bit and that consider themselves R beginners. Lines that start with a # are comments and are ignored by R. Lines in black are R commands. You should cut and paste them into the R console and see their effect.


# --- Crash introduction to R ---
#
# Mario dos Reis, June 2008

# This tutorial was prepared with R v 2.5.1 on Linux
# Updated Feb 2015, Tested with R v 3.0.2 on Ubuntu Linux 14.04.2

# First, R can be used as a simple calculator
2 + 5

2 * 10 / pi

# `pi' is a built in variable, and you can probably guess its value
pi # [1] 3.141593
2 * pi^2 # [1] 19.73921
exp(1) # Exponential function
log(10) # Natural logarithm
log(0) # [1] -Inf
2e4 # Exponential notation: 2x10^4

# The built in variable `Last.value' stores the last value computed
5^2 # [1] 25
sqrt(.Last.value) # [1] 5

# Vectors are the simplest data structure in R
w <- 1:10 # this generates a vector with all the numbers from 1 to 10
x <- rnorm(1000) # this generates 1,000 pseudo-random numbers from a normal distribution
y <- rnorm(1000, mean=3, sd=1)

# `x' and `y' are numerical vectors of length 1,000
is.vector(x) # [1] TRUE
length(x) # [1] 1000

# A random cloud of points
plot(x, y)

# A typical boxplot
boxplot(x, y)

# or ... (can you spot the difference?)
boxplot(x, y, names=c("x", "y"), main="A Boxplot")

# vectors can be easily concatenated
z <- c(x, y)
hist(z, n=50) # a histogram

# Performing basic statistics is straightforward
mean(x)
sd(x) # standard deviation
summary(x) # various summary statistics
cor(x, y) # correlation between `x' and `y'

# R can easily handle matrices
# This generates a 10x10 matrix of pseudo-random Poisson numbers
x <- matrix(rpois(100, lambda=5), ncol=10)
# summary statistics still apply
mean(x)
range(x) # maximum and minimum values in `x'

x[1,] # the first row of the matrix

# EXERCISE: can you output the 5th column of the matrix?

# How about calculating the mean for each row in `x'?
# R has control flow structures that alow to do this easily:
means <- numeric(10) # this creates a vector of zeroes of length 10
for (i in 1:nrow(x)) means[i] <- mean(x[i,])

# EXERCISE: Try using functions like `sd' or `range'

# R has several built in datasets for didactic purposes
# One of such datasets is `women'
women

# Women is a data frame
is.data.frame(women) # [1] TRUE

# Data frames are special types of `lists', which are more general
# data structures than vectors or matrices
names(women)
women$height # this is a vector of heights
women$weight # this is a vector of weights

# EXERCISE: try the mean() and summary() functions on the women dataset

# A simple regression model
plot(women$height, women$weight)

# This looks like a linear relationship, is it?
women.lm <- lm(weight ~ height, data=women)
# Function `lm' fits a linear model of weight on height
# The term `weight ~ height' is called a formula, this is the
# shorthand for the statistical model
# `weight = alpha + beta * height + error ~ N(0, sigma)'

# adds the estimated regression line to the plot
abline(women.lm, col="red")

# `women.lm' is an R object that contains the output of the `lm'
# function. `women.lm' is also a list.
summary(women.lm)
anova(women.lm) # The classical ANOVA table
names(women.lm)

# Individual items within `women.lm' can be extracted with the $ sign
women.lm$residuals # the residuals obtained after fitting the model

plot(women.lm$residuals ~ women$height)
abline(h=0, lty=2) # adds a dashed, horizontal line to the plot
# It looks like a linear model is a very bad representation of the data!

# For comparison
x <- 1:100
y <- 2*x + rnorm(100, sd=10)
plot(y ~ x)
xy.lm <- lm(y ~ x)
plot(xy.lm$residuals ~ x); abline(h=0, lty=2)

# Getting help is easy
help(women) # This gives a description of what is in the women dataset
?women # Has the same effect as above
?lm # A good explanation of what the `lm' function does
help.start() # Opens up your web browser with the R help files

# EXERCISE: Go through the excellent examples provided in the `lm'
# documentation.

# EXERCISE: Try `plot(women.lm)'. Can you obtain all the plots simultaneously?
# Check help(par) and look for the `mfrow' parameter. R has many graphical
# parameters that can be set to gain fine control of all the plots. HINT: try
# par(mfrow=c(2,2)) and then plot(women.lm). Try also `plot(xy.lm)'.

example(lm) # This will run the example in the `lm' documentation automatically

# If you don't know the name of the function you're looking for, try
# `help.search'
help.search("gamma distribution")

# R has fast, buit-in pseudo random number generators for any distribution you
# can think of: binomial, multinomial, beta, gamma, poisson, etc, etc, etc.
x <- matrix(rcauchy(10000), ncol=100) # The Cauchy distribution
dim(x) # x has 10,000 elements divided into 100 rows and 100 columns
means <- apply(x, 1, mean)
plot(means, type='l') # The mean of the Cauchy distribution is not defined!

# Let's increased the sample size
x <- matrix(rcauchy(1e6), ncol=1e3)
means <- apply(x, 1, mean)
plot(means, type='l') # The sample mean wiggles erratically!

# EXERCISE: what does the `apply' function does? Find out using the `R'
# documentation. `apply' is much more efficient than using `for' loops
# for traversing along the rows or columns of a matrix. It can dramatically
# increase the speed of code originally written with for loops! Other useful
# functions are `sapply', `lapply', `aggregate' and `replicate'.

# EXERCISE: repeat the example above and convince yourself that for the Normal
# distribution the sample mean quickly converges to zero with sample size, while
# the Cauchy distribution will never converge!

# R has also `array' data structures. Arrays are multidimensional matrices
# with an arbitray number of dimensions!
cube <- array(1:27, dim=c(3,3,3)) # this creates a cube of 3x3x3

cube[3,3,3] # [1] 27

# The contents of `cube' can also be accessed linearly
cube[16] # This is the same as cube[1,3,2]

require(MASS) # This loads the MASS package with its corresponding functions
# we need this for the galaxies and geyser datasets

# Example from p.129 in Venables and Ripley (2002) Modern Applied Statistics with S.
# 4th ed. Springer
# This is the bible for `R' and `S' (S is R's grandaddy!)

gal <- galaxies/1e3 # The speed of galaxies

hist(gal, n=20) # A histogram of galaxy speeds

# We can do something much nicer
plot(x = c(0, 40), y = c(0, 0.18), type="n", bty="l",
xlab="Velocity of galaxy (1000Km/s)", ylab="density")
rug(gal) # adds the data points as a rug at the bottom of the plot
lines(density(gal, width="SJ-dpi", n=256), lty=1)
lines(density(gal, width="SJ", n=256), lty=3)
abline(h=0, lty=2)

# Once you have a nice plot you want to keep
dev.print() # will send the plot to the network printer
dev.print(pdf, file="myplot.pdf") # will save the plot as a pdf file
dev.copy2eps(file="myplot.eps") # will save it as an eps file

# EXERCISE: try `?dev2bitmap' for saving the graphics device as a bitmap file.
# Try `?dev.list' for a series of functions about graphic devices and what they
# do.

# The Old Faithful Geyser (Example from the book above, p.131)
# We'll plot the waiting time between geyser eruptions vs. the previous
# wating time between eruptions
geyser2 <- data.frame(as.data.frame(geyser)[-1,], # removes the first line in geyser
pduration = geyser$duration[-299])
attach(geyser2) # when attaching, the internal data frame names become visible
plot(pduration, waiting)
# without attaching, the command above would have been
# plot(waiting ~ pduration, data=geyser2)

# In a similar fashion to the example above, we can fit a two dimensional kernel
# to the data in order to vizualise the bivariate distribution
geyser.k2d <- kde2d(pduration, waiting, n=50)
image(geyser.k2d)
contour(geyser.k2d, add=TRUE)
detach(geyser2)

# EXERCISE: try contour on its own. HINT: set add=FALSE (its default value!)

# We can also plot a neat perspective of the bivariate distribution
persp(geyser.k2d, phi=30, theta=20, d=5,
xlab="Previous duration", ylab="Waiting time", zlab="")

# EXERCISE (HARD): check the documentation for `persp' first. Get rid of the
# bounding box and the facet edges. Colour the facets using the same coloring
# scheme as in the `image' example above (HINT: check out `heat.colors'), add
# light and shades.

# Sometimes we need to calculate confidence intervals for estimates from arbitrary
# distributions. For example, the galaxies example above shows a weird distribution
# and using normality assumptions to calculate a CI is not appropriate. Bootstrap
# is a neat way to calculate non-parametric CI's and it is very easy to do in R.

set.seed(101) # It is useful to specify the random set as to be able to reproduce
# the sampling.
m <- 1000; gal.boot <- numeric(m)
for (i in 1:m) gal.boot[i] <- median(sample(gal, replace=T))

plot(density(gal.boot))
quantile(gal.boot, p=c(0.025, 0.975)) # 95% CI for the median

# R has very efficient built-in functions for bootstrapping
library(boot)
set.seed(101)
gal.boot2 <- boot(gal, function(x, i) median(x[i]), R = 1000)
gal.boot2
# The CI calculated above are known as the percentile CI's. They are not
# considered the best. `boot.ci' will calcualte a collection of CI's
# according to different methods. Please consult the book above for details
# about them.
boot.ci(gal.boot2)
plot(gal.boot2) # for some diagnostic plots

# Non-parametric regression with lowess:
help(GAGurine) # concetration of a chemical GAG in 314 childern
attach(GAGurine)
plot(GAG ~ Age, pch='+')
lines(lowess(GAG ~ Age, f=1/4), col="red", lwd=2)

# The lowess function fits a non-parametric regression line. It works sort like
# a running mean, but it has a more sophisticated algorith. The `f' parameter
# controls the influence of neighbouring values when fitting the regression at
# a particular point. EXERCISE: repeat the example above with `f=2/3'.

# EXERCISE: checkout `loess' a newer and more sophisticated version of `lowess'.
# It works in a different way.

# For more info about R check www.r-project.org
# There is a very large list of contributed packages to do any sort of
# analyses imaginable. Check http://www.stats.bris.ac.uk/R/ for an extensive
# list of packages. The R mailing list (available form the main R page above)
# provides an excellent forum to get answers to anything about R.

# A few more advanced examples: (added Feb 2015)

# The `curve' function can be used to plot mathematical functions:
curve(dnorm(x, mean=0, sd=1), from=-4, to=4) # plots the normal distribution

# R adds a bit of blank space between what is being plotted, and the plotting box
# you can remove the space using `xaxs' and `yaxs'. y-axis labels can be set to be
# horizontal with `las', we can change the box with `bty' and we can change
# the axis labels and add a title:
curve(dnorm(x, mean=0, sd=1), from=-4, to=4, xaxs="i", yaxs="i", las=1,
bty="l", ylab="Density", main="The normal distribution")
# you can add additional curves to the same plot:
curve(dnorm(x, mean=0, sd=2), from=-4, to=4, add=TRUE, lty=2)

# more functions:
curve(sin(x), from=0, to=2*pi); abline(h=0, lty=2)
# (if you use a semicolon, you can put several R commands in the same line!)

curve(x^2, from=-1, to=1, xaxs="i", yaxs="i")

# R can calculate integrals
# define the x^2 function:
x2 <- function(x) x^2
integrate(x2, lower=0, upper=1) # 0.3333333 with absolute error < 3.7e-15

# integrate a statistical distribution
integrate(dnorm, lower=0, upper=Inf) # 0.5 with absolute error < 4.7e-05

# R can also calculate derivatives:
D(expression(x^2), name="x")
D(expression(y * x^2), name="x") # derivative with respect to x
D(expression(y * x^2), name="y") # with respect to y

# For any questions, comments, corrections or suggestions, please write to:
# mariodosreis@gmail.com

Hope you find it useful!