This vignette demonstrates model comparison in MALECOT. It covers:
This tutorial assumes some prior knowledge about MALECOT, so if you are completely new to the program we recommend working through the simpler bi-allelic tutorial first.
When carrying out Bayesian analysis it is useful to make a distinction between two types of analysis:
As an example, we might create a model ℳ in which we write down the probability of the observed data x as a function of the unknown allele frequencies p. In other words, we write down the likelihood Pr(x | p, ℳ).
In Bayesian parameter estimation we are trying to get at the posterior probability Pr(p | x, ℳ). We typically do this using MCMC, which produces a series of draws from the posterior distribution.
But what if we want to know the posterior probability of the model, rather than the parameters of the model? In other words, we want to know Pr(ℳ | x). Calculating this quantity requires that we integrate over all unknown parameters:
Pr(ℳ | x) = ∫Pr(ℳ, p | x)dp
This is often an extremely high-dimensional integral - for example in MALECOT there could easily be hundreds of unknown parameters - making it computationally infeasible by most methods. Regular MCMC cannot help us here because it only produces draws from the posterior distribution rather than normalised values.
One way around this is to use an advanced MCMC technique known as thermodynamic integration (TI). In TI we run multiple MCMC chains, each at a different “rung” on a temperature ladder. The hotter the chain, the flatter the target distribution. The log-likelihoods over all chains are then combined in a single calculation that - by what can only be described as mathematical magic! - is asymptotically equal to log [Pr(x | ℳ)]. We can then apply a prior over models, for example giving each model equal weight, to arrive at the desired posterior value Pr(ℳ | x). Generalised thermodynamic integration (GTI) differs from regular TI in that it uses a slightly different calculation when combining information across rungs that leads to lower bias and higher precision.
Hopefully this is enough background to run thermodynamic MCMC in MALECOT, and to understand the results. For those eager to understand all the mathematical details, see this vignette.
For the sake of this tutorial we will use simulated bi-allelic data drawn from K = 3 subpopulations. We create a new project and bind this data:
# create project and bind data
myproj <- malecot_project()
myproj <- bind_data_biallelic(myproj, df = mysim$data, ID_col = 1, pop_col = 2)
For our first parameter set we will assume a very basic model with default parameters; which means a Poisson prior on COI, a flat prior on allele frequencies and no error estimation:
We will then run the MCMC for values of K from 1 to 5. What makes this
thermodynamic MCMC rather than regular MCMC is the
rungs
argument, which dictates the number of rungs on the
temperature ladder. We will opt for 10 rungs for now. Be warned - this
MCMC will take considerably longer to run than regular MCMC, so be
prepared to go make yourself a cup of tea!
# run thermodynamic MCMC
myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e4, converge_test = 1e2,
samples = 1e4, rungs = 10, pb_markdown = TRUE)
## Running MCMC for K = 1
## Burn-in phase
## | |======================================================================| 100%
## converged within 200 iterations
## Sampling phase
## | |======================================================================| 100%
## completed in 9.8761 seconds
##
## Running MCMC for K = 2
## Burn-in phase
## | |======================================================================| 100%
## converged within 200 iterations
## Sampling phase
## | |======================================================================| 100%
## completed in 17.9571 seconds
##
## Running MCMC for K = 3
## Burn-in phase
## | |======================================================================| 100%
## converged within 300 iterations
## Sampling phase
## | |======================================================================| 100%
## completed in 23.9591 seconds
##
## Running MCMC for K = 4
## Burn-in phase
## | |======================================================================| 100%
## converged within 200 iterations
## Sampling phase
## | |======================================================================| 100%
## completed in 29.3485 seconds
##
## Running MCMC for K = 5
## Burn-in phase
## | |======================================================================| 100%
## converged within 200 iterations
## Sampling phase
## | |======================================================================| 100%
## completed in 34.8111 seconds
##
## Processing results
## Total run-time: 1.97 minutes
Notice that convergence times are much longer than under regular MCMC. This is because the MCMC is only deemed to have converged when every chain has converged, meaning we are only as strong as the weakest link. When running thermodynamic MCMC it is therefore important to keep an eye on convergence, and to increase the number of burn-in iterations if needed.
There are more moving parts in thermodynamic MCMC, meaning we have more things to check. First, we should perform the same diagnostic checks as for regular MCMC:
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the MALECOT package.
## Please report the issue at <https://github.com/bobverity/malecot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## ℹ The deprecated feature was likely used in the MALECOT package.
## Please report the issue at <https://github.com/bobverity/malecot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The plot_loglike_diagnostic()
function uses the cold
rung by default (in this case the 10th rung), or we can use the
rung
argument to produce plots for any given rung, for
example:
Running get_ESS()
now prints out the effective sample
size of every rung:
## rung1 rung2 rung3 rung4 rung5 rung6 rung7 rung8
## 807.0128 747.3621 828.9853 1023.9434 1056.7947 875.3339 962.8699 892.5152
## rung9 rung10
## 891.3583 887.8006
We interpret ESS for hot chains the same way as for the cold chain -
as the number of independent samples that we have obtained from the
target distribution once autocorrelation has been accounted for. If we
see small values of the ESS (less than one thousand as a rule of thumb)
then we should be concerned that that particular rung has has not
explored the space well, and we should repeat the analysis with a larger
number of samples. For the sake of this tutorial we will load in the
result of re-running this MCMC with samples = 1e5
, which
took around 20 minutes.
A useful plotting function when working with multiple rungs is the
plot_loglike()
function, which plots the 95% quantiles of
the log-likelihood over every rung:
This plot should be always-increasing from left to right, aside from
small variations due to the random nature of MCMC. If any rungs stand
out from the overall pattern then it is worth running
plot_loglike_diagnostic()
on these particular rungs, as
they may have mixed badly. As with all other checks, this should be
performed on every explored value of K.
The GTI method takes the log-likelihood values above and multiplies
them by weights, leading to a GTI “path”. The area between this path and
the zero line is our estimate of the log-evidence. We can visualise this
path using the plot_GTI_path()
function:
As our evidence estimate is derived from the area between the line and zero, it is important that this path is not too jagged. We are essentially approximating a smooth curve with a series of straight line segments, so if the straight lines cut the corners off the smooth curve then the area will be too large or too small, leading to a biased estimate. There are two ways that we can mitigate this:
GTI_pow
argumentIncreasing the number of rungs will obviously lead to a smoother
path, and it will also help with MCMC
mixing, but it comes at the cost of slowing down the MCMC. The
GTI_pow
argument changes the curvature of the path by
modifying the log-likelihood weights, with large values leading to a
more concave path. Ideally we want to choose GTI_pow
such
that the path is as straight as possible, as this will lead to smallest
difference between the true curve and the discrete approximation. The
plot above is slightly concave using the default value
GTI_pow = 3
, but it is not pathalogically concave. If we
were producing results for publication and had a computer sitting idle
overnight then we could consider re-running the MCMC with more rungs and
a lower GTI_pow
, but otherwise these results are fine.
Once we are happy with our log-likelihood estimates and our GTI path,
we can use our log-evidence estimates to compare values of K. The raw estimates can be plotted
using the plot_logevidence_K()
function, which plots 95%
credible intervals of the log-evidence for each value of K:
We can see immediately that the model favours K = 3 in this case, which agrees
with our simulation parameters. If we had seen large credible intervals
at this stage then we could re-run the model with a larger number of
samples
, but in this example the intervals are
non-overlapping and the signal is clear so there is no need.
We can also use the plot_posterior_K()
function to plot
the posterior probability of each value of K, which is obtained by taking these
raw estimates out of log space and applying an equal prior over K:
This second plot is often easier to interpret than the first as it is
in units of ordinary probability. In this case we can see that there is
a >99% posterior probability that the data were drawn from K = 3 subpopulations (which we know
to be true). Again, if we saw large credible intervals at this stage
then it would be worth re-running the MCMC with a larger number of
samples
, but here there is no need.
One of the major advantages of the model evidence is that we can use it to compare wider evolutionary models, i.e. different parameter sets. Mathematically this is as simple as summing the evidence for each value of K, weighted by the prior.
We can test this by creating a second parameter set that differs from the first in that in that we now estimate the error terms:
# create new parameter set
myproj <- new_set(myproj, name = "error model", estimate_error = TRUE)
myproj
## DATA:
## data format = biallelic
## samples = 100
## loci = 24
## pops = 3
## missing data = 0 of 2400 gene copies (0%)
##
## PARAMETER SETS:
## SET1: simple model
## * SET2: error model
##
## ACTIVE SET: SET2
## lambda = 1
## COI model = poisson
## COI max = 20
## estimate COI mean = TRUE
## estimate error = TRUE
## e1_max = 0.2
## e2_max = 0.2
We then need to run thermodynamic MCMC for this parameter set over the same range of K values. For the sake of this tutorial we will save time by loading results from file:
## Running MCMC for K = 1
## Burn-in phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## completed in 0.0453773 seconds
##
## Running MCMC for K = 2
## Burn-in phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## completed in 0.0681553 seconds
##
## Running MCMC for K = 3
## Burn-in phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## completed in 0.0849362 seconds
##
## Running MCMC for K = 4
## Burn-in phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## completed in 0.102442 seconds
##
## Running MCMC for K = 5
## Burn-in phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## Warning: convergence still not reached within 10 iterations
## Sampling phase
## | |======= | 10% | |============== | 20% | |===================== | 30% | |============================ | 40% | |=================================== | 50% | |========================================== | 60% | |================================================= | 70% | |======================================================== | 80% | |=============================================================== | 90% | |======================================================================| 100%
## completed in 0.120855 seconds
# uncomment to run MCMC
#myproj <- run_mcmc(myproj, K = 1:5, burnin = 1e4, converge_test = 1e2, samples = 1e5, rungs = 10)
# plot diagnostics
plot_loglike_dignostic(myproj, K = 3)
## rung1 rung2 rung3 rung4 rung5 rung6 rung7 rung8
## 10.000000 10.000000 10.000000 10.000000 10.000000 2.906066 1.687337 2.059786
## rung9 rung10
## 1.603670 2.409224
Assuming we are happy with the MCMC behaviour we can go ahead and plot the posterior error estimates:
We know from the simulation parameters that there are no errors in the data, meaning this extra flexibility of estimating the error is actually taking the model further away from the truth. Looking at the posterior credible intervals we can see that the model has struggled to arrive at precise estimates of the error rates, instead exploring the full prior range [0, 0.2] quite evenly.
We continue with the same analyses as in the previous section:
# use gridExtra package to arrange plots
library(gridExtra)
# produce plots and arrange in grid
plot1 <- plot_loglike(myproj, K = 3)
plot2 <- plot_GTI_path(myproj, K = 3)
plot3 <- plot_logevidence_K(myproj)
plot4 <- plot_posterior_K(myproj)
grid.arrange(plot1, plot2, plot3, plot4)
As with the previous model, the evidence is in favour of K = 3 in this example.
Finally, we come to the issue of comparing parameter sets. Just as
with the evidence estimates over K, we can plot the log-evidence of a
parameter set using the plot_logevidence_model()
function,
or the posterior distribution over models assuming an equal prior using
the plot_posterior_model()
function:
# produce evidence plots over parameter sets
plot1 <- plot_logevidence_model(myproj)
plot2 <- plot_posterior_model(myproj)
grid.arrange(plot1, plot2, nrow = 1)
Here we can see that the former, simpler model has a higher evidence than the more complex error estimation model. To understand this, we can imagine the MCMC exploring the parameter space under each model. In the simple model the error is fixed at the correct value of zero, while in the more complex model the MCMC has a finite chance of exploring different error values, all of which are sub-optimal in terms of the likelihood. So the likelihood will, on average, be higher under the simple model. In other words, in the complex model we have introduced additional flexibility that has not been compensated for by a commensurate increase in the likelihood, and so the model evidence naturally punishes the model for being over-fitted. In this example we would conclude that a zero-error model is a better fit to the data, and so we would most likely only report detailed results from this model.
It is clear from this tutorial that thermodynamic MCMC over multiple temperature rungs can take considerably longer to run than regular MCMC. Hence, the next tutorial descibes how to run MALECOT in parallel over multiple cores.