--- title: "The DRpower model" author: "Bob Verity" date: "Last updated: `r format(Sys.Date(), '%d %b %Y')`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{The DRpower model} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r, echo=FALSE, message=FALSE, warning=FALSE} library(DRpower) library(kableExtra) library(tidyverse) ``` ```{r, echo=FALSE} df_ex <- data.frame(cluster = 1:5, n_tested = 50, n_deletions = c(15, 10, 1, 2, 7)) %>% dplyr::mutate(cluster_prevalence = n_deletions / n_tested * 100) n_tested <- df_ex$n_tested n_deletions <- df_ex$n_deletions c <- nrow(df_ex) n <- df_ex$n_tested[1] p <- mean(df_ex$cluster_prev / 100) z <- qnorm(0.975) d <- z * sqrt(p * (1 - p) / (n*c)) Wald_lower <- (p - d)*100 Wald_upper <- (p + d)*100 ``` ```{r, echo=FALSE} d <- z * sd(df_ex$cluster_prevalence) / sqrt(c) standard_lower <- p*100 - d standard_upper <- p*100 + d ``` ```{r, echo=FALSE} v1 <- var(df_ex$cluster_prevalence / 100) / c v2 <- p*(1 - p) / (n*c) Deff <- v1 / v2 Neff <- n*c / Deff ``` ## 5. A Bayesian approach In the *DRpower* software we take a different approach to analysis that deals with all of the issues listed in section 4. We describe the approach at a high level here - for those wanting more mathematical details please [see this page](articles/mathematical_details.html). We start by assuming a [random effects](https://en.wikipedia.org/wiki/Random_effects_model) framework. Instead of assuming that all individuals have the same probability of carrying the *pfhrp2/3* deleted strain, we allow each site to have its own site-level prevalence. We model the mean and the spread of this site-level variation using a random effects distribution - in our case a [beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) like the one shown below: ```{r, echo=FALSE} point1 <- 14 point2 <- 30 data.frame(x = seq(0, 1, 0.001)) %>% dplyr::mutate(y = dbeta(x, 10, 40)) %>% ggplot() + theme_minimal() + geom_ribbon(aes(x = x*100, ymax = y, ymin = 0), fill = "dodgerblue3", alpha = 0.5) + geom_line(aes(x = x*100, y = y)) + geom_vline(xintercept = 20, linetype = "dashed") + annotate(x = 28, y = 8, geom = "text", label = "p = 20%") + geom_point(aes(x = point1, y = 0)) + geom_segment(aes(x = point1, y = 1, xend = point1, yend = 0.2), arrow = arrow(length = unit(0.3, "cm"))) + annotate(x = point1, y = 1.3, geom = "text", label = "14%") + geom_point(aes(x = point2, y = 0)) + geom_segment(aes(x = point2, y = 1, xend = point2, yend = 0.2), arrow = arrow(length = unit(0.3, "cm"))) + annotate(x = point2, y = 1.3, geom = "text", label = "30%") + ylim(c(0, 10)) + xlab("Site-level prevalence (%)") + ylab("Probability density") ``` This distribution has a mean value $p$ that represents the domain-level prevalence of deletions (i.e. the main thing we are trying to estimate). It also has a parameter $r$ that represents the level of intra-cluster correlation and hence overdispersion. Larger values of $r$ lead to more spread out distributions. Our objective is to estimate $p$ while taking into account uncertainty in $r$, which we can do very easily within a Bayesian framework. A full description of [Bayesian statistics](https://en.wikipedia.org/wiki/Bayesian_statistics) is beyond the scope of this document, but in short we: 1. Write down the probability of the data under the model, which can be done exactly in this case without making any approximations. 2. Come up with suitable [priors](https://en.wikipedia.org/wiki/Prior_probability) on our two unknown parameters; $p$ and $r$. A good approach here is to use [historical data](articles/historical_analysis.html) to guide us. 3. Calculate the posterior distribution of $p$, integrated over our uncertainty in $r$. Crucially, this method **takes into account everything the data tells us about intra-cluster correlation and propagates this uncertainty into our estimate of the prevalence**. We end up with a probability distribution that describes the plausible range of values that $p$ could take. We can use this in many different ways, for example we can calculate a credible interval (CrI). CrIs have a slightly different and more direct interpretation than confidence intervals. A 95% CrI means there is a 95% probability that the true value lies inside this interval (compare this with [the definition of a CI](https://thestatsgeek.com/2020/11/21/interpretation-of-frequentist-confidence-intervals-and-bayesian-credible-intervals/))! This immediately deals with the first 4 out of 5 issues listed in the previous section. We can also output from the model the posterior probability that prevalence is above the 5% threshold. We can turn this into a hypothesis test by rejecting the null hypothesis whenever this probability is above a certain level. This gives us a binary decision tool just like the traditional approach, but now with all the advantages of the Bayesian method. We can run the *DRpower* model on our example data as follows: ```{r} get_prevalence(n = n_deletions, N = n_tested) ``` ```{r, echo=FALSE} gp <- get_prevalence(n = n_deletions, N = n_tested) ``` The first output is the *maximum a posteriori* (MAP) estimate, which gives us a point estimate of `r gp$MAP`% prevalence - very close to our original estimate of `r round(p*100, 2)`% from the raw data. There are also [other summaries](articles/summarise_prevalence.html) that we might be interested in. Based on the output above, we also estimate that there is a `r gp$prob_above_threshold` probability that the prevalence of *pfhrp2/3* deletions is above 5%. By default we recommend using a cutoff of 0.95 when turning this into a hypothesis test, so in this case we can reject the hypothesis that prevalence is below 5%. Based on this analysis, therefore, we have sufficient evidence to conclude that prevalence is above our target threshold, and so a switch of RDTs is justified. In summary, the DRpower model provides an alternative way of analysing multi-site prevalence data that has some advantages over traditional methods. It can be used to calculate CrIs, and/or it can be used in a hypothesis testing framework. The next section describes how this framework can be used in study design.