The DRpower model

5. A Bayesian approach

In the DRpower software we take a different approach to analysis that deals with all of the issues listed in section 4. We describe the approach at a high level here - for those wanting more mathematical details please see this page.

We start by assuming a random effects framework. Instead of assuming that all individuals have the same probability of carrying the pfhrp2/3 deleted strain, we allow each site to have its own site-level prevalence. We model the mean and the spread of this site-level variation using a random effects distribution - in our case a beta distribution like the one shown below:

#> Warning in geom_point(aes(x = point1, y = 0)): All aesthetics have length 1, but the data has 1001 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#>   a single row.
#> Warning in geom_segment(aes(x = point1, y = 1, xend = point1, yend = 0.2), : All aesthetics have length 1, but the data has 1001 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#>   a single row.
#> Warning in geom_point(aes(x = point2, y = 0)): All aesthetics have length 1, but the data has 1001 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#>   a single row.
#> Warning in geom_segment(aes(x = point2, y = 1, xend = point2, yend = 0.2), : All aesthetics have length 1, but the data has 1001 rows.
#> ℹ Please consider using `annotate()` or provide this layer with data containing
#>   a single row.

This distribution has a mean value p that represents the domain-level prevalence of deletions (i.e. the main thing we are trying to estimate). It also has a parameter r that represents the level of intra-cluster correlation and hence overdispersion. Larger values of r lead to more spread out distributions. Our objective is to estimate p while taking into account uncertainty in r, which we can do very easily within a Bayesian framework. A full description of Bayesian statistics is beyond the scope of this document, but in short we:

  1. Write down the probability of the data under the model, which can be done exactly in this case without making any approximations.
  2. Come up with suitable priors on our two unknown parameters; p and r. A good approach here is to use historical data to guide us.
  3. Calculate the posterior distribution of p, integrated over our uncertainty in r. Crucially, this method takes into account everything the data tells us about intra-cluster correlation and propagates this uncertainty into our estimate of the prevalence.

We end up with a probability distribution that describes the plausible range of values that p could take. We can use this in many different ways, for example we can calculate a credible interval (CrI). CrIs have a slightly different and more direct interpretation than confidence intervals. A 95% CrI means there is a 95% probability that the true value lies inside this interval (compare this with the definition of a CI)! This immediately deals with the first 4 out of 5 issues listed in the previous section.

We can also output from the model the posterior probability that prevalence is above the 5% threshold. We can turn this into a hypothesis test by rejecting the null hypothesis whenever this probability is above a certain level. This gives us a binary decision tool just like the traditional approach, but now with all the advantages of the Bayesian method.

We can run the DRpower model on our example data as follows:

get_prevalence(n = n_deletions, N = n_tested)
#>     MAP CrI_lower CrI_upper prob_above_threshold
#> 1 13.98      6.11     28.13               0.9969

The first output is the maximum a posteriori (MAP) estimate, which gives us a point estimate of 13.98% prevalence - very close to our original estimate of 14% from the raw data. There are also other summaries that we might be interested in.

Based on the output above, we also estimate that there is a 0.9969 probability that the prevalence of pfhrp2/3 deletions is above 5%. By default we recommend using a cutoff of 0.95 when turning this into a hypothesis test, so in this case we can reject the hypothesis that prevalence is below 5%. Based on this analysis, therefore, we have sufficient evidence to conclude that prevalence is above our target threshold, and so a switch of RDTs is justified.

In summary, the DRpower model provides an alternative way of analysing multi-site prevalence data that has some advantages over traditional methods. It can be used to calculate CrIs, and/or it can be used in a hypothesis testing framework. The next section describes how this framework can be used in study design.