---
title: "Analysing data"
author: "Bob Verity"
date: "Last updated: `r format(Sys.Date(), '%d %b %Y')`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Analysing data}
  %\VignetteEngine{knitr::knitr}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```
```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(DRpower)
library(kableExtra)
library(tidyverse)
```

Here, we outline the main steps in analysing data using the DRpower Bayesian model. Although we focus on the *pfhrp2/3* use-case here, the same steps can be used to analyse the prevalence of drug resistance markers.

## 1. Estimate prevalence

The main thing we want to estimate is usually the prevalence of *pfhrp2/3* deletions. This is extremely simple to do, and is carried out through the `get_prevalence()` function. We pass this function two sets of values; 1) the number of *pfhrp2/3* deletions observed in each site (numerator), and 2) the total sample size in each site (denominator):

```{r, echo=FALSE}
# define observed data
num_deletions <- c(3, 12, 4)
sample_size <- c(100, 130, 65)

# estimate prevalence
z <- get_prevalence(n = num_deletions,
                    N = sample_size)
```
```{r}
# define observed data
num_deletions <- c(3, 12, 4)
sample_size <- c(100, 130, 65)

# estimate prevalence
get_prevalence(n = num_deletions,
               N = sample_size)
```

We obtain a point estimate of `r z$MAP`% prevalence, with a 95% CrI in the range [`r z$CrI_lower`% to `r z$CrI_upper`%]. When presenting our estimates we should always report the full credible interval and not just the central estimate of `r z$MAP`%, as this may give a misleading impression of how confident we are in this value.

## 2. Compare prevalence against a threshold

The second thing we may want to do is to establish whether the prevalence is above the 5% threshold at the domain level. The probability of being above this threshold is given in the `prob_above_threshold` output above, in this case `r z$prob_above_threshold`. Before conducting this analysis, we should have decided what level of confidence we need in order to accept this hypothesis - we advise using 0.95 by default. In this case, `r z$prob_above_threshold` is below 0.95 so we do not have sufficient evidence to conclude that prevalence is above 5% at the domain level.

Note that it is possible for the CrI to span the 5% threshold, but for the `prob_above_threshold` to still be greater than 0.95. This is because the CrI is two-sided, whereas the hypothesis test is one sided.

## 3. Estimate the ICC

The prevalence estimates above have already taken into account uncertainty in the intra-cluster correlation (ICC). That being said, it can be useful to present our estimate of the ICC to help contextualise results, and to guide future studies. This can be achieved through the `get_ICC()` function, which takes the same two inputs:

```{r, echo=FALSE}
# estimate ICC
z <- get_ICC(n = num_deletions,
             N = sample_size)
```
```{r}
# estimate ICC
get_ICC(n = num_deletions,
        N = sample_size)
```

We estimate that the ICC is around `r z$MAP`, and in the range [`r z$CrI_lower`, `r z$CrI_upper`]. This is a fairly low value of the ICC, and so when we conduct follow-up studies or studies in nearby regions we should take this information into account.