---
title: "Basic tutorial"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Basic tutorial}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, echo=FALSE, message=FALSE, warning=FALSE}
library(kableExtra)
```

PlasmoSim returns two types of output:

1. Population-level data giving basic epidemiological measures (e.g. prevalence) on each day of simulation
2. Individual-level data, including parasite genotypes found in the host at specific sampling points

This tutorial gives example of both outputs, and shows basic visualisation and interpretation functions.

Load necessary packages:
```{r}
library(PlasmoSim)
library(tidyverse)
```

## Daily trends

```{r, echo=FALSE}
set.seed(3)
```

First, we need to define our desired sampling strategy using a data.frame. This specifies the demes (partially isolated sub-populations) that we will sample from, at what time points, and how many hosts will be drawn at random from the population:

```{r}
# define individual-level sampling via a data.frame
sample_df <- data.frame(deme = 1,
                        time = 365,
                        n = 100)
```

Now we run the main simulation function:

```{r}
# run simulation
sim1 <- sim_falciparum(H = 1000,                     # human population size
                       M = 5000,                     # adult female mosquito population size
                       seed_infections = 100,        # number of infected hosts at time 0
                       L = 24,                       # number of loci
                       sample_dataframe = sample_df  # sampling data.frame
                       )
```

Daily output is stored in long format, which makes it easy to produce plots:

```{r, out.width="70%", out.height="70%"}
# basic plot of prevalence in "I" state (infected)
sim1$daily_values |>
  ggplot() + theme_bw() +
  geom_line(aes(x = time, y = 100 * I / 1000)) +
  ylab("Prevalence (%)")
```

We can see that prevalence jumped to 10% (100 infected hosts in the population of 1000) on day 13. This is because we set the simulation going with 100 seed infections, which here means new liver-stage infections. The default time from liver-stage to blood-stage (the intrinsic incubation period) is set to \code{u = 12} by default, hence these all emerging together on day 13. After this point we can see dynamic and stochastic changes in the number infected.

## Individual-level output

Individual-level output is also returned in long format. We will use the `kable` package to make this slightly easier to read:

```{r}
# take a peek at basic individual-level output, without the haplotypes column
sim1$indlevel |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) |>
  scroll_box(width = "1000px", height = "400px")
```

The first few columns tell us when and where (i.e. which deme) sampling occurred, the ID of the host, and whether they were positive for malaria parasites. The next two columns give genetic data (positive samples only). The `haplotypes` column gives the raw information at all 24 loci. Note that each host can be infected with multiple strains, meaning this element is actually a matrix with one row for each strain and one column for each locus. We can see this by printing out the full element for a malaria-positive host:

```{r}
# print raw genetic data
sim1$indlevel |>
  filter(sample_ID == 394) |>
  pull(haplotypes)
```

Returning to the table, the final column gives the `haplo_ID`. This is a hash of the information contained in each row of the `haplotypes` matrix, meaning each unique combination of values over all loci will be given a unique name. This can be very useful when we only care about unique genotypes and not the locus-by-locus information contained in those genotypes. For example, for the same individual as before:

```{r}
# print haplo_ID
sim1$indlevel |>
  filter(sample_ID == 394) |>
  pull(haplo_ID)
```

But what do the values in the `haplotypes` matrix actually mean? Although we have described them as haplotypes, they do not (yet) represent genetic information. Instead, each value specifies the ancestor that the information is descended from at the start of the simulation. For example, if we see a value 68 then we know that, at this locus, the information eventually traces back to the 68th seed infection at the start of the simulation. There are two reasons for encoding information like this:

1. It allows us to ask questions about identify by descent (IBD), and not just identity by state (IBS). If two samples have the same value at the same locus then we know they are descended from a common ancestor some time between the start of the simulation and the present day.
2. We can always go from this ancestral encoding to genotypes; we simply have to define a genetic value for each unique ancestor at each locus (we will see an example of this). However, the converse is not true; we cannot always go back from genotypes to ancestry.

Note that two samples having the value 68 does not imply that they are both *direct* descendants of the 68th seed infection. Rather, it implies that these samples have a common ancestor some time between the start of the simulation and the present day, and *this common ancestor* is descended from the 68th seed infection. It is much more likely that the common ancestor is much more recent than going all the way back to the start of the simulation.

Look again at the `haplotypes` matrix for this poly-clonally infected host. Notice that each row contains blocks of the values 68 and 42:

```{r}
# print raw genetic data
sim1$indlevel |>
  filter(sample_ID == 394) |>
  pull(haplotypes)
```

This pattern is due to recombination. We could use this matrix to identify blocks of IBD within a sample, or compare this matrix to another to explore IBD between samples.