Basic tutorial

PlasmoSim returns two types of output:

  1. Population-level data giving basic epidemiological measures (e.g. prevalence) on each day of simulation
  2. Individual-level data, including parasite genotypes found in the host at specific sampling points

This tutorial gives example of both outputs, and shows basic visualisation and interpretation functions.

Load necessary packages:

library(PlasmoSim)
library(tidyverse)
#> ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
#> ✔ dplyr     1.1.4     ✔ readr     2.1.5
#> ✔ forcats   1.0.0     ✔ stringr   1.5.1
#> ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
#> ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
#> ✔ purrr     1.0.2     
#> ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter()     masks stats::filter()
#> ✖ dplyr::group_rows() masks kableExtra::group_rows()
#> ✖ dplyr::lag()        masks stats::lag()
#> ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Individual-level output

Individual-level output is also returned in long format. We will use the kable package to make this slightly easier to read:

# take a peek at basic individual-level output, without the haplotypes column
sim1$indlevel |>
  kable() |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed")) |>
  scroll_box(width = "1000px", height = "400px")
time deme sample_ID positive haplotypes haplo_ID
365 1 7 FALSE NULL NULL
365 1 15 FALSE NULL NULL
365 1 31 FALSE NULL NULL
365 1 35 FALSE NULL NULL
365 1 54 FALSE NULL NULL
365 1 71 FALSE NULL NULL
365 1 81 FALSE NULL NULL
365 1 82 FALSE NULL NULL
365 1 89 FALSE NULL NULL
365 1 93 TRUE 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81 7bb9cce5c9911a60f4e7f1e827615768
365 1 95 FALSE NULL NULL
365 1 101 FALSE NULL NULL
365 1 106 FALSE NULL NULL
365 1 113 FALSE NULL NULL
365 1 120 TRUE 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39, 39 468815117ba482205c8aa5d5eca5d558
365 1 136 FALSE NULL NULL
365 1 142 TRUE 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14 5a04f475aaafb7d8d57fb74d15afdf98
365 1 159 FALSE NULL NULL
365 1 174 FALSE NULL NULL
365 1 176 TRUE 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53 0d05caa56fd36e9a1b89ec813cef6de7
365 1 183 FALSE NULL NULL
365 1 194 TRUE 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17, 17 a7ec324ac08f16ea346bc7d8a3c6d4bc
365 1 196 FALSE NULL NULL
365 1 204 FALSE NULL NULL
365 1 227 FALSE NULL NULL
365 1 240 FALSE NULL NULL
365 1 263 FALSE NULL NULL
365 1 275 FALSE NULL NULL
365 1 284 FALSE NULL NULL
365 1 287 TRUE 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21, 21 5eba0a09b0d61a2da0d5bd4fffc6516f
365 1 288 FALSE NULL NULL
365 1 291 FALSE NULL NULL
365 1 350 FALSE NULL NULL
365 1 354 FALSE NULL NULL
365 1 356 TRUE 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47, 47 1938d4f28086e6687ede6b64163e9e3d
365 1 363 FALSE NULL NULL
365 1 374 FALSE NULL NULL
365 1 381 FALSE NULL NULL
365 1 390 TRUE 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89, 47, 89 1938d4f28086e6687ede6b64163e9e3d, 4c364f6f36b13c5e685e9a946b2c38c1
365 1 394 TRUE 68, 42, 42, 68, 68, 68, 42, 68, 68, 42, 42, 68, 68, 42, 42, 68, 68, 42, 42, 68, 68, 42, 42, 68, 42, 42, 42, 68, 42, 42, 42, 68, 42, 42, 42, 68, 68, 42, 42, 68, 68, 42, 42, 68, 68, 42, 42, 68, 68, 42, 42, 68, 68, 42, 42, 68, 68, 68, 42, 68, 42, 68, 42, 68, 42, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 42, 68, 68, 68, 42, 68, 68, 68, 42, 68, 68 3a41ce4c33d3dd5808b4c225bfb16708, 2caecb59b554231a5158876dcfdda9c5, d0187d454d53dd08b30854f451e17134, 186af5c5ac9bf632066ea1c6433dac5d
365 1 397 FALSE NULL NULL
365 1 400 FALSE NULL NULL
365 1 401 FALSE NULL NULL
365 1 412 TRUE 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14 5a04f475aaafb7d8d57fb74d15afdf98
365 1 414 FALSE NULL NULL
365 1 415 FALSE NULL NULL
365 1 417 FALSE NULL NULL
365 1 420 FALSE NULL NULL
365 1 424 TRUE 47, 47, 21, 47, 47, 21, 21, 47, 47, 21, 21, 47, 21, 21, 21, 47, 21, 21, 47, 47, 21, 21, 21, 47, 47, 21, 21, 47, 47, 21, 21, 47, 47, 21, 21, 47, 47, 21, 21, 47, 47, 21, 21, 47, 47, 21, 21, 21, 47, 21, 47, 21, 47, 21, 47, 21, 47, 21, 47, 47, 21, 21, 21, 47, 21, 21, 21, 47, 21, 21, 21, 47, 21, 47, 21, 21, 21, 47, 21, 21, 21, 47, 21, 21, 47, 47, 47, 21, 47, 47, 47, 21, 47, 47, 47, 47 67623df4fcabc563d1c6f770084b8404, cd8b80d9780e7baa568ae43a9e427fed, e1a6e7b41ade666e1208f821ff536cb3, e629f1bebe0b02832ecc3fe63f457242
365 1 430 FALSE NULL NULL
365 1 436 FALSE NULL NULL
365 1 438 TRUE 42, 68, 68, 42, 68, 68, 42, 68, 68, 42, 42, 42, 42, 42, 68, 68, 42, 68, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 68, 42, 42, 42, 42, 42, 42, 42, 42, 42, 68, 42, 42, 68, 42, 68, 68, 42, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 68, 42, 68, 68, 42, 68, 68, 42, 68 d50ccfeb495e6304eb2f50069f03decb, 946e804ff63228416dd65529939c9464, 3db0498dafecf7738dd73ec80d75089e
365 1 451 TRUE 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86, 86 a3010796957ecae73d6f2f663164dc88
365 1 456 FALSE NULL NULL
365 1 457 FALSE NULL NULL
365 1 463 FALSE NULL NULL
365 1 473 FALSE NULL NULL
365 1 479 FALSE NULL NULL
365 1 480 FALSE NULL NULL
365 1 498 FALSE NULL NULL
365 1 517 FALSE NULL NULL
365 1 524 FALSE NULL NULL
365 1 538 FALSE NULL NULL
365 1 546 TRUE 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89, 89 4c364f6f36b13c5e685e9a946b2c38c1
365 1 547 FALSE NULL NULL
365 1 549 FALSE NULL NULL
365 1 551 TRUE 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14 5a04f475aaafb7d8d57fb74d15afdf98
365 1 573 FALSE NULL NULL
365 1 587 TRUE 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53, 53 0d05caa56fd36e9a1b89ec813cef6de7
365 1 590 FALSE NULL NULL
365 1 596 FALSE NULL NULL
365 1 641 FALSE NULL NULL
365 1 649 FALSE NULL NULL
365 1 652 TRUE 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81, 81 7bb9cce5c9911a60f4e7f1e827615768
365 1 681 TRUE 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87, 87 8ff6b39ffbb37ad530e1e02242acd424
365 1 688 FALSE NULL NULL
365 1 691 TRUE 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55, 55 83cacee044af7e1edd2c042d6d837a1f
365 1 703 FALSE NULL NULL
365 1 708 FALSE NULL NULL
365 1 726 FALSE NULL NULL
365 1 745 FALSE NULL NULL
365 1 754 TRUE 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42 e582bccedae1a8c866136c0c59119502
365 1 781 FALSE NULL NULL
365 1 785 FALSE NULL NULL
365 1 811 FALSE NULL NULL
365 1 818 FALSE NULL NULL
365 1 830 FALSE NULL NULL
365 1 844 TRUE 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 88, 64, 88, 64, 88, 64, 88, 64, 88, 64, 88, 88, 88, 88 195c30a32df1dd776b6aa4a20bc2f924, 0ce52b07bc461926e17172e8f37ab893
365 1 866 FALSE NULL NULL
365 1 872 TRUE 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85, 85 d9fc74b6c8e22d6169af05082e17798d
365 1 886 FALSE NULL NULL
365 1 897 TRUE 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64, 64 14e7f028287c07ea9e875126193d6aca
365 1 911 FALSE NULL NULL
365 1 931 FALSE NULL NULL
365 1 947 FALSE NULL NULL
365 1 959 FALSE NULL NULL
365 1 961 FALSE NULL NULL
365 1 963 FALSE NULL NULL
365 1 973 FALSE NULL NULL
365 1 986 FALSE NULL NULL

The first few columns tell us when and where (i.e. which deme) sampling occurred, the ID of the host, and whether they were positive for malaria parasites. The next two columns give genetic data (positive samples only). The haplotypes column gives the raw information at all 24 loci. Note that each host can be infected with multiple strains, meaning this element is actually a matrix with one row for each strain and one column for each locus. We can see this by printing out the full element for a malaria-positive host:

# print raw genetic data
sim1$indlevel |>
  filter(sample_ID == 394) |>
  pull(haplotypes)
#> [[1]]
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,]   68   68   68   68   68   68   42   42   42    68    68    68    68    68
#> [2,]   42   68   42   42   42   42   42   42   42    42    42    42    42    42
#> [3,]   42   42   42   42   42   42   42   42   42    42    42    42    42    42
#> [4,]   68   68   68   68   68   68   68   68   68    68    68    68    68    68
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
#> [1,]    68    42    42    68    68    68    68    68    68    68
#> [2,]    68    68    68    68    68    68    68    42    42    42
#> [3,]    42    42    68    68    68    68    68    68    68    68
#> [4,]    68    68    68    68    68    68    68    68    68    68

Returning to the table, the final column gives the haplo_ID. This is a hash of the information contained in each row of the haplotypes matrix, meaning each unique combination of values over all loci will be given a unique name. This can be very useful when we only care about unique genotypes and not the locus-by-locus information contained in those genotypes. For example, for the same individual as before:

# print haplo_ID
sim1$indlevel |>
  filter(sample_ID == 394) |>
  pull(haplo_ID)
#> [[1]]
#> [1] "3a41ce4c33d3dd5808b4c225bfb16708" "2caecb59b554231a5158876dcfdda9c5"
#> [3] "d0187d454d53dd08b30854f451e17134" "186af5c5ac9bf632066ea1c6433dac5d"

But what do the values in the haplotypes matrix actually mean? Although we have described them as haplotypes, they do not (yet) represent genetic information. Instead, each value specifies the ancestor that the information is descended from at the start of the simulation. For example, if we see a value 68 then we know that, at this locus, the information eventually traces back to the 68th seed infection at the start of the simulation. There are two reasons for encoding information like this:

  1. It allows us to ask questions about identify by descent (IBD), and not just identity by state (IBS). If two samples have the same value at the same locus then we know they are descended from a common ancestor some time between the start of the simulation and the present day.
  2. We can always go from this ancestral encoding to genotypes; we simply have to define a genetic value for each unique ancestor at each locus (we will see an example of this). However, the converse is not true; we cannot always go back from genotypes to ancestry.

Note that two samples having the value 68 does not imply that they are both direct descendants of the 68th seed infection. Rather, it implies that these samples have a common ancestor some time between the start of the simulation and the present day, and this common ancestor is descended from the 68th seed infection. It is much more likely that the common ancestor is much more recent than going all the way back to the start of the simulation.

Look again at the haplotypes matrix for this poly-clonally infected host. Notice that each row contains blocks of the values 68 and 42:

# print raw genetic data
sim1$indlevel |>
  filter(sample_ID == 394) |>
  pull(haplotypes)
#> [[1]]
#>      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
#> [1,]   68   68   68   68   68   68   42   42   42    68    68    68    68    68
#> [2,]   42   68   42   42   42   42   42   42   42    42    42    42    42    42
#> [3,]   42   42   42   42   42   42   42   42   42    42    42    42    42    42
#> [4,]   68   68   68   68   68   68   68   68   68    68    68    68    68    68
#>      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
#> [1,]    68    42    42    68    68    68    68    68    68    68
#> [2,]    68    68    68    68    68    68    68    42    42    42
#> [3,]    42    42    68    68    68    68    68    68    68    68
#> [4,]    68    68    68    68    68    68    68    68    68    68

This pattern is due to recombination. We could use this matrix to identify blocks of IBD within a sample, or compare this matrix to another to explore IBD between samples.