Real data example

Data structure

The algorithms developed in this package require an input data set containing the population-level minor allele frequency (PLMAF), the within-sample minor allele frequency (WSMAF), and the within-sample coverage across a set of loci. We note that while our package leverages the population-level and within-sample minor allele frequencies to run, user may also input the population-level and within-sample allele frequencies of the reference allele. The package has built-in capabilities to convert these values to the allele frequencies for the minor allele.

The example real data set included with this package contains a matrix with the WSAFs of multiple samples across several loci, with the sample represented by the rows and the locus represented by the columns. The first 5 rows and 3 columns of the example real data set included with this package is shown below:

print(example_real_data[1:5, 1:3])
#>          Pf3D7_01_v3_94422 Pf3D7_01_v3_95518 Pf3D7_01_v3_100608
#> FP0024-C         0.8205128        0.43548387          0.4972067
#> FP0025-C         1.0000000        0.00000000          1.0000000
#> FP0028-C         0.7389381        0.48000000          1.0000000
#> FP0029-C         1.0000000        1.00000000          1.0000000
#> FP0030-C         0.6250000        0.01882353          1.0000000

Given this information, we may determine the PLAF by averaging the WSAF of all samples across each locus, as follows:

plaf <- colMeans(example_real_data, na.rm = TRUE)

With the WSAF and PLAF, we can generate an input data frame. However, as our algorithms work on a per sample basis, we must generate a list of input data frames:

input_data <- purrr::map(seq_len(nrow(example_real_data)), function(i) {
  tibble::tibble(wsmaf = example_real_data[i, ], plmaf = plaf) %>%
    tidyr::drop_na()
})

Estimate the COI

With the input data set now generated, to run estimate the COI, users can use the compute_coi() or optimize_coi() function, depending on whether a discrete or continuous value of the COI is desired. Below we illustrate estimating the discrete COI:

# Estimate the COI of a single sample
optimize_coi(input_data[[1]], data_type = "real")
#> [1] 1.2079

# Estimating the COI of multiple samples
purrr::map_dbl(input_data, ~ optimize_coi(.x, data_type = "real"))
#>  [1] 1.2079 1.7896 1.1241 1.0550 1.0708 1.1313 1.0500 1.0722 1.9574 1.0143

The estimation functions will return the estimated COI. In some cases, additional information will also be returned.

Data visualization

We recommend exploring the ggplot2 package to plot results. The Graph Gallery is a beautiful website with graphs and demos that may provide some inspiration.