---
title: "Real data example"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Real data example}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
# Here, we set default options for our markdown file
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 5,
  fig.height = 5
)
# Change the way tibble prints so only prints 5 extra columns
options(tibble.max_extra_cols = 5)

library(ggplot2)
library(coiaf)
```

## Data structure
The algorithms developed in this package require an input data set containing
the population-level minor allele frequency (PLMAF), the within-sample minor
allele frequency (WSMAF), and the within-sample coverage across a set of loci.
We note that while our package leverages the population-level and within-sample
minor allele frequencies to run, user may also input the population-level and
within-sample allele frequencies of the reference allele. The package has
built-in capabilities to convert these values to the allele frequencies for the
minor allele.

The example real data set included with this package contains a matrix with the
WSAFs of multiple samples across several loci, with the sample represented by
the rows and the locus represented by the columns. The first 5 rows and 3
columns of the example real data set included with this package is shown below:
```{r print data}
print(example_real_data[1:5, 1:3])
```

Given this information, we may determine the PLAF by averaging the WSAF of all
samples across each locus, as follows:
```{r PLAF}
plaf <- colMeans(example_real_data, na.rm = TRUE)
```

With the WSAF and PLAF, we can generate an input data frame. However, as our
algorithms work on a per sample basis, we must generate a list of input data
frames:
```{r input data}
input_data <- purrr::map(seq_len(nrow(example_real_data)), function(i) {
  tibble::tibble(wsmaf = example_real_data[i, ], plmaf = plaf) %>%
    tidyr::drop_na()
})
```


## Estimate the COI
With the input data set now generated, to run estimate the COI, users can use
the `compute_coi()` or `optimize_coi()` function, depending on whether a
discrete or continuous value of the COI is desired. Below we illustrate
estimating the discrete COI:
```{r estimate the COI}
# Estimate the COI of a single sample
optimize_coi(input_data[[1]], data_type = "real")

# Estimating the COI of multiple samples
purrr::map_dbl(input_data, ~ optimize_coi(.x, data_type = "real"))
```

The estimation functions will return the estimated COI. In some cases,
additional information will also be returned.

## Data visualization
We recommend exploring the [`ggplot2`](https://ggplot2.tidyverse.org/index.html)
package to plot results. The [Graph
Gallery](https://r-graph-gallery.com/index.html) is a beautiful website with
graphs and demos that may provide some inspiration.