--- title: "Real data example" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Real data example} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} # Here, we set default options for our markdown file knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 5, fig.height = 5 ) # Change the way tibble prints so only prints 5 extra columns options(tibble.max_extra_cols = 5) library(ggplot2) library(coiaf) ``` ## Data structure The algorithms developed in this package require an input data set containing the population-level minor allele frequency (PLMAF), the within-sample minor allele frequency (WSMAF), and the within-sample coverage across a set of loci. We note that while our package leverages the population-level and within-sample minor allele frequencies to run, user may also input the population-level and within-sample allele frequencies of the reference allele. The package has built-in capabilities to convert these values to the allele frequencies for the minor allele. The example real data set included with this package contains a matrix with the WSAFs of multiple samples across several loci, with the sample represented by the rows and the locus represented by the columns. The first 5 rows and 3 columns of the example real data set included with this package is shown below: ```{r print data} print(example_real_data[1:5, 1:3]) ``` Given this information, we may determine the PLAF by averaging the WSAF of all samples across each locus, as follows: ```{r PLAF} plaf <- colMeans(example_real_data, na.rm = TRUE) ``` With the WSAF and PLAF, we can generate an input data frame. However, as our algorithms work on a per sample basis, we must generate a list of input data frames: ```{r input data} input_data <- purrr::map(seq_len(nrow(example_real_data)), function(i) { tibble::tibble(wsmaf = example_real_data[i, ], plmaf = plaf) %>% tidyr::drop_na() }) ``` ## Estimate the COI With the input data set now generated, to run estimate the COI, users can use the `compute_coi()` or `optimize_coi()` function, depending on whether a discrete or continuous value of the COI is desired. Below we illustrate estimating the discrete COI: ```{r estimate the COI} # Estimate the COI of a single sample optimize_coi(input_data[[1]], data_type = "real") # Estimating the COI of multiple samples purrr::map_dbl(input_data, ~ optimize_coi(.x, data_type = "real")) ``` The estimation functions will return the estimated COI. In some cases, additional information will also be returned. ## Data visualization We recommend exploring the [`ggplot2`](https://ggplot2.tidyverse.org/index.html) package to plot results. The [Graph Gallery](https://r-graph-gallery.com/index.html) is a beautiful website with graphs and demos that may provide some inspiration.