Package 'coiaf' reference manual

Title:	Complexity of Infection Estimation with Allele Frequencies
Description:	Provides a direct method for estimating complexity of infection using easily calculated measures from sequence read depth data.
Authors:	Aris Paschalidis [aut, cre] , OJ Watson [aut]
Maintainer:	Aris Paschalidis <[email protected]>
License:	MIT + file LICENSE
Version:	0.1.2
Built:	2025-02-05 06:14:15 UTC
Source:	https://github.com/bailey-lab/coiaf

Generate bootstrapped CI

Description

Generate bootstrapped confidence interval for COI estimates.

Usage

bootstrap_ci(
  data,
  max_coi = 25,
  seq_error = 0.01,
  coi_method = c("variant", "frequency"),
  solution_method = c("discrete", "continuous"),
  use_bins = FALSE,
  bin_size = 20,
  replicates = 100,
  parallel = FALSE,
  ncpus = 8
)

## Default S3 method:
bootstrap_ci(
  data,
  max_coi = 25,
  seq_error = 0.01,
  coi_method = c("variant", "frequency"),
  solution_method = c("discrete", "continuous"),
  use_bins = FALSE,
  bin_size = 20,
  replicates = 100,
  parallel = FALSE,
  ncpus = 8
)

## S3 method for class 'sim'
bootstrap_ci(
  data,
  max_coi = 25,
  seq_error = 0.01,
  coi_method = c("variant", "frequency"),
  solution_method = c("discrete", "continuous"),
  use_bins = FALSE,
  bin_size = 20,
  replicates = 100,
  parallel = FALSE,
  ncpus = 8
)
bootstrap_ci(
  data,
  max_coi = 25,
  seq_error = 0.01,
  coi_method = c("variant", "frequency"),
  solution_method = c("discrete", "continuous"),
  use_bins = FALSE,
  bin_size = 20,
  replicates = 100,
  parallel = FALSE,
  ncpus = 8
)

## Default S3 method:
bootstrap_ci(
  data,
  max_coi = 25,
  seq_error = 0.01,
  coi_method = c("variant", "frequency"),
  solution_method = c("discrete", "continuous"),
  use_bins = FALSE,
  bin_size = 20,
  replicates = 100,
  parallel = FALSE,
  ncpus = 8
)

## S3 method for class 'sim'
bootstrap_ci(
  data,
  max_coi = 25,
  seq_error = 0.01,
  coi_method = c("variant", "frequency"),
  solution_method = c("discrete", "continuous"),
  use_bins = FALSE,
  bin_size = 20,
  replicates = 100,
  parallel = FALSE,
  ncpus = 8
)

Arguments

`data`	The data for which the COI will be computed.
`max_coi`	A number indicating the maximum COI to compare the simulated data to.
`seq_error`	The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".
`solution_method`	Whether to estimate discrete or continuous COIs.
`use_bins`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.
`bin_size`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.
`replicates`	The number of bootstrap replicates.
`parallel`	Whether to parallelize the confidence interval calculation. Note that parallelization only works on non-Windows machines.
`ncpus`	The number of processes to be used in parallel operation.

Value

A tibble() with columns:

coi: The mean COI.
bias: Bias of the statistic.
std.error: The standard error of the statistic.
conf.low: The lower 95% confidence interval.
conf.high: The upper 95% confidence interval.

Examples

sim_data <- sim_biallelic(coi = 5, plmaf = runif(100, 0, 0.5))
bootstrap_ci(sim_data, solution_method = "continuous")
sim_data <- sim_biallelic(coi = 5, plmaf = runif(100, 0, 0.5))
bootstrap_ci(sim_data, solution_method = "continuous")

Predict the COI

Description

Predict the COI of the sample.

Usage

compute_coi(
  data,
  data_type,
  max_coi = 25,
  seq_error = 0.01,
  bin_size = 20,
  comparison = "overall",
  distance = "squared",
  coi_method = "variant",
  use_bins = FALSE
)
compute_coi(
  data,
  data_type,
  max_coi = 25,
  seq_error = 0.01,
  bin_size = 20,
  comparison = "overall",
  distance = "squared",
  coi_method = "variant",
  use_bins = FALSE
)

Arguments

`data`	The data for which the COI will be computed.
`data_type`	The type of the data to be analyzed. One of `"sim"` or `"real"`.
`max_coi`	A number indicating the maximum COI to compare the simulated data to.
`seq_error`	The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error.
`bin_size`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.
`comparison`	This argument is no longer supported; this function will compare the theoretical curve and sample curve for all PLMAFs.
`distance`	This argument is no longer supported; this function will solve a weighted least squares minimization problem.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".
`use_bins`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.

Details

Compare the within sample allele frequency (WSMAF) and the population level allele frequency (PLMAF) of the sample to what a theoretical WSMAF and PLMAF should look like. By examining the sample's WSMAF and PLMAF to the theoretical WSMAF and PLMAF, an estimation can be made about what the COI of the sample is. We refer to the sample's WSMAF vs PLMAF as the "sample curve" and refer to the theoretical WSMAF vs PLMAF as the "theoretical curve." To determine the predicted COI value, one of three different methods can be selected:

end

Determines the distance between the theoretical and sample curve at a PLMAF of 0.5. The COI is whichever theoretical COI curve has the smallest distance to the simulated data.

ideal

Determines the distance between the theoretical and sample curve at the ideal PLMAF. The ideal PLMAF is calculated by looking at the change between the COI of \(i\) and the COI of \(i-1\) and finding the PLMAF for which this distance is maximized. The COI is whichever theoretical COI curve has the smallest distance to the simulated data at the ideal PLMAF.

overall

Determines the distance between the theoretical and simulated curve for all PLMAFs. Computes the distance between the theoretical curves and the simulated curve. The COI is whichever theoretical curve has the smallest distance to the simulated curve. There is an option to choose one of several distance metrics:

abs_sum: Absolute value of sum of difference.
sum_abs: Sum of absolute difference.
squared: Sum of squared difference.

Value

A list of the following:

coi: The predicted COI of the sample.
probability: A probability density function representing the probability of each COI.

Compute COI based on residuals of all loci against theoretical curves

Description

Compute COI based on residuals of all loci against theoretical curves

Usage

compute_coi_regression(
  data,
  data_type,
  max_coi = 25,
  seq_error = 0.01,
  distance = "squared",
  coi_method = "variant",
  seq_error_bin_size = 20
)
compute_coi_regression(
  data,
  data_type,
  max_coi = 25,
  seq_error = 0.01,
  distance = "squared",
  coi_method = "variant",
  seq_error_bin_size = 20
)

Arguments

`data`	The data for which the COI will be computed.
`data_type`	The type of the data to be analyzed. One of `"sim"` or `"real"`.
`max_coi`	A number indicating the maximum COI to compare the simulated data to.
`seq_error`	The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error.
`distance`	This argument is no longer supported; this function will solve a weighted least squares minimization problem.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".
`seq_error_bin_size`	Number of loci in smallest bin for estimating sequence error

Value

A list of the following:

coi: The predicted COI of the sample.
probability: A probability density function representing the probability of each COI.

Continuous sensitivity analysis

Description

Runs several iterations of a full COI sensitivity analysis with varying parameters.

Usage

cont_sensitivity(
  repetitions = 10,
  coi = 3,
  max_coi = 25,
  plmaf = runif(1000, 0, 0.5),
  coverage = 200,
  alpha = 1,
  overdispersion = 0,
  relatedness = 0,
  epsilon = 0,
  seq_error = 0.01,
  bin_size = 20,
  comparison = "overall",
  distance = "squared",
  coi_method = "variant",
  use_bins = FALSE
)
cont_sensitivity(
  repetitions = 10,
  coi = 3,
  max_coi = 25,
  plmaf = runif(1000, 0, 0.5),
  coverage = 200,
  alpha = 1,
  overdispersion = 0,
  relatedness = 0,
  epsilon = 0,
  seq_error = 0.01,
  bin_size = 20,
  comparison = "overall",
  distance = "squared",
  coi_method = "variant",
  use_bins = FALSE
)

Arguments

`repetitions`	The number of times each sample will be run.
`coi`	Complexity of infection.
`max_coi`	A number indicating the maximum COI to compare the simulated data to.
`plmaf`	Vector of population-level minor allele frequencies at each locus.
`coverage`	Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci.
`alpha`	Shape parameter of the symmetric Dirichlet prior on strain proportions.
`overdispersion`	The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \(\frac{p}{overdispersion}\) and \(\frac{1-p}{overdispersion}\).
`relatedness`	The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (doi:10.1371/journal.pcbi.1005348): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))."
`epsilon`	The probability of a single read being miscalled as the other allele. This error is applied in both directions.
`seq_error`	The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error.
`bin_size`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.
`comparison`	This argument is no longer supported; this function will compare the theoretical curve and sample curve for all PLMAFs.
`distance`	This argument is no longer supported; this function will solve a weighted least squares minimization problem.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".
`use_bins`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.

Value

A list of the following:

predicted_coi: A dataframe of the predicted COIs. COIs are predicted using compute_coi(). Each column represents a separate set of parameters. Each row represents a predicted COI. Predictions are done many times, depending on the value of repetitions.
probability:A list of matrices containing the probability that our model predicted each COI value. Each row contains the probability for a different run. The first row contains the average probabilities over all the runs.
param_grid: The parameter grid. The parameter grid is all possible combinations of the parameters inputted. Each row represents a unique combination.
boot_error: A dataframe containing information about the error of the algorithm. The first column indicates the COI that was fed into the simulation. The other columns indicate the mean absolute error (mae), the lower and upper bounds of the 95% confidence interval and the bias.

Discrete Sensitivity analysis

Description

Runs several iterations of a full COI sensitivity analysis with varying parameters.

Usage

disc_sensitivity(
  repetitions = 10,
  coi = 3,
  max_coi = 25,
  plmaf = runif(1000, 0, 0.5),
  coverage = 200,
  alpha = 1,
  overdispersion = 0,
  relatedness = 0,
  epsilon = 0,
  seq_error = 0.01,
  bin_size = 20,
  comparison = "overall",
  distance = "squared",
  coi_method = "variant",
  use_bins = FALSE
)
disc_sensitivity(
  repetitions = 10,
  coi = 3,
  max_coi = 25,
  plmaf = runif(1000, 0, 0.5),
  coverage = 200,
  alpha = 1,
  overdispersion = 0,
  relatedness = 0,
  epsilon = 0,
  seq_error = 0.01,
  bin_size = 20,
  comparison = "overall",
  distance = "squared",
  coi_method = "variant",
  use_bins = FALSE
)

Arguments

`repetitions`	The number of times each sample will be run.
`coi`	Complexity of infection.
`max_coi`	A number indicating the maximum COI to compare the simulated data to.
`plmaf`	Vector of population-level minor allele frequencies at each locus.
`coverage`	Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci.
`alpha`	Shape parameter of the symmetric Dirichlet prior on strain proportions.
`overdispersion`	The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \(\frac{p}{overdispersion}\) and \(\frac{1-p}{overdispersion}\).
`relatedness`	The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (doi:10.1371/journal.pcbi.1005348): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))."
`epsilon`	The probability of a single read being miscalled as the other allele. This error is applied in both directions.
`seq_error`	The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error.
`bin_size`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.
`comparison`	This argument is no longer supported; this function will compare the theoretical curve and sample curve for all PLMAFs.
`distance`	This argument is no longer supported; this function will solve a weighted least squares minimization problem.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".
`use_bins`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.

Value

A list of the following:

predicted_coi: A dataframe of the predicted COIs. COIs are predicted using compute_coi(). Each column represents a separate set of parameters. Each row represents a predicted COI. Predictions are done many times, depending on the value of repetitions.
probability:A list of matrices containing the probability that our model predicted each COI value. Each row contains the probability for a different run. The first row contains the average probabilities over all the runs.
param_grid: The parameter grid. The parameter grid is all possible combinations of the parameters inputted. Each row represents a unique combination.
boot_error: A dataframe containing information about the error of the algorithm. The first column indicates the COI that was fed into the simulation. The other columns indicate the mean absolute error (mae), the lower and upper bounds of the 95% confidence interval and the bias.

Error plot

Description

Creates a plot showing the error of the sensitivity analysis.

Usage

error_plot(
  data,
  fill = "coi",
  fill_levels = NULL,
  title = NULL,
  legend_title = fill,
  legend.position = "right",
  second_fill = NULL
)
error_plot(
  data,
  fill = "coi",
  fill_levels = NULL,
  title = NULL,
  legend_title = fill,
  legend.position = "right",
  second_fill = NULL
)

Arguments

`data`	The data to be plotted.
`fill`	The variable the data will be separated by.
`fill_levels`	The levels for the fill variable.
`title`	The title of the plot. Default to `NULL`.
`legend_title`	The text for the legend. Default to `NULL`.
`legend.position`	The position of the legend. One of `"none"`, `"left"`, `"right"`, `"bottom"`, `"top"`.
`second_fill`	Indicates if there will be a second fill variable and what it will be.

Details

Plots are created using ggplot2::geom_col(), which creates a simple bar plot. The mean absolute error is plotted in various colors, according to what parameter is being tested. In addition the 95% confidence interval is shown as black vertical lines.

Example real data

Description

A small example dataset that contains within-sample allele frequencies (WSAFs) from a sample of individuals.

Format

A matrix of data. The rows of the matrix indicate the sample name and the columns of the matrix indicate the WSAF at each locus.

Source

ftp://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/

Likelihood of a COI

Description

A function to generate the likelihood of a specific COI value.

Usage

likelihood(coi, processed_data, distance = "squared", coi_method = "variant")
likelihood(coi, processed_data, distance = "squared", coi_method = "variant")

Arguments

`coi`	The COI for which the likelihood will be generated.
`processed_data`	The processed COI data. This is the output of `process_sim()` or `process_real()`.
`distance`	This argument is no longer supported; this function will solve a weighted least squares minimization problem.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".

Details

The likelihood can be thought of the distance between two curves: the "real" COI curve, generated from the inputted data, and the "simulated" COI curve, which depends on the COI value specified. There are three different methods implemented to compute the distance between two curves:

abs_sum: Absolute value of sum of difference.
sum_abs: Sum of absolute difference.
squared: Sum of squared difference.

Value

The likelihood for a specific COI value.

Optimize the COI

Description

A function to compute the COI of inputted data.

Usage

optimize_coi(
  data,
  data_type,
  max_coi = 25,
  seq_error = 0.01,
  bin_size = 20,
  distance = "squared",
  coi_method = "variant",
  use_bins = FALSE
)
optimize_coi(
  data,
  data_type,
  max_coi = 25,
  seq_error = 0.01,
  bin_size = 20,
  distance = "squared",
  coi_method = "variant",
  use_bins = FALSE
)

Arguments

`data`	The data for which the COI will be computed.
`data_type`	The type of the data to be analyzed. One of `"sim"` or `"real"`.
`max_coi`	A number indicating the maximum COI to compare the simulated data to.
`seq_error`	The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error.
`bin_size`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.
`distance`	This argument is no longer supported; this function will solve a weighted least squares minimization problem.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".
`use_bins`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.

Details

The function utilizes stats::optim(). In particular, the function utilizes a quasi-Newton method to compute gradients and build a picture of the surface to be optimized. The function uses a likelihood function as defined by likelihood().

Value

The predicted COI value.

Compute COI based on all points fitted to best fitting curve for COI

Description

Compute COI based on all points fitted to best fitting curve for COI

Usage

optimize_coi_regression(
  data,
  data_type,
  max_coi = 25,
  seq_error = 0.01,
  distance = "squared",
  coi_method = "variant",
  seq_error_bin_size = 20
)
optimize_coi_regression(
  data,
  data_type,
  max_coi = 25,
  seq_error = 0.01,
  distance = "squared",
  coi_method = "variant",
  seq_error_bin_size = 20
)

Arguments

`data`	The data for which the COI will be computed.
`data_type`	The type of the data to be analyzed. One of `"sim"` or `"real"`.
`max_coi`	A number indicating the maximum COI to compare the simulated data to.
`seq_error`	The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error.
`distance`	This argument is no longer supported; this function will solve a weighted least squares minimization problem.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".
`seq_error_bin_size`	Number of loci in smallest bin for estimating sequence error

Value

The predicted COI value.

Plot simulated data

Description

Generate a simple plot visualizing simulated data. Compares the derived WSMAF to the PLMAF.

Usage

## S3 method for class 'sim'
autoplot(object, ...)

## S3 method for class 'sim'
plot(x, ...)
## S3 method for class 'sim'
autoplot(object, ...)

## S3 method for class 'sim'
plot(x, ...)

Arguments

`object`, `x`	An object of class `sim`. Derived from the output of `sim_biallelic()`.
`...`	Other arguments passed on to methods.

Examples

plot(sim_biallelic(coi = 2))
plot(sim_biallelic(coi = 5))
plot(sim_biallelic(coi = 2))
plot(sim_biallelic(coi = 5))

Process real data

Description

Generate the COI curve for real data.

Usage

process_real(
  wsmaf,
  plmaf,
  coverage,
  seq_error = 0.01,
  bin_size = 20,
  coi_method = "variant"
)
process_real(
  wsmaf,
  plmaf,
  coverage,
  seq_error = 0.01,
  bin_size = 20,
  coi_method = "variant"
)

Arguments

`wsmaf`	The within-sample minor allele frequency.
`plmaf`	The population-level minor allele frequency.
`coverage`	The read coverage at each locus.
`seq_error`	The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error.
`bin_size`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".

Details

The function computes whether a SNP is a variant site or not, based on the WSMAF at that SNP. This process additionally accounts for potential sequencing error.

Value

A list of the following:

data: A tibble with

plmaf_cut: Breaks of the form ⁠[a, b)⁠.
m_variant: The average WSMAF or proportion of variant sites in each segment defined by plmaf_cut.
bucket_size: The number of loci in each bucket.
midpoints: The midpoint of each bucket.

seq_error: The sequence error inferred.
bin_size: The minimum size of each bin.
cuts: The breaks utilized in splitting the data. of each COI.

Process simulated data

Description

Generate the simulated COI curve.

Usage

process_sim(sim, seq_error = 0.01, bin_size = 20, coi_method = "variant")
process_sim(sim, seq_error = 0.01, bin_size = 20, coi_method = "variant")

Arguments

`sim`	Output of `sim_biallelic()`.
`seq_error`	The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error.
`bin_size`	This argument is no longer supported; to estimate the COI, all data points are used. Data points are not grouped in bins of changing `plaf`.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".

Details

Utilize the output of sim_biallelic(), which creates simulated data. The PLMAF is kept, and the function computes whether a SNP is a variant site or not, based on the simulated WSMAF at that SNP. This process additionally accounts for potential sequencing error. To check whether the simulated WSMAF correctly indicated a variant site or not, the phased haplotype of the parasites is computed.

Value

A list of the following:

data: A tibble with

plmaf_cut: Breaks of the form ⁠[a, b)⁠.
m_variant: The average WSMAF or proportion of variant sites in each segment defined by plmaf_cut.
bucket_size: The number of loci in each bucket.
midpoints: The midpoint of each bucket.

seq_error: The sequence error inferred.
bin_size: The minimum size of each bin.
cuts: The breaks utilized in splitting the data. of each COI.

Sensitivity plot

Description

Creates a plot of the sensitivity analysis.

Usage

sensitivity_plot(
  data,
  dims,
  result_type,
  sub_title = NULL,
  title = NULL,
  caption = NULL
)
sensitivity_plot(
  data,
  dims,
  result_type,
  sub_title = NULL,
  title = NULL,
  caption = NULL
)

Arguments

`data`	The data to be plotted.
`dims`	A list representing the number of rows and columns our plots will be split into.
`result_type`	An indicator that indicates if a count or boxplot should be plotted.
`sub_title`	A list of titles for each individual subplot.
`title`	The title of the overall figure.
`caption`	The caption of the overall figure.

Details

Creates a grid of plots. Each plot is created using ggplot2::geom_count(). The number of observations at each location is counted and then the count is mapped to point area on the plot.

The x-axis is the true COI, and the y-axis is the estimated COI. The counts are plotted in blue, and red line is drawn with the equation \(y = x\). This line indicates where the blue circles should be if the algorithm was 100% correct.

Simulate biallelic data

Description

Simulate biallelic data from a simple statistical model. Inputs include the complexity of infection (COI), population-level minor allele frequencies (PLMAF), and some parameters dictating skew and error distributions. Outputs include the phased haplotypes and the unphased read count and coverage data.

Usage

sim_biallelic(
  coi,
  plmaf = runif(10, 0, 0.5),
  coverage = 200,
  alpha = 1,
  overdispersion = 0,
  relatedness = 0,
  epsilon = 0
)
sim_biallelic(
  coi,
  plmaf = runif(10, 0, 0.5),
  coverage = 200,
  alpha = 1,
  overdispersion = 0,
  relatedness = 0,
  epsilon = 0
)

Arguments

`coi`	Complexity of infection.
`plmaf`	Vector of population-level minor allele frequencies at each locus.
`coverage`	Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci.
`alpha`	Shape parameter of the symmetric Dirichlet prior on strain proportions.
`overdispersion`	The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \(\frac{p}{overdispersion}\) and \(\frac{1-p}{overdispersion}\).
`relatedness`	The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (doi:10.1371/journal.pcbi.1005348): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))."
`epsilon`	The probability of a single read being miscalled as the other allele. This error is applied in both directions.

Details

Simulated data are drawn from a simple statistical model:

Strain proportions are drawn from a symmetric Dirichlet distribution with shape parameter alpha.
Phased haplotypes are drawn at every locus, one for each coi. The allele at each locus is drawn from a Bernoulli distribution with probability given by the plmaf.
The "true" within-sample allele frequency at every locus is obtained by multiplying haplotypes by their strain proportions, and summing over haplotypes. Errors are introduced through the equation \[wsmaf_{error} = wsmaf(1-e) + (1-wsmaf)e\] where \(wsmaf\) is the WSMAF without error and \(e\) is the error parameter epsilon.
Final read counts are drawn from a beta-binomial distribution with expectation \(w_{error}\). The raw number of draws is given by the coverage, and the skew of the distribution is given by the overdispersion parameter. If the overdispersion is equal to zero, then the distribution is binomial, rather than beta-binomial.

Value

An object of class sim. Contains a list of tibbles:

parameters contains each parameter and the value used to simulate data.
strain_proportions contains the proportion of each strain.
phased_haplotypes contains the phased haplotype for each strain at each locus.
data contains the following columns:
- plmaf: The population-level minor allele frequency.
- coverage: The coverage at each locus.
- counts: The count at each locus.
- wsaf: The within-sample minor allele frequency.

Examples

sim_biallelic(coi = 5)
sim_biallelic(coi = 5)

Custom ggplot2 theme

Description

Custom ggplot2 theme

Usage

theme_coiaf(
  base_size = 10,
  base_family = "",
  base_line_size = base_size/22,
  base_rect_size = base_size/22
)
theme_coiaf(
  base_size = 10,
  base_family = "",
  base_line_size = base_size/22,
  base_rect_size = base_size/22
)

Arguments

`base_size`	base font size, given in pts.
`base_family`	base font family
`base_line_size`	base size for line elements
`base_rect_size`	base size for rect elements

Examples

library("ggplot2")
p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(gear))) +
  geom_point() +
  facet_wrap(~am) +
  geom_smooth(method = "lm", se = FALSE)

p + theme_coiaf()
library("ggplot2")
p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(gear))) +
  geom_point() +
  facet_wrap(~am) +
  geom_smooth(method = "lm", se = FALSE)

p + theme_coiaf()

Theoretical COI

Description

Generate the theoretical relationship between the WSMAF (\(\bf{w}\)), the PLMAF (\(\bf{p}\)), and the COI (\(k\)).

Usage

theoretical_coi(
  coi_range,
  plmaf = seq(0, 0.5, length.out = 101),
  coi_method = c("variant", "frequency")
)
theoretical_coi(
  coi_range,
  plmaf = seq(0, 0.5, length.out = 101),
  coi_method = c("variant", "frequency")
)

Arguments

`coi_range`	The COIs for which the relationship will be generated.
`plmaf`	The population-level minor allele frequency over which the relationship will be generated.
`coi_method`	The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant".

Value

A tibble() containing the generated values. Each column is named with the COI used. The last column of the tibble contains the PLMAF.

Examples

theoretical_coi(1:5)
theoretical_coi(1:5, coi_method = "frequency")
theoretical_coi(1:5)
theoretical_coi(1:5, coi_method = "frequency")

World map plot

Description

Plot a world map showing the COI in each region where reads were sampled from.

Usage

world_map(data, variable, label = NULL, alpha = 0.1, breaks = c(1, 2))
world_map(data, variable, label = NULL, alpha = 0.1, breaks = c(1, 2))

Arguments

`data`	The data to be plotted.
`variable`	The variable the data will plot.
`label`	The label for the variable.
`alpha`	The alpha value for the plotted data.
`breaks`	The breaks for the color scale.

Details

Creates a world map and overlays the COI in each region. The magnitude of the COI is indicated by both the color and the size of the bubble.

Package 'coiaf'

Help Index

Generate bootstrapped CI

Description

Usage

Arguments

Value

See Also

Examples

Predict the COI

Description

Usage

Arguments

Details

Value

Compute COI based on residuals of all loci against theoretical curves

Description

Usage

Arguments

Value

Continuous sensitivity analysis

Description

Usage

Arguments

Value

Discrete Sensitivity analysis

Description

Usage

Arguments

Value

Error plot

Description

Usage

Arguments

Details

See Also

Example real data

Description

Format

Source

Likelihood of a COI

Description

Usage

Arguments

Details

Value

See Also

Optimize the COI

Description

Usage

Arguments

Details

Value

See Also

Compute COI based on all points fitted to best fitting curve for COI

Description

Usage

Arguments

Value

See Also

Plot simulated data

Description

Usage

Arguments

See Also

Examples

Process real data

Description

Usage

Arguments

Details

Value

See Also

Process simulated data

Description

Usage

Arguments

Details

Value

See Also