Title: | Complexity of Infection Estimation with Allele Frequencies |
---|---|
Description: | Provides a direct method for estimating complexity of infection using easily calculated measures from sequence read depth data. |
Authors: | Aris Paschalidis [aut, cre] , OJ Watson [aut] |
Maintainer: | Aris Paschalidis <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.2 |
Built: | 2024-11-01 04:25:12 UTC |
Source: | https://github.com/bailey-lab/coiaf |
Generate bootstrapped confidence interval for COI estimates.
bootstrap_ci( data, max_coi = 25, seq_error = 0.01, coi_method = c("variant", "frequency"), solution_method = c("discrete", "continuous"), use_bins = FALSE, bin_size = 20, replicates = 100, parallel = FALSE, ncpus = 8 ) ## Default S3 method: bootstrap_ci( data, max_coi = 25, seq_error = 0.01, coi_method = c("variant", "frequency"), solution_method = c("discrete", "continuous"), use_bins = FALSE, bin_size = 20, replicates = 100, parallel = FALSE, ncpus = 8 ) ## S3 method for class 'sim' bootstrap_ci( data, max_coi = 25, seq_error = 0.01, coi_method = c("variant", "frequency"), solution_method = c("discrete", "continuous"), use_bins = FALSE, bin_size = 20, replicates = 100, parallel = FALSE, ncpus = 8 )
bootstrap_ci( data, max_coi = 25, seq_error = 0.01, coi_method = c("variant", "frequency"), solution_method = c("discrete", "continuous"), use_bins = FALSE, bin_size = 20, replicates = 100, parallel = FALSE, ncpus = 8 ) ## Default S3 method: bootstrap_ci( data, max_coi = 25, seq_error = 0.01, coi_method = c("variant", "frequency"), solution_method = c("discrete", "continuous"), use_bins = FALSE, bin_size = 20, replicates = 100, parallel = FALSE, ncpus = 8 ) ## S3 method for class 'sim' bootstrap_ci( data, max_coi = 25, seq_error = 0.01, coi_method = c("variant", "frequency"), solution_method = c("discrete", "continuous"), use_bins = FALSE, bin_size = 20, replicates = 100, parallel = FALSE, ncpus = 8 )
A tibble()
with columns:
The mean COI.
Bias of the statistic.
The standard error of the statistic.
The lower 95% confidence interval.
The upper 95% confidence interval.
boot::boot()
, boot::boot.ci()
, broom::tidy.boot()
sim_data <- sim_biallelic(coi = 5, plmaf = runif(100, 0, 0.5)) bootstrap_ci(sim_data, solution_method = "continuous")
sim_data <- sim_biallelic(coi = 5, plmaf = runif(100, 0, 0.5)) bootstrap_ci(sim_data, solution_method = "continuous")
Predict the COI of the sample.
compute_coi( data, data_type, max_coi = 25, seq_error = 0.01, bin_size = 20, comparison = "overall", distance = "squared", coi_method = "variant", use_bins = FALSE )
compute_coi( data, data_type, max_coi = 25, seq_error = 0.01, bin_size = 20, comparison = "overall", distance = "squared", coi_method = "variant", use_bins = FALSE )
Compare the within sample allele frequency (WSMAF) and the population level allele frequency (PLMAF) of the sample to what a theoretical WSMAF and PLMAF should look like. By examining the sample's WSMAF and PLMAF to the theoretical WSMAF and PLMAF, an estimation can be made about what the COI of the sample is. We refer to the sample's WSMAF vs PLMAF as the "sample curve" and refer to the theoretical WSMAF vs PLMAF as the "theoretical curve." To determine the predicted COI value, one of three different methods can be selected:
end
Determines the distance between the theoretical and sample curve at a PLMAF of 0.5. The COI is whichever theoretical COI curve has the smallest distance to the simulated data.
ideal
Determines the distance between the theoretical and sample curve at the ideal PLMAF. The ideal PLMAF is calculated by looking at the change between the COI of \(i\) and the COI of \(i-1\) and finding the PLMAF for which this distance is maximized. The COI is whichever theoretical COI curve has the smallest distance to the simulated data at the ideal PLMAF.
overall
Determines the distance between the theoretical and simulated curve for all PLMAFs. Computes the distance between the theoretical curves and the simulated curve. The COI is whichever theoretical curve has the smallest distance to the simulated curve. There is an option to choose one of several distance metrics:
abs_sum
: Absolute value of sum of difference.
sum_abs
: Sum of absolute difference.
squared
: Sum of squared difference.
A list of the following:
coi
: The predicted COI of the sample.
probability
: A probability density function representing the probability
of each COI.
Compute COI based on residuals of all loci against theoretical curves
compute_coi_regression( data, data_type, max_coi = 25, seq_error = 0.01, distance = "squared", coi_method = "variant", seq_error_bin_size = 20 )
compute_coi_regression( data, data_type, max_coi = 25, seq_error = 0.01, distance = "squared", coi_method = "variant", seq_error_bin_size = 20 )
A list of the following:
coi
: The predicted COI of the sample.
probability
: A probability density function representing the probability
of each COI.
Runs several iterations of a full COI sensitivity analysis with varying parameters.
cont_sensitivity( repetitions = 10, coi = 3, max_coi = 25, plmaf = runif(1000, 0, 0.5), coverage = 200, alpha = 1, overdispersion = 0, relatedness = 0, epsilon = 0, seq_error = 0.01, bin_size = 20, comparison = "overall", distance = "squared", coi_method = "variant", use_bins = FALSE )
cont_sensitivity( repetitions = 10, coi = 3, max_coi = 25, plmaf = runif(1000, 0, 0.5), coverage = 200, alpha = 1, overdispersion = 0, relatedness = 0, epsilon = 0, seq_error = 0.01, bin_size = 20, comparison = "overall", distance = "squared", coi_method = "variant", use_bins = FALSE )
repetitions |
The number of times each sample will be run. |
coi |
Complexity of infection. |
max_coi |
A number indicating the maximum COI to compare the simulated data to. |
plmaf |
Vector of population-level minor allele frequencies at each locus. |
coverage |
Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci. |
alpha |
Shape parameter of the symmetric Dirichlet prior on strain proportions. |
overdispersion |
The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \(\frac{p}{overdispersion}\) and \(\frac{1-p}{overdispersion}\). |
relatedness |
The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (doi:10.1371/journal.pcbi.1005348): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))." |
epsilon |
The probability of a single read being miscalled as the other allele. This error is applied in both directions. |
seq_error |
The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error. |
bin_size |
This argument is no longer
supported; to estimate the COI, all data points are used. Data points are
not grouped in bins of changing |
comparison |
This argument is no longer supported; this function will compare the theoretical curve and sample curve for all PLMAFs. |
distance |
This argument is no longer supported; this function will solve a weighted least squares minimization problem. |
coi_method |
The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant". |
use_bins |
This argument is no longer
supported; to estimate the COI, all data points are used. Data points are
not grouped in bins of changing |
A list of the following:
predicted_coi
: A dataframe of the predicted COIs. COIs are
predicted using compute_coi()
. Each column represents a separate set
of parameters. Each row represents a predicted COI. Predictions are done
many times, depending on the value of repetitions
.
probability
:A list of matrices containing the probability
that our model predicted each COI value. Each row contains the probability
for a different run. The first row contains the average probabilities over
all the runs.
param_grid
: The parameter grid. The parameter grid is all
possible combinations of the parameters inputted. Each row represents a
unique combination.
boot_error
: A dataframe containing information about the error
of the algorithm. The first column indicates the COI that was fed into the
simulation. The other columns indicate the mean absolute error (mae),
the lower and upper bounds of the 95% confidence interval and the bias.
Runs several iterations of a full COI sensitivity analysis with varying parameters.
disc_sensitivity( repetitions = 10, coi = 3, max_coi = 25, plmaf = runif(1000, 0, 0.5), coverage = 200, alpha = 1, overdispersion = 0, relatedness = 0, epsilon = 0, seq_error = 0.01, bin_size = 20, comparison = "overall", distance = "squared", coi_method = "variant", use_bins = FALSE )
disc_sensitivity( repetitions = 10, coi = 3, max_coi = 25, plmaf = runif(1000, 0, 0.5), coverage = 200, alpha = 1, overdispersion = 0, relatedness = 0, epsilon = 0, seq_error = 0.01, bin_size = 20, comparison = "overall", distance = "squared", coi_method = "variant", use_bins = FALSE )
repetitions |
The number of times each sample will be run. |
coi |
Complexity of infection. |
max_coi |
A number indicating the maximum COI to compare the simulated data to. |
plmaf |
Vector of population-level minor allele frequencies at each locus. |
coverage |
Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci. |
alpha |
Shape parameter of the symmetric Dirichlet prior on strain proportions. |
overdispersion |
The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \(\frac{p}{overdispersion}\) and \(\frac{1-p}{overdispersion}\). |
relatedness |
The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (doi:10.1371/journal.pcbi.1005348): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))." |
epsilon |
The probability of a single read being miscalled as the other allele. This error is applied in both directions. |
seq_error |
The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error. |
bin_size |
This argument is no longer
supported; to estimate the COI, all data points are used. Data points are
not grouped in bins of changing |
comparison |
This argument is no longer supported; this function will compare the theoretical curve and sample curve for all PLMAFs. |
distance |
This argument is no longer supported; this function will solve a weighted least squares minimization problem. |
coi_method |
The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant". |
use_bins |
This argument is no longer
supported; to estimate the COI, all data points are used. Data points are
not grouped in bins of changing |
A list of the following:
predicted_coi
: A dataframe of the predicted COIs. COIs are
predicted using compute_coi()
. Each column represents a separate set
of parameters. Each row represents a predicted COI. Predictions are done
many times, depending on the value of repetitions
.
probability
:A list of matrices containing the probability
that our model predicted each COI value. Each row contains the probability
for a different run. The first row contains the average probabilities over
all the runs.
param_grid
: The parameter grid. The parameter grid is all
possible combinations of the parameters inputted. Each row represents a
unique combination.
boot_error
: A dataframe containing information about the error
of the algorithm. The first column indicates the COI that was fed into the
simulation. The other columns indicate the mean absolute error (mae),
the lower and upper bounds of the 95% confidence interval and the bias.
Creates a plot showing the error of the sensitivity analysis.
error_plot( data, fill = "coi", fill_levels = NULL, title = NULL, legend_title = fill, legend.position = "right", second_fill = NULL )
error_plot( data, fill = "coi", fill_levels = NULL, title = NULL, legend_title = fill, legend.position = "right", second_fill = NULL )
data |
The data to be plotted. |
fill |
The variable the data will be separated by. |
fill_levels |
The levels for the fill variable. |
title |
The title of the plot. Default to |
legend_title |
The text for the legend. Default to |
legend.position |
The position of the legend. One of |
second_fill |
Indicates if there will be a second fill variable and what it will be. |
Plots are created using ggplot2::geom_col()
, which creates a simple bar
plot. The mean absolute error is plotted in various colors, according to what
parameter is being tested. In addition the 95% confidence interval is shown
as black vertical lines.
ggplot2::geom_col()
for more information on bar plots and the
ggplot2 website.
Other plotting functions:
sensitivity_plot()
,
world_map()
A small example dataset that contains within-sample allele frequencies (WSAFs) from a sample of individuals.
A matrix of data. The rows of the matrix indicate the sample name and the columns of the matrix indicate the WSAF at each locus.
ftp://ngs.sanger.ac.uk/production/malaria/pfcommunityproject/Pf6/
A function to generate the likelihood of a specific COI value.
likelihood(coi, processed_data, distance = "squared", coi_method = "variant")
likelihood(coi, processed_data, distance = "squared", coi_method = "variant")
coi |
The COI for which the likelihood will be generated. |
processed_data |
The processed COI data. This is the output of
|
distance |
This argument is no longer supported; this function will solve a weighted least squares minimization problem. |
coi_method |
The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant". |
The likelihood can be thought of the distance between two curves: the "real" COI curve, generated from the inputted data, and the "simulated" COI curve, which depends on the COI value specified. There are three different methods implemented to compute the distance between two curves:
abs_sum
: Absolute value of sum of difference.
sum_abs
: Sum of absolute difference.
squared
: Sum of squared difference.
The likelihood for a specific COI value.
Other optimization functions:
optimize_coi_regression()
,
optimize_coi()
A function to compute the COI of inputted data.
optimize_coi( data, data_type, max_coi = 25, seq_error = 0.01, bin_size = 20, distance = "squared", coi_method = "variant", use_bins = FALSE )
optimize_coi( data, data_type, max_coi = 25, seq_error = 0.01, bin_size = 20, distance = "squared", coi_method = "variant", use_bins = FALSE )
The function utilizes stats::optim()
. In particular, the function utilizes
a quasi-Newton method to compute gradients and build a picture of the
surface to be optimized. The function uses a likelihood function as defined
by likelihood()
.
The predicted COI value.
stats::optim()
for the complete documentation on the optimization
function.
Other optimization functions:
likelihood()
,
optimize_coi_regression()
Compute COI based on all points fitted to best fitting curve for COI
optimize_coi_regression( data, data_type, max_coi = 25, seq_error = 0.01, distance = "squared", coi_method = "variant", seq_error_bin_size = 20 )
optimize_coi_regression( data, data_type, max_coi = 25, seq_error = 0.01, distance = "squared", coi_method = "variant", seq_error_bin_size = 20 )
The predicted COI value.
stats::optim()
for the complete documentation on the optimization
function.
Other optimization functions:
likelihood()
,
optimize_coi()
Generate a simple plot visualizing simulated data. Compares the derived WSMAF to the PLMAF.
## S3 method for class 'sim' autoplot(object, ...) ## S3 method for class 'sim' plot(x, ...)
## S3 method for class 'sim' autoplot(object, ...) ## S3 method for class 'sim' plot(x, ...)
object , x
|
An object of class |
... |
Other arguments passed on to methods. |
Other simulated data functions:
process_sim()
,
sim_biallelic()
plot(sim_biallelic(coi = 2)) plot(sim_biallelic(coi = 5))
plot(sim_biallelic(coi = 2)) plot(sim_biallelic(coi = 5))
Generate the COI curve for real data.
process_real( wsmaf, plmaf, coverage, seq_error = 0.01, bin_size = 20, coi_method = "variant" )
process_real( wsmaf, plmaf, coverage, seq_error = 0.01, bin_size = 20, coi_method = "variant" )
The function computes whether a SNP is a variant site or not, based on the WSMAF at that SNP. This process additionally accounts for potential sequencing error.
A list of the following:
data
: A tibble with
plmaf_cut
: Breaks of the form [a, b)
.
m_variant
: The average WSMAF or proportion of variant sites in each
segment defined by plmaf_cut
.
bucket_size
: The number of loci in each bucket.
midpoints
: The midpoint of each bucket.
seq_error
: The sequence error inferred.
bin_size
: The minimum size of each bin.
cuts
: The breaks utilized in splitting the data.
of each COI.
process_sim()
to process simulated data.
Generate the simulated COI curve.
process_sim(sim, seq_error = 0.01, bin_size = 20, coi_method = "variant")
process_sim(sim, seq_error = 0.01, bin_size = 20, coi_method = "variant")
sim |
Output of |
seq_error |
The level of sequencing error that is assumed. If no value is inputted, then we infer the level of sequence error. |
bin_size |
This argument is no longer
supported; to estimate the COI, all data points are used. Data points are
not grouped in bins of changing |
coi_method |
The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant". |
Utilize the output of sim_biallelic()
, which creates simulated
data. The PLMAF is kept, and the function computes whether a SNP is a
variant site or not, based on the simulated WSMAF at that SNP. This process
additionally accounts for potential sequencing error. To check whether the
simulated WSMAF correctly indicated a variant site or not, the phased
haplotype of the parasites is computed.
A list of the following:
data
: A tibble with
plmaf_cut
: Breaks of the form [a, b)
.
m_variant
: The average WSMAF or proportion of variant sites in each
segment defined by plmaf_cut
.
bucket_size
: The number of loci in each bucket.
midpoints
: The midpoint of each bucket.
seq_error
: The sequence error inferred.
bin_size
: The minimum size of each bin.
cuts
: The breaks utilized in splitting the data.
of each COI.
process_real()
to process real data.
Other simulated data functions:
plot-simulation
,
sim_biallelic()
Creates a plot of the sensitivity analysis.
sensitivity_plot( data, dims, result_type, sub_title = NULL, title = NULL, caption = NULL )
sensitivity_plot( data, dims, result_type, sub_title = NULL, title = NULL, caption = NULL )
data |
The data to be plotted. |
dims |
A list representing the number of rows and columns our plots will be split into. |
result_type |
An indicator that indicates if a count or boxplot should be plotted. |
sub_title |
A list of titles for each individual subplot. |
title |
The title of the overall figure. |
caption |
The caption of the overall figure. |
Creates a grid of plots. Each plot is created using ggplot2::geom_count()
.
The number of observations at each location is counted and then the count is
mapped to point area on the plot.
The x-axis is the true COI, and the y-axis is the estimated COI. The counts are plotted in blue, and red line is drawn with the equation \(y = x\). This line indicates where the blue circles should be if the algorithm was 100% correct.
ggplot2::geom_count()
for more information on count plots and the
ggplot2 website.
Other plotting functions:
error_plot()
,
world_map()
Simulate biallelic data from a simple statistical model. Inputs include the complexity of infection (COI), population-level minor allele frequencies (PLMAF), and some parameters dictating skew and error distributions. Outputs include the phased haplotypes and the unphased read count and coverage data.
sim_biallelic( coi, plmaf = runif(10, 0, 0.5), coverage = 200, alpha = 1, overdispersion = 0, relatedness = 0, epsilon = 0 )
sim_biallelic( coi, plmaf = runif(10, 0, 0.5), coverage = 200, alpha = 1, overdispersion = 0, relatedness = 0, epsilon = 0 )
coi |
Complexity of infection. |
plmaf |
Vector of population-level minor allele frequencies at each locus. |
coverage |
Coverage at each locus. If a single value is supplied then the same coverage is applied over all loci. |
alpha |
Shape parameter of the symmetric Dirichlet prior on strain proportions. |
overdispersion |
The extent to which counts are over-dispersed relative to the binomial distribution. Counts are Beta-Binomially distributed, with the beta distribution having shape parameters \(\frac{p}{overdispersion}\) and \(\frac{1-p}{overdispersion}\). |
relatedness |
The probability that a strain in mixed infections is related to another. The implementation is similar to relatedness as defined in THE REAL McCOIL simulations (doi:10.1371/journal.pcbi.1005348): "... simulated relatedness (r) among lineages within the same host by sampling alleles either from an existing lineage within the same host (with probability r) or from the population (with probability (1-r))." |
epsilon |
The probability of a single read being miscalled as the other allele. This error is applied in both directions. |
Simulated data are drawn from a simple statistical model:
Strain proportions are drawn from a symmetric Dirichlet
distribution with shape parameter alpha
.
Phased haplotypes are drawn at every locus, one for each
coi
. The allele at each locus is drawn from a Bernoulli
distribution with probability given by the plmaf
.
The "true" within-sample allele frequency at every locus is
obtained by multiplying haplotypes by their strain proportions, and
summing over haplotypes. Errors are introduced through the equation
\[wsmaf_{error} = wsmaf(1-e) + (1-wsmaf)e\]
where \(wsmaf\) is the WSMAF without error and \(e\) is
the error parameter epsilon
.
Final read counts are drawn from a beta-binomial distribution with
expectation \(w_{error}\). The raw number of draws is given by the
coverage
, and the skew of the distribution is given by the
overdispersion
parameter. If the overdispersion
is equal to
zero, then the distribution is binomial, rather than beta-binomial.
An object of class sim
. Contains a list of
tibbles:
parameters
contains each parameter and the value used to simulate data.
strain_proportions
contains the proportion of each strain.
phased_haplotypes
contains the phased haplotype for each strain at each
locus.
data
contains the following columns:
plmaf
: The population-level minor allele frequency.
coverage
: The coverage at each locus.
counts
: The count at each locus.
wsaf
: The within-sample minor allele frequency.
Other simulated data functions:
plot-simulation
,
process_sim()
sim_biallelic(coi = 5)
sim_biallelic(coi = 5)
Custom ggplot2 theme
theme_coiaf( base_size = 10, base_family = "", base_line_size = base_size/22, base_rect_size = base_size/22 )
theme_coiaf( base_size = 10, base_family = "", base_line_size = base_size/22, base_rect_size = base_size/22 )
base_size |
base font size, given in pts. |
base_family |
base font family |
base_line_size |
base size for line elements |
base_rect_size |
base size for rect elements |
library("ggplot2") p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(gear))) + geom_point() + facet_wrap(~am) + geom_smooth(method = "lm", se = FALSE) p + theme_coiaf()
library("ggplot2") p <- ggplot(mtcars, aes(x = wt, y = mpg, colour = factor(gear))) + geom_point() + facet_wrap(~am) + geom_smooth(method = "lm", se = FALSE) p + theme_coiaf()
Generate the theoretical relationship between the WSMAF (\(\bf{w}\)), the PLMAF (\(\bf{p}\)), and the COI (\(k\)).
theoretical_coi( coi_range, plmaf = seq(0, 0.5, length.out = 101), coi_method = c("variant", "frequency") )
theoretical_coi( coi_range, plmaf = seq(0, 0.5, length.out = 101), coi_method = c("variant", "frequency") )
coi_range |
The COIs for which the relationship will be generated. |
plmaf |
The population-level minor allele frequency over which the relationship will be generated. |
coi_method |
The method we will use to generate the theoretical relationship. The method is either "variant" or "frequency". The default value is "variant". |
A tibble()
containing the generated values. Each
column is named with the COI used. The last column of the tibble contains the
PLMAF.
theoretical_coi(1:5) theoretical_coi(1:5, coi_method = "frequency")
theoretical_coi(1:5) theoretical_coi(1:5, coi_method = "frequency")
Plot a world map showing the COI in each region where reads were sampled from.
world_map(data, variable, label = NULL, alpha = 0.1, breaks = c(1, 2))
world_map(data, variable, label = NULL, alpha = 0.1, breaks = c(1, 2))
data |
The data to be plotted. |
variable |
The variable the data will plot. |
label |
The label for the variable. |
alpha |
The alpha value for the plotted data. |
breaks |
The breaks for the color scale. |
Creates a world map and overlays the COI in each region. The magnitude of the COI is indicated by both the color and the size of the bubble.
This website for more information on creating bubble graphs in R.
Other plotting functions:
error_plot()
,
sensitivity_plot()