Title: | Filtering and analysis of MIP data |
---|---|
Description: | Filtering and analysis of MIP data. |
Authors: | Bob Verity [aut, cre] |
Maintainer: | Bob Verity <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.0 |
Built: | 2024-10-23 04:31:17 UTC |
Source: | https://github.com/mrc-ide/MIPanalyzer |
Simple function to check that MIPanalyzer package has loaded successfully.
check_MIPanalyzer_loaded()
check_MIPanalyzer_loaded()
Explore what effect the filter_coverage_loci()
function will have on the data without actually applying any filters. Can
be used to set coverage thresholds.
explore_filter_coverage_loci( x, min_coverage = 5, max_low_coverage = 50, breaks = 100 )
explore_filter_coverage_loci( x, min_coverage = 5, max_low_coverage = 50, breaks = 100 )
x |
object of class |
min_coverage |
the coverage threshold below which data is deemed to be low-coverage. |
max_low_coverage |
(percentage). Loci are not allowed to contain more
than this many low-coverage samples. In the |
breaks |
number of breaks spanning the range |
Explore what effect the filter_coverage_samples()
function will have on the data without actually applying any filters. Can
be used to set coverage thresholds.
explore_filter_coverage_samples( x, min_coverage = 5, max_low_coverage = 50, breaks = 100 )
explore_filter_coverage_samples( x, min_coverage = 5, max_low_coverage = 50, breaks = 100 )
x |
object of class |
min_coverage |
the coverage threshold below which data is deemed to be low-coverage. |
max_low_coverage |
(percentage). Samples are not allowed to contain more
than this many low-coverage loci. In the |
breaks |
number of breaks spanning the range |
Drop any allele for which the number of read counts is below a given threshold. Coverage is adjusted to account for dropped reads.
filter_counts( x, count_min = 2, description = "filter individual allele counts" )
filter_counts( x, count_min = 2, description = "filter individual allele counts" )
x |
object of class |
count_min |
alleles with fewer than this many counts are dropped. |
description |
brief description of the filter, to be saved in the filter history. |
Set a coverage threshold: any coverage value below this
threshold is deemed to be low-coverage. Then set a maximum percent
low-coverage samples per locus: any locus with greater than this percentage
low-coverage samples is dropped. Note that threshold values can be explored
without applying any filtering using the
explore_filter_coverage_loci()
function.
filter_coverage_loci( x, min_coverage = 5, max_low_coverage = 50, replace_low_coverage = FALSE, description = "filter loci based on coverage" )
filter_coverage_loci( x, min_coverage = 5, max_low_coverage = 50, replace_low_coverage = FALSE, description = "filter loci based on coverage" )
x |
object of class |
min_coverage |
the coverage threshold below which data is deemed to be low-coverage. |
max_low_coverage |
any locus with more than |
replace_low_coverage |
(Boolean). If |
description |
brief description of the filter, to be saved in the filter history. |
Set a coverage threshold: any coverage value below this
threshold is deemed to be low-coverage. Then set a maximum percent
low-coverage loci per sample: any sample with greater than this percentage
low-coverage loci is dropped. Note that threshold values can be explored
without applying any filtering using the
explore_filter_coverage_samples()
function.
filter_coverage_samples( x, min_coverage = 5, max_low_coverage = 50, replace_low_coverage = FALSE, description = "filter samples based on coverage" )
filter_coverage_samples( x, min_coverage = 5, max_low_coverage = 50, replace_low_coverage = FALSE, description = "filter samples based on coverage" )
x |
object of class |
min_coverage |
the coverage threshold below which data is deemed to be low-coverage. |
max_low_coverage |
any sample with more than |
replace_low_coverage |
(Boolean). If |
description |
brief description of the filter, to be saved in the filter history. |
Filter out some loci.
filter_loci(x, locus_filter, description = "")
filter_loci(x, locus_filter, description = "")
x |
object of class |
locus_filter |
boolean vector specifying whether to keep ( |
description |
brief description of the filter, to be saved in the filter history. |
Filter loci to drop invariant sites.
filter_loci_invariant(x, description = "filter loci to drop invariant sites")
filter_loci_invariant(x, description = "filter loci to drop invariant sites")
x |
object of class |
description |
brief description of the filter, to be saved in the filter history. |
Filter out over-counts, defined as count > coverage. Replace any such element with NA.
filter_overcounts(x, description = "replace overcounts with NA")
filter_overcounts(x, description = "replace overcounts with NA")
x |
object of class |
description |
brief description of the filter, to be saved in the filter history. |
Filter out some samples.
filter_samples(x, sample_filter, description = "")
filter_samples(x, sample_filter, description = "")
x |
object of class |
sample_filter |
boolean vector specifying whether to keep ( |
description |
brief description of the filter, to be saved in the filter history. |
Drop any allele for which the within-sample allele frequency
(WSAF) is below a givin threshold. Thresholds apply in both directions, for
example if wsaf_min = 0.01
then alleles with a WSAF less than 0.01
*or* greater than 0.99 will be rounded to 0 or 1, respectively. Coverage is
adjusted to account for dropped reads.
filter_wsaf(x, wsaf_min = 0.01, description = "filter individual allele WSAF")
filter_wsaf(x, wsaf_min = 0.01, description = "filter individual allele WSAF")
x |
object of class |
wsaf_min |
alleles with counts that make a WSAF less than this threshold are dropped. |
description |
brief description of the filter, to be saved in the filter history. |
Get genomic distance between samples using a distance metric that allows for mixed infections and takes account of linkage (see references for details).
get_genomic_distance(x, cutoff = 0.1, report_progress = TRUE)
get_genomic_distance(x, cutoff = 0.1, report_progress = TRUE)
x |
object of class |
cutoff |
when calculating weights, correlations below this value are ignored (see references). |
report_progress |
if |
MalariaGEN Plasmodium falciparum Community Project. "Genomic epidemiology of artemisinin resistant malaria". eLIFE (2016).
Get identity by "mixture distance". The mixture distance between two samples is the proportion of loci that have identical within-sample allele frequencies (WSAFs), or alternatively have WSAFs within a given tolerance. This extends the idea of identity by state (IBS) to continuous WSAFs rather than categorical genotypes.
get_IB_mixture(x, tol = 0, diagonal = NULL, report_progress = TRUE)
get_IB_mixture(x, tol = 0, diagonal = NULL, report_progress = TRUE)
x |
object of class |
tol |
tolerance on mixture comparisons. Default = 0. |
diagonal |
Should the diagonal of the distance matrix be changed to a given value. Default = NULL, which cause no changes. |
report_progress |
if |
Get identity by state (IBS) distance, computed as the proportion
of sites that are identical between samples. If ignore_het = TRUE
then heterozygous sites are ignored, otherwise the major strain is called
at every locus.
get_IBS_distance(x, ignore_het = TRUE, diagonal = NULL, report_progress = TRUE)
get_IBS_distance(x, ignore_het = TRUE, diagonal = NULL, report_progress = TRUE)
x |
object of class |
ignore_het |
whether to ignore heterozygous comparisons, or alternatively call the major allele at every locus (see details). |
diagonal |
Should the diagonal of the distance matrix be changed to a
given value. Default = |
report_progress |
if |
Get great circle distance between spatial points.
get_spatial_distance(lat, long)
get_spatial_distance(lat, long)
lat |
vector of latitudes. |
long |
vector of longitudes. |
Get within-sample allele frequencies from coverage and count data. Missing values can optionally be imputed by applying a summary function to the non NA values at each locus. The default summary function takes the mean of the non NA values.
get_wsaf(x, impute = TRUE, FUN = median, ...)
get_wsaf(x, impute = TRUE, FUN = median, ...)
x |
object of class |
impute |
whether to impute missing values. |
FUN |
function used to impute missing values. Default = 'median' |
... |
other arguments to pass to |
Estimates the inbreeding coefficient between all pairs of samples by maximum likelihood.
inbreeding_mle( x, f = seq(0, 1, l = 11), ignore_het = FALSE, report_progress = TRUE )
inbreeding_mle( x, f = seq(0, 1, l = 11), ignore_het = FALSE, report_progress = TRUE )
x |
object of class |
f |
values of f that are explored. |
ignore_het |
whether to ignore heterzygous comparisons, or alternatively call the major allele at every locus (see details). |
report_progress |
if |
The probability of seeing the same or different alleles at a locus
can be written in terms of the global allele frequency p and the inbreeding
coefficient f, for example the probability of seeing the same REF allele is
. This formula can be multiplied over all loci to
arrive at the overall likelihood of each value of f, which can then be
chosen by maximum likelihood. This function carries out this comparison
between all pairwise samples, passed in as a matrix. The formula above only
applies when comparing homozygous calls - for homo/het or het/het
comparisons we can either ignore these loci (the default) or convert hets
to homo by calling the major allele at every locus.
Determine if object is of class mipanalyzer_biallelic
.
is.mipanalyzer_biallelic(x)
is.mipanalyzer_biallelic(x)
x |
object of class |
Determine if object is of class mipanalyzer_multiallelic
.
is.mipanalyzer_multiallelic(x)
is.mipanalyzer_multiallelic(x)
x |
object of class |
Calculate great circle distance and bearing between spatial coordinates.
lonlat_to_bearing(origin_lon, origin_lat, dest_lon, dest_lat)
lonlat_to_bearing(origin_lon, origin_lat, dest_lon, dest_lat)
origin_lon |
The origin longitude |
origin_lat |
The origin latitude |
dest_lon |
The destination longitude |
dest_lat |
The destination latitude |
# one degree longitude should equal approximately 111km at the equator lonlat_to_bearing(0, 0, 1, 0)
# one degree longitude should equal approximately 111km at the equator lonlat_to_bearing(0, 0, 1, 0)
This package can be used to read in raw molecular inversion probe (MIP) data from vcf into a format that is convenient to work with. Data can be filtered based on counts, frequencies, missingness or other criteria. Filtered data can be analysed by common methods including PCA and various pairwise genetic metrics, and can be visualised in multiple ways. This package is intended to evolve as new MIP analyses are needed, thereby making it easy to repeat common analyses as new data becomes available.
Filtering and analysis of MIP data.
Maintainer: Bob Verity [email protected]
Converts an object of class mipanalyzer_biallelic
to
vcfR
format.
mipanalyzer_biallelic_to_vcfR(input = NULL, cutoff = 0.1)
mipanalyzer_biallelic_to_vcfR(input = NULL, cutoff = 0.1)
input |
an object of class |
cutoff |
the within-sample non-referent allele frequency cutoff to transform your biallelic site to a genotype matrix. |
Load a file from within the inst/extdata folder of the MIPanalyzer package. File extension must be one of .csv, .txt, or .rds.
mipanalyzer_file(name)
mipanalyzer_file(name)
name |
the name of a file within the inst/extdata folder. |
Conduct principal components analysis (PCA) on a matrix of within-sample allele frequencies (WSAF). Missing values must have been already imputed. Output includes the raw components, the variance in the data explained by each component, and the loadings of each component also returned.
pca_wsaf(x)
pca_wsaf(x)
x |
a matrix of within-sample allele frequencies, as produced by the
function |
Contributions of each variable are computed from the loading values
(stored as "rotation" within the prcomp
object). The percent
contribution of a variable is defined as the absolute loading value for
this variable, divided by the sum of loadings over all variables and
multiplied by 100.
Invisibly returns a list of class 'prcomp' with the following components
"sdev" the standard deviations of the principal components (i.e., the square roots of the eigenvalues of the covariance/correlation matrix, though the calculation is actually done with the singular values of the data matrix).
"rotation" the
matrix of variable loadings (i.e., a matrix whose columns contain the
eigenvectors). The function princomp
returns this in the element
loadings
.
"center, scale" the centering and scaling used.
"x" the value of the rotated data (the centred data
multiplied by the rotation matrix). Hence, cov(x)
is the
diagonal matrix diag(sdev^2)
.
"var" the variance in the data explained by each component.
"contribution" the percent contribution of a variable (i.e. a locus) to the overall variation.
Conduct principal coordinate analysis (PCoA) on a matrix of genomic distances.
pcoa_genomic_distance(x)
pcoa_genomic_distance(x)
x |
matrix of genomic distances, as produced by the function
|
Get dataframe of P.falciparum chromosome lengths
Pf_chrom_lengths()
Pf_chrom_lengths()
Plot matrix of coverage over all samples and loci.
plot_coverage(x)
plot_coverage(x)
x |
object of class |
Simple image plot of a matrix of pairwise distances.
plot_distance(m, col_pal = "plasma")
plot_distance(m, col_pal = "plasma")
m |
square matrix of pairwise distances. |
col_pal |
which viridis colour pallet to use. Options are "viridis", "plasma", "magma" or "inferno". |
Produce ggplot map.
plot_map( x_limits = c(12, 35), y_limits = c(-13, 5), col_country = grey(0.3), col_country_border = grey(0.5), size_country_border = 0.5, col_sea = grey(0.1), resolution = "coarse" )
plot_map( x_limits = c(12, 35), y_limits = c(-13, 5), col_country = grey(0.3), col_country_border = grey(0.5), size_country_border = 0.5, col_sea = grey(0.1), resolution = "coarse" )
x_limits |
longitude limits of map. |
y_limits |
latitude limits of map. |
col_country |
fill colour of countries. |
col_country_border |
colour of country borders. |
size_country_border |
size of country borders. |
col_sea |
fill colour of sea. |
resolution |
the resolution of the underlying map. Must be one of "coarse", "low", "less", "islands", "li", "high". |
Plots either the first 2 or 3 principal components.
plot_pca( pca, num_components = 2, col = NULL, col_palette = NULL, ggplot = FALSE )
plot_pca( pca, num_components = 2, col = NULL, col_palette = NULL, ggplot = FALSE )
pca |
output of |
num_components |
numeric for number of components used. Default = 2. |
col |
vector by which samples are coloured. |
col_palette |
vector of colours for each group. |
ggplot |
boolean for plotting using ggplot. Default = FALSE |
Plot PCA contribution of each variable.
plot_pca_contribution( pca, component = 1, chrom, pos, locus_type = NULL, y_buffer = 0 )
plot_pca_contribution( pca, component = 1, chrom, pos, locus_type = NULL, y_buffer = 0 )
pca |
output of |
component |
which component to plot. |
chrom |
the chromosome corresponding to each contribution value. |
pos |
the genomic position corresponding each contribution value. |
locus_type |
defines the colour of each bar. |
y_buffer |
(percent). A buffer added to the bottom of each y-axis, making room for other annotations to be added. |
Plot the variance explained by each PCA component. The number of
components shown is controlled by num_components
, with up to the
first 10 componenets shown by default. If less than the requested number of
components exist, then all the components will be shown.
plot_pca_variance(pca, num_components = 10)
plot_pca_variance(pca, num_components = 10)
pca |
output of |
num_components |
maximum components to be shown. |
Plots either the first 2 or 3 vectors of PCoA.
plot_pcoa(pcoa, num_components = 2, col = NULL, col_palette = NULL)
plot_pcoa(pcoa, num_components = 2, col = NULL, col_palette = NULL)
pcoa |
object of class "pcoa", as produced by |
num_components |
numeric for number of components used. Default = 2. |
col |
vector by which samples are coloured. |
col_palette |
vector of colours for each group. |
Simple image plot of within-sample allele frequencies. The top row of the plot corresponds to the first row of the input matrix.
plot_wsaf(x, col_pal = "plasma")
plot_wsaf(x, col_pal = "plasma")
x |
matrix of within-sample allele frequencies, with samples in rows and loci in columns. |
col_pal |
which viridis colour pallet to use. Options are "viridis", "plasma", "magma" or "inferno". |
Custom print function for class mipanalyzer_biallelic
,
printing a summary of the key elements (also equivalent to
summary(x)
). To do an ordinary print()
, use the
print_full()
function.
## S3 method for class 'mipanalyzer_biallelic' print(x, ...)
## S3 method for class 'mipanalyzer_biallelic' print(x, ...)
x |
object of class |
... |
other arguments (ignored) |
Custom print function for class mipanalyzer_multiallelic
,
printing a summary of the key elements (also equivalent to
summary(x)
). To do an ordinary print()
, use the
print_full()
function.
## S3 method for class 'mipanalyzer_multiallelic' print(x, ...)
## S3 method for class 'mipanalyzer_multiallelic' print(x, ...)
x |
object of class |
... |
other arguments (ignored) |
Draw from Beta-binomial distribution.
rbetabinom(n = 1, k = 10, alpha = 1, beta = 1)
rbetabinom(n = 1, k = 10, alpha = 1, beta = 1)
n |
number of draws. |
k |
number of binomial trials. |
alpha |
first shape parameter of beta distribution. |
beta |
second shape parameter of beta distribution. |
Draw from Dirichlet distribution given a vector of shape parameters.
rdirichlet(shape = rep(1, 3))
rdirichlet(shape = rep(1, 3))
shape |
vector of shape parameters. |
Simulate biallelic data from a simple statistical model. Inputs include the complexity of infection (COI), population-level allele frequencies (PLAF) and some parameters dicating skew and error distributions. Outputs include the phased haplotypes and the un-phased read count and coverage data.
sim_biallelic( COI = 3, PLAF = runif(10, 0, 0.5), coverage = 100, alpha = 1, overdispersion = 0, epsilon = 0 )
sim_biallelic( COI = 3, PLAF = runif(10, 0, 0.5), coverage = 100, alpha = 1, overdispersion = 0, epsilon = 0 )
COI |
complexity of infection. |
PLAF |
vector of population-level allele frequencies at each locus. |
coverage |
coverage at each locus. If a single value then the same coverage is applied over all loci. |
alpha |
shape parameter of the symmetric Dirichlet prior on strain proportions. |
overdispersion |
the extent to which counts are over-dispersed relative
to the binomial distribution. Counts are Beta-binomially distributed, with
the beta distribution having shape parameters |
epsilon |
the probability of a single read being mis-called as the other allele. Applies in both directions. |
Simulated data are drawn from a simple statistical model:
Strain proportions are drawn from a symmetric Dirichlet
distribution with shape parameter alpha
.
Phased haplotypes are drawn at every locus, one for each
COI
. The allele at each locus is drawn from a Bernoulli
distribution with probability given by the PLAF
.
The "true" within-sample allele frequency at every locus is obtained by multiplying haplotypes by their strain proportions, and summing over haplotypes. Errors are introduced through the equation
where is the WSAF
without error and
is the error parameter
epsilon
.
Final read counts are drawn from a beta-binomial distribution with
expectation . The raw number of draws is given by the
coverage
, and the skew of the distribution is given by the
overdispersion
parameter. If overdispersion = 0
then the
distribution is binomial, rather than beta-binomial.
Custom summary function for class mipanalyzer_biallelic
.
## S3 method for class 'mipanalyzer_biallelic' summary(object, ...)
## S3 method for class 'mipanalyzer_biallelic' summary(object, ...)
object |
object of class |
... |
other arguments (ignored) |
Custom summary function for class mipanalyzer_multiallelic
.
## S3 method for class 'mipanalyzer_multiallelic' summary(object, ...)
## S3 method for class 'mipanalyzer_multiallelic' summary(object, ...)
object |
object of class |
... |
other arguments (ignored) |
Convert vcf to biallelic mipanalyzer data class.
vcf_to_mipanalyzer_biallelic(file = NULL, vcfR = NULL, verbose = TRUE)
vcf_to_mipanalyzer_biallelic(file = NULL, vcfR = NULL, verbose = TRUE)
file |
path to vcf file. |
vcfR |
object of class |
verbose |
if reading from file, whether to read in verbose manner. |
Convert vcf to multiallelic mipanalyzer data class.
vcf_to_mipanalyzer_multiallelic(file = NULL, vcfR = NULL, verbose = TRUE)
vcf_to_mipanalyzer_multiallelic(file = NULL, vcfR = NULL, verbose = TRUE)
file |
path to vcf file. |
vcfR |
object of class |
verbose |
if reading from file, whether to read in verbose manner. |