Title: | Judge the performance of a panel of genetic markers using simulated data |
---|---|
Description: | An R package to judge the performance of a panel of genetic markers using data simulated for pairs of haploid genotypes. The data are simulated under a hidden Markov model of relatedness (described in Taylor, A.R., Jacob, P.E., Neafsey, D.E. and Buckee, C.O., 2019. Estimating relatedness between malaria parasites. Genetics, 212(4), pp.1337-1351) using allele frequency estimates provided by the user and inter-marker distances. The markers are treated as categorical random variables whose realisations (alleles) are unordered. The effective cardinalities and diversities of the markers can be computed using the input allele frequency estimates. Panel performance can be judged in terms of the root mean square error (RMSE) and confidence interval width of estimated relatedness, where relatedness is estimated under the same model used to simulate the data. At present, the examples we provide do not consider model misspecification; do not account for uncertainty around input allele frequency estimates; do not consider relatedness between pairs of haploid genotypes simulated using different allele frequencies; do not account for marker drop-out (markers that fail to produce useful data, e.g. because they a monomorphic). Otherwise stated, in the examples provided, the performance of a panel is judged in its most favourable light; it will likely perform less well in reality. |
Authors: | Aimee Taylor [aut, cre] , Pierre Jacob [aut] |
Maintainer: | Aimee Taylor <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.0.9000 |
Built: | 2024-12-06 04:25:59 UTC |
Source: | https://github.com/aimeertaylor/paneljudge |
Lengths in base pairs of chromosomes Pf3D7_01_v3 to Pf3D7_14_v3 of the 3D7 Plasmodium falciparum reference genome listed on PlasmoDB (see url below).
chr_lengths
chr_lengths
A numeric vector named by the chromosome number.
https://plasmodb.org/plasmo/showApplication.do
Given a matrix of marker allele frequencies, compute_diversities
returns the diversities of markers, where
is the
marker count. Each diversity is calculated as described in [1], i.e. without
correcting for finite sample sizes or considering uncertainty.
compute_diversities(fs, warn_fs = TRUE)
compute_diversities(fs, warn_fs = TRUE)
fs |
Matrix of marker allele frequencies, i.e. the |
warn_fs |
Logical indicating if the function should return warnings following allele frequency checks. |
Diversities for markers.
Taylor, A.R., Jacob, P.E., Neafsey, D.E. and Buckee, C.O., 2019. Estimating relatedness between malaria parasites. Genetics, 212(4), pp.1337-1351.
compute_diversities(fs = frequencies$Colombia)
compute_diversities(fs = frequencies$Colombia)
Given a matrix of marker allele frequencies,
compute_eff_cardinalities
returns the effective cardinalites of
markers, where
is the marker count. Effective
cardinalities are per-marker allele counts that account for inequifrequenct
alleles. Each effective cardinality is calculated as described in [1], i.e.
without correcting for finite sample sizes or considering uncertainty.
compute_eff_cardinalities(fs, warn_fs = TRUE)
compute_eff_cardinalities(fs, warn_fs = TRUE)
fs |
Matrix of marker allele frequencies, i.e. the |
warn_fs |
Logical indicating if the function should return warnings following allele frequency checks. |
Effective cardinalities for markers.
Taylor, A.R., Jacob, P.E., Neafsey, D.E. and Buckee, C.O., 2019. Estimating relatedness between malaria parasites. Genetics, 212(4), pp.1337-1351.
compute_eff_cardinalities(fs = frequencies$Colombia)
compute_eff_cardinalities(fs = frequencies$Colombia)
Given a matrix of marker allele frequencies, a vector of inter-marker
distances, and estimates of the relatedness and switch rate parameters,
compute_r_and_k_CIs
returns confidence intervals around the parameter
estimates. The default confidence is 95%. The intervals are approximate.
They are generated using parametric bootstrap draws of the parameter
estimates based on genotype calls for haploid genotype pairs simulated under
the HMM described in [1] using the input parameter estimates. The quality of
the approximation and compute time increases with the number of parametric
bootstrap draws, which are generated in parallel using a specified number of
cores.
compute_r_and_k_CIs( fs, ds, khat, rhat, confidence = 95, nboot = 100, core_count = parallel::detectCores() - 1, warn_fs = TRUE, ... )
compute_r_and_k_CIs( fs, ds, khat, rhat, confidence = 95, nboot = 100, core_count = parallel::detectCores() - 1, warn_fs = TRUE, ... )
fs |
Matrix of marker allele frequencies, i.e. the |
ds |
Vector of |
khat |
Estimate of the switch rate parameter, i.e. estimate of |
rhat |
Estimate of the relatedness parameter, i.e. estimate of |
confidence |
Confidence level (percentage) of the confidence interval (default 95%). |
nboot |
Number of parametric bootstrap draws from which to compute the confidence interval. Larger values provide a better approximation but prolong computation. |
core_count |
Number of cores to use to do computation. Set to 2 or more for parallel computation. Defaults to the number detected on the machine minus one. |
warn_fs |
Logical indicating if the function should return warnings following allele frequency checks. |
... |
Arguments to be passed to |
Confidence intervals around input switch rate parameter, , and
relatedness parameter,
.
Taylor, A.R., Jacob, P.E., Neafsey, D.E. and Buckee, C.O., 2019. Estimating relatedness between malaria parasites. Genetics, 212(4), pp.1337-1351.
# First, stimulate some data simulated_Ys <- simulate_Ys(fs = frequencies$Colombia, ds = markers$distances, k = 5, r = 0.25) # Second, estimate the switch rate parameter, k, and relatedness parameter, r krhat <- estimate_r_and_k(fs = frequencies$Colombia, ds = markers$distances, Ys = simulated_Ys) # Third, compute confidence intervals (CIs) compute_r_and_k_CIs(fs = frequencies$Colombia, ds = markers$distances, khat = krhat['khat'], rhat = krhat['rhat'])
# First, stimulate some data simulated_Ys <- simulate_Ys(fs = frequencies$Colombia, ds = markers$distances, k = 5, r = 0.25) # Second, estimate the switch rate parameter, k, and relatedness parameter, r krhat <- estimate_r_and_k(fs = frequencies$Colombia, ds = markers$distances, Ys = simulated_Ys) # Third, compute confidence intervals (CIs) compute_r_and_k_CIs(fs = frequencies$Colombia, ds = markers$distances, khat = krhat['khat'], rhat = krhat['rhat'])
Given a matrix of marker allele frequencies, a vector of inter-marker
distances, and a matrix of genotype calls for a pair of haploid genotypes,
estimate_r_and_k
returns the maximum likelihood estimates of the
relatedness parameter, , and the switch rate parameter,
, under
the HMM described in [1].
estimate_r_and_k( fs, ds, Ys, epsilon = 0.001, rho = 7.4 * 10^(-7), kinit = 50, rinit = 0.5, warn_fs = TRUE )
estimate_r_and_k( fs, ds, Ys, epsilon = 0.001, rho = 7.4 * 10^(-7), kinit = 50, rinit = 0.5, warn_fs = TRUE )
fs |
Matrix of marker allele frequencies, i.e. the |
ds |
Vector of |
Ys |
Matrix of genotypes calls for a pair of simulated haploid
genotypes, i.e. the |
epsilon |
Genotyping error, i.e. |
rho |
Recombination rate, i.e. |
kinit |
Switch rate parameter value used to initialise optimization of the negative loglikelihood. |
rinit |
Relatedness parameter value used to initialise optimization of the negative loglikelihood. |
warn_fs |
Logical indicating if the function should return warnings following allele frequency checks. |
Maximum likelihood estimates of the switch rate parameter, ,
and relatedness parameter,
.
Taylor, A.R., Jacob, P.E., Neafsey, D.E. and Buckee, C.O., 2019. Estimating relatedness between malaria parasites. Genetics, 212(4), pp.1337-1351.
Miles, A., Iqbal, Z., Vauterin, P., Pearson, R., Campino, S., Theron, M., Gould, K., Mead, D., Drury, E., O'Brien, J. and Rubio, V.R., 2016. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome research, 26(9), pp.1288-1299.
# First stimulate some data simulated_Ys <- simulate_Ys(fs = frequencies$Colombia, ds = markers$distances, k = 5, r = 0.25) # Second estimate the switch rate parameter, k, and relatedness parameter, r estimate_r_and_k(fs = frequencies$Colombia, ds = markers$distances, Ys = simulated_Ys)
# First stimulate some data simulated_Ys <- simulate_Ys(fs = frequencies$Colombia, ds = markers$distances, k = 5, r = 0.25) # Second estimate the switch rate parameter, k, and relatedness parameter, r estimate_r_and_k(fs = frequencies$Colombia, ds = markers$distances, Ys = simulated_Ys)
A data set of allele frequencies for four countries: Colombia, French Guiana, Mali and Sengal.
frequencies
frequencies
Each entry of the list is a matrix, fs
say, with
rows and
variables, where
is the marker count and
is the maximum cardinality (per-marker allele count) observed
over all
markers. If, for any
, the maximum
cardinality exceeds that of the
-th marker (i.e. if
), then all
fs[t,1:Kt]
are in (0,1] and all
fs[t,(Kt+1):Kmax]
are zero. For example, for PF3D7_0103600 in Colombia, and
frequencies$Colombia["PF3D7_0103600",] = (0.687075, 0.312925, 0, ..., 0)
.
Frequency (numeric) of the first allele
...
Frequency (numeric) of the allele
see https://github.com/artaylor85/paneljudge/blob/master/data_raw/Process_GTseq.R
A data set of marker attributes for markers pertaining to the example GTseq panel.
markers
markers
A data frame with 126 rows and 8 variables:
Name (character) of the microhaplotype marker ("Amplicon" because typed using an amplicon)
Chromosome (character) of the microhaplotype marker
First base pair (integer) of the microhaplotype marker
Last base pair (integer) of the microhaplotype marker
Length (integer) of the microhaplotype marker in base pairs
Mid-point (numeric) of the microhaplotype marker
Chromosome (numeric) of the microhaplotype marker
Inter mid-point distance (numeric) between the microhaplotype marker and its subsequent neighbour
see https://github.com/artaylor85/paneljudge/blob/master/data_raw/Process_GTseq.R
Given a matrix of marker allele frequencies, a vector of inter-marker
distances, a relatedness parameter, and a switch rate parameter, for a pair
of haploid genotypes simulate_Ys
returns genotype calls simulated
under the HMM described in [1].
simulate_Ys(fs, ds, k, r, epsilon = 0.001, rho = 7.4 * 10^(-7), warn_fs = TRUE)
simulate_Ys(fs, ds, k, r, epsilon = 0.001, rho = 7.4 * 10^(-7), warn_fs = TRUE)
fs |
Matrix of marker allele frequencies, i.e. the |
ds |
Vector of |
k |
Data-generating switch rate parameter, i.e. |
r |
Data-generating relatedness parameter, i.e. |
epsilon |
Genotyping error, i.e. |
rho |
Recombination rate, i.e. |
warn_fs |
Logical indicating if the function should return warnings following allele frequency checks. |
Simulated genotype calls for a pair of haploid genotypes, i.e. the
s of the
-th and
-th haploid genotypes in [1].
Specifically, a
by 2 matrix, where
is the marker count and
each column contains a haploid genotype. For all
markers,
alleles are enumerated 0 to
, where
is the cardinality
(per-marker allele count) of the
-th marker. For example, if
, both
Ys[t,1]
and Ys[t,2]
are either 0 or 1.
Taylor, A.R., Jacob, P.E., Neafsey, D.E. and Buckee, C.O., 2019. Estimating relatedness between malaria parasites. Genetics, 212(4), pp.1337-1351.
Miles, A., Iqbal, Z., Vauterin, P., Pearson, R., Campino, S., Theron, M., Gould, K., Mead, D., Drury, E., O'Brien, J. and Rubio, V.R., 2016. Indels, structural variation, and recombination drive genomic diversity in Plasmodium falciparum. Genome research, 26(9), pp.1288-1299.
simulate_Ys(fs = frequencies$Colombia, ds = markers$distances, k = 10, r = 0.5)
simulate_Ys(fs = frequencies$Colombia, ds = markers$distances, k = 10, r = 0.5)