In this vignette, to illustrate the functionality of the R package paneljudge, we inspect the performance of an example panel using example marker allele frequencies based on Plasmodium falciparum data from Colombia. Example frequencies are distributed with the package; data used to estimate them are not.
First we provide a summary of the example panel in terms of its marker count, and its markers’ positions, diversities, effective cardinalities and cardinalities. These quantities provide a rough idea of how informative the panel is regarding relatedness inference (1). Second, we simulate and fit data under the hidden Markov model (HMM) described in (1). The second step provides a more direct summary of how informative the panel is regarding relatedness inference. In the first step, the function compute_diversities() returns a warning about allele frequencies not quite summing to one and about some markers being uninformative. Since we deem neither to be problematic, we suppress warnings in the rest of the examples.
At present, our example analyses do not consider model specification; do not account for uncertainty around input allele frequency estimates; do not consider relatedness between pairs of haploid genotypes simulated using different allele frequencies; do not account for missing marker data. Otherwise stated, at present, the performance of a panel is judged in its most favourable light; it will likely perform less well in reality.
To view all the source code for this vignette, including that used to
generate the plots, simply type
edit(vignette("paneljudge_example"))
.
In this vignette, a “marker” refers to a genetic polymorphism that can be typed in order to characterise population genetic variation, while “allele” is used to refer to a realisation of a marker for a particular individual. Examples of markers include single nucleotide polmorphisms (SNPs), microsatellites, and microhaplotype markers. The latter refers to a region of the genome that spans two or more contiguous SNPs and is small enough to be typed using an amplicon (a chemical by-product of genotyping, not a biological entity, used to type single SNPs as well as SNP multiplets). Examples of alleles include a nucleotide at a SNP, an amino acid at a non-synonymous SNP, a repeat length at a microsatellite and a sequence of nucleotides at a microhaplotype marker. Arguably, one could use SNP multiplet to refer to a marker that spans two more SNPs, and microhaplotype to refer to an allele at a marker that spans two more SNPs. However, for consistency with (1), microhaplotype is used to refer to a marker spanning two or more contiguous SNPs.
We use “markers” and “alleles” in order to be as general as possible. The HMM on which package paneljudge relies does not care if markers are SNPs, microsatellites or microhaplotypes as long as they can be modelled as categorical random variables whose realisations (i.e. alleles) are unordered. To this end, microsatellites and microhaplotypes are treated as point polymorphisms and their alleles are enumerated. Enumeration serves only to label the realisations. As such, the ordinal nature of microsatellite alleles is not accounted for even if repeat lengths are enumerated in increasing order.
For relatedness inference, panels with many, evenly spaced and highly diverse markers are informative (1).
The example panel has 126 markers distributed across 14 of 14 chromosomes:
The markers in the example panel are microhaplotypes: short regions of the genome (whose length can be spanned by a single amplicon) that contain two of more SNPs. The positions of four of the microhaplotypes (TRAP, AMA1, CSP and SERA2) are annotated. Although the microhaplotypes are polymorphisms with physical length, we plot them as points (using their mid-points), since their lengths are barely visible on the above scale due to their small size (average 212 base pairs; standard deviation: 29 base pairs). Some markers are very close together (minimum distance between marker mid-points 295.5 base pairs). However, none overlap:
#> Do any marker starts preceed the stop of the previous marker? FALSE
Next we compute marker diversity, effective cardinality, cardinality and maximum potential diversity. Using paneljudge functions, marker diversities and effective cardinalities were calculated from example marker allele frequencies without accounting for the fact that they are estimates (e.g. without correcting for finite sample sizes or considering uncertainty). Marker cardinality is either a theoretical maximum (e.g. 23 for a microhaplotype spanning three biallelic single nucleotide polymorphisms), or an observed count. Here, the observed count is calculated (i.e. the per-marker count of alleles with non-zero frequency). Maximum diversity is a function of cardinality assuming equifrequent alleles.
# Compute marker summaries using paneljudge functions
diversities <- compute_diversities(frequencies$Colombia)
#> Warning: Some markers have frequencies whose sum deviates from one by up to
#> 1.00000000013978e-06.
#> Warning: Some markers are uninformative (have allele frequencies equal to one).
eff_cardinalities <- compute_eff_cardinalities(frequencies$Colombia, warn_fs = FALSE)
# Per-marker allele count:
cardinalities <- apply(frequencies$Colombia, 1, function(x){sum(x > 0)})
# Compute max. diversity as a function of cardinality
max_diversities <- sapply(1:max(cardinalities), function(x){1-sum(rep(1/x, x)^2)})
Marker diversity is the probability of picking two different alleles per marker. For an outbred diploid, it is equal to heterozygosity. Highly diverse markers are markers with many equifrequent alleles. For relatedness inference, the potential informativeness of a marker thus scales with its allele count (its cardinality). Alleles are rarely equifrequent, however. Effective cardinality is an allele count that accounts for inequifrequent alleles: it is equal to cardinality if all alleles are equifrequent; otherwise, it is less. Across the 126 markers of the example panel, marker diversity and effective cardinality range from 0 to 0.74 and 1 to 3.92, respectively:
Both marker diversity and effective cardinality are measures of marker informativeness regarding relatedness inference. Personally, I find effective cardinality more intuitive because potential effective cardinality scales linearly with cardinality, whereas potential diversity does not. Otherwise stated, diversity is less convenient because its range of possible values does not scale linearly with cardinality, making diversity hard to compare across markers with different cardinalities. The average effective cardinality multiplied by marker count, 214 in this case, provides a rough summary of the informativeness of the entire panel.
Using the example panel and marker allele frequencies, for various data-generating values of relatedness, r, we first simulate some data on a pair of haploid genotypes; second estimate r (and a switch rate parameter, k, whose data-generating value is fixed in this vignette; see footnote); and third compute 95% confidence intervals (CIs) around the estimates r̂ and k̂:
#=============================================================
# Simulate n genotype pairs for various r and fixed k
#=============================================================
# Data-generating relatedness values (named s.t. mle_CIs are named)
rs <- c("0.01"=0.01, "0.25"=0.25, "0.50"=0.50, "0.75"=0.75, "0.99"=0.99)
k <- 5 # Data-generating switch rate parameter value
n <- 5 # Number of pairs to per simulate per r in rs
fs <- frequencies$Colombia # example marker allele frequencies
ds <- markers$distances # distances between marker mid-points
mle_CIs <- lapply(rs, function(r) {
sapply(1:n, function(i) {
# First simulate genotype pair
Ys <- simulate_Ys(fs, ds, k, r, warn_fs = FALSE)
# Second, estimate r and k
krhat <- estimate_r_and_k(fs, ds, Ys, warn_fs = FALSE)
# Third, compute confidence intervals (CIs)
CIs <- compute_r_and_k_CIs(fs, ds, khat = krhat['khat'], rhat = krhat['rhat'], warn_fs = FALSE)
# End of function
return(c(krhat['rhat'], CIs['rhat',]))
})
})
#> Loading required package: foreach
#> Loading required package: rngtools
#> Warning in estimate_r_and_k(fs, ds, Ys, warn_fs = FALSE): Some per-marker
#> allele counts exceed per-marker non-zero allele frequencies. Data are
#> permissible due to non-zero epsilon.
#> Warning in estimate_r_and_k(fs, ds, Ys, warn_fs = FALSE): Some per-marker
#> allele counts exceed per-marker non-zero allele frequencies. Data are
#> permissible due to non-zero epsilon.
#> Warning in estimate_r_and_k(fs, ds, Ys, warn_fs = FALSE): Some per-marker
#> allele counts exceed per-marker non-zero allele frequencies. Data are
#> permissible due to non-zero epsilon.
#> Warning in estimate_r_and_k(fs, ds, Ys, warn_fs = FALSE): Some per-marker
#> allele counts exceed per-marker non-zero allele frequencies. Data are
#> permissible due to non-zero epsilon.
For each data-generating value of relatedness we simulated 5 haploid genotype pairs. The 95% CIs around the relatedness estimates show that, despite considerable uncertainty, at the 95% confidence level the panel is informative regarding relatedness inference across a range of data-generating values (an uninformative panel at the 95% confidence level would have 95% CIs that span the entire 0 to 1 range):
The variation in r̂ around the data-generating r (black horizontal line) is partly due to limited panel informativeness and partly due to the finite length of the genome, i.e. Mendelian sampling (2). Whole genome simulation is required to partition these two sources of variation; see footnote below.
Instead of plotting the CIs directly, we can plot the CI widths as a indicator of informativeness (an uninformative CI having width one). Since 5 haploid genotype pairs were simulated per data-generating value of r, we can also compute and plot the root mean square error (RMSE) as an measure of informativeness.
CI_widths <- sapply(mle_CIs, function(x){ # Calculate CI widths
x["97.5%",] - x["2.5%",]
})
RMSEs <- sapply(names(mle_CIs), function(r){ # Calculate RMSEs
sqrt(mean((mle_CIs[[r]] - as.numeric(r))^2))
})
Relatedness, r, is defined as a probability of identity-by-descent (IBD) averaged over the genome (1). It differs to realised relatedness, which is the fraction of the genome that is IBD, due to the finite length of the genome (i.e. due to Mendelian sampling (2)): for a genome of finite length many realised relatednesses are compatible with a given probability of IBD, meaning that there will always be some variance in realised relatedness around r. Consequently, the RMSE of r̂ compared to the data-generating r (and the width of a CI computed using the parametric bootstrap) asymptotes to some small but non-zero value when more and more markers are typed (1).
Considering a panel with a finite marker count, variation in r̂ around the data-generating r is partly due to the informativeness of the panel and partly due to Mendelian sampling (2). To set-aside variation due to Mendelian sampling and thus focus entirely on panel performance, we could compare r̂ to realised relatedness (e.g. when we compute RMSE). However, this is computationally expensive as it requires whole genome simulation:
The switch rate parameter, k, controls the rate at which the Markov chain switches between the latent states (IBD and not IBD) in the HMM (1). When considering data on a limited number of markers in a panel (versus whole genome sequence data), we treat the switch rate as a nuisance parameter (since it has little effect on the estimate, r̂, yet is near-on impossible to estimate precisely without whole genome sequence data), focusing instead on the relatedness parameter r.