Title: | What the Package Does (One Line, Title Case) |
---|---|
Description: | What the package does (one paragraph). |
Authors: | First Last [aut, cre] (YOUR-ORCID-ID) |
Maintainer: | First Last <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.0.0.9000 |
Built: | 2024-12-08 06:34:48 UTC |
Source: | https://github.com/aimeertaylor/Pv3Rs |
Finds all possible recurrence states for each recurrence compatible with the relationship graph, then takes the Cartesian product to get all vectors of recurrence states. For a recurrence to be a recrudescence, all edges between the recurrent infection and the immediately preceding infection must be clonal edges. For a recurrence to be a reinfection, all edges between the recurrent infection and any preceding infection must be stranger edges. All recurrences may possibly be relapses.
compatible_rstrs(RG, gs_per_ts)
compatible_rstrs(RG, gs_per_ts)
RG |
Relationship graph; see |
gs_per_ts |
List of vectors of genotypes for each infection. |
Vector of strings (consisting of "C", "L", "I" for recrudescence, relapse, reinfection respectively) compatible with relationship graph.
MOIs <- c(2, 2, 1) RG <- enumerate_RGs(MOIs, igraph = TRUE)[[175]] gs_per_ts <- split(paste0("g", 1:sum(MOIs)), rep(1:length(MOIs), MOIs)) # 1st recurrence can't be recrudescence, 2nd recurrence can't be reinfection plot_RG(RG, edge.curved = 0.2) compatible_rstrs(RG, gs_per_ts) # "LL" "IL" "LC" "IC"
MOIs <- c(2, 2, 1) RG <- enumerate_RGs(MOIs, igraph = TRUE)[[175]] gs_per_ts <- split(paste0("g", 1:sum(MOIs)), rep(1:length(MOIs), MOIs)) # 1st recurrence can't be recrudescence, 2nd recurrence can't be reinfection plot_RG(RG, edge.curved = 0.2) compatible_rstrs(RG, gs_per_ts) # "LL" "IL" "LC" "IC"
Compute posterior probabilities of P. vivax recurrent states relapse, reinfection and recrudescence using genetic data.
Please note, the progress bar does not necessarily increment uniformly (see details below); it may seem stuck when the code is still running.
compute_posterior( y, fs, prior = NULL, MOIs = NULL, return.RG = FALSE, return.logp = FALSE )
compute_posterior( y, fs, prior = NULL, MOIs = NULL, return.RG = FALSE, return.logp = FALSE )
y |
Observed data in the form of a list of lists. The outer list is a
list of episodes in increasing chronological order. The inner list is a list of named
markers per episode. Episode names can be specified, but they are not used.
Markers must be named. Each episode must list the same markers. If not all
markers are typed per episode, data on untyped markers can be encoded as
missing (see below). For each marker, one must specify an allelic vector: a
set of distinct alleles detected at that marker. |
fs |
List of per-marker allele frequency vectors. Names of the list must match
with the marker names in |
prior |
Matrix of prior probabilities of the recurrence states for each
recurrent episode. Each row corresponds to an episode in increasing
chronological order. The column names must be C, L, and I for
recrudescence, relapse and reinfection respectively. Row names can be
specified but they are not used. If |
MOIs |
Multiplicity of infection for each episode. If MOIs are not
provided, the most parsimonious MOIs compatible with the data will be used;
see |
return.RG |
Boolean for whether to return the relationship graphs,
defaults to |
return.logp |
Boolean for whether to return the log-likelihood for each
relationship graph, defaults to |
compute_posterior()
computes posterior probabilities proportional to the likelihood multiplied by the prior. The likelihood
sums over
ways to phase allelic data onto haploid genotypes
graphs of relationships between haploid genotypes
ways to partition alleles into clusters of identity-by-descent
compute_posterior()
expects each per-episode, per-marker allelic
vector to be a set of distinct alleles. Allele repeats at markers with
observed data, and NA
repeats at markers with missing data, are
removed in a data pre-processing step. NA
s in allelic vectors that
also contain non-NA
values are removed in a data pre-processing step.
We enumerate all possible relationship graphs between haploid genotypes,
where pairs of genotypes can either be clones, siblings, or strangers. The
likelihood of a sequence of recurrence states can be determined from the
likelihood of all relationship graphs compatible with said sequence. More
details on the enumeration and likelihood calculation of relationship graphs
can be found in enumerate_RGs
and RG_inference
respectively. For each relationship graph, the model sums over all possible
identity-by-descent partitions. Because some relationship graphs are
compatible with more identity-by-descent partitions than others, the log
p(Y|RG) progress bar does not necessarily increment uniformly.
Notable model assumptions and limitations:
Perfect detection of alleles (no genotyping error)
No within-host de novo mutations
Parasites are outbred
All siblings are regular siblings
Relationship graphs compatible with a given sequence of recurrent states are equally likely a priori
We do not recommend running 'compute_posterior() when the total genotype count (sum of per-episode multiplicities of infection) exceeds eight, because there are too many relationship graphs.
Presently, Pv3Rs
only supports prevalence data (categorical data that
signal the detection of alleles), not quantitative data (data that signal the proportional
abundance of the alleles detected).
List containing:
marg
Matrix of marginal posterior probabilities of
recurrent states for each recurrence, one row per recurrence with "C"
for recrudescence, "L" for relapse, and "I" for reinfection. marg
is
a simple summary of joint
(see next): each marginal probability of a
recurrent state is a sum over a subset of joint probabilities of
recurrent state sequences. For example, the marginal probability of "C"
at the first of two recurrences is a sum over the joint probabilities
of "CC",
"CL", and "CI".
joint
Vector of joint posterior probabilities of each recurrent state sequence, where within a sequence "C" denotes recrudescence, "L" denotes relapse, and "I" denotes reinfection.
RGs
List of relationship graphs with their log-likelihoods stored.
Only returned if return.RG
is TRUE
. See
enumerate_RGs
.
# =========================================================================== # Example where alleles are named numerically # =========================================================================== # Data y <- list(enroll = list(m1 = c('3','2'), m2 = c('1','2')), recur1 = list(m1 = c('1','4'), m2 = c('1')), recur2 = list(m1 = c('1'), m2 = NA)) # Allele frequencies fs <- list(m1 = c('1' = 0.78, '2' = 0.14, '3' = 0.07, '4' = 0.01), m2 = c('1' = 0.27, '2' = 0.73)) # Compute posterior probabilities using default prior compute_posterior(y, fs) # =========================================================================== # Example where alleles are named arbitrarily and probabilities are plotted # =========================================================================== # Data y <- list(episode0 = list(marker1 = c("Tinky Winky", "Dipsy"), marker2 = c("Tinky Winky", "Laa-Laa", "Po")), episode1 = list(marker1 = "Tinky Winky", marker2 = "Laa-Laa")) # Allele frequencies fs <- list(marker1 = c("Tinky Winky" = 0.4, "Dipsy" = 0.6), marker2 = c("Tinky Winky" = 0.1, "Laa-Laa" = 0.1, "Po" = 0.8)) # Compute posterior probabilities using default prior posterior_probs <- compute_posterior(y, fs) # Plot posterior probabilities on the simplex plot_simplex(c("Recrudescence", "Relapse", "Reinfection"), 0.5) # Simplex xy <- project2D(posterior_probs$marg[1,]) # Project probabilities points(x = xy["x"], y = xy["y"], pch = 20) # Plot projected probabilities #============================================================================ # Demonstrating the return of the prior when all data are missing #============================================================================ # Data y_missing <- list(enroll = list(m1 = NA), recur1 = list(m1 = NA), recur2 = list(m1 = NA)) # Return of the prior suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1)))) # Return of the prior re-weighted to the exclusion of recrudescence suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1)), MOIs = c(1,2,3))) # (Recrudescing parasites are clones of previous blood-stage parasites. The # Pv3R model assumes no within-host de-novo mutations and perfect allele # detection. As such, recrudescence is incompatible with an MOI increase on # the preceding infection.) # =========================================================================== # Demonstrating the cosmetic-only nature of episode names # =========================================================================== # Data y <- list(enroll = list(m1 = NA), recur2 = list(m1 = NA), recur1 = list(m1 = NA)) # Use a non-uniform prior for the purpose of illustration prior <- matrix(c(0.2,0.2,0.6,0.7,0.1,0.2), byrow = TRUE, nrow = 2, dimnames = list(c("recur1_prior", "recur2_prior"), c("C", "L", "I"))) # Print posterior and prior, noting that "recur1_prior" is returned for # "recur2", and "recur2_prior" is returned for "recur1" suppressMessages(compute_posterior(y, fs = list(m1 = c(a = 1)), prior))$marg prior #============================================================================ # Demonstrating the informative nature of non-recurrent data #============================================================================ # Data and allele frequencies y_het <- list(list(m1 = c('1', '2')), list(m1 = NA)) y_hom <- list(list(m1 = '1'), list(m1 = NA)) fs = list(m1 = c('1' = 0.5, '2' = 0.5)) # The prior is not returned despite there being no recurrent data (see # vignette XXX to understand why) suppressMessages(compute_posterior(y = y_het, fs))$marg suppressMessages(compute_posterior(y = y_hom, fs, MOIs = c(2,1)))$marg #============================================================================ # Demonstrating the effect of increasingly large relationship graphs: the # marginal probability of the first recurrence changes slightly, albeit at a # decreasing rate, as the number of additional recurrences (all without data) # increases. The change is greatest when the observed allele is rare. #============================================================================ # Data for different recurrence counts where only the 1st recurrence has data ys <- list(scenario1 = list(enroll = list(m1 = "A"), recur1 = list(m1 = "A")), scenario2 = list(enroll = list(m1 = "A"), recur1 = list(m1 = "A"), recur2 = list(m1 = NA)), scenario3 = list(enroll = list(m1 = "A"), recur1 = list(m1 = "A"), recur2 = list(m1 = NA), recur3 = list(m1 = NA)), scenario4 = list(enroll = list(m1 = "A"), recur1 = list(m1 = "A"), recur2 = list(m1 = NA), recur3 = list(m1 = NA), recur4 = list(m1 = NA))) # Allele frequencies: smaller f_A leads to larger change f_A <- 0.1; fs <- list(m1 = c("A" = f_A, "Other" = 1-f_A)) # Compute posterior probabilities and extract marginal probabilities results <- lapply(ys, function(y) compute_posterior(y, fs)$marg) # Extract results for the first recurrence results_recur1 <- sapply(results, function(result) result[1,]) results_recur1 # Results are different for different scenarios # Visualise the change in the marginal probability of the first recurrence plot_simplex(c("Recrudescence", "Relapse", "Reinfection")) # Plot simplex xy <- apply(results_recur1, 2, project2D) # Project probabilities points(x = xy["x", ], y = xy["y", ], pch = "-", col = 1:4) # Plot projections legend("left", col = 1:4, pch = "-", pt.cex = 2, bty = "n", legend = 1:4, title = "Recurrence \n count") # legend
# =========================================================================== # Example where alleles are named numerically # =========================================================================== # Data y <- list(enroll = list(m1 = c('3','2'), m2 = c('1','2')), recur1 = list(m1 = c('1','4'), m2 = c('1')), recur2 = list(m1 = c('1'), m2 = NA)) # Allele frequencies fs <- list(m1 = c('1' = 0.78, '2' = 0.14, '3' = 0.07, '4' = 0.01), m2 = c('1' = 0.27, '2' = 0.73)) # Compute posterior probabilities using default prior compute_posterior(y, fs) # =========================================================================== # Example where alleles are named arbitrarily and probabilities are plotted # =========================================================================== # Data y <- list(episode0 = list(marker1 = c("Tinky Winky", "Dipsy"), marker2 = c("Tinky Winky", "Laa-Laa", "Po")), episode1 = list(marker1 = "Tinky Winky", marker2 = "Laa-Laa")) # Allele frequencies fs <- list(marker1 = c("Tinky Winky" = 0.4, "Dipsy" = 0.6), marker2 = c("Tinky Winky" = 0.1, "Laa-Laa" = 0.1, "Po" = 0.8)) # Compute posterior probabilities using default prior posterior_probs <- compute_posterior(y, fs) # Plot posterior probabilities on the simplex plot_simplex(c("Recrudescence", "Relapse", "Reinfection"), 0.5) # Simplex xy <- project2D(posterior_probs$marg[1,]) # Project probabilities points(x = xy["x"], y = xy["y"], pch = 20) # Plot projected probabilities #============================================================================ # Demonstrating the return of the prior when all data are missing #============================================================================ # Data y_missing <- list(enroll = list(m1 = NA), recur1 = list(m1 = NA), recur2 = list(m1 = NA)) # Return of the prior suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1)))) # Return of the prior re-weighted to the exclusion of recrudescence suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1)), MOIs = c(1,2,3))) # (Recrudescing parasites are clones of previous blood-stage parasites. The # Pv3R model assumes no within-host de-novo mutations and perfect allele # detection. As such, recrudescence is incompatible with an MOI increase on # the preceding infection.) # =========================================================================== # Demonstrating the cosmetic-only nature of episode names # =========================================================================== # Data y <- list(enroll = list(m1 = NA), recur2 = list(m1 = NA), recur1 = list(m1 = NA)) # Use a non-uniform prior for the purpose of illustration prior <- matrix(c(0.2,0.2,0.6,0.7,0.1,0.2), byrow = TRUE, nrow = 2, dimnames = list(c("recur1_prior", "recur2_prior"), c("C", "L", "I"))) # Print posterior and prior, noting that "recur1_prior" is returned for # "recur2", and "recur2_prior" is returned for "recur1" suppressMessages(compute_posterior(y, fs = list(m1 = c(a = 1)), prior))$marg prior #============================================================================ # Demonstrating the informative nature of non-recurrent data #============================================================================ # Data and allele frequencies y_het <- list(list(m1 = c('1', '2')), list(m1 = NA)) y_hom <- list(list(m1 = '1'), list(m1 = NA)) fs = list(m1 = c('1' = 0.5, '2' = 0.5)) # The prior is not returned despite there being no recurrent data (see # vignette XXX to understand why) suppressMessages(compute_posterior(y = y_het, fs))$marg suppressMessages(compute_posterior(y = y_hom, fs, MOIs = c(2,1)))$marg #============================================================================ # Demonstrating the effect of increasingly large relationship graphs: the # marginal probability of the first recurrence changes slightly, albeit at a # decreasing rate, as the number of additional recurrences (all without data) # increases. The change is greatest when the observed allele is rare. #============================================================================ # Data for different recurrence counts where only the 1st recurrence has data ys <- list(scenario1 = list(enroll = list(m1 = "A"), recur1 = list(m1 = "A")), scenario2 = list(enroll = list(m1 = "A"), recur1 = list(m1 = "A"), recur2 = list(m1 = NA)), scenario3 = list(enroll = list(m1 = "A"), recur1 = list(m1 = "A"), recur2 = list(m1 = NA), recur3 = list(m1 = NA)), scenario4 = list(enroll = list(m1 = "A"), recur1 = list(m1 = "A"), recur2 = list(m1 = NA), recur3 = list(m1 = NA), recur4 = list(m1 = NA))) # Allele frequencies: smaller f_A leads to larger change f_A <- 0.1; fs <- list(m1 = c("A" = f_A, "Other" = 1-f_A)) # Compute posterior probabilities and extract marginal probabilities results <- lapply(ys, function(y) compute_posterior(y, fs)$marg) # Extract results for the first recurrence results_recur1 <- sapply(results, function(result) result[1,]) results_recur1 # Results are different for different scenarios # Visualise the change in the marginal probability of the first recurrence plot_simplex(c("Recrudescence", "Relapse", "Reinfection")) # Plot simplex xy <- apply(results_recur1, 2, project2D) # Project probabilities points(x = xy["x", ], y = xy["y", ], pch = "-", col = 1:4) # Plot projections legend("left", col = 1:4, pch = "-", pt.cex = 2, bty = "n", legend = 1:4, title = "Recurrence \n count") # legend
The MOIs returned correspond to the parsimonious explanation of the data, i.e. the minimal MOIs required to support the observed allelic diversity.
determine_MOIs(y)
determine_MOIs(y)
y |
Observed data in the form of a list of lists. The outer list is a list of episodes in chronological order. The inner list is a list of named markers per episode. For each marker, one must specify an allelic vector: a set of distinct alleles detected at that marker. |
For a given episode, the minimal MOI required to support the observed allelic diversity is equal to the maximum number of per-marker alleles observed across markers.
At present, Pv3Rs
only supports prevalence data (categorical data that
signal the detection of alleles), not quantitative (proportional abundance)
data. Allele repeats at markers with observed data, and repeat NA
s at
markers with missing data, are removed. NA
s in allelic vectors that
also contain non-NA
values are ignored.
Returns a vector of the (minimum) multiplicity of infection (MOI) for each infection.
# two infections, three markers y <- list( list(m1 = c("A", "B"), m2 = c("A"), m3 = c("C")), list(m1 = c("B"), m2 = c("B", "C"), m3 = c("A", "B", "C")) ) determine_MOIs(y) # should be c(2, 3)
# two infections, three markers y <- list( list(m1 = c("A", "B"), m2 = c("A"), m3 = c("C")), list(m1 = c("B"), m2 = c("B", "C"), m3 = c("A", "B", "C")) ) determine_MOIs(y) # should be c(2, 3)
Note that this function is not tested for input with alleles mixed with NA.
enumerate_alleles(y.inf, gs.inf, use.sym = TRUE)
enumerate_alleles(y.inf, gs.inf, use.sym = TRUE)
y.inf |
List of alleles observed across markers for genotypes within one infection. Alleles must not be repeated for the same marker. |
gs.inf |
Vector of genotype names for genotypes within one infection. |
use.sym |
Boolean for permutation symmetry is exploited as a computational shortcut. Due to permutation symmetry of intra-infection genotypes, we can fix a single assignment for one of the markers whose number of alleles observed is equal to the MOI (consider it the anchor) and permute the rest, discarding combinations that under-represent the observed marker diversity. The default behaviour is to use this symmetry such that less assignments have to be considered. |
List of dataframes, one for each marker. The columns correspond to genotypes, while the rows correspond to allele assignments for a marker.
# 3 markers y.inf <- list(m1 = c("A", "B"), m2 = c("B", "C", "D"), m3 = c("C")) enumerate_alleles(y.inf, c("g1", "g2", "g3")) # 6 assignments for m1 (BAA, ABA, BBA, AAB, BAB, ABB) # 1 assignment for m2 (accounting for permutation symmetry) # 1 assignment for m3 (CCC)
# 3 markers y.inf <- list(m1 = c("A", "B"), m2 = c("B", "C", "D"), m3 = c("C")) enumerate_alleles(y.inf, c("g1", "g2", "g3")) # 6 assignments for m1 (BAA, ABA, BBA, AAB, BAB, ABB) # 1 assignment for m2 (accounting for permutation symmetry) # 1 assignment for m3 (CCC)
A clonal partition is a partition of genotypes where a pair of genotypes of the same partition cell have a clonal relationship. Genotypes from the same infection cannot be clones. This code enumerates all clonal partitions, accounting for this intra-infection restriction.
enumerate_CPs(MOIs)
enumerate_CPs(MOIs)
MOIs |
A numeric vector specifying, for each infection, the number of distinct parasite genotypes, a.k.a. the multiplicity of infection (MOI). |
A list of all possible partitions, where each partition is encoded as a membership vector, which indices (genotype names) with the same entry corresponding to genotypes being int he same partition cell.
enumerate_CPs(c(2, 2))
enumerate_CPs(c(2, 2))
Given a specified number of alleles for a single marker,
generate_halfsib_alleles()
enumerates all the ways three half
siblings can draw alleles from their respective parents by firstly
enumerating all allelic combinations for the parents, and by secondly
enumerating all the inheritable combinations for the children.
enumerate_halfsib_alleles(n_alleles)
enumerate_halfsib_alleles(n_alleles)
n_alleles |
Positive whole number specifying a per-marker number of alleles, otherwise known as marker cardinality. |
A character matrix. Each column is an individual. Each row is a
possible allelic draw. Alleles are represented by the first n_alleles
letters of the latin alphabet.
enumerate_halfsib_alleles(3)
enumerate_halfsib_alleles(3)
Enumerate IBD partitions consistent with a given relationship graph
enumerate_IPs_RG(RG, compat = TRUE)
enumerate_IPs_RG(RG, compat = TRUE)
RG |
A relationship graph; see |
compat |
Logical, if true, a list of |
For a IBD partition to be consistent with the relationship graph given, it must satisfy the following:
genotypes within each IBD cell have clonal and sibling relationships only,
genotypes that are clones must be in the same IBD cell,
each cluster of sibling units span at most 2 IBD cells (corresponds to two parents).
Note that all IBD partitions are equally likely.
set.seed(3) RG <- sample_RG(c(2, 1, 2)) enumerate_IPs_RG(RG)
set.seed(3) RG <- sample_RG(c(2, 1, 2)) enumerate_IPs_RG(RG)
A relationship graph is a complete graph on all genotypes, where each edge is annotated as a clone, sibling, or stranger edge. The enumerated relationship graphs satisfy the following constraints:
The subgraph induced by the clone edges is a cluster graph.
The subgraph induced by the clone edges and sibling edges is a cluster graph.
Clone edges are only allowed for two genotypes from different infections.
enumerate_RGs(MOIs, igraph = TRUE)
enumerate_RGs(MOIs, igraph = TRUE)
MOIs |
A numeric vector specifying, for each infection, the number of distinct parasite genotypes, a.k.a. the multiplicity of infection (MOI). |
igraph |
Logical for whether to return |
Relationship graphs are enumerated by generating nested set partitions that
meet certain constraints. Since the clone edges induce a cluster graph, the
information encoded by clonal relationships is equivalent to a partition of
the genotypes. Note that genotypes from the same infection cannot belong to
the same partition cell. Subsequent information encoded by sibling
relationships is equivalent to further partitioning the clonal partition.
There are no constraints when enumerating the sibling partitions. The data
structure returned encodes each graph as a nested set partition. Each
partition is represented in the form of a list of vectors (clone
and
sib
) and as a membership vector (clone.vec
and sib.vec
), where each
entry identifies the partition cell the corresponding index belongs to.
A list of relationship graphs. If igraph
is FALSE
,
each element is a list of four attributes:
A list of groups of genotypes that make up the clonal cells.
A numeric vector indicating the clonal membership of each genotype.
A list of groups of clonal cells that make up the sibling cells.
A numeric vector indicating the sibling membership of each clonal cell.
Otherwise, each element is an igraph
object (see
RG_to_igraph
) along with these four attributes. Note that
the weight matrix contains information equivalent to that of the four
attributes.
graphs <- enumerate_RGs(c(2, 1, 2), igraph=TRUE) # 250 graphs
graphs <- enumerate_RGs(c(2, 1, 2), igraph=TRUE) # 250 graphs
The posterior mean of a multinomial-Dirichlet model with uniform prior fit to data on allele prevalence in initial episodes. Because the model is fit to allele prevalence (observed) not allele frequency (would requires integrating-out unknown multiplicities of infection) it is liable to underestimate the frequencies of common alleles and overestimate those of rare but detected alleles.
fs_VHX_BPD
fs_VHX_BPD
A list of nine markers; for each marker a named vector of allele frequencies that sum to one.
This is used for building a hash table for p(y at marker m|IBD).
hash.IP(IP, gs)
hash.IP(IP, gs)
IP |
List containing vectors of genotype names with each vector corresponding to an IBD cell. |
gs |
Vector containing all genotype names. |
String where the integers in the IBD membership vector have been converted to ASCII characters.
gs <- paste0("g", 1:3) IP1 <- list(c("g1", "g3"), c("g2")) IP2 <- list(c("g2"), c("g3", "g1")) hash1 <- hash.IP(IP1, gs) hash2 <- hash.IP(IP2, gs) hash1 == hash2 # TRUE, even though the order is different
gs <- paste0("g", 1:3) IP1 <- list(c("g1", "g3"), c("g2")) IP2 <- list(c("g2"), c("g3", "g1")) hash1 <- hash.IP(IP1, gs) hash2 <- hash.IP(IP2, gs) hash1 == hash2 # TRUE, even though the order is different
Summarise locus-wise data types
locus_type_summary(y)
locus_type_summary(y)
y |
A list of lists for two episodes; see |
A function to summarise the data at each locus as one of four types:
All match (all genotypes have the same allele).
All diff. (all genotypes have a different allele).
Intra-match (some intra-episode genotypes have the same allele)
Inter-match (some inter-episode genotypes have the same allele).
The number of apparent genotypes is the sum of the per-episode MOIs. When the total number of apparent genotypes exceeds 3, "Intra-match" excludes any "Inter-match" (i.e., it is equivalent to Inter-matches only), whereas an "Inter-match does not exclude an "Intra-match
A vector of strings summarising the data at each locus.
# example code y <- list( list(m1 = c("A", "B"), m2 = c("A"), m3 = c("C")), list(m1 = c("B"), m2 = c("B", "C"), m3 = c("A", "B", "C", "D")) ) locus_type_summary(y)
# example code y <- list( list(m1 = c("A", "B"), m2 = c("A"), m3 = c("C")), list(m1 = c("B"), m2 = c("B", "C"), m3 = c("A", "B", "C", "D")) ) locus_type_summary(y)
A simple progress bar to use in R packages where messages are preferred to console output. See https://gist.github.com/MansMeg/1ec56b54e1d9d238b4fd.
iter
Total number of iterations
i
Current iteration
width
Width of the R console
width_bar
Width of the progress bar
progress
The number of character printed (continous)
progress_step
Addition to progress per iteration
increment()
A messagebar object.
initialize(iter)
Initialize a messagebar object
Mans Magnusson (MansMeg @ github)
Plots the alleles (colours), which are observed in different episodes (rows), on different markers (columns), where episodes are grouped by patient. The patients and per-patient episodes are plotted from bottom to top. If more than one allele is detected per episode per marker, the corresponding row-column entry is subdivided into different colours. The legend depicts the alleles of the markers as the markers appear from left to right in the main plot. Otherwise stated, the legend is ordered by the order of markers stated on on the horizontal axis of the main plot. The colour scheme is adaptive. It is designed to visually differentiate the alleles as much as possible: the maximum range of qualitative scheme, with contrast of hue between adjacent colours, is always used; the adjacent colours are interpolated only if a given marker has more than 12 alleles. The names of the alleles are printed on top of their colours if marker_annotate.
plot_data(ys, fs = NULL, marker_annotate = TRUE)
plot_data(ys, fs = NULL, marker_annotate = TRUE)
ys |
A nested list of per-patient, per-episode, per-marker allelic data. Specifically, a per-patient list of a per-episode list of a per-marker list of character vectors of observed alleles. |
fs |
A per-marker list of numeric vectors of allele frequencies. If NULL (default), for a given marker, only the alleles present in the data are represented in the legend, and each allele is represented equally. Because the colour scheme is adaptive (see introduction), the same allele will have a different colour in a plot of an alternative data list if more or fewer alleles are observed at the given marker across the alternative data list. If fs is specified, all possible alleles are represented and legend areas are proportional to allele frequencies; i.e., common alleles have relatively large legend areas, and rare alleles have relatively small legend areas. Specify fs to fix the colour of a given allele across plots of different data lists, thereby facilitating cross-comparison. |
marker_annotate |
Logical. If true (default), the names of the alleles are printed on top of their colours in the legend. |
# Examples # Plot example Plasmodium vivax data set plot_data(ys = ys_VHX_BPD) plot_data(ys = ys_VHX_BPD, fs = fs_VHX_BPD) plot_data(ys = ys_VHX_BPD, fs = fs_VHX_BPD, marker_annotate = FALSE) # Demonstrating the adaptive nature of the colour scheme: ys <- ys_VHX_BPD["VHX_52"] # A single patient plot_data(ys, fs = fs_VHX_BPD) # Colours and the legend match plots above plot_data(ys) # Colours and the legend adapt to alleles detected in VHX_52
# Examples # Plot example Plasmodium vivax data set plot_data(ys = ys_VHX_BPD) plot_data(ys = ys_VHX_BPD, fs = fs_VHX_BPD) plot_data(ys = ys_VHX_BPD, fs = fs_VHX_BPD, marker_annotate = FALSE) # Demonstrating the adaptive nature of the colour scheme: ys <- ys_VHX_BPD["VHX_52"] # A single patient plot_data(ys, fs = fs_VHX_BPD) # Colours and the legend match plots above plot_data(ys) # Colours and the legend adapt to alleles detected in VHX_52
This function is a wrapper around plot.igraph
, written
to group parasite genotypes by infection, both spatially and using vertex
colour. Specifically, parasite genotypes within infections are vertically
distributed with some horizontal jitter when layout_by_group
is TRUE
(default), and coloured the same. It also makes sure clonal and sibling edges
are plotted differently using different line types.
plot_RG( RG, layout_by_group = TRUE, vertex_palette = "Set2", edge_lty = c(`0.5` = "dashed", `1` = "solid"), edge_col = c(`0.5` = "black", `1` = "black"), edge.width = 1.5, ... )
plot_RG( RG, layout_by_group = TRUE, vertex_palette = "Set2", edge_lty = c(`0.5` = "dashed", `1` = "solid"), edge_col = c(`0.5` = "black", `1` = "black"), edge.width = 1.5, ... )
RG |
A relationship graph, which is an |
layout_by_group |
A logical argument which if TRUE (default) overrides
the default layout of |
vertex_palette |
A character string specifying an RColorBrewer palette.
Overrides the default |
edge_lty |
A vector of edge line types corresponding to different relationships, where 0.5 represents a sibling and 1 represents a clone. |
edge_col |
A vector of edge colours corresponding to different relationships, where 0.5 represents a sibling and 1 represents a clone. |
edge.width |
Overrides the default |
... |
Additional arguments to pass to |
This function was adapted from plot_Vivax_model
at
https://github.com/jwatowatson/RecurrentVivax/blob/master/Genetic_Model/iGraph_functions.R.
RGs <- enumerate_RGs(c(2, 1, 1)) cpar <- par() # record current par before changing par(mfrow = c(3, 4), mar = c(0.1, 0.1, 0.1, 0.1)) for (i in 12:23) { plot_RG(RGs[[i]], edge.curved = 0.1) box() } par(cpar) # reset par
RGs <- enumerate_RGs(c(2, 1, 1)) cpar <- par() # record current par before changing par(mfrow = c(3, 4), mar = c(0.1, 0.1, 0.1, 0.1)) for (i in 12:23) { plot_RG(RGs[[i]], edge.curved = 0.1) box() } par(cpar) # reset par
Plots a 2D simplex, a triangle with unit sides centred at the origin, onto
which marginal posterior probabilities of relapse, reinfection and
recrudescence (or any other vector of three numbers in zero to one summing to
one) can be projected; see project2D()
and examples below.
plot_simplex( v_labels = NULL, v_cutoff = 0.5, v_colours = c("yellow", "purple", "red") )
plot_simplex( v_labels = NULL, v_cutoff = 0.5, v_colours = c("yellow", "purple", "red") )
v_labels |
A vector of labels that annotate vertices anticlockwise from top. If NULL (default), vertices are not annotated. |
v_cutoff |
An arbitrary number between 0.5 and 1 that separates regions of lower and higher probability. Beware the use of cut-offs for probable recrudescence classification and probable reinfection classification; see vignette("understanding-posterior-estimates", package = "Pv3Rs"). |
v_colours |
A vector of colours associated with the vertices anticlockwise from top; see example below. |
# Plot 2D simplex plot_simplex() xy <- project2D(v = c("C" = 1, "L" = 0, "I" = 0)) points(x = xy["x"], xy["y"], pch = "C") graphics::text(x = xy["x"], xy["y"], labels = "(1,0,0)", pos = 3) xy <- project2D(v = c("C" = 0, "L" = 1, "I" = 0)) points(x = xy["x"], xy["y"], pch = "L") graphics::text(x = xy["x"], xy["y"], labels = "(0,1,0)", pos = 3) xy <- project2D(v = c("C" = 0, "L" = 0, "I" = 1)) points(x = xy["x"], xy["y"], pch = "I") graphics::text(x = xy["x"], xy["y"], labels = "(0,0,1)", pos = 3) # ============================================================================== # Given data on an enrollment episode and a recurrence, # compute the posterior probabilities of the 3Rs and plot the deviation of the # posterior from the prior # ============================================================================== # Some data: y <- list(list(m1 = c("A", "C"), m2 = c("G", "T")), # Enrollment episode list(m1 = c("A"), m2 = c("G"))) # Recurrent episode # Some allele frequencies: fs <- list(m1 = setNames(c(0.4, 0.6), c("A", "C")), m2 = setNames(c(0.2, 0.8), c("G", "T"))) # A vector of prior probabilities: prior <- array(c(0.2, 0.3, 0.5), dim = c(1,3), dimnames = list(NULL, c("C", "L", "I"))) # Compute posterior probabilities post <- compute_posterior(y, fs, prior) # Projev_cutoff marginal prior probabilities onto x and y coordinates: xy_prior <- project2D(as.vector(prior)) # Projev_cutoff marginal posterior probabilities onto x and y coordinates: xy_post <- project2D(as.vector(post$marg)) # Plot simplex with probability greater than 0.8 considered relatively # certain plot_simplex(colnames(post$marg), 0.8) # Plot the deviation of the posterior from the prior arrows(x0 = xy_prior["x"], x1 = xy_post["x"], y0 = xy_prior["y"], y1 = xy_post["y"], length = 0.1)
# Plot 2D simplex plot_simplex() xy <- project2D(v = c("C" = 1, "L" = 0, "I" = 0)) points(x = xy["x"], xy["y"], pch = "C") graphics::text(x = xy["x"], xy["y"], labels = "(1,0,0)", pos = 3) xy <- project2D(v = c("C" = 0, "L" = 1, "I" = 0)) points(x = xy["x"], xy["y"], pch = "L") graphics::text(x = xy["x"], xy["y"], labels = "(0,1,0)", pos = 3) xy <- project2D(v = c("C" = 0, "L" = 0, "I" = 1)) points(x = xy["x"], xy["y"], pch = "I") graphics::text(x = xy["x"], xy["y"], labels = "(0,0,1)", pos = 3) # ============================================================================== # Given data on an enrollment episode and a recurrence, # compute the posterior probabilities of the 3Rs and plot the deviation of the # posterior from the prior # ============================================================================== # Some data: y <- list(list(m1 = c("A", "C"), m2 = c("G", "T")), # Enrollment episode list(m1 = c("A"), m2 = c("G"))) # Recurrent episode # Some allele frequencies: fs <- list(m1 = setNames(c(0.4, 0.6), c("A", "C")), m2 = setNames(c(0.2, 0.8), c("G", "T"))) # A vector of prior probabilities: prior <- array(c(0.2, 0.3, 0.5), dim = c(1,3), dimnames = list(NULL, c("C", "L", "I"))) # Compute posterior probabilities post <- compute_posterior(y, fs, prior) # Projev_cutoff marginal prior probabilities onto x and y coordinates: xy_prior <- project2D(as.vector(prior)) # Projev_cutoff marginal posterior probabilities onto x and y coordinates: xy_post <- project2D(as.vector(post$marg)) # Plot simplex with probability greater than 0.8 considered relatively # certain plot_simplex(colnames(post$marg), 0.8) # Plot the deviation of the posterior from the prior arrows(x0 = xy_prior["x"], x1 = xy_post["x"], y0 = xy_prior["y"], y1 = xy_post["y"], length = 0.1)
NA
sRemoves repeat alleles and all NA
s from allelic vectors with non-NA
values.
Removes repeat NA
s from allelic vectors with only NA
values.
prep_data(y)
prep_data(y)
y |
Observed data in the form of a list of lists. The outer list is a
list of episodes in chronological order. The inner list is a list of named
markers per episode. Episode names can be specified, but they are not used.
Markers must be named. Each episode must list the same markers. If not all
markers are typed per episode, data on untyped markers can be encoded as
missing (see below). For each marker, one must specify an allelic vector: a
set of distinct alleles detected at that marker. |
y <- list(list(m1 = c("A", "A", NA, "B"), m2 = c("A"), m3 = c("C")), list(m1 = c(NA, NA), m2 = c("B", "C"), m3 = c("A", "B", "C"))) prep_data(y)
y <- list(list(m1 = c("A", "A", NA, "B"), m2 = c("A"), m3 = c("C")), list(m1 = c(NA, NA), m2 = c("B", "C"), m3 = c("A", "B", "C"))) prep_data(y)
Project three probabilities that sum to one (e.g., the marginal probabilities of the 3Rs) onto the coordinates of a 2D simplex centred at the origin (i.e., a triangle centred at (0,0) with unit sides).
project2D(v)
project2D(v)
v |
A numeric vector of three numbers in zero to one that sum to one. |
A numeric vector of two coordinates that can be used to plot the
probability vector v
on the origin-centred 2D simplex (see
plot_simplex()
), where the top, left, and right vertices of the simplex
correspond with the first, second and third entries of v
respectively.
project2D(v = c(0.75,0.20,0.05))
project2D(v = c(0.75,0.20,0.05))
For a given set of per chromosome marker counts, generate per marker parental ids for a meiotic tetrad generated by sexual recombination (chromosomal crossovers followed by independent assortment). We assume
For all chromosomes, both pairs of non-sister chromatids cross over
One crossover per non-sister chromatid pair
Equal-length chromosomes (follows from above; in reality, chromosomes have different lengths, and the number of crossovers increases with length)
Equi-distributed markers (implicit)
recombine_parent_ids(chrs_per_marker)
recombine_parent_ids(chrs_per_marker)
chrs_per_marker |
A vector of chromosome numbers for each marker. |
n_chrs <- 14 # P. vivax has 14 chromosomes n_markers <- 100 # For 100 markers chrs_per_marker <- round(seq(0.51, n_chrs + 0.5, length.out = n_markers)) recombine_parent_ids(chrs_per_marker)
n_chrs <- 14 # P. vivax has 14 chromosomes n_markers <- 100 # For 100 markers chrs_per_marker <- round(seq(0.51, n_chrs + 0.5, length.out = n_markers)) recombine_parent_ids(chrs_per_marker)
The likelihood of a relationship graph (RG) can be decomposed over markers, as we assume that marker data is conditionally independent across markers given the RG. The likelihood of a RG for one marker is then found by integrating over all possible IBD (identity-by-descent) partitions that are consistent with the RG. The probability distribution of these IBD partitions are always uniform. Since different RGs may use the same IBD, the likelihood of IPs for each marker are stored in hash tables for improved computational efficiency.
RG_inference(MOIs, fs, alleles_per_m)
RG_inference(MOIs, fs, alleles_per_m)
MOIs |
A numeric vector specifying, for each infection, the number of distinct parasite genotypes, a.k.a. the multiplicity of infection (MOI). |
fs |
List of allele frequencies as vectors. Names of the list must
match with the marker names in |
alleles_per_m |
List of allele assignments as dataframes for each marker. Each column corresponds to a genotype and each row corresponds to an allele assignment. |
List of all relationship graph objects, including their log-likelihoods as a variable under each object.
# 1 marker, 2 infections, MOIs = 2, 1 MOIs <- c(2, 1) fs <- list( m1 = stats::setNames(c(0.4, 0.6), c("A", "B")), m2 = stats::setNames(c(0.2, 0.8), c("C", "D")) ) al_df1 <- as.data.frame(matrix(c("A", "B", "B"), nrow = 1)) al_df2 <- as.data.frame(matrix(c( "C", "D", "C", "D", "C", "C" ), nrow = 2, byrow = TRUE)) colnames(al_df1) <- colnames(al_df2) <- paste0("g", 1:3) alleles_per_m <- list(m1 = al_df1, m2 = al_df2) RGs <- RG_inference(MOIs, fs, alleles_per_m)
# 1 marker, 2 infections, MOIs = 2, 1 MOIs <- c(2, 1) fs <- list( m1 = stats::setNames(c(0.4, 0.6), c("A", "B")), m2 = stats::setNames(c(0.2, 0.8), c("C", "D")) ) al_df1 <- as.data.frame(matrix(c("A", "B", "B"), nrow = 1)) al_df2 <- as.data.frame(matrix(c( "C", "D", "C", "D", "C", "C" ), nrow = 2, byrow = TRUE)) colnames(al_df1) <- colnames(al_df2) <- paste0("g", 1:3) alleles_per_m <- list(m1 = al_df1, m2 = al_df2) RGs <- RG_inference(MOIs, fs, alleles_per_m)
igraph
objectConverts relationship graph to an igraph
object
RG_to_igraph(RG, gs, ts_per_gs)
RG_to_igraph(RG, gs, ts_per_gs)
RG |
Relationship graph output by |
gs |
Vector of genotype names. |
ts_per_gs |
Vector of infection numbers for each genotype. This can be
inferred from the data |
An igraph
object along with the original variables in
RG
.
set.seed(20) RG <- sample_RG(c(2, 2)) RG <- RG_to_igraph(RG, c("g1", "g2", "g3", "g4"), c(1, 1, 2, 2))
set.seed(20) RG <- sample_RG(c(2, 2)) RG <- RG_to_igraph(RG, c("g1", "g2", "g3", "g4"), c(1, 1, 2, 2))
Uses the techniques in enumerate_RGs
to uniformly sample
a relationship graph. All clonal partitions are generated, and the number
of sibling partitions consistent with each clonal partition is determined. A
clonal partition is randomly selected with probability proportional to the
corresponding number of sibling partitions, and a sibling partition is then
uniformly sampled. The nested partition is equivalent to a relationship
graph. See enumerate_RGs
for details on the nested
partition representation of a relationship graph.
sample_RG(MOIs, igraph = T)
sample_RG(MOIs, igraph = T)
MOIs |
A numeric vector specifying, for each infection, the number of distinct parasite genotypes, a.k.a. the multiplicity of infection (MOI). |
igraph |
Logical for whether to return an |
A relationship graph, i.e. one entry of the list returned by
enumerate_RGs
.
set.seed(20) RG <- sample_RG(c(2, 2))
set.seed(20) RG <- sample_RG(c(2, 2))
Partition a vector into at most two subvectors
split_two(s)
split_two(s)
s |
Vector to be split. |
Given a vector with no repeats, returns a list consisting of
a list that contains the original vector as its only element
other lists where each list contains two disjoint vectors whose union covers the vector. All possible unordered pairs are included.
gs <- paste0("g", 1:3) # 4 possibilities in total # either all genotypes in one vector (1 possibility) # or 2 genotypes in one vector and the last in one vector (3 possibilties) split_two(gs)
gs <- paste0("g", 1:3) # 4 possibilities in total # either all genotypes in one vector (1 possibility) # or 2 genotypes in one vector and the last in one vector (3 possibilties) split_two(gs)
Previously-published microsatellite data on Plasmodium vivax parasites extracted from patients enrolled in the Best Primaquine Dose (BPD) and Vivax History (VHX) trials; see https://www.nature.com/articles/s41467-019-13412-x.
ys_VHX_BPD
ys_VHX_BPD
A list of 217 patients; for each patient, a list of one or more episodes; for each episode, a list of three or more microsatellite markers; for each marker, a vector of observed alleles (repeat lengths). For example:
Patient identifier: patient 103 in the BPD trial
Episode identifier: episode one of patient 103 in the BPD trial
Marker identifier: Plasmodium vivax 3.27
Repeat length: 18