Package 'Pv3Rs'

Title: What the Package Does (One Line, Title Case)
Description: What the package does (one paragraph).
Authors: First Last [aut, cre] (YOUR-ORCID-ID)
Maintainer: First Last <[email protected]>
License: MIT + file LICENSE
Version: 0.0.0.9000
Built: 2024-12-08 06:34:48 UTC
Source: https://github.com/aimeertaylor/Pv3Rs

Help Index


Find all vectors of recurrence states compatible with relationship graph

Description

Finds all possible recurrence states for each recurrence compatible with the relationship graph, then takes the Cartesian product to get all vectors of recurrence states. For a recurrence to be a recrudescence, all edges between the recurrent infection and the immediately preceding infection must be clonal edges. For a recurrence to be a reinfection, all edges between the recurrent infection and any preceding infection must be stranger edges. All recurrences may possibly be relapses.

Usage

compatible_rstrs(RG, gs_per_ts)

Arguments

RG

Relationship graph; see enumerate_RGs.

gs_per_ts

List of vectors of genotypes for each infection.

Value

Vector of strings (consisting of "C", "L", "I" for recrudescence, relapse, reinfection respectively) compatible with relationship graph.

Examples

MOIs <- c(2, 2, 1)
RG <- enumerate_RGs(MOIs, igraph = TRUE)[[175]]
gs_per_ts <- split(paste0("g", 1:sum(MOIs)), rep(1:length(MOIs), MOIs))
# 1st recurrence can't be recrudescence, 2nd recurrence can't be reinfection
plot_RG(RG, edge.curved = 0.2)
compatible_rstrs(RG, gs_per_ts) # "LL" "IL" "LC" "IC"

Compute posterior probabilities of P. vivax recurrent states

Description

Compute posterior probabilities of P. vivax recurrent states relapse, reinfection and recrudescence using genetic data.

Please note, the progress bar does not necessarily increment uniformly (see details below); it may seem stuck when the code is still running.

Usage

compute_posterior(
  y,
  fs,
  prior = NULL,
  MOIs = NULL,
  return.RG = FALSE,
  return.logp = FALSE
)

Arguments

y

Observed data in the form of a list of lists. The outer list is a list of episodes in increasing chronological order. The inner list is a list of named markers per episode. Episode names can be specified, but they are not used. Markers must be named. Each episode must list the same markers. If not all markers are typed per episode, data on untyped markers can be encoded as missing (see below). For each marker, one must specify an allelic vector: a set of distinct alleles detected at that marker. NAs encode missing per-marker data, i.e., when no alleles are observed for a given marker. NA entries in allelic vectors that contain both NA and non-NA entries are ignored. Allele names are arbitrary, but must correspond with frequency names (see examples below). The same names can be used for alleles belonging to different markers. As such, frequencies must be specified per named allele per named marker.

fs

List of per-marker allele frequency vectors. Names of the list must match with the marker names in y. Within lists (i.e., for each marker), frequencies must be specified per allele name.

prior

Matrix of prior probabilities of the recurrence states for each recurrent episode. Each row corresponds to an episode in increasing chronological order. The column names must be C, L, and I for recrudescence, relapse and reinfection respectively. Row names can be specified but they are not used. If prior is NULL (default), per-episode recurrent states are equally likely.

MOIs

Multiplicity of infection for each episode. If MOIs are not provided, the most parsimonious MOIs compatible with the data will be used; see determine_MOIs.

return.RG

Boolean for whether to return the relationship graphs, defaults to FALSE. If return.logp is set to TRUE, then return.RG is overridden to be TRUE, as log-probabilities are returned for each relationship graph.

return.logp

Boolean for whether to return the log-likelihood for each relationship graph, defaults to FALSE. When setting return.logp to TRUE, return.RG should also be set to TRUE. Setting return.logp to FALSE allows for permutation symmetries to be exploited to save computational time, see enumerate_alleles. Setting this to TRUE will result in longer runtimes, especially when multiplicities of infection are large. Note that this argument does not affect the output of the posterior probabilities.

Details

compute_posterior() computes posterior probabilities proportional to the likelihood multiplied by the prior. The likelihood sums over

  • ways to phase allelic data onto haploid genotypes

  • graphs of relationships between haploid genotypes

  • ways to partition alleles into clusters of identity-by-descent

compute_posterior() expects each per-episode, per-marker allelic vector to be a set of distinct alleles. Allele repeats at markers with observed data, and NA repeats at markers with missing data, are removed in a data pre-processing step. NAs in allelic vectors that also contain non-NA values are removed in a data pre-processing step.

We enumerate all possible relationship graphs between haploid genotypes, where pairs of genotypes can either be clones, siblings, or strangers. The likelihood of a sequence of recurrence states can be determined from the likelihood of all relationship graphs compatible with said sequence. More details on the enumeration and likelihood calculation of relationship graphs can be found in enumerate_RGs and RG_inference respectively. For each relationship graph, the model sums over all possible identity-by-descent partitions. Because some relationship graphs are compatible with more identity-by-descent partitions than others, the log p(Y|RG) progress bar does not necessarily increment uniformly.

Notable model assumptions and limitations:

  • Perfect detection of alleles (no genotyping error)

  • No within-host de novo mutations

  • Parasites are outbred

  • All siblings are regular siblings

  • Relationship graphs compatible with a given sequence of recurrent states are equally likely a priori

  • We do not recommend running 'compute_posterior() when the total genotype count (sum of per-episode multiplicities of infection) exceeds eight, because there are too many relationship graphs.

  • Presently, Pv3Rs only supports prevalence data (categorical data that signal the detection of alleles), not quantitative data (data that signal the proportional abundance of the alleles detected).

Value

List containing:

marg

Matrix of marginal posterior probabilities of recurrent states for each recurrence, one row per recurrence with "C" for recrudescence, "L" for relapse, and "I" for reinfection. marg is a simple summary of joint (see next): each marginal probability of a recurrent state is a sum over a subset of joint probabilities of recurrent state sequences. For example, the marginal probability of "C" at the first of two recurrences is a sum over the joint probabilities of "CC", "CL", and "CI".

joint

Vector of joint posterior probabilities of each recurrent state sequence, where within a sequence "C" denotes recrudescence, "L" denotes relapse, and "I" denotes reinfection.

RGs

List of relationship graphs with their log-likelihoods stored. Only returned if return.RG is TRUE. See enumerate_RGs.

Examples

# ===========================================================================
# Example where alleles are named numerically
# ===========================================================================
# Data
y <- list(enroll = list(m1 = c('3','2'), m2 = c('1','2')),
          recur1 = list(m1 = c('1','4'), m2 = c('1')),
          recur2 = list(m1 = c('1'), m2 = NA))

# Allele frequencies
fs <- list(m1 = c('1' = 0.78, '2' = 0.14, '3' = 0.07, '4' = 0.01),
           m2 = c('1' = 0.27, '2' = 0.73))

# Compute posterior probabilities using default prior
compute_posterior(y, fs)


# ===========================================================================
# Example where alleles are named arbitrarily and probabilities are plotted
# ===========================================================================
# Data
y <- list(episode0 = list(marker1 = c("Tinky Winky", "Dipsy"),
                          marker2 = c("Tinky Winky", "Laa-Laa", "Po")),
          episode1 = list(marker1 = "Tinky Winky",
                          marker2 = "Laa-Laa"))

# Allele frequencies
fs <- list(marker1 = c("Tinky Winky" = 0.4, "Dipsy" = 0.6),
           marker2 = c("Tinky Winky" = 0.1, "Laa-Laa" = 0.1, "Po" = 0.8))

# Compute posterior probabilities using default prior
posterior_probs <- compute_posterior(y, fs)

# Plot posterior probabilities on the simplex
plot_simplex(c("Recrudescence", "Relapse", "Reinfection"), 0.5) # Simplex
xy <- project2D(posterior_probs$marg[1,]) # Project probabilities
points(x = xy["x"], y = xy["y"], pch = 20) # Plot projected probabilities


#============================================================================
# Demonstrating the return of the prior when all data are missing
#============================================================================
# Data
y_missing <- list(enroll = list(m1 = NA),
                  recur1 = list(m1 = NA),
                  recur2 = list(m1 = NA))

# Return of the prior
suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1))))

# Return of the prior re-weighted to the exclusion of recrudescence
suppressMessages(compute_posterior(y_missing, fs = list(m1 = c("A" = 1)),
                 MOIs = c(1,2,3)))

# (Recrudescing parasites are clones of previous blood-stage parasites. The
# Pv3R model assumes no within-host de-novo mutations and perfect allele
# detection. As such, recrudescence is incompatible with an MOI increase on
# the preceding infection.)


# ===========================================================================
# Demonstrating the cosmetic-only nature of episode names
# ===========================================================================
# Data
y <- list(enroll = list(m1 = NA),
          recur2 = list(m1 = NA),
          recur1 = list(m1 = NA))

# Use a non-uniform prior for the purpose of illustration
prior <- matrix(c(0.2,0.2,0.6,0.7,0.1,0.2), byrow = TRUE, nrow = 2,
                dimnames = list(c("recur1_prior", "recur2_prior"),
                                c("C", "L", "I")))

# Print posterior and prior, noting that "recur1_prior" is returned for
# "recur2", and "recur2_prior" is returned for "recur1"
suppressMessages(compute_posterior(y, fs = list(m1 = c(a = 1)), prior))$marg
prior


#============================================================================
# Demonstrating the informative nature of non-recurrent data
#============================================================================
# Data and allele frequencies
y_het <- list(list(m1 = c('1', '2')), list(m1 = NA))
y_hom <- list(list(m1 = '1'), list(m1 = NA))
fs = list(m1 = c('1' = 0.5, '2' = 0.5))

# The prior is not returned despite there being no recurrent data (see
# vignette XXX to understand why)
suppressMessages(compute_posterior(y = y_het, fs))$marg
suppressMessages(compute_posterior(y = y_hom, fs, MOIs = c(2,1)))$marg



#============================================================================
# Demonstrating the effect of increasingly large relationship graphs: the
# marginal probability of the first recurrence changes slightly, albeit at a
# decreasing rate, as the number of additional recurrences (all without data)
# increases. The change is greatest when the observed allele is rare.
#============================================================================
# Data for different recurrence counts where only the 1st recurrence has data
ys <- list(scenario1 = list(enroll = list(m1 = "A"),
                            recur1 = list(m1 = "A")),
           scenario2 = list(enroll = list(m1 = "A"),
                            recur1 = list(m1 = "A"),
                            recur2 = list(m1 = NA)),
           scenario3 = list(enroll = list(m1 = "A"),
                            recur1 = list(m1 = "A"),
                            recur2 = list(m1 = NA),
                            recur3 = list(m1 = NA)),
           scenario4 = list(enroll = list(m1 = "A"),
                            recur1 = list(m1 = "A"),
                            recur2 = list(m1 = NA),
                            recur3 = list(m1 = NA),
                            recur4 = list(m1 = NA)))

# Allele frequencies: smaller f_A leads to larger change
f_A <- 0.1; fs <- list(m1 = c("A" = f_A, "Other" = 1-f_A))

# Compute posterior probabilities and extract marginal probabilities
results <- lapply(ys, function(y) compute_posterior(y, fs)$marg)

# Extract results for the first recurrence
results_recur1 <- sapply(results, function(result) result[1,])
results_recur1 # Results are different for different scenarios

# Visualise the change in the marginal probability of the first recurrence
plot_simplex(c("Recrudescence", "Relapse", "Reinfection")) # Plot simplex
xy <- apply(results_recur1, 2, project2D) # Project probabilities
points(x = xy["x", ], y = xy["y", ], pch = "-", col = 1:4) # Plot projections
legend("left", col = 1:4, pch = "-", pt.cex = 2, bty = "n", legend = 1:4,
title = "Recurrence \n count") # legend

Determine MOIs from unphased data

Description

The MOIs returned correspond to the parsimonious explanation of the data, i.e. the minimal MOIs required to support the observed allelic diversity.

Usage

determine_MOIs(y)

Arguments

y

Observed data in the form of a list of lists. The outer list is a list of episodes in chronological order. The inner list is a list of named markers per episode. For each marker, one must specify an allelic vector: a set of distinct alleles detected at that marker.

Details

For a given episode, the minimal MOI required to support the observed allelic diversity is equal to the maximum number of per-marker alleles observed across markers.

At present, Pv3Rs only supports prevalence data (categorical data that signal the detection of alleles), not quantitative (proportional abundance) data. Allele repeats at markers with observed data, and repeat NAs at markers with missing data, are removed. NAs in allelic vectors that also contain non-NA values are ignored.

Value

Returns a vector of the (minimum) multiplicity of infection (MOI) for each infection.

Examples

# two infections, three markers
y <- list(
  list(m1 = c("A", "B"), m2 = c("A"), m3 = c("C")),
  list(m1 = c("B"), m2 = c("B", "C"), m3 = c("A", "B", "C"))
)
determine_MOIs(y) # should be c(2, 3)

Find all allele assignments for genotypes within the same infection

Description

Note that this function is not tested for input with alleles mixed with NA.

Usage

enumerate_alleles(y.inf, gs.inf, use.sym = TRUE)

Arguments

y.inf

List of alleles observed across markers for genotypes within one infection. Alleles must not be repeated for the same marker.

gs.inf

Vector of genotype names for genotypes within one infection.

use.sym

Boolean for permutation symmetry is exploited as a computational shortcut. Due to permutation symmetry of intra-infection genotypes, we can fix a single assignment for one of the markers whose number of alleles observed is equal to the MOI (consider it the anchor) and permute the rest, discarding combinations that under-represent the observed marker diversity. The default behaviour is to use this symmetry such that less assignments have to be considered.

Value

List of dataframes, one for each marker. The columns correspond to genotypes, while the rows correspond to allele assignments for a marker.

Examples

# 3 markers
y.inf <- list(m1 = c("A", "B"), m2 = c("B", "C", "D"), m3 = c("C"))
enumerate_alleles(y.inf, c("g1", "g2", "g3"))
# 6 assignments for m1 (BAA, ABA, BBA, AAB, BAB, ABB)
# 1 assignment for m2 (accounting for permutation symmetry)
# 1 assignment for m3 (CCC)

Enumerate partitions induced by clonal relationships

Description

A clonal partition is a partition of genotypes where a pair of genotypes of the same partition cell have a clonal relationship. Genotypes from the same infection cannot be clones. This code enumerates all clonal partitions, accounting for this intra-infection restriction.

Usage

enumerate_CPs(MOIs)

Arguments

MOIs

A numeric vector specifying, for each infection, the number of distinct parasite genotypes, a.k.a. the multiplicity of infection (MOI).

Value

A list of all possible partitions, where each partition is encoded as a membership vector, which indices (genotype names) with the same entry corresponding to genotypes being int he same partition cell.

Examples

enumerate_CPs(c(2, 2))

List all allelic draws for three half siblings

Description

Given a specified number of alleles for a single marker, generate_halfsib_alleles() enumerates all the ways three half siblings can draw alleles from their respective parents by firstly enumerating all allelic combinations for the parents, and by secondly enumerating all the inheritable combinations for the children.

Usage

enumerate_halfsib_alleles(n_alleles)

Arguments

n_alleles

Positive whole number specifying a per-marker number of alleles, otherwise known as marker cardinality.

Value

A character matrix. Each column is an individual. Each row is a possible allelic draw. Alleles are represented by the first n_alleles letters of the latin alphabet.

Examples

enumerate_halfsib_alleles(3)

Enumerate IBD partitions consistent with a given relationship graph

Description

Enumerate IBD partitions consistent with a given relationship graph

Usage

enumerate_IPs_RG(RG, compat = TRUE)

Arguments

RG

A relationship graph; see enumerate_RGs for details.

compat

Logical, if true, a list of partitions equivalence objects are returned. Otherwise, returns a list of IBD parititon vectors in reference to the sibling clusters.

Details

For a IBD partition to be consistent with the relationship graph given, it must satisfy the following:

  • genotypes within each IBD cell have clonal and sibling relationships only,

  • genotypes that are clones must be in the same IBD cell,

  • each cluster of sibling units span at most 2 IBD cells (corresponds to two parents).

Note that all IBD partitions are equally likely.

Examples

set.seed(3)
RG <- sample_RG(c(2, 1, 2))
enumerate_IPs_RG(RG)

Enumerate transitive relationship graphs

Description

A relationship graph is a complete graph on all genotypes, where each edge is annotated as a clone, sibling, or stranger edge. The enumerated relationship graphs satisfy the following constraints:

  • The subgraph induced by the clone edges is a cluster graph.

  • The subgraph induced by the clone edges and sibling edges is a cluster graph.

  • Clone edges are only allowed for two genotypes from different infections.

Usage

enumerate_RGs(MOIs, igraph = TRUE)

Arguments

MOIs

A numeric vector specifying, for each infection, the number of distinct parasite genotypes, a.k.a. the multiplicity of infection (MOI).

igraph

Logical for whether to return igraph objects.

Details

Relationship graphs are enumerated by generating nested set partitions that meet certain constraints. Since the clone edges induce a cluster graph, the information encoded by clonal relationships is equivalent to a partition of the genotypes. Note that genotypes from the same infection cannot belong to the same partition cell. Subsequent information encoded by sibling relationships is equivalent to further partitioning the clonal partition. There are no constraints when enumerating the sibling partitions. The data structure returned encodes each graph as a nested set partition. Each partition is represented in the form of a list of vectors (clone and sib) and as a membership vector (clone.vec and sib.vec), where each entry identifies the partition cell the corresponding index belongs to.

Value

A list of relationship graphs. If igraph is FALSE, each element is a list of four attributes:

clone

A list of groups of genotypes that make up the clonal cells.

clone.vec

A numeric vector indicating the clonal membership of each genotype.

sib

A list of groups of clonal cells that make up the sibling cells.

sib.vec

A numeric vector indicating the sibling membership of each clonal cell.

Otherwise, each element is an igraph object (see RG_to_igraph) along with these four attributes. Note that the weight matrix contains information equivalent to that of the four attributes.

Examples

graphs <- enumerate_RGs(c(2, 1, 2), igraph=TRUE) # 250 graphs

Allele frequencies computed from example Plasmodium vivax data

Description

The posterior mean of a multinomial-Dirichlet model with uniform prior fit to data on allele prevalence in initial episodes. Because the model is fit to allele prevalence (observed) not allele frequency (would requires integrating-out unknown multiplicities of infection) it is liable to underestimate the frequencies of common alleles and overestimate those of rare but detected alleles.

Usage

fs_VHX_BPD

Format

A list of nine markers; for each marker a named vector of allele frequencies that sum to one.

Source


Convert IBD partition to a unique string for hashing

Description

This is used for building a hash table for p(y at marker m|IBD).

Usage

hash.IP(IP, gs)

Arguments

IP

List containing vectors of genotype names with each vector corresponding to an IBD cell.

gs

Vector containing all genotype names.

Value

String where the integers in the IBD membership vector have been converted to ASCII characters.

Examples

gs <- paste0("g", 1:3)
IP1 <- list(c("g1", "g3"), c("g2"))
IP2 <- list(c("g2"), c("g3", "g1"))
hash1 <- hash.IP(IP1, gs)
hash2 <- hash.IP(IP2, gs)
hash1 == hash2 # TRUE, even though the order is different

Summarise locus-wise data types

Description

Summarise locus-wise data types

Usage

locus_type_summary(y)

Arguments

y

A list of lists for two episodes; see compute_posterior() for more details.

Details

A function to summarise the data at each locus as one of four types:

  • All match (all genotypes have the same allele).

  • All diff. (all genotypes have a different allele).

  • Intra-match (some intra-episode genotypes have the same allele)

  • Inter-match (some inter-episode genotypes have the same allele).

    The number of apparent genotypes is the sum of the per-episode MOIs. When the total number of apparent genotypes exceeds 3, "Intra-match" excludes any "Inter-match" (i.e., it is equivalent to Inter-matches only), whereas an "Inter-match does not exclude an "Intra-match

Value

A vector of strings summarising the data at each locus.

Examples

# example code
y <- list(
  list(m1 = c("A", "B"), m2 = c("A"), m3 = c("C")),
  list(m1 = c("B"), m2 = c("B", "C"), m3 = c("A", "B", "C", "D"))
)

locus_type_summary(y)

Message progress bar

Description

A simple progress bar to use in R packages where messages are preferred to console output. See https://gist.github.com/MansMeg/1ec56b54e1d9d238b4fd.

Fields

iter

Total number of iterations

i

Current iteration

width

Width of the R console

width_bar

Width of the progress bar

progress

The number of character printed (continous)

progress_step

Addition to progress per iteration

Methods

increment()

A messagebar object.

initialize(iter)

Initialize a messagebar object

Author(s)

Mans Magnusson (MansMeg @ github)


Plots the data

Description

Plots the alleles (colours), which are observed in different episodes (rows), on different markers (columns), where episodes are grouped by patient. The patients and per-patient episodes are plotted from bottom to top. If more than one allele is detected per episode per marker, the corresponding row-column entry is subdivided into different colours. The legend depicts the alleles of the markers as the markers appear from left to right in the main plot. Otherwise stated, the legend is ordered by the order of markers stated on on the horizontal axis of the main plot. The colour scheme is adaptive. It is designed to visually differentiate the alleles as much as possible: the maximum range of qualitative scheme, with contrast of hue between adjacent colours, is always used; the adjacent colours are interpolated only if a given marker has more than 12 alleles. The names of the alleles are printed on top of their colours if marker_annotate.

Usage

plot_data(ys, fs = NULL, marker_annotate = TRUE)

Arguments

ys

A nested list of per-patient, per-episode, per-marker allelic data. Specifically, a per-patient list of a per-episode list of a per-marker list of character vectors of observed alleles.

fs

A per-marker list of numeric vectors of allele frequencies. If NULL (default), for a given marker, only the alleles present in the data are represented in the legend, and each allele is represented equally. Because the colour scheme is adaptive (see introduction), the same allele will have a different colour in a plot of an alternative data list if more or fewer alleles are observed at the given marker across the alternative data list. If fs is specified, all possible alleles are represented and legend areas are proportional to allele frequencies; i.e., common alleles have relatively large legend areas, and rare alleles have relatively small legend areas. Specify fs to fix the colour of a given allele across plots of different data lists, thereby facilitating cross-comparison.

marker_annotate

Logical. If true (default), the names of the alleles are printed on top of their colours in the legend.

Examples

# Examples

# Plot example Plasmodium vivax data set
plot_data(ys = ys_VHX_BPD)
plot_data(ys = ys_VHX_BPD, fs = fs_VHX_BPD)
plot_data(ys = ys_VHX_BPD, fs = fs_VHX_BPD, marker_annotate = FALSE)

# Demonstrating the adaptive nature of the colour scheme:
ys <- ys_VHX_BPD["VHX_52"] # A single patient
plot_data(ys, fs = fs_VHX_BPD) # Colours and the legend match plots above
plot_data(ys) # Colours and the legend adapt to alleles detected in VHX_52

Plot a relationship graph (RG)

Description

This function is a wrapper around plot.igraph, written to group parasite genotypes by infection, both spatially and using vertex colour. Specifically, parasite genotypes within infections are vertically distributed with some horizontal jitter when layout_by_group is TRUE (default), and coloured the same. It also makes sure clonal and sibling edges are plotted differently using different line types.

Usage

plot_RG(
  RG,
  layout_by_group = TRUE,
  vertex_palette = "Set2",
  edge_lty = c(`0.5` = "dashed", `1` = "solid"),
  edge_col = c(`0.5` = "black", `1` = "black"),
  edge.width = 1.5,
  ...
)

Arguments

RG

A relationship graph, which is an igraph graph; see return value of RG_to_igraph.

layout_by_group

A logical argument which if TRUE (default) overrides the default layout of plot.igraph so that vertices that represent parasite genotypes from different infections are distributed horizontally and vertices that represent genotypes within infections are distributed vertically.

vertex_palette

A character string specifying an RColorBrewer palette. Overrides the default palette of plot.igraph.

edge_lty

A vector of edge line types corresponding to different relationships, where 0.5 represents a sibling and 1 represents a clone.

edge_col

A vector of edge colours corresponding to different relationships, where 0.5 represents a sibling and 1 represents a clone.

edge.width

Overrides the default edge.width of plot.igraph.

...

Additional arguments to pass to plot.igraph, e.g. edge.curved.

Provenance

This function was adapted from plot_Vivax_model at https://github.com/jwatowatson/RecurrentVivax/blob/master/Genetic_Model/iGraph_functions.R.

Examples

RGs <- enumerate_RGs(c(2, 1, 1))
cpar <- par() # record current par before changing
par(mfrow = c(3, 4), mar = c(0.1, 0.1, 0.1, 0.1))
for (i in 12:23) {
  plot_RG(RGs[[i]], edge.curved = 0.1)
  box()
}
par(cpar) # reset par

Plots a 2D simplex

Description

Plots a 2D simplex, a triangle with unit sides centred at the origin, onto which marginal posterior probabilities of relapse, reinfection and recrudescence (or any other vector of three numbers in zero to one summing to one) can be projected; see project2D() and examples below.

Usage

plot_simplex(
  v_labels = NULL,
  v_cutoff = 0.5,
  v_colours = c("yellow", "purple", "red")
)

Arguments

v_labels

A vector of labels that annotate vertices anticlockwise from top. If NULL (default), vertices are not annotated.

v_cutoff

An arbitrary number between 0.5 and 1 that separates regions of lower and higher probability. Beware the use of cut-offs for probable recrudescence classification and probable reinfection classification; see vignette("understanding-posterior-estimates", package = "Pv3Rs").

v_colours

A vector of colours associated with the vertices anticlockwise from top; see example below.

Examples

# Plot 2D simplex
plot_simplex()

xy <- project2D(v = c("C" = 1, "L" = 0, "I" = 0))
points(x = xy["x"], xy["y"], pch = "C")
graphics::text(x = xy["x"], xy["y"], labels = "(1,0,0)", pos = 3)

xy <- project2D(v = c("C" = 0, "L" = 1, "I" = 0))
points(x = xy["x"], xy["y"], pch = "L")
graphics::text(x = xy["x"], xy["y"], labels = "(0,1,0)", pos = 3)

xy <- project2D(v = c("C" = 0, "L" = 0, "I" = 1))
points(x = xy["x"], xy["y"], pch = "I")
graphics::text(x = xy["x"], xy["y"], labels = "(0,0,1)", pos = 3)


# ==============================================================================
# Given data on an enrollment episode and a recurrence,
# compute the posterior probabilities of the 3Rs and plot the deviation of the
# posterior from the prior
# ==============================================================================

# Some data:
y <- list(list(m1 = c("A", "C"), m2 = c("G", "T")), # Enrollment episode
          list(m1 = c("A"), m2 = c("G"))) # Recurrent episode

# Some allele frequencies:
fs <- list(m1 = setNames(c(0.4, 0.6), c("A", "C")),
           m2 = setNames(c(0.2, 0.8), c("G", "T")))

# A vector of prior probabilities:
prior <- array(c(0.2, 0.3, 0.5), dim = c(1,3),
               dimnames = list(NULL, c("C", "L", "I")))

# Compute posterior probabilities
post <- compute_posterior(y, fs, prior)

# Projev_cutoff marginal prior probabilities onto x and y coordinates:
xy_prior <- project2D(as.vector(prior))

# Projev_cutoff marginal posterior probabilities onto x and y coordinates:
xy_post <- project2D(as.vector(post$marg))

# Plot simplex with probability greater than 0.8 considered relatively
# certain
plot_simplex(colnames(post$marg), 0.8)

# Plot the deviation of the posterior from the prior
arrows(x0 = xy_prior["x"], x1 = xy_post["x"],
       y0 = xy_prior["y"], y1 = xy_post["y"], length = 0.1)

Pre-process data to remove repeats and NAs

Description

Removes repeat alleles and all NAs from allelic vectors with non-NA values. Removes repeat NAs from allelic vectors with only NA values.

Usage

prep_data(y)

Arguments

y

Observed data in the form of a list of lists. The outer list is a list of episodes in chronological order. The inner list is a list of named markers per episode. Episode names can be specified, but they are not used. Markers must be named. Each episode must list the same markers. If not all markers are typed per episode, data on untyped markers can be encoded as missing (see below). For each marker, one must specify an allelic vector: a set of distinct alleles detected at that marker. NAs encode missing per-marker data, i.e., when no alleles are observed for a given marker.

Examples

y <- list(list(m1 = c("A", "A", NA, "B"), m2 = c("A"), m3 = c("C")),
          list(m1 = c(NA, NA), m2 = c("B", "C"), m3 = c("A", "B", "C")))

prep_data(y)

Project 3D probability coordinates onto 2D simplex coordinates

Description

Project three probabilities that sum to one (e.g., the marginal probabilities of the 3Rs) onto the coordinates of a 2D simplex centred at the origin (i.e., a triangle centred at (0,0) with unit sides).

Usage

project2D(v)

Arguments

v

A numeric vector of three numbers in zero to one that sum to one.

Value

A numeric vector of two coordinates that can be used to plot the probability vector v on the origin-centred 2D simplex (see plot_simplex()), where the top, left, and right vertices of the simplex correspond with the first, second and third entries of v respectively.

Examples

project2D(v = c(0.75,0.20,0.05))

Generate parental allocations for a meiotic tetrad

Description

For a given set of per chromosome marker counts, generate per marker parental ids for a meiotic tetrad generated by sexual recombination (chromosomal crossovers followed by independent assortment). We assume

  1. For all chromosomes, both pairs of non-sister chromatids cross over

  2. One crossover per non-sister chromatid pair

  3. Equal-length chromosomes (follows from above; in reality, chromosomes have different lengths, and the number of crossovers increases with length)

  4. Equi-distributed markers (implicit)

Usage

recombine_parent_ids(chrs_per_marker)

Arguments

chrs_per_marker

A vector of chromosome numbers for each marker.

Examples

n_chrs <- 14 # P. vivax has 14 chromosomes
n_markers <- 100 # For 100 markers
chrs_per_marker <- round(seq(0.51, n_chrs + 0.5, length.out = n_markers))
recombine_parent_ids(chrs_per_marker)

Computes the likelihood of all relationship graphs

Description

The likelihood of a relationship graph (RG) can be decomposed over markers, as we assume that marker data is conditionally independent across markers given the RG. The likelihood of a RG for one marker is then found by integrating over all possible IBD (identity-by-descent) partitions that are consistent with the RG. The probability distribution of these IBD partitions are always uniform. Since different RGs may use the same IBD, the likelihood of IPs for each marker are stored in hash tables for improved computational efficiency.

Usage

RG_inference(MOIs, fs, alleles_per_m)

Arguments

MOIs

A numeric vector specifying, for each infection, the number of distinct parasite genotypes, a.k.a. the multiplicity of infection (MOI).

fs

List of allele frequencies as vectors. Names of the list must match with the marker names in alleles_per_m.

alleles_per_m

List of allele assignments as dataframes for each marker. Each column corresponds to a genotype and each row corresponds to an allele assignment.

Value

List of all relationship graph objects, including their log-likelihoods as a variable under each object.

Examples

# 1 marker, 2 infections, MOIs = 2, 1
MOIs <- c(2, 1)
fs <- list(
  m1 = stats::setNames(c(0.4, 0.6), c("A", "B")),
  m2 = stats::setNames(c(0.2, 0.8), c("C", "D"))
)
al_df1 <- as.data.frame(matrix(c("A", "B", "B"), nrow = 1))
al_df2 <- as.data.frame(matrix(c(
  "C", "D", "C",
  "D", "C", "C"
), nrow = 2, byrow = TRUE))
colnames(al_df1) <- colnames(al_df2) <- paste0("g", 1:3)
alleles_per_m <- list(m1 = al_df1, m2 = al_df2)
RGs <- RG_inference(MOIs, fs, alleles_per_m)

Converts relationship graph to an igraph object

Description

Converts relationship graph to an igraph object

Usage

RG_to_igraph(RG, gs, ts_per_gs)

Arguments

RG

Relationship graph output by enumerate_RGs with igraph=FALSE.

gs

Vector of genotype names.

ts_per_gs

Vector of infection numbers for each genotype. This can be inferred from the data y using rep(1:length(y), determine_MOIs(y)).

Value

An igraph object along with the original variables in RG.

Examples

set.seed(20)
RG <- sample_RG(c(2, 2))
RG <- RG_to_igraph(RG, c("g1", "g2", "g3", "g4"), c(1, 1, 2, 2))

Sample a transitive relationship graph

Description

Uses the techniques in enumerate_RGs to uniformly sample a relationship graph. All clonal partitions are generated, and the number of sibling partitions consistent with each clonal partition is determined. A clonal partition is randomly selected with probability proportional to the corresponding number of sibling partitions, and a sibling partition is then uniformly sampled. The nested partition is equivalent to a relationship graph. See enumerate_RGs for details on the nested partition representation of a relationship graph.

Usage

sample_RG(MOIs, igraph = T)

Arguments

MOIs

A numeric vector specifying, for each infection, the number of distinct parasite genotypes, a.k.a. the multiplicity of infection (MOI).

igraph

Logical for whether to return an igraph object.

Value

A relationship graph, i.e. one entry of the list returned by enumerate_RGs.

Examples

set.seed(20)
RG <- sample_RG(c(2, 2))

Partition a vector into at most two subvectors

Description

Partition a vector into at most two subvectors

Usage

split_two(s)

Arguments

s

Vector to be split.

Value

Given a vector with no repeats, returns a list consisting of

  • a list that contains the original vector as its only element

  • other lists where each list contains two disjoint vectors whose union covers the vector. All possible unordered pairs are included.

Examples

gs <- paste0("g", 1:3)
# 4 possibilities in total
# either all genotypes in one vector (1 possibility)
# or 2 genotypes in one vector and the last in one vector (3 possibilties)
split_two(gs)

Example Plasmodium vivax data

Description

Previously-published microsatellite data on Plasmodium vivax parasites extracted from patients enrolled in the Best Primaquine Dose (BPD) and Vivax History (VHX) trials; see https://www.nature.com/articles/s41467-019-13412-x.

Usage

ys_VHX_BPD

Format

A list of 217 patients; for each patient, a list of one or more episodes; for each episode, a list of three or more microsatellite markers; for each marker, a vector of observed alleles (repeat lengths). For example:

BPD_103

Patient identifier: patient 103 in the BPD trial

BPD_103_1

Episode identifier: episode one of patient 103 in the BPD trial

PV.3.27

Marker identifier: Plasmodium vivax 3.27

18

Repeat length: 18

Source