Title: | Joint estimation of COI and population structure for malaria genetic data |
---|---|
Description: | Carries out joint estimation of complexity of infection (COI) and population structure on malaria genetic data. Assumes a simple model in which individuals have genotypes sampled from one or more subpopulations, and the number of genotypes in an individual is equal to the COI, which is also unknown. All unknown parameters are inferred using MCMC. |
Authors: | Bob Verity [aut, cre] |
Maintainer: | Bob Verity <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1 |
Built: | 2024-10-31 04:09:20 UTC |
Source: | https://github.com/bobverity/MALECOT |
Change the active set of a MALECOT project
active_set(project, set)
active_set(project, set)
project |
a MALECOT project, as produced by the function
|
set |
the new active set |
# TODO
# TODO
Bind data in bi-allelic format to MALECOT project. Data should be formatted as a dataframe with samples in rows and loci in columns. Genetic data should be coded as 1 (homozygote REF allele), 0 (homozygote ALT allele), or 0.5 (heterozygote). Additional meta-data columns can be specified, including a column for sample IDs and a column for sampling population.
bind_data_biallelic(project, df, ID_col = 1, pop_col = NULL, data_cols = NULL, ID = NULL, pop = NULL, missing_data = -9, name = NULL, check_delete_output = TRUE)
bind_data_biallelic(project, df, ID_col = 1, pop_col = NULL, data_cols = NULL, ID = NULL, pop = NULL, missing_data = -9, name = NULL, check_delete_output = TRUE)
project |
a MALECOT project, as produced by the function
|
df |
a dataframe containing genetic information and optional meta-data |
ID_col |
which column of the input data contains the sample IDs. If NULL
then IDs must be defined seperately through the |
pop_col |
which column of the input data contains the ostensible
population of the samples. If NULL then populations must be defined
seperately through the |
data_cols |
which columns of the input data contain genetic information. Defaults to all remaining columns of the data once meta-data columns have been accounted for |
ID |
sample IDs. Ignored if using the |
pop |
ostensible populations. Ignored if using the |
missing_data |
what value represents missing data. Defaults to -9. Must be a positive or negative integer, and cannot equal 0 or 1 as these are reserved for genetic data. |
name |
optional name of the data set to aid in record keeping |
check_delete_output |
whether to prompt the user before overwriting existing data |
# TODO
# TODO
Bind data in multi-allelic format to MALECOT project. Data should be formatted as a dataframe with three columns: "sample_ID", "locus" and "haplotype". Each row of this dataframe specifies a haplotype that was observed at that locus in that individual. Haplotypes should be coded as positive integers.
bind_data_multiallelic(project, df, pop = NULL, missing_data = -9, alleles = NULL, name = NULL, check_delete_output = TRUE)
bind_data_multiallelic(project, df, pop = NULL, missing_data = -9, alleles = NULL, name = NULL, check_delete_output = TRUE)
project |
a MALECOT project, as produced by the function
|
df |
a dataframe with three columns, as decribed above |
pop |
ostensible populations of the samples |
missing_data |
what value represents missing data. Defaults to -9. Must be a positive or negative integer |
alleles |
the number of alleles at each locus. If scalar then the same number of alleles is assumed at all loci. If NULL then the number of alleles is inferred directly from data as the maximum observed value per locus |
name |
optional name of the data set to aid in record keeping |
check_delete_output |
whether to prompt the user before overwriting existing data |
# TODO
# TODO
Simple function to check that MALECOT package has loaded successfully. Prints "MALECOT loaded successfully!" if so.
check_MALECOT_loaded()
check_MALECOT_loaded()
Delete a given parameter set from a MALECOT project.
delete_set(project, set = NULL, check_delete_output = TRUE)
delete_set(project, set = NULL, check_delete_output = TRUE)
project |
a MALECOT project, as produced by the function
|
set |
which set to delete. Defaults to the current active set |
check_delete_output |
whether to prompt the user before deleting any existing output |
# TODO
# TODO
Returns effective sample size (ESS) of chosen model run.
get_ESS(project, K = NULL)
get_ESS(project, K = NULL)
project |
a MALCOT project, as produced by the function
|
K |
get ESS for this value of K |
# TODO
# TODO
Compares qmatrix output for a chosen value of K against a
target_group
vector. Returns the order of target_group
groups, such that there is the best possible alignment against the qmatrix.
For example, if the vector returned is c(2,3,1)
then the second
group in the target vector should be matched against the first group in the
qmatrix, followed by the third group in the target vector against the
second group in the qmatrix, followed by the first group in the target
vector against the third group in the qmatrix.
get_group_order(project, K, target_group)
get_group_order(project, K, target_group)
project |
a MALCOT project, as produced by the function
|
K |
compare against qmatrix output for this value of K |
target_group |
the target group to be aligned against the qmatrix |
# TODO
# TODO
Determine if object is of class malecot_project.
is.malecot_project(x)
is.malecot_project(x)
x |
TODO |
TODO
# TODO
# TODO
Import file from the inst/extdata folder of this package
malecot_file(name)
malecot_file(name)
name |
name of file |
Define empty malecot_project object
malecot_project()
malecot_project()
TODO
# TODO
# TODO
Expand a series of colours by interpolation to produce any
number of colours from a given series. The pattern of interpolation is
designed so that (n+1)th value contains the nth value plus one more colour,
rather than being a completely different series. For example, running
more_colours(5)
and more_colours(4)
, the first 4 colours will
be shared between the two series.
more_colours(n = 5, raw_cols = col_hot_cold())
more_colours(n = 5, raw_cols = col_hot_cold())
n |
how many colours to return |
raw_cols |
vector of colours to interpolate |
TODO
new_set(project, name = "(no name)", lambda = 1, COI_model = "poisson", COI_max = 20, COI_manual = NULL, estimate_COI_mean = TRUE, COI_mean = 3, COI_dispersion = 2, estimate_error = FALSE, e1 = 0, e2 = 0, e1_max = 0.2, e2_max = 0.2)
new_set(project, name = "(no name)", lambda = 1, COI_model = "poisson", COI_max = 20, COI_manual = NULL, estimate_COI_mean = TRUE, COI_mean = 3, COI_dispersion = 2, estimate_error = FALSE, e1 = 0, e2 = 0, e1_max = 0.2, e2_max = 0.2)
project |
a MALECOT project, as produced by the function
|
name |
the name of the parameter set |
lambda |
the shape parameter(s) of the prior on allele frequencies. This
prior is Beta in the bi-allelic case, and Dirichlet in the multi-allelic
case.
|
COI_model |
the type of prior on COI. Must be one of "uniform", "poisson", or "nb" (negative binomial) |
COI_max |
the maximum COI allowed for any given sample |
COI_manual |
A vector of length n (where n is the number of samples)
allowing the COI to be specified manually. Positive values indicate fixed
COIs that should not be updated as part of the MCMC, while -1 values
indicate that COIs should be estimated. Defaults to |
estimate_COI_mean |
whether the mean COI should be estimated for each
subpopulation as part of the MCMC, otherwise the value |
COI_mean |
single scalar value specifying the mean COI for all
subpopulations (see |
COI_dispersion |
the ratio of the variance to the mean of the prior on COI. Only applies under the negative binomial model. Must be >1, as a ratio of 1 can be achieved by using the Poisson distribution |
estimate_error |
whether to estimate error probabilities |
e1 |
the probability of a true homozygote being incorrectly called as a heterozygote |
e2 |
the probability of a true heterozygote being incorrectly called as a homozygote |
e1_max |
the maximum possible value of |
e2_max |
the maximum possible value of |
TODO
# TODO
# TODO
Produce MCMC autocorrelation plot of the log-likelihood
plot_acf(project, K = NULL, rung = NULL, col = "black")
plot_acf(project, K = NULL, rung = NULL, col = "black")
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
rung |
which rung to plot. Defaults to the cold chain |
col |
colour of the trace |
Plot COI 95% credible intervals of current active set
plot_COI(project, K = NULL)
plot_COI(project, K = NULL)
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
TODO
# TODO
# TODO
Plot COI_mean 95% credible intervals of current active set
plot_COI_mean(project, K = NULL, deme_order = NULL)
plot_COI_mean(project, K = NULL, deme_order = NULL)
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
deme_order |
the order in which to plot demes. Defaults to increasing order |
TODO
# TODO
# TODO
Plot Metropolis-coupling acceptance rates
plot_coupling(project, K = NULL)
plot_coupling(project, K = NULL)
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
Produce MCMC density plot of the log-likelihood
plot_density(project, K = NULL, rung = NULL, col = "black")
plot_density(project, K = NULL, rung = NULL, col = "black")
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
rung |
which rung to plot. Defaults to the cold chain |
col |
colour of the trace |
Plot error rate 95% credible intervals of current active set
plot_e(project, K = NULL)
plot_e(project, K = NULL)
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
TODO
# TODO
# TODO
Plot GTI path of current active set
plot_GTI_path(project, K = NULL, axis_type = 1)
plot_GTI_path(project, K = NULL, axis_type = 1)
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
axis_type |
how to format the x-axis. 1 = integer rungs, 2 = values of beta |
Plot log-evidence estimates over K
plot_logevidence_K(project)
plot_logevidence_K(project)
project |
a MALECOT project, as produced by the function
|
Plot log-evidence estimates over parameter sets
plot_logevidence_model(project)
plot_logevidence_model(project)
project |
a MALECOT project, as produced by the function
|
Plot loglikelihood 95% credible intervals of current active set
plot_loglike(project, K = NULL, axis_type = 1, connect_points = FALSE, connect_whiskers = FALSE)
plot_loglike(project, K = NULL, axis_type = 1, connect_points = FALSE, connect_whiskers = FALSE)
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
axis_type |
how to format the x-axis. 1 = integer rungs, 2 = values of beta, 3 = values of beta raised to the GTI power |
connect_points |
whether to connect points in the middle of intervals |
connect_whiskers |
whether to connect points at the ends of the whiskers |
Produce diagnostic plots of the log-likelihood.
plot_loglike_dignostic(project, K = NULL, rung = NULL, col = "black")
plot_loglike_dignostic(project, K = NULL, rung = NULL, col = "black")
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
rung |
which rung to plot. Defaults to the cold chain |
col |
colour of the trace |
Plot allele frequency 95% credible intervals of current active set
plot_p(project, K = NULL, deme = 1)
plot_p(project, K = NULL, deme = 1)
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
deme |
TODO |
TODO
# TODO
# TODO
Plot posterior K
plot_posterior_K(project)
plot_posterior_K(project)
project |
a MALECOT project, as produced by the function
|
Plot posterior model
plot_posterior_model(project)
plot_posterior_model(project)
project |
a MALECOT project, as produced by the function
|
Produce plot of the prior on COI for given parameters. Options include the uniform distribution, and a modified form of Poisson and negative binomial distribution (see details).
plot_prior_COI(COI_model = "poisson", COI_mean = 3, COI_dispersion = 2, COI_max = 20)
plot_prior_COI(COI_model = "poisson", COI_mean = 3, COI_dispersion = 2, COI_max = 20)
COI_model |
the type of prior on COI. Must be one of "uniform", "poisson", or "nb" (negative binomial) |
COI_mean |
the prior mean (before truncating at |
COI_dispersion |
the ratio of the variance to the mean of the prior on COI. Only applies under the negative binomial model. Must be >1 |
COI_max |
the maximum COI allowed. Distributions are truncated at this value |
The prior on COI can be uniform, Poisson, or negative binomial. In
the uniform case there is an equal chance of any given sample having a COI
between 1 and COI_max
(inclusive). In the Poisson and negative
binomial cases it is important to note that the distribution is over
(COI-1), rather than over COI. This is because both Poisson and negative
binomial distributions allow for 0 values, which cannot be the case here
because observed samples must contain at least 1 genotype. Poisson and
negative binomial distributions are also truncated at COI_max
.
The full probability mass distribution for the Poisson case with
COI_mean
and
COI_max
can be written
where
is a normalising constant that ensures the distribution sums to unity, and
is defined as:
The mean of this distribution will generally be very close to ,
and the variance will be close to
(strictly it will approach
these values as
tends to infinity).
The full probability mass distribution for the negative binomial case with
COI_mean
,
COI_dispersion
and
COI_max
can be written
where ,
, and
is a normalising
constant that ensures the distribution sums to unity, and is defined as:
The mean of this distribution will generally be very close to and
the variance will be close to
(strictly it will approach these
values as
tends to infinity).
Produce plot of the prior on COI for given parameters. This prior is Beta in the bi-allelic case, and Dirichlet in the multi-allelic case.
plot_prior_p(lambda = 1, alleles = NULL)
plot_prior_p(lambda = 1, alleles = NULL)
lambda |
shape parameter(s) of the Beta or Dirichlet distribution. Can
be a single scalar value, in which case the dimensionality is given by the
number of |
alleles |
the dimensionality of the prior. Defaults to the length of
|
Produce posterior allocation plot of current active set.
plot_structure(project, K = NULL, base_colours = col_hot_cold(), divide_ind_on = FALSE)
plot_structure(project, K = NULL, base_colours = col_hot_cold(), divide_ind_on = FALSE)
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
base_colours |
colours from which final plotting colours are taken. These will be interpolated to produce final colours |
divide_ind_on |
whether to add dividing lines between bars |
Produce MCMC trace plot of the log-likelihood at each iteration.
plot_trace(project, K = NULL, rung = NULL, col = "black")
plot_trace(project, K = NULL, rung = NULL, col = "black")
project |
a MALECOT project, as produced by the function
|
K |
which value of K to plot |
rung |
which rung to plot. Defaults to the cold chain |
col |
colour of the trace |
Calling print()
on an object of class malecot_project
results in custom output. This function therefore stands in for the base
print()
function, and is equivalent to running
print(unclass(x))
.
print_full(x, ...)
print_full(x, ...)
x |
object of class |
... |
other arguments passed to |
Custom print function for class malecot_project, printing a
summary of the key elements (also equivalent to summary(x)
). To do
an ordinary print()
of all elements of the project, use the
print_full()
function.
## S3 method for class 'malecot_project' print(x, ...)
## S3 method for class 'malecot_project' print(x, ...)
x |
object of class |
... |
other arguments (ignored) |
When a new value of K is added in to the analysis it affects all downstream evidence estimates that depend on this K - for example the overall model evidence integrated over K. This function therefore looks through all values of K in the active set and recalculates all downstream elements as needed.
recalculate_evidence(project)
recalculate_evidence(project)
project |
a MALCOT project, as produced by the function
|
Run the main MALECOT MCMC. Model parameters are taken from the current active parameter set, and MCMC parameters are passed in as arguments. All output is stored within the project.
run_mcmc(project, K = NULL, precision = 0.01, burnin = 1000, samples = 1000, rungs = 1, GTI_pow = 3, auto_converge = TRUE, converge_test = 100, solve_label_switching_on = TRUE, coupling_on = TRUE, cluster = NULL, pb_markdown = FALSE, store_acceptance = TRUE, store_raw = TRUE, silent = FALSE)
run_mcmc(project, K = NULL, precision = 0.01, burnin = 1000, samples = 1000, rungs = 1, GTI_pow = 3, auto_converge = TRUE, converge_test = 100, solve_label_switching_on = TRUE, coupling_on = TRUE, cluster = NULL, pb_markdown = FALSE, store_acceptance = TRUE, store_raw = TRUE, silent = FALSE)
project |
a MALECOT project, as produced by the function
|
K |
the values of K that the MCMC will explore |
precision |
the level of precision at which allele frequencies are represented in the bi-allelic case. This allows the use of look-up tables for the likelihood, which significantly speeds up the MCMC. Set to 0 to use exact values (up to C++ "double" precision) rather than using look-up tables |
burnin |
the number of burn-in iterations |
samples |
the number of sampling iterations |
rungs |
the number of temperature rungs |
GTI_pow |
the power used in the generalised thermodynamic integration method. Must be greater than 1.1 |
auto_converge |
whether convergence should be assessed automatically
every |
converge_test |
test for convergence every |
solve_label_switching_on |
whether to implement the Stevens' solution to the label-switching problem. If turned off then Q-matrix output will no longer be correct, although evidence estimates will be unaffected. |
coupling_on |
whether to implement Metropolis-coupling over temperature rungs |
cluster |
option to pass in a cluster environment (see package "parallel") |
pb_markdown |
whether to run progress bars in markdown mode, in which case they are updated once at the end to avoid large amounts of output |
store_acceptance |
whether to store acceptance rates for all parameters updated by Metropolis-Hastings. Proposal distributions are tuned adaptively with a target acceptance rate of 23% |
store_raw |
whether to store raw MCMC output in addition to summary output. Setting to FALSE can considerably reduce output size in memory |
silent |
whether to suppress all console output |
# TODO
# TODO
Simulate genetic data from the same model used in the MALECOT inference step.
sim_data(n = 100, L = 24, K = 3, data_format = "biallelic", pop_col_on = TRUE, alleles = 2, lambda = 1, COI_model = "poisson", COI_max = 20, COI_manual = rep(-1, n), COI_mean = 3, COI_dispersion = 2, e1 = 0, e2 = 0, prop_missing = 0)
sim_data(n = 100, L = 24, K = 3, data_format = "biallelic", pop_col_on = TRUE, alleles = 2, lambda = 1, COI_model = "poisson", COI_max = 20, COI_manual = rep(-1, n), COI_mean = 3, COI_dispersion = 2, e1 = 0, e2 = 0, prop_missing = 0)
n |
the number of samples |
L |
the number of loci per sample |
K |
the number of subpopulations |
data_format |
whether to produce data in "biallelic" or "multiallelic"
format. Note that if biallelic format is chosen then |
pop_col_on |
TODO |
alleles |
the number of alleles at each locus. Can be a vector of length
|
lambda |
the shape parameter(s) of the prior on allele frequencies. This
prior is Beta in the bi-allelic case, and Dirichlet in the multi-allelic
case.
|
COI_model |
the distribution from which COIs are drawn. Options include
a uniform distribution ( |
COI_max |
the maximum allowed COI. Any COIs that are initially drawn larger than this value are set down to this value |
COI_manual |
option to override the MCMC and set the COI of one or more
samples manually, in which case they are not updated. Vector of length
|
COI_mean |
the mean of the distribution from which COIs are drawn. Only
applies under the Poisson and negative binomial models (under the uniform
model the mean is |
COI_dispersion |
Only used under the negative binomial model. Defines how much larger the variance is than the mean. Must be > 1 |
e1 |
the probability of a true homozygote being incorrectly called as a heterozygote |
e2 |
the probability of a true heterozygote being incorrectly called as a homozygote |
prop_missing |
the proportion of the data that is missing. Note that data are masked out at random, meaning in some rare cases (and when the proportion of missing data is large) an entire sample or locus can end up being masked out, which will throw an error when loaded into a project |
TODO
# TODO
# TODO
TODO - text
sim_data_safe(..., data_format = "biallelic", no_invariant_loci = TRUE, no_missing_samples = TRUE, no_missing_loci = TRUE, max_attempts = 1000)
sim_data_safe(..., data_format = "biallelic", no_invariant_loci = TRUE, no_missing_samples = TRUE, no_missing_loci = TRUE, max_attempts = 1000)
... |
TODO |
data_format |
TODO |
no_invariant_loci |
TODO |
no_missing_samples |
TODO |
no_missing_loci |
TODO |
max_attempts |
TODO |
TODO
# TODO
# TODO
Overload summary function for class malecot_project
## S3 method for class 'malecot_project' summary(object, ...)
## S3 method for class 'malecot_project' summary(object, ...)
object |
object of class |
... |
other arguments (ignored) |