This vignette documents how to import or enter genotypic data for the
hierfstat
package. Originally this package was written to
estimate and test hierarchical F-statistics, but was then further
developed and now include almost all features of the Fstat program
(no longer maintained), as well as several others.
The data types that hierfstat
can analyse are haploid or
diploid, unphased, multilocus genotypes. Note that each data set must be
made of only one ploidy level.
The basic data structure required by most Hierfstat
function is a data frame with the first column containing a population
identifier (preferably a number), and the next nl columns the genotype at
each of nl loci.
In hierfstat
, alleles are encoded as 1, 2 or 3 digits
numbers, and genotypes are encoded as numbers with the two alleles
collated (as in pasted together). Other type of data can be imported
(see below) but for the time being we focus on the primary data type.
Thus imagine that you have an individual genotyped at a microsatellite
locus with allele length 120
and 124
, the way
to encode it for hierfstat
is either 120124
or
124120
. If the data are SNPs, each allele at a locus could
be encoded as 1
and 2
, or you may decide to
keep the correspondence between nucleotides and alleles
(e.g. 1, 2, 3, 4
for A, C, G, T
). Thus, if the
two alleles at a SNP locus are A
and T
and an
individual is heterozygote, it could be encoded as 14
or
41
.
Example data sets are included in hierfstat
. For
instance:
library(hierfstat) #load the library
data(diploid) # info about this data set with ?diploid
head(diploid)
## Pop loc-1 loc-2 loc-3 loc-4 loc-5
## 1 1 44 43 43 33 44
## 2 1 44 44 43 33 44
## 3 1 44 44 43 43 44
## 4 1 44 44 NA 33 44
## 5 1 44 44 24 34 44
## 6 1 44 44 NA 43 44
The first individual (first row of the diploid data frame) belongs to
population 1. Its genotype at loc-1
is 44
,
thus homozygote for allele 4
. It is heterozygote for
alleles 3
and 4
at both loc-2
and
loc-3
, and homozygote for allele 3
at
loc-4
and finaly homozygote for allele 4
at
loc-5
. In fact, loc-1
and loc-4
are monomorphic, meaning that only one allele is present in all
individuals from all populations.
If a genotype is missing, it is encoded as NA
. For
instance, the fourth individual has not been typed at
loc-3
, nor did the 6th individual for the same locus.
The first column of this dataframe contains the identifier of the population to which the individual belongs. We can find how many individuals were typed in each population by using the command table:
##
## 1 2 3 4 5 6
## 8 8 5 7 9 7
As another example, we look at dataset cont.isl99
, a
data frame where alleles are encoded as 2 digits numbers:
## Pop loc.1 loc.2 loc.3 loc.4 loc.5
## 1 1 7474 1955 9168 4051 9251
## 2 1 7474 3175 9168 2410 2327
## 3 1 808 3194 9536 9751 9223
## 4 1 874 5294 1876 1310 1292
## 5 1 7484 3875 1010 5107 7712
## 6 1 874 3175 1010 5135 9292
the first individual is homozygous for allele 74
at the
first locus (loc.1
) and heterozygous fore alleles
19
and 55
at the second locus. The genotype
could have been written 5519
instead of 1955
,
it does not matter. Note the genotype of the 3rd and fourth individual
at the first locus. They both carry allele 8
, which is in
fact encoded as 08
. When it comes first, the leading 0
disappears, but it must be present in second position. Hence genotype
874
, 0874
and 7408
are the same,
but different from genotype 748
who would be understood by
hierfstat as an individual heterozygous for alleles 07
and
48
.
Last point: alleles for all loci to be analysed simultaneously must be encoded with the same number of digits.
Often the data to be imported are in a text file. If this is the
case, the easiest way to import the file into R
is via one
of the workhorse of R, the read.table
function.
If the data are in the FSTAT
format, they can be readily
imported using the function read.fstat
:
## Pop loc-1 loc-2 loc-3 loc-4 loc-5
## 1 1 44 43 43 33 44
## 2 1 44 44 43 33 44
## 3 1 44 44 43 43 44
## 4 1 44 44 NA 33 44
## 5 1 44 44 24 34 44
## 6 1 44 44 NA 43 44
adegenet
is another population genetics analysis
package, with the ability to import from several data format.
hierfstat
can import and work directly with
genind
objects generated by adegenet
:
library(adegenet,quietly=TRUE)
data(nancycats)
head(genind2hierfstat(nancycats)[,1:10]) # only the first 10 loci
## pop fca8 fca23 fca43 fca45 fca77 fca78 fca90 fca96 fca37
## N215 P01 NA 136146 139139 116120 156156 142148 199199 113113 208208
## N216 P01 NA 146146 139145 120126 156156 142148 185199 113113 208208
## N217 P01 135143 136146 141141 116116 152156 142142 197197 113113 210210
## N218 P01 133135 138138 139141 116126 150150 142148 199199 91105 208208
## N219 P01 133135 140146 141145 126126 152152 142148 193199 113113 208208
## N220 P01 135143 136146 145149 120126 150156 148148 193195 91113 208208
data(H3N2)
head(genind2hierfstat(H3N2,pop=rep(1,dim(H3N2@tab)[1]))[,1:10]) # only the first 10 positions
## pop X6 X17 X39 X42 X45 X51 X60 X72 X73
## AB434107 1 1 1 3 2 3 2 3 3 2
## AB434108 1 1 1 3 2 3 2 3 3 2
## AB438242 1 NA NA NA NA NA NA 3 3 2
## AB438243 1 NA NA NA NA NA NA 3 3 2
## AB438244 1 NA NA NA NA NA NA 3 3 2
## AB438245 1 NA NA NA NA NA NA 3 3 2
## pop loc.1 loc.2 loc.3 loc.4 loc.5 loc.6 loc.7 loc.8 loc.9 loc.10
## 1 1 129155 264292 142146 156156 157166 171179 205205 183187 196196 137140
## 2 1 145150 288292 138142 168172 157166 171175 205210 195203 196196 128134
## 3 1 155155 292300 138142 156156 157169 167171 205205 183199 184196 137140
## 4 1 150155 264292 142146 156176 157163 171175 205205 187187 196196 140140
## 5 1 150155 292300 138146 156160 157166 171171 205205 183207 187190 128128
## 6 1 155155 296296 146146 152176 157157 167171 205210 179183 196196 128140
The example below shows how to estimate gene diversities (expected
heterozygosities), observed heterozygosities, or estimate genetic
distances directly from a genind
object:
## P01 P02 P03 P04 P05 P06 P07 P08
## 0.6531333 0.7043000 0.7239111 0.7525222 0.6431222 0.7510000 0.6719333 0.7598556
## P09 P10 P11 P12 P13 P14 P15 P16
## 0.6913667 0.7020111 0.7867667 0.6718333 0.6883889 0.7928556 0.7217111 0.7016222
## P17
## 0.6046375
## P01 P02 P03 P04 P05 P06 P07 P08
## 0.5722222 0.5707111 0.6111111 0.6280222 0.5629667 0.6262667 0.5458556 0.6111111
## P09 P10 P11 P12 P13 P14 P15 P16
## 0.7407556 0.6262667 0.6627444 0.5570000 0.6752222 0.6836667 0.7474889 0.6759222
## P17
## 0.5980875
Genomic datasets are composed of (very) large numbers of bi-allelic
loci, called Single Nucleotide Polymorhisms, or SNP. A convenient and
efficient format for storing SNP data is allelic dosage, the number of
alternative alleles and individual carries at a locus. For a diploid
locus, this number can be 0 if the individual carries two reference
alleles, 1 if he is heterozygote or 2 if he carries 2 alternate alleles.
The allelic dosage format is suitable for analyses with several of the
hierfstat
functions, as described in the
hierfstat
vignette.
To import genlight objects from adegenet
, just wrap the
object’s name in an as.matrix
:
obj <-read.snp(system.file("files/exampleSnpDat.snp",package="adegenet"), chunk=2,quiet=TRUE)
as.matrix(obj)
## 1.a/t 8.g/c 11.a/c 43.t/a
## foo 1 0 2 0
## bar 0 0 1 2
## toto 1 0 NA 0
## Nyarlathotep 0 1 2 0
## an even longer label but OK since on a single line 1 1 0 0
Variant Call Format (VCF) files have become a standard for genomic data. This is the storage format used for the 1000 genomes for instance.
hierfstat
using the
function read.VCF
, built from the gaston
package function read.vcf
:library(gaston,quietly=TRUE)
filepath <-system.file("extdata", "LCT.vcf.gz", package="gaston")
x1 <- read.VCF( filepath )
## ped stats and snps stats have been set.
## 'p' has been set.
## 'mu' and 'sigma' have been set.
## A bed.matrix with 503 individuals and 607 markers.
## snps stats are set
## ped stats are set
## [1] 503 607
## A2
## A1 A C G T
## A 0 25 94 22
## C 18 0 22 111
## G 126 24 0 26
## T 27 90 22 0
by default, read.VCF
keeps only the bi-allelic SNPs
(whereas gaston::read.vcf
keeps all variants), but you can
choose to keep all variant by setting the argument
BiAllelic
to FALSE.
as.matrix(x1)
will then provide allelic dosages.
SNPRelate
package, and its function snpgdsVCF2GDS
(as the function
writes to a file, the two next code chunks will not be evaluated)library(SNPRelate)
snpgdsVCF2GDS(filepath, "test1.gds", method="biallelic.only")
snpgdsSummary("test1.gds")
f<-snpgdsOpen("test1.gds")
snpgdsVCF2GDS
stores the number of reference alleles, we
want the the number of alternate alleles:
x2<-2-snpgdsGetGeno(f)
#check that allele frequencies are the same with the two methods
all.equal(colMeans(x2)/2,colMeans(as.matrix(x1)/2),check.names=FALSE)
bed
file using
gaston
, and use an external program, plink, to convert the
VCF file into a BED file.These are only 3 possibilities that I find convenient and efficient,
but many other exist, thus feel free to import dosage data into
hierfstat
by your preferred route.
While hierfstat
contains several built-in function to
generate genetic data (e.g. sim.genot
), functions exist to
import data from external, more sophisticated genetic data simulators,
namely QuantiNemo
and ms
QuantiNemo
is a genetic simulation program for markers and traits. Data generated
by Quantinemo
can be imported using the function
read.fstat
if the save_genotype
setting in
quantinemo
is set to 1. If
the QuantiNemo
output is set to 2 (extended), 6 extra columns are outputed
and these can be read with hierfstat
using the function
qn2.read.fstat
. The component $dat
of the
object return by this function contains the genotypes of the individuals
simulated, while the component $sex
contains its sex, the
component $ped
the individuals’ pedigree and the component
$w
their fitness. For more details about the extended
FSTAT
format of QuantiNemo
, see its
manual.
## [1] "dat" "sex" "ped" "W"
## [1] "F" "F" "F" "F" "F" "F"
## Pop trait-1_locus-1 trait-1_locus-2 trait-1_locus-3 trait-1_locus-4
## 1 1 606 1515 101 404
## 2 1 606 1515 101 404
## 3 1 606 1515 101 404
## 4 1 606 1515 101 404
## 5 1 606 1515 101 404
## 6 1 606 1515 101 404
## trait-1_locus-5 trait-1_locus-6 trait-1_locus-7 trait-1_locus-8
## 1 707 404 303 101
## 2 707 415 303 101
## 3 707 404 303 101
## 4 707 415 303 101
## 5 707 415 303 101
## 6 707 1504 303 101
## trait-1_locus-9
## 1 808
## 2 808
## 3 808
## 4 808
## 5 808
## 6 808
The program ms
of Hudson is commonly used to generate genomic data, and the program msprime has
a wrapper for ms
named mspms.
I very briefly discussed the output of the ms
software
below, see its documentation
for more informations and details of arguments to be passed to
ms
.
The output looks like this:
ms 200 100 -t 20 -I 2 100 100 40 -n 2 0.01
29161
//
segsites: 23
positions: 0.0689 0.2534 0.3219 0.3350 0.3547 0.3768 0.4339 0.4359 0.4388 0.4694 0.5003 0.5422 0.6575 0.6985 0.7059 0.7147 0.7453 0.7709 0.7891 0.8439 0.8779 0.8857 0.9380
00100001100000000000000
00001001000000000001000
The first line is the ms
command line. It instructs the
program to simulate 2 populations (-I 2
) , with θ = 2N0μ = 20
(-t 20
). The 2 populations differ in size and the smallest
(the second) is a 100th of the first (-n 2 0.01
). The two
populations exchange 4Nm = 40 migrants per
generation and 100 chromosomes are sampled from each population
(-I 2 100 100 40
), and this is repeated a 100 times (hence
ms 200 100
).
The second line is a random number allowing to reapeat exactly the same sequence
The third line is a line of comments
The fourth line contains the positions of the different segregating sites
The fifth and following lines contain the genetic data, as a serie of 0 and 1 (0 is the ancestral and 1 the derived allele), collated one to the other. These are the SNP sites, with 0 being the ancestral state and 1 the derived state.
the function ms2bed
will convert ms output to a bed
object:
bed<-ms2bed(system.file("extdata","2pops_asspop.txt",package="hierfstat"))
head(as.matrix(bed[,1:10])) #first 10 columns of bed matrix
## M_1 M_2 M_3 M_4 M_5 M_6 M_7 M_8 M_9 M_10
## 1 0 0 1 0 1 0 0 2 1 0
## 2 1 0 0 0 1 0 0 1 0 0
## 3 0 0 0 0 0 0 0 2 0 0
## 4 0 0 0 0 2 0 0 2 0 0
## 5 0 0 0 0 0 0 0 2 0 0
## 6 0 0 0 0 1 0 0 2 0 0
Take some time to explore the structure of the bed
s4
object (str(bed)
), it is very useful. the @ped
slot contains info relevant to the individuals (their names, family id
etc…), while the @snps
slot contains info relevant to the
SNPs (their name, position, chromosome, etc…). The dosage matrix is
extracted with as.matrix(bed)
. This can be used as argument
to all functions containing dosage
in their names, such as
beta.dosage, pi.dosage, theta.Watt.dosage, TajimaD.dosage
.
For instance, fs.dosage
will produce estimates of
populations specific FIS and
FST.
## 1 2 All
## Fis -0.0046 0.0342 0.0148
## Fst -0.1616 0.6179 0.2282
As a more interesting example, we can reuse the eHGDP
dataset we’ve encountered previously (see ?eHGDP
), and
after having converted it to dosage data with the fstat2dos
function (there is a total of 8170
alleles in the data set), we can look at inbreeding within populations
and population structure using fs.dosage
:
dat.HGDP<-genind2hierfstat(eHGDP)
dos.HGDP<-fstat2dos(dat.HGDP[,-1])
fs.HGDP<-fs.dosage(dos.HGDP,pop=dat.HGDP[,1])
We may be interested in finding which populations have either a large FSTP > 0.15 or a low one FSTP < 0:
## Population Region Label Latitude Longitude
## 23 Colombian AMERICA 23 3.0 -68.0
## 24 Karitiana AMERICA 24 -10.0 -63.0
## 26 Pima AMERICA 26 29.0 -108.0
## 54 Surui AMERICA 54 -11.0 -62.0
## 63 Guaymi AMERICA 63 8.5 -82.0
## 64 Cabecar AMERICA 64 9.5 -84.0
## 68 Ache AMERICA 68 -24.0 -56.0
## 69 Kaingang AMERICA 69 -24.0 -52.5
## 71 Kogi AMERICA 71 11.0 -74.0
## 75 TicunaArara AMERICA 75 -4.0 -70.0
## 76 TicunaTarapaca AMERICA 76 -4.0 -70.0
## 79 Arhuaco AMERICA 79 11.0 -73.8
## Population Region Label Latitude Longitude
## 27 BantuKenya AFRICA 27 -3.0 37.0
## 28 BantuSouthWest AFRICA 28 -21.0 18.7
## 29 BantuSouthEast AFRICA 29 -28.4 27.6
## 30 Mandenka AFRICA 30 12.0 -12.0
## 31 Yoruba AFRICA 31 8.0 5.0
## 32 BiakaPygmy AFRICA 32 4.0 17.0
## 33 MbutiPygmy AFRICA 33 1.0 29.0
## 34 San AFRICA 34 -21.0 20.0
and you’ll see that samples from AFRICA have low Population specific FST, while samples from AMERICA have large population specific FST