Importing data in Hierfstat

Introduction

This vignette documents how to import or enter genotypic data for the hierfstat package. Originally this package was written to estimate and test hierarchical F-statistics, but was then further developed and now include almost all features of the Fstat program (no longer maintained), as well as several others.

Format required by most functions in Hierfstat

The data types that hierfstat can analyse are haploid or diploid, unphased, multilocus genotypes. Note that each data set must be made of only one ploidy level.

The basic data structure required by most Hierfstat function is a data frame with the first column containing a population identifier (preferably a number), and the next nl columns the genotype at each of nl loci.

In hierfstat, alleles are encoded as 1, 2 or 3 digits numbers, and genotypes are encoded as numbers with the two alleles collated (as in pasted together). Other type of data can be imported (see below) but for the time being we focus on the primary data type. Thus imagine that you have an individual genotyped at a microsatellite locus with allele length 120 and 124, the way to encode it for hierfstat is either 120124or 124120. If the data are SNPs, each allele at a locus could be encoded as 1 and 2, or you may decide to keep the correspondence between nucleotides and alleles (e.g. 1, 2, 3, 4 for A, C, G, T). Thus, if the two alleles at a SNP locus are A and T and an individual is heterozygote, it could be encoded as 14or 41.

Example data sets are included in hierfstat. For instance:

library(hierfstat) #load the library
data(diploid) # info about this data set with ?diploid
head(diploid)
##   Pop loc-1 loc-2 loc-3 loc-4 loc-5
## 1   1    44    43    43    33    44
## 2   1    44    44    43    33    44
## 3   1    44    44    43    43    44
## 4   1    44    44    NA    33    44
## 5   1    44    44    24    34    44
## 6   1    44    44    NA    43    44

The first individual (first row of the diploid data frame) belongs to population 1. Its genotype at loc-1 is 44, thus homozygote for allele 4. It is heterozygote for alleles 3and 4 at both loc-2 and loc-3, and homozygote for allele 3 at loc-4 and finaly homozygote for allele 4 at loc-5. In fact, loc-1 and loc-4 are monomorphic, meaning that only one allele is present in all individuals from all populations.

If a genotype is missing, it is encoded as NA. For instance, the fourth individual has not been typed at loc-3, nor did the 6th individual for the same locus.

The first column of this dataframe contains the identifier of the population to which the individual belongs. We can find how many individuals were typed in each population by using the command table:

table(diploid[,1])
## 
## 1 2 3 4 5 6 
## 8 8 5 7 9 7

As another example, we look at dataset cont.isl99, a data frame where alleles are encoded as 2 digits numbers:

data(cont.isl99)
head(cont.isl99)
##   Pop loc.1 loc.2 loc.3 loc.4 loc.5
## 1   1  7474  1955  9168  4051  9251
## 2   1  7474  3175  9168  2410  2327
## 3   1   808  3194  9536  9751  9223
## 4   1   874  5294  1876  1310  1292
## 5   1  7484  3875  1010  5107  7712
## 6   1   874  3175  1010  5135  9292

the first individual is homozygous for allele 74 at the first locus (loc.1) and heterozygous fore alleles 19 and 55 at the second locus. The genotype could have been written 5519 instead of 1955, it does not matter. Note the genotype of the 3rd and fourth individual at the first locus. They both carry allele 8, which is in fact encoded as 08. When it comes first, the leading 0 disappears, but it must be present in second position. Hence genotype 874, 0874 and 7408 are the same, but different from genotype 748 who would be understood by hierfstat as an individual heterozygous for alleles 07 and 48.

Last point: alleles for all loci to be analysed simultaneously must be encoded with the same number of digits.

Importing data files

Often the data to be imported are in a text file. If this is the case, the easiest way to import the file into R is via one of the workhorse of R, the read.table function.

Importing FSTAT data files

If the data are in the FSTAT format, they can be readily imported using the function read.fstat:

dip<-hierfstat::read.fstat(system.file("extdata","diploid.dat",package="hierfstat"))
head(dip)
##   Pop loc-1 loc-2 loc-3 loc-4 loc-5
## 1   1    44    43    43    33    44
## 2   1    44    44    43    33    44
## 3   1    44    44    43    43    44
## 4   1    44    44    NA    33    44
## 5   1    44    44    24    34    44
## 6   1    44    44    NA    43    44

Importing from adegenet: genind objects

adegenet is another population genetics analysis package, with the ability to import from several data format. hierfstat can import and work directly with genind objects generated by adegenet:

library(adegenet,quietly=TRUE)
data(nancycats)
head(genind2hierfstat(nancycats)[,1:10]) # only the first 10 loci
##      pop   fca8  fca23  fca43  fca45  fca77  fca78  fca90  fca96  fca37
## N215 P01     NA 136146 139139 116120 156156 142148 199199 113113 208208
## N216 P01     NA 146146 139145 120126 156156 142148 185199 113113 208208
## N217 P01 135143 136146 141141 116116 152156 142142 197197 113113 210210
## N218 P01 133135 138138 139141 116126 150150 142148 199199  91105 208208
## N219 P01 133135 140146 141145 126126 152152 142148 193199 113113 208208
## N220 P01 135143 136146 145149 120126 150156 148148 193195  91113 208208
data(H3N2)
head(genind2hierfstat(H3N2,pop=rep(1,dim(H3N2@tab)[1]))[,1:10]) # only the first 10 positions
##          pop X6 X17 X39 X42 X45 X51 X60 X72 X73
## AB434107   1  1   1   3   2   3   2   3   3   2
## AB434108   1  1   1   3   2   3   2   3   3   2
## AB438242   1 NA  NA  NA  NA  NA  NA   3   3   2
## AB438243   1 NA  NA  NA  NA  NA  NA   3   3   2
## AB438244   1 NA  NA  NA  NA  NA  NA   3   3   2
## AB438245   1 NA  NA  NA  NA  NA  NA   3   3   2
data(eHGDP)
head(genind2hierfstat(eHGDP))[,1:11] 
##   pop  loc.1  loc.2  loc.3  loc.4  loc.5  loc.6  loc.7  loc.8  loc.9 loc.10
## 1   1 129155 264292 142146 156156 157166 171179 205205 183187 196196 137140
## 2   1 145150 288292 138142 168172 157166 171175 205210 195203 196196 128134
## 3   1 155155 292300 138142 156156 157169 167171 205205 183199 184196 137140
## 4   1 150155 264292 142146 156176 157163 171175 205205 187187 196196 140140
## 5   1 150155 292300 138146 156160 157166 171171 205205 183207 187190 128128
## 6   1 155155 296296 146146 152176 157157 167171 205210 179183 196196 128140

The example below shows how to estimate gene diversities (expected heterozygosities), observed heterozygosities, or estimate genetic distances directly from a genind object:

#basic.stats(nancycats)
hierfstat::Hs(nancycats) #mean populations gene diversities
##       P01       P02       P03       P04       P05       P06       P07       P08 
## 0.6531333 0.7043000 0.7239111 0.7525222 0.6431222 0.7510000 0.6719333 0.7598556 
##       P09       P10       P11       P12       P13       P14       P15       P16 
## 0.6913667 0.7020111 0.7867667 0.6718333 0.6883889 0.7928556 0.7217111 0.7016222 
##       P17 
## 0.6046375
hierfstat::Ho(nancycats) # mean populations observed heterozygosities
##       P01       P02       P03       P04       P05       P06       P07       P08 
## 0.5722222 0.5707111 0.6111111 0.6280222 0.5629667 0.6262667 0.5458556 0.6111111 
##       P09       P10       P11       P12       P13       P14       P15       P16 
## 0.7407556 0.6262667 0.6627444 0.5570000 0.6752222 0.6836667 0.7474889 0.6759222 
##       P17 
## 0.5980875
#genet.dist(nancycats)

Allelic dosages

Genomic datasets are composed of (very) large numbers of bi-allelic loci, called Single Nucleotide Polymorhisms, or SNP. A convenient and efficient format for storing SNP data is allelic dosage, the number of alternative alleles and individual carries at a locus. For a diploid locus, this number can be 0 if the individual carries two reference alleles, 1 if he is heterozygote or 2 if he carries 2 alternate alleles. The allelic dosage format is suitable for analyses with several of the hierfstat functions, as described in the hierfstat vignette.

Importing from adegenet: genlight objects

To import genlight objects from adegenet, just wrap the object’s name in an as.matrix:

obj <-read.snp(system.file("files/exampleSnpDat.snp",package="adegenet"), chunk=2,quiet=TRUE)

as.matrix(obj)
##                                                    1.a/t 8.g/c 11.a/c 43.t/a
## foo                                                    1     0      2      0
## bar                                                    0     0      1      2
## toto                                                   1     0     NA      0
## Nyarlathotep                                           0     1      2      0
## an even longer label but OK since on a single line     1     1      0      0

importing VCF files

Variant Call Format (VCF) files have become a standard for genomic data. This is the storage format used for the 1000 genomes for instance.

  • VCF files can be imported in hierfstat using the function read.VCF, built from the gaston package function read.vcf:
library(gaston,quietly=TRUE)
filepath <-system.file("extdata", "LCT.vcf.gz", package="gaston")
x1 <- read.VCF( filepath )
## ped stats and snps stats have been set. 
## 'p' has been set. 
## 'mu' and 'sigma' have been set.
x1
## A bed.matrix with 503 individuals and 607 markers.
## snps stats are set
## ped stats are set
dim(x1)
## [1] 503 607
with(x1@snps,table(A1,A2))
##    A2
## A1    A   C   G   T
##   A   0  25  94  22
##   C  18   0  22 111
##   G 126  24   0  26
##   T  27  90  22   0

by default, read.VCF keeps only the bi-allelic SNPs (whereas gaston::read.vcf keeps all variants), but you can choose to keep all variant by setting the argument BiAllelic to FALSE.

as.matrix(x1) will then provide allelic dosages.

  • Another route to import VCF files is via the SNPRelate package, and its function snpgdsVCF2GDS (as the function writes to a file, the two next code chunks will not be evaluated)
library(SNPRelate)
snpgdsVCF2GDS(filepath, "test1.gds", method="biallelic.only")
snpgdsSummary("test1.gds")
f<-snpgdsOpen("test1.gds")

snpgdsVCF2GDS stores the number of reference alleles, we want the the number of alternate alleles:

x2<-2-snpgdsGetGeno(f) 

#check that allele frequencies are the same with the two methods
all.equal(colMeans(x2)/2,colMeans(as.matrix(x1)/2),check.names=FALSE) 
  • Last, we could import a bed file using gaston, and use an external program, plink, to convert the VCF file into a BED file.

These are only 3 possibilities that I find convenient and efficient, but many other exist, thus feel free to import dosage data into hierfstat by your preferred route.

Data generated by external genetic data simulators

While hierfstat contains several built-in function to generate genetic data (e.g. sim.genot), functions exist to import data from external, more sophisticated genetic data simulators, namely QuantiNemo and ms

Importing from Quantinemo

QuantiNemo is a genetic simulation program for markers and traits. Data generated by Quantinemo can be imported using the function read.fstat if the save_genotype setting in quantinemo is set to 1. If the QuantiNemo output is set to 2 (extended), 6 extra columns are outputed and these can be read with hierfstatusing the function qn2.read.fstat. The component $dat of the object return by this function contains the genotypes of the individuals simulated, while the component $sexcontains its sex, the component $ped the individuals’ pedigree and the component $w their fitness. For more details about the extended FSTAT format of QuantiNemo, see its manual.

dat<-qn2.read.fstat(system.file("extdata","qn2_sex.dat",package="hierfstat"))
names(dat)
## [1] "dat" "sex" "ped" "W"
head(dat$sex)
## [1] "F" "F" "F" "F" "F" "F"
head(dat$dat[,1:10])
##   Pop trait-1_locus-1 trait-1_locus-2 trait-1_locus-3 trait-1_locus-4
## 1   1             606            1515             101             404
## 2   1             606            1515             101             404
## 3   1             606            1515             101             404
## 4   1             606            1515             101             404
## 5   1             606            1515             101             404
## 6   1             606            1515             101             404
##   trait-1_locus-5 trait-1_locus-6 trait-1_locus-7 trait-1_locus-8
## 1             707             404             303             101
## 2             707             415             303             101
## 3             707             404             303             101
## 4             707             415             303             101
## 5             707             415             303             101
## 6             707            1504             303             101
##   trait-1_locus-9
## 1             808
## 2             808
## 3             808
## 4             808
## 5             808
## 6             808
#sexbias.test(dat[[1]],sex=dat[[2]])

Importing from ms

The program ms of Hudson is commonly used to generate genomic data, and the program msprime has a wrapper for ms named mspms.

I very briefly discussed the output of the ms software below, see its documentation for more informations and details of arguments to be passed to ms.

The output looks like this:

ms 200 100 -t 20 -I 2 100 100 40 -n 2 0.01

29161

//

segsites: 23

positions: 0.0689 0.2534 0.3219 0.3350 0.3547 0.3768 0.4339 0.4359 0.4388 0.4694 0.5003 0.5422 0.6575 0.6985 0.7059 0.7147 0.7453 0.7709 0.7891 0.8439 0.8779 0.8857 0.9380

00100001100000000000000

00001001000000000001000

The first line is the ms command line. It instructs the program to simulate 2 populations (-I 2) , with θ = 2N0μ = 20 (-t 20). The 2 populations differ in size and the smallest (the second) is a 100th of the first (-n 2 0.01). The two populations exchange 4Nm = 40 migrants per generation and 100 chromosomes are sampled from each population (-I 2 100 100 40), and this is repeated a 100 times (hence ms 200 100).

The second line is a random number allowing to reapeat exactly the same sequence

The third line is a line of comments

The fourth line contains the positions of the different segregating sites

The fifth and following lines contain the genetic data, as a serie of 0 and 1 (0 is the ancestral and 1 the derived allele), collated one to the other. These are the SNP sites, with 0 being the ancestral state and 1 the derived state.

the function ms2bed will convert ms output to a bed object:

bed<-ms2bed(system.file("extdata","2pops_asspop.txt",package="hierfstat"))
head(as.matrix(bed[,1:10])) #first 10 columns of bed matrix
##   M_1 M_2 M_3 M_4 M_5 M_6 M_7 M_8 M_9 M_10
## 1   0   0   1   0   1   0   0   2   1    0
## 2   1   0   0   0   1   0   0   1   0    0
## 3   0   0   0   0   0   0   0   2   0    0
## 4   0   0   0   0   2   0   0   2   0    0
## 5   0   0   0   0   0   0   0   2   0    0
## 6   0   0   0   0   1   0   0   2   0    0

Take some time to explore the structure of the bed s4 object (str(bed)), it is very useful. the @ped slot contains info relevant to the individuals (their names, family id etc…), while the @snps slot contains info relevant to the SNPs (their name, position, chromosome, etc…). The dosage matrix is extracted with as.matrix(bed). This can be used as argument to all functions containing dosagein their names, such as beta.dosage, pi.dosage, theta.Watt.dosage, TajimaD.dosage. For instance, fs.dosage will produce estimates of populations specific FIS and FST.

fs.dosage(as.matrix(bed),pop=rep(1:2,each=50)) # population specific FSTs
##           1      2    All
## Fis -0.0046 0.0342 0.0148
## Fst -0.1616 0.6179 0.2282

As a more interesting example, we can reuse the eHGDP dataset we’ve encountered previously (see ?eHGDP), and after having converted it to dosage data with the fstat2dos function (there is a total of 8170 alleles in the data set), we can look at inbreeding within populations and population structure using fs.dosage:

dat.HGDP<-genind2hierfstat(eHGDP)
dos.HGDP<-fstat2dos(dat.HGDP[,-1])
fs.HGDP<-fs.dosage(dos.HGDP,pop=dat.HGDP[,1])

We may be interested in finding which populations have either a large FSTP > 0.15 or a low one FSTP < 0:

eHGDP@other$popInfo[which(fs.HGDP$Fs[2,]>0.15),]
##        Population  Region Label Latitude Longitude
## 23      Colombian AMERICA    23      3.0     -68.0
## 24      Karitiana AMERICA    24    -10.0     -63.0
## 26           Pima AMERICA    26     29.0    -108.0
## 54          Surui AMERICA    54    -11.0     -62.0
## 63         Guaymi AMERICA    63      8.5     -82.0
## 64        Cabecar AMERICA    64      9.5     -84.0
## 68           Ache AMERICA    68    -24.0     -56.0
## 69       Kaingang AMERICA    69    -24.0     -52.5
## 71           Kogi AMERICA    71     11.0     -74.0
## 75    TicunaArara AMERICA    75     -4.0     -70.0
## 76 TicunaTarapaca AMERICA    76     -4.0     -70.0
## 79        Arhuaco AMERICA    79     11.0     -73.8
eHGDP@other$popInfo[which(fs.HGDP$Fs[2,]<0.0),]
##        Population Region Label Latitude Longitude
## 27     BantuKenya AFRICA    27     -3.0      37.0
## 28 BantuSouthWest AFRICA    28    -21.0      18.7
## 29 BantuSouthEast AFRICA    29    -28.4      27.6
## 30       Mandenka AFRICA    30     12.0     -12.0
## 31         Yoruba AFRICA    31      8.0       5.0
## 32     BiakaPygmy AFRICA    32      4.0      17.0
## 33     MbutiPygmy AFRICA    33      1.0      29.0
## 34            San AFRICA    34    -21.0      20.0

and you’ll see that samples from AFRICA have low Population specific FST, while samples from AMERICA have large population specific FST