---
title: "Importing data in Hierfstat"
author: "Jerome Goudet"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Importing data in Hierfstat}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage{utf8}{inputenc}
---

# Introduction

This vignette documents how to import or enter genotypic data for the `hierfstat` package.  Originally this package was written to estimate and test hierarchical F-statistics, but was then further developed and now include almost all features of the  [Fstat program](https://www2.unil.ch/popgen/softwares/fstat.htm) (no longer maintained), as well as several others. 


# Format required by most functions in *Hierfstat*

The data types that `hierfstat` can analyse are haploid or diploid, unphased, multilocus genotypes. Note that each data set must be made of only one ploidy level.  

The basic data structure required by most `Hierfstat` function is a data frame with the first column containing a population identifier (preferably a number), and the next $nl$ columns the genotype at each of $nl$ loci.  

In `hierfstat`, alleles are encoded as 1, 2 or 3 digits numbers, and genotypes are encoded as numbers with the two alleles collated (as in pasted together).  Other type of data can be imported (see below) but for the time being we focus on the primary data type. Thus imagine that you have an individual genotyped at a microsatellite locus with allele length `120` and `124`, the way to encode it for `hierfstat` is either `120124`or `124120`.   If the data are SNPs, each allele at a locus could be encoded as `1` and `2`, or you may decide to keep the correspondence between nucleotides and alleles (e.g. `1, 2, 3, 4` for `A, C, G, T`). Thus, if the two alleles at a SNP locus are `A` and `T` and an individual is heterozygote, it could be encoded as `14`or `41`.  

Example data sets are included in `hierfstat`. For instance:

```{r,message=FALSE}
library(hierfstat) #load the library
data(diploid) # info about this data set with ?diploid
head(diploid)
```

The first individual (first row of the diploid data frame) belongs to population 1. Its genotype at `loc-1` is `44`, thus homozygote for allele `4`.  It is heterozygote for alleles `3`and `4` at both `loc-2` and `loc-3`, and homozygote for allele `3` at `loc-4` and finaly homozygote for allele `4` at `loc-5`.  In fact, `loc-1` and `loc-4` are monomorphic, meaning that only one allele is present in all individuals from all populations.

If a genotype is missing, it is encoded as `NA`. For instance, the fourth individual has not been typed at `loc-3`, nor did the 6th individual for the same locus. 

The first column of this dataframe contains the identifier of the population to which the individual belongs. We can find how many individuals were typed in each population by using the command table:

```{r}
table(diploid[,1])
```


As another example, we look at dataset `cont.isl99`, a data frame where alleles are encoded as 2 digits numbers:

```{r}
data(cont.isl99)
head(cont.isl99)
```

the first individual is homozygous for allele `74` at the first locus (`loc.1`) and heterozygous fore alleles `19` and `55` at the second locus.   The genotype could have been written `5519` instead of `1955`, it does not matter.  Note the genotype of the 3rd and fourth individual at the first locus.  They both carry allele `8`, which is in fact encoded as `08`.  When it comes first, the leading 0 disappears, but it must be present in second position. Hence genotype `874`, `0874` and `7408` are the same, but different from genotype `748` who would be understood by hierfstat as an individual heterozygous for alleles `07` and `48`.

Last point: alleles for all loci to be analysed simultaneously must be encoded with the same number of digits.


# Importing data files 

Often the data to be imported are in a text file. If this is the case, the easiest way to import the file into `R` is via one of the workhorse of R, the `read.table` function.

## Importing FSTAT data files

If the data are in the `FSTAT` format, they can be readily imported using the function `read.fstat`:

```{r}
dip<-hierfstat::read.fstat(system.file("extdata","diploid.dat",package="hierfstat"))
head(dip)
```


## Importing from adegenet: genind objects

`adegenet` is another population genetics analysis package, with the ability to import from several data format. `hierfstat` can import and work directly with `genind` objects generated by `adegenet`:  

```{r,message=FALSE}
library(adegenet,quietly=TRUE)
data(nancycats)
head(genind2hierfstat(nancycats)[,1:10]) # only the first 10 loci
data(H3N2)
head(genind2hierfstat(H3N2,pop=rep(1,dim(H3N2@tab)[1]))[,1:10]) # only the first 10 positions
data(eHGDP)
head(genind2hierfstat(eHGDP))[,1:11] 
```

The example below shows how to estimate gene diversities (expected heterozygosities), observed heterozygosities, or estimate genetic distances directly from a `genind` object: 

```{r,message=FALSE}
#basic.stats(nancycats)
hierfstat::Hs(nancycats) #mean populations gene diversities
hierfstat::Ho(nancycats) # mean populations observed heterozygosities
#genet.dist(nancycats)
```


# Allelic dosages

Genomic datasets are composed of (very) large numbers of bi-allelic loci, called Single Nucleotide  Polymorhisms, or SNP.  A convenient and efficient format for storing SNP data is allelic dosage, the number of alternative alleles and individual carries at a locus. For a diploid locus, this number can be 0 if the individual carries two reference alleles, 1 if he is heterozygote or 2 if he carries 2 alternate alleles. The allelic dosage format is suitable for analyses with several of the `hierfstat` functions, as described in the `hierfstat` vignette. 


## Importing from adegenet: genlight objects

To import genlight objects from `adegenet`, just wrap the object's name in an `as.matrix`: 

```{R,message=FALSE}
obj <-read.snp(system.file("files/exampleSnpDat.snp",package="adegenet"), chunk=2,quiet=TRUE)

as.matrix(obj)

```

## importing VCF files 

[Variant Call Format (VCF)](https://samtools.github.io/hts-specs/) files have become a standard for genomic data. This is the storage format used for the [1000 genomes](https://www.internationalgenome.org/) for instance. 

* VCF files can be imported in `hierfstat`  using the function `read.VCF`, built from the `gaston` package function `read.vcf`:


```{r,message=FALSE}
library(gaston,quietly=TRUE)
filepath <-system.file("extdata", "LCT.vcf.gz", package="gaston")
x1 <- read.VCF( filepath )
x1
dim(x1)
with(x1@snps,table(A1,A2))
```

by default, `read.VCF` keeps only the bi-allelic SNPs (whereas `gaston::read.vcf` keeps all variants), but you can choose to keep all variant by setting the argument `BiAllelic` to FALSE.  

`as.matrix(x1)` will then provide allelic dosages.


* Another route to import VCF files is via the `SNPRelate` package, and its function `snpgdsVCF2GDS` (as the function writes to a file, the two next code chunks will not be evaluated) 

```{r,eval=FALSE}
library(SNPRelate)
snpgdsVCF2GDS(filepath, "test1.gds", method="biallelic.only")
snpgdsSummary("test1.gds")
f<-snpgdsOpen("test1.gds")
```

`snpgdsVCF2GDS` stores the number of reference alleles, 
we want the the number of alternate alleles:

```{r,eval=FALSE}
x2<-2-snpgdsGetGeno(f) 

#check that allele frequencies are the same with the two methods
all.equal(colMeans(x2)/2,colMeans(as.matrix(x1)/2),check.names=FALSE) 
```

* Last, we could import a `bed` file using `gaston`, and use an external program, [plink](https://www.cog-genomics.org/plink/2.0/), to convert the VCF file into a BED file. 

These are only 3 possibilities that I find convenient and efficient, but many other exist, thus feel free to import dosage data into `hierfstat` by your preferred route. 


# Data generated by external genetic data simulators 

While `hierfstat` contains several built-in function to generate genetic data (e.g. `sim.genot`), functions exist to import data from external, more sophisticated genetic data simulators, namely [QuantiNemo](https://www2.unil.ch/popgen/softwares/quantinemo/) and [ms](http://home.uchicago.edu/~rhudson1/source/mksamples.html) 

## Importing from Quantinemo

[QuantiNemo](https://www2.unil.ch/popgen/softwares/quantinemo/) is a genetic simulation program for markers and traits. Data generated by `Quantinemo` can be imported using the function `read.fstat` if the `save_genotype` setting in `quantinemo` is set to $1$.  If the `QuantiNemo` output is set to $2$ (extended), 6 extra columns are outputed and these can be read with `hierfstat`using the function  `qn2.read.fstat`. The component  `$dat` of the object return by this function contains the genotypes of the  individuals simulated, while the component `$sex`contains its sex, the component `$ped` the individuals' pedigree and the component `$w` their fitness.  For more details about the extended `FSTAT` format of `QuantiNemo`, see its manual. 


```{r}
dat<-qn2.read.fstat(system.file("extdata","qn2_sex.dat",package="hierfstat"))
names(dat)
head(dat$sex)
head(dat$dat[,1:10])
#sexbias.test(dat[[1]],sex=dat[[2]])
```


## Importing from ms

The program [ms](http://home.uchicago.edu/~rhudson1/source/mksamples.html) of Hudson is commonly used to generate genomic data, and the program [msprime](https://tskit.dev/msprime/docs/stable/intro.html) has a wrapper for `ms` named [mspms](https://tskit.dev/msprime/docs/stable/cli.html#sec-mspms).   

I very briefly discussed the output of the `ms` software below, see its [documentation](https://uchicago.app.box.com/s/l3e5uf13tikfjm7e1il1eujitlsjdx13/file/245765534272) for more informations and details of arguments to be passed to `ms`.  

The output looks like this:

>  ms 200 100 -t 20 -I 2 100 100 40 -n 2 0.01 
>  
>  29161
>  
>  //
>  
>  segsites: 23
>  
>  positions: 0.0689 0.2534 0.3219 0.3350 0.3547 0.3768 0.4339 0.4359 0.4388 0.4694 0.5003 0.5422 0.6575 0.6985 0.7059 0.7147 0.7453 0.7709 0.7891 0.8439 0.8779 0.8857 0.9380 
>  
>  00100001100000000000000
>  
>  00001001000000000001000

The first line is the `ms` command line. It instructs the program to simulate 2 populations (`-I 2`) , with $\theta=2N_0\mu=20$ (`-t 20`).  The 2 populations differ in size and the smallest (the second) is a 100th of the first (`-n 2 0.01`). The two populations exchange $4Nm=40$ migrants per generation and  100 chromosomes are sampled from each population (`-I 2 100 100 40`), and this is repeated a 100 times (hence `ms 200 100`). 

The second line is a random number allowing to reapeat exactly the same sequence

The third line is a line of comments

The fourth line contains the positions of the different segregating sites

The fifth and following lines contain the genetic data, as a serie of 0 and 1 (0 is the ancestral and 1 the derived allele), collated one to the other.  These are the SNP sites, with 0 being the ancestral state and 1 the derived state. 

the function `ms2bed` will convert ms output to a bed object:

```{r}
bed<-ms2bed(system.file("extdata","2pops_asspop.txt",package="hierfstat"))
head(as.matrix(bed[,1:10])) #first 10 columns of bed matrix
```

Take some time to explore the structure of the `bed` s4 object (`str(bed)`), it is very useful. the `@ped` slot contains info relevant to the individuals (their names, family id etc...), while the `@snps` slot contains info relevant to the SNPs (their name, position, chromosome, etc...). The dosage matrix is extracted with `as.matrix(bed)`.  This can be used as argument to all functions containing `dosage`in their names, such as  `beta.dosage, pi.dosage, theta.Watt.dosage, TajimaD.dosage`.  For instance, `fs.dosage`  will produce estimates of populations specific $F_{IS}$ and $F_{ST}$.


```{r}
fs.dosage(as.matrix(bed),pop=rep(1:2,each=50)) # population specific FSTs
```

As a more interesting example, we can reuse the `eHGDP` dataset we've encountered previously (see `?eHGDP`), and after having converted it to dosage data with the `fstat2dos` function (there is a total of $8170$ alleles in the data set), we can look at inbreeding within populations and population structure using `fs.dosage`:

```{r}
dat.HGDP<-genind2hierfstat(eHGDP)
dos.HGDP<-fstat2dos(dat.HGDP[,-1])
fs.HGDP<-fs.dosage(dos.HGDP,pop=dat.HGDP[,1])
```

We may be interested in finding which populations have either a large $F_{ST}^P>0.15$ or a low one $F_{ST}^P <0$:

```{r}
eHGDP@other$popInfo[which(fs.HGDP$Fs[2,]>0.15),]
eHGDP@other$popInfo[which(fs.HGDP$Fs[2,]<0.0),]

```

and you'll see that samples from AFRICA have low Population specific $F_{ST}$, while samples from AMERICA have large population specific $F_{ST}$