---
title: "Hierfstat latest features"
author: "Jerome Goudet"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Hierfstat latest features}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage{utf8}{inputenc}
---

# Introduction

This vignette documents the latest developments in `hierfstat`. Refer to the [hierfstat article](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1471-8286.2004.00828.x) and a step by step  [tutorial](https://www.sciencedirect.com/science/article/pii/S1567134807001037) for an introduction to the package

# Loading data

Data can be imported in `hierfstat`many different ways (fstat format, tabular format, dosage data, even VCF format), as described in the import vignette.  `hierfstat` can now also read `genind` objects (from package `adegenet`). Note however that only some genetic data types will be properly converted and used.  The alleles need to be encoded either as integer (up to three digits per allele), or as nucleotides (`c("a","c","g","t","A","C","G","T")`).

```{r,message=FALSE}
library(adegenet)
library(hierfstat)
```

```{r}
data(nancycats) 
is.genind(nancycats)
```
# Descriptive statistics

The function you are the most likely to want using is `basic.stats` (it calculates $H_O$, $H_S$, $F_{IS}$, $F_{ST}$ etc...). 

```{r}
bs.nc<-basic.stats(nancycats)
bs.nc
boxplot(bs.nc$perloc[,1:3]) # boxplot of Ho, Hs, Ht
```

You can also get e.g. `allele.count` and `allelic.richness`, a rarefied measure of the number of alleles at each locus and in each population. For instance, below is a boxplot of the allelic richness for the 5 first loci in the nancycats dataset

```{r}
boxplot(t(allelic.richness(nancycats)$Ar[1:5,])) #5 first loci
```

# Population statistics

Population statistics are obtained through `basic.stats` or `wc` (`varcomp.glob` can also be used and will give the same result as `wc` for a one level hierarchy). For instance, $F_{IS}$ and $F_{ST}$ in the *Galba truncatula* dataset provided with `hierfstat` are obtained as:

```{r}
data(gtrunchier) 
wc(gtrunchier[,-2])
varcomp.glob(data.frame(gtrunchier[,1]),gtrunchier[,-c(1:2)])$F #same
```


Confidence intervals on these statistics can be obtained via `boot.vc`:

```{r}
boot.vc(gtrunchier[,1],gtrunchier[,-c(1:2)])$ci
```

`boot.ppfis`and `boot.ppfst` provide bootstrap confidence intervals (bootstrapping over loci) for population specific $F_{IS}$ and pairwise $F_{ST}$ respectively. 


## Hierarchical analyses

# Testing 


## Genetic distances

`genet.dist` estimates one of 10 different genetic distances between populations as described mostly in Takezaki & Nei (1996)

```{r}
(Ds<-genet.dist(gtrunchier[,-2],method="Ds")) # Nei's standard genetic distances
```


## Population Principal coordinate analysis

Principal coordinate analysis can be carried out on this genetic distances:

```{r,eval=FALSE}
pcoa(as.matrix(Ds))
```

## Population specific $F_{ST}$s

Function `betas` will give estimates of population specific $F_{ST}^i$ for data in the `fstat` format.  For instance, using the `nancycats` dataset:

```{r}
barplot(betas(nancycats)$betaiovl)

```

Functions `fs.dosage, fst.dosage`  and `fis.dosage` give estimates of population specific $F_{ST}^i$ and $F_{IS}^i$ for dosage data, `fs.dosage` also outputs the matrix of $F_{ST}^{XY}$, as defined in [Weir and Goudet (2017)](https://academic.oup.com/genetics/article/206/4/2085/60725905) and [Weir and Hill (2002)](https://www.annualreviews.org/doi/abs/10.1146/annurev.genet.36.050802.093940). We use the `gtrunchier` dataset to illustrate the output of `fs.dosage`. We first need to convert `gtrunchier` to a dosage format  via `fstat2dos` 

```{r}
gt.dos<-fstat2dos(gtrunchier[,-c(1:2)]) #converts fstat format to dosage 
fs.gt<-fs.dosage(gt.dos,pop=gtrunchier[,2]) 
image(1:29,1:29,fs.gt$FsM,main=expression(F[ST]^{XY}))
```

We clearly see the block structure of patches within locality, with the second block along the diagonal showing higher similarity than the others. Blocks off the main diagonal are lighter, showing less genetic similarity between localities than within.   

Functions `pi.dosage`, `theta.Watt.dosage`  and `TajimaD.dosage` calculate nucleotide diversity, Watterson's $\theta_W$ and Tajima's $D$ respectively, from dosage data. 

# Individual statistics

## Individual PCA

`indpca` carries out Principal component analysis on individual genotypes. To illustrate, we use the `gtrunchier` datasets, with individuals colored according to their locality of origin:

```{r}
x<-indpca(gtrunchier[,-2],ind.labels=gtrunchier[,2])
plot(x,col=gtrunchier[,1],cex=0.5)
```


## kinships, GRM  and individual inbreeding coefficients

Functions `betas` and `beta.dosage` give estimates of individual inbreeding coefficients and kinships between individuals, the former for genotypes in the `fstat` format, the latter for dosage data:

```{r}
image(beta.dosage(gt.dos),main="kinship and inbreeding \n in Galba truncatula",xlab="",ylab="")
```

This example is just for illustrating purposes, I do not advise using these individual statistics unless you have a large number of polymorphic markers ($\geq 1000$ at least, better if $\geq 10000$, see [Goudet, Kay & Weir (2018)](https://onlinelibrary.wiley.com/doi/full/10.1111/mec.14833)).

# Simulating genetic data

It is now possible to simulate genetic data from a continent islands model, either at equilibrium via `sim.genot` or for a given number of generations via `sim.genot.t`.  These two functions have several arguments, allowing to look at the effect of population sizes, inbreeding, migration and mutation. the number of independent loci and number of alleles per loci can also be specified. It is also possible to simulate data from a finite island model using `sim.genot.t`, by specifying that argument `IIM` is `FALSE`.  Last,`sim.genot.metapop.t` generates genetic data with any migration matrix, population size and inbreeding level, while still assuming independence of the genetic markers. 


# Exporting to other programs

(refer to the import vignette to import data from other programs)

Other than the `fstat` format, `hierfstat`can now export to files in format suitable for [Bayescan](http://cmpg.unibe.ch/software/BayeScan/), [plink](https://www.cog-genomics.org/plink/2.0/) and [structure](https://web.stanford.edu/group/pritchardlab/structure.html). The functions to export to these programs are `write.bayescan`, `write.ped` and `write.struct` respectively.


# Miscellaneous 

## Sex-biased dispersal

A function to detect sex-biased dispersal, `sexbias.test` based on [Goudet et al. (2002)](https://onlinelibrary.wiley.com/doi/abs/10.1046/j.1365-294X.2002.01496.x) has been implemented. 
To illustrate its use, load the *Crocidura russula* data set. It consists of the genotypes and sexes of 140 shrews studied by [Favre et al. (1997)](https://royalsocietypublishing.org/doi/10.1098/rspb.1997.0019). In this species, mark -recapture showed an excess of dispersal from females, an unusual pattern in mammals. This is confirmed using genetic data: 

```{r}
data("crocrussula")
aic<-AIc(crocrussula$genot)
boxplot(aic~crocrussula$sex)
tapply(aic,crocrussula$sex,mean)
sexbias.test(crocrussula$genot,crocrussula$sex)
```