Inference of population structure from RAD datasets:
Understanding of shared ancestry in genetic datasets is often key to their interpretation. The fineSTRUCTURE package (Lawson et al., 2012) represents a powerful model-based approach to investigating population structure using genetic data. It offers especially high resolution in inference of recent shared ancestry (see e.g. the genetic structure of the British population in Leslie et al., 2015). The high resolution of this method derives from: a) utilizing haplotype linkage information and b) focusing on the most recent coalescence (common ancestry) among the sampled individuals to derive a "co-ancestry matrix" - a summary of nearest neighbor haplotype relationships in the dataset. Further advantages when compared with other model-based methods (e.g. STRUCTURE and ADMIXTURE) include the ability to deal with a very large number of populations, explore relationships between them, and to quantify ancestry sources in each population.
The fineSTRUCTURE pipeline was designed to meet the needs of analyzing large scale SNP datasets, where chromosomal location of the markers are known and haplotypes are typically assumed to be correctly phased. Therefore, these methods have so far been inaccessible to users without high quality genome-wide haplotypes.
RADpainter is a program designed specifically to infer the co-ancestry matrix from RAD-seq data, taking full advantage of its unique features. We package this program together with the fineSTRUCTURE MCMC clustering algorithm into fineRADstructure - a complete, easy to use, and fast population inference package for RAD-seq data.
17th July 2018: v0.3.2 - A major speedup of RADpainter (about 5x faster); fixing a bug that could have led to incorrect results when different individuals have different numbers of alleles
05th March 2018: v0.3.1 - An improvement in the handling of missing genotypes (N characters) within the rad tags; eliminates a bug that could have led to a division by zero
30th January 2018: v0.3 - can now handle arbitrary ploidy (just set the maximum number of alleles per individual via the -p parameter)
25th August 2017: v0.2 - substantial improvements in handling of missing data + improvements in inferring statistical variance directly from the data (resulting in more reasonable clustering).
DOWNLOAD AND INSTALLATION
The software is available for download from GitHub. It works on most UNIX like systems (including Linux and OSX/Apple).
Clone the package from GitHub: git clone https://github.com/millanek/fineRADstructure
Enter the folder: cd fineRADstructure
Run configure: ./configure
Run make: make
You'll need the GNU Scientific Development Library, (libgsl0-dev and libgsl0 in Ubuntu, gsl-devel in OpenSUSE), or download from https://www.gnu.org/software/gsl/ Mac/Apple users also need to have the Command Line Tools installed on their machine to run "configure" and "make".
On Linux/Unix you need a resonably new(-ish) version of the gcc compiler (gcc 4.6.2 and onwards work fine). The Apple llvm compiler that comes with the Command Line Tools works fine too (all versions as far as I know).
If you get an autotools version error like
/bin/sh: aclocal-1.14: command not found
then you probably need to re-do the autotools sequence. First run:
aclocal; autoconf; automake -a
and then re-run:
Calculate the co-ancestry matrix: ./RADpainter paint INPUT_RAD_FILE.txt
Assign individuals to populations: ./finestructure -x 100000 -y 100000 -z 1000 INPUT_RAD_FILE_chunks.out INPUT_RAD_FILE_chunks.mcmc.xml
Tree building: ./finestructure -m T -x 10000 INPUT_RAD_FILE_chunks.out INPUT_RAD_FILE_chunks.mcmc.xml INPUT_RAD_FILE_chunks.mcmcTree.xml
The input file (INPUT_RAD_FILE.txt) should be in one of the following formats:
Stacks populations program output with the --radpainter option. This is the easiest option if you have your data in Stacks.
Tag Haplotype Matrix (for unmapped data): Example
Chromosome/scaffold + Tag Haplotype Matrix (for mapped data): Example
In all cases, rows correspond to RAD loci. Data columns correspond to individual haplotype calls; only sites that are variable in the dataset are needed. In diploid individuals, if the two alleles are the same, e.g. TGAT/TGAT, they can be collapsed to just TGAT. If the two alleles differ then both alleles need to be fully specified (e.g. TGAT/CGAT). Missing data is left blank; see the examples above. If you have polyploid individuals then the best way is to write out all the alleles; e.g. TGAT/TGAG/TGAG/TGAG could be the four haplotypes for a tetraploid individual.
If you have your data in a VCF file, the RADpainter hapsFromVCF command may work; feel free to get in touch if it doesn't work for your specific VCF.
If the populations --radpainter option is for some reason not optimal, the fineRADstructure package also includes a python script Stacks2fineRAD.py that can be used for conversion from the standard output of the program "populations" in Stacks (a file usually called "batch_1.haplotypes.tsv"). It has various filtering options. The script is also available here. An example command line to run this would be: python Stacks2fineRAD.py -i Stacks_output.tsv -n 10 -m 30. The -n option, specifies the maximum number of SNPs allowed at a locus (misaligned paralogs can lead to unusually many SNPs). Using the -m option, you discard individuals with unusually high levels of missing data (e.g. using option -m 30 we set the cutoff to 30%).
Finally, there is an utility script from Edgardo M. Ortiz enabling conversion from the format used by the pyRAD and ipyrad toolkits https://github.com/edgardomortiz/fineRADstructure-tools.
Note: If you have a reference genome, the RAD loci should ideally be ordered according to genomic coordinates. If you have unmapped loci, we provide a script sampleLD.R that can reorder loci according to linkage disequilibrium (LD). If LD is strong and loci are not sorted, this could lead to overconfident clustering. Therefore, we recommend using the sampleLD.R script before running RADpainter to users with unmapped data who want to be extra careful to ensure they obtain a CONSERVATIVE upper bound on the number of statistically identifiable clusters. Example command line: Rscript sampleLD.R -s 1 -n 500 INPUT_RAD_FILE.txt INPUT_RAD_FILE_reordered.txt. This should do the job. If you want to understand the options, they are described in the R script.
Note 2: The amount of missing data should not vary too much across individuals (remove outliers if necessary) or systematically between putative populations (test a few runs with filtering tags/individuals at various stringency levels; as a rule of thumb, I'd say that missingness over 50% per individual can cause problems).
If you are familiar with R then the easiest way to plot the results may be by adapting our R scripts: fineRADstructurePlot.R and FinestructureLibrary.R, which are included within the package. Just follow the instructions in the first script. The R scripts are a simplified version of the Daniel Lawson's code from the finestructure website.
Alternatively you can try to install and use the fineStructure GUI