Professional Documents
Culture Documents
Straf
Straf
PII: S1872-4973(17)30149-7
DOI: http://dx.doi.org/doi:10.1016/j.fsigen.2017.07.007
Reference: FSIGEN 1748
Please cite this article as: Alexandre Gouy, Martin Zieger, STRAF—a convenient
online tool for STR data evaluation in forensic genetics, Forensic Science International:
Geneticshttp://dx.doi.org/10.1016/j.fsigen.2017.07.007
This is a PDF file of an unedited manuscript that has been accepted for publication.
As a service to our customers we are providing this early version of the manuscript.
The manuscript will undergo copyediting, typesetting, and review of the resulting proof
before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that
apply to the journal pertain.
STRAF – a convenient online tool for STR data evaluation in forensic genetics
Alexandre Gouy1,2
Alexandre.Gouy@iee.unibe.ch
Martin.Zieger@irm.unibe.ch
1
Computational and Molecular Population Genetics, Institute of Ecology and Evolution, University of
2
Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
3
Institute of Forensic Medicine, Forensic Molecular Biology Dpt., University of Bern, Sulgenauweg 40,
Highlights:
STRAF (STR Analysis for Forensics) computes all the required standard statistics for
autosomal STR population data in forensic genetics.
It is freely available, easy to use, requires a very simple input file and results can be
downloaded as convenient tables or displayed graphically online.
STRAF includes a PCA module that can be used for quality control and population
substructure detection.
Abstract
Population data in forensic genetics has to be checked for a variety of statistical parameters before it
can be employed for case work. A lot of very powerful statistical tools are available for this task, most
of them developed by labs having their research focus on population genetics or evolution. However,
1
most of these programs require a substantial amount of experience. In addition, to our knowledge,
none of the freely available programs calculates all the common parameters for a population study in
forensic genetics at once, based on a single input file. We present here a convenient online tool that
fills this gap. STRAF (STR Analysis for Forensics) provides an intuitive interface and input file format
and computes all the relevant parameters for a classical population study based on autosomal STR
data at once and in a convenient way. In addition, STRAF includes a PCA module that can be used for
population substructure detection or quality control. The results generated by the program were
1. Introduction
tools specifically designed to compute summary statistics on STR data. To assess the utility of
software designed for forensic genetics population data analysis, we screened all publications that
appeared in 2016 in FSI Genetics, International Journal of Legal Medicine and Legal Medicine. We
found 40 studies presenting original autosomal STR population data. In 36 of them the authors
provide forensic parameters such as power of exclusion (PE), power of discrimination (PD) or
polymorphism information content (PIC). For 31 of those publications (86 %) the Excel Spreadsheet
PowerStats provided by Promega Corporation has been used to calculate the forensic parameters.
Despite its proven utility, this valuable contribution to forensic genetics data analysis also has a
variety of inconveniences. First, PowerStats does not support the analysis of complete data sets at
once because calculation of forensic parameters has to be performed locus by locus. Second, it is no
longer supported by the company (personal communication). Third, it provides only a limited variety
of statistical analyses, restricted to the parameters that are dedicated for use in forensics, but it does
2
not allow more general genetic analyses, such as the computation of F-statistics or testing for Hardy-
Weinberg equilibrium.
With STRAF (STR Analysis for Forensics), we provide the forensic genetics community with a freely
accessible online tool, that can compute all the statistics necessary to check the validity of an
autosomal population data set at once, based on a single input file. In addition to the analysis of
autosomal STR data, the program can also handle haploid data such as Y-STR haplotypes. In the
following, we present the different features implemented in STRAF. We also validated the tool by
recalculating genetic parameters from an already published population study [1]. STRAF can be found
online at http://www.cmpg.iee.unibe.ch/services/shiny/.
STRAF is an R based [2] online tool that brings together a couple of previously existing statistical
STRAF accepts tab delimited txt-tables of different content. The first column needs to contain the
sample ID and the second the population ID (if several populations are studied). Most convenient for
the analysis of forensically relevant autosomal STR data, STRAF accepts point alleles. Allele data for
haploid samples is entered with one column per locus and for diploid data with two columns per
locus. Missing data (e.g. null alleles) must be indicated with a “0”. An example of a diploid input file is
provided online.
It is also possible to use a format similar to the Genepop [3] input format, with both alleles (for
diploid data) coded in one column, either with 2 or 3 digits. Note that in order to use this format, no
The raw data can be displayed and checked on the website after import.
3
b. Allele frequency table
STRAF calculates the allele frequencies for all loci from the input file and returns a complete table in
a standard format that can be checked online and also downloaded as a TSV (tab-separated values)
file. The allele frequencies per locus can be plotted, to check for unexpected allele frequency
distributions. If the user wishes to save any displayed plot, all visualizations can be modified using the
c. Forensic parameters
STRAF calculates the following parameters relevant for forensic practice, based on the given
formulas:
Genetic diversity (GD), or expected heterozygosity (Hexp) is computed using its unbiased estimator
[4]:
𝑛
𝑛
𝐻exp = GD = (1 − ∑(𝑝𝑖 )2 ),
𝑛−1
𝑖=1
where n is the number of gene copies sampled and pi is the frequency of the ith allele in the
population.
The match probability (PM), is defined as the probability of a match between two unrelated
𝑃𝑀 = ∑(𝐺𝑖 )2 ,
𝑖
The power of discrimination (PD) is defined as the probability of discriminating between two
unrelated individuals
PD = 1 − 𝑃𝑀.
4
Polymorphism Information Content (PIC) can be interpreted as the probability that the maternal and
paternal alleles of a child are deducible or the probability of being able to deduce which allele a
𝑛 𝑛−1 𝑛
𝑃𝐸 = ℎ2 (1 − 2ℎ𝐻 2 ),
Finally, the typical paternity index (TPI) reflects the “mean PI for random non-excluded men“ (see
also [8]) for a given locus. Let H be the frequency of homozygotes, then
𝑇𝑃𝐼 = 1⁄(2𝐻).
Again, a graphical representation of all parameters for every locus is possible. Such a quick overview
can be useful, e.g. to check whether the observed distribution of PE over all loci fits expectations.
d. F statistics
For files containing different subpopulations, FST and FIS per locus can be calculated [9, 10]. Pairwise
FST values between subpopulations are also computed over all loci. Based on the F statistics, the user
can decide to treat all subpopulations separately or as a single population, depending on their degree
of genetic similarity.
STRAF also allows to test whether the population sample is under Hardy-Weinberg equilibrium
(HWE). The test for HWE consists in computing the p-value of an exact test based on 1,000 Monte
5
Carlo permutations of alleles [11]. As a rough rule of thumb, we usually assume HWE if p-values for
Principal component analysis (PCA) allows for the detection of possible genotype clusters that might
be caused by population substructure [12]. For forensic population studies based on autosomal STR
data, we usually expect a rather homogeneous distribution, i.e. observing a single cluster on the PCA
due to a lack of population structure. This expectation might be useful for quality control of the data.
In data sets that are supposed to cluster homogeneously, outliers can be identified on the PCA plot,
facilitating a check for possible errors through data analysis or sampling (e.g. samples resulting from
undeclared migration events or sampling errors). Figure 1 shows an example of the detection of a
clear outlier using a PCA based on a supposedly homogeneous data set. Checking back on the
corresponding electropherograms confirmed that the genotype of this sample has been assigned
correctly. Therefore, in this case, the outlier could be due to a migration event that has not been
PCA can also be used for substructure analysis of haploid population data. Figure 2 shows an example
of a PCA performed on Y-STR data. It has been performed using STR haplotypes of known
haplogroup, as well as samples of undefined haplogroup (marked with “?”). This permits a rough
prediction of the haplogroup of the unknown samples. It is obvious that this prediction only works
for haplogroups present in the reference data set, and such a prediction should ultimately be
confirmed by targeted SNP typing for an empirical determination of the haplogroup. If SNP typing is
done by SNaPshot assays [13], the PCA can help to choose the appropriate primer panel for the
sample.
g. Linkage Disequilibrium
6
STRAF computes and represents estimates of linkage disequilibrium (LD) based on a T2 test [14]. The
p-values from the LD analysis are displayed in a table and plotted as a heatmap. The indicative color
coding depends on the strength of LD. LD p-values distribution can be checked and a Kolmogorov-
Smirnov [15, 16] test (KS) can be performed to test if this distribution is uniform. Low p-values
together with a non-uniform p-value histogram can indicate linkage disequilibrium. However, before
assuming linkage, we recommend to perform a permutation test using other software packages such
as Arlequin [17]. Indeed, for infrastructure reasons (limitation of the computation time online), no
3. Concordance Test
The results generated with STRAF were verified by using the raw data from a published forensic
genetics population study [1]. The forensic parameters (GD, PIC, PD, PE) that have been calculated
with PowerStats in the study from Gehrig et al. correspond exactly to the values we calculated with
STRAF (data not shown). FST values per locus calculated with STRAF are also the same as in the
4. Limitations
STRAF is designed to be a user-friendly tool for standard analysis of population data in forensic
genetics. It can be used to easily calculate a variety of parameters useful for forensic practice. At the
same time it checks for statistical validity of the sampled data, by calculating different population
genetic parameters. STRAF has been developed to check data sets in forensic genetics prior to
publication or use in case work. It is not its purpose to provide in depth analysis of genetic data for
genetic research projects, for which more powerful software such as Arlequin [17] or Genepop [3]
differentiation is not possible with STRAF. Finally, this software has been designed for the analysis of
7
STR markers, but it cannot handle at the moment other types of markers such as SNPs of INDELs
5. Further development
Suggestions for further development or bug reports are welcome and can be addressed to Alexandre
Gouy. If we notice that the forensic genetics community makes frequent use of STRAF we will
consider further development and possibly provide more computing resources, e.g. to allow for a
6. Acknowledgments
M. Z. would like to thank Gerald Heckel and Laurent Excoffier for welcoming him in their lab as
visiting scientist. Without this visit, this project would not have been started. We thank Christian
Gehrig for providing us with the raw data of their study, for validation of STRAF. Furthermore, we
thank Alexandre Thiéry for setting up the STRAF server. Finally, we thank Mirco Hecht for critical
8
References
[1] C. Gehrig, B. Balitzki, A. Kratzer, C. Cossu, N. Malik, V. Castella, Allelic proportions of 16 STR loci-
including the new European Standard Set (ESS) loci-in a Swiss population sample, Int J Legal Med
128(3) (2014) 461-5.
[2] R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/.
[3] F. Rousset, genepop’007: a complete re-implementation of the genepop software for Windows
and Linux, Molecular ecology resources 8(1) (2008) 103-106.
[4] M. Nei, Molecular Evolutionary Genetics, Columbia University Press, New York, 1987.
[5] R.A. Fisher, Standard calculations for evaluating a blood-group system, Heredity (Edinb) 5(1)
(1951) 95-102.
[6] D. Botstein, R.L. White, M. Skolnick, R.W. Davis, Construction of a genetic linkage map in man
using restriction fragment length polymorphisms, Am J Hum Genet 32(3) (1980) 314-31.
[7] X. Guo, R.C. Elston, Linkage information content of polymorphic genetic markers, Human heredity
49(2) (1999) 112-8.
[8] C.H. Brenner, J.W. Morris, Paternity Index Calculations in Single Locus Hypervariable DNA Probes:
Validation and Other Studies, The International Symposium on Human Identification, Promega
Corporation, Madison, WI, USA, 1989, pp. 21-53.
[9] S. Wright, The Genetical Structure of Populations, Ann Eugen 15(4) (1951) 323-354.
[10] B.S. Weir, C.C. Cockerham, Estimating F-Statistics for the Analysis of Population-Structure,
Evolution 38(6) (1984) 1358-1370.
[11] S.W. Guo, E.A. Thompson, Performing the Exact Test of Hardy-Weinberg Proportion for Multiple
Alleles, Biometrics 48(2) (1992) 361-372.
[12] J. Novembre, M. Stephens, Interpreting principal component analyses of spatial population
genetic variation, Nat Genet 40(5) (2008) 646-649.
[13] M. Geppert, L. Roewer, SNaPshot® Minisequencing Analysis of Multiple Ancestry-Informative Y-
SNPs Using Capillary Electrophoresis, in: A. Alonso (Ed.), DNA Electrophoresis Protocols for Forensic
Genetics, Humana Press, Totowa, NJ, 2012, pp. 127-140.
[14] D.V. Zaykin, A. Pudovkin, B.S. Weir, Correlation-based inference for linkage disequilibrium with
multiple alleles, Genetics 180(1) (2008) 533-45.
[15] A. Kolmogorov, Sulla determinazione empirica di una legge di distribuzione, G. Ist. Ital. Attuari. 4
(1933) 83-91.
[16] N. Smirnov, Table for Estimating the Goodness of Fit of Empirical Distributions, Ann Math Statist
19 (1948) 279-281.
[17] L. Excoffier, H.E.L. Lischer, Arlequin suite ver 3.5: a new series of programs to perform
population genetics analyses under Linux and Windows, Molecular ecology resources 10(3) (2010)
564-7.
[18] J. Goudet, FSTAT (Version 1.2): A computer program to calculate F-statistics, Journal of Heredity
86(6) (1995) 485-486.
9
Figure 1: Example of PCA for quality control. A population data set with 4 defined subpopulations
(obviously with small genetic differences) typed at 21 autosomal STR loci has been submitted for
PCA. The two first axis are represented. The data set can be quickly checked for outliers as given in
this example (arrow). The sample ID can be checked by clicking on the data point.
10
Figure 2: Example of PCA for haplogroup prediction. A data set of Y-STR-Haplotypes of known
haplogroups is submitted for PCA, together with a subset of haplotypes of unknown haplogroup,
labelled with “?”. Note that PCA can only give a rough estimate of the possible haplogroup, relative
to the samples with known haplogroups.
11