Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Accepted Manuscript

Title: STRAF—a convenient online tool for STR data


evaluation in forensic genetics

Authors: Alexandre Gouy, Martin Zieger

PII: S1872-4973(17)30149-7
DOI: http://dx.doi.org/doi:10.1016/j.fsigen.2017.07.007
Reference: FSIGEN 1748

To appear in: Forensic Science International: Genetics

Received date: 2-6-2017


Revised date: 12-7-2017
Accepted date: 13-7-2017

Please cite this article as: Alexandre Gouy, Martin Zieger, STRAF—a convenient
online tool for STR data evaluation in forensic genetics, Forensic Science International:
Geneticshttp://dx.doi.org/10.1016/j.fsigen.2017.07.007

This is a PDF file of an unedited manuscript that has been accepted for publication.
As a service to our customers we are providing this early version of the manuscript.
The manuscript will undergo copyediting, typesetting, and review of the resulting proof
before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that
apply to the journal pertain.
STRAF – a convenient online tool for STR data evaluation in forensic genetics

Alexandre Gouy1,2

Alexandre.Gouy@iee.unibe.ch

Martin Zieger3 (corresponding author)

Martin.Zieger@irm.unibe.ch

Tel.: +41 31 631 31 59

1
Computational and Molecular Population Genetics, Institute of Ecology and Evolution, University of

Bern, Baltzerstrasse 6, 3012 Bern, Switzerland

2
Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland

3
Institute of Forensic Medicine, Forensic Molecular Biology Dpt., University of Bern, Sulgenauweg 40,

3007 Bern, Switzerland

Highlights:

 STRAF (STR Analysis for Forensics) computes all the required standard statistics for
autosomal STR population data in forensic genetics.
 It is freely available, easy to use, requires a very simple input file and results can be
downloaded as convenient tables or displayed graphically online.
 STRAF includes a PCA module that can be used for quality control and population
substructure detection.

Abstract

Population data in forensic genetics has to be checked for a variety of statistical parameters before it

can be employed for case work. A lot of very powerful statistical tools are available for this task, most

of them developed by labs having their research focus on population genetics or evolution. However,

1
most of these programs require a substantial amount of experience. In addition, to our knowledge,

none of the freely available programs calculates all the common parameters for a population study in

forensic genetics at once, based on a single input file. We present here a convenient online tool that

fills this gap. STRAF (STR Analysis for Forensics) provides an intuitive interface and input file format

and computes all the relevant parameters for a classical population study based on autosomal STR

data at once and in a convenient way. In addition, STRAF includes a PCA module that can be used for

population substructure detection or quality control. The results generated by the program were

verified by recalculating parameters from an already published population study.

Keywords: population, data analysis, statistics, STR, software

1. Introduction

Quality assessment of population data is of primordial importance in forensics genetics. It requires

tools specifically designed to compute summary statistics on STR data. To assess the utility of

software designed for forensic genetics population data analysis, we screened all publications that

appeared in 2016 in FSI Genetics, International Journal of Legal Medicine and Legal Medicine. We

found 40 studies presenting original autosomal STR population data. In 36 of them the authors

provide forensic parameters such as power of exclusion (PE), power of discrimination (PD) or

polymorphism information content (PIC). For 31 of those publications (86 %) the Excel Spreadsheet

PowerStats provided by Promega Corporation has been used to calculate the forensic parameters.

Despite its proven utility, this valuable contribution to forensic genetics data analysis also has a

variety of inconveniences. First, PowerStats does not support the analysis of complete data sets at

once because calculation of forensic parameters has to be performed locus by locus. Second, it is no

longer supported by the company (personal communication). Third, it provides only a limited variety

of statistical analyses, restricted to the parameters that are dedicated for use in forensics, but it does

2
not allow more general genetic analyses, such as the computation of F-statistics or testing for Hardy-

Weinberg equilibrium.

With STRAF (STR Analysis for Forensics), we provide the forensic genetics community with a freely

accessible online tool, that can compute all the statistics necessary to check the validity of an

autosomal population data set at once, based on a single input file. In addition to the analysis of

autosomal STR data, the program can also handle haploid data such as Y-STR haplotypes. In the

following, we present the different features implemented in STRAF. We also validated the tool by

recalculating genetic parameters from an already published population study [1]. STRAF can be found

online at http://www.cmpg.iee.unibe.ch/services/shiny/.

2. Features of the software

STRAF is an R based [2] online tool that brings together a couple of previously existing statistical

calculations. The web application is based on the Shiny framework (https://shiny.rstudio.com/).

a. Input file format

STRAF accepts tab delimited txt-tables of different content. The first column needs to contain the

sample ID and the second the population ID (if several populations are studied). Most convenient for

the analysis of forensically relevant autosomal STR data, STRAF accepts point alleles. Allele data for

haploid samples is entered with one column per locus and for diploid data with two columns per

locus. Missing data (e.g. null alleles) must be indicated with a “0”. An example of a diploid input file is

provided online.

It is also possible to use a format similar to the Genepop [3] input format, with both alleles (for

diploid data) coded in one column, either with 2 or 3 digits. Note that in order to use this format, no

point alleles should be present in the data set.

The raw data can be displayed and checked on the website after import.

3
b. Allele frequency table

STRAF calculates the allele frequencies for all loci from the input file and returns a complete table in

a standard format that can be checked online and also downloaded as a TSV (tab-separated values)

file. The allele frequencies per locus can be plotted, to check for unexpected allele frequency

distributions. If the user wishes to save any displayed plot, all visualizations can be modified using the

“graphical parameters” section.

c. Forensic parameters

STRAF calculates the following parameters relevant for forensic practice, based on the given

formulas:

Genetic diversity (GD), or expected heterozygosity (Hexp) is computed using its unbiased estimator

[4]:
𝑛
𝑛
𝐻exp = GD = (1 − ∑(𝑝𝑖 )2 ),
𝑛−1
𝑖=1

where n is the number of gene copies sampled and pi is the frequency of the ith allele in the

population.

The match probability (PM), is defined as the probability of a match between two unrelated

individuals and is calculated as [5]

𝑃𝑀 = ∑(𝐺𝑖 )2 ,
𝑖

where Gi is the frequency of the genotype i at a given locus in the population.

The power of discrimination (PD) is defined as the probability of discriminating between two

unrelated individuals

PD = 1 − 𝑃𝑀.

4
Polymorphism Information Content (PIC) can be interpreted as the probability that the maternal and

paternal alleles of a child are deducible or the probability of being able to deduce which allele a

parent has transmitted to the child [6, 7]

𝑛 𝑛−1 𝑛

𝑃𝐼𝐶 = 1 − ∑ 𝑝𝑖2 − ∑ ∑ 2𝑝𝑖2 𝑝𝑗2 ,


𝑖=1 𝑖=1 𝑗=𝑖+1

where pi and pj are allele frequencies.

The power of exclusion (PE) is defined as [8]

𝑃𝐸 = ℎ2 (1 − 2ℎ𝐻 2 ),

where h is the proportion of heterozygous individuals and H the proportion of homozygous

individuals in the population sample.

Finally, the typical paternity index (TPI) reflects the “mean PI for random non-excluded men“ (see

also [8]) for a given locus. Let H be the frequency of homozygotes, then

𝑇𝑃𝐼 = 1⁄(2𝐻).

Again, a graphical representation of all parameters for every locus is possible. Such a quick overview

can be useful, e.g. to check whether the observed distribution of PE over all loci fits expectations.

d. F statistics

For files containing different subpopulations, FST and FIS per locus can be calculated [9, 10]. Pairwise

FST values between subpopulations are also computed over all loci. Based on the F statistics, the user

can decide to treat all subpopulations separately or as a single population, depending on their degree

of genetic similarity.

e. Test for Hardy-Weinberg equilibrium

STRAF also allows to test whether the population sample is under Hardy-Weinberg equilibrium

(HWE). The test for HWE consists in computing the p-value of an exact test based on 1,000 Monte

5
Carlo permutations of alleles [11]. As a rough rule of thumb, we usually assume HWE if p-values for

deviation from HWE are not below 0.05.

f. Principal component analysis

Principal component analysis (PCA) allows for the detection of possible genotype clusters that might

be caused by population substructure [12]. For forensic population studies based on autosomal STR

data, we usually expect a rather homogeneous distribution, i.e. observing a single cluster on the PCA

due to a lack of population structure. This expectation might be useful for quality control of the data.

In data sets that are supposed to cluster homogeneously, outliers can be identified on the PCA plot,

facilitating a check for possible errors through data analysis or sampling (e.g. samples resulting from

undeclared migration events or sampling errors). Figure 1 shows an example of the detection of a

clear outlier using a PCA based on a supposedly homogeneous data set. Checking back on the

corresponding electropherograms confirmed that the genotype of this sample has been assigned

correctly. Therefore, in this case, the outlier could be due to a migration event that has not been

declared upon sampling.

PCA can also be used for substructure analysis of haploid population data. Figure 2 shows an example

of a PCA performed on Y-STR data. It has been performed using STR haplotypes of known

haplogroup, as well as samples of undefined haplogroup (marked with “?”). This permits a rough

prediction of the haplogroup of the unknown samples. It is obvious that this prediction only works

for haplogroups present in the reference data set, and such a prediction should ultimately be

confirmed by targeted SNP typing for an empirical determination of the haplogroup. If SNP typing is

done by SNaPshot assays [13], the PCA can help to choose the appropriate primer panel for the

sample.

g. Linkage Disequilibrium

6
STRAF computes and represents estimates of linkage disequilibrium (LD) based on a T2 test [14]. The

p-values from the LD analysis are displayed in a table and plotted as a heatmap. The indicative color

coding depends on the strength of LD. LD p-values distribution can be checked and a Kolmogorov-

Smirnov [15, 16] test (KS) can be performed to test if this distribution is uniform. Low p-values

together with a non-uniform p-value histogram can indicate linkage disequilibrium. However, before

assuming linkage, we recommend to perform a permutation test using other software packages such

as Arlequin [17]. Indeed, for infrastructure reasons (limitation of the computation time online), no

permutation test for LD analysis is yet implemented in STRAF.

3. Concordance Test

The results generated with STRAF were verified by using the raw data from a published forensic

genetics population study [1]. The forensic parameters (GD, PIC, PD, PE) that have been calculated

with PowerStats in the study from Gehrig et al. correspond exactly to the values we calculated with

STRAF (data not shown). FST values per locus calculated with STRAF are also the same as in the

original study, computed using FSTAT [18].

4. Limitations

STRAF is designed to be a user-friendly tool for standard analysis of population data in forensic

genetics. It can be used to easily calculate a variety of parameters useful for forensic practice. At the

same time it checks for statistical validity of the sampled data, by calculating different population

genetic parameters. STRAF has been developed to check data sets in forensic genetics prior to

publication or use in case work. It is not its purpose to provide in depth analysis of genetic data for

genetic research projects, for which more powerful software such as Arlequin [17] or Genepop [3]

should be considered. As an example, statistically testing the significance of population

differentiation is not possible with STRAF. Finally, this software has been designed for the analysis of

7
STR markers, but it cannot handle at the moment other types of markers such as SNPs of INDELs

(insertions and deletions).

5. Further development

Suggestions for further development or bug reports are welcome and can be addressed to Alexandre

Gouy. If we notice that the forensic genetics community makes frequent use of STRAF we will

consider further development and possibly provide more computing resources, e.g. to allow for a

permutation test of LD or the analysis of other types of genetic markers.

6. Acknowledgments

M. Z. would like to thank Gerald Heckel and Laurent Excoffier for welcoming him in their lab as

visiting scientist. Without this visit, this project would not have been started. We thank Christian

Gehrig for providing us with the raw data of their study, for validation of STRAF. Furthermore, we

thank Alexandre Thiéry for setting up the STRAF server. Finally, we thank Mirco Hecht for critical

reading of the manuscript.

8
References

[1] C. Gehrig, B. Balitzki, A. Kratzer, C. Cossu, N. Malik, V. Castella, Allelic proportions of 16 STR loci-
including the new European Standard Set (ESS) loci-in a Swiss population sample, Int J Legal Med
128(3) (2014) 461-5.
[2] R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundation for
Statistical Computing, Vienna, Austria. URL: https://www.R-project.org/.
[3] F. Rousset, genepop’007: a complete re-implementation of the genepop software for Windows
and Linux, Molecular ecology resources 8(1) (2008) 103-106.
[4] M. Nei, Molecular Evolutionary Genetics, Columbia University Press, New York, 1987.
[5] R.A. Fisher, Standard calculations for evaluating a blood-group system, Heredity (Edinb) 5(1)
(1951) 95-102.
[6] D. Botstein, R.L. White, M. Skolnick, R.W. Davis, Construction of a genetic linkage map in man
using restriction fragment length polymorphisms, Am J Hum Genet 32(3) (1980) 314-31.
[7] X. Guo, R.C. Elston, Linkage information content of polymorphic genetic markers, Human heredity
49(2) (1999) 112-8.
[8] C.H. Brenner, J.W. Morris, Paternity Index Calculations in Single Locus Hypervariable DNA Probes:
Validation and Other Studies, The International Symposium on Human Identification, Promega
Corporation, Madison, WI, USA, 1989, pp. 21-53.
[9] S. Wright, The Genetical Structure of Populations, Ann Eugen 15(4) (1951) 323-354.
[10] B.S. Weir, C.C. Cockerham, Estimating F-Statistics for the Analysis of Population-Structure,
Evolution 38(6) (1984) 1358-1370.
[11] S.W. Guo, E.A. Thompson, Performing the Exact Test of Hardy-Weinberg Proportion for Multiple
Alleles, Biometrics 48(2) (1992) 361-372.
[12] J. Novembre, M. Stephens, Interpreting principal component analyses of spatial population
genetic variation, Nat Genet 40(5) (2008) 646-649.
[13] M. Geppert, L. Roewer, SNaPshot® Minisequencing Analysis of Multiple Ancestry-Informative Y-
SNPs Using Capillary Electrophoresis, in: A. Alonso (Ed.), DNA Electrophoresis Protocols for Forensic
Genetics, Humana Press, Totowa, NJ, 2012, pp. 127-140.
[14] D.V. Zaykin, A. Pudovkin, B.S. Weir, Correlation-based inference for linkage disequilibrium with
multiple alleles, Genetics 180(1) (2008) 533-45.
[15] A. Kolmogorov, Sulla determinazione empirica di una legge di distribuzione, G. Ist. Ital. Attuari. 4
(1933) 83-91.
[16] N. Smirnov, Table for Estimating the Goodness of Fit of Empirical Distributions, Ann Math Statist
19 (1948) 279-281.
[17] L. Excoffier, H.E.L. Lischer, Arlequin suite ver 3.5: a new series of programs to perform
population genetics analyses under Linux and Windows, Molecular ecology resources 10(3) (2010)
564-7.
[18] J. Goudet, FSTAT (Version 1.2): A computer program to calculate F-statistics, Journal of Heredity
86(6) (1995) 485-486.

9
Figure 1: Example of PCA for quality control. A population data set with 4 defined subpopulations
(obviously with small genetic differences) typed at 21 autosomal STR loci has been submitted for
PCA. The two first axis are represented. The data set can be quickly checked for outliers as given in
this example (arrow). The sample ID can be checked by clicking on the data point.

10
Figure 2: Example of PCA for haplogroup prediction. A data set of Y-STR-Haplotypes of known
haplogroups is submitted for PCA, together with a subset of haplotypes of unknown haplogroup,
labelled with “?”. Note that PCA can only give a rough estimate of the possible haplogroup, relative
to the samples with known haplogroups.

11

You might also like