Population Substructure

POPULATION SUBSTRUCTURE AND ITS IMPACT ON GENOME-WIDE
ASSOCIATION STUDIES WITH ADMIXED POPULATIONS
by
Jinghua Liu
____________________________________________________________________
A Dissertation Presented to the

FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(STATISTICAL GENETICS AND GENETIC EPIDEMIOLOGY)
August 2012
Copyright 2012
Jinghua Liu
Table of Contents
List of Tables ...................................................................................................................... iv
List of Figures ..................................................................................................................... v
Abstract
..................................................................................................................... viii
Chapter 1
Introduction ..................................................................................................... 1
1.1 Use of admixed populations for genetic association studies ................................................ 1

1.2 Degree of admixture in different populations ............................................................................. 2
1.2.1 Migration history ............................................................................................................................ 2
1.2.2 Cryptic structure............................................................................................................................. 3
1.2.3 Self-reported ancestry, global ancestry, and local ancestry ......................................... 4
1.3 Background of the Hispanic population ........................................................................................ 5
1.3.1 Background ....................................................................................................................................... 5
1.3.2 Population substructures identified among Hispanic samples................................... 6
1.3.3 Asthma among Hispanics ............................................................................................................ 7
1.3.4 Potential confounding of genetic association studies among
Hispanics ........................................................................................................................................................... 8
1.4 Methods for control of confounding................................................................................................ 9
1.4.1 EIGENSTRAT ..................................................................................................................................10
1.4.2 STRUCTURE & ADMIXTURE ....................................................................................................11
1.4.3 HAPMIX & LAMP ...........................................................................................................................11
1.5 Remaining challenges for genetic association studies among Hispanics.......................12
1.6 Introduction to graphical modeling...............................................................................................14
Chapter 2
Population substructures among Hispanics ................................................... 15
2.1 The USC Childrens Health Study (CHS) .......................................................................................15

2.1.1 Samples & markers ......................................................................................................................15
2.1.2 Potential confounding by population substructure observed from
previous studies ...........................................................................................................................................15
2.2 Ancestry informative markers.........................................................................................................16
2.3 HapMap III populations ......................................................................................................................17
2.3.1 Diverse ethnic populations.......................................................................................................17
2.3.2 Global ancestry estimator: EIGENSTRAT, STRUCTURE &
ADMIXTURE ...................................................................................................................................................17
2.3.3 Local ancestry estimates among HapMap MEX samples .............................................27
2.3.4 Comparing methods ....................................................................................................................29
2.4 Population substructure among CHS samples ..........................................................................34
2.4.1 Global ancestry estimates .........................................................................................................34
2.4.2 Local ancestry estimates ...........................................................................................................41
Chapter 3 Confounding and Heterogeneity in Genetic Association Studies

with Admixed Populations ................................................................................................ 42
3.1
Introduction ............................................................................................................................................42
ii
3.2 Materials and Methods........................................................................................................................45

3.2.1 Graphical Model ............................................................................................................................45
3.2.2 Regression models .......................................................................................................................48
3.2.3 Simulations .....................................................................................................................................49
3.2.4 Scenarios ..........................................................................................................................................51
3.2.5 USC Childrens Health Study (CHS) .......................................................................................53
3.3 Results........................................................................................................................................................55
3.3.1 Simulation result ..........................................................................................................................55
3.3.2 Results from the Childrens Health Study ..........................................................................58
3.4 Discussion ................................................................................................................................................64
Chapter 4
Mapping by admixture linkage disequilibrium ............................................. 72
4.1 Introduction ............................................................................................................................................72

4.1.1 Concept for admixture mapping ............................................................................................72
4.1.2 Testing for excess ancestry proportions ............................................................................72
4.1.3 Advantages of admixture mapping .......................................................................................74
4.1.4 Purpose of this study ..................................................................................................................75
4.2 Materials and Methods........................................................................................................................75
4.2.1 Regression models .......................................................................................................................75
4.2.2 Simulation framework ...............................................................................................................79
4.2.3 Scenarios ..........................................................................................................................................81
4.2.4 Real data analysis among African Americans ...................................................................81
4.3 Results........................................................................................................................................................82
4.3.1 Simulation results ........................................................................................................................82
4.3.2 Real data analysis results ..........................................................................................................85
4.4 Discussion ................................................................................................................................................95
Chapter 5
Summary........................................................................................................ 99
Chapter 6
Future Directions ......................................................................................... 103
Bibliography .................................................................................................................... 106
iii
List of Tables
Table 2-1 Estimated individual global European, Asian, Amerindian, and
African ancestry proportions across HapMap MEX samples from
different approaches. ........................................................................................ 32
Table 2-2 Pearson correlation of estimated individual global European
ancestry proportion between different approaches. .......................................... 32
Table 2-3 Characteristics of different approaches for estimating individual
global and local ancestries. ............................................................................... 33
Table 2-4 Estimated individual global European, Asian/Amerindian, and
African ancestry proportions among CHS samples through
STRUCTURE with K=6. ................................................................................. 40
Table 3-1 Simulated scenarios A-C. ................................................................................. 52
Table 3-2 P-value and effect estimate for selected markers across ethnic
groups and models. ........................................................................................... 63
Table 3-3 Investigation of heterogeneity for SNP rs10519951 in the
Childrens Health Study combined samples. ................................................... 70
Table 4-1 Type 1 error among models for admixture scan. .............................................. 82
Table 4-2 Power among models for admixture scan using only ancestry
information. ...................................................................................................... 83
Table 4-3 Power among models for admixture scan incorporating genotype
information. ...................................................................................................... 83
Table 4-4 Analysis details for regions with great disagreement between
case-only and case-control analysis.................................................................. 87
Table 4-5 Known Hits that replicated from the conventional model. ............................... 89
Table 4-6 Known Hits that are replicated only in models incorporating local
ancestry information. ........................................................................................ 90
Table 4-7 Effect size for SNPs and on chromosome 9. ................................................... 95
iv
List of Figures
Figure 1-1 Graphical model for the concept of confounding. Y represents
the trait of interest, GM represents the genotyped marker, and X
represents unknown environmental and genetic factors. Observed
variables are drawn as solid square, and unobserved variables are
drawn as dashed circle. Association between variables is
represented by solid line connecting between variables................................... 14
Figure 2-1 Clusters identified among HapMap III samples through
EIGENSTRAT. ................................................................................................ 19
Figure 2-2 Finer scale cluster among HapMap samples identified through
EIGENSTRAT within European, African, and Asian ancestry. ...................... 20
Figure 2-3 STRUCTURE results from analyze only HapMap samples for K
from 2 to 7. ....................................................................................................... 22
Figure 2-4 Ln likelihood of the data estimated from the STRUCTURE with
K from 2 to 7 among HapMap samples. .......................................................... 23
Figure 2-5 ADMIXTURE results from analyze only HapMap samples for K
from 2 to 7. ....................................................................................................... 24
Figure 2-6 Cross validation errors of the data estimated from the
ADMIXTURE with K from 2 to 10 among HapMap samples. ....................... 25
Figure 2-7 Comparison of estimated individual global ancestry among
HapMap samples from STRUCTURE, ADMIXTURE, and
EIGENSTRAT. ................................................................................................ 26
Figure 2-8 Estimated individual local ancestry on chromosome 22 from
HAPMIX and LAMP for three selected HapMap MEX samples. ................... 30
Figure 2-9 Comparison of estimated individual global ancestry among
HapMap MEX samples from STRUCTURE, ADMIXTURE,
HAPMIX, and LAMP. ..................................................................................... 31
Figure 2-10 Clusters identified through EIGENSTRAT by analyzing
HapMap and CHS combined samples. ............................................................. 36
Figure 2-11 Ancestry proportions identified through STRUCTURE by
analyzing HapMap and CHS combined samples. ............................................ 39
Figure 2-12 Individual global ancestry estimated from HAPMIX for CHS
samples. ............................................................................................................ 41
Figure 3-1 (a) Potential confounding paths in genetic association studies

among admixed populations. Y represents the outcome of interest,
GM the SNP at a marker locus being tested for association, L the
individual local ancestry in the immediate neighborhood of the
marker locus, Q the individual global ancestry averaged through
L across the genome, X represents other causal factors, either
unmeasured environmental factors, or unmeasured causal loci
present across the genome, that may be associated with global
ancestry, and GL the immediate neighborhood of the marker locus
that is used to estimate individual local ancestry L. (b) Directions
of admixture LD and the LD in the parental populations................................. 46
Figure 3-2 Type I error rates with and without control for confounding at the
marker locus (GM) in scenario A. ..................................................................... 56
Figure 3-3 Effect of adjustment by global and local ancestries on power in
scenario B. ........................................................................................................ 56
Figure 3-4 Comparison of power in scenario C when there is heterogeneity
due to differential LD between ancestries. ....................................................... 58
Figure 3-5 Local ancestry along chromosome 4 for selected CHS Hispanic
samples. global ancestry Q represents estimated European
ancestry proportion. .......................................................................................... 59
Figure 3-6 Q-Q plots for model (1)-(4) among combined samples. ................................. 60
Figure 3-7 Analysis results across models (4) and (5) for combined samples.................. 61
Figure 3-8 Plausibility of scenario B in the ENCODE regions......................................... 66
Figure 3-9 Distribution of the D' difference between the CEU and the Asian
populations in the ENCODE regions. .............................................................. 67
Figure 4-1 Simulation framework for admixture mapping. .............................................. 80
Figure 4-2 Simulation results across different LD levels among markers
with allele frequencies greater than 0.4 between populations. ......................... 84
Figure 4-3 Genome-wide admixture scan using the SNP association and the
models use only local ancestry information. .................................................... 86
Figure 4-4 Genome-wide admixture scan using the SNP association and the
models incorporating both genotype and ancestry information
(SUM, MIX, and CCcom). ............................................................................... 88
Figure 4-5 Compare the performance between the proposed CCmix (2df)
model and the conventional model. .................................................................. 91
vi
Figure 4-6 Comparison of the performance between the proposed CCcom

(2df) model and the SNP association analysis on the region on
chromosome 8. ................................................................................................. 92
Figure 4-7 Comparison of models CCcom (2df) and CCcom_GL (3df). ......................... 93
Figure 4-8 Comparison of models CCcom (2df) and CCcom_GL (3df) on
the region on chromosome 8. ........................................................................... 94
Figure 4-9 Comparison of models CCcom (2df) and CCcom_GL (3df) on
the region on chromosome 9. ........................................................................... 94
Figure 4-10 Compare model CCcom_GL (3df) to the SNP association
model. ............................................................................................................... 97
Figure 5-1 Comparision of proposed models. ................................................................... 99
Figure 5-2 Changes on results between proposed models. ............................................. 100
Figure 5-3 Histogram of changes in log10(p-value) when comparing
Y~G+L+GL+Q (3df) to the conventional model Y~G+Q. ............................ 101
Figure 5-4 Weighted changes on results between proposed models............................... 102
vii
Abstract
Association studies among admixed populations pose many challenges. The purpose of
this study is to compare the methods for ancestry estimation and to investigate the control
for confounding and the capture of heterogeneity in SNP effect by the use of individual
ancestries. In addition, a general regression framework is proposed to perform admixture
mapping for both case-only and case-control study designs among admixed populations.
For confounding and heterogeneity, simulation results indicate that 1) adjustment for
global ancestry can control for confounding; 2) additional adjustment for local ancestry
may increase power when the induced admixture LD is in the opposite direction as the
LD in the ancestral populations; 3) the inclusion of a SNP by local ancestry interaction
term can increase power when there is substantial differential LD between ancestry
populations. Real data analysis in a genome-wide data using the University of Southern
California's Children's Health Study of childhood asthma highlights rs10519951 (p=8.5E7) from the model with the interaction term, a SNP lacking any evidence of association
from the SNP association analysis (p=0.5). For the admixture mapping, simulation and
real data analysis results among African Americans from the Multiethnic Cohort Study of
prostate cancer indicate that 1) case-only analysis suffers from spurious results among the
regions with biased local ancestry estimation; 2) our proposed regression model yield
similar performance as the existing methods; 3) it is more powerful to incorporate
genotype information for admixture mapping; 4) and it is more powerful to incorporate
SNP by local ancestry interaction to capture the admixture signal and heterogeneity by
local ancestry simultaneously.
viii
Chapter 1 Introduction
1.1 Use of admixed populations for genetic association studies
Genome-wide association studies (GWAS) have been relatively successful at identifying
numerous risk variants for many diseases and traits (Hindorff et al.). A majority of these
studies have been performed among individuals of European descent (Broderick et al.;
Gudbjartsson et al.; Hunter et al.; Saxena et al.; I. Tomlinson et al.; Zanke et al.; Bilguvar
et al.; Graham et al.; Liu et al.; Tenesa et al.; I. P. Tomlinson et al.; Wallace et al.;
Benyamin et al.; Hicks et al.; H. Kim et al.; Kottgen et al.; Landi et al.; Ma et al.; Org et
al.; Pillai et al.; Simon-Sanchez, Schulte, et al.; Song et al.; Xiong et al.; Arking et al.;
Birlea et al.; Chalasani et al.; Eijgelsheim et al.; Hor et al.; Lascorz et al.; Tan et al.;
Lessard et al.; O'Seaghdha et al.; Ryu et al.; Van Laer et al.; Voight et al.; Boger et al.; S.
Kim et al.; McKay et al.; Panoutsopoulou et al.; Reilly et al.; Simon-Sanchez, van Hilten,
et al.; K. Wang et al.; Wijsman et al.).
GWAS within European populations are convenient for sample collection, and
these studies are also free of major population structure and heterogeneity. However, the
field is gradually moving towards GWAS in more diverse ethnic populations (Rosenberg
et al.).
This is mainly due to the desire to generalize genetic findings to other
populations, the belief that variation in local LD structure across populations can assist in
localization of the putative causal variants as well as to the belief that the limited genetic
variation within individuals with European ancestry will not be sufficient to find all
underlying variants for each disease, (Cooper, Tayo and Zhu; Haiman and Stram). It has
been shown from simulation studies that association studies using a broader range of
populations with more or different genetic variations as well as differences in disease

prevalence have additional power for discovery (Pulit, Voight and de Bakker). To date,
there have been several reported GWAS among relatively homogeneous non-Caucasian
populations such as Chinese (Garcia-Barcelo, Tang, et al.; Garcia-Barcelo, Yeung, et al.;
Shu et al.; Guo et al.; Kung et al.; Han et al.; Lei et al.; Ng et al.; Tse et al.; Zhang et al.;
Chen et al.; Tsai et al.; F. Wang et al.), Japanese (Hattori et al.; Hiura, Shen, et al.; Satake
et al.; Hiura, Tabara, et al.; Kamatani et al.; Tanaka et al.; Unoki et al.; Yamada et al.;
Yasuda et al.; Low et al.; Cui et al.; Kumar et al.), and Korean (Cho et al.; Yoon et al.; J.
J. Kim et al.).
Furthermore, there have been GWAS among admixed populations such as
African Americans (Bostrom et al.; Charles et al.; Lettre et al.), and Hispanics (Hayes et
al.; Norris et al.; Rich et al.; Palmer et al.). The advantages of conducting GWAS among
admixed populations over homogeneous populations is the extended linkage
disequilibrium (LD) along the genome (Bonilla et al.; Gonzalez Burchard et al.).
1.2 Degree of admixture in different populations

1.2.1
Migration history
Modern humans originated in Africa, and the expansion from Africa to Asia, more
specifically to the Middle East, occurred about 100,000 years ago (Cavalli-Sforza,
Menozzi and Piazza). The Middle East is considered the center where modern humans
appeared and spread to other parts of the word. It is hypothesized that about 60,000 years
ago, East Asia was reached from the Middle East through two routes, one through Central
Asia and the other through South Asia. The expansion from Southeast Asia to Australia
2
happened about 55,000 years ago. The expansion to Europe occurred about 35,000 years
ago from West Asia.
There has been hard evidence supporting the idea that the expansion from
Northeast Asia to America, more specifically from Siberia to Alaska, happened between
35,000 and 15,000 years ago by way of the Bering Strait (Fagan). Based on linguistic,
dental, and genetic information, there are three major migrations from Siberia to America
(Cavalli-Sforza, Menozzi and Piazza): (1) the first migration followed by a rapid
occupation of the whole continent by the Amerinds; (2) a second migration, named after
the Na-Dene family, mainly settled in southern Alaska and on the northwestern coast of
North America; (3) and the third migration of the Eskimo-Aleut who occupied Alaska,
the northern coast of North America, and the Aleutian islands. It is possible that NaDene and Eskimo-Aleuts had common origins from Asia (Cavalli-Sforza, Menozzi and
Piazza).
The first wave of migrations forms the native populations (Amerindians) living in
the Americas prior to the arrival of Columbus. Columbuss arrival in the Caribbean in
1492 brought European and African ancestries into America and marks the formation of
Latino (Hispanic) populations in the region (Gonzalez Burchard et al.)
1.2.2
Cryptic structure
Populations that are geographically distant from each other for a long time form a broad
range of ethnicity and tend to have different allele frequencies between them. Among
these individuals, for example Caucasians and Asians, ethnicity provides a good measure
of genetic homogeneity.
There is also cryptic structure among admixed populations that carry genes from
two or more ancestries. When two geographically isolated populations come together,
gene flow in small amounts per generation, and there is infusion of individuals from one
population to the other (Cavalli-Sforza, Menozzi and Piazza). For example, AfricanAmericans originate from the admixture of Caucasians and Africans, while Hispanics
originate from the admixture of Caucasians, Africans, and Amerindians. Individuals
from the admixed population have diverse ancestry compositions.
1.2.3
Self-reported ancestry, global ancestry, and local ancestry
Individuals could self-identify into a single ethnic category, and one or multi race groups.
Ethnicity indicates ones geography, nationality or country of birth of the persons
ancestors, and race is more of an indicator of genetic background of the person. These
two together form the self-reported ancestry for an individual. Individual global ancestry
is the proportion of the genome an individual received from ancestral populations. For
individuals from major continental populations, self-reported ancestry would agree
perfectly with global ancestry. Another ancestry indicator, individual local ancestry,
carries additional ancestry information for admixed individuals.
Genetic admixture
occurs when individuals from previously distinct populations interbreed. After several
generations, the genomes of the individuals in an admixed population become a mosaic
composed of chromosomal segments originating from each of the ancestral populations.
The ancestral variation for these segments is then referred to as local ancestry.
Depending on their particular genealogical history, each admixed individual will have
different proportions of chromosomal segments throughout their genome originating
from each of the ancestral populations. The average of these proportions across the
genome is referred to as global ancestry defined above. This individual and localized
admixture is due to genetic drift, population admixture, local chromosomal structure, and
mating patterns (Salanti, Sanderson and Higgins).
Self-reported ancestry could be a proxy for unmeasured environmental factors.
For example, subgroups of Hispanic population, Hispanic White vs. Hispanic Black,
could capture the diversity of environmental exposure, culture, lifestyle, and
socioeconomic status among the Hispanic origin subgroups. These environmental factors
may or may not differ between individuals with different genetic composition (e.g.
difference in global or local ancestry composition). However, self-reported ancestry is
not an accurate proxy for the individuals genetic ancestry especially for admixed
individuals. Individual global ancestry, in most cases estimated from the genotyping of
individuals, is a better indicator of the individuals genome composition. It captures the
cryptic structures within admixed populations and any environmental and genetic
divergence along with the underline substructure. In contrast, individual local ancestry
more likely reflects the recent genetic recombinations since initial admixture, and is an
accurate locus specific ancestral indicator. It captures the diversity in LD structure across
the genome for each individual as well as the extended LD along the region for admixed
populations.
1.3 Background of the Hispanic population

1.3.1
Background
Hispanics are the largest and fastest growing minority population in the United States.
The Hispanic population accounted for 13% of the nations total population in 2000, and
5
grew by 43%, which account for over half of the increase in the total population between
2000 and 2010 [U.S. Census Bureau. 2010]. There were 50.5 million Hispanics that
account for 16% of the nations total population in the United States in 2010 [U.S.
Census Bureau. 2010].
According to the U.S. Census 2010, Hispanic origin is the heritage, nationality
group of the individuals ancestors before their arrival in the United States. Hispanic or
Latino refers to a person of Cuban, Mexican, Puerto Rican, South or Central American,
or other Spanish culture or origin regardless of race [U.S. Census Bureau. 2010]. The
three major subgroups of Hispanic origins recorded in the 2010 census are (a) Mexican,
Mexican American, (b) Puerto Rican, and (c) Cuban. Individuals of Hispanic origin
could be in any race groups, e.g. White, Black, or Asian.
As shown from the migration history of America, Hispanics is mainly an admixed
population mix of European, Amerindian, and African ancestries (Bertoni et al.;
Choudhry, Coyle, et al.; Choudhry, Seibold, et al.; Lee et al.; Salari et al.). The three
major Hispanic origin subgroups distribute differently across the country, with Mexicans
mainly clustering in the Southwest and Chicago, Puerto Ricans in the Northeast, and
Cubans in Florida (Denavas and Hall). This uneven distribution of Hispanic origins and
marital behaviors in different areas result in a genetic heterogeneity (Bertoni et al.), as
well as diversity in environmental and socioeconomic factors among Hispanic
populations (Reibman and Liu)
1.3.2
Population substructures identified among Hispanic samples
Hispanics are an admixed population with mainly European, Amerindian, and African
ancestries. The average ancestry proportions for Hispanics from different regions are
different. Hispanics from the San Luis Valley in Southern Colorado are of about 62.7%
European, 34.1% Amerindian, and 3.2% African ancestries (Bonilla et al.).
Puerto
Ricans with 59.7% European, 19.1% Amerindian, and 21.3% African ancestries
(Choudhry, Coyle, et al.). Hispanics from Los Angeles have on average 48% European,
44% Amerindian, and 8% African ancestries (Price, Patterson, Yu, et al.), while
Hispanics residing in 6 census tracts of the Los Angeles County from the Latino Eye
Study have 40.1% European, 45.2% Amerindian, and 4.9% African ancestries (Shtir et
al.).
Hispanics from San Francisco Bay Area, California consist of 45.4% European,
51.0% Amerindian, and 3.7% African ancestries (Choudhry, Coyle, et al.). And Hispanic
samples collected from New York City are identified as a mixture of 29.2% European,
44.8% Amerindian, and 26.0% African ancestries (Lee et al.). In general, Hispanics
origin in Puerto Ricans and Cubans tend to have a greater proportion of African ancestry
and a lower proportion of Amerindian ancestry compared to Mexican Americans
(Reibman and Liu).
1.3.3
Asthma among Hispanics
Puerto Ricans has the highest asthma prevalence and mortality, while Mexican
Americans has the lowest in the United States (Carter-Pokras and Gergen; Freeman,
Schneider and McGarvey; Homa, Mannino and Lara; Choudhry, Coyle, et al.; Salari et
al.).
The results from a parent-child trios study of Mexican and Puerto Rican support
the LTA4H and ALOX5AP genes as risk factors for asthma in Hispanic populations (Via
et al.). GWAS among Puerto Rican samples identified 5q23 as susceptible region for
asthma (Choudhry, Taub, et al.). (will add more reference and established findings)
Exposures associated with asthma diagnosis are environmental tobacco smoke,

presence of dampness/mold, roaches, and furry pets in the home (Freeman, Schneider and
McGarvey). Ancestry by environment interactions (e.g. SES) modify the risk of asthma
among Hispanics (Choudhry, Seibold, et al.). Environmental factors associated with
asthma among subgroups of Hispanics differed , e.g. bathroom mold and roaches were
significantly associated with asthma in Puerto Ricans but not Mexicans and
Dominicans(Freeman, Schneider and McGarvey).
1.3.4
Potential confounding of genetic association studies among Hispanics
A standard design for genetic association studies entails selecting cases and unrelated
population-based controls that are representative of the source population that gives rise
to the cases (Devlin and Roeder). When the sample being studied is drawn from a
population consisting of sub-populations with varying rates of disease, cases will be more
likely than randomly-selected controls to arise from the sub-populations with the higher
rates of disease. Furthermore, for any genetic locus at which allele frequencies differ
among the sub-populations, spurious associations will be induced using the standard
case-control analysis, resulting in false positive or negative results (Thomas and Witte).
Significant systematic differences in ancestry proportions (with regard to
Amerindian and European ancestries) between cases and controls among Hispanic
samples have been observed consistently in many studies (Aldrich et al.; Choudhry, Taub,
et al.; Salari et al.). This indicates that there is potential confounding due to population
substructure among Hispanic populations, and it is necessary to adjust for this
confounding in association studies of asthma among Hispanic populations (Aldrich et al.;
Choudhry, Taub, et al.).
1.4 Methods for control of confounding

It has been suggested that confounding by population substructure might be essentially a
non-issue within broad population groupings defined by self-identified race or ethnicity
(Wacholder, Rothman and Caporaso; Jorm and Easteal). If ethnicity provides a good
measure of genetic homogeneity, or disease risks vary little within such groups, then
incorporating this information into an association study (i.e., by matching cases and
controls onor analytically adjusting forethnicity) should help reduce the potential for
bias due to population stratification. However, for an admixed population, the variation
in allele frequencies and disease rates within these broad ranges of ethnicity (e.g.,
African-Americans, and Hispanics) is unclear. Differing allele frequencies and disease
rates arise when there is random mating within, but little or no mating between, subpopulations. The choice of mates is determined by geographic, socioeconomic, religious,
cultural, and physical characteristics, which may not segregate into broadly defined
ethnic groupings. Furthermore, among sub-populations, these characteristics are dynamic
over time and space, often undergoing their own evolution (Cavalli-Sforza and Feldman).
Therefore, adjusting for or matching on self-identified race or ethnicity as a proxy for
genetic sub-populations may not fully control for population stratification in these
admixed populations (Choudhry, Coyle, et al.; Serre et al.).
There have been several approaches to control for confounding due to population
substructure for GWAS among population-based samples. These include approaches
aiming to control the confounding but not necessarily to estimate the population
structures. Genomic control approaches attempt to adjust the test statistic distribution for
the presence of stratification (Devlin and Roeder), and logistic regression approaches
adjust for many unlinked markers (Setakis, Stirnadel and Balding). Alternatively, there
are approaches that rely on the estimation of population structures. These include latent
variable approaches attempting to identify the specific structure to analytically adjust for
the stratification (Hoggart, Parra, et al.; Satten, Flanders and Yang; Pritchard, Stephens
and Donnelly; Alexander, Novembre and Lange), and distance-based multivariate
approaches capturing the variation with fewer dimensions than the original data
(Engelhardt and Stephens; Li and Yu; Miclaus, Wolfinger and Czika; Price, Patterson,
Plenge, et al.). For the most part, approaches that attempt to estimate the structure focus
solely on global ancestry for a given individual although it has been recently suggested
that correcting for individual local ancestries may be required for genome-wide
association scans in admixed populations (Bryc et al.; Kang et al.; Qin et al.; X. Wang et
al.).
In this study, STRUCTURE, ADMIXTURE, and EIGENSTRAT were used to
estimate individual global ancestry, and HAPMIX and LAMP were used to assess
individual local ancestry.
1.4.1
EIGENSTRAT
EIGENSTRAT (Price, Patterson, Plenge, et al.) applies principal components analysis to

genotype data to infer continuous axes of genetic variation. SNPs are centered and scaled,
and then eigenvectors are calculated from the covariance matrix between individuals
based on the genotype of the SNPs. Top continuous axes of variation (top eigenvectors)
are used to infer substructures within the study samples.
10
1.4.2
STRUCTURE & ADMIXTURE
STRUCTURE (Pritchard, Stephens and Donnelly) is a model-based approach.
It
assumes K founder populations characterized by a set of allele frequencies across a

number of independent markers, and assumes Hardy-Weinberg equilibrium within each
population. Individuals are then originated from one or more of the K populations, and
are probabilistically assigned to populations. The program models the likelihood of the
observed genotypes based on the assigned populations and allele frequency within each
population, and uses the Markov chain Monte Carlo (MCMC) algorithm to sample the
posterior distribution.
The basic idea for the program ADMIXTURE (Alexander,
Novembre and Lange) is similar to STRUCTURE. However, instead of relying on

MCMC to sample the posterior distribution, ADMIXTURE utilizes an optimization
technique to focus on maximizing the likelihood. The same as STRUCTURE, the result
from ADMIXTURE gives the estimated proportion of ancestry (average across the
genome) from each contributing population for each individual under study.
1.4.3
HAPMIX & LAMP
HAPMIX (Price, Tandon, et al.) is a haplotype-based approach that infers local ancestry
from dense genome-wide data.
It assumes two homogeneous reference ancestral
populations for the admixed population under study, and the reference populations are
required to be close to the true founder populations of the study samples. The key of the
program is assuming genome of an admixed individual is a mosaic of small regions that
originates from the reference populations. It calculates the likelihood that a haplotype
from an admixed individual is from one reference population or the other at each locus,
and likelihoods from nearby loci are combined through a Hidden Markov Model (HMM)
11
to get a probabilistic ancestry estimator for each locus.
Another advantage of this
program is instead of estimating haplotypes among the study samples first and assuming
no error during the phasing process, HAPMIX incorporates a built-in phasing process and
averages the inference about ancestry across all possible phase solutions for each
admixed individual.
LAMP (Pasaniuc, Sankararaman, et al.) uses a sliding window-based framework.
It assumes admixed populations arise from K ancestral populations. It partitions an
individuals genome into small windows and assumes no more than one recombination
event that changes the ancestry within each window. It chooses a window size for each
locus, and at each locus, it computes the likelihood of having different ancestry upstream
and downstream within the window.
1.5 Remaining challenges for genetic association studies among

Hispanics
With regard to the population substructures for Hispanic population, the challenges
remaining for conducting genetic association studies among this population include the
following:
a) There are several existing methods for estimating individual global and local
ancestry. The question is how to choose between these methods for ancestry estimation
according to the study design, purpose of the study (e.g. adjusting for confounding,
capturing heterogeneity), and the samples under study (general knowledge about the
geography and genetic composition of the samples).
12
b) In an admixed population, each individual contains various proportions of

founder ancestries with these proportions varying across the genome, resulting in LD
patterns varying within and between populations. What does the local ancestry look like
and how much diversity across the genome is observed for these admixed individuals? In
addition, although there have been studies showing that local ancestry could be a
confounder and adjustment for individual local ancestry is necessary in genetic
association studies, all these conclusions are drawn from simulation studies. There has
not been any study showing the need to adjust for local ancestry in addition to the
adjustment for global ancestry (which is generally utilized in genetic association studies
these days).
c) In an admixed population such as Hispanics, there are indicators of selfreported ancestry (e.g. Hispanic White, Hispanic Asian, and Hispanic Black), estimated
individual global, as well as local ancestries. What is the relation among all these
ancestry indicators?
How much do they correlate with each other, and how much
addition information does each of them capture? What is the contribution from each
ancestry indicator for the interpretation of underline genetic and environmental effects on
the trait under study?
d) Is there heterogeneity by self-reported ancestry, estimated individual global
and local ancestry for SNP marginal effect? Is there heterogeneity of SNP effects on
disease risks by global (or self-reported ancestry) by local ancestry three-way interaction?
e) Is the testing of gene by environmental interaction confounded by population
substructure among admixed individuals? (will add more reference for this section)
13
1.6 Introduction to graphical modeling

Figure 1-1 uses graphical model to give a structural representation of the concept of
confounding. Assume that we are interested in the unknown causal relation between a
gene (GM) and disease (Y). If we undertake an association study within which there
exists an unknown factor X (e.g. population substructure, other underline genetic factors,
or any environmental factors) that is associated with genotype and is a risk factor for the
disease, then confounding by this unobserved factor X can occur.
Figure 1-1 Graphical model for the concept of confounding. Y represents the trait of
interest, GM represents the genotyped marker, and X represents unknown environmental
and genetic factors. Observed variables are drawn as solid square, and unobserved
variables are drawn as dashed circle. Association between variables is represented by
solid line connecting between variables.
14
Chapter 2 Population substructures among Hispanics

2.1 The USC Childrens Health Study (CHS)
2.1.1
Samples & markers
The CHS is an ongoing cohort study investigating environmental and genetic influences
on asthma in children. The study design is discussed in detail elsewhere (Navidi et al.;
McConnell et al.; Li et al.). In this project, we include a total of 2,839 samples (1,246
asthma cases and 1593 controls) from two self-reported ethnic groups: 1,489 nonHispanic Whites and 1,350 Hispanics. Genotyping of these samples was performed at
the USC Genome Center utilizing both the Illumina HumanHap550 and the Illumina
Human 610-Quad BeadChips.
2.1.2
Potential confounding by population substructure observed from previous

studies
It has been shown from several previous studies that population based genetic association
studies among Hispanic populations might be confounded by population stratification.
For example, a case-control study among Hispanics showed that Puerto Rican asthma
cases had a significantly lower proportion of African ancestry and a significantly higher
proportion of European ancestry than controls (Choudhry, Coyle, et al.). In addition,
European ancestry was found to be associated with more severe asthma in MexicanAmericans, and there was a strong inverse correlation between Native American and
European ancestry in this Hispanic population (Salari et al.).
15
2.2 Ancestry informative markers

Ancestry informative markers (AIMs) are markers with different allele frequencies
between parental populations, therefore, are selected to identify individuals ancestral
proportions.
In order to study the observed structure within the CHS multiethnic
populations, three exclusive AIMs groups are used to distinguish between continental, as
well as with continental finer substructures. AIM233 (Smith et al.) and AIM557 (Seldin
et al.) are informative marker sets that are selected to identify continental genetic
structures. AIM233 contains AIMs from four lists, each with 100 SNPs, which are
optimal for distinguishing four population mixtures: European vs. West-African,
European vs. Amerindian, West-African vs. Amerindian, and European vs. East Asian.
Of these 400 SNPs, 233 are unique and have a high probability of successful genotyping
using Illumina. AIM557 is a subset of markers found on the Illumina Linkage IV panel
that are informative for identifying European ancestry from African, East Asian, South
Asian, and Amerindian ancestries. Furthermore, in order to detect possible finer scale
structures with European population, we further include a group of European substructure ancestry informative markers, AIM192 and AIM1211 (Tian et al.), that are
selected from the Illumina 300K and 500 K platforms. AIM192 is informative for
identifying Northern/Southern European substructures, and AIM1211 is informative for
identifying substructures along a West-East gradient within northern Europeans.
16
2.3 HapMap III populations

2.3.1
Diverse ethnic populations
HapMap Phase III could be used as reference populations when estimating individual
global and local ancestries among the study samples. HapMap III recruits samples from
11 populations: Yoruba in Ibadan, Nigeria (YRI), Luhya in Webuye, Kenya (LWK),
African in Southwest USA (ASW), Maasai in Kinyawa, Kenya (MKK), Toscans in Italy
(TSI), Utah residents with Northern and Western European ancestry from the CEPH
collection (CEU), Han Chinese in Beijing, China (CHB), Japanese in Tokyo, Japan (JPT),
Chinese in Metropolitan Denver, Colorado (CHD), Mexican ancestry in Los Angeles,
California (MEX), and Gujarati Indians in Houston, Texas (GIH). When investigating
population structures, only unrelated samples are included for the analysis. In the end,
988 unrelated samples (113 YRI, 90 LWK, 49 ASW, 143 MKK, 88 TSI, 112 CEU, 84
CHB, 86 JPT, 85 CHD, 50 MEX, and 88 GIH) are included in this study.
2.3.2
Global ancestry estimator: EIGENSTRAT, STRUCTURE & ADMIXTURE
2.3.2.1 Results from EIGENSTRAT

Individual global ancestry among HapMap III samples is estimated through the
EIGENSTRAT program based on the 1981 AIMs. Figure 2-1 shows the plots of the first
eigenvector against eigenvectors two through seven for HapMap III samples. It shows
clearly that the first eigenvector distinguishes two major continental clusters, the African
ancestry related groups (YRI, LWK, ASW, and MKK) and the other ethnic groups. And
within the African ancestry, it also separates YRI and LWK from ASW and MKK. The
second eigenvector identifies the Asian cluster (CHB, JPT, and CHD) as well as the
17
European cluster (TSI and CEU). Then the third eigenvector further separates out the
Indian cluster (GIH) from the other ethnic groups. The fourth and the fifth eigenvectors
identify the Amerindian (MEX) as well as the Maasai component of the African ancestry
(MKK) from the other ethnic groups. Eigenvector six is the major eigenvector that is
able to clearly distinguish the southern (TSI) northern (CEU) clusters within the
European ancestry. And finally, eigenvector seven is able to distinguish JPT from CHB
and CHD within the Asian ancestry. The remaining eigenvectors (eigenvectors eight to
ten) do not show any pattern of clusters that are related to the ethnic groups within the
HapMap III samples. The top ten eigenvectors explains 20% of the variance in the data.
Figure 2-2 shows the finer scale clusters within the European (TSI and CEU),
African (YRI, LWK, ASW, and MKK), and Asian (CHB, JPT, and CHD) ancestries
respectively. For each ancestry group, the top two most informative eigenvectors that are
specific to identify the finer clusters within that ancestry are plot against each other.
Within the European ancestry, eigenvectors 4 and 6 separate TSI that represents the
southern European cluster from the CEU.
Within the African ancestry, with the
combination of eigenvector 1 and 4, the four different ethnic groups are perfectly
separated from each other. The YRI and LWK are two small clusters that are close to
each other, while the ASW and MKK are relatively far apart with greater variation within
the clusters. For the Asian ancestry, the JPT is clearly separate out from the other ethnic
groups mainly through eigenvector 7, and the CHB and CHD are together identified as a
homogeneous Asian cluster.
18
Figure 2-1 Clusters identified among HapMap III samples through EIGENSTRAT.
19
Figure 2-2 Finer scale cluster among HapMap samples identified through EIGENSTRAT
within European, African, and Asian ancestry.
2.3.2.2 Results from STRUCTURE

I also estimate individual global ancestry through the STRUCTURE program based on
the same set of AIMs among HapMap III samples. When running STRUCTURE, the
number of clusters (K) is predefined to a range of numbers based on our knowledge about
the samples under studying, and the admixture model is used for running the program.
The length of burnin period is set at 20,000, and the number of MCMC steps after burnin
is set at 10,000. I perform 10 independent runs for each K, and the final number of
clusters K used to interpret the sub-structures is decided based on both the estimated Ln
probability of the data and the knowledge about the geography and possible ancestry
proportions that I believe are true among the samples.
The results for K ranges from 2 to 7 are shown in Figure 2-3. For each plot, the
horizontal axis represents each individual grouped by ethnicity.
The vertical axis
represents the estimated individual ancestry coefficients, which is a continuous variable
20
between 0 and 1 indicating the percentage of different ancestries of the individuals

genome. Different estimated ancestries are represented by different color. Here, green is
used to represent the estimated European ancestry; red to represent African ancestry;
orange to represent the Maasai component of the African ancestry; blue purple to
represent Asian ancestry; pink to represent the Amerindian ancestry; and blue to
represent the Western Indian ancestry.
As shown in Figure 2-3, with K=2, STRUCTURE mainly identifies the African
ancestry. There are four ethnic groups (YRI, LWK, ASW, and MKK) that mostly consist
of African ancestry. For K=3, STRUCTURE further identifies Asian ancestry. There are
three ethnic groups (CHB, JPT, and CHD) that are homogeneous of Asian ancestry.
STRUCTURE separates out the Indian ancestry among the GIH samples for K=4, and for
K=5 it identifies the Maasai component of the African ancestry within the MKK ethnic
group. The Amerindian ancestry is finally identified for K=6 within the MEX ethnic
group.
There is no more ancestry clusters that could be identified from the
STRUCTURE program with K greater than 6.

Figure 2-4 shows the Ln likelihood of the data estimated from the STRUCTURE
with K ranges from 2 to 7. The likelihood increases dramatically from K=2 to K=3
followed by a relatively small increase from K=3 to K=6, and then the likelihood
leverages after K=6. Based on the estimated likelihood from STRUCTURE and the
geography of the samples, assuming six clusters (K=6) could best represent the
population substructure within the HapMap III samples.
21
Figure 2-3 STRUCTURE results from analyze only HapMap samples for K from 2 to 7.
22
Figure 2-4 Ln likelihood of the data estimated from the STRUCTURE with K from 2 to 7
among HapMap samples.
2.3.2.3 Results from ADMIXTURE

The results from ADMIXTURE are shown in Figure 2-5 for K ranges from 2 to 7.
Similar to the result from STRUCTURE analysis, ADMIXTURE can identify the three
major continental ancestries, which are European, African, and Asian ancestries, as well
as the other three distinct ancestry proportions: Amerindian among MEX samples, Indian
ancestry mainly among GIH samples, and the Maasai component of African ancestry.
The major differences between the results from ADMIXTURE and STRUCTURE are: 1)
the estimated proportion of Maasai component among MKK and LWK samples is higher
from ADMIXTURE; 2) the estimated Amerindian ancestry among MEX is higher from
ADMIXTURE; 3) JPT is separated out from CHB and CHD.
23
Figure 2-5 ADMIXTURE results from analyze only HapMap samples for K from 2 to 7.
24
Figure 2-6 Cross validation errors of the data estimated from the ADMIXTURE with K
from 2 to 10 among HapMap samples.
Figure 2-6 shows the cross validation errors of the data estimated from the
ADMIXTURE with K from 2 to 10. The error decreases dramatically from K=2 to K=3
followed by a relatively small decrease from K=3 to K=6, and then increases again after
K=6. This supports the conclusion drawn from the STRUCTURE program that six
clusters (K=6) could best represent the population substructure within the HapMap III
samples. However, as shown from the EIGENSTRAT result, based on the 1981 AIMs
selected, it is able to identify the JPT from the other Asian ethnic groups. The finer scale
substructure with the Asian ancestry identified from ADMIXTURE with K=7 is
consistent with the finding from the EIGENSTRAT program.
25
Figure 2-7 Comparison of estimated individual global ancestry among HapMap samples
from STRUCTURE, ADMIXTURE, and EIGENSTRAT.
26
Figure 2-7 (Continued).
In order to compare the results from EIGENSTRAT, STRUCTURE, and

ADMIXTURE, we plot the top eigenvectors along with the population sub-structures
identified through STRUCTURE and ADMIXTURE. As shown in Figure 2-7, the top
four eigenvectors capture similar population sub-structures as that form STRUCTURE
and ADMIXTURE. Eigenvector 7 support the finer scale substructure identified among
Asian populations through ADMIXTURE with K=7.
2.3.3
Local ancestry estimates among HapMap MEX samples
When estimating individual local ancestry among HapMap MEX samples, phased
haplotypes of HapMap III CEU and Asian (CHB and JPT) are used as the reference
ancestral populations for HAPMIX, and allele frequencies from each group are served as
27
the reference allele frequency for LAMP.
In addition, global ancestry is calculated by
averaging across the estimated local ancestries across the genome for each individual.
As shown from the STRUCTURE result with K=6, the MEX admixed samples
are mainly consists of Amerindian and European ancestries. When estimating individual
local ancestry among MEX samples, it is optimal to use homogeneous Amerindian as one
of the reference populations; however, such homogeneous population of Amerindian
ancestry is not available in this study. STRUCTURE results comparing between K=5
and K=6 show that the Asian ancestry is most close to the Amerindian ancestry, and the
HapMap III CHB and JPT are perfect homogeneous ethnic groups of Asian ancestry.
Therefore, CHB and JPT are use as the Asian reference population to represent the
Amerindian ancestry within MEX samples. Therefore, HapMap III CEU and Asian
samples are used as the reference population when estimating individual local ancestries
among MEX samples.
Figure 2-8 plots the estimated individual local ancestry on chromosome 22 from
HAPMIX and LAMP for three selected MEX samples.
Sample NA19679 has an
estimated European ancestry proportion of 0.84 from STRUCTURE with K=5. And the
estimated European ancestry proportion for samples NA19676 and NA19759 are 0.63
and 0.42 respectively. In each plot, the horizontal axis represents markers ordered by
their physical position on chromosome 22, and the vertical axis represents the proportion
of European ancestry at each locus. The genome is mainly consisted of three different
kind of regions: regions with ~0% European ancestry (two copies from the Asian parental
population); regions with ~50% European ancestry (one copy from European parental
population and one copy from Asian parental population); and regions with ~100%
28
European ancestry (two copies from the European parental population). It shows that for
sample with greater global European ancestry estimated from the STRUCTURE
program, the local ancestry estimated through both HAPMIX and LAMP consistently
shows more proportions of European ancestry. The local ancestry estimated from LAMP
is roughly agree with that estimated from the HAPMIX program. Pearson correlation of
the estimated local ancestry for samples NA19679, NA19676, and NA19759 are
0.71,0.82, and 0.74 respectively between HAPMIX and LAMP.
2.3.4
Comparing methods
As the individual global ancestry is an average of the local ancestries across the genome,
besides estimating individual global ancestry based on selected AIMs through
STRUCTURE and ADMIXTURE, we also calculated individual global ancestry by
averaging the estimated individual local ancestry across the genome (437,599 loci in
total). Individual global ancestry estimators are shown in Figure 2-9 for STRUCTURE,
ADMIXTURE, HAPMIX, and LAMP. The order of the individuals is the same across
the plots. The estimated individual global ancestries are consistent among different
approaches. As shown in Table 2-1, the estimated Asian ancestry proportion is similar
among STRUCTURE with K=5 (31%), ADMIXTURE with K=5 (31%), HAPMIX
(36%), and LAMP (34%). For K=6, STRUCTURE and ADMIXTURE result in much
less European ancestry proportions. As shown in Table 2-2, The Pearson correlation of
European ancestry proportion is high (>0.9) between different approaches. HAPMIX and
LAMP results in very similar global ancestry estimator (as shown in Figure 10).
29
Figure 2-8 Estimated individual local ancestry on chromosome 22 from HAPMIX and
LAMP for three selected HapMap MEX samples.
30
Figure 2-9 Comparison of estimated individual global ancestry among HapMap MEX
samples from STRUCTURE, ADMIXTURE, HAPMIX, and LAMP.
31
Table 2-1 Estimated individual global European, Asian, Amerindian, and African
ancestry proportions across HapMap MEX samples from different approaches.
European
Asian
K=5
60%
31%
5%
4%
K=6
42%
14%
42%
2%
0%
ADMIXTURE K=5
50%
31%
6%
3%
K=6
27%
6%
57%
4%
6%
HAPMIX
64%
36%
LAMP
66%
34%
STRUCTURE
Amerindian African
Other
Table 2-2 Pearson correlation of estimated individual global European ancestry

proportion between different approaches.
STRUCTURE ADMIXTURE
K=5
K=5
HAPMIX
LAMP
K=6
K=5
K=6
0.97
0.99
0.96
0.95
0.94
0.95
0.99
0.97
0.97
0.94
0.93
0.92
0.96
0.96
STRUCTURE
K=6
K=5
ADMIXTURE
K=6
HAPMIX
1.00
32
Table 2-3 Characteristics of different approaches for estimating individual global and
local ancestries.
Ancestry estimator
Global
Markers
Timea
Choice for study designs
Local
STRUCTURE
Yes
AIMs
1 hr
Studies with genotyped AIMs.

Need precise & interpretable
estimators for individual
global ancestry.
ADMIXTURE
Yes
AIMs;
Random
markers
(e.g. 10,000
~ 100,000)
1 min
Studies without AIMs. Need

quick but interpretable global
estimator that is good enough
for adjusting for confounding.
EIGENSTRAT
Yes
AIMs;
Random
markers
(e.g. 10,000
~ 100,000)
5 sec
Studies without AIMs. Need

quick adjustment in the study
without complete
interpretation of the estimated
cluster. Studies that would like
to capture other potential
unknown substructures.
HAPMIX
Yes
Yes
GWAS;
1000
genome;
Sequencing
15 hr
Studies need precise local

ancestry estimator. Samples
mainly consist of two parental
ancestries.
LAMP
Yes
Yes
GWAS;
1000
genome;
Sequencing
3 min
Studies need quick local

ancestry estimator with the
trade off of accuracy. Studies
with samples apparently
consist of more than two
parental ancestries.
The time consumed for conducting STRUCTURE, ADMIXTURE, and EIGENSTRAT

is calculated for running the program based on 1891 AIMs among 50 samples; time
calculated for conducting HAPMIX and LAMP is calculated for running the program
across the whole genome (437,599 autosomal SNPs) among 50 samples.
33
2.4 Population substructure among CHS samples

Among the 1981 AIMs that are available on HapMap, 1746 of them are available in both
HapMap III and CHS GWAS.
HapMap III samples are served as the reference
population when estimating individual global and local ancestries among CHS samples.
2.4.1
Global ancestry estimates
Individual global ancestry is estimated through EIGENSTRAT and STRUCTURE. The

results from EIGENSTRAT are shown in Figure 2-10.
Figure 2-10 (a) shows the
estimated clusters from the top ten eigenvectors within HapMap reference populations.
The result is similar to that from previous investigation only among HapMap samples
(Figure 2-1). The major difference between running EIGENSTRAT within HapMap
samples only and for combined HapMap and CHS samples is: The Amerindian ancestry
among HapMap MEX samples is identified earlier (eigenvector two) when combining
HapMap with CHS than that from analyze HapMap samples alone (eigenvector four).
This is driven by the large Hispanic group within the CHS samples. Figure 2-10 (b)
shows the clusters within CHS non-Hispanic samples. The scale for each plot is the same
as that in (a). Comparing the clusters to those identified with the HapMap reference
populations, the CHS non-Hispanic White population, marked as green in the plot, is a
fairly homogeneous population closely clusters around the European ancestry. There are
a few individuals lying between the European and the other ancestry clusters (small tails
toward the Asian and African ancestries). Most of the CHS non-Hispanic Asian samples
(marked as blue purple) are clustered as Asian ancestry, and most non-Hispanic Black
samples are identified as the African cluster. CHS self-identified non-Hispanic Mix
samples (plot as black circles) are basically a combination of the three CHS non-Hispanic
34
groups state above, and the non-Hispanic Other population (pink dots) includes several
individuals with levels of Amerindian ancestry, which is shown from the plot for
eigenvector one against eigenvector 3. These samples are genetically more close to the
CHS Hispanic samples that are shown in Figure 2-10 (c). In this figure, self-identified
Hispanic White, Hispanic Mix, and Hispanic Other samples are marked with pink color,
Hispanic Asian and Hispanic Black samples are marked with blue purple and red color
respectively. Overall, the CHS Hispanic population is an admixed population mainly
clustering between European and Amerindian ancestry, and with a few individuals
reaching towards Asian and African ancestries as identified among HapMap reference
samples.
Population substructure identified through STRUCTURE is consistent with that
from the EIGENSTRAT. Result for HapMap reference populations and for CHS samples
is plotted separately in Figure 2-11 (a) and (b). When combining HapMap samples with
CHS samples, there is an additional ancestry, the Southern European ancestry within TSI,
identified through STRUCTURE with K=7 comparing to the result shown before (Figure
2-3).
The estimated individual global European, Asian, Amerindian, and African
ancestry proportions among CHS samples through STRUCTURE with K=6 is shown in
Table 2-4.
35
(a) Identified clusters within within HapMap reference populations.

Figure 2-10 Clusters identified through EIGENSTRAT by analyzing HapMap and CHS
combined samples.
36
(b) Identified clusters within CHS non-Hispanic samples.

37
(c) Identified clusters within CHS Hispanic samples.

38
(a) Identified ancestry proportions within HapMap reference populations.
(b) Identified ancestry proportions within CHS samples with K=6.

Figure 2-11 Ancestry proportions identified through STRUCTURE by analyzing
HapMap and CHS combined samples.
39
(c) Identified ancestry proportions within CHS samples with K=7.

Table 2-4 Estimated individual global European, Asian/Amerindian, and African ancestry
proportions among CHS samples through STRUCTURE with K=6.
Num
European
Asian
1749
96%
1%
1%
0%
2%
Asian
48
7%
75%
0%
0%
18%
Black
16
24%
0%
0%
75%
1%
Mix
56
62%
27%
1%
1%
11%
Other
52%
1%
42%
0%
5%
White
401
67%
0%
29%
0%
4%
Asian
23%
44%
23%
0%
10%
Black
26%
3%
20%
49%
2%
Mix
282
68%
5%
22%
0%
5%
Other
839
45%
0%
49%
0%
6%
Non-Hispanics White
Hispanics
Amerindian African
Other
40
2.4.2
Local ancestry estimates
Similar to the approach when estimating individual local ancestry among HapMap MEX
samples, HapMap CEU and Asian (CHB and JPT) are used as the reference population
when conducting HAPMIX among CHS samples. Figure 2-12 shows the individual
global ancestry averaged across the estimated local ancestry across the genome.
Figure 2-12 Individual global ancestry estimated from HAPMIX for CHS samples.
41
Chapter 3 Confounding and Heterogeneity in Genetic

Association Studies with Admixed Populations
3.1 Introduction
Genome-wide association studies (GWAS) have been relatively successful at identifying
numerous risk variants for many diseases and traits (Hindorff et al.). A majority of these
studies have been performed with individuals of European ancestry and so have been free
of major population structure and heterogeneity. However, the field is gradually moving
towards GWAS in more diverse ethnic populations (Rosenberg et al.). To date, there
have been several reported GWAS among relatively homogeneous non-Caucasian
populations, including Chinese (Garcia-Barcelo, Tang, et al.; Guo et al.; Han et al.; Lei et
al.; Ng et al.; Tse et al.; Zhang et al.), Japanese (Hattori et al.; Hiura, Shen, et al.;
Kamatani et al.; Tanaka et al.; Unoki et al.; Yamada et al.; Yasuda et al.), and Korean
populations (Cho et al.). In part, this is due to the desire to generalize genetic findings to
other populations, as well as to the belief that the limited genetic variation within
individuals with European ancestry will not be sufficient to find all underlying variants
for each disease. Association studies using a broader range of populations with more or
different genetic variations as well as differences in disease prevalence may have
additional power for discovery (Pulit, Voight and de Bakker).
In addition to expansion to other homogeneous populations, GWAS among
admixed populations such as African Americans (Adeyemo et al.; Barnholtz-Sloan et al.)
and Hispanics (Hayes et al.; Norris et al.; Palmer et al.; Rich et al.) may be advantageous
due to extended linkage disequilibrium (LD) along the genome (Bonilla et al.; Gonzalez
42
Burchard et al.). However, such admixed populations pose new challenges in association
studies, most notably potential confounding due to subtle population stratification and
heterogeneity of the effects due to differential LD. Genetic admixture occurs when
individuals from previously distinct populations interbreed.
For example, African-
Americans originate from the admixture of Caucasians and Western Africans, while
Hispanic-Americans originate from the admixture of Caucasians, Western Africans, and
Amerindians. After several generations, the genomes of the individuals in an admixed
population become a mosaic composed of chromosomal segments originating from each
of the ancestral populations. The ancestral variation for these segments is referred to as
local ancestry.
Depending on their particular genealogical history, each admixed
individual will have different proportions of chromosomal segments throughout their

genome originating from each of the ancestral populations.
The average of these
proportions across the genome is referred to as global ancestry.

While controlling for self-identified race and/or ethnicity is possible for broadscale structure, when finer scale stratification or admixture is suspected an alternative
approach is to perform family-based studies to obtain valid inference (Gauderman, Witte
and Thomas). Between these two extreme study designs there exist many approaches
that attempt to either account for the effects of the confounding or to identify the
unknown structure. These include approaches aiming to control the confounding but not
necessarily to estimate the population structures.
In an admixed population, each individual contains various proportions of founder
ancestries with these proportions varying across the genome.
This individual and
localized admixture is due to genetic drift, population admixture, local chromosomal
43
structure, and mating patterns (Salanti, Sanderson and Higgins), and results in LD
patterns varying within and between populations. When testing genetic markers that are
proxies for a disease causal locus (as in a GWAS), this differential LD can result in
heterogeneity of effect estimates by local ancestry. This heterogeneity not only can
impact the power in a GWAS, but it can influence meta-analysis, as well. In fact, many
researchers have leveraged this to identify causal variants arguing that consistency in
effect estimates across multiple ethnic groups bolsters support for that specific variant
being a true causal polymorphism (Teslovich et al.; Waters et al.).
To deal with heterogeneity, most current studies perform a test of interaction
between the SNP of interest and ethnicity and then conduct a stratified analysis if
appropriate.
These methods are optimal for the combined analysis from more
homogeneous populations (e.g. European and Asian), but it is unclear how appropriate
this may be for certain admixed populations with variation in local ancestry along the
chromosome (e.g. Hispanics) or when combining an admixed population with others.
In this chapter, we use graphical diagram to clarify the mechanisms in which
admixture can lead to confounding and how heterogeneity in effect estimates may arise.
Based on these mechanisms, we investigate the source and effect of confounding and test
for heterogeneity via an interaction term between SNP and local ancestry through
simulations. Across all models and simulation scenarios we focus on effect estimation,
type I error and power. Finally, we apply these models to a GWAS investigating the
impact of genetic variation on asthma in the University of Southern Californias
Childrens Health Study. We discuss the overall impact of global ancestry on this
44
analysis and identify several empirical examples where accounting for local ancestry
impacts inference.
3.2 Materials and Methods

3.2.1
Graphical Model
Figure 3-1 (a) is a graphical model representing the relationship of several factors
involved in genetic association studies among admixed populations (Greenland;
Greenland, Pearl and Robins). Here, Y represents the outcome of interest. GM represents
the SNP at a marker being tested for association with Y (with effect G M ). GD represents
an unmeasured causal locus (with effect GD ) for which we are testing GM as a proxy. X
represents other causal factors that are associated with global ancestry (Q), including
unmeasured environmental factors and/or unmeasured causal loci. The global ancestry is
most often estimated from a subset set of markers (i.e. ancestry informative markers)
(Seldin et al.; Shtir et al.; Smith et al.; Tian et al.). Alternatively, we view global ancestry
as the average of local ancestries along the genome. Ancestral variation can lead to
differences in allele frequencies for measured genetic variants (q) and unmeasured causal
variants (p). We assume the local ancestry at GM and GD are the same, and there are
additional SNPs (GL) that can be used to estimate local ancestry for each subject at each
location.
45
(a) Potential confounding paths in genetic association studies among admixed

populations.
(b) Directions of admixture LD and the LD in the parental populations.

Figure 3-1 (a) Potential confounding paths in genetic association studies among admixed
populations. Y represents the outcome of interest, GM the SNP at a marker locus being
tested for association, L the individual local ancestry in the immediate neighborhood of
the marker locus, Q the individual global ancestry averaged through L across the genome,
X represents other causal factors, either unmeasured environmental factors, or
unmeasured causal loci present across the genome, that may be associated with global
ancestry, and GL the immediate neighborhood of the marker locus that is used to estimate
individual local ancestry L. (b) Directions of admixture LD and the LD in the parental
populations.
46
There are paths between factors that together may lead to confounding for the
relationship between the marker (GM and potentially GD) and Y. By definition, global
ancestry Q is correlated with local ancestry L ( Q,L ). When there are differing allele
frequencies by ancestral populations at GM (variation in q), Q is thus related to GM and
results in confounding path GM-L-Q-X-Y when testing the marker locus GM. Similarly,
when there are differing allele frequencies at GD, there will be the confounding path GDL-Q-X-Y if we are testing the disease locus GD or even the marker locus GM.
There are two components that affect the magnitude of the LD between GM and
GD in an admixed population: LD within parental populations (D'); and the admixture LD
induced by differential frequencies between ancestral populations at both the marker and
the disease locus. As shown in Figure 3-1 (a), the admixture LD is indicated as the path
GM-L-GD marked in red, and the LD within the parental populations is in black. When the
directions of the LD are the same between parental populations, as indicated in Figure
3-1 (b), the reference alleles for GM & GD are determined so that the correlation between
these two loci is positive within the admixed population. Similarly, a reference local
ancestry population for L can be defined such that L is positively correlated with GM in
the admixed population.
Thus, the reference allele is the same in both parental
populations. Given these reference definitions, when L is negatively correlated with GD

(left panel), there exists an overall negative correlation between GM & GD through the
path GM-L-GD. In this situation, the admixture LD is in a different direction to the LD in
the parental populations. This results in a corresponding reduction in the observed
magnitude of the LD in the admixed population. In contrast, when L is positively
correlated with GD, there is an overall positive correlation between GM & GD through the
47
path GM-L-GD (right panel). In this situation, the admixture LD is in the same direction as
the LD in the parental populations, and the observed LD between GM and GD in the
admixed population is enhanced. In addition to the scenarios discussed above, when the
LD in the two ancestral populations is in opposite directions, the admixture LD will
always enhance the LD in one ancestry while reducing the level of LD in the other. In
summary, for a marker GM, admixture LD has the potential to act as an additional
confounder of the GM-Y effect. For a disease locus GD, there is no such potential.
Finally, individual local ancestries may modify the marginal effect at the marker
locus because of the differential LD existing across ancestral populations. That is, within
a study population the level of association between GD-GM varies across individuals as a
function of L, (e.g. D'1 D'2). Thus, L acts as an effect modifier of the association
between GM and Y.
3.2.2
Regression models
We use the following generalized linear models to investigate the efficiency of

controlling for confounding by global ancestry and the potential impact on power by
adjusting for local ancestry:
g(Y) = + G M GM
(1)
g(Y) = + G M GM + Q Q
(2)
g(Y) = + G M GM + L L + Q Q
(3)
Specifically, g(Y) is the logit link, Y is a dichotomized outcome with values of 0

(unaffected) and 1 (affected), and Y is the probability of Y=1, conditional on the
covariates included in the model. Alternative outcomes can be handled in a similar
48
manner in the generalized linear framework. GM represents the number of variant alleles
for each individual and G M is the corresponding marginal effect. A Wald or likelihood
ratio test of G M = 0 can be used to test association. For investigating heterogeneity, we
compared model (3) to a model that also includes a GML interaction term:
g(Y) = + G M GM + L L + int GML + Q Q
(4)
Here, we use a 2-df likelihood ratio test for the joint test of G M = 0 and int = 0. This
joint test has been shown to be nearly optimal across many different scenarios for main
and interacting effects (Kraft et al.).
3.2.3
Simulations
We conduct simulations to investigate the performance of using the models defined

above to control for confounding and to capture heterogeneity of effects. Simulations are
based on the framework represented in the graphical model in Figure 3-1 we simulate
data based on the confounding paths and assess the Type I error and power after adjusting
for individual ancestries. To test the gain in power as well as the potential overadjustment by local ancestry, we include simulation scenarios that model the admixture
LD. In addition, we simulate data with and without LD differences between populations
to gauge the impact of heterogeneity between ancestries. In all scenarios, we generate
cases for a binary disease outcome (Y) using a logistic regression model incorporating
the disease locus GD and the individual global ancestry Q, with a 50% average probability
for being a case. For simplicity we assume a direct relationship of Q to Y.
49
More specifically, in the simulations, we assume that individuals come from the
admixture of two parental ancestries, A1 and A2. The steps for simulating the dataset are
as follows:
a) Each individual i is assigned a local ancestry representing the number of genetic
copies from ancestry group A2: 0 (Li =0), 1 (Li =1) or 2 (Li =2). We generate 600
individuals within each ancestry group
b) Assign allele frequencies at the disease locus (GD) within each parental ancestry:
p1 & p2
c) Assign allele frequencies at the marker locus (GM) within each parental ancestry:
q1 & q2
d) Assign LD between GD and GM within each parental ancestry: D'1 & D'2
e) Within each ancestry group, calculate the haplotype frequencies at the disease
(GD) and the marker (GM) loci based on the assigned allele frequencies (p and q)
and LD (D') within each parental ancestry.
f) Given the haplotype frequencies within each parental ancestry, generate two
haplotypes for each individual conditional on their local ancestry. That is, two
haplotypes from A1 for each individual with Li =0, one haplotype from A1 and
one haplotype from A2 for each individual with Li =1, and two haplotypes from
A2 for each individual with Li =2. From this, we obtain the genotypes at both the
marker (GM) and the disease (GD) loci.
50
g) For simulating global ancestry, we generate Q conditional on L for each

individual from the regression model: Q i = L/2 + 0.2(L i - L) + i , with i ~
N(0,0.3). Then Q is truncated between 0 and 1. In this way, the generated Q is
related to local ancestry L with a correlation around 0.5 to reflect the observed
distribution among the Hispanic samples in the CHS.
h) We then probabilistically generate case-status for all 1,800 samples using a
logistic regression model incorporating the disease locus GD and the individual
global ancestry Q. Variables in the logistic regression model are mean centered
and there is a baseline risk of 50%, thus resulting in approximately equal numbers
of cases and controls for each replicate.
Note that, in the simulations, instead of directly simulating the observed LD in the
admixed populations, we generate the two components of the LD between GM and GD (as
described in the graphical model) respectively: LD within each parental ancestry is
assigned as D'1 and D'2; for the admixture LD, based on the definition, we assign different
allele frequencies between parental ancestries A1 and A2 to generate an simulated induced
admixture LD.
3.2.4
Scenarios
Across all the simulation scenarios (

Table 3-1), we vary the population specific parameters for each parental ancestry (p , q,
and D') and the causal model parameters ( Q and GD ).
In scenario A, there is no genetic causal effect ( GD = log(1)) but a strong global
ancestry effect ( Q = log(3.0)). The allele frequency is fixed at 0.3 in ancestral population
51
A1 at both GM and GD, with the allele frequency in ancestral population A2 varying from
0.3 to 0.7 to simulate the induced admixture LD. The LD is the same (D'=0.9) within
each ancestral population. In this scenario, we simulate different allele frequencies
between ancestral populations to investigates the efficiency of control for confounding by
individual ancestries in models (2) Y ~ G + Q and (3) Y ~ G + Q + L. In scenario B, we
simulate the genetic causal effect ( GD = log(1.2) ) when no effect of Q on Y ( Q = log(1))
is present. We simulate a positive correlation (LD) between GM and GD in each ancestral
population: D'1= D'2=0.9.
The induced admixture LD due to differential allele
frequencies between ancestries is simulated as described in Scenario A and, in addition,

we simulate induced admixture LD in the same as well as in different directions to the
LD in the original ancestral populations. Finally, in scenario C, we simulate a genetic
causal effect ( GD = log(1.2) ) as well as an effect of Q on Y ( Q = log(3.0) ), and we
simulate the same allele frequency across ancestral populations (p=q=0.3). In this
scenario, we varies the D' difference between populations from 0 to 1.8 (D'1=0.9, D'2
varies from -0.9 to 0.9) to gauge the impact of heterogeneity.
Table 3-1 Simulated scenarios A-C.
Allele freq
SNP effect
differencea
[0,0.4]
None
log(3.0)
None
[0,0.4]
log(1.2)
None
None
None
log(1.2)
log(3.0)
[0,1.8]
Scenario
Q effect Q
D' difference
(heterogeneity)
Allele frequencies at both the disease and the marker loci.
52
For each simulated scenario, we create 1,800 individuals with an equal number of
individuals (NL = 600) within each local ancestry group (L = {0, 1, 2}, where L indicates
the number of copies from ancestral population 1). Conditional on L and the
corresponding specified parameters for allele frequency, LD and risk, we generate GD,
GM, and Q. We then probabilistically generate case-status for all 1,800 individuals using
a logistic regression model incorporating the disease locus GD and the individual global
ancestry Q. Variables in the logistic regression model are mean centered and there is a
baseline risk of 50%, thus resulting in approximately equal numbers of cases and controls
for each replicate. This simulation framework does not directly simulate potential
confounding or heterogeneity by L. Rather, potential confounding and heterogeneity is
induced by simulating haplotypes, global ancestry and diseases status conditional on local
ancestries as reflected in our graphical framework. Specifically, potential confounding is
induced via the path, GM-L-Q-Y. The Type I error and empirical power are calculated as
the number of significant tests ( = 0.05) over 10,000 replicates.
3.2.5
USC Childrens Health Study (CHS)
The CHS is an ongoing cohort study investigating environmental and genetic influences
on asthma in children. The study design is discussed in detail elsewhere (Li et al.;
McConnell et al.; Navidi et al.). The CHS GWAS is a nested case-control study from the
ongoing longitudinal CHS cohort with approximately equal number of cases and controls
for non-Hispanic whites and Hispanics. All CHS subjects and their parents gave informed
consent and the study was approved by the University of Southern California Institutional
Review Board. In this study, we include a total of 2,839 samples from two self-reported
ethnic groups: 1,396 non-Hispanic Whites and 1,171 Hispanics. Among non-Hispanics
53
samples there are 595 cases and 801 controls; and there are 532 cases and 639 controls
among Hispanics. We analyze the CHS data stratified by ethnicity and in a combined
sample, assuming that the non-Hispanic white individuals all have two copies of
European local ancestry at each location. Genotyping of these samples was performed at
the USC Genome Center utilizing both the Illumina HumanHap550 and the Illumina
Human 610-Quad BeadChips and the analysis is conducted on 437,599 autosomal SNPs
passing a stringent quality control procedure. We perform several genome-wide scans in
the CHS samples with additional covariates (age, gender, community of residence, and
self-reported ethnicity).
We estimate individual local ancestry L through the program HAPMIX (Price,
Tandon, et al.). HAPMIX requires two parental reference populations and estimates
individual local ancestry from dense genotyping chips. We focus our investigation on the
confounding and heterogeneity due to the structure defined by a European/Amerindian
admixture. As the Amerindian component has greatest similarity with Asian populations,
and since a homogeneous or non-admixed Amerindian population is unavailable in
HapMap, we used the homogeneous HapMap phase III CEPH and East Asian (Han
Chinese in Beijing, China and Japanese in Tokyo, Japan) populations as the two
reference populations for local admixture estimation ("The International Hapmap Project";
Altshuler et al.). The average of all local ancestry estimates across the genome for each
individual is used to estimate global ancestry ( Q i = L im /2M ) for each individual.
M
54
3.3 Results
3.3.1
Simulation result
3.3.1.1 Confounding
In scenario A (Figure 3-2), at the marker locus GM, when the allele frequency difference
between ancestral populations is greater than 0.1, the crude model has substantially
elevated Type I error rate while models (2) Y ~ G + Q and (3) Y ~ G + Q + L efficiently
control for the confounding. The pattern is the same when testing the disease locus GD.
In scenario B, there is no confounding path simulated and all models have the correct test
size (not shown). Reflecting these patterns, adjustment by global ancestry (when needed)
results in an unbiased effect estimate. In contrast, there is very little impact on the effect
estimate from adjustment with local ancestry. However, when the induced LD due to
differential allele frequency between ancestries is in the same direction as the LD from
the parental ancestries (Figure 3-3 (a)), adjusting for local ancestry results in a slight loss
in power. When the induced LD is in the opposite direction to the LD from the parental
ancestries (Figure 3-3 (b)), additional adjustment of local ancestries results in a slight
increase in power. However, this decrease/increase in power is negligible for allele
frequency differences <0.1. When testing the disease variant directly, as opposed to a
marker, the pattern is the same as that shown in Figure 3(a).
55
Figure 3-2 Type I error rates with and without control for confounding at the marker
locus (GM) in scenario A.
(a) Induced LD is in the same direction to the LD in the parental ancestries.

Figure 3-3 Effect of adjustment by global and local ancestries on power in scenario B.
56
(b) Induced LD is in different direction to the LD in the parental ancestries.

Figure 3-3 Continued.
3.3.1.2 Heterogeneity
Figure 3-4 shows the performance of inclusion of the Gm x L interaction term by
comparing model (3) Y ~ G + Q + L and model (4) Y ~ G + Q + L + GL. For tests at the
marker locus GM, when the LD between GM and GD are similar in the ancestral
populations (D'<0.7), there is a reduction in power for the two degree of freedom test of
the joint effect of Gm and Gm x L from model (4).
As the difference in LD increases
(>0.7), model (4) has similar or greater power than model (3). For testing a disease
variant (D' difference = 0), the loss in power is 0.10 for the simulated scenario.
57
Figure 3-4 Comparison of power in scenario C when there is heterogeneity due to

differential LD between ancestries.
3.3.2
Results from the Childrens Health Study
Figure 3-5 displays the local ancestry estimates across chromosome 4 for five individuals
sampled across a range of global coefficients of ancestry (Q = {0, 0.25, 0.5, 0.75, 1.0}).
For the individuals with European global ancestry not equal to 0% or 100% there is
substantial diversity as to which regions contain more than zero copies of local European
ancestry.
Averaging the estimated local ancestries at each SNP location across all
Hispanic individuals shows that, on average, each location has slightly more than one
copy of European ancestry (bottom panel in Figure 3-5).
The average of all local
ancestry estimates across the genome across all the individuals yields an estimate of
average European ancestry around 1.35.
58
Figure 3-5 Local ancestry along chromosome 4 for selected CHS Hispanic samples.
global ancestry Q represents estimated European ancestry proportion.
When investigating the impact of confounding in the CHS GWAS, as shown in

Figure 3-6, a crude analysis Y~G among the Hispanics and non-Hispanic Whites
combined samples results in a large overdispersion (1=1.27), that is reduced
substantially with adjustment by global or both global and local ancestries in models (2)
Y~G+Q (2=1.02) and (3) Y~G+Q+L (3=1.02). However, the overdispersion parameter
is a summary across the entire genome. In the CHS data for example, about 50% of the
markers have a smaller p-value from model (3) Y~G+Q+L compared to model (2)
Y~G+Q. About 10% of the SNPs found to be potentially interesting (p<0.05) from an
analysis with both global and local ancestry are not noteworthy with an analysis adjusting
only for global ancestry. In terms of effect estimates, among those SNPs potentially
59
interesting (p < 0.05) from an analysis adjusting for global ancestry only, additional
adjustment for local ancestry results in a 10% or greater change in the effect estimate for
over 9% of the SNPs.
Figure 3-6 Q-Q plots for model (1)-(4) among combined samples.
Figure 3-7 shows the p-values from a GWAS analysis of the combined nonHispanic white and Hispanic populations in the CHS using models (3) and (4). The most
notable SNP rs10119122 (p = 4.510-8) from model (3) remains noteworthy with the 2-
60
df test from model (4). There is an additional SNP (rs10519951) that lacks association
with model (3) (p = 0.50), but is noteworthy with model (4) as indicated by a much
smaller p-value (p = 8.510-7).
Figure 3-7 Analysis results across models (4) and (5) for combined samples.
Table 3-2 provides the SNP effect estimates and corresponding p-values from
both models for these two SNPs from the ethnic-specific and combined analyses. In
addition, for model (4) in the Hispanic only analysis and the combined analysis we
present the expected genetic effect estimate within each local ancestry stratum by
stratifying individuals by their estimated local ancestry at each SNP and using the
stratum-specific effect estimates from model (4). For SNP rs10119122 there are similar
allele frequencies across the two populations and there is very little heterogeneity
indicated from model (4) in Hispanics. Thus, the marginal effect estimate from model (3)
61
in the Hispanic only analysis ( G M = -0.30) is similar to the non-Hispanic estimates ( G M

= -0.36). It is also interesting to note that in the non-Hispanic white only analysis
rs10119122 has an effect estimate of G M = -0.36. For the Hispanic individuals only
analysis within the strata of individuals with two copies of European ancestry, the effect
estimate is almost identical, G M = -0.37.
In contrast, SNP rs10519951 has very little evidence for an effect in the nonHispanic White analysis from model (3) ( G M = 0.19, p = 0.07). In the Hispanic only
analysis, rs10519951 has a sizeable inverse effect on asthma from model (3) ( G M = 0.29, p = 5.210-3) and further evidence of heterogeneity by local ancestry from model
(4), specifically the test for only the interaction term has a p value of 1.610-5. Also, the
estimates of effect are comparable between the non-Hispanic whites only ( G M = 0.19)
and the Hispanics only within the strata of individuals having two copies of European
ancestry ( G M = 0.16). In the Hispanic only analysis within the strata of individuals carry
0 copies of European ancestry the estimate is G M =-1.12. The contrast in estimates
across strata is reflected in the more significant result from model (4) in the combined
sample (p = 8.510-7).
62
63
1386
1386
L1
L2
All
1372
All
L0
1372
L2
0.19
0.19
0.66
0.66
0.19
(7.110-2)
-0.36
(2.110-5)
1160
510
531
119
1143
721
410
12
0.24
0.19
0.26
0.34
0.62
0.64
0.60
0.46
Freq
-0.29
(5.210-3)
-0.3
(1.810-3)
(p-value)
-
0.16
(1.610-5)
-0.48
-1.12
-0.37
(4.710-3)
-0.17
(p-value)
0.03
G
M
Model 4
Hispanics
Model 3
2546
1896
531
119
2515
2093
410
12
0.21
0.19
0.26
0.34
0.64
0.65
0.60
0.46
Freq
M
-0.05
(5.010-1)
-0.34
(4.510-8)
(p-value)
-
Combinede
Model 3
M
0.18
(8.510-7)
-0.49
-1.17
-0.37
(1.610-7)
-0.17
(p-value)
0.04
Model (4)
Estimated individual local ancestry. L is round up into three categorical groups 0 ( L 0.5), 1 (0.5< L 1.5), and 2 ( L >1.5).
Sample size within each local strata.
c
Effect estimate G M of the SNP marginal effect followed by the corresponding p-value in the parenthesis from Model (3):
Y~G+Q+L.
d
The expected effect estimate G M of the SNP effect within each local strata followed by the 2-df test p-value in the parenthesis from
Model (4): Y~G+L+GL +Q.
e
Combined non-Hispanic White and Hispanic samples for analysis.
rs10519951
L0
rs10119122
L1
Local
Strataa
Marker
non-Hispanic White
Model 3c
b
G M
N
Freq
(p-value)
0
-
Table 3-2 P-value and effect estimate for selected markers across ethnic groups and models.
In contrast, SNP rs10519951 has very little evidence for an effect in the nonHispanic white analysis from either model 4 ( G M = 0.08, p = 0.38) or model (5) (p =
0.02), albeit there is some evidence of heterogeneity. In the Hispanic only analysis,
rs10519951 has a sizeable inverse effect on asthma from model (4) ( G M = -0.29, p =
5.210-3) and further evidence of heterogeneity by local ancestry from model (5) (p =
1.610-5). For both the non-Hispanic white and Hispanic analyses, there is a larger
inverse effect for those individuals with zero copies of European ancestry.
This
similarity in heterogeneity by local ancestry is reflected in the significant result from

model (5) in the combined sample (p = 1.910-7).
3.4 Discussion
When confounding arises through global ancestry via a path that links an external factor
to the marker being evaluated, then global ancestry alone can control for the confounding.
This assumes that the estimated global ancestry accurately captures the underlying factor.
Previous studies have argued that adjusting for local ancestry is necessary for controlling
for confounding (Kang et al.; Qin et al.; X. Wang et al.), however these papers simulated
local ancestry as a strict confounder and did not allow for induced LD in admixed
populations. Our simulations demonstrate that impact of adjustment for local ancestry is
more nuanced within admixed populations. When the direction of the admixture LD is in
a different direction from the LD in the parental ancestries there is a reduction in the
magnitude of the LD in the admixed population and additional adjustment for local
ancestry can increase the power to detect the true association at the marker locus. But,
this potential gain in power comes with risk: when the admixture LD is in the same
64
direction as the LD within the ancestral population, adjustment for local ancestry will
result in over-adjustment. To quantify the occurrence of the simulated scenario in real
populations, we examine data from the ENCODE regions from the HapMap ENCODE
Genotyping Project("The Encode (Encyclopedia of DNA Elements) Project").
For
comparison of the two relevant ancestral populations, we estimate allele frequency and D'
within the CEU and East Asian samples.
To mimic the potential LD between a
genotyped SNP and an unobserved disease variant as in a genomewide scan, we limit our
comparison to only SNPs genotyped in the CHS study (treated as marker loci) vs. those
SNPs not genotyped but contained within ENCODE (treated as disease loci). Figure
3-8(a) shows the joint distribution of D' within the Asian and the CEU population for the
ENCODE regions. Across all the estimated pairwise D', about 38% are in the opposite
direction (the gray areas). For the 62% of markers that have LD in the same direction in
parental populations (the white areas in Figure 3-8(a)), the direction of induced LD will
determine if there is a resulting over-adjustment or gain in power. The joint distribution
of the direction of the D' within parental ancestries and the direction of the induced LD is
shown in Figure Figure 3-8(b) for these variants (with the variants from the white regions
in Figure Figure 3-8(a)). In Figure Figure 3-8(b), the x-axis is the direction of the D'
within parental ancestries (positive or negative); y-axis is the direction of the induced LD.
When the directions of allele frequency differences between ancestries at the marker
locus is the same as that at the disease locus (e.g. at both loci, the allele frequency in
European ancestry is higher than that in Amerindian ancestry), we assign a positive sign
for the induced LD; when the allele frequency differences are in the opposite directions
between the marker and the disease loci, we assign a negative sign of the induced LD.
65
Among these loci, the direction of the induced LD and the LD within the ancestral
populations is in the same direction for about 57% of them.
(a) Direction of the LD in the parental ancestries.
(b) Direction of the induced LD and the LD in the parental ancestries.

Figure 3-8 Plausibility of scenario B in the ENCODE regions.
When investigating heterogeneity of effect estimates by local ancestry, there is
also a potential loss in power by testing both the SNP main effect and the interaction via
66
a 2-df test (Figure 3-4). In the ENCODE regions, about 30% of the estimated differences
in D' between the populations are greater than 0.7 (Figure 3-9). Our simulation results
demonstrates that model (4) with the SNP-local ancestry interaction has greater power
than the conventional model for D' differences above 0.7. Thus, one may expect an
increase in power for about 30% of the SNPs, with the remaining 70% having none or a
slight reduction in power. Given this tradeoff, a GWAS for discovery using only model
(4) may not be the most advantageous approach. However, in practice most investigators
will first perform an analysis without an interaction term. Subsequent analyses with the
interaction term included offer the potential to uncover previously unidentified regions.
In such a two-step approach, one would need to consider the impact on type I error, but
for discovery and further follow-up such impact may be negligible.
Figure 3-9 Distribution of the D' difference between the CEU and the Asian populations
in the ENCODE regions.
The graphical model we presented helps to construct and interpret our empirical
investigation of the CHS data. Following the graphical model presented, we view global
67
ancestry as a composite of all local ancestry estimates along the genome, i.e. the L-Q path.
Thus, we estimate global ancestry as an average of all local ancestry estimates. This is in
contrast to the more commonly used approach of estimating global ancestry using
selected ancestry informative markers (AIMs) and the STRUCTURE program (Falush,
Stephens and Pritchard "Inference of Population Structure Using Multilocus Genotype
Data: Linked Loci and Correlated Allele Frequencies"; Falush, Stephens and Pritchard
"Inference of Population Structure Using Multilocus Genotype Data: Dominant Markers
and Null Alleles"; Hubisz et al.; Pritchard, Stephens and Donnelly), or EIGENSTAT
(Price, Patterson, Plenge, et al.). In the graphical modeling framework, these approaches
could be represented with a box for observed AIMs pointing directly to Q.
For
comparison, we use the HapMap Phase III (release 2) samples as reference populations
and estimate global ancestry using the program STRUCTURE with 1,637 selected AIMs
(Seldin et al.; Shtir et al.; Smith et al.; Tian et al.). The estimated individual global
ancestries are highly correlated (R2=0.96) with the global ancestry values calculated by
averaging the estimated individual local ancestry across the genome (437,599 loci in
total).
For the CHS GWAS, the most notable SNP (rs10119122) from the marginal test
of the SNPs remains noteworthy with the 2-df test from model (4). In contrast, SNP
rs10519951, a SNP in NR3C2, only has a significant p-value from model (4) (p = 8.5107
in the combined samples). Although rs10519951 does not reach the conventional cutoff
for determining genome-wide significance for main effects (i.e. = 510-8), the genomewide significance level for the 2-df test that involved correlated local ancestries across
the genome is unclear and is an active area of research. Such significance levels may
68
depend on the specific admixed population investigated since the distribution of local
ancestry for each individual across all locations in the genome will depend upon the
sample. In this case a permutation test for determining significance may be required.
Whether the top SNPs are strictly significant or not, there is a clear potential for
additional information to be gained from including local ancestry in a test of
heterogeneity. Overall for the CHS, and consistent with results from ENCODE, the
interaction model results in smaller p-values for 35% of the SNPs across the genome.
Notably, for those SNPs with a smaller p-value from model (4), this change is often
substantial suggesting that a great deal of additional information may be captured by
jointly considering the main and interaction terms.
When testing the disease variant, adjusting for local ancestry most often results in
a loss of power from over-adjustment when the allele frequency is different between
ancestries.
Likewise, when investigating a measured causal variant in an admixed
population there will be no influence of differential LD between the marker and the
causal variant. Thus, the inclusion of a SNP by local ancestry interaction term will not
capture any additional information and stratified estimates across local ancestry strata
should be similar. This offers a potential approach to leverage differential LD patterns in
an admixed population to help identify causal variants when performing fine-scale
mapping or sequencing studies (Stacey et al.).
In addition to capturing the heterogeneity of the SNP effect among admixed
populations, it is possible that this observed effect is induced by another genetic or
environmental factor that drives the observed effect modification and is correlated with
self-identified ethnicity and thus related to local ancestry via global ancestry (i.e.X-Q-L).
69
In order to investigate the source (environmental or genetic) of the heterogeneity one can
perform further analyses within strata by individual global ancestry and, if available, the
strata of self-reported ethnicity. For SNP rs10519951, Table 3-3 shows that the
heterogeneity captured by local ancestry is attenuated when stratifying by global ancestry
or self-identified ethnicity.
These results suggest that this particular observed
heterogeneity is most likely due to local genetic structure and not global genetic or
environmental differences.
Table 3-3 Investigation of heterogeneity for SNP rs10519951 in the Childrens Health
Study combined samples.
Strata
Allele freq
beta
p-value
All
Combined
2546
0.21
-0.05
5.010-1 a
Lb
L0 (Asian)
L1
L2 (European)
119
531
1896
0.34
0.26
0.19
-1.17
-0.49
0.18
8.510-7
Qb
Q0.25 (Asian)
Q0.5
Q0.75 (European)
21
480
2045
0.43
0.30
0.19
-0.76
-0.35
0.06
2.910-2 d
Ec
Hispanics
Non-Hispanic White
1160
1386
0.24
0.19
-0.29
0.19
5.610-3 d
Conventional analysis testing of the SNP main effect only.

Estimated individual ancestry is rounded into three categorical groups when presenting
the number of samples and the allele frequency within each strata L : 0 ( L 0.5), 1
(0.5< L 1.5), and 2 ( L >1.5). Q: 0.25 (Q0.33), 0.5 (0.33<Q0.66), and 0.75 (Q>0.66).
c
Self-reported ethnicity.
d
2-df test of the SNP by strata interaction and the SNP marginal effect.
b
70
We have demonstrated that one needs to consider the impact of adjustment by

local ancestry in addition to the common practice of adjusting for global ancestry. While
the adjustment for local ancestry reflects the induced admixture LD within admixed
populations, the impact of inclusion of local ancestry depends upon the LD patterns in the
ancestral populations. Furthermore, we have also demonstrated the potential for a 2-df
test of SNP main effect and SNP by local ancestry interaction to increase power when
there is substantial differential LD between ancestral populations. We realize that for
most GWAS utilizing admixed populations, investigators will first scan the genome with
a marginal test of association. Thus, we view analyses with the interaction term as
secondary follow-up to uncover previously unidentified regions with substantial
heterogeneity of SNP effect by local ancestry.
71
Chapter 4 Mapping by admixture linkage disequilibrium

4.1 Introduction
4.1.1
Concept for admixture mapping
Mapping by admixture linkage disequilibrium is also known as admixture mapping. It is

a test for association of the disease with the ancestry conditioning on the admixture.
When genetic risk variants differ in frequency between ancestry populations, cases are
more likely to inherit alleles derived from the ancestral population that carries more
disease susceptible alleles (Patterson et al.). As a result, in regions near the disease locus,
cases tend to have a higher ancestry proportion from the population in which the disease
is more prevalent. The major steps for admixture mapping are inferring ancestral origins
at each locus and testing for excess ancestry proportions among cases. The number of
markers required for admixture mapping depends on the number of generations since
admixture and the information content for ancestry of the markers (a function of allele
frequencies in the ancestral populations) (McKeigue).
4.1.2
Testing for excess ancestry proportions
Admixture mapping using family data tests for excess transmission of the genome that
derives from one ancestral population (McKeigue; Zheng and Elston). It is a test of
association conditional on parental admixture. Near the disease locus, the ancestry is
skewed toward one of the ancestral population given the ancestry of the parents.
Admixture mapping using unrelated individuals could be conducted among either
case only or case-control samples (Montana and Pritchard). The main idea for case-
72
control study design is to test if the mean of the local ancestry among the cases
significantly diverges from the mean of the local ancestry among the controls at each
locus:
TCC =
(L d L c ) 2(Q d Qc )
SD(L d L c )
1
Ld =
Nd
Qd =
1
Nd
Nd
L i,d
Lc =
i
Nd
Qi,d
Qc =
1
Nc
1
Nc
Nc
L i,c
i
Nc
Qi,c
i
In this equation, Nd and Nc represent the number of cases and controls respectively. Li,d
represents the local ancestry estimate for individual i among the cases and Qi,d represents
the global ancestry estimate for the same individual.
L d and L c represent the average
local ancestry at the tested locus among the cases and controls respectively.
Qd and Qc
represent the averaged global ancestry across the cases and controls respectively. Note
that environmental and social factors may also differ between ancestral populations
(Risch et al.).
For a case-control study design, a overall difference in ancestry
(differences in global ancestry) could lead to confounding for the association between
local ancestry and disease. Therefore, the test statistics TCC tests if the difference in local
ancestry between cases and controls significantly diverges from the difference in global
ancestry between cases and controls.
The case-only study design tests if the mean local ancestry significantly diverges
from the genome-wide mean (global ancestry) among the set of cases:
TCO =
L d 2Qd
SD(L d )
73
Case-only analysis uses the global ancestry from the same set of samples as the
controls to control for the potential confounding that may arise in the case-control
admixture study design. However, there are issues that arise when using the case only
study design in a admixture scan. For example, if the local ancestry estimate is biased at
certain regions among both cases and controls (elevated ancestry proportions towards one
parental ancestry), a case only analysis will lead to spurious results.
4.1.3
Advantages of admixture mapping
First of all, linkage disequilibrium decays rapidly with distance. Genome-wide

association studies using background LD (LD among ancestry populations) requires a
relatively dense set of markers (Gabriel et al.). In contrast, admixture mapping takes
advantages of the extended LD within the admixed populations. Considering the size of
haplotypes and the size of ancestry blocks in an admixed population, admixture mapping
using admixture LD requires fewer markers for testing regions with excess ancestry
proportions from one population. Early studies focused on the use of relatively evenly
spaced ancestral informative markers (AIMs) across the genome for capturing each
region of local ancestry and for testing association. In recent studies, admixture mapping
with randomly selected markers has become an alternative to mapping with AIMs. A
second advantage for admixture mapping is that for a rare disease, it is more efficient to
use cases-only compared to case-control analysis (Hoggart, Shriver, et al.). With the
establishment of reliable approaches for local ancestry estimation, there have been many
recent papers that develop models to incorporate local ancestry information in the genetic
74
association studies with admixed samples (Chanock; Ding et al.; Pasaniuc, Zaitlen, et al.;
Shriner, Adeyemo and Rotimi; Zhu et al.; Fejerman et al.).
4.1.4
Purpose of this study
We propose a general regression framework to perform admixture mapping for both case
only and case-control study designs. We then test the performance of these proposed
models with a comparison to existing approaches. Finally, we apply the various models
to a real data set consisting of African Americans from the Multiethnic Cohort Study of
prostate cancer.
For comparison to work presented in Chapter 3, we discuss the
performance of the models leveraging admixture information to a model that incorporates

a SNP by local ancestry interaction.
4.2 Materials and Methods

4.2.1
Regression models
4.2.1.1 Proposed models for admixture mapping in regression framework

For the case-control analysis for admixture mapping, we can rewrite the test statistic as:
TCC =
(L d 2Q d ) (L c 2Qc )
SD(L d L c )
Therefore, the t-test for the case-control analysis for admixture mapping is
mathematically identical to a regression model Li 2Qi = + YYi + i with a test of if
the coefficient for Y (case-control status) is significantly different from zero. Changing
this model to a more general regression we use for testing ancestry and disease outcome
75
association, we propose a regression model (CCreg) to implement the original idea of

admixture mapping for case-control study design:
logit(Yi) = + LLi + QQi + i
Ho: L = 0
Ha: L 0
In this model, Li represent the estimated local ancestry in terms of the number of
ancestral chromosomal segments (in the range from 0 to 2) for individual i at the tested
locus and Qi represent the estimated global ancestry as an average of the local ancestry
across the genome divided by 2 (in the range from 0 to 1) for the same individual. Yi is
an indicator of disease status (cases vs. controls), and L represents the marginal effect of
local ancestry. The model tests the association between Yi and local ancestry (Li) with
adjustment for global ancestry (Qi) to control for potential confounding.
Similarly, for the case-only analysis, we can rewrite the test statistic as:
TCO =
(L z=0 2Qz=0 ) 0
SD(L z=0 )
Here, we define two groups, Z=0 for the cases and Z=1 for a hypothetic group
with the measured independent variables all equal to 0 ( L z=1
2Qz=1 = 0 ). In this way,
the t-test for the case-only analysis for admixture mapping is mathematically identical to
a regression model Li2Qi=+ZZi+i with a test of if Z is significantly different from
zero. As we define that when Z=1, Li2Qi=0, so the model can be simplified to Li
2Qi=-Zi+i with a test of if is significantly different from zero. Further restrict the
data to Z=0 group (only the cases), we develop a regression model (COreg) to
implement the idea of admixture mapping for the case-only analysis:
Li = + 2Qi + i
76
Ho: = 0
Ha: 0
In this model, the test of if Li is significantly different from 2Qi is equivalent to

the test of if is significantly different from 0 in the model.
In addition to the model CCreg, we further propose a model CCcom that
incorporates both genotype and local ancestry information in the test:
logit(Yi) = + GGi + LLi + QQi + i
A 2-df likelihood ratio test is used to jointly test the genotype (Gi) and the local ancestry
(Li) marginal effect.
4.2.1.2 Existing approaches
ADM is an approach proposed by Pasaniuc et al. (Pasaniuc, Zaitlen, et al.) for admixture
mapping among cases. This approach defines the likelihood of the data for the individual
within each ancestral strata:
In the likelihood, represents the multiplicative risk for disease given one or two
reference ancestral population,
l i,N represents the likelihood for individual i with Ni

i
copies of allele derived from the reference ancestral population. Assuming individuals
77
are independent of each other, the likelihood of the data is then written as the
multiplicative of the individual likelihood and a likelihood ratio test is used to derive the
ADM score which follows a chi-square distribution with 1-df.
The other two approaches proposed by Pasaniuc et al. (Pasaniuc, Zaitlen, et al.)
for admixture mapping using both genotype and local ancestry information are MIX and
SUM. Both tests combine the ADM model together with a SNP association model. The
SNP association model adjusts for local ancestry, and therefore is conditionally
independent of the ADM model:
L ( pA,0 , pB,0 , R) =
pA,Y 2RR
AA ,Y
+RVAA ,Y
+RVAA ,Y
(1 pA,Y ) 2VV
AA ,Y
Y { 0,1}
pA,Y RR
AB ,Y
+0.5RVAB ,Y
(1 pA,Y )VV
AB ,Y
+0.5RVAB ,Y
pB,Y
RR AB ,Y +0.5RVAB ,Y
(1 p B,Y )VV
AB ,Y
+0.5RVAB ,Y
Y { 0,1}
pB,Y 2RR
BB ,Y
+RVBB ,Y
(1 pB,Y ) 2VV
BB ,Y
+RVBB ,Y
Y { 0,1}
pA,1 =
RpA,0
1 pA,0 + RpA,0
SNP = 2 max log L( pA,0 , p B,0 , R) max log L( pA , pB ,1)

p ,p
p , p ,R
A ,0
B,0
In the likelihood, R represents the relative increase in risk per extra reference
allele, pA,0 represents the allele frequency in ancestral population A among controls, and
RRAA,0 represents the number of individuals with genotype RR and both alleles derived
from ancestral population A among the controls. The MIX (1-df) and the SUM (2-df)
models are likelihood ratio tests that combine the likelihood (MIX) or the test statistics
(SUM) of the single SNP test (adjusting for local ancestry) and the ADM. Both
78
approaches assume no heterogeneity of the SNP effect across ancestral populations.

More specifically, SUM score is the sum of the statistics ADM and SNP and the SUM
statistic follows a chi-square distribution with 2-df. MIX model multiplies the liklihood
Ladmix() and L(pA,0,pB,0,R) assuming the following relationship between and R:
And then the MIX statistic is calculated as:
MIX = 2 max log L combined ( pA,0 , pB,0 , R) max log L combined (p A,0 , pB,0 ,1)
p , p ,R
p , p ,R
A ,0
B ,0
A ,0
B ,0
This statistic follows a chi-square distribution with 1-df

4.2.1.3 Summary of the proposed models
As a summary, we proposed three models in regression framework for admixture
mapping: Model (1) that uses only ancestry information for case-only study design;
Model (2) that uses only ancestry information for case-control study design; and Model
(3) that incorporates ancestry and SNP genotype information:
4.2.2
COreg:
Li = + 2Qi + i
(1)
CCreg:
logit(Yi) ~ + LLi + QQi + i
(2)
CCcom:
logit(Yi) ~ + GGi + LLi + QQi + i (2-df)
(3)
Simulation framework
A total of 9,641 African-American samples with measured genotypes were used for the
simulation. Local ancestry was estimated through HAPMIX using HapMap 2 YRI &
CEU as the reference populations. Global ancestry was then calculated as an average of
the local ancestry across the genome divided by 2 for each individual.
79
Figure 4-1 Simulation framework for admixture mapping.

As shown in the framework above (Figure 4-1), GM represent the marker locus we
observed and GD represent the true disease locus that is not directly observed in the
genotype data. LD (r2) is the predefined linkage disequilibrium between GM and GD. p
represents the allele frequency at GM, and is calculated directly from the observed
genotypes; while q represents the allele frequency at GD. During the simulation, q is set
equal to p. Haplotype frequencies at GM & GD are calculated based on the predefined LD
80
and the allele frequencies at these two loci. Then the probability of the genotypes at GD
given GM ( Pr(GD|GM) ) is calculated based on the haplogype frequencies at GM and GD.
Given the conditional probability Pr(GD|GM) and the observed genotypes at GM, the
genotypes at the disease locus GD are generated for each individual.
After simulating the disease locus genotypes, we resample 20,000 individuals
from our African-American samples with replacement. For these samples, we generate
case/control status for a binary disease outcome (Y) using a logistic regression model
incorporating the disease locus GD and the predefined disease prevalence (0.1) in the
population.
Then, 1,000 cases and 1,000 controls are randomly selected (without
replacement) from the 20,000 samples. Admixture mapping was conducted among these
selected samples to compare the performance of the proposed models.
4.2.3
Scenarios
Scenario A: Testing the performance of the models under the null hypothesis. The effect
of GD on disease outcome Y (odds ratio) equals to 1.0. Scenario B: Testing the
performance of the models under the alternative hypothesis. The LD (r2) between GM
and GD equals to 0.9. The effect of GD on disease outcome Y (odds ratio) is fixed at 2.0.
SNPs are grouped into strata according to the allele frequency differences
between European and African ancestries (as calculated from HapMap III populations).
The performance of the models was tested within each strata respectively.
4.2.4
Real data analysis among African Americans
We apply the models to the African-American samples in the MEC (Haiman et al.).
After quality control, there were 9,641 individuals and 863,431 SNPs remaining for the
analysis. Among these samples, there were 4,905 cases and 4,732 control. The averaged
81
global European ancestry proportion (calculated as the averaged local ancestry across the
genome divided by 2 for each individual) is 0.205 among cases and 0.215 among
controls.
4.3 Results
4.3.1
Simulation results
A significant threshold of 0.05 is used to access the type I error rate for the tested models.
SNPs are grouped by their allele frequency differences between CEU and YRI
populations from HapMap III.
As shown in Table 4-1, under the null hypothesis
(OR=1.0), the overall type I error rates are around 0.05 except for the ADM model
(0.027) and the SUM score model (0.035).
Table 4-1 Type 1 error among models for admixture scan.

All
12393
0.051
Allele freq differences1

< 0.2
0.2 ~ 0.4
> 0.4
7839
3275
1279
0.051
0.048
0.055
0.058
0.027
0.052
0.057
0.027
0.051
OR=1.0; LD=0.9
Number of loci
Use Genotype info SNP association
Use ancestry info
COreg
ADM
CCreg
0.062
0.028
0.053
0.055
0.023
0.054
Use genotype
& ancestry
CCcom
0.051
0.053
0.050
0.045
SUM
0.035
0.038
0.032
0.026
MIX
0.048
0.050
0.046
0.043
1
Absolute allele frequency differences between African and European populations.
To assess the power of the tested models, a threshold of 1e-05 is used for models
that use only local ancestry information (COreg, ADM, and CCreg), and a threshold of
82
5e-08 is used for models that incorporate a SNP-based test (SNP association, CCcom,
SUM, and MIX).
Table 4-2 shows the power for models using only ancestry information (COreg,
ADM, and CCreg). The simulation results indicate that the case-only analysis is more
powerful than the case-control analysis.
In addition, our proposed model COreg
performs better than the ADM model for the case-only analysis.
For models
incorporating genotype information, as shown in Table 4-3, models CCcom, SUM, and
MIX result in simulation power as the SNP association model. Across all the tested
models, the power for detecting the association increases with greater allele frequency
differences between European and African populations.
Table 4-2 Power among models for admixture scan using only ancestry information.
All
< 0.2
0.2 ~ 0.4
> 0.4
Number of loci
12393
7839
3275
1279
Models
COreg
0.033
0.001
0.020
0.250
ADM
0.015
0.000
0.005
0.122
CCreg
0.011
0.000
0.005
0.094
1
Absolute allele frequency differences between Asian and European populations.
OR=2.0; LD=0.9
Table 4-3 Power among models for admixture scan incorporating genotype information.
All
< 0.2
0.2 ~ 0.4
> 0.4
Number of loci
12393
7839
3275
1279
Models
SNP association
0.739
0.659
0.865
0.935
CCcom
0.726
0.644
0.852
0.922
SUM
0.728
0.643
0.855
0.928
MIX
0.756
0.675
0.870
0.956
1
Absolute allele frequency differences between Asian and European populations.
OR=2.0; LD=0.9
83
As a conclusion, the simulation results suggest that at the LD=0.9 level (dense
markers across the genome), when genotype information is available, it is always more
powerful to incorporate it in the analysis, and we will not gain much in terms of power by
incorporating ancestry information. However, among markers with great allele frequency
differences between populations (>0.4), when LD decreased to 0.4, our proposed model
that incorporates both ancestry and genotype information begins to show greater power
than the SNP association analysis; and when LD is lower than 0.28, our proposed caseonly model COreg begins to show greater power than the SNP association analysis
(Figure 4-2). Noted that the case-only analysis has a constant power across different
level of LD because the local ancestry background remains constant.
Figure 4-2 Simulation results across different LD levels among markers with allele
frequencies greater than 0.4 between populations.
84
4.3.2
Real data analysis results
4.3.2.1 Results across the genome

We apply models (SNP association, COreg, CCreg, and CCcom) to a whole genome scan
among the MEC African Americans.
Figure 4-3 shows the results from the SNP
association and the models use only local ancestry information. The pattern from caseonly analysis is similar to that from the case-control analysis, but it shows clearly that the
case-only analysis is much more significant. Note that the resulting pattern is almost
identical between ADM and COreg, with COreg attaining more significant p-values. On
the other hand, some regions (e.g. regions on Chromosome 6, 8, 11, and 15) show a
substantial disagreement between the case-only and the case-control analysis. Within
these regions, the case-only analysis results in very significant p-values, while only one
of these regions (on Chromosome 8) reaches genome-wide significance (1e-05) from the
case-control analysis using only local ancestry. The difference may due to overadjustment for individual global ancestry in the case-control setting, or due to the
spurious result introduced by the individual local ancestry estimation (e.g., if the
estimated local ancestry proportion is higher than the average among both cases and
controls). In order to understand the reason for the difference, we list detailed analysis
results and ancestry estimates in Table 4-4 for these regions.
85
SNP association (black and gray)

COreg (red)
SNP association (black and gray)

CCreg (red)
Figure 4-3 Genome-wide admixture scan using the SNP association and the models use
only local ancestry information.
As shown in Table 4-4, on Chromosome 6, 11, and 15, there are regions with
elevated local ancestry estimation (of European origins) among both cases and controls.
The highly significant signals from the case-only analysis (here we only show the result
from COreg) match perfectly with these regions. In addition, at these loci, the casecontrol analysis does not show any hint of the association. Therefore, we suspect that the
significance detected within these regions are false positives that caused by the local
ancestry estimation procedure. On the other hand, for the region on Chromosome 8, only
the local ancestry estimation (proportion inherited from European ancestry) among the
cases shows deviation from the mean. In this region, cases tend to have less European
ancestry proportion than one would expect. At the most significant locus, all the models
(SNP association, case-only, and case-control analysis) reach the genome-wide
significant level (1e-05). Therefore, we view this region as a promising candidate for the
disease under studying. The reduced signal from the case-control analysis may due to the
adjustment for individual global ancestry, as global ancestry is highly correlated with
local ancestry.
86
Table 4-4 Analysis details for regions with great disagreement between case-only and
case-control analysis.
Regions
Chr6:
Chr8:
Chr11:
Chr15:
-log10(p):
COreg
(red) vs.
CCreg
(blue)
Local
ancestry:
Cases
(red) vs.
Controls
(blue)
Figure 4-4 shows the results from the models using both genotype and ancestry
information (CCcom) and compares their performance to the SUM and MIX models. The
results indicate that the performance is similar across the three models, and most of the
signals from these mixed models are well captured by the SNP association model that
uses only the genotype information. Note that the SUM score and the MIX score model
combine the genotype signal from the case-control analysis with the admixture signal
from the case-only analysis; consequently, these two models generate spurious results at
the regions on chromosome 6, 11, and 15 (regions shown in Table 4-4). On the other
hand, our proposed mixed model, CCcom, is a combination of genotype signal and the
admixture signal from the case-control analysis, therefore is immune to these false
87
positive regions. Therefore, model CCcom is a more appropriate model for combining
genotype and ancestry information.
Conventional model (black and gray)

CCcom (red)

SUM (red)

MIX (red)
Figure 4-4 Genome-wide admixture scan using the SNP association and the models
incorporating both genotype and ancestry information (SUM, MIX, and CCcom).
4.3.2.2 Results on known hits

We further check the results on the known hits for prostate cancer among MEC AfricanAmericans.
Table 4-5 shows the list of SNPs that are replicated from the SNP association
model (p-value cutoff 0.05). These SNPs are further classified into two subgroups: group
A SNPs that are only significant from the SNP association model, group B SNPs that also
make the 0.05 cutoff among the models that use only ancestry information (COreg or
CCreg). For each SNP, the model with the most significant p-value is highlighted in
bold. As we expected, for SNPs in group A, the most significant p-value resulted from
either the SNP association model or model CCcom which combines both genotype and
88
ancestry information, and there is not much gain in terms of increased statistical
significance by incorporating ancestry information in the combined model. For group B
SNPs, all the most significant p-values resulted from the combined model CCcom. Note
that the results are almost identical between ADM and COreg, with COreg attaining more
significant p-values. In addition, the performance of CCcom is similar to MIX, while the
SUM score model results in relatively conservative p-values (less significant results).
Table 4-5 Known Hits that replicated from the conventional model.
Group
SNP
Chr
COreg
CCreg
CCcom
2
2
2
2
5
6
6
6
10
11
11
17
19
19
SNP
association
3.33E-02
1.07E-02
1.21E-02
2.90E-03
2.34E-03
2.52E-04
1.74E-06
9.92E-03
2.43E-03
1.70E-03
2.06E-03
2.65E-08
1.89E-02
3.74E-02
rs10187424
rs12621278
rs7584330
rs2292884
rs12653946
rs1983891
rs339331
rs9364554
rs10993994
rs7127900
rs11228565
rs7210100
rs8102476
rs11672691
2.03E-01
9.77E-01
7.48E-01
8.11E-01
3.43E-01
6.94E-02
2.45E-01
5.48E-01
5.51E-01
4.64E-01
9.35E-01
7.56E-02
3.91E-01
4.04E-01
8.08E-02
6.78E-01
7.59E-01
6.16E-01
8.89E-01
3.18E-01
6.11E-02
9.40E-01
8.19E-01
6.85E-01
9.61E-01
6.61E-02
2.20E-01
3.18E-01
4.38E-02
3.66E-02
1.22E-02
4.19E-03
9.48E-03
1.14E-03
1.09E-06
1.73E-02
7.46E-03
4.67E-03
7.25E-03
6.04E-09
1.30E-02
1.07E-01
rs2028898
rs10486567
rs1512268
rs5759167
2
7
8
22
1.56E-02
8.03E-04
6.30E-07
1.00E-04
1.30E-01
3.91E-02
4.76E-02
7.86E-03
3.55E-02
2.27E-03
3.74E-03
1.26E-03
9.36E-03
4.49E-05
6.37E-07
4.05E-05
Table 4-6 shows the list of SNPs that are not replicated in the SNP association
analysis but are captured by the models incorporating ancestry information (p-value
cutoff 0.05). Note that rs2121875 and rs130067 yield very significant p-values from the
case only analysis (COreg), but show nothing from the case-control analysis (CCreg).
This is due to large differences in local ancestry as compared to global ancestry for both
89
cases and controls. The averaged global ancestry estimates are 0.205 and 0.214 among
cases and controls respectively for these two SNPs. For rs2121875, the averaged local
ancestry estimates are 0.44 and 0.46 among cases and controls; and at rs130067, the
estimates are 0.46 and 0.48 among cases and controls. At both loci there are elevated
local ancestry estimates among both the cases and controls. For the case only analysis a
comparison to the global ancestry yields a significant p-value. For the case-control
analysis, since local ancestry is comparable for both the cases and the controls there is no
significant association.
Table 4-6 Known Hits that are replicated only in models incorporating local ancestry
information.
SNP
Chr
rs6763931
rs2121875
rs130067
rs2928679
rs4962416
3
5
6
8
10
SNP
association
2.85E-01
4.62E-01
1.36E-01
8.94E-01
1.08E-01
COreg
CCreg
CCcom
4.11E-03
3.26E-04
8.93E-13
6.54E-02
6.83E-02
3.26E-02
9.00E-01
8.27E-01
2.93E-03
1.58E-03
9.97E-02
7.01E-01
3.24E-01
1.02E-02
3.88E-03
4.3.2.3 Building regression models

As model CCcom generates valid results across the genome, we compare the
performance of CCcom to the SNP association in Figure . In this figure, the x-axis plots
the log10(p) for the genotype marginal effect from the conventional model, and y-axis
plots the log10(p) for the 2-df test of the genotype and local ancestry main effect from
the CCcom model. The solid line represents when the two models perform the same.
Result shows that 10 SNPs (all on chr 8) were significant in CCcom and not in the SNP
association analysis (red colored) and there were 275 SNPs (274 of them are on
90
chromosome 8 and 1 of them is on chromosome 22) in which the difference between

models is greater than 4 (blue colored).
Figure
Fi
Figure 4-5 Compare the performance between the proposed CCmix (2df) model and the
conventional model.
Figure 4-6 shows the detailed analysis results for the region on chromosome 8. In
this figure, the red and blue colored SNPs are among the ones colored in Figure . Region
A (highlighted in red) is the region containing the association signals from the SNP
association analysis, and most of the red colored SNPs from Figure fall into this region.
Region B are the two regions highlighted in blue on both side of Region A, and most of
the blue colored SNPs from Figure fall into this region. For the red colored SNPs, the
association signal comes from both the genotype and the ancestry, and for the majority of
the blue colored SNPs, the association signal mainly comes from only the ancestry
information.
91
Position on chromosome 8
Figure 4-6 Comparison of the performance between the proposed CCcom (2df) model
and the SNP association analysis on the region on chromosome 8.
Furthermore, we build an additional model based on CCcom to include the SNP

by local ancestry interaction, and propose a 3-df test of the G, L, and GL: CCcom_GL
logit(Y) ~ + GG + LL + intGL + QQ (3-df test)
Figure 4-7 compares the performance between CCcom and the SNP associatoin
model. Similar to Figure 4-5, SNPs marked in red are the ones that make the significant
cutoff from model CCcom_GL (3-df) and result in more significant p-values than CCcom
(2-df). There are 9 such SNPs across the genome and all of them are on chromosome 8
(as shown in Figure 4-8). In the right panel, the 12 SNPs marked in blue result in much
92
smaller p-values than the conventional model (difference between log10(p) is greater
than 4). Among these SNPs, 9 of them are on chromosome 9, and the remaining 3 SNPs
are on chromosome 2, 5, and 8 respectively. Results for the region on chromosome 9 are
shown in Figure 4-9. The 9 blue colored SNPs (from Figure 4-8) are gathered within the
two highlighted regions. The first region contains a transcription factor gene ZFAND5. It
regulats NFkappaB activation and apoptosis. The second region contains ALDH1A1, a
Aldehyde dehydrogenase enzyme that are responsible for alcohol metabolism and is also
involved in the regulation of the metabolic responses to high-fat diet.
Figure 4-7 Comparison of models CCcom (2df) and CCcom_GL (3df).
93
Figure 4-8 Comparison of models CCcom (2df) and CCcom_GL (3df) on the region on
chromosome 8.
Figure 4-9 Comparison of models CCcom (2df) and CCcom_GL (3df) on the region on
chromosome 9.
Finally, we take the most significant SNP from each of the highlighted regions in
Figure 4-9 and show the effect estimates within each ancestry strata. As shown in Table
94
4-7, the allele frequencies for rs4073226 are 0.720 for L=0 (homogeneous of African
ancestry) and 0.493 for L=2 (homogeneous of European ancestry). This estimated allele
frequency is consistent with that from the HapMap samples (0.714 for YRI and 0.535 for
CEU). For rs3815836, the allele frequencies are 0.162 (L=0) and 0.537 (L=2), which are
also consistent with the HapMap samples (0.129 for YRI and 0.566 for CEU). For both
SNPs, the effect sizes are in opposite directions between local ancestry strata L=0 and
L=2.
Table 4-7 Effect size for SNPs and on chromosome 9.

Marker
rs4073226
rs3815836
Local
Strata
Allele
freq
L=0
L=1
L=2
6189
2945
504
0.720
0.616
0.493
All
9638
0.676
L=0
L=1
L=2
6211
2925
503
0.162
0.343
0.537
All
9639
0.237
CCcom (2-df)
CCcom_GL (3-df)
(p-value)
0.048
(0.27)
(p-value)
-0.059
0.178
0.415
(1.610-5)
-0.001
(0.94)
0.144
-0.127
-0.398
(9.510-6)
4.4 Discussion
Results from both simulation and real data analysis indicate that case only analysis
(COreg and ADM) is more powerful than case-control analysis (CCreg) and our proposed
regression model COreg for case only analysis is more powerful than the ADM model.
When SNP genotypes are available, it is more powerful to incorporate genotype
information in the model (CCcom, MIX, and SUM), and our proposed regression model
95
CCcom yield similar performance as MIX and SUM models. On the other hand, the real
data analysis shows that the case-only analysis suffers from spurious results among the
regions with biased local ancestry estimation (e.g. estimated ancestry proportion is higher
than it should be among both cases and controls). Therefore, the case only analysis
models COreg and ADM as well as models MIX and SUM that incorporate signals from
the case only analysis may result in false positives across the genome. As a conclusion,
we suggest using our proposed regression model CCcom with the 2-df test of both SNP
and local ancestry for admixture scan with admixed populations.
Moving from the model using only genotype information to the model
incorporating bother genotype and local ancestry information, and finally to the model
that further considering heterogeneity by local ancestry, at each step, we gain power at
certain loci and at the same time lose power due the additional degrees of freedom in the
test. More specifically, compared to the SNP association model, model CCcom (2-df)
results in more significant p-values for 43% of the loci; and compared to CCcom (2-df),
model CCcom_GL (3-df) results in more significant p-values for 34% of the loci. When
comparing CCcom_GL directly to the conventional model (as shown in Figure 4-10), we
find that 47% of the loci attain more significant p-values and notice that the difference
tends to be greater between the two models among these loci. For example, rs11777807
on chromosome 8 has a p-value at 2.4e-3 from the conventional model, but results in a
very significant p-value at 4.1e-12 from model CCcom_GL (3-df). This change suggests
that after an initial scan with the conventional model, it is more powerful to run an
admixture scan using CCcom_GL (3-df), a model that captures the admixture signal and
heterogeneity by local ancestry simultaneously.
96
Figure 4-10 Compare model CCcom_GL (3df) to the SNP association model.
Note that our suggestions on the admixture scan for GWAS are based on the
results from both simulations and real data analysis. As the real data analysis results
show that there could be spurious results from the case-only analysis (as shown in Table
4-4), we do not suggest using models that capture the admixture signals through the caseonly analysis (COreg, ADM, SUM, and MIX). However, when the issue with case-only
analysis is solved (e.g. no more regions with elevated local ancestry estimates), our
suggestions for admixture scan in GWAS may change accordingly.
As shown in the simulation, for models use only local ancestry information,
COreg > ADM > CCreg in terms of power.
And for the combined models, the
performance is similar across the CCcom, MIX and SUM models. Compared to the
exiting models, our proposed regression modes have the following advantages: 1) easier
to set-up and running the analysis; 2) more flexible for including additional factors in the
model (e.g. the SNP by local ancestry interaction); 3) easier to interpret the result. The
log odds ratio of the SNP effect, local ancestry effect as well as the ancestry strata
97
specific SNP effect can be obtained directly from the regression coefficients in the model;
4) can be applied to continuous outcomes as well.
98
Chapter 5 Summary
Figure 5-1 summarizes our suggested models for GWAS among admixed populations.
The two models to the left, Y~G+Q and Y~G+GL+L+Q (2df), are the models we
proposed in Chapter 3.
The two models to the right, Y~G+L+Q (2df) and
Y~G+GL+L+Q (3df), are the models we proposed in Chapter 4. As shown in the figure,
moving from the left panel to the right panel, we incorporate admixture signals into the
models; moving from the upper panel to the lower panel, we incorporate heterogeneity
signals into the models.
Figure 5-1 Comparision of proposed models.

99
Figure 5-2 shows the effects on results by incorporating different signals in the
model. Compared to the conversional model Y~G+Q, about 43% of the loci result in
more significant p-values by incorporating local ancestry information (Y~G+L+Q 2df),
and about 40% of the loci result in more significant p-values by incorporating G by L
interaction in the model (Y~G+GL+L+Q 2df). When comparing the model that
incorporates both admixture signal and heterogeneity signal (Y~G+GL+L+Q 3df) to the
conventional model, about 47% of the loci result in more significant p-values. The
distribution of the changes in log10(p-value) is shown in Figure 5-3. Among the loci
with smaller difference between the two models, e.g. the absolute value of the difference
is smaller than 0.6, the conventional model shows its advantage over the advanced model
(more than half of the loci results in more significant p-values from the conventional
model); however, among the loci with greater difference between models, our proposed
3-df test model shows the advantage over the conventional model.
Figure 5-2 Changes on results between proposed models.

100
0.06%
52.6%
-1
41.5%
4.91%
0.93%
>2
D ifference in log10(p-value)
Figure 5-3 Histogram of changes in log10(p-value) when comparing Y~G+L+GL+Q

(3df) to the conventional model Y~G+Q.
Although it turns out that compared to the conventional model, the model
incorporating both admixture and heterogeneity signals results in less significant p-values
for more than half of the loci (~53%) across the genome, we see a greater difference
between the models among the loci that show greater difference between models (as
indicated in Figure 5-3). In order to reflect the among of changes between models in the
comparison, we calculated a weighted changes for the loci above (Wabove) and below
(Wbelow) the expected line in Figure 4-10 respectively:
( yi x i )
Wabove =
i { y>x }
N i{ y>x }
( yi x i )
Wbelow =
i { y<x }
N i{y<x }
101
Here yi represents the log10(p) from model Y~G+GL+L+Q (3df) at locus i and
xi represents the log10(p) from the conventional model Y~G+Q at the same locus. The
weighted changes across models are shown in Figure 5-4. In each comparison, Wabove is
greater than the absolute value of Wbelow, suggesting that on average (considering the
differences between models), the potential to gain information outweighs the potential
loss in significance due to an additional degree-of-freedom in the test. Therefore, GWAS
among admixed populations will benefit from incorporating admixture and heterogeneity
signals in the analysis.
Figure 5-4 Weighted changes on results between proposed models.
102
Chapter 6 Future Directions

We have discussed the performance of various models including: SNP association,
models using only ancestry information, models using both ancestry and genotype
information, and models accounting for heterogeneity by local ancestry (SNP by local
ancestry interaction). Each model has its own advantage under different scenarios. Then
the issue becomes how to select among the models, or the build of a more completed
model from the initial SNP association model. This model selection can based on the
AIC which incorporates the penalty of over fitting of the model. Or, alternative ways can
be generating priors for each model under each scenario (e.g. allele frequency difference
between populations at each locus) and select among the models in a Bayesian
framework.
Principal Component Analysis (PCA) has been widely used for the detection of
population substructures and has become a standard approach for adjusting for
confounding for genetic association studies.
However, there remain issues for this
approach. First of all, the current prevalent way is to use the top 10 PCs to control for
confounding in the association study. The top 10 PCs may not be enough for controlling
for finer structures in the data and may be too much for studies with only major
population structure (and therefore cause loss of power when the sample size is small).
So, there is question about how many PCs is appropriate for the adjustment, and the
answer will be different with different study populations. Secondly, PCs are not very
interpretable compared to STRUCTURE results, so, there is no clear cutoff for defining
outliers in terms of samples ancestry composition.
Special clustering methods are
103
necessary for summarizing the results from PCA and interpreting the identified
population substructures.
Local ancestry estimates may generate spurious results for admixture scans
among only the cases. We observed regions with elevated ancestry estimation within
both cases and controls among the MEC African American samples (as shown in Table
4-4). Investigation of the underlying causes for this bias and how to account for this
potential bias in estimation of local ancestry is a area of research that is needed. In
addition, there is a need for statistical methods for more robust approaches for ancestry
estimation (robust estimates when the exact reference population is not available or not
known).
Another challenge for association studies is over-adjustment, for example, in
controlling for confounding.
In order to control for any potential possibilities for
population stratification, we suggest adjusting for individual global ancestry in the

regression model. This has been shown to be the most efficient way in the literature as
well as in our simulations and real data analysis. However, we may also lose power at
loci that are in LD with the disease causal locus. Figure 3-3 (a) shows the simulation
result under the scenario in which there is no global ancestry effect on the disease. The
test is at the marker locus that is in LD with the underline disease locus. The model
adjusting for global ancestry (red dashed line) has less power compared to the crude
model (blue dotted line), especially when there is substantial allele frequency difference
between ancestries.
The conclusion and suggestions we give in Chapter 3 and 4 are based on the
GWAS design.
For Next Generation Sequencing, our conclusions may change
104
accordingly.
First of all, it is more likely to have genotyped the causal locus in
sequencing data than in GWAS data. Therefore, there will still be confounding issues but
there will less likely be heterogeneity by local ancestry.
Another interesting area is how to use the ancestry information to facilitate the
genotyping (especially for rare variants), association analysis, as well as the summary of
the results. Genotyping of the rare variants is a big challenge because of the sparse data
in the cluster of the minor genotype (or even missing genotype clusters). Current
approaches borrow information of the cluster distributions from a set of other SNPs. It is
also possible to use the local ancestry information as the prior of the genotype clusters for
the rare variants, e.g. have a better estimation of the haplotype probabilities around the
region. For analyzing the sequencing data, information of local ancestry could also be
used for generating prior for selecting SNPs for a joint analysis. For example, the results
from admixture mapping could provide suggestions of the supporting regions as
described in
Figure 4-6 (Region B). But this information should be considered together with the allele
frequencies because an admixture scan is only powerful at the loci that have substantial
allele frequency differences between ancestral populations (with the further caveat that
the disease prevalence has to be different between ancestral populations).
In addition, the sequencing data we discussed above is for sequencing of each
individual sample. An alternative approach is the pooling sequencing which cases and
controls are grouped into different pools. In this case, population stratification becomes
an issue for samples from admixed populations.
105
Bibliography
"The Encode (Encyclopedia of DNA Elements) Project." Science 306.5696 (2004):
636-40. Print.
"The International Hapmap Project." Nature 426.6968 (2003): 789-96. Print.
Adeyemo, A., et al. "A Genome-Wide Association Study of Hypertension and Blood
Pressure in African Americans." PLoS Genet 5.7 (2009): e1000564. Print.
Aldrich, M. C., et al. "Comparison of Statistical Methods for Estimating Genetic
Admixture in a Lung Cancer Study of African Americans and Latinos." Am J
Epidemiol 168.9 (2008): 1035-46. Print.
Alexander, D. H., J. Novembre, and K. Lange. "Fast Model-Based Estimation of
Ancestry in Unrelated Individuals." Genome Res 19.9 (2009): 1655-64. Print.
Altshuler, D. M., et al. "Integrating Common and Rare Genetic Variation in Diverse
Human Populations." Nature 467.7311 (2010): 52-8. Print.
Arking, D. E., et al. "Genome-Wide Association Study Identifies Gpc5 as a Novel
Genetic Locus Protective against Sudden Cardiac Arrest." PLoS One 5.3 (2010):
e9879. Print.
Barnholtz-Sloan, J. S., et al. "Fgfr2 and Other Loci Identified in Genome-Wide
Association Studies Are Associated with Breast Cancer in African-American and
Younger Women." Carcinogenesis 31.8 (2010): 1417-23. Print.
Benyamin, B., et al. "Variants in Tf and Hfe Explain Approximately 40% of Genetic
Variation in Serum-Transferrin Levels." Am J Hum Genet 84.1 (2009): 60-5. Print.
Bertoni, B., et al. "Admixture in Hispanics: Distribution of Ancestral Population
Contributions in the Continental United States." Hum Biol 75.1 (2003): 1-11. Print.
Bilguvar, K., et al. "Susceptibility Loci for Intracranial Aneurysm in European and
Japanese Populations." Nat Genet 40.12 (2008): 1472-7. Print.
Birlea, S. A., et al. "Genome-Wide Association Study of Generalized Vitiligo in an
Isolated European Founder Population Identifies Smoc2, in Close Proximity to
Iddm8." J Invest Dermatol 130.3 (2010): 798-803. Print.
Boger, C. A., et al. "Cubn Is a Gene Locus for Albuminuria." J Am Soc Nephrol 22.3
(2011): 555-70. Print.
106
Bonilla, C., et al. "Admixture in the Hispanics of the San Luis Valley, Colorado, and Its
Implications for Complex Trait Gene Mapping." Ann Hum Genet 68.Pt 2 (2004): 13953. Print.
Bostrom, M. A., et al. "Candidate Genes for Non-Diabetic Esrd in African Americans:
A Genome-Wide Association Study Using Pooled DNA." Hum Genet 128.2 (2010):
195-204. Print.
Broderick, P., et al. "A Genome-Wide Association Study Shows That Common Alleles
of Smad7 Influence Colorectal Cancer Risk." Nat Genet 39.11 (2007): 1315-7. Print.
Bryc, K., et al. "Colloquium Paper: Genome-Wide Patterns of Population Structure
and Admixture among Hispanic/Latino Populations." Proc Natl Acad Sci U S A 107
Suppl 2 (2010): 8954-61. Print.
Carter-Pokras, O. D., and P. J. Gergen. "Reported Asthma among Puerto Rican,
Mexican-American, and Cuban Children, 1982 through 1984." Am J Public Health
83.4 (1993): 580-2. Print.
Cavalli-Sforza, L. L., and M. W. Feldman. "The Application of Molecular Genetic
Approaches to the Study of Human Evolution." Nat Genet 33 Suppl (2003): 266-75.
Print.
Cavalli-Sforza, L. L., Paolo Menozzi, and Alberto Piazza. The History and Geography of
Human Genes. Princeton, N.J.: Princeton University Press, 1994. Print.
Chalasani, N., et al. "Genome-Wide Association Study Identifies Variants Associated
with Histologic Features of Nonalcoholic Fatty Liver Disease." Gastroenterology
139.5 (2010): 1567-76, 76 e1-6. Print.
Chanock, S. J. "A Twist on Admixture Mapping." Nat Genet 43.3 (2011): 178-9. Print.
Charles, B. A., et al. "A Genome-Wide Association Study of Serum Uric Acid in African
Americans." BMC Med Genomics 4 (2011): 17. Print.
Chen, Z. J., et al. "Genome-Wide Association Study Identifies Susceptibility Loci for
Polycystic Ovary Syndrome on Chromosome 2p16.3, 2p21 and 9q33.3." Nat Genet
43.1 (2011): 55-9. Print.
Cho, Y. S., et al. "A Large-Scale Genome-Wide Association Study of Asian Populations
Uncovers Genetic Factors Influencing Eight Quantitative Traits." Nat Genet 41.5
(2009): 527-34. Print.
Choudhry, S., et al. "Population Stratification Confounds Genetic Association Studies
among Latinos." Hum Genet 118.5 (2006): 652-64. Print.
107
Choudhry, S., et al. "Dissecting Complex Diseases in Complex Populations: Asthma in

Latino Americans." Proc Am Thorac Soc 4.3 (2007): 226-33. Print.
Choudhry, S., et al. "Genome-Wide Screen for Asthma in Puerto Ricans: Evidence for
Association with 5q23 Region." Hum Genet 123.5 (2008): 455-68. Print.
Cooper, R. S., B. Tayo, and X. Zhu. "Genome-Wide Association Studies: Implications
for Multiethnic Samples." Hum Mol Genet 17.R2 (2008): R151-5. Print.
Cui, R., et al. "Common Variant in 6q26-Q27 Is Associated with Distal Colon Cancer
in an Asian Population." Gut (2011). Print.
Denavas, C., and M. A. Hall. "The Hispanic Population in the United States: March
1986 and 1987." Curr Popul Rep Popul Charact.434 (1988): 1-89. Print.
Devlin, B., and K. Roeder. "Genomic Control for Association Studies." Biometrics 55.4
(1999): 997-1004. Print.
Ding, L., et al. "Comparison of Measures of Marker Informativeness for Ancestry and
Admixture Mapping." BMC Genomics 12.1 (2011): 622. Print.
Eijgelsheim, M., et al. "Genome-Wide Association Analysis Identifies Multiple Loci
Related to Resting Heart Rate." Hum Mol Genet 19.19 (2010): 3885-94. Print.
Engelhardt, B. E., and M. Stephens. "Analysis of Population Structure: A Unifying
Framework and Novel Methods Based on Sparse Factor Analysis." PLoS Genet 6.9
(2010). Print.
Fagan, Brian M. The Great Journey : The Peopling of Ancient America. New York, N.Y.:
Thames and Hudson, 1987. Print.
Falush, D., M. Stephens, and J. K. Pritchard. "Inference of Population Structure Using
Multilocus Genotype Data: Dominant Markers and Null Alleles." Mol Ecol Notes 7.4
(2007): 574-78. Print.
---. "Inference of Population Structure Using Multilocus Genotype Data: Linked Loci
and Correlated Allele Frequencies." Genetics 164.4 (2003): 1567-87. Print.
Fejerman, L., et al. "Admixture Mapping Identifies a Locus on 6q25 Associated with
Breast Cancer Risk in Us Latinas." Hum Mol Genet (2012). Print.
Freeman, N. C., D. Schneider, and P. McGarvey. "Household Exposure Factors,
Asthma, and School Absenteeism in a Predominantly Hispanic Community." J Expo
Anal Environ Epidemiol 13.3 (2003): 169-76. Print.
Gabriel, S. B., et al. "The Structure of Haplotype Blocks in the Human Genome."
Science 296.5576 (2002): 2225-9. Print.
108
Garcia-Barcelo, M. M., et al. "Genome-Wide Association Study Identifies Nrg1 as a

Susceptibility Locus for Hirschsprung's Disease." Proc Natl Acad Sci U S A 106.8
(2009): 2694-9. Print.
Garcia-Barcelo, M. M., et al. "Genome-Wide Association Study Identifies a
Susceptibility Locus for Biliary Atresia on 10q24.2." Hum Mol Genet 19.14 (2010):
2917-25. Print.
Gauderman, W. J., J. S. Witte, and D. C. Thomas. "Family-Based Association Studies." J
Natl Cancer Inst Monogr.26 (1999): 31-7. Print.
Gonzalez Burchard, E., et al. "Latino Populations: A Unique Opportunity for the
Study of Race, Genetics, and Social Environment in Epidemiological Research." Am J
Public Health 95.12 (2005): 2161-8. Print.
Graham, R. R., et al. "Genetic Variants near Tnfaip3 on 6q23 Are Associated with
Systemic Lupus Erythematosus." Nat Genet 40.9 (2008): 1059-61. Print.
Greenland, S. "Quantifying Biases in Causal Models: Classical Confounding Vs
Collider-Stratification Bias." Epidemiology 14.3 (2003): 300-6. Print.
Greenland, S., J. Pearl, and J. M. Robins. "Causal Diagrams for Epidemiologic
Research." Epidemiology 10.1 (1999): 37-48. Print.
Gudbjartsson, D. F., et al. "Variants Conferring Risk of Atrial Fibrillation on
Chromosome 4q25." Nature 448.7151 (2007): 353-7. Print.
Guo, Y., et al. "Genome-Wide Association Study Identifies Aldh7a1 as a Novel
Susceptibility Gene for Osteoporosis." PLoS Genet 6.1 (2010): e1000806. Print.
Haiman, C. A., et al. "Genome-Wide Association Study of Prostate Cancer in Men of
African Ancestry Identifies a Susceptibility Locus at 17q21." Nat Genet 43.6 (2011):
570-3. Print.
Haiman, C. A., and D. O. Stram. "Exploring Genetic Susceptibility to Cancer in Diverse
Populations." Curr Opin Genet Dev 20.3 (2010): 330-5. Print.
Han, J. W., et al. "Genome-Wide Association Study in a Chinese Han Population
Identifies Nine New Susceptibility Loci for Systemic Lupus Erythematosus." Nat
Genet 41.11 (2009): 1234-7. Print.
Hattori, E., et al. "Preliminary Genome-Wide Association Study of Bipolar Disorder
in the Japanese Population." Am J Med Genet B Neuropsychiatr Genet 150B.8 (2009):
1110-7. Print.
Hayes, M. G., et al. "Identification of Type 2 Diabetes Genes in Mexican Americans
through Genome-Wide Association Studies." Diabetes 56.12 (2007): 3033-44. Print.
109
Hicks, A. A., et al. "Genetic Determinants of Circulating Sphingolipid Concentrations

in European Populations." PLoS Genet 5.10 (2009): e1000672. Print.
Hindorff, L. A., et al. "Potential Etiologic and Functional Implications of GenomeWide Association Loci for Human Diseases and Traits." Proc Natl Acad Sci U S A
106.23 (2009): 9362-7. Print.
Hiura, Y., et al. "Identification of Genetic Markers Associated with High-Density
Lipoprotein-Cholesterol by Genome-Wide Screening in a Japanese Population: The
Suita Study." Circ J 73.6 (2009): 1119-26. Print.
Hiura, Y., et al. "A Genome-Wide Association Study of Hypertension-Related
Phenotypes in a Japanese Population." Circ J 74.11 (2010): 2353-9. Print.
Hoggart, C. J., et al. "Control of Confounding of Genetic Associations in Stratified
Populations." Am J Hum Genet 72.6 (2003): 1492-504. Print.
Hoggart, C. J., et al. "Design and Analysis of Admixture Mapping Studies." Am J Hum
Genet 74.5 (2004): 965-78. Print.
Homa, D. M., D. M. Mannino, and M. Lara. "Asthma Mortality in U.S. Hispanics of
Mexican, Puerto Rican, and Cuban Heritage, 1990-1995." Am J Respir Crit Care Med
161.2 Pt 1 (2000): 504-9. Print.
Hor, H., et al. "Genome-Wide Association Study Identifies New Hla Class Ii
Haplotypes Strongly Protective against Narcolepsy." Nat Genet 42.9 (2010): 786-9.
Print.
Hubisz, MJ, et al. "Inferring Weak Population Structure with the Assistance of
Sample Group Information." Molecular Ecology Resources 9.5 (2009): 1322-32. Print.
Hunter, D. J., et al. "A Genome-Wide Association Study Identifies Alleles in Fgfr2
Associated with Risk of Sporadic Postmenopausal Breast Cancer." Nat Genet 39.7
(2007): 870-4. Print.
Jorm, A. F., and S. Easteal. "Assessing Candidate Genes as Risk Factors for Mental
Disorders: The Value of Population-Based Epidemiological Studies." Soc Psychiatry
Psychiatr Epidemiol 35.1 (2000): 1-4. Print.
Kamatani, Y., et al. "Genome-Wide Association Study of Hematological and
Biochemical Traits in a Japanese Population." Nat Genet 42.3 (2010): 210-5. Print.
Kang, S. J., et al. "Assessing the Impact of Global Versus Local Ancestry in Association
Studies." BMC Proc 3 Suppl 7 (2009): S107. Print.
Kim, H., et al. "Genome-Wide Association Study of Acute Post-Surgical Pain in
Humans." Pharmacogenomics 10.2 (2009): 171-9. Print.
110
Kim, J. J., et al. "A Genome-Wide Association Analysis Reveals 1p31 and 2p13.3 as
Susceptibility Loci for Kawasaki Disease." Hum Genet 129.5 (2011): 487-95. Print.
Kim, S., et al. "Genome-Wide Association Study of Csf Biomarkers Abeta1-42, T-Tau,
and P-Tau181p in the Adni Cohort." Neurology 76.1 (2011): 69-79. Print.
Kottgen, A., et al. "Multiple Loci Associated with Indices of Renal Function and
Chronic Kidney Disease." Nat Genet 41.6 (2009): 712-7. Print.
Kraft, P., et al. "Exploiting Gene-Environment Interaction to Detect Genetic
Associations." Hum Hered 63.2 (2007): 111-9. Print.
Kumar, V., et al. "Common Variants on 14q32 and 13q12 Are Associated with Dlbcl
Susceptibility." J Hum Genet (2011). Print.
Kung, A. W., et al. "Association of Jag1 with Bone Mineral Density and Osteoporotic
Fractures: A Genome-Wide Association Study and Follow-up Replication Studies."
Am J Hum Genet 86.2 (2010): 229-39. Print.
Landi, M. T., et al. "A Genome-Wide Association Study of Lung Cancer Identifies a
Region of Chromosome 5p15 Associated with Risk for Adenocarcinoma." Am J Hum
Genet 85.5 (2009): 679-91. Print.
Lascorz, J., et al. "Genome-Wide Association Study for Colorectal Cancer Identifies
Risk Polymorphisms in German Familial Cases and Implicates Mapk Signalling
Pathways in Disease Susceptibility." Carcinogenesis 31.9 (2010): 1612-9. Print.
Lee, Y. L., et al. "Comparing Genetic Ancestry and Self-Reported Race/Ethnicity in a
Multiethnic Population in New York City." J Genet 89.4 (2010): 417-23. Print.
Lei, S. F., et al. "Genome-Wide Association Scan for Stature in Chinese: Evidence for
Ethnic Specific Loci." Hum Genet 125.1 (2009): 1-9. Print.
Lessard, C. J., et al. "Identification of a Systemic Lupus Erythematosus Susceptibility
Locus at 11p13 between Pdhx and Cd44 in a Multiethnic Study." Am J Hum Genet
88.1 (2011): 83-91. Print.
Lettre, G., et al. "Genome-Wide Association Study of Coronary Heart Disease and Its
Risk Factors in 8,090 African Americans: The Nhlbi Care Project." PLoS Genet 7.2
(2011): e1001300. Print.
Li, Q., and K. Yu. "Improved Correction for Population Stratification in Genome-Wide
Association Studies by Identifying Hidden Population Structures." Genet Epidemiol
32.3 (2008): 215-26. Print.
111
Li, Y. F., et al. "Glutathione S-Transferase P1, Maternal Smoking, and Asthma in
Children: A Haplotype-Based Analysis." Environ Health Perspect 116.3 (2008): 40915. Print.
Liu, Y. Z., et al. "Identification of Plcl1 Gene for Hip Bone Size Variation in Females in
a Genome-Wide Association Study." PLoS One 3.9 (2008): e3160. Print.
Low, S. K., et al. "Genome-Wide Association Study of Pancreatic Cancer in Japanese
Population." PLoS One 5.7 (2010): e11824. Print.
Ma, D., et al. "A Genome-Wide Association Study of Autism Reveals a Common Novel
Risk Locus at 5p14.1." Ann Hum Genet 73.Pt 3 (2009): 263-73. Print.
McConnell, R., et al. "Air Pollution and Bronchitic Symptoms in Southern California
Children with Asthma." Environ Health Perspect 107.9 (1999): 757-60. Print.
McKay, J. D., et al. "A Genome-Wide Association Study of Upper Aerodigestive Tract
Cancers Conducted within the Inhance Consortium." PLoS Genet 7.3 (2011):
e1001333. Print.
McKeigue, P. M. "Mapping Genes That Underlie Ethnic Differences in Disease Risk:
Methods for Detecting Linkage in Admixed Populations, by Conditioning on Parental
Admixture." Am J Hum Genet 63.1 (1998): 241-51. Print.
Miclaus, K., R. Wolfinger, and W. Czika. "Snp Selection and Multidimensional Scaling
to Quantify Population Structure." Genet Epidemiol 33.6 (2009): 488-96. Print.
Montana, G., and J. K. Pritchard. "Statistical Tests for Admixture Mapping with CaseControl and Cases-Only Data." Am J Hum Genet 75.5 (2004): 771-89. Print.
Navidi, W., et al. "Design and Analysis of Multilevel Analytic Studies with
Applications to a Study of Air Pollution." Environ Health Perspect 102 Suppl 8
(1994): 25-32. Print.
Ng, C. C., et al. "A Genome-Wide Association Study Identifies Itga9 Conferring Risk of
Nasopharyngeal Carcinoma." J Hum Genet 54.7 (2009): 392-7. Print.
Norris, J. M., et al. "Genome-Wide Association Study and Follow-up Analysis of
Adiposity Traits in Hispanic Americans: The Iras Family Study." Obesity (Silver
Spring) 17.10 (2009): 1932-41. Print.
O'Seaghdha, C. M., et al. "Common Variants in the Calcium-Sensing Receptor Gene
Are Associated with Total Serum Calcium Levels." Hum Mol Genet 19.21 (2010):
4296-303. Print.
112
Org, E., et al. "Genome-Wide Scan Identifies Cdh13 as a Novel Susceptibility Locus
Contributing to Blood Pressure Determination in Two European Populations." Hum
Mol Genet 18.12 (2009): 2288-96. Print.
Palmer, N. D., et al. "Candidate Loci for Insulin Sensitivity and Disposition Index
from a Genome-Wide Association Analysis of Hispanic Participants in the Insulin
Resistance Atherosclerosis (Iras) Family Study." Diabetologia 53.2 (2010): 281-9.
Print.
Panoutsopoulou, K., et al. "Insights into the Genetic Architecture of Osteoarthritis
from Stage 1 of the Arcogen Study." Ann Rheum Dis 70.5 (2011): 864-7. Print.
Pasaniuc, B., et al. "Inference of Locus-Specific Ancestry in Closely Related
Populations." Bioinformatics 25.12 (2009): i213-21. Print.
Pasaniuc, B., et al. "Enhanced Statistical Tests for Gwas in Admixed Populations:
Assessment Using African Americans from Care and a Breast Cancer Consortium."
PLoS Genet 7.4 (2011): e1001371. Print.
Patterson, N., et al. "Methods for High-Density Admixture Mapping of Disease
Genes." Am J Hum Genet 74.5 (2004): 979-1000. Print.
Pillai, S. G., et al. "A Genome-Wide Association Study in Chronic Obstructive
Pulmonary Disease (Copd): Identification of Two Major Susceptibility Loci." PLoS
Genet 5.3 (2009): e1000421. Print.
Price, A. L., et al. "Principal Components Analysis Corrects for Stratification in
Genome-Wide Association Studies." Nat Genet 38.8 (2006): 904-9. Print.
Price, A. L., et al. "A Genomewide Admixture Map for Latino Populations." Am J Hum
Genet 80.6 (2007): 1024-36. Print.
Price, A. L., et al. "Sensitive Detection of Chromosomal Segments of Distinct Ancestry
in Admixed Populations." PLoS Genet 5.6 (2009): e1000519. Print.
Pritchard, J. K., M. Stephens, and P. Donnelly. "Inference of Population Structure
Using Multilocus Genotype Data." Genetics 155.2 (2000): 945-59. Print.
Pulit, S. L., B. F. Voight, and P. I. de Bakker. "Multiethnic Genetic Association Studies
Improve Power for Locus Discovery." PLoS One 5.9 (2010): e12600. Print.
Qin, H., et al. "Interrogating Local Population Structure for Fine Mapping in GenomeWide Association Studies." Bioinformatics 26.23 (2010): 2961-8. Print.
Reibman, J., and M. Liu. "Genetics and Asthma Disease Susceptibility in the Us Latino
Population." Mt Sinai J Med 77.2 (2010): 140-8. Print.
113
Reilly, M. P., et al. "Identification of Adamts7 as a Novel Locus for Coronary

Atherosclerosis and Association of Abo with Myocardial Infarction in the Presence
of Coronary Atherosclerosis: Two Genome-Wide Association Studies." Lancet
377.9763 (2011): 383-92. Print.
Rich, S. S., et al. "A Genome-Wide Association Scan for Acute Insulin Response to
Glucose in Hispanic-Americans: The Insulin Resistance Atherosclerosis Family Study
(Iras Fs)." Diabetologia 52.7 (2009): 1326-33. Print.
Risch, N., et al. "Categorization of Humans in Biomedical Research: Genes, Race and
Disease." Genome Biol 3.7 (2002): comment2007. Print.
Rosenberg, N. A., et al. "Genome-Wide Association Studies in Diverse Populations."
Nat Rev Genet 11.5 (2010): 356-66. Print.
Ryu, E., et al. "Genome-Wide Association Analyses of Genetic, Phenotypic, and
Environmental Risks in the Age-Related Eye Disease Study." Mol Vis 16 (2010):
2811-21. Print.
Salanti, G., S. Sanderson, and J. P. Higgins. "Obstacles and Opportunities in MetaAnalysis of Genetic Association Studies." Genet Med 7.1 (2005): 13-20. Print.
Salari, K., et al. "Genetic Admixture and Asthma-Related Phenotypes in Mexican
American and Puerto Rican Asthmatics." Genet Epidemiol 29.1 (2005): 76-86. Print.
Satake, W., et al. "Genome-Wide Association Study Identifies Common Variants at
Four Loci as Genetic Risk Factors for Parkinson's Disease." Nat Genet 41.12 (2009):
1303-7. Print.
Satten, G. A., W. D. Flanders, and Q. Yang. "Accounting for Unmeasured Population
Substructure in Case-Control Studies of Genetic Association Using a Novel LatentClass Model." Am J Hum Genet 68.2 (2001): 466-77. Print.
Saxena, R., et al. "Genome-Wide Association Analysis Identifies Loci for Type 2
Diabetes and Triglyceride Levels." Science 316.5829 (2007): 1331-6. Print.
Seldin, M. F., et al. "European Population Substructure: Clustering of Northern and
Southern Populations." PLoS Genet 2.9 (2006): e143. Print.
Serre, D., et al. "Correction of Population Stratification in Large Multi-Ethnic
Association Studies." PLoS One 3.1 (2008): e1382. Print.
Setakis, E., H. Stirnadel, and D. J. Balding. "Logistic Regression Protects against
Population Structure in Genetic Association Studies." Genome Res 16.2 (2006): 2906. Print.
114
Shriner, D., A. Adeyemo, and C. N. Rotimi. "Joint Ancestry and Association Testing in
Admixed Individuals." PLoS Comput Biol 7.12 (2011): e1002325. Print.
Shtir, C. J., et al. "Variation in Genetic Admixture and Population Structure among
Latinos: The Los Angeles Latino Eye Study (Lales)." BMC Genet 10 (2009): 71. Print.
Shu, X. O., et al. "Identification of New Genetic Risk Variants for Type 2 Diabetes."
PLoS Genet 6.9 (2010). Print.
Simon-Sanchez, J., et al. "Genome-Wide Association Study Reveals Genetic Risk
Underlying Parkinson's Disease." Nat Genet 41.12 (2009): 1308-12. Print.
Simon-Sanchez, J., et al. "Genome-Wide Association Study Confirms Extant Pd Risk
Loci among the Dutch." Eur J Hum Genet (2011). Print.
Smith, M. W., et al. "A High-Density Admixture Map for Disease Gene Discovery in
African Americans." Am J Hum Genet 74.5 (2004): 1001-13. Print.
Song, H., et al. "A Genome-Wide Association Study Identifies a New Ovarian Cancer
Susceptibility Locus on 9p22.2." Nat Genet 41.9 (2009): 996-1000. Print.
Stacey, S. N., et al. "Ancestry-Shift Refinement Mapping of the C6orf97-Esr1 Breast
Cancer Susceptibility Locus." PLoS Genet 6.7 (2010): e1001029. Print.
Tan, L., et al. "A Genome-Wide Association Analysis Implicates Sox6 as a Candidate
Gene for Wrist Bone Mass." Sci China Life Sci 53.9 (2010): 1065-72. Print.
Tanaka, Y., et al. "Genome-Wide Association of Il28b with Response to Pegylated
Interferon-Alpha and Ribavirin Therapy for Chronic Hepatitis C." Nat Genet 41.10
(2009): 1105-9. Print.
Tenesa, A., et al. "Genome-Wide Association Scan Identifies a Colorectal Cancer
Susceptibility Locus on 11q23 and Replicates Risk Loci at 8q24 and 18q21." Nat
Genet 40.5 (2008): 631-7. Print.
Teslovich, T. M., et al. "Biological, Clinical and Population Relevance of 95 Loci for
Blood Lipids." Nature 466.7307 (2010): 707-13. Print.
Thomas, D. C., and J. S. Witte. "Point: Population Stratification: A Problem for CaseControl Studies of Candidate-Gene Associations?" Cancer Epidemiol Biomarkers Prev
11.6 (2002): 505-12. Print.
Tian, C., et al. "Analysis and Application of European Genetic Substructure Using 300
K Snp Information." PLoS Genet 4.1 (2008): e4. Print.
Tomlinson, I. P., et al. "A Genome-Wide Association Study Identifies Colorectal
Cancer Susceptibility Loci on Chromosomes 10p14 and 8q23.3." Nat Genet 40.5
(2008): 623-30. Print.
115
Tomlinson, I., et al. "A Genome-Wide Association Scan of Tag Snps Identifies a
Susceptibility Variant for Colorectal Cancer at 8q24.21." Nat Genet 39.8 (2007): 9848. Print.
Tsai, F. J., et al. "Identification of Novel Susceptibility Loci for Kawasaki Disease in a
Han Chinese Population by a Genome-Wide Association Study." PLoS One 6.2
(2011): e16853. Print.
Tse, K. P., et al. "Genome-Wide Association Study Reveals Multiple Nasopharyngeal
Carcinoma-Associated Loci within the Hla Region at Chromosome 6p21.3." Am J
Hum Genet 85.2 (2009): 194-203. Print.
Unoki, H., et al. "Snps in Kcnq1 Are Associated with Susceptibility to Type 2 Diabetes
in East Asian and European Populations." Nat Genet 40.9 (2008): 1098-102. Print.
Van Laer, L., et al. "A Genome-Wide Association Study for Age-Related Hearing
Impairment in the Saami." Eur J Hum Genet 18.6 (2010): 685-93. Print.
Via, M., et al. "The Role of Lta4h and Alox5ap Genes in the Risk for Asthma in
Latinos." Clin Exp Allergy 40.4 (2010): 582-9. Print.
Voight, B. F., et al. "Twelve Type 2 Diabetes Susceptibility Loci Identified through
Large-Scale Association Analysis." Nat Genet 42.7 (2010): 579-89. Print.
Wacholder, S., N. Rothman, and N. Caporaso. "Population Stratification in
Epidemiologic Studies of Common Genetic Variants and Cancer: Quantification of
Bias." J Natl Cancer Inst 92.14 (2000): 1151-8. Print.
Wallace, C., et al. "Genome-Wide Association Study Identifies Genes for Biomarkers
of Cardiovascular Disease: Serum Urate and Dyslipidemia." Am J Hum Genet 82.1
(2008): 139-49. Print.
Wang, F., et al. "Genome-Wide Association Identifies a Susceptibility Locus for
Coronary Artery Disease in the Chinese Han Population." Nat Genet 43.4 (2011):
345-9. Print.
Wang, K., et al. "Integrative Genomics Identifies Lmo1 as a Neuroblastoma
Oncogene." Nature 469.7329 (2011): 216-20. Print.
Wang, X., et al. "Adjustment for Local Ancestry in Genetic Association Analysis
Ofadmixed Populations." Bioinformatics (2010). Print.
Waters, K. M., et al. "Consistent Association of Type 2 Diabetes Risk Variants Found
in Europeans in Diverse Racial and Ethnic Groups." PLoS Genet 6.8 (2010). Print.
116
Wijsman, E. M., et al. "Genome-Wide Association of Familial Late-Onset Alzheimer's

Disease Replicates Bin1 and Clu and Nominates Cugbp2 in Interaction with Apoe."
PLoS Genet 7.2 (2011): e1001308. Print.
Xiong, D. H., et al. "Genome-Wide Association and Follow-up Replication Studies
Identified Adamts18 and Tgfbr3 as Bone Mass Candidate Genes in Different Ethnic
Groups." Am J Hum Genet 84.3 (2009): 388-98. Print.
Yamada, Y., et al. "Identification of Celsr1 as a Susceptibility Gene for Ischemic
Stroke in Japanese Individuals by a Genome-Wide Association Study."
Atherosclerosis 207.1 (2009): 144-9. Print.
Yasuda, K., et al. "Variants in Kcnq1 Are Associated with Susceptibility to Type 2
Diabetes Mellitus." Nat Genet 40.9 (2008): 1092-7. Print.
Yoon, K. A., et al. "A Genome-Wide Association Study Reveals Susceptibility Variants
for Non-Small Cell Lung Cancer in the Korean Population." Hum Mol Genet 19.24
(2010): 4948-54. Print.
Zanke, B. W., et al. "Genome-Wide Association Scan Identifies a Colorectal Cancer
Susceptibility Locus on Chromosome 8q24." Nat Genet 39.8 (2007): 989-94. Print.
Zhang, X. J., et al. "Psoriasis Genome-Wide Association Study Identifies Susceptibility
Variants within Lce Gene Cluster at 1q21." Nat Genet 41.2 (2009): 205-10. Print.
Zheng, C., and R. C. Elston. "Multipoint Linkage Disequilibrium Mapping with
Particular Reference to the African-American Population." Genet Epidemiol 17.2
(1999): 79-101. Print.
Zhu, X., et al. "Combined Admixture Mapping and Association Analysis Identifies a
Novel Blood Pressure Genetic Locus on 5p13: Contributions from the Care
Consortium." Hum Mol Genet 20.11 (2011): 2285-95. Print.
117

Population Substructure

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Population Substructure

Uploaded by

Copyright:

Available Formats

POPULATION SUBSTRUCTURE AND ITS IMPACT ON GENOME-WIDE

ASSOCIATION STUDIES WITH ADMIXED POPULATIONS

A Dissertation Presented to the

1.1 Use of admixed populations for genetic association studies ................................................ 1

Population substructures among Hispanics ................................................... 15

2.1 The USC Childrens Health Study (CHS) .......................................................................................15

Chapter 3 Confounding and Heterogeneity in Genetic Association Studies

3.2 Materials and Methods........................................................................................................................45

Mapping by admixture linkage disequilibrium ............................................. 72

4.1 Introduction ............................................................................................................................................72

Future Directions ......................................................................................... 103

Bibliography .................................................................................................................... 106

Figure 3-1 (a) Potential confounding paths in genetic association studies

Figure 4-6 Comparison of the performance between the proposed CCcom

This is mainly due to the desire to generalize genetic findings to other

populations with more or different genetic variations as well as differences in disease

1.2 Degree of admixture in different populations

Self-reported ancestry, global ancestry, and local ancestry

1.3 Background of the Hispanic population

Population substructures identified among Hispanic samples

Asthma among Hispanics

Exposures associated with asthma diagnosis are environmental tobacco smoke,

Potential confounding of genetic association studies among Hispanics

1.4 Methods for control of confounding

EIGENSTRAT (Price, Patterson, Plenge, et al.) applies principal components analysis to

STRUCTURE & ADMIXTURE

STRUCTURE (Pritchard, Stephens and Donnelly) is a model-based approach.

assumes K founder populations characterized by a set of allele frequencies across a

The basic idea for the program ADMIXTURE (Alexander,

Novembre and Lange) is similar to STRUCTURE. However, instead of relying on

HAPMIX & LAMP

It assumes two homogeneous reference ancestral

to get a probabilistic ancestry estimator for each locus.

Another advantage of this

1.5 Remaining challenges for genetic association studies among

b) In an admixed population, each individual contains various proportions of

1.6 Introduction to graphical modeling

Chapter 2 Population substructures among Hispanics

Samples & markers

Potential confounding by population substructure observed from previous

2.2 Ancestry informative markers

In order to study the observed structure within the CHS multiethnic

2.3 HapMap III populations

Diverse ethnic populations

Global ancestry estimator: EIGENSTRAT, STRUCTURE & ADMIXTURE

2.3.2.1 Results from EIGENSTRAT

Within the African ancestry, with the

2.3.2.2 Results from STRUCTURE

The vertical axis

represents the estimated individual ancestry coefficients, which is a continuous variable

between 0 and 1 indicating the percentage of different ancestries of the individuals

There is no more ancestry clusters that could be identified from the

STRUCTURE program with K greater than 6.

2.3.2.3 Results from ADMIXTURE

Figure 2-7 (Continued).

In order to compare the results from EIGENSTRAT, STRUCTURE, and

Local ancestry estimates among HapMap MEX samples

the reference allele frequency for LAMP.

In addition, global ancestry is calculated by

Sample NA19679 has an

Table 2-2 Pearson correlation of estimated individual global European ancestry

Choice for study designs

Studies with genotyped AIMs.