Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Resource

Pan-Genome of Wild and Cultivated Soybeans


Graphical Abstract Authors
Yucheng Liu, Huilong Du,
Pengcheng Li, ..., Bin Han,
Chengzhi Liang, Zhixi Tian

Correspondence
cliang@genetics.ac.cn (C.L.),
zxtian@genetics.ac.cn (Z.T.)

In Brief
A high-quality graph-based soybean pan-
genome is constructed through de novo
genome assemblies of 26 representative
wild and cultivated soybean accessions,
demonstrating the impact of structural
variation on key agronomic traits.

Highlights
d de novo genome assemblies for 26 representative soybeans

d Construction of a graph-based genome

d Identification of large structural variations and gene fusion


events

d Link structural variations to gene expressions and


agronomic traits

Liu et al., 2020, Cell 182, 162–176


July 9, 2020 ª 2020 Elsevier Inc.
https://doi.org/10.1016/j.cell.2020.05.023 ll
ll

Resource
Pan-Genome of Wild and Cultivated Soybeans
Yucheng Liu,1,7,8 Huilong Du,2,7,8 Pengcheng Li,3 Yanting Shen,4 Hua Peng,2,7 Shulin Liu,1 Guo-An Zhou,1 Haikuan Zhang,3
Zhi Liu,1,7 Miao Shi,3 Xuehui Huang,5 Yan Li,6 Min Zhang,1 Zheng Wang,1 Baoge Zhu,1 Bin Han,6 Chengzhi Liang,2,7,*
and Zhixi Tian1,7,9,*
1State Key Laboratory of Plant Cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Innovation Academy for

Seed Design, Chinese Academy of Sciences, Beijing 100101, China


2State Key Laboratory of Plant Genomics, Institute of Genetics and Developmental Biology, Innovation Academy for Seed Design, Chinese

Academy of Sciences, Beijing 100101, China


3Berry Genomics Corporation, Beijing 100015, China
4School of Pharmaceutical Sciences, Guangzhou University of Chinese Medicine, Guangzhou 510006, China
5College of Life Sciences, Shanghai Normal University, Shanghai 200234, China
6National Center for Gene Research, CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology,

Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China
7College of Advanced Agriculture Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
8These authors contributed equally
9Lead Contact

*Correspondence: cliang@genetics.ac.cn (C.L.), zxtian@genetics.ac.cn (Z.T.)


https://doi.org/10.1016/j.cell.2020.05.023

SUMMARY

Soybean is one of the most important vegetable oil and protein feed crops. To capture the entire genomic
diversity, it is needed to construct a complete high-quality pan-genome from diverse soybean accessions.
In this study, we performed individual de novo genome assemblies for 26 representative soybeans that
were selected from 2,898 deeply sequenced accessions. Using these assembled genomes together with
three previously reported genomes, we constructed a graph-based genome and performed pan-genome
analysis, which identified numerous genetic variations that cannot be detected by direct mapping of short
sequence reads onto a single reference genome. The structural variations from the 2,898 accessions that
were genotyped based on the graph-based genome and the RNA sequencing (RNA-seq) data from the repre-
sentative 26 accessions helped to link genetic variations to candidate genes that are responsible for impor-
tant traits. This pan-genome resource will promote evolutionary and functional genomics studies in soybean.

INTRODUCTION otyping, particularly for larger SVs (Garrison et al., 2018; Kim
et al., 2019; Rakocevic et al., 2019; Eggertsson et al., 2019;
Increasing reports have suggested that one or a few reference Ameur, 2019).
genomes cannot represent the full range of genetic diversity of Soybean is one of the most important vegetable oil and protein
a species (Scherer et al., 2007; Li et al., 2010a, 2014; Hirsch feed crops. Cultivated soybean (Glycine max [L.] Merr.) was
et al., 2014; Lin et al., 2014; Saxena et al., 2014; Yao et al., domesticated from its wild relative (Glycine soja [Sieb. and
2015; Golicz et al., 2016a; Zhou et al., 2017; Wang et al., Zucc.]) in China 5,000 years ago. At present, over 60,000 ac-
2018b; Hurgobin et al., 2018), which limit the identification of ge- cessions adapted to different ecoregions have been developed
netic variants, particularly for larger structural variants (SVs) such (Carter et al., 2004; Wilson, 2008; Li et al., 2020). The first soy-
as presence/absence variants (PAVs) and copy number variants bean reference genome of a cultivated accession (Williams 82,
(CNVs) that have been revealed to play key roles in the genetic termed Wm82 in this study) opens the gate of soybean functional
determination of agronomical traits (Xu et al., 2006; Shomura genomics (Schmutz et al., 2010; Chan et al., 2012; Wang and
et al., 2008; Cook et al., 2012; Hufford et al., 2012; Hirsch Tian, 2015). Intergenomic comparisons demonstrate that exten-
et al., 2014; Lu et al., 2015; Deng et al., 2017; Lye and Purug- sive genetic diversities exist between wild soybeans and culti-
ganan, 2019). Pan-genome construction is becoming increas- vated soybeans and also among cultivated soybeans from
ingly necessary (Tettelin et al., 2005; Golicz et al., 2016b; Li different geographic areas (Bandillo et al., 2015; Hyten et al.,
et al., 2017; Tao et al., 2019). In addition, conventional linear ref- 2006; Li et al., 2010b; Liu et al., 2017; Shen et al., 2018; Xie
erences are challenged in showing the genotypes of different al- et al., 2019). Recently, two other reference genomes have
leles of each locus and in the identification of larger SVs. There is become available, one from ‘‘Zhonghuang 13’’ (ZH13) that is
a toward construction of graph-based genome, which contains the most widely planted soybean cultivar in China (Shen et al.,
the variations in a population and enables fast and accurate gen- 2018; Shen et al., 2019) and another from a wild soybean

162 Cell 182, 162–176, July 9, 2020 ª 2020 Elsevier Inc.


ll
Resource

(W05) (Xie et al., 2019). Comparisons between Wm82 and these whole-genome assemblies ranged from 18.8 to 26.8 Mb pairs
two genomes further demonstrate that a considerable amount of with a mean of 22.6 Mb, and scaffold N50 sizes ranged from
CNVs and PAVs exist in different accessions, promoting the 50.3 to 52.3 Mb with a mean of 51.2 Mb. The final assembled
need for construction of a complete pan-genome from diverse genome sizes ranged from 992.3 Mb to 1059.8 Mb with a
soybean accessions. mean of 1011.6 Mb. For each accession, an average of 99%
A pan-genome from seven wild soybeans has been con- contigs were anchored to the chromosomes (Tables 1 and S2).
structed using second generation sequencing technology (Li The Illumina reads from individual accessions were re-mapped
et al., 2014). However, due to technology limitations at the onto the corresponding assembled genomes, and the mapping
time, genome assembly, annotation, and chromosome-scale ratio reached 99.4% (Table S2), indicating a high completeness
SV interrogations need to be improved (Li et al., 2014). In this of each assembled genome.
study, we individually de novo assembled 26 soybean genomes Repetitive DNA made up 54.4% of each genome with a
and constructed a graph-based genome using these assembled range from 53.6% to 55.3% (Table 1). Among the repetitive se-
genomes together with three previously reported genomes. Pan- quences, long terminal repeat (LTR)-retrotransposons were the
genome analyses disclosed numerous genetic variations that most abundant (Table S2), which is consistent with previously re-
cannot be detected by mapping the second generation ported soybean genomes (Schmutz et al., 2010; Shen et al.,
sequencing reads to a single genome, including hundreds of 2018, 2019; Xie et al., 2019). To annotate the protein-coding
thousands of large structure variations and dozens of gene and small RNA genes, for each of the 26 accessions, we
fusion events, which facilitated us to identify candidate genes collected 9 samples from roots, stems, leaves, flowers, and
associated with agronomic traits. seeds at different developmental stages and performed RNA
sequencing (RNA-seq) (with a mean of 8 Gb pairs for each
RESULTS sample) and small RNA-seq (with a mean of 278 Mb for each
sample) (Table S2). An average of 56,522 protein coding genes,
De Novo Genome Assembly and Annotation of 26 553 microRNAs (miRNA), 171 small nuclear RNAs (snRNA), and
Soybean Accessions 439 ribosomal RNAs (rRNA) genes per genome were predicted
To make the pan-genome represent the full range of genetic di- (Tables 1 and S2). Benchmarking universal single-copy ortho-
versity of soybean, we sequenced a total of 2,898 soybean ac- logs (BUSCO) evaluation showed that a mean of 95.6% of the
cessions (871 from our previous studies [Fang et al., 2017; 1,440 single copy Embryophyta genes were completely assem-
Zhou et al., 2015] and 2,027 from this study) by Illumina technol- bled in these genomes (Table S2), indicating high completeness
ogy with an average coverage depth of more than 133 for each of the gene annotation.
accession (Figure 1A; Table S1). These accessions included 103
wild soybeans, 1,048 landraces, and 1,747 cultivars, which were Core and Dispensable Genes
collected globally and represented a full range of soybean We further performed pan-genome analyses for the 26 newly de
geographic distributions (Figure 1A). After mapping against the novo assembled genomes plus the ZH13 genome using a re-
genome Gmax_ZH13 (Shen et al., 2019), we identified ported strategy (Hirsch et al., 2014; Hu et al., 2017; Li et al.,
31,870,983 single-nucleotide polymorphisms (SNPs). 2014; Wang et al., 2018b; Zhao et al., 2018). Orthologs investiga-
Phylogenetic analyses using whole-genome SNPs classified tion classified all genes from the 27 soybean genomes into
these 2,898 accessions into six major groups (a to f) including 57,492 families, which was close to the number from wild soy-
all wild soybeans in one group and all cultivated soybeans in beans (Li et al., 2014). The total gene sets increased as additional
five groups that were correlated to their geographic distribution genomes were added and approached a plateau when n = 25
(Figure 1B). To construct a well representative pan-genome, in (Figure 2A), indicating the representativeness of these 27 soy-
addition to ZH13, we selected 26 accessions for de novo assem- bean accessions. Of the total gene sets, 20,623 families pre-
bly. These accessions included 3 wild soybeans, 9 landraces, sented in all 27 accessions and were defined as core genes,
and 14 cultivars that by the most represented the 2,898 acces- 8,163 families presented in 25 to 26 accessions (>90% of the
sions in terms of phylogenetic relationships and geographic dis- collection) were defined as softcore genes, 28,679 families pre-
tributions (Figure 1A; Table S2). We also tended to select the ac- sented in 2 to 24 accessions were defined as dispensable genes,
cessions that greatly contributed to breeding and production. and 27 families presented in only one accession were defined as
For example, SoyL04, SoyL06, and SoyL09 had been used as private genes (Figure 2B). Although the total of dispensable and
parental lines to develop more than one hundred varieties in private gene families accounted for a larger proportion (49.9%)
China; SoyC04, SoyC11, and SoyC13 are the most popular vari- of the total gene sets in the 27 accessions, they accounted for
eties in different planting regions of China. an average of 19.1% of the genes in individual accessions (Fig-
The 26 accessions were sequenced individually using single- ures 2B–2D; Table S3).
molecule real-time (SMRT) sequencing with an average We found that 77.5% of the core genes and 72.1% of the
coverage depth of 963 , optical mapping with an average softcore genes contained InterPro domains, which was much
coverage depth of 2773 , chromosome conformation capture higher than the percentages in the dispensable and private
(Hi-C) sequencing with an average coverage depth of 1363 , genes (49.0% and 38.5%, respectively) (Figure 2E). Moreover,
and Illumina sequencing (HiSeq) with an average coverage depth the nucleotide diversity (p) and dN/dS were higher in dispensable
of 683 (Table S2). Subsequently, we performed de novo genome genes than in core genes (Figures 2F and 2G). These results indi-
assembly for each accession. The contig N50 sizes of the 26 cated that core genes were more functionally conserved than

Cell 182, 162–176, July 9, 2020 163


ll
Resource

Figure 1. Geographic Distribution and Phylogenetic Analysis of 2,898 Resequenced Soybean Accessions
(A) Geographic distribution of the 2,898 accessions. The number of accessions collected from each region is indicated by the size of pie, and the ratio of wild
soybean (G. soja, indicated by purple), landrace (green), and cultivar (orange) for each region is shown in the pie. The total number of soybean germplasm re-
sources from each country or region is indicated by the degree of blue color.
(B) Phylogenetic tree of all accessions inferred from whole-genome SNPs. The lines with different colors indicate wild soybean (purple), landrace (green) and
cultivar (orange). The geographic origin of each accession is divided into eight clades. EU, Europe; NA, North America; JAP, Japan; KOR, Korea; RUS, Russia;
CHN, China; I–VI, Eco-regions of soybeans in China. The 26 accessions used for de novo assembly are indicated in the phylogenetic tree. SoyW, wild soybean;
SoyL, landrace; SoyC, cultivar.
See also Table S1.

dispensable genes. Pfam enrichment and Gene Ontology (GO) riched in pathways related to the basic metabolism and biosyn-
analyses showed that core genes were enriched in biological thesis of secondary metabolites. Whereas, the dispensable
processes related to growth, immune system, reproductive, genes were more enriched in pathways related to specific meta-
cellular, and cellular component organization or biogenesis, bolism, such as fatty acid biosynthesis and fatty acid degrada-
such as ring finger domain, AP2 domain, WD domain, WRKY tion (Figure S1C), indicating that the dispensable and private
DNA-binding domain, and bZIP transcription factors. In contrast, genes may play important roles in determining the divergence
the dispensable and private genes were enriched for abiotic and of fatty acid composition in different accessions.
biotic response genes, such as different NBS gene families
(especially NBS-LRR) and stress upregulated nod genes (Fig- Sequence Variation Identification in 29 Soybean
ures S1A and S1B), which were consistent with previous findings Genomes
in other plants (Gordon et al., 2017; Wang et al., 2018b; Zhao To discover sequence variations, the 26 genomes plus three
et al., 2018). Kyoto Encyclopedia of Genes and Genomes previously reported genomes, Wm82, ZH13, and W05, were
(KEGG) pathway analyses showed that the core genes were en- compared. We anchored the 28 genome sequences onto the

164 Cell 182, 162–176, July 9, 2020


ll
Resource

Table 1. Summary of the Assembly and Annotation of 26 Soybean Genomes


Contig Scaffold Assembly Chromosome Chromosome Loading
Sample N50 (Mb) N50 (Mb) Size (Mb) Size (Mb) Ratio (%) Repetitive Content (%) Gene Number (#)
SoyW01 23.9 52.1 1,021.3 1,010.7 99.0 53.9 57,920
SoyW02 20.5 51.9 1,007.0 983.9 97.7 54.2 55,516
SoyW03 26.8 52.3 1,014.5 1,008.1 99.4 55.3 57,884
SoyL01 23.3 50.6 999.2 991.8 99.3 54.8 54,502
SoyL02 23.1 52.3 1,011.7 1,003.6 99.2 54.7 54,803
SoyL03 21.3 51.3 1,039.5 1,002.7 96.5 54.2 55,769
SoyL04 23.0 51.4 1,005.2 995.5 99.0 54.5 58,881
SoyL05 22.1 51.2 1,059.8 993.8 93.8 54.5 56,086
SoyL06 21.5 50.7 997.8 997.2 99.9 54.5 56,573
SoyL07 23.7 50.9 1,004.7 991.0 98.6 54.8 58,259
SoyL08 22.6 51.1 999.7 993.6 99.4 54.8 58,496
SoyL09 23.1 51.0 1,028.2 997.1 97.0 54.6 59,588
SoyC01 23.5 51.5 1,003.8 999.7 99.6 54.7 54,405
SoyC02 23.8 51.2 1,011.1 997.1 98.6 54.4 55,191
SoyC03 23.9 51.3 1,007.8 994.6 98.7 54.0 57,792
SoyC04 22.7 50.9 1,001.5 998.7 99.7 54.4 57,777
SoyC05 22.7 50.3 992.3 989.3 99.7 54.3 58,040
SoyC06 23.0 52.0 1,007.9 997.4 99.0 54.1 55,619
SoyC07 21.8 51.2 1,008.8 1,000.9 99.2 54.7 54,792
SoyC08 22.4 50.9 1,002.3 990.0 98.8 54.0 54,747
SoyC09 22.6 50.4 1,004.1 995.3 99.1 53.6 55,926
SoyC10 19.8 51.0 1,004.9 995.4 99.1 54.1 56,011
SoyC11 18.8 51.4 1,024.5 1,000.1 97.6 53.8 56,750
SoyC12 20.0 50.8 1,025.1 995.6 97.1 53.8 55,000
SoyC13 23.8 51.1 1,010.1 991.4 98.2 54.3 57,976
SoyC14 23.0 51.6 1,007.4 988.5 98.1 54.1 55,267
See also Table S2.

ZH13 genome (Shen et al., 2019). A total of 14,604,953 SNPs and events (including 6,801 intra-chromosome translocations and
12,716,823 small insertions and deletions (indels, referring % 15,085 inter-chromosome translocations), and 3,120 inversion
50 bp in this work) were identified. Although the SNP number events (Table S4). We found that most of the PAVs had a length
from the pan-genome was less than that from the 2,898 acces- from 1 kb to 2 kb, translocations were concentrated from 10 kb
sions (14,604,953 versus 31,870,983), the distributions of to 30 kb, inversions mainly ranged from 100 to 200 kb, and the
SNPs from these two datasets exhibited similar patterns across CNVs varied from 2 to >10 with an enrichment of 2 and 3
the genome (Figure 3A). Particularly, when the SNPs with minor (Figure S2B).
allele frequency (MAF) <0.01 were removed from the 2,898 ac- The 723,862 PAV made up a total of 4.71 Gb sequences with a
cessions, the correlation between the SNP distributions from mean of 167.09 Mb in each accession, which accounted for
the 29 soybean genomes and the 2,898 accessions reached 16% of the assembled genome in each accession (Table S4).
0.553 (Figure S2A). We also calculated the nucleotide diversity, We found that more than 90% of the length variation of the
dN and dS in the 29 genomes and 2,898 accessions, respectively, assembled genomes resulted from PAVs (Figure S3A; Table
and found that each of them showed high correlations between S4), indicating that PAV was a major contributor to driving
the 29 soybean genomes and the 2,898 accessions (Figure S2A), genome size variation. For example, compared with SoyW03,
further indicating the representativeness of the selected soybean a total 1.2 Mb deletion was found in ZH13 from the region of
accessions. 22.7 to 23.5 Mb (ZH13 genome position), whereas, a total of
In addition to identifying SNPs and small indels, de novo con- 1.2 Mb were further deleted at the same region in SoyW02
struction of the pan-genome provides a valuable platform for the compared to ZH13, which resulted in SoyW03 having the
identification of larger SVs (Tettelin et al., 2005; Li et al., 2017; longest sequence and SoyW02 having the shortest sequence
Tao et al., 2019; Yang et al., 2019). Comparing the 28 genomes among the 26 genomes for chromosome 7 (Figure S3B). Simi-
to ZH13 identified a total of 723,862 PAVs (referring >50 bp inser- larly, for chromosome 17, SoyL03 had the longest sequence
tion or deletion in this work), 27,531 CNVs, 21,886 translocation and SoyC11 had the shortest sequence. Consistent with their

Cell 182, 162–176, July 9, 2020 165


ll
Resource

A B Figure 2. Pan- and Core Genome Analyses of


27 Soybean Accessions
(A) Variation of gene families in the pan-genome and
core genome along with an additional soybean
genome.
(B) Compositions of the pan-genome and individual
genomes. The histogram shows the number of gene
families in the 27 genomes with different fre-
quencies. Pie shows the proportion of the gene
family marked by each composition. A yellow block
with a dashed border indicates unclustered gene in
each genome, with the count of gene number.
(C) Presence and absence information of pan gene
families in the 27 soybean genomes.
(D) Gene number of each composition in individual
genomes.
C D (E) Proportion of genes with InterPro domains in
core, soft core, dispensable, and private genomes.
Orange histograms indicate the genes with InterPro
domain annotation; green histograms indicate the
gene without InterPro domain annotation.
(F and G) Compositions of nucleotide diversity (p)
and dN/dS in core, softcore, dispensable, and pri-
vate genes. Box edges depict interquartile range,
whiskers 1.53 depict the interquartile range, and
centerlines depict the median. Multiple compari-
E F G sons are done by Student-Newman-Keuls test with
a = 0.001.
Each row in (C) and (D) represents a soybean
accession. The accessions from the top to the
bottom are SoyL01, SoyC01, SoyL02, SoyC08,
SoyC07, SoyC12, SoyC14, SoyC02, SoyW02,
ZH13, SoyC06, SoyL03, SoyL05, SoyC10, SoyC09,
SoyL06, SoyC11, SoyW03, SoyC04, SoyC13,
SoyC03, SoyW01, SoyC05, SoyL07, SoyL08,
SoyL04, and SoyL09.
See also Figure S1 and Table S3.

chromosome lengths, compared with ZH13, a total of 1.2 Mb the highest ratio of private SVs, which may be due to the genome
insertion was found in SoyL03 from the region of 26.5 to 26.6 assembly of Wm82.a2.v1 being mainly based on a second gen-
Mb (ZH13 genome position), whereas, a total of 0.1 Mb was eration sequencing technology.
deleted in the same region in SoyC11 (Figure S3C). We found that the SVs tended to be enriched in repetitive DNA
regions (Figure 3D). This pattern is consistent with our previous
Graph-Based Genome and SV Characterization investigation of nonreference transposons using short-read re-
To construct an integrated graph-based genome for soybean sequencing data (Tian et al., 2012). Nevertheless, many more
using the 29 individually de novo assembled genomes, we PAVs were identified from this study (25,800 per sample)
merged the 776,399 SVs from all genomes into a set of than from the previous analysis (1,100 per sample). We inves-
124,222 nonredundant SVs. Similar to the patterns of core and tigated the sequence composition of each PAV and found that
dispensable gene families (Figure 2A), the nonredundant SV 78.5% of the PAVs came from repetitive DNA (Figure 3E),
set grew, with additional samples being added and tended to which supported the theory that variation of repetitive se-
flatten. In parallel, the set of shared SVs declined, leaving a final quences largely contributed to the divergences of different ge-
130 SVs shared in all samples (Figure 3B). Based on the allele fre- nomes (Kumar and Bennetzen, 1999).
quency of these SVs, we classified the SVs into four categories: Subsequently, we built an integrated graph-based genome
core (present in all 28 samples), softcore (present in >90% of with the PAVs containing less than 90% repetitive DNA using
samples but not all; 26–27), dispensable (present in more than the ZH13 genome as a standard linear base reference genome.
one but <90% of samples; 2–25) or private (present in only one Then, we mapped the resequencing short reads of the 2,898 ac-
sample). We found that wild soybeans had higher ratio of private cessions onto the graph-based genome, and identified a total of
SVs (average of 22.2%) than cultivated soybeans (average of 55,402 SVs. The precision, recall, and F1 score were 0.94, 0.75,
6.7%) (Figure 3C). However, there is an exception: Wm82 had and 0.83, respectively, which are comparable to that of yeast

166 Cell 182, 162–176, July 9, 2020


ll
Resource

A B

D E F G

H I J

Figure 3. Genetic Variations from 29 Soybean Genomes and 2,898 Resequenced Accessions
(A) Distribution of genetic variations from 29 genomes and 2,898 resequenced soybean accessions. (a) Gene density. (b–e) SNP density, p, dN, and dS from 29
genomes and 2,898 resequenced soybean accessions. The black block represents a large repetitive region. (f) Distribution of larger structure variations and
repeat sequences across the soybean genome. Red bars indicate numbers of larger structural variations. Blue bars indicate repeat contents. Orange bars
indicate the average similarity of 29 genomes to ZH13.
(B) Variants from each sample are merged using a nonredundant strategy starting with W05 and iteratively adding unique calls from additional accessions.
(C) The number of variants in each discovery class is shown per accession.
(D) Structural variation density from repeat/non-repeat genome regions by continuous 500-kb windows. Significance was tested by Fisher’s exact test; ***
p < 0.001.
(E) Structural variation number plots against repetitive DNA coverage.
(F) Structural variations from 2,898 accessions plot against discovery frequency. The structural variations are identified by mapping short reads against the
integrated graph-based genome.
(G) Numbers of detected and novel SVs from wild soybeans, landraces, and cultivars. Box edges depict interquartile range, whiskers 1.53 the interquartile range,
and centerlines the median.
(H) GWAS of seed luster using the PAVs genotyped based on graph-based genome.
(I) A 10-kb PAV results in the presence-and-absence of SoyZH13_15G114704, an HPS encoding gene.
(J) Comparison of the seed luster variation between two haplotype of the 10 kb PAV.
See also Figures S2, S3, and S4 and Tables S3 and S4.

Cell 182, 162–176, July 9, 2020 167


ll
Resource

(Hickey et al., 2020). Consistent with the results from human be- diversity of each window was calculated individually. The calcu-
ings (Audano et al., 2019), the number of identified SVs lation demonstrated that nucleotide diversity increased close to
decreased along with the increase of discovery frequency (Fig- the PAVs and declined monotonically with distance, tending to
ure 3F). In addition to the SVs from the 29 genomes, 3,584 novel flatten 700 bp away from PAVs (Figure S4F). We further
SVs were identified from the 2,898 accessions, which mainly compared the patterns between WGD and non-WGD regions,
came from lower discovery frequency catalogs (Figure 3F). The and found that although the nucleotide diversity from WGD
wild soybeans contained more SVs than landraces and cultivars, declined faster and the final flattened values were lower than
either for the SVs from the 29 genomes or the novel SVs from those from non-WGD, the nucleotide diversity surrounding
graph-based genome (Figure 3G). PAVs either from WGD or non-WGD showed comparable values
The SVs from 2,898 accessions allowed us to investigate if (Figure S4F). The results indicated that whole-genome duplica-
they confer to phenotypic variation of any agronomical traits. tion influenced the declining rate of indel-associated substitution
For instance, seed luster is an important trait for soybean, and but not too much to the top substitution value that closely sur-
a previous study suggested that the accumulation of hydropho- rounds the PAVs.
bic protein from soybean (HPS) is associated with the variations
of seed luster (Gijzen et al., 2003). However, the responsible Gene Structure Variation and Gene Fusion
genes remain unclear. A genome-wide association study Gene structure variations provide a major genetic source of trait
(GWAS) on seed luster using the SVs genotyped from the diversity (Golicz et al., 2016a). From the pan-genome analyses,
graph-based genome identified a significant signal on chromo- we identified a total of 27,175 genes from the 26 de novo assem-
some 15 (Figure 3H), among which a 10-kb PAV resulted in the bled accessions were partially absent from the ZH13 genome.
presence-and-absence of an HPS encoding gene (Figure 3I). Simultaneously, a total of 48,249 genes were lost from at least
We found that the soybeans with and without the 10 kb had one of the 26 de novo assembled accessions. We found that
higher ratio of luster and lusterless seed respectively (Figure 3J), 2.2% of the SNPs from the 29 genomes were located in the an-
indicating this PAV might be one of the responsible genetic var- notated protein coding regions of the ZH13 genome. Counting
iations controlling seed luster variation in soybean. with the protein coding genes from ZH13 as a referral, these
SNPs resulted in a total of 5,474 non-redundancy premature
Sequence Variations and Paleopolyploid stop codons in other accessions and 841 premature stop co-
Previous analyses from the reference genome Wm82 revealed dons in ZH13 (Table S4). Of the indels, 3.2% were located in cod-
that a recent genome-wide duplication (WGD) in soybean ing regions, of which 385,950 resulted in frameshifts in one or
occurred 13 million years ago and resulted in nearly 50% of multiple accessions (Table S4). In addition, 8.1% of the pres-
the genes presented with duplications (Schmutz et al., 2010). ence-absence genes came from larger SVs. For example, a
We investigated the duplications from the recent WGD in individ- 16-kb deletion was found in SoyC13 that matched to the ZH13
ual genomes and found that, similar to that of Wm82, the WGD genome at the 52.23–52.25 Mb region on chromosome 18.
accounted for a mean of 54% of the total genome (Table S3). Accompanied by this deletion, SoyZH13_18G184700 was lost
In addition, the WGD from individual genomes also showed a from SoyC13. Similarly, a 23-kb insertion was found in SoyC11
similar pattern to that of Wm82, tending to occur in gene-rich re- that matched to the ZH13 genome at 16,525,487 on chromo-
gion and away from repetitive DNA sequences (Figures S4A and some 06. As a result, this insertion introduced three additional
S4B). It has been suggested that duplicates have a slower evo- gene models, SoyC10_06G170400, SoyC10_06G170500, and
lution rate than singletons in eukaryotes (Davis and Petrov, 2004; SoyC10_06G170600, than ZH13 (Figures S3D and S3E).
Jordan et al., 2004; Beilstein et al., 2010; Yang and Gaut, 2011; Gene fusion by read-through plays important roles in gene
Du et al., 2012; Fang et al., 2016). Consistently, we found that the evolution (Jones and Begun, 2005). The high pan-genome data
nucleotide diversity from the 29 genomes in the WGD regions allow us not only to detect previously reported alleles but also
was significantly lower than that in the non-WGD regions (Fig- to find new gene structure variations, including gene fusion.
ure S4C). In addition, the WGD regions contained a higher ratio For example, E3 was reported to be an important gene respon-
of core and softcore genes, whereas non-WGD regions con- sible for major flowering loci in soybean (Watanabe et al., 2009),
tained a higher ratio of dispensable and private genes (Fig- which encoded a homolog of PHYA in Arabidopsis (Figure S5A).
ure S4D). In parallel, more SVs were identified in the non-WGD Several alleles have been detected in natural populations (Tsu-
regions although they accounted for less genome sequences bokura et al., 2014). As expected, compared with E3
than WGD (46% versus 54%). Moreover, WGD regions held (SoyZH13_19G210400) from ZH13 (termed Hap E3-Mi-1), a
fewer private SVs than non-WGD regions (Figure S4E), implying 2.6-kb insertion in the third intron (termed Hap E3-Ha-1) and a
that genome duplication not only restrained the evolutionary rate 13.3-kb deletion starting from the third intron (termed Hap E3-
of genes but also acted as an important genetic force to shape tr) were identified from our de novo genomes (Figures 4A and
the evolution of SVs. 4B). Furthermore, haplotypes with a substitution of G to A
Indel-associated substitution is a general mutational mecha- (3182) at the third exon and a T/-(141) indel at the first exon
nism in eukaryotes (Tian et al., 2008). We reason whether larger were identified from the accessions with the 2.6-kb insertion.
SVs, such as PAVs, could influence the single-nucleotide muta- Each of these two polymorphisms resulted in a frameshift, and
tion rate as well. To this end, we extracted 1-kb sequences from the corresponding haplotypes were termed Hap E3-Ha-2
each side of individual PAVs. Then the 1-kb sequences were and Hap E3-Ha-3, respectively. We also identified two indels
divided into ten continuous 100-bp windows, and the nucleotide (T/-(611) and G/-(768)) under the genetic background of Hap

168 Cell 182, 162–176, July 9, 2020


ll
Resource

A Figure 4. Structural Variation of E3 in


Different Accessions
(A) Physical location of E3 in the genome.
(B) Haplotypes of E3 in different accessions.
(C) A 13.3-kb deletion results in gene loss of
SoyZH13_19G210500 and gene fusion of E3
(SoyZH13_19G210400) and its neighboring gene
SoyZH13_19G210600 in the accessions with
haplotype E3-tr.
(D) Validation of the gene fusion by amplification of
B transcripts. L05, L09, C13, and C14 are represen-
tative accessions of haplotype E3-tr with the 13.3-
kb deletion, and W02, C12, and ZH13 are repre-
sentative accessions without the deletion. The
cDNA from each accession is used as a template to
amplify different fragments.
(E) Expression profiling of E3 and its neighboring
genes in ZH13.
See also Figure S5 and Tables S5 and S6.

C
contrast, clear bands were amplified from
the E3-tr accessions when the primers
from E3 and SoyZH13_19G210600 were
used, but not in the accessions without
the deletion (Figure 4D). PCR sequencing
with the Sanger-based platform illumi-
nated that the fragment indeed perfectly
D E matched the fusion of E3 and SoyZH13_
19G210600 transcriptions (Figure S5B),
demonstrating the existence of gene
fusion of E3 and SoyZH13_19G210600.
Similarly, the new transcript form of E3
was also confirmed using primers from
the third exon and the additional exon (Fig-
ure 4D). Transcriptional data demon-
strated E3 and its neighboring genes,
including SoyZH13_19G210500, all ex-
pressed in ZH13 (Figure 4E), indicating
they are functional. Inspired by this result,
we investigated gene fusion events among
E3-Mi-1, which both resulted in frameshifts, and the correspond- the assembled genomes at a genome-wide level through
ing haplotypes were termed E3-Mi-2 and E3-Mi-3, respectively genome sequence comparison and expression detection. In to-
(Figure 4B). tal, we identified 15 gene fusion events that occurred in different
Moreover, we found that the 13.3 kb deletion accompanied accessions (Table S6). Future functional studies of these fusion
with the loss of complete gene, SoyZH13_19G210500 (Fig- genes will help to illuminate how new genes are formed and
ure 4B). Surprisingly, our RNA-seq data showed that in addition evolved and in turn to determine biological diversity (Long
to resulting in the loss of the last exon of E3 and an entire gene of et al., 2013).
SoyZH13_19G210600, the 13.3-kb deletion in Hap E3-tr caused
transcriptional read through of E3 and SoyZH13_19G210600 Contribution of Structural Variations in Soybean
(termed Hap E3-tr-1). In addition, a new transcript form of E3 Domestication
with an additional exon appeared (termed Hap E3-tr-2) (Fig- Previous studies identified a number of important loci that may
ure 4C). To confirm that the read-through and an additional tran- be responsible for soybean domestication and adaptation, but
script truly existed, we designed primers from E3 and most of the causative genetic variation in these regions has not
SoyZH13_19G210600 to amplify the transcription sequencing been well determined (Lam et al., 2010; Li et al., 2014; Zhou
using cDNA (Table S5). A clear band was amplified from the ac- et al., 2015; Lu et al., 2017; Torkamaneh et al., 2018; Wang
cessions without the deletion when the primers from the third et al., 2018a). The large number of SVs from dozens of indepen-
and last exons were used, however, no production was obtained dently de novo assembled genomes enabled us to clarify clearer
from E3-tr accessions, confirming the loss of the last exon. In evolutionary processes that cannot be detected from one or a

Cell 182, 162–176, July 9, 2020 169


ll
Resource

few genomes. For instance, the alteration of seed coat pigmen- that the inversion occurred during domestication from wild soy-
tation is one of the obvious selection traits during soybean bean to cultivated soybean.
domestication. Almost all wild soybeans exhibited black seed
coat color, and most cultivars exhibited yellow seed coat color. Structural Variations Affect Gene Expression and Are
The classically defined I locus is an important domestication lo- Associated with Agronomic Traits
cus responsible for the changes in seed coat color from black to Gene expression could be affected by gene structural variations
colorless (Woodworth, 1921; Zhou et al., 2015), which was re- and in turn lead to agronomic trait changes. For example, an
ported to be associated with reduced chalcone synthase insertion of Ty1/copia-like retrotransposon disrupted E4 function
(CHS) gene expression in yellow soybean seed coats via homol- by decreasing its expression level, and the soybean with this
ogy-dependent gene silencing (Tuteja et al., 2004, 2009; Wang insertion exhibited insensitivity under long day conditions (Liu
et al., 1994). A recent study demonstrated that gene silencing et al., 2008). We found that 17,696 SVs were located within/
might be caused by a SV of inversion and gene duplication of around the high confidence core genes (gene body with a ±3
the CHS cluster that came from double crossover events (Xie kb flanking region). To investigate whether the SVs affected
et al., 2019). Of the 29 accessions, four wild soybeans and land- more gene expression, we compared the expression level of
race SoyL02 exhibited black seed coat color, and the other culti- these genes among the 26 accessions for de novo assembly us-
vated soybeans exhibited yellow seed coat color (Figure 5A). ing their RNA-seq data from 9 tissues using principal component
Phylogenetic analyses using polymorphism around the I locus analyses (PCA) and then performed correlation analyses be-
showed that the 29 accessions could be classified into five major tween the SV and expressional PCA data. We found that 1,021
haplotypes (H1–H5), with accessions with black seed coat as a gene structural variations were associated with their expression
distinct haplotype of H1 (Figure 5A). Structural variation analyses level changes among different accessions (Table S7).
showed that, compared with H1, the accessions from H3, H4, The analyses enabled us to identify some candidate genes
and H5 contained the inversion and gene duplication (Figure 5B), that may be responsible for important traits. For example, iron
which is consistent with a previous report (Xie et al., 2019). How- deficiency chlorosis is a common problem for soybean produc-
ever, we found that this SV did not exist in the accessions from tion in calcareous soil. Genetic investigation demonstrated that
H2, although they exhibited a yellow seed coat color. Neverthe- several significant quantitative trait loci (QTLs) were responsible
less, we found that in haplotype H2, a 23.4-kb sequence from H1 for iron deficiency chlorosis, one of which was on chromosome
was duplicated and inverted into the CHS cluster, which might 14 (Lin et al., 1997) (Figure 6A). SoyZH13_14G179600, a gene
have resulted from a double crossover event that resulted in encoding a Fe2+/Zn2+ regulated transporter, was found in this
the pseudogenicity of two CHS genes (Figures 5B and 5C). QTL region. We found that a 1.4 kb indel existed in the promoter
Moreover, we found several additional SV events accompanying region of SoyZH13_14G179600 in different genomes. The 1.4-kb
the inversion and duplication, including an additional inversion sequence was composed of two equal length terminal inverted
between the two CHS clusters that occurred in the H3 haplotype, repeats of 700 bp and was flanked by a 9-bp target side dupli-
a deletion based on H3 that might come from an unequal combi- cation (Figure S7), which met the criterion of Mutator (Wicker
nation of CHS genes and result in H4, and an insertion based on et al., 2007). The PAV of this Mutator was highly linked to the
H3 that might come from tandem duplication and result in H5 other five polymorphisms in the exon regions and classified the
(Figure 5C). Genetic distance estimated that the SVs from H2 26 accessions into two distinct groups: Hap-1 without the dele-
and H3 might have originated 4,500 and 4,200 years ago, tion and Hap-2 with this deletion (Figure 6B). Based on our RNA-
whereas H4 and H5 might have originated 600 years ago seq data, the accessions of Hap-2 exhibited higher expression
(Figure 5A). levels than those of Hap-1 (Figure 6C). Fe availability is greatly
Some PAVs that were associated with soybean domestication determined by the pH value. In soils with higher pH, Fe usually
were also found. For example, a 360-kb inversion was found in exists in a predominate form of insoluble ferric oxides. However,
chromosome 7 from 43.30–43.66 Mb of the ZH13 genome (Fig- at lower pH, ferric Fe is freed from the oxide and becomes more
ure S6A). Interestingly, this inversion showed differences only be- available for uptake by roots (Morrissey and Guerinot, 2009). In
tween wild soybeans and cultivated soybeans and might occur China, lower latitude regions (Southern and Eastern) are mainly
4,700 years ago (Figure S6B). Selective analysis demonstrated comprised of ferralsol and have lower pH values, whereas the
that this inversion was located in a selective sweep from soybean higher latitude regions (Northeastern, Northwestern, and Huan-
domestication (Figure S6C). Further investigation illuminated that ghuai regions) have relatively higher pH values (Dai et al., 2009;
the inversion resulted in a gene (SoyZH13_07G220000/SoyW02_ Shi et al., 2006). Interestingly, the Hap-2 accessions with higher
07G232200) exhibiting SV between wild and cultivated soybeans expression were preferentially located in higher latitude regions,
(Figure S6D). This region was orthologs to chromosome 17, which and Hap-1 accessions with lower expression were preferentially
came from the recent WGD. Synteny analyses demonstrated that located in lower latitude regions (Figures 6D–6F), indicating that
these two regions showed continuous synteny in wild soybean, but genetic divergence of SoyZH13_14G179600 contributed to soy-
showed divergence of with/without this inversion in cultivated bean adaptation in iron uptake.
soybeans. In contrast to the SV of SoyZH13_07G220000 and
SoyW02_07G232200 in wild and cultivated soybeans, its DISCUSSION
orthologous on chromosome 17 (SoyZH13_17G036200 and
SoyW02_17G034500) exhibited structure conservation between Soybean provides more than half of the global production of
wild and cultivated soybeans (Figure S6D). The results suggested oilseed and more than a quarter of the protein for human food

170 Cell 182, 162–176, July 9, 2020


ll
Resource

A B

Figure 5. Evolution of the I Locus in Soybeans


(A) Phylogenetic analysis of the I locus in 29 soybean accessions. The genetic variations surrounding the CHS cluster in the 29 soybean accessions were illu-
minated and the phylogenetic tree was constructed based on the SNPs from the CHS cluster. The major evolutionary events and the estimated divergence ages
are labeled. ya, years ago.
(B) Structural variations in the I locus in 29 soybean accessions. The haplotypes of the accessions are clustered based on the structure of CHS units.
(C) Evolutionary process of five main I haplotypes. Color blocks represent CHS unit by sequence similarity. The white block shows unclear similarity of certain
CHS units. Black triangle, strand of CHS unit.
See also Figure S6.

and animal feed (Graham and Vance, 2003; Wilson, 2008). To gently needed. A comprehensive evaluation and utilization of
ensure an adequate food supply for the expanding worldwide genetically diverse germplasm is essential for crop improvement
human population, soybean production must be doubled by (Golicz et al., 2016a; Varshney et al., 2020). A mass of soybean
2050 (Foley et al., 2011; Tilman et al., 2011; Ray et al., 2013). accessions were genotyped using short-reads-based high-
In turn, more effective plant breeding of soybean varieties is ur- throughput sequencing technologies (Lam et al., 2010; Li et al.,

Cell 182, 162–176, July 9, 2020 171


ll
Resource

A D

B C

E F

Figure 6. Structural Variation in an Fe Efficiency QTL Candidate Gene in Different Accessions


(A) SoyZH13_14G179600, a Fe2+/Zn2+ regulated transporter, is located in the Fe efficiency QTL and exhibits structural variation among accessions.
(B) Haplotype divergence of SoyZH13_14G179600 in different accessions.
(C) Expression levels of two haplotypes of SoyZH13_14G179600 in 29 pan-genome accessions. Tissues: A, root from growth stage V1; B, stem from growth stage
V1; C, young leaf from growth stage V1; D, mature leaf from growth stage R1; E, old leaf from growth stage R4; F, flower from growth stage R1; G, pod and seed
before 4 weeks; H, seed at 6 weeks; I, seed at 8 weeks.
(D–F) Comparison of the geography and latitude distribution of the two haplotypes of SoyZH13_14G179600 in soybean accessions. (D) The geography distri-
bution of the two haplotypes. (E) Count of the accession number of the two haplotypes in each geography region. (F) Comparison of latitude distribution of the two
haplotypes in soybean accessions. I–VI, eco-regions of soybeans in China; A, America; J, Japan; K, Korea; R, Russia. Violin plots represent the latitude density of
the data. Distribution of the data with the median shown as a horizontal line. Box edges depict interquartile range, whiskers 1.53 the interquartile range, and
centerlines the median. Significance was tested by the Wilcoxon test; *** p < 0.001.
See also Figure S7 and Table S7.

2010b, 2013; Zhou et al., 2015; Han et al., 2016; Maldonado dos Figshare databases. Second, the graph-based genome offers
Santos et al., 2016; Fang et al., 2017; Torkamaneh et al., 2018). a new platform to map short read data to determine the genetic
However, given the availability of a single released soybean variations at the pan-genome level instead of a single genome
reference genome Wm82 (Schmutz et al., 2010), only small and prevent erroneous variation calls around SVs (Rakocevic
SNPs and small indels were identified, leaving the larger SVs et al., 2019). Therefore, re-analysis of previously amount of re-
almost overlooked. This limited the capture of the full landscape sequenced data based on the created graph-based genome
of genetic variations and the pinpoint of causal variations in QTL will generate more comprehensive information than ever, which
cloning and genome-wide association studies. makes those reported data rejuvenated. In addition, coupled
The pan-genome dataset of 27 wild and cultivated soybean with the RNA-seq and small RNA sequencing (smRNA-seq)
accessions from this study will provide a promising platform from individual accessions, the platform will make it possible to
for future soybean in-depth functional genomics studies. First, link SVs with gene expression and, in turn, will greatly promote
the high-quality genomes enabled the identification of numerous gene discovery.
complex variations that cannot be detected by simply mapping Several genetic diversity bottlenecks occurred during soy-
the short reads to a single genome. The larger amount of genetic bean domestication and improvement, resulting in narrowed ge-
variation has been deposited public accessible Genome netic diversity among modern cultivars (Hyten et al., 2006; Zhou
Sequence Archive (GSA) database in the BIG Data Center and et al., 2015), which greatly restrict the subsequent creation of

172 Cell 182, 162–176, July 9, 2020


ll
Resource

elite cultivars. Creation of experimental populations is essential DECLARATION OF INTERESTS


for both functional studies and breeding (Huang et al., 2015). In
The authors declare no competing interests.
addition to the traditional experimental populations that combine
the genomes of two parents, there is a trend to develop multi-
Received: December 22, 2019
parent-based mixture populations, such as nested association Revised: April 7, 2020
mapping (NAM) populations (Yu et al., 2008) and multi-parent Accepted: May 12, 2020
advanced generation intercrosses (MAGIC) populations (Huang Published: June 17, 2020
et al., 2015). However, a challenge in experimental population
creation is to select appropriate representative parents. The 27 REFERENCES
accessions were carefully selected from 2,898 globally collected
soybean germplasms in terms of phylogenetic relationships and Ameur, A. (2019). Goodbye reference, hello genome graphs. Nat. Biotechnol.
37, 866–868.
geographic distributions. These accessions will be promising
candidates for NAM or MAGIC population construction. There- Audano, P.A., Sulovari, A., Graves-Lindsay, T.A., Cantsilieris, S., Sorensen,
M., Welch, A.E., Dougherty, M.L., Nelson, B.J., Shah, A., Dutcher, S.K.,
fore, the pan-genome will not only facilitate functional study
et al. (2019). Characterizing the major structural variant alleles of the human
but will also be quite useful for soybean breeding. genome. Cell 176, 663–675.
In summary, the pan-genome provides a comprehensive
Axtell, M.J. (2013). ShortStack: comprehensive annotation and quantification
resource and will greatly benefit the soybean breeding and of small RNA genes. RNA 19, 740–751.
research community.
Bandillo, N., Jarquin, D., Song, Q.J., Nelson, R., Cregan, P., Specht, J., and
Lorenz, A. (2015). A population structure and genome-wide association anal-
STAR+METHODS ysis on the USDA soybean germplasm collection. Plant Genome 8. https://doi.
org/10.3835/plantgenome2015.04.0024.
Detailed methods are provided in the online version of this paper Beilstein, M.A., Nagalingum, N.S., Clements, M.D., Manchester, S.R., and
and include the following: Mathews, S. (2010). Dated molecular phylogenies indicate a Miocene origin
for Arabidopsis thaliana. Proc. Natl. Acad. Sci. USA 107, 18724–18728.
d KEY RESOURCES TABLE Camacho, C., Coulouris, G., Avagyan, V., Ma, N., Papadopoulos, J., Bealer,
d RESOURCE AVAILABILITY K., and Madden, T.L. (2009). BLAST+: architecture and applications. BMC Bio-
B Lead Contact informatics 10, 421.
B Materials Availability Cantarel, B.L., Korf, I., Robb, S.M., Parra, G., Ross, E., Moore, B., Holt, C.,
B Data and Code Availability Sánchez Alvarado, A., and Yandell, M. (2008). MAKER: an easy-to-use anno-
d EXPERIMENTAL MODEL AND SUBJECT DETAILS tation pipeline designed for emerging model organism genomes. Genome
Res. 18, 188–196.
d METHOD DETAILS
B SNP Calling and Phylogenetic Analyses Carter, T.E., Nelson, R., Sneller, C.H., and Cui, Z. (2004). Genetic diversity in
soybean. In Soybeans: Improvement, Production and Uses, Third Edition,
B Genome Assembly
H.R. Boerma and J.E. Specht, eds. (American Society of Agronomy, Inc.),
B Repeat Analysis and Gene Annotation pp. 303–416.
B Synteny Analysis
Chakraborty, M., Emerson, J.J., Macdonald, S.J., and Long, A.D. (2019).
B Structural Variation Identification Structural variants exhibit widespread allelic heterogeneity and shape varia-
B Genetic Variation Analysis tion in complex traits. Nat. Commun. 10, 4872.
B Gene and miRNA Expression Chan, C., Qi, X., Li, M.W., Wong, F.L., and Lam, H.M. (2012). Recent develop-
B Core and Dispensable Gene Family Clustering ments of genomic research in soybean. J. Genet. Genomics 39, 317–324.
B CHS gene unit identification Cook, D.E., Lee, T.G., Guo, X., Melito, S., Wang, K., Bayless, A.M., Wang, J.,
B Gene Fusion Event Identification Hughes, T.J., Willis, D.K., Clemente, T.E., et al. (2012). Copy number variation
d QUANTIFICATION AND STATISTICAL ANALYSIS of multiple genes at Rhg1 mediates nematode resistance in soybean. Science
338, 1206–1209.
SUPPLEMENTAL INFORMATION Dai, W., Huang, Y., Wu, L., and Yu, J. (2009). Relationships between soil
organic matter content (SOM) and pH in topsoil of zonal soils in China. Acta Pe-
Supplemental Information can be found online at https://doi.org/10.1016/j. dologica Sinica 46, 851–860.
cell.2020.05.023. Davis, J.C., and Petrov, D.A. (2004). Preferential duplication of conserved pro-
teins in eukaryotic genomes. PLoS Biol. 2, E55.
ACKNOWLEDGMENTS Deng, Y., Zhai, K., Xie, Z., Yang, D., Zhu, X., Liu, J., Wang, X., Qin, P., Yang, Y.,
Zhang, G., et al. (2017). Epigenetic regulation of antagonistic receptors confers
This work was supported by the National Natural Science Foundation of China rice blast resistance with yield balance. Science 355, 962–965.
(31788103, 31525018, and 91531304) and the Chinese Academy of Sciences
Du, H., and Liang, C. (2019). Assembly of chromosome-scale contigs by effi-
(XDA24030501, ZDRW-ZS-2019-2, and 153E11KYSB20190045).
ciently resolving repetitive sequences with long reads. Nat. Commun.
10, 5360.
AUTHOR CONTRIBUTIONS
Du, J., Grant, D., Tian, Z., Nelson, R.T., Zhu, L., Shoemaker, R.C., and Ma, J.
Z.T. designed the experiments and managed the project. Y. Liu., H.D., Y.S., (2010). SoyTEdb: a comprehensive database of transposable elements in the
P.L., H.Z., M.S., and C.L. performed genome assembly. Y. Liu, H.D., Y.S., soybean genome. BMC Genomics 11, 113.
S.L., H.P., G.-A.Z., X.H., Y. Li, Z.W., Z.L., B.Z., M.Z., B.H., C.L., and Z.T. per- Du, J., Tian, Z., Sui, Y., Zhao, M., Song, Q., Cannon, S.B., Cregan, P., and Ma,
formed the data analysis and validation. Y. Liu. and Z.T. wrote the manuscript. J. (2012). Pericentromeric effects shape the patterns of divergence, retention,

Cell 182, 162–176, July 9, 2020 173


ll
Resource
and expression of duplicated genes in the paleopolyploid soybean. Plant Cell crops: current status and future prospects. Theor. Appl. Genet. 128,
24, 21–32. 999–1017.
Dudchenko, O., Batra, S.S., Omer, A.D., Nyquist, S.K., Hoeger, M., Durand, Hufford, M.B., Xu, X., van Heerwaarden, J., Pyhäjärvi, T., Chia, J.M., Cart-
N.C., Shamim, M.S., Machol, I., Lander, E.S., Aiden, A.P., and Aiden, E.L. wright, R.A., Elshire, R.J., Glaubitz, J.C., Guill, K.E., Kaeppler, S.M., et al.
(2017). De novo assembly of the Aedes aegypti genome using Hi-C yields chro- (2012). Comparative population genomics of maize domestication and
mosome-length scaffolds. Science 356, 92–95. improvement. Nat. Genet. 44, 808–811.
Durand, N.C., Shamim, M.S., Machol, I., Rao, S.S., Huntley, M.H., Lander, Hurgobin, B., Golicz, A.A., Bayer, P.E., Chan, C.K., Tirnaz, S., Dolatabadian,
E.S., and Aiden, E.L. (2016). Juicer provides a one-click system for analyzing A., Schiessl, S.V., Samans, B., Montenegro, J.D., Parkin, I.A.P., et al. (2018).
loop-resolution Hi-C experiments. Cell Syst. 3, 95–98. Homoeologous exchange is a major cause of gene presence/absence varia-
Edgar, R.C. (2004). MUSCLE: a multiple sequence alignment method with tion in the amphidiploid Brassica napus. Plant Biotechnol. J. 16, 1265–1274.
reduced time and space complexity. BMC Bioinformatics 5, 113. Hyten, D.L., Song, Q., Zhu, Y., Choi, I.-Y., Nelson, R.L., Costa, J.M., Specht,
Eggertsson, H.P., Kristmundsdottir, S., Beyter, D., Jonsson, H., Skuladottir, A., J.E., Shoemaker, R.C., and Cregan, P.B. (2006). Impacts of genetic bottle-
Hardarson, M.T., Gudbjartsson, D.F., Stefansson, K., Halldorsson, B.V., and necks on soybean genome diversity. Proc. Natl. Acad. Sci. USA 103,
Melsted, P. (2019). GraphTyper2 enables population-scale genotyping of 16666–16671.
structural variation using pangenome graphs. Nat. Commun. 10, 5402. Jones, C.D., and Begun, D.J. (2005). Parallel evolution of chimeric fusion
Fang, C., Ma, Y., Yuan, L., Wang, Z., Yang, R., Zhou, Z., Liu, T., and Tian, Z. genes. Proc. Natl. Acad. Sci. USA 102, 11373–11378.
(2016). Chloroplasts DNA Underwent Independent Selection from Nuclear Jones, P., Binns, D., Chang, H.Y., Fraser, M., Li, W., McAnulla, C., McWilliam,
Genes during Soybean Domestication and Improvement. J. Genet. Genomics H., Maslen, J., Mitchell, A., Nuka, G., et al. (2014). InterProScan 5: genome-
43, 217–221. scale protein function classification. Bioinformatics 30, 1236–1240.
Fang, C., Ma, Y., Wu, S., Liu, Z., Wang, Z., Yang, R., Hu, G., Zhou, Z., Yu, H., Jordan, I.K., Wolf, Y.I., and Koonin, E.V. (2004). Duplicated genes evolve
Zhang, M., et al. (2017). Genome-wide association studies dissect the genetic slower than singletons despite the initial rate increase. BMC Evol. Biol. 4, 22.
networks underlying agronomical traits in soybean. Genome Biol. 18, 161.
Kang, H.M., Sul, J.H., Service, S.K., Zaitlen, N.A., Kong, S.Y., Freimer, N.B.,
Foley, J.A., Ramankutty, N., Brauman, K.A., Cassidy, E.S., Gerber, J.S., John- Sabatti, C., and Eskin, E. (2010). Variance component model to account for
ston, M., Mueller, N.D., O’Connell, C., Ray, D.K., West, P.C., et al. (2011). So- sample structure in genome-wide association studies. Nat. Genet. 42,
lutions for a cultivated planet. Nature 478, 337–342. 348–354.
Garrison, E., Sirén, J., Novak, A.M., Hickey, G., Eizenga, J.M., Dawson, E.T., Kim, D., Langmead, B., and Salzberg, S.L. (2015). HISAT: a fast spliced aligner
Jones, W., Garg, S., Markello, C., Lin, M.F., et al. (2018). Variation graph toolkit with low memory requirements. Nat. Methods 12, 357–360.
improves read mapping by representing genetic variation in the reference. Nat.
Kim, D., Paggi, J.M., Park, C., Bennett, C., and Salzberg, S.L. (2019). Graph-
Biotechnol. 36, 875–879.
based genome alignment and genotyping with HISAT2 and HISAT-genotype.
Gijzen, M., Weng, C., Kuflu, K., Woodrow, L., Yu, K., and Poysa, V. (2003). Soy- Nat. Biotechnol. 37, 907–915.
bean seed lustre phenotype and surface protein cosegregate and map to link-
Koren, S., Walenz, B.P., Berlin, K., Miller, J.R., Bergman, N.H., and Phillippy,
age group E. Genome 46, 659–664.
A.M. (2017). Canu: scalable and accurate long-read assembly via adaptive
Golicz, A.A., Batley, J., and Edwards, D. (2016a). Towards plant pangenomics.
k-mer weighting and repeat separation. Genome Res. 27, 722–736.
Plant Biotechnol. J. 14, 1099–1105.
Kumar, A., and Bennetzen, J.L. (1999). Plant retrotransposons. Annu. Rev.
Golicz, A.A., Bayer, P.E., Barker, G.C., Edger, P.P., Kim, H., Martinez, P.A.,
Genet. 33, 479–532.
Chan, C.K., Severn-Ellis, A., McCombie, W.R., Parkin, I.A., et al. (2016b).
The pangenome of an agronomically important crop plant Brassica oleracea. Lam, H.M., Xu, X., Liu, X., Chen, W., Yang, G., Wong, F.L., Li, M.W., He, W.,
Nat. Commun. 7, 13390. Qin, N., Wang, B., et al. (2010). Resequencing of 31 wild and cultivated soy-
bean genomes identifies patterns of genetic diversity and selection. Nat.
Gordon, S.P., Contreras-Moreira, B., Woods, D.P., Des Marais, D.L., Burgess,
Genet. 42, 1053–1059.
D., Shu, S., Stritt, C., Roulin, A.C., Schackwitz, W., Tyler, L., et al. (2017).
Extensive gene content variation in the Brachypodium distachyon pan- Lee, T.-H., Guo, H., Wang, X., Kim, C., and Paterson, A.H. (2014). SNPhylo: a
genome correlates with population structure. Nat. Commun. 8, 2184. pipeline to construct a phylogenetic tree from huge SNP data. BMC Genomics
15, 162.
Graham, P.H., and Vance, C.P. (2003). Legumes: importance and constraints
to greater use. Plant Physiol. 131, 872–877. Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Bur-
rows-Wheeler transform. Bioinformatics 25, 1754–1760.
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith, R.K., Jr., Han-
nick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D., et al. (2003). Li, W., and Godzik, A. (2006). Cd-hit: a fast program for clustering and
Improving the Arabidopsis genome annotation using maximal transcript align- comparing large sets of protein or nucleotide sequences. Bioinformatics 22,
ment assemblies. Nucleic Acids Res. 31, 5654–5666. 1658–1659.
Han, Y., Zhao, X., Liu, D., Li, Y., Lightfoot, D.A., Yang, Z., Zhao, L., Zhou, G., Li, L., Stoeckert, C.J., Jr., and Roos, D.S. (2003). OrthoMCL: identification of
Wang, Z., Huang, L., et al. (2016). Domestication footprints anchor genomic re- ortholog groups for eukaryotic genomes. Genome Res. 13, 2178–2189.
gions of agronomic importance in soybeans. New Phytol. 209, 871–884. Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G.,
Hickey, G., Heller, D., Monlong, J., Sibbesen, J.A., Sirén, J., Eizenga, J., Daw- Abecasis, G., and Durbin, R.; 1000 Genome Project Data Processing Sub-
son, E.T., Garrison, E., Novak, A.M., and Paten, B. (2020). Genotyping struc- group (2009). The Sequence Alignment/Map format and SAMtools. Bioinfor-
tural variants in pangenome graphs using the vg toolkit. Genome Biol. 21, 35. matics 25, 2078–2079.
Hirsch, C.N., Foerster, J.M., Johnson, J.M., Sekhon, R.S., Muttoni, G., Vaillan- Li, Y.H., Li, W., Zhang, C., Yang, L., Chang, R.Z., Gaut, B.S., and Qiu, L.J.
court, B., Peñagaricano, F., Lindquist, E., Pedraza, M.A., Barry, K., et al. (2010b). Genetic diversity in domesticated soybean (Glycine max) and its
(2014). Insights into the maize pan-genome and pan-transcriptome. Plant wild progenitor (Glycine soja) for simple sequence repeat and single-nucleo-
Cell 26, 121–135. tide polymorphism loci. New Phytol. 188, 242–253.
Hu, Z., Sun, C., Lu, K.C., Chu, X., Zhao, Y., Lu, J., Shi, J., and Wei, C. (2017). Li, R., Li, Y., Zheng, H., Luo, R., Zhu, H., Li, Q., Qian, W., Ren, Y., Tian, G., Li, J.,
EUPAN enables pan-genome studies of a large number of eukaryotic ge- et al. (2010a). Building the sequence map of the human pan-genome. Nat. Bio-
nomes. Bioinformatics 33, 2408–2409. technol. 28, 57–63.
Huang, B.E., Verbyla, K.L., Verbyla, A.P., Raghavan, C., Singh, V.K., Gaur, P., Li, Y.H., Zhao, S.C., Ma, J.X., Li, D., Yan, L., Li, J., Qi, X.T., Guo, X.S., Zhang, L.,
Leung, H., Varshney, R.K., and Cavanagh, C.R. (2015). MAGIC populations in He, W.M., et al. (2013). Molecular footprints of domestication and

174 Cell 182, 162–176, July 9, 2020


ll
Resource
improvement in soybean revealed by whole genome re-sequencing. BMC Ge- Murray, M.G., and Thompson, W.F. (1980). Rapid isolation of high molecular
nomics 14, 579. weight plant DNA. Nucleic Acids Res. 8, 4321–4325.
Li, Y.H., Zhou, G., Ma, J., Jiang, W., Jin, L.G., Zhang, Z., Guo, Y., Zhang, J., Nawrocki, E.P., Kolbe, D.L., and Eddy, S.R. (2009). Infernal 1.0: inference of
Sui, Y., Zheng, L., et al. (2014). De novo assembly of soybean wild relatives RNA alignments. Bioinformatics 25, 1335–1337.
for pan-genome analysis of diversity and agronomic traits. Nat. Biotechnol. Nei, M. (1972). Genetic distance between populations. Am. Nat. 106, 283–292.
32, 1045–1052. Nei, M., and Gojobori, T. (1986). Simple methods for estimating the numbers of
Li, M., Chen, L., Tian, S., Lin, Y., Tang, Q., Zhou, X., Li, D., Yeung, C.K.L., Che, synonymous and nonsynonymous nucleotide substitutions. Mol. Biol. Evol. 3,
T., Jin, L., et al. (2017). Comprehensive variation discovery and recovery of 418–426.
missing sequence in the pig genome using multiple de novo assemblies. Pertea, M., Pertea, G.M., Antonescu, C.M., Chang, T.C., Mendell, J.T., and
Genome Res. 27, 865–874. Salzberg, S.L. (2015). StringTie enables improved reconstruction of a tran-
Li, M.W., Wang, Z., Jiang, B., Kaga, A., Wong, F.L., Zhang, G., Han, T., Chung, scriptome from RNA-seq reads. Nat. Biotechnol. 33, 290–295.
G., Nguyen, H., and Lam, H.M. (2020). Impacts of genomic research on soy- Pertea, M., Kim, D., Pertea, G.M., Leek, J.T., and Salzberg, S.L. (2016). Tran-
bean improvement in East Asia. Theor. Appl. Genet. 133, 1655–1678. script-level expression analysis of RNA-seq experiments with HISAT, StringTie
Lieberman-Aiden, E., van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, and Ballgown. Nat. Protoc. 11, 1650–1667.
T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O., et al. Price, M.N., Dehal, P.S., and Arkin, A.P. (2010). FastTree 2–approximately
(2009). Comprehensive Mapping of Long-Range Interactions Reveals Folding maximum-likelihood trees for large alignments. PLoS ONE 5, e9490.
Principles of the Human Genome. Science 326, 289–293. R Development Core Team (2013). R: A language and environment for statis-
Lin, S., Cianzio, S., and Shoemaker, R. (1997). Mapping genetic loci for iron tical computing (R Foundation for Statistical Computing).
deficiency chlorosis in soybean. Mol. Breed. 3, 219–229. Rakocevic, G., Semenyuk, V., Lee, W.P., Spencer, J., Browning, J., Johnson,
Lin, K., Zhang, N., Severing, E.I., Nijveen, H., Cheng, F., Visser, R.G., Wang, X., I.J., Arsenijevic, V., Nadj, J., Ghose, K., Suciu, M.C., et al. (2019). Fast and ac-
de Ridder, D., and Bonnema, G. (2014). Beyond genomic variation–compari- curate genomic analyses using genome graphs. Nat. Genet. 51, 354–362.
son and functional annotation of three Brassica rapa genomes: a turnip, a rapid Ray, D.K., Mueller, N.D., West, P.C., and Foley, J.A. (2013). Yield trends are
cycling and a Chinese cabbage. BMC Genomics 15, 250. insufficient to double global crop production by 2050. PLoS ONE 8, e66428.
Liu, B., Kanazawa, A., Matsumura, H., Takahashi, R., Harada, K., and Abe, J. Salamov, A.A., and Solovyev, V.V. (2000). Ab initio gene finding in Drosophila
(2008). Genetic redundancy in soybean photoresponses associated with genomic DNA. Genome Res. 10, 516–522.
duplication of the phytochrome A gene. Genetics 180, 995–1007. Saxena, R.K., Edwards, D., and Varshney, R.K. (2014). Structural variations in
Liu, Z., Li, H., Wen, Z., Fan, X., Li, Y., Guan, R., Guo, Y., Wang, S., Wang, D., plant genomes. Brief. Funct. Genomics 13, 296–307.
and Qiu, L. (2017). Comparison of genetic diversity between Chinese and Scherer, S.W., Lee, C., Birney, E., Altshuler, D.M., Eichler, E.E., Carter, N.P.,
American soybean (Glycine max (L.)) accessions revealed by high-density Hurles, M.E., and Feuk, L. (2007). Challenges and standards in integrating sur-
SNPs. Front. Plant Sci. 8, 2014. veys of structural variation. Nat. Genet. 39, S7–S15.
Long, M., VanKuren, N.W., Chen, S., and Vibranovski, M.D. (2013). New gene Schmutz, J., Cannon, S.B., Schlueter, J., Ma, J., Mitros, T., Nelson, W., Hyten,
evolution: little did we know. Annu. Rev. Genet. 47, 307–333. D.L., Song, Q., Thelen, J.J., Cheng, J., et al. (2010). Genome sequence of the
Lowe, T.M., and Eddy, S.R. (1997). tRNAscan-SE: a program for improved palaeopolyploid soybean. Nature 463, 178–183.
detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. Shen, Y., Zhou, Z., Wang, Z., Li, W., Fang, C., Wu, M., Ma, Y., Liu, T., Kong,
25, 955–964. L.A., Peng, D.L., and Tian, Z. (2014). Global dissection of alternative splicing
in paleopolyploid soybean. Plant Cell 26, 996–1008.
Lu, F., Romay, M.C., Glaubitz, J.C., Bradbury, P.J., Elshire, R.J., Wang, T., Li,
Y., Li, Y., Semagn, K., Zhang, X., et al. (2015). High-resolution genetic mapping Shen, Y., Liu, J., Geng, H., Zhang, J., Liu, Y., Zhang, H., Xing, S., Du, J., Ma, S.,
of maize pan-genome sequence anchors. Nat. Commun. 6, 6914. and Tian, Z. (2018). De novo assembly of a Chinese soybean genome. Sci.
China Life Sci. 61, 871–884.
Lu, S., Zhao, X., Hu, Y., Liu, S., Nan, H., Li, X., Fang, C., Cao, D., Shi, X., Kong,
L., et al. (2017). Natural variation at the soybean J locus improves adaptation to Shen, Y., Du, H., Liu, Y., Ni, L., Wang, Z., Liang, C., and Tian, Z. (2019). Update
the tropics and enhances yield. Nat. Genet. 49, 773–779. soybean Zhonghuang 13 genome to a golden reference. Sci. China Life Sci.
62, 1257–1260.
Lye, Z.N., and Purugganan, M.D. (2019). Copy number variation in domestica-
Shi, X.Z., Yu, D.S., Warner, E.D., Sun, W.X., Petersen, G.W., Gong, Z.T., and
tion. Trends Plant Sci. 24, 352–365.
Lin, H. (2006). Cross-reference system for translating between genetic soil
Maldonado dos Santos, J.V., Valliyodan, B., Joshi, T., Khan, S.M., Liu, Y., classification of china and soil taxonomy. Soil Sci. Soc. Am. J. 70, 78–83.
Wang, J., Vuong, T.D., de Oliveira, M.F., Marcelino-Guimarães, F.C., Xu, D.,
Shomura, A., Izawa, T., Ebana, K., Ebitani, T., Kanegae, H., Konishi, S., and
et al. (2016). Evaluation of genetic variation among Brazilian soybean cultivars
Yano, M. (2008). Deletion in a gene associated with grain size increased yields
through genome resequencing. BMC Genomics 17, 110.
during rice domestication. Nat. Genet. 40, 1023–1028.
Marçais, G., Delcher, A.L., Phillippy, A.M., Coston, R., Salzberg, S.L., and Zi-
Simão, F.A., Waterhouse, R.M., Ioannidis, P., Kriventseva, E.V., and Zdobnov,
min, A. (2018). MUMmer4: A fast and versatile genome alignment system.
E.M. (2015). BUSCO: assessing genome assembly and annotation complete-
PLoS Comput. Biol. 14, e1005944.
ness with single-copy orthologs. Bioinformatics 31, 3210–3212.
McCarthy, E.M., and McDonald, J.F. (2003). LTR_STRUC: a novel search and Stanke, M., and Morgenstern, B. (2005). AUGUSTUS: a web server for gene
identification program for LTR retrotransposons. Bioinformatics 19, 362–367. prediction in eukaryotes that allows user-defined constraints. Nucleic Acids
McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, Res. 33, W465-7.
A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M., and DePristo, M.A. (2010). Tamura, K., Stecher, G., Peterson, D., Filipski, A., and Kumar, S. (2013).
The Genome Analysis Toolkit: a MapReduce framework for analyzing next- MEGA6: molecular evolutionary genetics analysis version 6.0. Mol. Biol.
generation DNA sequencing data. Genome Res. 20, 1297–1303. Evol. 30, 2725–2729.
Morrissey, J., and Guerinot, M.L. (2009). Iron uptake and transport in plants: Tao, Y., Zhao, X., Mace, E., Henry, R., and Jordan, D. (2019). Exploring and ex-
the good, the bad, and the ionome. Chem. Rev. 109, 4553–4567. ploiting pan-genomics for crop improvement. Mol. Plant 12, 156–169.
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., and Wold, B. (2008). Tettelin, H., Masignani, V., Cieslewicz, M.J., Donati, C., Medini, D., Ward, N.L.,
Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Angiuoli, S.V., Crabtree, J., Jones, A.L., Durkin, A.S., et al. (2005). Genome
Methods 5, 621–628. analysis of multiple pathogenic isolates of Streptococcus agalactiae:

Cell 182, 162–176, July 9, 2020 175


ll
Resource
implications for the microbial ‘‘pan-genome’’. Proc. Natl. Acad. Sci. USA 102, Wang, W., Mauleon, R., Hu, Z., Chebotarov, D., Tai, S., Wu, Z., Li, M., Zheng,
13950–13955. T., Fuentes, R.R., Zhang, F., et al. (2018b). Genomic variation in 3,010 diverse
Thorvaldsdóttir, H., Robinson, J.T., and Mesirov, J.P. (2013). Integrative Geno- accessions of Asian cultivated rice. Nature 557, 43–49.
mics Viewer (IGV): high-performance genomics data visualization and explora- Watanabe, S., Hideshima, R., Xia, Z., Tsubokura, Y., Sato, S., Nakamoto, Y.,
tion. Brief. Bioinform. 14, 178–192. Yamanaka, N., Takahashi, R., Ishimoto, M., Anai, T., et al. (2009). Map-based
Tian, D., Wang, Q., Zhang, P., Araki, H., Yang, S., Kreitman, M., Nagylaki, T., cloning of the gene associated with the soybean maturity locus E3. Genetics
Hudson, R., Bergelson, J., and Chen, J.Q. (2008). Single-nucleotide mutation 182, 1251–1262.
rate increases close to insertions/deletions in eukaryotes. Nature 455, Wicker, T., Sabot, F., Hua-Van, A., Bennetzen, J.L., Capy, P., Chalhoub, B.,
105–108. Flavell, A., Leroy, P., Morgante, M., Panaud, O., et al. (2007). A unified classi-
Tian, Z., Zhao, M., She, M., Du, J., Cannon, S.B., Liu, X., Xu, X., Qi, X., Li, M.W., fication system for eukaryotic transposable elements. Nat. Rev. Genet. 8,
Lam, H.M., and Ma, J. (2012). Genome-wide characterization of nonreference 973–982.
transposons reveals evolutionary propensities of transposons in soybean. Wilson, R.F. (2008). Soybean: Market Driven Research Needs in Genetics and
Plant Cell 24, 4422–4436. Genomics of Soybean (Springer).
Tilman, D., Balzer, C., Hill, J., and Befort, B.L. (2011). Global food demand and Woodworth, C.M. (1921). Inheritance of cotyledon, seed-coat, hilum and pu-
the sustainable intensification of agriculture. Proc. Natl. Acad. Sci. USA 108, bescence colors in soy-beans. Genetics 6, 487–553.
20260–20264.
Xie, C., Mao, X., Huang, J., Ding, Y., Wu, J., Dong, S., Kong, L., Gao, G., Li,
Torkamaneh, D., Laroche, J., Tardivel, A., O’Donoughue, L., Cober, E., Rajcan, C.Y., and Wei, L. (2011). KOBAS 2.0: a web server for annotation and identifi-
I., and Belzile, F. (2018). Comprehensive description of genomewide nucleo- cation of enriched pathways and diseases. Nucleic Acids Res. 39, W316-22.
tide and structural variation in short-season soya bean. Plant Biotechnol. J.
Xie, M., Chung, C.Y., Li, M.W., Wong, F.L., Wang, X., Liu, A., Wang, Z., Leung,
16, 749–759.
A.K., Wong, T.H., Tong, S.W., et al. (2019). A reference-grade wild soybean
Törönen, P., Medlar, A., and Holm, L. (2018). PANNZER2: a rapid functional
genome. Nat. Commun. 10, 1216.
annotation web server. Nucleic Acids Res. 46 (W1), W84–W88.
Xu, K., Xu, X., Fukao, T., Canlas, P., Maghirang-Rodriguez, R., Heuer, S., Is-
Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D.R., Pimentel,
mail, A.M., Bailey-Serres, J., Ronald, P.C., and Mackill, D.J. (2006). Sub1A is
H., Salzberg, S.L., Rinn, J.L., and Pachter, L. (2012). Differential gene and tran-
an ethylene-response-factor-like gene that confers submergence tolerance
script expression analysis of RNA-seq experiments with TopHat and Cufflinks.
to rice. Nature 442, 705–708.
Nat. Protoc. 7, 562–578.
Yang, L., and Gaut, B.S. (2011). Factors that contribute to variation in evolu-
Tsubokura, Y., Watanabe, S., Xia, Z., Kanamori, H., Yamagata, H., Kaga, A.,
tionary rate among Arabidopsis genes. Mol. Biol. Evol. 28, 2359–2369.
Katayose, Y., Abe, J., Ishimoto, M., and Harada, K. (2014). Natural variation
in the genes responsible for maturity loci E1, E2, E3 and E4 in soybean. Ann. Yang, N., Wu, S., and Yan, J. (2019). Structural variation in complex genome:
Bot. 113, 429–441. detection, integration and function. Sci. China Life Sci. 62, 1098–1100.
Tuteja, J.H., Clough, S.J., Chan, W.C., and Vodkin, L.O. (2004). Tissue-specific Yao, W., Li, G., Zhao, H., Wang, G., Lian, X., and Xie, W. (2015). Exploring the
gene silencing mediated by a naturally occurring chalcone synthase gene clus- rice dispensable genome using a metagenome-like assembly strategy.
ter in Glycine max. Plant Cell 16, 819–835. Genome Biol. 16, 187.
Tuteja, J.H., Zabala, G., Varala, K., Hudson, M., and Vodkin, L.O. (2009). Yu, J., Holland, J.B., McMullen, M.D., and Buckler, E.S. (2008). Genetic design
Endogenous, tissue-specific short interfering RNAs silence the chalcone syn- and statistical power of nested association mapping in maize. Genetics 178,
thase gene family in glycine max seed coats. Plant Cell 21, 3063–3077. 539–551.
Varshney, R.K., Sinha, P., Singh, V.K., Kumar, A., Zhang, Q., and Bennetzen, Yu, G., Wang, L.G., Han, Y., and He, Q.Y. (2012). clusterProfiler: an R package
J.L. (2020). 5Gs for crop genetic improvement. Curr. Opin. Plant Biol. 13, 1–7. for comparing biological themes among gene clusters. OMICS 16, 284–287.
Wang, Z., and Tian, Z. (2015). Genomics progress will facilitate molecular Zhao, Q., Feng, Q., Lu, H., Li, Y., Wang, A., Tian, Q., Zhan, Q., Lu, Y., Zhang, L.,
breeding in soybean. Sci. China Life Sci. 58, 813–815. Huang, T., et al. (2018). Pan-genome analysis highlights the extent of genomic
Wang, C.S., Todd, J.J., and Vodkin, L.O. (1994). Chalcone synthase mRNA variation in cultivated and wild rice. Nat. Genet. 50, 278–284.
and activity are reduced in yellow soybean seed coats with dominant I alleles. Zhou, Z., Wang, Z., Li, W., Fang, C., Shen, Y., Li, C., Wu, Y., and Tian, Z. (2013).
Plant Physiol. 105, 739–748. Comprehensive analyses of microRNA gene evolution in paleopolyploid soy-
Wang, W., Zheng, H., Fan, C., Li, J., Shi, J., Cai, Z., Zhang, G., Liu, D., Zhang, bean genome. Plant J. 76, 332–344.
J., Vang, S., et al. (2006). High rate of chimeric gene origination by retroposition Zhou, Z., Jiang, Y., Wang, Z., Gou, Z., Lyu, J., Li, W., Yu, Y., Shu, L., Zhao, Y.,
in plant genomes. Plant Cell 18, 1791–1802. Ma, Y., et al. (2015). Resequencing 302 wild and cultivated accessions iden-
Wang, K., Li, M., and Hakonarson, H. (2010). ANNOVAR: functional annotation tifies genes related to domestication and improvement in soybean. Nat. Bio-
of genetic variants from high-throughput sequencing data. Nucleic Acids Res. technol. 33, 408–414.
38, e164. Zhou, P., Silverstein, K.A., Ramaraj, T., Guhlin, J., Denny, R., Liu, J., Farmer,
Wang, M., Li, W., Fang, C., Xu, F., Liu, Y., Wang, Z., Yang, R., Zhang, M., Liu, A.D., Steele, K.P., Stupar, R.M., Miller, J.R., et al. (2017). Exploring structural
S., Lu, S., et al. (2018a). Parallel selection on a dormancy gene during domes- variation and gene family architecture with De Novo assemblies of 15 Medi-
tication of crops from multiple families. Nat. Genet. 50, 1435–1441. cago genomes. BMC Genomics 18, 261.

176 Cell 182, 162–176, July 9, 2020


ll
Resource

STAR+METHODS

KEY RESOURCES TABLE

REAGENT or RESOURCE SOURCE IDENTIFIER


Critical Commercial Assays
Blood&Cell Culture DNA Midi Kit QIAGEN Cat# 19060
Plant Genomic DNA kit Tiangen Cat# DP305
RNAprep pure Plant Kit Tiangen Cat# DP432
Deposited Data
Soybean: G. max and G. soja; Pacbio, This study BioProject: PRJCA002030
Bionano, Hi-C and DNA sequencing data
Soybean: G. max and G. soja; de novo This study BioProject: PRJCA002030
assembly genome data
Soybean: G. max and G. soja; RNA and This study BioProject: PRJCA002030
micro-RNA
Soybean: G. max and G. soja; DNA Zhou et al., 2015; Fang et al., 2017; BioProject: PRJCA002030; see 10.1038/
sequencing data this study nbt.3096, 10.1186/s13059-017-1289-9 for
detail
Oligonucleotides
Primer: P1. This paper N/A
AGGTTGTTGTAGCAGCCACGC
Primer: P2. This paper N/A
CCTCAGATTCATGTCCATCGCGTCC
Primer: P3. This paper N/A
GCTGAGTTGACAGGTCTTCCAGTTGATG
Primer: P4. This paper N/A
GCTTCTTCTCCATTGGTTGAGTGGC
Primer: P5-F. This paper N/A
CCCTCAAATGCAAAGAACGTGATTATC
Primer: P5-R. This paper N/A
GATAATCACGTTCTTTGCATTTGAGGG
Software and Algorithms
BWA 0.7.12-r1039 Li and Durbin, 2009 https://sourceforge.net/projects/bio-bwa/
SAMtools v1.4 Li et al., 2009 https://sourceforge.net/projects/samtools/
Picard v1.87 N/A http://broadinstitute.github.io/picard/
GATK v3.7-0 McKenna et al., 2010 https://github.com/broadinstitute/gatk/
ANNOVAR Wang et al., 2010 https://doc-openbio.readthedocs.io/
projects/annovar/en/latest/
SNPhylo Lee et al., 2014 http://chibba.pgml.uga.edu/snphylo/
FastTree 2 v2.1.10 SSE3 Price et al., 2010 http://www.microbesonline.org/fasttree/
BioNano Solve v3.0.1 Bionano Genomics https://bionanogenomics.com/
CANU v1.7.1 Koren et al., 2017 https://github.com/marbl/canu/
HERA Du and Liang, 2019 https://github.com/liangclab/HERA/
Juicer v1.5 Durand et al., 2016 https://github.com/aidenlab/juicer/
3D-DNA Dudchenko et al., 2017 https://github.com/theaidenlab/3d-dna/
RepeatMasker v4.0.7 N/A http://repeatmasker.org/
ShortStack v3.8.5 Axtell, 2013 https://github.com/MikeAxtell/ShortStack/
Infernal 1.1.12 Nawrocki et al., 2009 http://eddylab.org/infernal/
tRNAscan-SE v2.0.0 Lowe and Eddy, 1997 http://lowelab.ucsc.edu/tRNAscan-SE/
MAKER Cantarel et al., 2008 http://www.yandell-lab.org/index.html
(Continued on next page)

Cell 182, 162–176.e1–e5, July 9, 2020 e1


ll
Resource

Continued
REAGENT or RESOURCE SOURCE IDENTIFIER
Augustus v3.0.3 Stanke and Morgenstern, 2005 https://sourceforge.net/projects/augustus/
FGENESH Softberry N/A
PASA Haas et al., 2003 https://sourceforge.net/projects/pasa/
MUMmer4 Marçais et al., 2018 https://mummer4.github.io/
SVMU Chakraborty et al., 2019 https://github.com/mahulchak/svmu
Hisat2 v2.1.2 Kim et al., 2019 https://ccb.jhu.edu/software/hisat2/
index.shtml
StringTie 1.3.4 Pertea et al., 2015 https://ccb.jhu.edu/software/stringtie/
OrthoMCL v2.0.9 Li et al., 2003 https://orthomcl.org/orthomcl/
KOBAS 3.0 Xie et al., 2011 http://kobas.cbi.pku.edu.cn/
InterProScan 5 Jones et al., 2014 https://www.ebi.ac.uk/interpro/
Pannzer2 Törönen et al., 2018) http://ekhidna2.biocenter.helsinki.fi/
sanspanz/
R 3.5.0 R Development Core Team, 2013 https://www.r-project.org/
R package ClusterProfiler 3.10.1 Yu et al., 2012 http://bioconductor.org/packages/release/
bioc/html/clusterProfiler.html
cd-hit v4.6 Li and Godzik, 2006 http://weizhongli-lab.org/cd-hit/
BLASTn Camacho et al., 2009 ftp://ftp.ncbi.nlm.nih.gov/blast/
executables/blast+/
BLASTp Camacho et al., 2009 ftp://ftp.ncbi.nlm.nih.gov/blast/
executables/blast+/
MUSCLE v3.8.31 Edgar, 2004 http://drive5.com/muscle/
MEGA6 Tamura et al., 2013 https://www.megasoftware.net/
IGV v2.5.0 Thorvaldsdóttir et al., 2013 https://busco.ezlab.org/
BUSCO 3.0.2 Simão et al., 2015 https://busco.ezlab.org/
vg v1.6.0 Garrison et al., 2018 https://github.com/vgteam/vg
EMMAX Kang et al., 2010 http://csg.sph.umich.edu/kang/emmax/
download/index.html

RESOURCE AVAILABILITY

Lead Contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Zhixi Tian
(zxtian@genetics.ac.cn).

Materials Availability
This study did not generate new unique reagents.

Data and Code Availability


The sequencing data used in this study, assembled chromosomes, unplaced scaffolds, and annotations have been deposited into the
Genome Sequence Archive (GSA) and Genome Warehouse (GWH) database in BIG Data Center (https://bigd.big.ac.cn/gsa/index.jsp)
under Accession Number PRJCA002030. The genetic diversities from the 2,898 accessions, including SNPs and small indels, have
been stored in VCF format and deposited into the Genome Variation Map (GVM) database in BIG Data Center (http://bigd.big.ac.cn/
gvm/getProjectDetail?project=GVM000063). The genetic diversities from the 29 genomes (including SNP, small indels, PAVs, CNVs,
inversions, and translocations) have been deposited into Figshare database (https://figshare.com/s/689ae685ad2c368f2568).

EXPERIMENTAL MODEL AND SUBJECT DETAILS

All soybean accessions were planted at the Experimental Station of the Institute of Genetics and Developmental Biology, Chinese
Academy of Sciences, Beijing (40 220 N and 116 230 E) in the growth season in 2018.

e2 Cell 182, 162–176.e1–e5, July 9, 2020


ll
Resource

For de novo assembly, samples were collected from a single plant for each of the 26 accessions. For SMRT sequencing, DNA was
isolated using the Blood & Cell Culture DNA Midi Kit (QIAGEN Inc., Valencia, CA, USA). A 20 kb library was constructed and
sequenced using the SequelTM Sequencing Plate 1.2 on the Pacific Biosciences Sequel platform. For HiSeq, DNA was isolated using
the Plant Genomic DNA kit (Tiangen, Beijing, China). A library with 450 bp small-insertion was prepared and sequenced on an Illumina
HiSeq2500 platform for 150 bp paired-end reads. For optical mapping, young leaves were used for high-molecular-weight DNA isola-
tion, after which they were labeled using the direct labeling enzyme DLE-1. These two labeled DNA samples were imaged using the
BioNano Irys system, and only molecules longer than 150 kb were used for further analysis. For Hi-C, leaves were fixed in 1% (vol/vol)
formaldehyde for library construction. Cell lysis, chromatin digestion, proximity-ligation treatments, DNA recovery and subsequent
DNA manipulations were performed according to a previously described method (Lieberman-Aiden et al., 2009). MboI or DpnII was
used as the restriction enzyme in chromatin digestion. The Hi-C library was sequenced on the Illumina HiSeq X Ten platform for
150 bp paired-end reads.
For whole genome re-sequencing of 2,027 soybean accessions, total DNA was extracted by the cetyltrimethylammonium bromide
(CTAB) method (Murray and Thompson, 1980). A paired-end sequencing library of each accession was constructed with an insert
size of approximately 350 bp and sequenced by Hiseq2500 and Hiseq X Ten.
For RNA-Seq and miRNA-Seq, nine samples of leaf, flower and seed were collected separately at early, middle and late develop-
mental stages for each accession (A, root from growth stage V1; B, stem from growth stage V1; C, young leaf from growth stage V1;
D, mature leaf from growth stage R1; E, old leaf from growth stage R4; F, flower from growth stage R1; G, pod and seed before
4 weeks; H, seed at 6 weeks; I, seed at 8 weeks). Total RNA was isolated using the RNAprep Pure Plant Kit (TIANGEN). RNA-seq
library construction was performed following the method of a previous reference (Shen et al., 2014) and sequenced on the NovaSeq
platform. Small RNA libraries were constructed as described previously (Zhou et al., 2013) and sequenced on a Nextseq 500 plat-
form. All sequencing was carried out at BerryGenomics Company (http://www.berrygenomics.com/. Beijing, China).

METHOD DETAILS

SNP Calling and Phylogenetic Analyses


The resequencing reads of the 2,898 accessions were mapped to the soybean ZH13 reference genome (Shen et al., 2019) with BWA
(Li and Durbin, 2009) v0.7.12-r1039 using the default parameters, and were further filtered by SAMtools (Li et al., 2009) v1.4 for
nonunique and unmapped reads and by the Picard package (http://broadinstitute.github.io/picard/, v1.87) for duplicated reads.
SNP calling was carried out using the Genome Analysis Toolkit (McKenna et al., 2010) (GATK, v3.7-0-gcfedb67). Annotations of
SNPs and INDELs were performed based on a gene model of the ZH13 genome using ANNOVAR (Wang et al., 2010).
To build a phylogeny of soybeans with huge SNP data, whole genome SNPs were first filtered by SNPhylo (Lee et al., 2014) with
parameters of -m 0.01, -M 0.01, and -l 0.5. Then, the remaining SNPs were used to construct the phylogeny by FastTree2 (Price et al.,
2010) v2.1.10 SSE3 with the parameter of -nt –gtr.

Genome Assembly
De novo assembly was conducted referring to a reported pipeline (Du and Liang, 2019; Shen et al., 2019). In brief, Canu (Koren et al.,
2017) v1.7.1 was used to assemble PacBio subreads to PacBio contigs, after which HiSeq reads were used for error correction. Bio-
nano optical maps were assembled into consensus physical maps by BioNano Solve v3.0.1 (https://bionanogenomics.com/,
Solve_06082017Rel). Then HERA (Du and Liang, 2019) was used to combine PacBio contigs and Bionano based physical maps
to PacBio-BioNano hybrid scaffolds. To anchor hybrid scaffolds into chromosomes, the Hi-C sequencing data were aligned into
scaffolds by Juicer (Durand et al., 2016) v1.5 and 3D-DNA (Dudchenko et al., 2017).

Repeat Analysis and Gene Annotation


Repeats were analyzed with a method combining de novo structure analyses and homology comparison (Schmutz et al., 2010). First,
LTR_STRUC was employed to identify LTR retrotransposons (McCarthy and McDonald, 2003). Then the repeat regions were
searched by RepeatMasker (http://www.repeatmasker.org/) v4.0.7 using annotated TEs together with TE elements in SoyTEdb
(Du et al., 2010). Non-coding genes, MIRNAs and their corresponding miRNAs were predicted by ShortStack (Axtell, 2013)
v3.8.5. SnRNA and snoRNA were predicted by the cmscan module in Infernal (Nawrocki et al., 2009) v1.1.2. tRNAs were predicted
by tRNA-scan-SE (Lowe and Eddy, 1997) v2.0.0.
For gene annotation, a strategy combining ab initio gene finding, homology-based gene prediction and RNA-Seq reads was used
(Shen et al., 2019). First, Augustus (Stanke and Morgenstern, 2005) v3.0.3 trained by FGENESH (Salamov and Solovyev, 2000) was
used in ab initio gene finding. Second, cDNA sequences of Arabidopsis thaliana (167_TAIR10), Glycine max (275_Wm82.a2.v1), Med-
icago truncatula (285_Mt4.0v1), Oryza sativa (323_v7.0), Phaseolus vulgaris (442_v2.1) and Solanum lycopersicum (390_ITAG2.4)
downloaded from Phytozome v12 were used to predict homologous genes. Third, transcripts from 9 RNA-seq datasets for each
accession were used as ESTs after Hisat2 (v2.1.0) (Kim et al., 2015) alignment and StringTie (v1.3.4) (Pertea et al., 2015) assembly.
Then, MAKER (Cantarel et al., 2008) was used to combine all these gene predications to construct the main structure of protein cod-
ing genes, and PASA (Haas et al., 2003) was used to predict alternative splicing types. Finally, mRNAs that encoded peptides were
less than 30 amino acids or that did not start with methionine were filtered out. IGV (Thorvaldsdóttir et al., 2013) v2.5.0 was used to

Cell 182, 162–176.e1–e5, July 9, 2020 e3


ll
Resource

visualize RNA-seq pair-end mapping regions. For assessment of gene annotation completeness, BUSCO (Simão et al., 2015) (v3.0.2,
lineage dataset embryophyta_odb9) was performed for each accession.

Synteny Analysis
Synteny analysis of the 28 genomes with Gmax_ZH13 was performed via whole-genome alignment using MUMmer4 (Marçais et al.,
2018). The published genome of cultivated accession Wm82 (275_Wm82.a2.v1) and wild accession W05 (ASM419377v1) were also
used in this study. Alignment of the genomes was performed using NUCmer (–c 1000), and then the alignment block filter was per-
formed using a delta-filter with one-to-one alignment mode (1). Blocks longer than 1000 bp were used for further structure variation
detection.
For whole genome duplication (WGD) investigation, we first masked all repetitive DNA but predicted genes, and compared trans-
lations of the remaining genic nucleotide sequences using MUMmer4 (Marçais et al., 2018). Only the top reciprocal best matches
between any two chromosomes in a comparison were considered duplication regions from the recent WGD. Then we evaluated
the colinear chains. To increase sensitivity, chaining was created on the basis of gene order, excluding positions of non-orthologous
genes, rather than with gene coordinates. Chains were required to have at least five colinear genes with no more than ten intervening
genes between neighbors. The resulting gene-pairs were classified as WGD syntenic. To investigate the relationship between struc-
tural variations and WGD, the WGD information from the Williams 82 genome (Schmutz et al., 2010) was used as a reference. To
compare the nucleotide diversity between WGD and non-WGD, the WGD and non-WGD blocks were divided into 100-bp continuous
windows, and the nucleotide diversity was calculated in each window.
For the effect of PAV on SNP evolution, 1 kb sequence from each flanking of the PAVs were extracted, and each was divided into
ten 100-bp continuous windows, and the nucleotide diversity was calculated. The nucleotide diversity of each window was a mean
from the total corresponding windows.

Structural Variation Identification


SNPs and indels were identified using show-snps (-ClrT) of the MUMmer4 toolkit. We use the SVMU (structural variants from
MUMmer) (Chakraborty et al., 2019) pipeline to automate presence and absence variation (PAV) discovery by parsing the result of
NUCmer. From the SVMU results, insertion/deletion (with tag INS/DEL) was treated as PAV. The genome region neither detected
as synteny block by NUCmer nor insertion/deletion by SVMU was also treated as the PAV region.
For copy number variation (CNV), we first filtered the synteny block less than 100 bp. The sequence region with two or more sepa-
rate synteny blocks (> 90% identity) overlapping was detected as CNV. Translocation and inversion events (both refer to structure
variation R 1 Kbp) were detected by manual check depending on their location and orientation to their neighboring blocks based on
the non-allelic homology blocks from the above alignment using MUMmer4. The neighboring blocks belonging to same type of
events were merged together. For structural variation merging, we referred to a reported method from human beings (Audano
et al., 2019). The ZH13 genome was set as the reference genome, SoyW01 served as the initial callset and new sites were added
per sample. Any variants in the sample that had 50% reciprocal overlap with an existing discovery variant was excluded. This merging
was performed separately by each variant type.
For graph-based genome construction and analyses, the ZH13 genome was set as a reference, the nonredundant structural var-
iations with repetitive sequences less than 90% were saved in variant call format (VCF), and graph-based genome construction was
performed via the vg (https://github.com/vgteam/vg, version v1.6.0) toolkit (Garrison et al., 2018). To genotype the structural varia-
tions in 2,898 accessions, we mapped the Illumina short reads from each accession to the graph-based genome via vg toolkit using
default parameters.

Genetic Variation Analysis


SNP density, dN, dS and nucleotide diversity (p) were calculated for both the 2,989 soybean accessions re-sequencing dataset and
the pan-genome dataset by a 500 kb non-overlap window. Overlapped repeat regions (marked by RepeatMasker v4.0.7) from ZH13
were merged together and regions > 100 kb after merging were treated as large repeat regions. In consideration of data confidence,
SNP density, dN, dS and p in large repeat regions were masked. Structural variation density, repeat content, and average synteny
similarity were calculated by 500 kb non-overlap window across the genome. SV density was counted by the number of structural
variations > 50 kb; repeat content was counted by the proportion of sequences marked as repeats by annotation; and average syn-
teny similarity was calculated by the average proportion of sequence synteny from 28 de novo assembly genomes to ZH13. dN and dS
were calculated using previous reported method (Nei and Gojobori, 1986). The genetic distance between populations and the diver-
gent age of the evolutionary events in Figures 5 and S6 were estimated using the method from Nei (1972). Cultivated soybean was
reported to be domesticated from wild soybean approximately 5,000 years ago (Carter et al., 2004). Therefore, in divergent age es-
timatioins for different evolutionary events, the genetic distance between wild and mordern cultivars was set as 5,000 years as a
control.

Gene and miRNA Expression


After clipping the adaptor sequences and removing the low quality reads, the RNA sequence data from each sample were mapped to
the reference soybean genome using Hisat2 v2.1.2 and StringTie v1.3.4 with default parameters (Pertea et al., 2016; Trapnell et al.,

e4 Cell 182, 162–176.e1–e5, July 9, 2020


ll
Resource

2012;). Gene expression was normalized using the number of reads per kilobase of exon sequence in a gene per million mapped
reads (RPKM) (Mortazavi et al., 2008).
The miRNA expression analysis was performed using a previously reported method (Zhou et al., 2013). In brief, the small RNA se-
quences that mapped to mature miRNA-5p or miRNA-3p were defined as the expression of miRNA-5p and miRNA-3p, respectively.
The final expression of miRNA-5p or miRNA-3p was calculated as TPM.

Core and Dispensable Gene Family Clustering


The core and dispensable gene sets were estimated based on gene family clustering using OrthoMCL (Li et al., 2003) v2.0.9. For each
de novo accession and ZH13, a gene containing CDS with 100% similarity to other genes was removed by using the cd-hit-est of CD-
HIT (Li and Godzik, 2006) v4.6 toolkit with the parameter of –c 1 –aS 1. Protein sequences of the remaining genes were subjected to
homologous searching by BLASTp (Camacho et al., 2009) with parameters of –evalue 1-e10 –max_target_seqs 116. OrthoMCL
(version 2.0.9) was used to deal with the BLAST result with the parameter of percentMatchCutoff = 50 and -I 1.5 to make gene family
clustering. The gene families that were shared among accessions were defined as core gene families, the gene families that were
missed in one or two accessions were defined as softcore gene families, the gene families that were missed in more than two ac-
cessions were defined as dispensable gene families, and those that only existed in one accession were defined as private gene fam-
ilies. For phylogenetic analysis of each gene family, MUSCLE (Edgar, 2004) v3.8.31 was used for sequence alignment and MEGA6
(Tamura et al., 2013) was used for phylogenetic tree building.
For gene function annotation, KEGG pathway analysis was performed using KOBAS 3.0 (Xie et al., 2011), protein domain was an-
notated by InterProScan 5 (Jones et al., 2014), and Gene Ontology was annotated by PANNZER2 (Törönen et al., 2018). The enrich-
ment test was performed by the ClusterProfiler (Yu et al., 2012) v3.10.1 package in R 3.5.0 (R Development Core Team, 2013). QTL
information was obtained from SoyBase (https://www.soybase.org/search/qtllist_by_symbol.php).

CHS gene unit identification


The CHS gene shows high duplication levels in the I locus on chromosome 08. To achieve all the CHS units (gene and/or pseudo-
gene), we searched the genome sequence from the start codon to the stop codon of SoyZH13_08G103000 (CHS4) on 29 soybean
genomes by BLASTn (Camacho et al., 2009) (e-value < 1e-20). The result with identity > 90% were selected as putative CHS units.
The NJ Phylogeny of the SNP from the I locus surrounding the region was built by MEGA6 (Tamura et al., 2013).

Gene Fusion Event Identification


To identify the gene fusion events among different genomes, a previously reported method for retrogene investigation was referred
with modification (Wang et al., 2006). The cDNA sequences from pairwise genomes were blasted against each other, and the gene
pairs in which one gene from the A genome matched to two or few adjacent separate genes (or the internal genes did not have syn-
tenic genes in this region) from the B genome were considered candidates. Subsequently, we filtered these candidate pairs in the
following criteria: (1) to eliminate unlikely protein coding genes, cDNAs with small open reading frames < 100 amino acids were dis-
carded; (2) the matching length of one of the separated genes to the fusion gene less than 50% were discarded; (3) the gene length of
the fusion gene less than 60% of the total length of the separated genes was discarded; and (4) the genes without RNA-seq transcrip-
tional data support were discarded.

QUANTIFICATION AND STATISTICAL ANALYSIS

All details of the statistics applied are provided alongside in the figure and corresponding legends. Statistical analyses were per-
formed in R 3.5.0. Genome wide association study was performed using the EMMAX software package (Kang et al., 2010). The sig-
nificant threshold is 1 3 106. The phenotype variations of seed luster are classified into three categories: luster, intermediate, and
lusterless. The phenotypes come from 754 accessions.

Cell 182, 162–176.e1–e5, July 9, 2020 e5


ll
Resource

Supplemental Figures

(legend on next page)


ll
Resource

Figure S1. Functional Annotation of Different Gene Categories, Related to Figure 2


(A) Pfam enrichment of core and dispensable gene families.
(B) Gene ontology analyses of core and dispensable gene families.
(C) Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analyses of core and dispensable gene families.
ll
Resource

Figure S2. SNP and Structural Variation of the 29 De Novo Assembled Genomes, Related to Figure 3
(A) Correlation of SNP density, p, dN, and dS from 29 de novo assembled genomes and 2,898 resequenced accessions.
(B) Characterization of larger structure variations in the 29 soybean genomes. The heatmap shows the present frequency of structural variation in 29 genomes.
ll
Resource

Figure S3. Presence-Absence Variation Leads to Chromosome Size Variation and Gene Gain and Loss of the 27 De Novo Assembled Ge-
nomes, Related to Figure 3
(A) Correlation between presence-absence variation (PAV) and chromosome size variation in the 27 de novo assembled genomes.
(B) Comparison of the PAVs in chromosome 07. The left panel compares PAVs from accessions with the longest and shortest chromosome sizes. The right panel
shows the variation in chromosome size in all 27 de novo assembled accessions. +/, presence/absence genome size (bp) compared to ZH13.
(C) Comparison of the PAVs in chromosome 17. The left panel compares PAVs from accessions with the longest and shortest chromosome sizes. The right panel
shows the variation in chromosome size variations of all 27 de novo assembled accessions. +/, presence/absence genome size (bp) compared to ZH13.
(D) A 16 kb deletion from SoyC14 results in the loss of SoyZH13_18G184700.
(E) A 23 kb insertion from SoyC10 results in the gain of SoyC10_06G170400, SoyC10_06G170500, and SoyC10_06G170600.
ll
Resource

Figure S4. Comparison of Gene and Structural Variation Characteristics between Whole Genome Duplication (WGD) and Non-WGD Regions,
Related to Figure 3
(A) Comparison of gene density between WGD and non-WGD regions.
(B) Comparison of repetitive DNA proportions between WGD and non-WGD regions.
(C) Comparison of nucleotide diversity between WGD and non-WGD regions. The data are given as the mean ± 95% CI. Significance was tested by unpaired two-
sample Wilcoxon test; *** p < 0.001.
(D) Gene components in WGD and non-WGD regions.
(E) Structure variation components in WGD and non-WGD regions.
(F) Comparison of the PAV driven single-nucleotide mutation rate between WGD and non-WGD regions. The data are the mean ± 95% CI.
ll
Resource

Figure S5. Validation of Gene Fusion between E3 (SoyZH13_19G210400) and Its Neighboring Gene SoyZH13_19G210600 in the Accessions of
Haplotype E3-tr, Related to Figure 4
(A) Neighbor-joining phylogenetic analyses of E3 in different species.
(B) Validation of the gene fusion through cDNA amplification from different accessions. L05, L09, C13, C14 are representative accessions of haplotype E3-tr with
the 13.3 kb deletion, and W02, C12 and ZH13 are representative accessions without the deletion. Sequencing of the amplification from L05, L09, C13, C14 cDNA
proves the gene fusion.
(C) Neighbor-joining phylogenetic analyses of SoyZH13_19G210600 in different species.
ll
Resource

Figure S6. An Inversion on Chromosome 07 between Wild Soybean and Cultivated Soybean Genomes Is Associated with Soybean
Domestication, Related to Figure 5
(A) Diagram of the 360 kb inversion between wild (SoyW02 as sample) and cultivated soybean (ZH13 as sample) genomes.
(B) The inversion is associated with the divergence between wild and cultivated soybeans, which might have occurred approximately 4,700 years ago. ya,
years ago.
(C) Fst between wild soybean and cultivated soybeans on chromosome 07 (left panel). Distribution of the Fst value of the 360 kb INV region (red arrow) at the whole
genome level (right panel).
(D) Synteny and gene structure comparison of this inversion on chromosome 7 and its whole genome duplicated region on chromosome 17.
ll
Resource

Figure S7. Structure and Sequences of the Mutator from SoyZH13_14G179600, Related to Figure 6
The Mutator is comprised of two terminal inverted repeats (TIR). Each of the TIRs is approximately 700 bp in length, and they are complementarily reversed. This
Mutator has the target side duplication of TGATAAATG.

You might also like