Professional Documents
Culture Documents
Pi Is 0092867420306188
Pi Is 0092867420306188
Correspondence
cliang@genetics.ac.cn (C.L.),
zxtian@genetics.ac.cn (Z.T.)
In Brief
A high-quality graph-based soybean pan-
genome is constructed through de novo
genome assemblies of 26 representative
wild and cultivated soybean accessions,
demonstrating the impact of structural
variation on key agronomic traits.
Highlights
d de novo genome assemblies for 26 representative soybeans
Resource
Pan-Genome of Wild and Cultivated Soybeans
Yucheng Liu,1,7,8 Huilong Du,2,7,8 Pengcheng Li,3 Yanting Shen,4 Hua Peng,2,7 Shulin Liu,1 Guo-An Zhou,1 Haikuan Zhang,3
Zhi Liu,1,7 Miao Shi,3 Xuehui Huang,5 Yan Li,6 Min Zhang,1 Zheng Wang,1 Baoge Zhu,1 Bin Han,6 Chengzhi Liang,2,7,*
and Zhixi Tian1,7,9,*
1State Key Laboratory of Plant Cell and Chromosome Engineering, Institute of Genetics and Developmental Biology, Innovation Academy for
Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200032, China
7College of Advanced Agriculture Sciences, University of Chinese Academy of Sciences, Beijing 100049, China
8These authors contributed equally
9Lead Contact
SUMMARY
Soybean is one of the most important vegetable oil and protein feed crops. To capture the entire genomic
diversity, it is needed to construct a complete high-quality pan-genome from diverse soybean accessions.
In this study, we performed individual de novo genome assemblies for 26 representative soybeans that
were selected from 2,898 deeply sequenced accessions. Using these assembled genomes together with
three previously reported genomes, we constructed a graph-based genome and performed pan-genome
analysis, which identified numerous genetic variations that cannot be detected by direct mapping of short
sequence reads onto a single reference genome. The structural variations from the 2,898 accessions that
were genotyped based on the graph-based genome and the RNA sequencing (RNA-seq) data from the repre-
sentative 26 accessions helped to link genetic variations to candidate genes that are responsible for impor-
tant traits. This pan-genome resource will promote evolutionary and functional genomics studies in soybean.
INTRODUCTION otyping, particularly for larger SVs (Garrison et al., 2018; Kim
et al., 2019; Rakocevic et al., 2019; Eggertsson et al., 2019;
Increasing reports have suggested that one or a few reference Ameur, 2019).
genomes cannot represent the full range of genetic diversity of Soybean is one of the most important vegetable oil and protein
a species (Scherer et al., 2007; Li et al., 2010a, 2014; Hirsch feed crops. Cultivated soybean (Glycine max [L.] Merr.) was
et al., 2014; Lin et al., 2014; Saxena et al., 2014; Yao et al., domesticated from its wild relative (Glycine soja [Sieb. and
2015; Golicz et al., 2016a; Zhou et al., 2017; Wang et al., Zucc.]) in China 5,000 years ago. At present, over 60,000 ac-
2018b; Hurgobin et al., 2018), which limit the identification of ge- cessions adapted to different ecoregions have been developed
netic variants, particularly for larger structural variants (SVs) such (Carter et al., 2004; Wilson, 2008; Li et al., 2020). The first soy-
as presence/absence variants (PAVs) and copy number variants bean reference genome of a cultivated accession (Williams 82,
(CNVs) that have been revealed to play key roles in the genetic termed Wm82 in this study) opens the gate of soybean functional
determination of agronomical traits (Xu et al., 2006; Shomura genomics (Schmutz et al., 2010; Chan et al., 2012; Wang and
et al., 2008; Cook et al., 2012; Hufford et al., 2012; Hirsch Tian, 2015). Intergenomic comparisons demonstrate that exten-
et al., 2014; Lu et al., 2015; Deng et al., 2017; Lye and Purug- sive genetic diversities exist between wild soybeans and culti-
ganan, 2019). Pan-genome construction is becoming increas- vated soybeans and also among cultivated soybeans from
ingly necessary (Tettelin et al., 2005; Golicz et al., 2016b; Li different geographic areas (Bandillo et al., 2015; Hyten et al.,
et al., 2017; Tao et al., 2019). In addition, conventional linear ref- 2006; Li et al., 2010b; Liu et al., 2017; Shen et al., 2018; Xie
erences are challenged in showing the genotypes of different al- et al., 2019). Recently, two other reference genomes have
leles of each locus and in the identification of larger SVs. There is become available, one from ‘‘Zhonghuang 13’’ (ZH13) that is
a toward construction of graph-based genome, which contains the most widely planted soybean cultivar in China (Shen et al.,
the variations in a population and enables fast and accurate gen- 2018; Shen et al., 2019) and another from a wild soybean
(W05) (Xie et al., 2019). Comparisons between Wm82 and these whole-genome assemblies ranged from 18.8 to 26.8 Mb pairs
two genomes further demonstrate that a considerable amount of with a mean of 22.6 Mb, and scaffold N50 sizes ranged from
CNVs and PAVs exist in different accessions, promoting the 50.3 to 52.3 Mb with a mean of 51.2 Mb. The final assembled
need for construction of a complete pan-genome from diverse genome sizes ranged from 992.3 Mb to 1059.8 Mb with a
soybean accessions. mean of 1011.6 Mb. For each accession, an average of 99%
A pan-genome from seven wild soybeans has been con- contigs were anchored to the chromosomes (Tables 1 and S2).
structed using second generation sequencing technology (Li The Illumina reads from individual accessions were re-mapped
et al., 2014). However, due to technology limitations at the onto the corresponding assembled genomes, and the mapping
time, genome assembly, annotation, and chromosome-scale ratio reached 99.4% (Table S2), indicating a high completeness
SV interrogations need to be improved (Li et al., 2014). In this of each assembled genome.
study, we individually de novo assembled 26 soybean genomes Repetitive DNA made up 54.4% of each genome with a
and constructed a graph-based genome using these assembled range from 53.6% to 55.3% (Table 1). Among the repetitive se-
genomes together with three previously reported genomes. Pan- quences, long terminal repeat (LTR)-retrotransposons were the
genome analyses disclosed numerous genetic variations that most abundant (Table S2), which is consistent with previously re-
cannot be detected by mapping the second generation ported soybean genomes (Schmutz et al., 2010; Shen et al.,
sequencing reads to a single genome, including hundreds of 2018, 2019; Xie et al., 2019). To annotate the protein-coding
thousands of large structure variations and dozens of gene and small RNA genes, for each of the 26 accessions, we
fusion events, which facilitated us to identify candidate genes collected 9 samples from roots, stems, leaves, flowers, and
associated with agronomic traits. seeds at different developmental stages and performed RNA
sequencing (RNA-seq) (with a mean of 8 Gb pairs for each
RESULTS sample) and small RNA-seq (with a mean of 278 Mb for each
sample) (Table S2). An average of 56,522 protein coding genes,
De Novo Genome Assembly and Annotation of 26 553 microRNAs (miRNA), 171 small nuclear RNAs (snRNA), and
Soybean Accessions 439 ribosomal RNAs (rRNA) genes per genome were predicted
To make the pan-genome represent the full range of genetic di- (Tables 1 and S2). Benchmarking universal single-copy ortho-
versity of soybean, we sequenced a total of 2,898 soybean ac- logs (BUSCO) evaluation showed that a mean of 95.6% of the
cessions (871 from our previous studies [Fang et al., 2017; 1,440 single copy Embryophyta genes were completely assem-
Zhou et al., 2015] and 2,027 from this study) by Illumina technol- bled in these genomes (Table S2), indicating high completeness
ogy with an average coverage depth of more than 133 for each of the gene annotation.
accession (Figure 1A; Table S1). These accessions included 103
wild soybeans, 1,048 landraces, and 1,747 cultivars, which were Core and Dispensable Genes
collected globally and represented a full range of soybean We further performed pan-genome analyses for the 26 newly de
geographic distributions (Figure 1A). After mapping against the novo assembled genomes plus the ZH13 genome using a re-
genome Gmax_ZH13 (Shen et al., 2019), we identified ported strategy (Hirsch et al., 2014; Hu et al., 2017; Li et al.,
31,870,983 single-nucleotide polymorphisms (SNPs). 2014; Wang et al., 2018b; Zhao et al., 2018). Orthologs investiga-
Phylogenetic analyses using whole-genome SNPs classified tion classified all genes from the 27 soybean genomes into
these 2,898 accessions into six major groups (a to f) including 57,492 families, which was close to the number from wild soy-
all wild soybeans in one group and all cultivated soybeans in beans (Li et al., 2014). The total gene sets increased as additional
five groups that were correlated to their geographic distribution genomes were added and approached a plateau when n = 25
(Figure 1B). To construct a well representative pan-genome, in (Figure 2A), indicating the representativeness of these 27 soy-
addition to ZH13, we selected 26 accessions for de novo assem- bean accessions. Of the total gene sets, 20,623 families pre-
bly. These accessions included 3 wild soybeans, 9 landraces, sented in all 27 accessions and were defined as core genes,
and 14 cultivars that by the most represented the 2,898 acces- 8,163 families presented in 25 to 26 accessions (>90% of the
sions in terms of phylogenetic relationships and geographic dis- collection) were defined as softcore genes, 28,679 families pre-
tributions (Figure 1A; Table S2). We also tended to select the ac- sented in 2 to 24 accessions were defined as dispensable genes,
cessions that greatly contributed to breeding and production. and 27 families presented in only one accession were defined as
For example, SoyL04, SoyL06, and SoyL09 had been used as private genes (Figure 2B). Although the total of dispensable and
parental lines to develop more than one hundred varieties in private gene families accounted for a larger proportion (49.9%)
China; SoyC04, SoyC11, and SoyC13 are the most popular vari- of the total gene sets in the 27 accessions, they accounted for
eties in different planting regions of China. an average of 19.1% of the genes in individual accessions (Fig-
The 26 accessions were sequenced individually using single- ures 2B–2D; Table S3).
molecule real-time (SMRT) sequencing with an average We found that 77.5% of the core genes and 72.1% of the
coverage depth of 963 , optical mapping with an average softcore genes contained InterPro domains, which was much
coverage depth of 2773 , chromosome conformation capture higher than the percentages in the dispensable and private
(Hi-C) sequencing with an average coverage depth of 1363 , genes (49.0% and 38.5%, respectively) (Figure 2E). Moreover,
and Illumina sequencing (HiSeq) with an average coverage depth the nucleotide diversity (p) and dN/dS were higher in dispensable
of 683 (Table S2). Subsequently, we performed de novo genome genes than in core genes (Figures 2F and 2G). These results indi-
assembly for each accession. The contig N50 sizes of the 26 cated that core genes were more functionally conserved than
Figure 1. Geographic Distribution and Phylogenetic Analysis of 2,898 Resequenced Soybean Accessions
(A) Geographic distribution of the 2,898 accessions. The number of accessions collected from each region is indicated by the size of pie, and the ratio of wild
soybean (G. soja, indicated by purple), landrace (green), and cultivar (orange) for each region is shown in the pie. The total number of soybean germplasm re-
sources from each country or region is indicated by the degree of blue color.
(B) Phylogenetic tree of all accessions inferred from whole-genome SNPs. The lines with different colors indicate wild soybean (purple), landrace (green) and
cultivar (orange). The geographic origin of each accession is divided into eight clades. EU, Europe; NA, North America; JAP, Japan; KOR, Korea; RUS, Russia;
CHN, China; I–VI, Eco-regions of soybeans in China. The 26 accessions used for de novo assembly are indicated in the phylogenetic tree. SoyW, wild soybean;
SoyL, landrace; SoyC, cultivar.
See also Table S1.
dispensable genes. Pfam enrichment and Gene Ontology (GO) riched in pathways related to the basic metabolism and biosyn-
analyses showed that core genes were enriched in biological thesis of secondary metabolites. Whereas, the dispensable
processes related to growth, immune system, reproductive, genes were more enriched in pathways related to specific meta-
cellular, and cellular component organization or biogenesis, bolism, such as fatty acid biosynthesis and fatty acid degrada-
such as ring finger domain, AP2 domain, WD domain, WRKY tion (Figure S1C), indicating that the dispensable and private
DNA-binding domain, and bZIP transcription factors. In contrast, genes may play important roles in determining the divergence
the dispensable and private genes were enriched for abiotic and of fatty acid composition in different accessions.
biotic response genes, such as different NBS gene families
(especially NBS-LRR) and stress upregulated nod genes (Fig- Sequence Variation Identification in 29 Soybean
ures S1A and S1B), which were consistent with previous findings Genomes
in other plants (Gordon et al., 2017; Wang et al., 2018b; Zhao To discover sequence variations, the 26 genomes plus three
et al., 2018). Kyoto Encyclopedia of Genes and Genomes previously reported genomes, Wm82, ZH13, and W05, were
(KEGG) pathway analyses showed that the core genes were en- compared. We anchored the 28 genome sequences onto the
ZH13 genome (Shen et al., 2019). A total of 14,604,953 SNPs and events (including 6,801 intra-chromosome translocations and
12,716,823 small insertions and deletions (indels, referring % 15,085 inter-chromosome translocations), and 3,120 inversion
50 bp in this work) were identified. Although the SNP number events (Table S4). We found that most of the PAVs had a length
from the pan-genome was less than that from the 2,898 acces- from 1 kb to 2 kb, translocations were concentrated from 10 kb
sions (14,604,953 versus 31,870,983), the distributions of to 30 kb, inversions mainly ranged from 100 to 200 kb, and the
SNPs from these two datasets exhibited similar patterns across CNVs varied from 2 to >10 with an enrichment of 2 and 3
the genome (Figure 3A). Particularly, when the SNPs with minor (Figure S2B).
allele frequency (MAF) <0.01 were removed from the 2,898 ac- The 723,862 PAV made up a total of 4.71 Gb sequences with a
cessions, the correlation between the SNP distributions from mean of 167.09 Mb in each accession, which accounted for
the 29 soybean genomes and the 2,898 accessions reached 16% of the assembled genome in each accession (Table S4).
0.553 (Figure S2A). We also calculated the nucleotide diversity, We found that more than 90% of the length variation of the
dN and dS in the 29 genomes and 2,898 accessions, respectively, assembled genomes resulted from PAVs (Figure S3A; Table
and found that each of them showed high correlations between S4), indicating that PAV was a major contributor to driving
the 29 soybean genomes and the 2,898 accessions (Figure S2A), genome size variation. For example, compared with SoyW03,
further indicating the representativeness of the selected soybean a total 1.2 Mb deletion was found in ZH13 from the region of
accessions. 22.7 to 23.5 Mb (ZH13 genome position), whereas, a total of
In addition to identifying SNPs and small indels, de novo con- 1.2 Mb were further deleted at the same region in SoyW02
struction of the pan-genome provides a valuable platform for the compared to ZH13, which resulted in SoyW03 having the
identification of larger SVs (Tettelin et al., 2005; Li et al., 2017; longest sequence and SoyW02 having the shortest sequence
Tao et al., 2019; Yang et al., 2019). Comparing the 28 genomes among the 26 genomes for chromosome 7 (Figure S3B). Simi-
to ZH13 identified a total of 723,862 PAVs (referring >50 bp inser- larly, for chromosome 17, SoyL03 had the longest sequence
tion or deletion in this work), 27,531 CNVs, 21,886 translocation and SoyC11 had the shortest sequence. Consistent with their
chromosome lengths, compared with ZH13, a total of 1.2 Mb the highest ratio of private SVs, which may be due to the genome
insertion was found in SoyL03 from the region of 26.5 to 26.6 assembly of Wm82.a2.v1 being mainly based on a second gen-
Mb (ZH13 genome position), whereas, a total of 0.1 Mb was eration sequencing technology.
deleted in the same region in SoyC11 (Figure S3C). We found that the SVs tended to be enriched in repetitive DNA
regions (Figure 3D). This pattern is consistent with our previous
Graph-Based Genome and SV Characterization investigation of nonreference transposons using short-read re-
To construct an integrated graph-based genome for soybean sequencing data (Tian et al., 2012). Nevertheless, many more
using the 29 individually de novo assembled genomes, we PAVs were identified from this study (25,800 per sample)
merged the 776,399 SVs from all genomes into a set of than from the previous analysis (1,100 per sample). We inves-
124,222 nonredundant SVs. Similar to the patterns of core and tigated the sequence composition of each PAV and found that
dispensable gene families (Figure 2A), the nonredundant SV 78.5% of the PAVs came from repetitive DNA (Figure 3E),
set grew, with additional samples being added and tended to which supported the theory that variation of repetitive se-
flatten. In parallel, the set of shared SVs declined, leaving a final quences largely contributed to the divergences of different ge-
130 SVs shared in all samples (Figure 3B). Based on the allele fre- nomes (Kumar and Bennetzen, 1999).
quency of these SVs, we classified the SVs into four categories: Subsequently, we built an integrated graph-based genome
core (present in all 28 samples), softcore (present in >90% of with the PAVs containing less than 90% repetitive DNA using
samples but not all; 26–27), dispensable (present in more than the ZH13 genome as a standard linear base reference genome.
one but <90% of samples; 2–25) or private (present in only one Then, we mapped the resequencing short reads of the 2,898 ac-
sample). We found that wild soybeans had higher ratio of private cessions onto the graph-based genome, and identified a total of
SVs (average of 22.2%) than cultivated soybeans (average of 55,402 SVs. The precision, recall, and F1 score were 0.94, 0.75,
6.7%) (Figure 3C). However, there is an exception: Wm82 had and 0.83, respectively, which are comparable to that of yeast
A B
D E F G
H I J
Figure 3. Genetic Variations from 29 Soybean Genomes and 2,898 Resequenced Accessions
(A) Distribution of genetic variations from 29 genomes and 2,898 resequenced soybean accessions. (a) Gene density. (b–e) SNP density, p, dN, and dS from 29
genomes and 2,898 resequenced soybean accessions. The black block represents a large repetitive region. (f) Distribution of larger structure variations and
repeat sequences across the soybean genome. Red bars indicate numbers of larger structural variations. Blue bars indicate repeat contents. Orange bars
indicate the average similarity of 29 genomes to ZH13.
(B) Variants from each sample are merged using a nonredundant strategy starting with W05 and iteratively adding unique calls from additional accessions.
(C) The number of variants in each discovery class is shown per accession.
(D) Structural variation density from repeat/non-repeat genome regions by continuous 500-kb windows. Significance was tested by Fisher’s exact test; ***
p < 0.001.
(E) Structural variation number plots against repetitive DNA coverage.
(F) Structural variations from 2,898 accessions plot against discovery frequency. The structural variations are identified by mapping short reads against the
integrated graph-based genome.
(G) Numbers of detected and novel SVs from wild soybeans, landraces, and cultivars. Box edges depict interquartile range, whiskers 1.53 the interquartile range,
and centerlines the median.
(H) GWAS of seed luster using the PAVs genotyped based on graph-based genome.
(I) A 10-kb PAV results in the presence-and-absence of SoyZH13_15G114704, an HPS encoding gene.
(J) Comparison of the seed luster variation between two haplotype of the 10 kb PAV.
See also Figures S2, S3, and S4 and Tables S3 and S4.
(Hickey et al., 2020). Consistent with the results from human be- diversity of each window was calculated individually. The calcu-
ings (Audano et al., 2019), the number of identified SVs lation demonstrated that nucleotide diversity increased close to
decreased along with the increase of discovery frequency (Fig- the PAVs and declined monotonically with distance, tending to
ure 3F). In addition to the SVs from the 29 genomes, 3,584 novel flatten 700 bp away from PAVs (Figure S4F). We further
SVs were identified from the 2,898 accessions, which mainly compared the patterns between WGD and non-WGD regions,
came from lower discovery frequency catalogs (Figure 3F). The and found that although the nucleotide diversity from WGD
wild soybeans contained more SVs than landraces and cultivars, declined faster and the final flattened values were lower than
either for the SVs from the 29 genomes or the novel SVs from those from non-WGD, the nucleotide diversity surrounding
graph-based genome (Figure 3G). PAVs either from WGD or non-WGD showed comparable values
The SVs from 2,898 accessions allowed us to investigate if (Figure S4F). The results indicated that whole-genome duplica-
they confer to phenotypic variation of any agronomical traits. tion influenced the declining rate of indel-associated substitution
For instance, seed luster is an important trait for soybean, and but not too much to the top substitution value that closely sur-
a previous study suggested that the accumulation of hydropho- rounds the PAVs.
bic protein from soybean (HPS) is associated with the variations
of seed luster (Gijzen et al., 2003). However, the responsible Gene Structure Variation and Gene Fusion
genes remain unclear. A genome-wide association study Gene structure variations provide a major genetic source of trait
(GWAS) on seed luster using the SVs genotyped from the diversity (Golicz et al., 2016a). From the pan-genome analyses,
graph-based genome identified a significant signal on chromo- we identified a total of 27,175 genes from the 26 de novo assem-
some 15 (Figure 3H), among which a 10-kb PAV resulted in the bled accessions were partially absent from the ZH13 genome.
presence-and-absence of an HPS encoding gene (Figure 3I). Simultaneously, a total of 48,249 genes were lost from at least
We found that the soybeans with and without the 10 kb had one of the 26 de novo assembled accessions. We found that
higher ratio of luster and lusterless seed respectively (Figure 3J), 2.2% of the SNPs from the 29 genomes were located in the an-
indicating this PAV might be one of the responsible genetic var- notated protein coding regions of the ZH13 genome. Counting
iations controlling seed luster variation in soybean. with the protein coding genes from ZH13 as a referral, these
SNPs resulted in a total of 5,474 non-redundancy premature
Sequence Variations and Paleopolyploid stop codons in other accessions and 841 premature stop co-
Previous analyses from the reference genome Wm82 revealed dons in ZH13 (Table S4). Of the indels, 3.2% were located in cod-
that a recent genome-wide duplication (WGD) in soybean ing regions, of which 385,950 resulted in frameshifts in one or
occurred 13 million years ago and resulted in nearly 50% of multiple accessions (Table S4). In addition, 8.1% of the pres-
the genes presented with duplications (Schmutz et al., 2010). ence-absence genes came from larger SVs. For example, a
We investigated the duplications from the recent WGD in individ- 16-kb deletion was found in SoyC13 that matched to the ZH13
ual genomes and found that, similar to that of Wm82, the WGD genome at the 52.23–52.25 Mb region on chromosome 18.
accounted for a mean of 54% of the total genome (Table S3). Accompanied by this deletion, SoyZH13_18G184700 was lost
In addition, the WGD from individual genomes also showed a from SoyC13. Similarly, a 23-kb insertion was found in SoyC11
similar pattern to that of Wm82, tending to occur in gene-rich re- that matched to the ZH13 genome at 16,525,487 on chromo-
gion and away from repetitive DNA sequences (Figures S4A and some 06. As a result, this insertion introduced three additional
S4B). It has been suggested that duplicates have a slower evo- gene models, SoyC10_06G170400, SoyC10_06G170500, and
lution rate than singletons in eukaryotes (Davis and Petrov, 2004; SoyC10_06G170600, than ZH13 (Figures S3D and S3E).
Jordan et al., 2004; Beilstein et al., 2010; Yang and Gaut, 2011; Gene fusion by read-through plays important roles in gene
Du et al., 2012; Fang et al., 2016). Consistently, we found that the evolution (Jones and Begun, 2005). The high pan-genome data
nucleotide diversity from the 29 genomes in the WGD regions allow us not only to detect previously reported alleles but also
was significantly lower than that in the non-WGD regions (Fig- to find new gene structure variations, including gene fusion.
ure S4C). In addition, the WGD regions contained a higher ratio For example, E3 was reported to be an important gene respon-
of core and softcore genes, whereas non-WGD regions con- sible for major flowering loci in soybean (Watanabe et al., 2009),
tained a higher ratio of dispensable and private genes (Fig- which encoded a homolog of PHYA in Arabidopsis (Figure S5A).
ure S4D). In parallel, more SVs were identified in the non-WGD Several alleles have been detected in natural populations (Tsu-
regions although they accounted for less genome sequences bokura et al., 2014). As expected, compared with E3
than WGD (46% versus 54%). Moreover, WGD regions held (SoyZH13_19G210400) from ZH13 (termed Hap E3-Mi-1), a
fewer private SVs than non-WGD regions (Figure S4E), implying 2.6-kb insertion in the third intron (termed Hap E3-Ha-1) and a
that genome duplication not only restrained the evolutionary rate 13.3-kb deletion starting from the third intron (termed Hap E3-
of genes but also acted as an important genetic force to shape tr) were identified from our de novo genomes (Figures 4A and
the evolution of SVs. 4B). Furthermore, haplotypes with a substitution of G to A
Indel-associated substitution is a general mutational mecha- (3182) at the third exon and a T/-(141) indel at the first exon
nism in eukaryotes (Tian et al., 2008). We reason whether larger were identified from the accessions with the 2.6-kb insertion.
SVs, such as PAVs, could influence the single-nucleotide muta- Each of these two polymorphisms resulted in a frameshift, and
tion rate as well. To this end, we extracted 1-kb sequences from the corresponding haplotypes were termed Hap E3-Ha-2
each side of individual PAVs. Then the 1-kb sequences were and Hap E3-Ha-3, respectively. We also identified two indels
divided into ten continuous 100-bp windows, and the nucleotide (T/-(611) and G/-(768)) under the genetic background of Hap
C
contrast, clear bands were amplified from
the E3-tr accessions when the primers
from E3 and SoyZH13_19G210600 were
used, but not in the accessions without
the deletion (Figure 4D). PCR sequencing
with the Sanger-based platform illumi-
nated that the fragment indeed perfectly
D E matched the fusion of E3 and SoyZH13_
19G210600 transcriptions (Figure S5B),
demonstrating the existence of gene
fusion of E3 and SoyZH13_19G210600.
Similarly, the new transcript form of E3
was also confirmed using primers from
the third exon and the additional exon (Fig-
ure 4D). Transcriptional data demon-
strated E3 and its neighboring genes,
including SoyZH13_19G210500, all ex-
pressed in ZH13 (Figure 4E), indicating
they are functional. Inspired by this result,
we investigated gene fusion events among
E3-Mi-1, which both resulted in frameshifts, and the correspond- the assembled genomes at a genome-wide level through
ing haplotypes were termed E3-Mi-2 and E3-Mi-3, respectively genome sequence comparison and expression detection. In to-
(Figure 4B). tal, we identified 15 gene fusion events that occurred in different
Moreover, we found that the 13.3 kb deletion accompanied accessions (Table S6). Future functional studies of these fusion
with the loss of complete gene, SoyZH13_19G210500 (Fig- genes will help to illuminate how new genes are formed and
ure 4B). Surprisingly, our RNA-seq data showed that in addition evolved and in turn to determine biological diversity (Long
to resulting in the loss of the last exon of E3 and an entire gene of et al., 2013).
SoyZH13_19G210600, the 13.3-kb deletion in Hap E3-tr caused
transcriptional read through of E3 and SoyZH13_19G210600 Contribution of Structural Variations in Soybean
(termed Hap E3-tr-1). In addition, a new transcript form of E3 Domestication
with an additional exon appeared (termed Hap E3-tr-2) (Fig- Previous studies identified a number of important loci that may
ure 4C). To confirm that the read-through and an additional tran- be responsible for soybean domestication and adaptation, but
script truly existed, we designed primers from E3 and most of the causative genetic variation in these regions has not
SoyZH13_19G210600 to amplify the transcription sequencing been well determined (Lam et al., 2010; Li et al., 2014; Zhou
using cDNA (Table S5). A clear band was amplified from the ac- et al., 2015; Lu et al., 2017; Torkamaneh et al., 2018; Wang
cessions without the deletion when the primers from the third et al., 2018a). The large number of SVs from dozens of indepen-
and last exons were used, however, no production was obtained dently de novo assembled genomes enabled us to clarify clearer
from E3-tr accessions, confirming the loss of the last exon. In evolutionary processes that cannot be detected from one or a
few genomes. For instance, the alteration of seed coat pigmen- that the inversion occurred during domestication from wild soy-
tation is one of the obvious selection traits during soybean bean to cultivated soybean.
domestication. Almost all wild soybeans exhibited black seed
coat color, and most cultivars exhibited yellow seed coat color. Structural Variations Affect Gene Expression and Are
The classically defined I locus is an important domestication lo- Associated with Agronomic Traits
cus responsible for the changes in seed coat color from black to Gene expression could be affected by gene structural variations
colorless (Woodworth, 1921; Zhou et al., 2015), which was re- and in turn lead to agronomic trait changes. For example, an
ported to be associated with reduced chalcone synthase insertion of Ty1/copia-like retrotransposon disrupted E4 function
(CHS) gene expression in yellow soybean seed coats via homol- by decreasing its expression level, and the soybean with this
ogy-dependent gene silencing (Tuteja et al., 2004, 2009; Wang insertion exhibited insensitivity under long day conditions (Liu
et al., 1994). A recent study demonstrated that gene silencing et al., 2008). We found that 17,696 SVs were located within/
might be caused by a SV of inversion and gene duplication of around the high confidence core genes (gene body with a ±3
the CHS cluster that came from double crossover events (Xie kb flanking region). To investigate whether the SVs affected
et al., 2019). Of the 29 accessions, four wild soybeans and land- more gene expression, we compared the expression level of
race SoyL02 exhibited black seed coat color, and the other culti- these genes among the 26 accessions for de novo assembly us-
vated soybeans exhibited yellow seed coat color (Figure 5A). ing their RNA-seq data from 9 tissues using principal component
Phylogenetic analyses using polymorphism around the I locus analyses (PCA) and then performed correlation analyses be-
showed that the 29 accessions could be classified into five major tween the SV and expressional PCA data. We found that 1,021
haplotypes (H1–H5), with accessions with black seed coat as a gene structural variations were associated with their expression
distinct haplotype of H1 (Figure 5A). Structural variation analyses level changes among different accessions (Table S7).
showed that, compared with H1, the accessions from H3, H4, The analyses enabled us to identify some candidate genes
and H5 contained the inversion and gene duplication (Figure 5B), that may be responsible for important traits. For example, iron
which is consistent with a previous report (Xie et al., 2019). How- deficiency chlorosis is a common problem for soybean produc-
ever, we found that this SV did not exist in the accessions from tion in calcareous soil. Genetic investigation demonstrated that
H2, although they exhibited a yellow seed coat color. Neverthe- several significant quantitative trait loci (QTLs) were responsible
less, we found that in haplotype H2, a 23.4-kb sequence from H1 for iron deficiency chlorosis, one of which was on chromosome
was duplicated and inverted into the CHS cluster, which might 14 (Lin et al., 1997) (Figure 6A). SoyZH13_14G179600, a gene
have resulted from a double crossover event that resulted in encoding a Fe2+/Zn2+ regulated transporter, was found in this
the pseudogenicity of two CHS genes (Figures 5B and 5C). QTL region. We found that a 1.4 kb indel existed in the promoter
Moreover, we found several additional SV events accompanying region of SoyZH13_14G179600 in different genomes. The 1.4-kb
the inversion and duplication, including an additional inversion sequence was composed of two equal length terminal inverted
between the two CHS clusters that occurred in the H3 haplotype, repeats of 700 bp and was flanked by a 9-bp target side dupli-
a deletion based on H3 that might come from an unequal combi- cation (Figure S7), which met the criterion of Mutator (Wicker
nation of CHS genes and result in H4, and an insertion based on et al., 2007). The PAV of this Mutator was highly linked to the
H3 that might come from tandem duplication and result in H5 other five polymorphisms in the exon regions and classified the
(Figure 5C). Genetic distance estimated that the SVs from H2 26 accessions into two distinct groups: Hap-1 without the dele-
and H3 might have originated 4,500 and 4,200 years ago, tion and Hap-2 with this deletion (Figure 6B). Based on our RNA-
whereas H4 and H5 might have originated 600 years ago seq data, the accessions of Hap-2 exhibited higher expression
(Figure 5A). levels than those of Hap-1 (Figure 6C). Fe availability is greatly
Some PAVs that were associated with soybean domestication determined by the pH value. In soils with higher pH, Fe usually
were also found. For example, a 360-kb inversion was found in exists in a predominate form of insoluble ferric oxides. However,
chromosome 7 from 43.30–43.66 Mb of the ZH13 genome (Fig- at lower pH, ferric Fe is freed from the oxide and becomes more
ure S6A). Interestingly, this inversion showed differences only be- available for uptake by roots (Morrissey and Guerinot, 2009). In
tween wild soybeans and cultivated soybeans and might occur China, lower latitude regions (Southern and Eastern) are mainly
4,700 years ago (Figure S6B). Selective analysis demonstrated comprised of ferralsol and have lower pH values, whereas the
that this inversion was located in a selective sweep from soybean higher latitude regions (Northeastern, Northwestern, and Huan-
domestication (Figure S6C). Further investigation illuminated that ghuai regions) have relatively higher pH values (Dai et al., 2009;
the inversion resulted in a gene (SoyZH13_07G220000/SoyW02_ Shi et al., 2006). Interestingly, the Hap-2 accessions with higher
07G232200) exhibiting SV between wild and cultivated soybeans expression were preferentially located in higher latitude regions,
(Figure S6D). This region was orthologs to chromosome 17, which and Hap-1 accessions with lower expression were preferentially
came from the recent WGD. Synteny analyses demonstrated that located in lower latitude regions (Figures 6D–6F), indicating that
these two regions showed continuous synteny in wild soybean, but genetic divergence of SoyZH13_14G179600 contributed to soy-
showed divergence of with/without this inversion in cultivated bean adaptation in iron uptake.
soybeans. In contrast to the SV of SoyZH13_07G220000 and
SoyW02_07G232200 in wild and cultivated soybeans, its DISCUSSION
orthologous on chromosome 17 (SoyZH13_17G036200 and
SoyW02_17G034500) exhibited structure conservation between Soybean provides more than half of the global production of
wild and cultivated soybeans (Figure S6D). The results suggested oilseed and more than a quarter of the protein for human food
A B
and animal feed (Graham and Vance, 2003; Wilson, 2008). To gently needed. A comprehensive evaluation and utilization of
ensure an adequate food supply for the expanding worldwide genetically diverse germplasm is essential for crop improvement
human population, soybean production must be doubled by (Golicz et al., 2016a; Varshney et al., 2020). A mass of soybean
2050 (Foley et al., 2011; Tilman et al., 2011; Ray et al., 2013). accessions were genotyped using short-reads-based high-
In turn, more effective plant breeding of soybean varieties is ur- throughput sequencing technologies (Lam et al., 2010; Li et al.,
A D
B C
E F
2010b, 2013; Zhou et al., 2015; Han et al., 2016; Maldonado dos Figshare databases. Second, the graph-based genome offers
Santos et al., 2016; Fang et al., 2017; Torkamaneh et al., 2018). a new platform to map short read data to determine the genetic
However, given the availability of a single released soybean variations at the pan-genome level instead of a single genome
reference genome Wm82 (Schmutz et al., 2010), only small and prevent erroneous variation calls around SVs (Rakocevic
SNPs and small indels were identified, leaving the larger SVs et al., 2019). Therefore, re-analysis of previously amount of re-
almost overlooked. This limited the capture of the full landscape sequenced data based on the created graph-based genome
of genetic variations and the pinpoint of causal variations in QTL will generate more comprehensive information than ever, which
cloning and genome-wide association studies. makes those reported data rejuvenated. In addition, coupled
The pan-genome dataset of 27 wild and cultivated soybean with the RNA-seq and small RNA sequencing (smRNA-seq)
accessions from this study will provide a promising platform from individual accessions, the platform will make it possible to
for future soybean in-depth functional genomics studies. First, link SVs with gene expression and, in turn, will greatly promote
the high-quality genomes enabled the identification of numerous gene discovery.
complex variations that cannot be detected by simply mapping Several genetic diversity bottlenecks occurred during soy-
the short reads to a single genome. The larger amount of genetic bean domestication and improvement, resulting in narrowed ge-
variation has been deposited public accessible Genome netic diversity among modern cultivars (Hyten et al., 2006; Zhou
Sequence Archive (GSA) database in the BIG Data Center and et al., 2015), which greatly restrict the subsequent creation of
STAR+METHODS
Continued
REAGENT or RESOURCE SOURCE IDENTIFIER
Augustus v3.0.3 Stanke and Morgenstern, 2005 https://sourceforge.net/projects/augustus/
FGENESH Softberry N/A
PASA Haas et al., 2003 https://sourceforge.net/projects/pasa/
MUMmer4 Marçais et al., 2018 https://mummer4.github.io/
SVMU Chakraborty et al., 2019 https://github.com/mahulchak/svmu
Hisat2 v2.1.2 Kim et al., 2019 https://ccb.jhu.edu/software/hisat2/
index.shtml
StringTie 1.3.4 Pertea et al., 2015 https://ccb.jhu.edu/software/stringtie/
OrthoMCL v2.0.9 Li et al., 2003 https://orthomcl.org/orthomcl/
KOBAS 3.0 Xie et al., 2011 http://kobas.cbi.pku.edu.cn/
InterProScan 5 Jones et al., 2014 https://www.ebi.ac.uk/interpro/
Pannzer2 Törönen et al., 2018) http://ekhidna2.biocenter.helsinki.fi/
sanspanz/
R 3.5.0 R Development Core Team, 2013 https://www.r-project.org/
R package ClusterProfiler 3.10.1 Yu et al., 2012 http://bioconductor.org/packages/release/
bioc/html/clusterProfiler.html
cd-hit v4.6 Li and Godzik, 2006 http://weizhongli-lab.org/cd-hit/
BLASTn Camacho et al., 2009 ftp://ftp.ncbi.nlm.nih.gov/blast/
executables/blast+/
BLASTp Camacho et al., 2009 ftp://ftp.ncbi.nlm.nih.gov/blast/
executables/blast+/
MUSCLE v3.8.31 Edgar, 2004 http://drive5.com/muscle/
MEGA6 Tamura et al., 2013 https://www.megasoftware.net/
IGV v2.5.0 Thorvaldsdóttir et al., 2013 https://busco.ezlab.org/
BUSCO 3.0.2 Simão et al., 2015 https://busco.ezlab.org/
vg v1.6.0 Garrison et al., 2018 https://github.com/vgteam/vg
EMMAX Kang et al., 2010 http://csg.sph.umich.edu/kang/emmax/
download/index.html
RESOURCE AVAILABILITY
Lead Contact
Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Zhixi Tian
(zxtian@genetics.ac.cn).
Materials Availability
This study did not generate new unique reagents.
All soybean accessions were planted at the Experimental Station of the Institute of Genetics and Developmental Biology, Chinese
Academy of Sciences, Beijing (40 220 N and 116 230 E) in the growth season in 2018.
For de novo assembly, samples were collected from a single plant for each of the 26 accessions. For SMRT sequencing, DNA was
isolated using the Blood & Cell Culture DNA Midi Kit (QIAGEN Inc., Valencia, CA, USA). A 20 kb library was constructed and
sequenced using the SequelTM Sequencing Plate 1.2 on the Pacific Biosciences Sequel platform. For HiSeq, DNA was isolated using
the Plant Genomic DNA kit (Tiangen, Beijing, China). A library with 450 bp small-insertion was prepared and sequenced on an Illumina
HiSeq2500 platform for 150 bp paired-end reads. For optical mapping, young leaves were used for high-molecular-weight DNA isola-
tion, after which they were labeled using the direct labeling enzyme DLE-1. These two labeled DNA samples were imaged using the
BioNano Irys system, and only molecules longer than 150 kb were used for further analysis. For Hi-C, leaves were fixed in 1% (vol/vol)
formaldehyde for library construction. Cell lysis, chromatin digestion, proximity-ligation treatments, DNA recovery and subsequent
DNA manipulations were performed according to a previously described method (Lieberman-Aiden et al., 2009). MboI or DpnII was
used as the restriction enzyme in chromatin digestion. The Hi-C library was sequenced on the Illumina HiSeq X Ten platform for
150 bp paired-end reads.
For whole genome re-sequencing of 2,027 soybean accessions, total DNA was extracted by the cetyltrimethylammonium bromide
(CTAB) method (Murray and Thompson, 1980). A paired-end sequencing library of each accession was constructed with an insert
size of approximately 350 bp and sequenced by Hiseq2500 and Hiseq X Ten.
For RNA-Seq and miRNA-Seq, nine samples of leaf, flower and seed were collected separately at early, middle and late develop-
mental stages for each accession (A, root from growth stage V1; B, stem from growth stage V1; C, young leaf from growth stage V1;
D, mature leaf from growth stage R1; E, old leaf from growth stage R4; F, flower from growth stage R1; G, pod and seed before
4 weeks; H, seed at 6 weeks; I, seed at 8 weeks). Total RNA was isolated using the RNAprep Pure Plant Kit (TIANGEN). RNA-seq
library construction was performed following the method of a previous reference (Shen et al., 2014) and sequenced on the NovaSeq
platform. Small RNA libraries were constructed as described previously (Zhou et al., 2013) and sequenced on a Nextseq 500 plat-
form. All sequencing was carried out at BerryGenomics Company (http://www.berrygenomics.com/. Beijing, China).
METHOD DETAILS
Genome Assembly
De novo assembly was conducted referring to a reported pipeline (Du and Liang, 2019; Shen et al., 2019). In brief, Canu (Koren et al.,
2017) v1.7.1 was used to assemble PacBio subreads to PacBio contigs, after which HiSeq reads were used for error correction. Bio-
nano optical maps were assembled into consensus physical maps by BioNano Solve v3.0.1 (https://bionanogenomics.com/,
Solve_06082017Rel). Then HERA (Du and Liang, 2019) was used to combine PacBio contigs and Bionano based physical maps
to PacBio-BioNano hybrid scaffolds. To anchor hybrid scaffolds into chromosomes, the Hi-C sequencing data were aligned into
scaffolds by Juicer (Durand et al., 2016) v1.5 and 3D-DNA (Dudchenko et al., 2017).
visualize RNA-seq pair-end mapping regions. For assessment of gene annotation completeness, BUSCO (Simão et al., 2015) (v3.0.2,
lineage dataset embryophyta_odb9) was performed for each accession.
Synteny Analysis
Synteny analysis of the 28 genomes with Gmax_ZH13 was performed via whole-genome alignment using MUMmer4 (Marçais et al.,
2018). The published genome of cultivated accession Wm82 (275_Wm82.a2.v1) and wild accession W05 (ASM419377v1) were also
used in this study. Alignment of the genomes was performed using NUCmer (–c 1000), and then the alignment block filter was per-
formed using a delta-filter with one-to-one alignment mode (1). Blocks longer than 1000 bp were used for further structure variation
detection.
For whole genome duplication (WGD) investigation, we first masked all repetitive DNA but predicted genes, and compared trans-
lations of the remaining genic nucleotide sequences using MUMmer4 (Marçais et al., 2018). Only the top reciprocal best matches
between any two chromosomes in a comparison were considered duplication regions from the recent WGD. Then we evaluated
the colinear chains. To increase sensitivity, chaining was created on the basis of gene order, excluding positions of non-orthologous
genes, rather than with gene coordinates. Chains were required to have at least five colinear genes with no more than ten intervening
genes between neighbors. The resulting gene-pairs were classified as WGD syntenic. To investigate the relationship between struc-
tural variations and WGD, the WGD information from the Williams 82 genome (Schmutz et al., 2010) was used as a reference. To
compare the nucleotide diversity between WGD and non-WGD, the WGD and non-WGD blocks were divided into 100-bp continuous
windows, and the nucleotide diversity was calculated in each window.
For the effect of PAV on SNP evolution, 1 kb sequence from each flanking of the PAVs were extracted, and each was divided into
ten 100-bp continuous windows, and the nucleotide diversity was calculated. The nucleotide diversity of each window was a mean
from the total corresponding windows.
2012;). Gene expression was normalized using the number of reads per kilobase of exon sequence in a gene per million mapped
reads (RPKM) (Mortazavi et al., 2008).
The miRNA expression analysis was performed using a previously reported method (Zhou et al., 2013). In brief, the small RNA se-
quences that mapped to mature miRNA-5p or miRNA-3p were defined as the expression of miRNA-5p and miRNA-3p, respectively.
The final expression of miRNA-5p or miRNA-3p was calculated as TPM.
All details of the statistics applied are provided alongside in the figure and corresponding legends. Statistical analyses were per-
formed in R 3.5.0. Genome wide association study was performed using the EMMAX software package (Kang et al., 2010). The sig-
nificant threshold is 1 3 106. The phenotype variations of seed luster are classified into three categories: luster, intermediate, and
lusterless. The phenotypes come from 754 accessions.
Supplemental Figures
Figure S2. SNP and Structural Variation of the 29 De Novo Assembled Genomes, Related to Figure 3
(A) Correlation of SNP density, p, dN, and dS from 29 de novo assembled genomes and 2,898 resequenced accessions.
(B) Characterization of larger structure variations in the 29 soybean genomes. The heatmap shows the present frequency of structural variation in 29 genomes.
ll
Resource
Figure S3. Presence-Absence Variation Leads to Chromosome Size Variation and Gene Gain and Loss of the 27 De Novo Assembled Ge-
nomes, Related to Figure 3
(A) Correlation between presence-absence variation (PAV) and chromosome size variation in the 27 de novo assembled genomes.
(B) Comparison of the PAVs in chromosome 07. The left panel compares PAVs from accessions with the longest and shortest chromosome sizes. The right panel
shows the variation in chromosome size in all 27 de novo assembled accessions. +/, presence/absence genome size (bp) compared to ZH13.
(C) Comparison of the PAVs in chromosome 17. The left panel compares PAVs from accessions with the longest and shortest chromosome sizes. The right panel
shows the variation in chromosome size variations of all 27 de novo assembled accessions. +/, presence/absence genome size (bp) compared to ZH13.
(D) A 16 kb deletion from SoyC14 results in the loss of SoyZH13_18G184700.
(E) A 23 kb insertion from SoyC10 results in the gain of SoyC10_06G170400, SoyC10_06G170500, and SoyC10_06G170600.
ll
Resource
Figure S4. Comparison of Gene and Structural Variation Characteristics between Whole Genome Duplication (WGD) and Non-WGD Regions,
Related to Figure 3
(A) Comparison of gene density between WGD and non-WGD regions.
(B) Comparison of repetitive DNA proportions between WGD and non-WGD regions.
(C) Comparison of nucleotide diversity between WGD and non-WGD regions. The data are given as the mean ± 95% CI. Significance was tested by unpaired two-
sample Wilcoxon test; *** p < 0.001.
(D) Gene components in WGD and non-WGD regions.
(E) Structure variation components in WGD and non-WGD regions.
(F) Comparison of the PAV driven single-nucleotide mutation rate between WGD and non-WGD regions. The data are the mean ± 95% CI.
ll
Resource
Figure S5. Validation of Gene Fusion between E3 (SoyZH13_19G210400) and Its Neighboring Gene SoyZH13_19G210600 in the Accessions of
Haplotype E3-tr, Related to Figure 4
(A) Neighbor-joining phylogenetic analyses of E3 in different species.
(B) Validation of the gene fusion through cDNA amplification from different accessions. L05, L09, C13, C14 are representative accessions of haplotype E3-tr with
the 13.3 kb deletion, and W02, C12 and ZH13 are representative accessions without the deletion. Sequencing of the amplification from L05, L09, C13, C14 cDNA
proves the gene fusion.
(C) Neighbor-joining phylogenetic analyses of SoyZH13_19G210600 in different species.
ll
Resource
Figure S6. An Inversion on Chromosome 07 between Wild Soybean and Cultivated Soybean Genomes Is Associated with Soybean
Domestication, Related to Figure 5
(A) Diagram of the 360 kb inversion between wild (SoyW02 as sample) and cultivated soybean (ZH13 as sample) genomes.
(B) The inversion is associated with the divergence between wild and cultivated soybeans, which might have occurred approximately 4,700 years ago. ya,
years ago.
(C) Fst between wild soybean and cultivated soybeans on chromosome 07 (left panel). Distribution of the Fst value of the 360 kb INV region (red arrow) at the whole
genome level (right panel).
(D) Synteny and gene structure comparison of this inversion on chromosome 7 and its whole genome duplicated region on chromosome 17.
ll
Resource
Figure S7. Structure and Sequences of the Mutator from SoyZH13_14G179600, Related to Figure 6
The Mutator is comprised of two terminal inverted repeats (TIR). Each of the TIRs is approximately 700 bp in length, and they are complementarily reversed. This
Mutator has the target side duplication of TGATAAATG.