Professional Documents
Culture Documents
Optimization Data For Whole-Genome Sequencing Protocol
Optimization Data For Whole-Genome Sequencing Protocol
1 ARTICLE INFORMATION
2 Article title
iew
3 Optimization data for an ARTIC-/Illumina-based whole-genome sequencing protocol and pipeline for
4 SARS-CoV-2 analysis
5 Authors
6 Christian Bundschuh1*, Niklas Weidner1,2, Julian Klein1, Tobias Rausch3, Nayara Azevedo3,
7 Anja Telzerow1,3, Katharina Laurence Jost1, Paul Schnitzler1, Hans-Georg Kräusslich1,4 and Vladimir
v
8 Benes3*
re
9 Affiliations
12
13
2Heidelberg
er
University, Medical Faculty Heidelberg, Department of Infectious Diseases, Microbiology
and Hygiene, Heidelberg, Germany
14 3Genomics Core Facility, European Molecular Biology Laboratory (EMBL), Heidelberg, Germany
pe
15 4Deutsches Zentrum für Infektionsforschung, partner site Heidelberg, Germany
18 Keywords
ot
20 Abstract
tn
21 In January 2021, Germany commenced surveillance of SARS-CoV-2 variants under the Corona
22 Surveillance Act, which ceased in July 2023. The objective was to bolster pandemic control, as
23 specific alterations in amino acids, particularly within the spike protein, were linked to heightened
24 transmission and decreased vaccine effectiveness.
rin
25 Consequently, our team conducted whole genome sequencing using the commercially accessible
26 ARTIC protocol on Illumina's NextSeq500 platform and MiSeq for SARS-CoV-2 positive samples
27 obtained from patients at Heidelberg University Hospital, affiliated hospitals, and the public health
28 office in the Rhine-Neckar/Heidelberg region. Throughout the pandemic, we refined the existing
ep
29 ARTIC V4 protocol as well as our bioinformatics pipeline, the details of which are outlined in this
30 report. This report reflects the protocol for the MiSeq analysis, the protocol for the NextSeq500 can
31 be found in our previous publication (Bundschuh et al., 2024).
Pr
32
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
33 SPECIFICATIONS TABLE
Subject Genomics; Virology
iew
Specific subject Establishment of an optimized ARTIC- and Illumina-based Whole-genome
area sequencing protocol and pipeline for SARS-CoV-2
Protocol, Figure
v
Processed data/optimized protocol
re
Data collection Inclusion criteria was a PCR cycle threshold of <32.
Library preparation adhered mainly to the original ARTIC protocol but with the
pe
described optimizations.
Sequencing data was generated on an Illumina NextSeq500 and MiSeq and the
data evaluation was performed with the described bioinformatic tools, such as
trim galore, kraken2, Burrows-Wheeler Alignment tool, SAMtools, Alfred,
FreeBayes, bcftools, Ensembl Variant Effect Predictor, Pangolin and Nextclade.
ot
location
Heidelberg University and European Molecular Biology Laboratory Heidelberg
(EMBL)
Related research Bundschuh, C., Weidner, N., Klein, J., Rausch, T., Azevedo, N., Telzerow, A., Mallm,
article J. P., Kim, H., Steiger, S., Seufert, I., Börner, K., Bauer, K., Hübschmann, D.,
Jost, K. L., Parthé, S., Schnitzler, P., Boutros, M., Rippe, K., Müller, B.,
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
34 VALUE OF THE DATA
35 Allows the easy establishment of a free, public and open-source SARS-CoV-2 whole genome
36 sequencing bioinformatics pipeline
iew
37 Full protocol on the library preparation allows for an easy establishment in any laboratory
38 with the necessary tools
39 Protocol optimizations reduce hands-on time and costs per analysis
40 BACKGROUND
41 Towards the end of 2019, a novel coronavirus (2019-nCoV), later named severe acute respiratory
v
42 syndrome coronavirus 2 (SARS-CoV-2) by the World Health Organization (WHO), emerged as the
43 causative agent for a surge in pneumonia cases known as coronavirus disease (COVID-19) in Wuhan,
re
44 China (World Health Organization (WHO), 2020). The rapid spread of the virus led to the escalation
45 of the epidemic in China into a global pandemic, with over 676 million confirmed SARS-CoV-2 cases
46 by March 2023 (data not updated beyond that point) (World Health Organization (WHO), 2023,
47 Center for Systems Science and Engineering (Johns Hopkins University), 2023).
er
48 The Spike (S) protein, comprising 1273 amino acids, was early identified as crucial for SARS-CoV-2
49 host cell entry (Xia et al., 2020). Consequently, various mutations in the S protein have been
50 associated with increased viral transmissibility, disease severity, reinfection despite natural
pe
51 immunity, and vaccine efficacy. As a result, monitoring circulating variants and their respective
52 mutations has become a vital epidemiological strategy for pandemic control (Abdool Karim and de
53 Oliveira, 2021), making viral genome sequencing essential for epidemiological surveillance.
54 This report outlines our group's adapted and refined whole genome sequencing protocol, ARTIC V4,
55 initially developed by New England Biolabs (Tyson et al., 2020) as well as the optimized
ot
56 bioinoformatics pipeline.
57 DATA DESCRIPTION
tn
58 First part is the wet-lab part of the protocol and describes the protocols for RNA purification, library
59 preparation (ARTIC protocol) with the respective substeps, such as cDNA synthesis, cDNA
60 amplification, fragmentation/end prep, adaptor ligation, PCR Enrichment of Adaptor-ligated
61 DNA/Dual Index PCR, library pooling, pool cleanup and library quality control.
rin
62 Second is the setting of the sequencer and the respective sequencing quality criteria.
63 Afterwards follows the data analysis, with sequence adapter trimming, host-read contamination
64 removal, alignment to SARS-CoV-2 reference genome, variant calling, consensus computation,
65 lineage classification and summary report generation.
ep
67 example: This folder contains an exemplary data analysis with a SARS-CoV-2 COG-UK sample
Pr
68 kraken2: contains the files for the respective application as well as a readme
69 ref: includes all files for pangolin and a readme on how to update and use pangolin
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
70 scripts: contains the scripts required for the process of adapter-trimming, host-read contamination
71 removal, sequence alignment, variant calling, consensus computation and lineage classification
iew
73 README.md contains an explanation of the repository and the subfolders/scripts etc.
v
76 EXPERIMENTAL DESIGN, MATERIALS AND METHODS
re
77 RNA purification
78 For the analysis of SARS-CoV-2 sequencing, RNA was extracted from specimens collected from the
79 upper respiratory tract, including nasopharyngeal and oropharyngeal swabs, as well as pharyngeal
er
80 washes. The extraction process involved automated magnetic bead-based nucleic acid extraction
81 protocols.
82 Our group employed either the QIASymphony DSP Virus/Pathogen mini-Kit from Qiagen or the
pe
83 Chemagic Viral DNA/RNA 300 Kit H96 from PerkinElmer for magnetic bead RNA extraction, following
84 the protocols provided by the manufacturers.
85 ARTIC Protocol
86 The NEBNext ARTIC SARS-CoV-2 FS Library Prep Kit for Illumina (E7650) contains the enzymes,
ot
87 buffers and oligonucleotides required to convert a broad range of total RNA into high quality,
88 targeted, libraries for next-generation sequencing on the Illumina platform. Primers targeting the
89 human EDF1 (NEBNext ARTIC Human Primer Mix 1) and NEDD8 (NEBNext ARTIC Human Primer Mix
90 2) genes are supplied as optional internal controls. The fast, user-friendly workflow also has minimal
tn
91 hands-on time. Each kit component must pass rigorous quality control standards, and for each new
92 lot the entire set of reagents is functionally validated together by construction and sequencing of
93 indexed libraries on an Illumina sequencing platform. For larger volume requirements, customized
94 and bulk packaging is available by purchasing through the OEM/Bulks department at NEB.
rin
ep
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
v iew
re
er
pe
ot
tn
95
96 cDNA Synthesis
97 The presence of carry-over products can interfere with sequencing accuracy, particularly for low
rin
98 copy targets. Therefore, it is important to carry out the appropriate no template control (NTC)
99 reactions to demonstrate that positive reactions are meaningful.
100 Gently mix and spin down the LunaScript RT SuperMix reagent. Prepare the cDNA synthesis reaction
ep
RNA sample 4 -
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
MM/strip tube: 25 µL
iew
104 controls.
105 The wells to add the controls should be constant:
v
re
er
pe
ot
106
107 The controls should be switched each plate!
108 Seal the plate, vortex and spin prior to placing it in the thermocycler.
tn
111
114 Note: 2.5 μl cDNA input is recommended. If using less than 2.5 μl of cDNA, add nuclease-free water
115 to a final volume of 2.5 μl. We recommend setting up the cDNA synthesis and cDNA amplification
116 reactions in different rooms to minimize cross-contamination of future reactions.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
117 Gently mix and spin down reagents. Prepare the split pool cDNA amplification reactions as described
118 below:
iew
1 rxn 96-well plate
cDNA 2.5 -
v
ARTIC SARS-CoV-2 Primer Mix 1 0.7 70
re
120
122 Seal the plate, vortex and spin prior to placing it in the thermocycler.
Hold 4°C ∞ 1
124 *Set heated lid to 105°C
125 Combine the Pool A and Pool B PCR reactions for each sample.
ep
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
130
iew
132 Fragmentation/End Prep
133 Mix 11 μl of the pooled amplicons (from step 2.4) with 99 μl of nuclease free water in a new 96well
134 plate (Eppendorf TwinPlate).
135 Ensure that the Ultra II FS Reaction Buffer is completely thawed. If a precipitate is seen in the buffer,
136 pipette up and down several times to break it up, and quickly vortex to mix. Place on ice until use.
v
137 Vortex the Ultra II FS Enzyme Mix 5-8 seconds prior to use and place on ice.
re
138 Note: It is important to vortex the enzyme mix prior to use for optimal performance.
139 Add the following components to a 0.2 ml thin wall PCR tube on ice:
Component
er Volume (µL) Volume (µL)
140 Seal the plate, vortex and spin prior to placing it in the thermocycler.
ot
Temperature Time
37°C 30 minutes
tn
65°C 30 minutes
4°C ∞
142 *Set heated lid to 75°C
143
rin
144 If necessary, samples can be stored at –20°C; however, a slight loss in yield (~20%) may be
145 observed. We recommend continuing with adaptor ligation before stopping.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
NEBNext Ultra II Ligation MM 6 600
iew
148 *Mix the Ultra II Ligation Master Mix by pipetting up and down several times prior to adding
149 to the reaction.
151 Seal the plate, vortex and spin prior to placing it in the thermocycler
152 Incubate at 20°C for 15 minutes in a thermocycler with the heated lid off.
v
153 Add 0.75 μl of • (red) USER® Enzyme to the ligation mixture from Step 4.3.
re
154 Mix well and incubate at 37°C for 15 minutes with the heated lid set to ≥ 47°C.
158 *Oligos from NEB dual index 96-well plate. 4 different sets.
tn
159 **The leftover of the Adaptor Ligated DNA Fragments can be stored overnight @ –20°C.
160 Seal the plate, vortex and spin before starting the program in the thermocycler.
161 Place the tube on a thermocycler and perform PCR amplification using the following PCR cycling
rin
162 Conditions:
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
167 Pipette 4 µL of each library into a strip with a multichannel pipette and combine all volume into a
168 1.5mL Eppendorf tube (~380 µL).
iew
170 Vortex NEBNext Sample Purification Beads to resuspend.
171 Measure the pool volume and add 0.9X of resuspended beads to the PCR reaction. Mix well by
172 pipetting up and down at least 10 times. Be careful to expel all of the liquid out of the tip during the
173 last mix. Vortexing for 3-5 seconds on high can also be used. If centrifuging samples after mixing, be
174 sure to stop the centrifugation before the beads start to settle out.
v
175 Incubate samples on bench top for at least 5 minutes at room temperature.
re
176 Place the tube/plate on an appropriate magnetic stand to separate the beads from the supernatant.
177 If necessary, quickly spin the sample to collect the liquid from the sides of the tube or plate wells
178 before placing on the magnetic stand.
179 After 5 minutes (or when the solution is clear), carefully remove and discard the supernatant. Be
er
180 careful not to disturb the beads that contain DNA targets.
185 Repeat the precious washing step for a total of two washes. Be sure to remove all visible liquid after
186 the second wash. If necessary, briefly spin the tube/plate, place back on the magnet and remove
ot
188 Air dry the beads for up to 5 minutes while the tube/plate is on the magnetic stand with the lid
189 open.
tn
190 Note: Do not over-dry the beads. This may result in lower recovery of DNA.
191 Remove the tube/plate from the magnetic stand. Elute the DNA target from the beads by adding 105
192 μl of 0.1x TE. Mix well by pipetting up and down 10 times, or on a vortex mixer. Incubate for at least
rin
194 If necessary, quickly spin the sample to collect the liquid from the sides of the tube or platewells before
195 placing back on the magnetic stand.
ep
196 Place the tube/plate on the magnetic stand. After 5 minutes (or when the solution is clear), transfer
197 100 μl to a new eppendorf tube.
198 Repeat the cleanup steps with the 0.9x of resuspended beads
Pr
199 Elute the DNA target from the beads by adding 55 μl of 0.1x TE and mix well
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
200 Place the tube/plate on the magnetic stand. After 5 minutes (or when the solution is clear), transfer
201 50 μl to a new eppendorf tube and store at –20°C.
202 Library QC
iew
203 Measure the concentration of the final pool with Qubit DNA BR (HS DNA reagents) kit.
204 Assess the library size distribution with Agilent Tapestation 4150 with D1000 tape and D1000
205 reagents. The sample may need to be highly diluted (in rare cases).
206 The expected concentration is between 100 and 200 ng/ul by BR Qubit.
v
207 A peak sized at 250 bp is expected on a Tapestation 4150, based on a 30-minute fragmentation time:
re
er
pe
ot
tn
rin
208
210 Sequencing
ep
211 Sequencing was conducted using an Illumina MiSeq System. This system is a desktop sequencer
212 supported by automated software, offering various analytical capabilities including whole genome
213 sequencing, exome sequencing, whole transcriptome sequencing, mRNA-Seq, and methylation
214 sequencing.
Pr
215 The MiSeq sequencing method is based on 2-channel sequencing by synthesis, employing
216 fluorescence-labeled nucleotide bases. These bases are incorporated into the (c)DNA template
217 strands, and detection is achieved accordingly. Each sequencing cycle involves reversible terminator-
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
218 bound deoxyribonucleotide triphosphates (dNTPs), minimizing incorporation bias and reducing error
219 rates.
220 Users can adjust settings and configurations to meet specific requirements, such as choosing
iew
221 between high or mid output flow cells and opting for paired-end reads of either 75 or 150 bp read
222 length. For our experiments, we utilized paired-end reads with a 75 bp length.
v
226 involved a multi-stage pipeline comprising several tools for tasks such as sequencing adapter
227 trimming, alignment, quality control, mutation calling, viral genome assembly, and lineage
re
228 classification. The specific tools used in each stage are detailed in the following sections.
244 Secondary usage for the generation of the viral consensus sequence and its subsequent quality
rin
248 Secondary usage for the generation of the viral consensus sequence and its subsequent quality
249 control (consensus computation).
252 regions. Due to the overlapping amplicon design of the applied protocol this did not result in further
253 dropout (unobserved) regions.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
254 Secondary usage for the generation of the viral consensus sequence and its subsequent quality
255 control (consensus computation).
256 The parameters for iVar consensus and quality control were established in accordance with the
iew
257 Robert Koch Institute (RKI) SARS-CoV-2 sequencing submission criteria. These criteria include
258 ensuring a consensus sequence with at least 90% informative sequence, less than 5% unobserved
259 bases (Ns), a minimum coverage of 20x, and a minimum of 90% read support for informative
260 positions (Robert Koch Institute (RKI), 2021).
v
262 FreeBayes was utilized for variant calling.
re
263
264 bcftools was used for quality filtering and normalization of variants.
265 Secondary usage for the generation of the viral consensus sequence and its subsequent quality
266 control (consensus computation). er
267 Ensembl Variant Effect Predictor (McLaren et al., 2016)
268 Subsequently, all identified variants were annotated with the Ensembl Variant Effect Predictor for
269 their functional consequences.
pe
270 Pangolin (Rambaut et al., 2020)
271 Pangolin (with default parameters) was used for the conducting of lineage classification.
277 labels, and variants resulting in amino acid changes. This script facilitated the generation of
278 comprehensive reports, including metadata tracking sheets and gzipped FASTA files for all viral
279 assemblies. These files were prepared for immediate upload to the German electronic sequencing
280 data hub (DESH) operated by the RKI.
rin
285 LIMITATIONS
286 Not applicable.
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
287 ETHICS STATEMENT
288 This study was approved by the ethics committee of the Medical Faculty at the University of
289 Heidelberg for the analysis of proband samples by whole genome sequencing of the viral RNA (S-
iew
290 316/2021).
291 Furthermore, the sequencing of SARS-CoV-2 positive samples adhered to the governmental
292 regulations outlined in the Coronavirus-Surveillanceverordnung (CorSurV) by Germany and Baden-
293 Württemberg.
v
294
295 Christian Bundschuh: Writing – review & editing, Writing – original draft, Investigation,
re
296 Data curation.
297 Niklas Weidner: Writing – review & editing, Investigation, Data curation.
298 Julian Klein: Investigation.
299
300
er
Tobias Rausch: Writing – review & editing, Formal analysis, Data curation.
Nayara Azevedo: Investigation.
301 Anja Telzerow: Investigation.
pe
302 Katharina Laurence Jost: Writing – review & editing, Investigation.
303 Paul Schnitzler: Writing – review & editing, Supervision.
304 Hans-Georg Kräusslich: Writing – review & editing, Supervision, Conceptualization.
ot
305 Vladimir Benes: Writing – review & editing, Investigation, Formal analysis.
306 ACKNOWLEDGEMENTS
307 We thank all diagnostics employees of the Virology department of the University Hospital Heidelberg
tn
308 and EMBL GeneCore's NGS protocol-automation team for their support. This work was supported by
309 the program for surveillance and control of SARS-CoV-2 mutations of the state of Baden-
310 Württemberg and the German Federal Research Network Applied Surveillance (CorSurV).
rin
311 For the publication fee we acknowledge financial support by Heidelberg University.
314 that could have appeared to influence the work reported in this paper.
315 REFERENCES
316 Bundschuh, C., Weidner, N., Klein, J., Rausch, T., Azevedo, N., Telzerow, A., Mallm, J. P., Kim, H.,
317 Steiger, S., Seufert, I., Börner, K., Bauer, K., Hübschmann, D., Jost, K. L., Parthé, S., Schnitzler,
Pr
318 P., Boutros, M., Rippe, K., Müller, B., Bartenschlager, R., Kräusslich, H. G. & Benes, V. 2024.
319 Evolution of SARS-CoV-2 in the Rhine-Neckar/Heidelberg Region 01/2021 - 07/2023. Infect
320 Genet Evol, 119, 105577.
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020
ed
321 Rausch T. SARS-CoV-2 data analysis. https://doi.org/10.5281/zenodo.10847332
322 Abdool Karim, S. S. & de Oliveira, T. 2021. New SARS-CoV-2 Variants - Clinical, Public Health, and
323 Vaccine Implications. N Engl J Med, published online 24 March 2021.
iew
324 Center for Systems Science and Engineering (Johns Hopkins University). 2023. COVID-19 Dashboard
325 [Online]. Available: https://coronavirus.jhu.edu/map.html [Accessed 19/01/2023].
326 Garrison, E. & Marth, G. 2012. Haplotype-based variant detection from short-read sequencing. arXiv,
327 1207.
328 Grubaugh, N. D., Gangavarapu, K., Quick, J., Matteson, N. L., De Jesus, J. G., Main, B. J., Tan, A. L., Paul,
329 L. M., Brackney, D. E., Grewal, S., Gurfield, N., Van Rompay, K. K. A., Isern, S., Michael, S. F.,
v
330 Coffey, L. L., Loman, N. J. & Andersen, K. G. 2019. An amplicon-based sequencing framework
331 for accurately measuring intrahost virus diversity using PrimalSeq and iVar. Genome Biol, 20,
332 8.
re
333 Hadfield, J., Megill, C., Bell, S. M., Huddleston, J., Potter, B., Callender, C., Sagulenko, P., Bedford, T. &
334 Neher, R. A. 2018. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics, 34,
335 4121-4123.
336 Li, H. 2011. A statistical framework for SNP calling, mutation discovery, association mapping and
337 population genetical parameter estimation from sequencing data. Bioinformatics, 27, 2987-
338 93.
er
339 Li, H. & Durbin, R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform.
340 Bioinformatics, 25, 1754-60.
pe
341 Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. & Durbin, R.
342 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078-9.
343 McLaren, W., Gil, L., Hunt, S. E., Riat, H. S., Ritchie, G. R., Thormann, A., Flicek, P. & Cunningham, F.
344 2016. The Ensembl Variant Effect Predictor. Genome Biol, 17, 122.
345 Rambaut, A., Holmes, E. C., O'Toole, Á., Hill, V., McCrone, J. T., Ruis, C., du Plessis, L. & Pybus, O. G.
346 2020. A dynamic nomenclature proposal for SARS-CoV-2 lineages to assist genomic
ot
351 alignment statistics, feature counting and feature annotation for long- and short-read
352 sequencing. Bioinformatics, 35, 2489-2491.
353 Robert Koch Institute (RKI). 2021. Qualitätsvorgaben für die Sequenzdaten [Online]. Available:
354 https://www.rki.de/DE/Content/InfAZ/N/Neuartiges_Coronavirus/DESH/Qualitaetskriterien.
355 pdf [Accessed 06/12/2021].
rin
356 Tyson, J. R., James, P., Stoddart, D., Sparks, N., Wickenhagen, A., Hall, G., Choi, J. H., Lapointe, H.,
357 Kamelian, K., Smith, A. D., Prystajecky, N., Goodfellow, I., Wilson, S. J., Harrigan, R., Snutch, T.
358 P., Loman, N. J. & Quick, J. 2020. Improvements to the ARTIC multiplex PCR method for SARS-
359 CoV-2 genome sequencing using nanopore. bioRxiv.
ep
360 Wood, D. E., Lu, J. & Langmead, B. 2019. Improved metagenomic analysis with Kraken 2. Genome Biol,
361 20, 257.
362 World Health Organization (WHO) 2020. Director-General's remarks at the media briefing on 2019-
363 nCoV on 11 February 2020.
364 World Health Organization (WHO) 2023. Coronavirus disease (COVID-19) Weekly Epidemiological
Pr
This preprint research paper has not been peer reviewed. Electronic copy available at: https://ssrn.com/abstract=4795020