Identification of Cancer-Related Mutations in Human Pluripotent Stem Cells Using RNA-seq Analysis

PROTOCOL
https://doi.org/10.1038/s41596-021-00591-5
Identification of cancer-related mutations in

human pluripotent stem cells using RNA-seq
analysis
Elyad Lezmi ✉ and Nissim Benvenisty ✉
Human pluripotent stem cells (hPSCs) are known to acquire genetic aberrations during in vitro propagation. In addition to
recurrent chromosomal aberrations, it has recently been shown that these cells also gain point mutations in cancer-related
genes, predominantly in TP53. The need for routine quality control of hPSCs is critical for both basic research and clinical
applications. Here we discuss the relevance of detecting mutations for various hPSCs applications, and present a detailed
protocol to identify cancer-related point mutations using data from RNA sequencing, an assay commonly performed
during the growth and differentiation of hPSCs. In this protocol, we describe how to process and align the sequencing
data, analyze it and conservatively interpret the results in order to generate an accurate estimation of mutations in tumor-
related genes. This pipeline is designed to work in high throughput and is available as a software container at https://
1234567890():,;
1234567890():,;
github.com/elyadlezmi/RNA2CM. The protocol requires minimal command-line skills and can be carried out in 1–2 d.
Introduction
Genetic stability of hPSCs

The potential to differentiate into all cell types and the ability for unlimited self-renewal in vitro are
the hallmarks of human pluripotent stem cells (hPSCs)1. Pluripotency is the eminent feature of these
cells, harnessed in the fields of regenerative medicine2 and disease modeling3, as well as in basic
research of early human development4. Continuous propagation in culture is another critical feature
of hPSCs, enabling an infinite source of biologic material, but also known to compromise the normal
state of the cells with time.
Prolonged culturing can result in genomic instability, and culture-adapted cells have been shown
to acquire several genomic and epigenomic aberrations5,6, many of which are similar to changes that
occur upon cancerous transformation7. Chromosomally aberrant culture-adapted cells are
distinguished not only in their proliferation and differentiation capacity8, but they also have an
increased tumorigenic potential and gradually take over the culture9,10. Hence the validation of hPSC
chromosomal integrity is imperative, and should be routinely conducted in both academic and
clinical facilities11.
Point mutations in cancer-related genes were first identified in human induced pluripotent stem
cells (iPSCs)12, although the origin and biological effects of these mutations were unclear. A later
study that examined the exome sequence of 140 embryonic stem cell lines, including lines prepared
for potential clinical application, identified recurrent dominant negative mutations in TP53 (ref. 13).
Further inspection of RNA-sequencing (RNA-seq) data of an additional set of >100 PSC lines
revealed additional coding mutations in TP53, where in some cases the allelic fraction rose above
50%, suggesting loss of heterozygosity in the TP53 locus over time. Interestingly, the specific
mutations identified are among the most common mutations in human cancers.
Cancer-related mutations were next shown to be acquired during culturing, in a study that
compared whole-exome sequencing (WES) of early-passaged cells with RNA-seq data from later
passages, of two of the most commonly used hPSCs lines. Mutations were observed predominantly in
TP53 but also less frequently in other genes, e.g., EGFR and CDK12 ref. 14). Interestingly, some of the
mutations initially identified in the original study actually originated from the contamination of
mouse feeder cells on which the hPSCs were cultured15. Hence, the fact that hPSCs are commonly
The Azrieli Center for Stem Cells and Genetic Research, The Hebrew University of Jerusalem, Jerusalem, Israel. ✉e-mail: elyad.lezmi@mail.huji.ac.il;
nissimb@mail.huji.ac.il
4522 NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot

NATURE PROTOCOLS PROTOCOL
PCS application Potential consequences
Altered cell cycle regulation
Developmental
Aberrant differentiation capacity
biology
Mutation burden increase with
passaging
Unknowingly modeling another

disease, e.g, Li–Fraumeni
modeling
Disease
syndrome (TP53 mutation)
Risk of graft contributing to cancer

formation
Cell therapy
Fig. 1 | Relevance of cancer-related point mutations in hPCSs. Outlined are the potential consequences of
pluripotent stem cells carrying cancer-related mutations in various hPSCs applications.
cultured on mouse feeders, and the potential consequences of contaminating mouse DNA or RNA on
mutation identification, needs to be addressed when developing pipelines to detect hPSC mutations
acquired in culture. New point mutations and other genomic aberrations are continuously identified
in hPSCs as next-generation sequencing methodologies improve16,17, supporting the importance of
routine quality control of these cells. Future analysis of cancer-related mutations in low and high
passage numbers of pluripotent stem cells will enable us to determine more precisely the rate at which
these mutations accumulate during cell growth in culture.
Relevance of cancer-related point mutations

hPSC-derived cells are constantly being evaluated for transplantation to patients suffering from
various conditions, e.g., macular degeneration, diabetes and Parkinson’s disease18. Although the effect
of nucleotide variations throughout the genome is considered harmless in most cases, mutations in
key loci might have an adverse effect on the transplanted cell’s state. Mutations that confer a selective
advantage to the cell could be drivers of faster proliferation, higher tumorigenicity and ultimately the
cause of cancer upon cell therapy to patients. Cancer-related mutations are not relevant only to
clinical applications. Since these mutations are mostly involved in pathways affecting the cell cycle,
DNA-damage response and general cell identity, using cells that carry such mutations in basic
research and developmental studies can be problematic, because phenotypes might arise owing to
interaction with underlying mutations rather than from pure experimental manipulation. Disease
modeling is another common application of hPSCs, especially by deriving iPSCs from patients via
reprogramming. Since iPSCs could acquire point mutations during culturing, such cells used as a
disease model could be in fact modeling a cancer-related syndrome, resulting in phenotypes erro-
neously attributed to the modeled disease rather than to the cancer-related mutations. For example,
by modeling any disease using iPSCs bearing mutations in TP53, VHL or PTEN, the study is
unknowingly modeling Li–Fraumeni syndrome, Von Hippel–Lindau disease or Cowden syndrome,
respectively. Overall, we see a great importance in the ability to be able to rigorously evaluate cells
designated for developmental studies, disease modeling and clinical transplantation (Fig. 1).
Protocol overview
The high-throughput sequencing era has brought enormous amounts of data about variations in the
human DNA sequence. Databases such as the Catalogue of Somatic Mutations in Cancer (COS-
MIC)19 and software like Functional Analysis through Hidden Markov Models (FATHMM)20 have
greatly enhanced our ability to characterize the harmful potential of different mutations. In addition,
databases such as dbSNP21 and the 1000 Genomes Project22 provide extensive information about
genomic polymorphism in the general population. This protocol takes advantage of the present
NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot 4523

PROTOCOL NATURE PROTOCOLS
a b dbSNP154.vcf
CosmicMutations.vcf varTable.tsv
Read processing and RNA-seq experiment Keep mutations identified
Variant annotation
Steps alignment to reference in >20 cancer tumors
1–7 (Trimmomatic, STAR, hESC RNA sample Filtered.vcf
Sample.fastq Variants.vcf
XenofilteR)
Low-quality base trimming Hard filtering Discard common SNPs
Steps Variant discovery Variants.vcf

Sample.fastq
8–13 (GATK4) Trimmed.fastq Recalibrated.bam
Add information of COSMIC
Alignment to genome Variant calling
mutations in census genes
Steps
Variant filtration and Trimmed.fastq
annotation GRCh38 toHuman.bam SplitN.bam Recalibrated.bam CosmicCensus.tsv
14–19 GRCm38 toMouse.bam dbSNP154.vcf
(GATK4, Bcftools)
Keep nonsilent mutations in
Filter out mouse reads Base quality recalibration
Tier 1 census genes
Identification of high- toHuman.bam SplitN.bam

toMouse.bam Filtered.bam MarkDups.bam
Steps confidence cancer-
20–26 related mutations Keep mutations predicted to be
Mark duplicate reads Split N-CIGAR reads pathogenic by FATHMM
(Pandas)
Filtered.bam MarkDups.bam CancerMutations.csv
Fig. 2 | Schematic representation of the pipeline for identification of cancer-related mutations in hPSCs from RNA-seq data. a, Colored boxes
represent the main stages of the procedure; the tools used in each stage are between parentheses in italics. b, Schematic representation of the pipeline
where each step is represented by a box; input and output files are between the steps in italics.
feasibility of RNA-seq technology, the increasing availability of excellent bioinformatic tools and the
constant growth of publicly available knowledge.
In the protocol, we discuss best practices for the processing of hPSC RNA-seq data, including
eliminating sources of noise such as poorly sequenced reads, or reads that originate from murine
RNA of feeder cells. This is followed by restrictive variant calling that reveals discrepancies with the
reference genome. Next, we filter out all polymorphic sites with a minor allele frequency >1%, in at
least one of the populations defined by the 1000 Genomes Project, as these are very unlikely to be
pathogenic. Finally, we compare the remaining variants with validated somatic mutations that were
identified in a reasonable number of cancer tumors in the COSMIC dataset, and are predicted to be
highly pathogenic by FATHMM.
This pipeline requires an RNA-seq sample as input, and outputs a table with detailed information
about cancer-related mutations found in the input sample (Fig. 2).
Applications of the method

The main application of this protocol is to perform quality control of hPSCs designated for clinical
applications, but also of cells used for disease modeling and in basic research. Cancer-related
mutations in hPSCs can affect various cellular processes and might mask experimental results or lead
to misleading conclusions. Thus, we see a great importance in validation of the cancer-related
genomic integrity of cells not just for clinical use but also in basic research.
As part of the variant calling, the pipeline outputs information about the allelic fraction within the
population of cells investigated. This information could be used to track culture dynamics under
selective pressure. Moreover, application of this protocol to RNA-seq samples obtained at different
time points could serve as a tool to track genomic changes in cells over time and calculate the cancer-
related mutation rate under various conditions.
Comparison with DNA-based methods for mutation identification

Sanger sequencing or other PCR-based methods are gene specific and require laborious experimental
design if applied to hundreds of genes; hence, next-generation sequencing methods such as WES and
whole-genome sequencing (WGS) are preferred owing to their high throughput. RNA-seq has now
become the standard tool for expression quantification, but it is also increasingly applied for the
identification of small variants as well as somatic mutations. Notably, mutations called from RNA
data may show some differences with mutations called from DNA data13,23,24, but only a direct
comparison between mutation called from RNA-seq and WES/WEG would allow determination of
the exact sensitivity and specificity of each of these methods. One substantial limitation of RNA-seq is
that, unlike DNA-seq, it reveals only sequences of expressed genes, while nontranscribed loci and
regulatory regions are overlooked in these analyses. On the other hand, WES/WGS usually offers

limited read coverage, which acts as a limiting factor for high-confidence mutation calling. Still, even
mutations in highly transcribed genes might be overlooked in RNA-seq data, if they introduce a
premature stop-codon that leads to nonsense-mediated decay, resulting in lack of transcripts
representing the mutated allele. Additional biological factors such as RNA editing and monoallelic
expression from imprinted loci, and technical factors such as variations in sample preparation, also
contribute to the discrepancies between RNA- and DNA-based mutation identification. All of these
mentioned biases should be accounted for when the methodology for mutation analysis is chosen,
and complementation of RNA-based mutation calling with WES/WGS-based mutation calling is
recommended, especially in cases where the cells are intended to be used for clinical applications.
Advantages and limitations

The main limitation of mutation identification using RNA-seq data is that it is an indirect mea-
surement of the DNA sequence; hence, it is prone to biases originating from the transcriptional
landscape and may be affected by posttranscriptional editing. Additionally, in RNA-seq data, the base
coverage differs between genes as it correlates with the abundance of expression of each gene, unlike
DNA-seq, which is more uniform along the human genome. It is important to note that this protocol
can detect only nucleotide variants, and is not intended to detect chromosomal aberrations or copy-
number variations. We recommend to complement this protocol with RNA-based methods such as
eSNP-karyotyping25 to validate complete genomic integrity of the sample.
The major advantage of RNA-seq is that many loci have higher read coverage than in WGS/WES,
which allows sequence characterization with high confidence. Additionally, most laboratories are
proficient in the methodology of RNA-seq, and since it is routinely performed for expression analysis,
the same RNA-seq data can also be used to validate genomic integrity post hoc. In addition, publicly
available RNA-seq data are abundant and accessible via repositories as the Sequence Read Archive
(SRA) and the European Nucleotide Archive (ENA), providing ample opportunities to apply this
pipeline for high-throughput analysis of published data from the literature.
Several RNA-based pipelines for somatic mutation identification have been suggested, e.g., the
RNA-MuTect pipeline23, a GATK-MuTect2 based pipeline24 and the RADIA pipeline26. While these
pipelines all make use of RNA-seq data for mutation calling, they are also based on comparison of the
RNA-seq data with healthy-control DNA-seq data23,26, which is usually unavailable when inspecting
newly derived hPSC lines. This protocol is designed to analyze independent samples, identifying an
absolute number of potential cancer-related mutations, without comparison with any baseline, and
thus it applies very strict filters to minimize the false discovery rate.
Materials
CRITICAL The materials list represents the minimal requirements for execution of the pipeline locally
c
and manually (not recommended). Alternatively, all steps can be streamlined via Nextflow27, and
containerized with Docker28 enabling the user to skip any software installation and computational
environment setup (Box 1).
Hardware
● A workstation or computer cluster, running a 64-bit UNIX-based operating system, with 30 GB of free
hard disk space and at least 64 GB of available random-access memory CRITICAL The main limiting
c
factor for the throughput of this pipeline is the number of available CPU threads.
Software
CRITICAL Note that some components might be already present on your system. The pipeline was
c
tested with the specified software versions but is not necessarily restricted to these specific versions.
● Python3 v3.6.9 (https://www.python.org)
● Pandas package for python 3 (https://pandas.pydata.org)
● Java 8 (https://www.java.com)
●
STAR29 v2.5.4b (https://github.com/alexdobin/STAR)
● SAMtools
30
v1.7-2 (http://www.htslib.org)
● BCFtools
31
● Tabix
32

Box 1 | Automated usage of the pipeline using the RNA2CM software tool
Software
RNA2CM (https://github.com/elyadlezmi/RNA2CM)
Nextflow (https://www.nextflow.io)
Docker (https://www.docker.com)
Singularity (optional, required only for execution on computer clusters managed by the SLURM workload manager) (https://sylabs.io/singularity)
Procedure
Equipment setup
CRITICAL This setup replaces all steps of the Equipment Setup and is performed only once before the first time the pipeline is run.
c
1 Nextflow and Docker (use Singularity instead of Docker for execution on SLURM-clusters) are the only prerequisites for the RNA2CM tool.
Install both, and make sure they are running properly on your system. If the following commands do not generate any error message, the
installation has been successful.
$ nextflow run hello # test that nextflow is working

$ docker run hello-world # test that docker is working
2 Download and extract the project directory (using either git or an internet browser):
$ git clone elyadlezmi/RNA2CM # clone the project using git
3 Download the files CosmicMutantExportCensus.tsv.gz and CosmicCodingMuts.vcf.gz from the COSMIC website (login required), then move them
into the project’s subdirectory named data (RNA2CM/data).
4 Execute the script named setup.nf, which is responsible for setting up all the reference data and will complete the installation (this might take a
while).
$ nextflow run /path/to/RNA2CM/setup.nf # run the installation script

The setup.nf script can take three optional arguments:
-profile: choose the executor profile between a standard dockerized usage on a local workstation or usage on a
SLURM cluster (requires Singularity instead of Docker) (standard/cluster, default: standard).
--cpu: the number of threads for multithreading (int, default 8).
--readLength: the expected Illumina read length for optimal alignment by STAR (int, default 100).
Example for automated run of the Equipment setup on a local system, using four CPUs, and defining a read length
of 150 bases:
$ nextflow run /path/to/RNA2CM/setup.nf --cpu 4 --readLength 150
Pipeline execution
CRITICAL This step replaces Steps 2–26 of the Procedure.
c
$ nextflow run /path/to/RNA2CM --fastq sample.fastq.gz # for single-end reads

$ nextflow run /path/to/RNA2CM --fastq sample_1.fastq.gz --fastq2 sample_2.fastq.gz # for paired-ends reads
Optional arguments
-profile: (standard/cluster, default: standard).
--cpu: (int, default 8).
--prefix: output files have standard names, but a custom prefix can be added (character string).
--keepInter: whether to keep intermediate alignment and VCF files (true/false, default: false).
--filterMouse: whether to perform mouse contamination cleanup (true/false, default true).
Example for a paired-ends RNA-seq run, using four CPUs, keeping intermediate files:
$ RNA2CM.nf --fastq nextflow run /path/to/RNA2CM --fastq esc_1.fastq.gz --fastq2 esc_2.fastq.gz --cpu 4
--keepInter true
Java programs (do not require installation)

● Trimmomatic33 v0.39 (http://www.usadellab.org/cms)
● GATK34 v4.1.8.0 (https://gatk.broadinstitute.org)
Data
● Primary assembly of the human reference genome (GRCh38 build) in FASTA format35 (https://www.
gencodegenes.org/human)
● Gene annotation file for the primary assembly of the human reference genome in GTF format
35
(https://www.gencodegenes.org/human)
●
dbSNP data of polymorphic sites in the GRCh38 build of the human reference genome, in VCF
format21 (https://www.ncbi.nlm.nih.gov/snp)
● Data of all coding mutations from COSMIC in VCF format
19
; the file is named CosmicCodingMuts.vcf.
gz (https://cancer.sanger.ac.uk/cosmic)

●
Data of all cancer mutations in cancer census genes in tab-separated file format36; the file is named
CosmicMutantExportCensus.tsv.gz (https://cancer.sanger.ac.uk/cosmic)
Other programs and data necessary in case of potential contamination by murine reads (as in the
case of cells cultured on mouse feeder cells)
● R 3.4 (https://www.r-project.org)
● XenofilteR
37
v.1.6 (https://github.com/PeeperLab/XenofilteR)
● Primary assembly of the mouse reference genome in FASTA format
35
(https://www.gencodegenes.org/
mouse)
● Gene annotation file for the primary assembly of the mouse reference genome in GTF format
35
(https://www.gencodegenes.org/mouse)
Equipment setup
CRITICAL In all subsequent code examples, the commands to be typed in the shell terminal are after
c
the $ sign (or > sign in python and R shells), while comments appear after the # sign CRITICAL The
c
equipment setup is performed only once before the first time the procedure is run.
Reference data download ● Timing 1–2 h

1 Download the project ZIP file named RNA2CM.zip from https://github.com/elyadlezmi/RNA2CM, and
unzip it. You can keep the directory anywhere; in this example, it is unzipped into the Home directory.
$ cd ~
$ wget https://github.com/elyadlezmi/RNA2CM/archive/master.zip &&

unzip master.zip && mv RNA2CM-master RNA2CM CRITICAL STEP The directory
c
RNA2CM contains scripts for automated execution of both the preliminary setup and the
procedure. Box 1 describes the steps covered by each script and provides detailed usage
instructions.
2 Enter the COSMIC website (https://cancer.sanger.ac.uk/cosmic), and create a username and password
for registered login. Login with your username, and download the VCF file of all coding mutations
named CosmicCodingMuts.vcf.gz, and the tab-separated file of all cancer mutations in census genes
named CosmicMutantExportCensus.tsv.gz. Transfer both files into the data subdirectory.
? TROUBLESHOOTING
3 Download the human and mouse reference genomes and the matching annotation files from the
GENCODE website into the data subdirectory, and unzip the files. If your sample has no risk of
murine RNA contamination, you can skip all parts addressing the mouse reference genome for the
rest of the protocol.
$ cd ~/RNA2CM/data
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/
release_34/GRCh38.primary_assembly.genome.fa.gz
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/
release_34/gencode.v34.primary_assembly.annotation.gtf.gz
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/
release_M25/GRCm38.primary_assembly.genome.fa.gz
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/
release_M25/gencode.vM25.primary_assembly.annotation.gtf.gz
$ gunzip GRCh38.primary_assembly.genome.fa.gz gencode.v34.
primary_assembly.annotation.gtf.gz GRCm38.primary_assembly.genome.
fa.gz gencode.vM25.primary_assembly.annotation.gtf.gz
CRITICAL STEP Download files named ‘primary assembly’, as these files do not contain
c
alternative haplotypes that might reduce alignment rate in later steps.

4 Download the latest version of dbSNP in VCF format (this protocol uses build 155 for GRCh38)
and its index file (which has a tbi suffix), from NCBI’s FTP site, into the data subdirectory.
$ wget https://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_
000001405.39.gz
$ wget https://ftp.ncbi.nih.gov/snp/archive/b155/VCF/GCF_
000001405.39.gz.tbi

Reference data setup ● Timing 2–4 h

5 Create STAR index files for both the human and mouse genomes, inside new directories named GRCh38
and GRCm38, respectively, within the data subdirectory. In the --sjdbOverhang option, you should
specify the Illumina read length-1 (in this case, we use reads of length 101), and in the --runThreadN
option, you should specify the number of threads for multithreading (10 in this example).
$ mkdir ~/RNA2CM/data/GRCh38
$ cd ~/RNA2CM/data/GRCh38
$ STAR --runThreadN 10 --runMode genomeGenerate --genomeDir ~/RNA2CM/
data/GRCh38 --genomeFastaFiles ~/RNA2CM/data/GRCh38.primary_assem-
bly.genome.fa --sjdbGTFfile ~/RNA2CM/data/gencode.v34.primary_assem-
bly.annotation.gtf --sjdbOverhang 100
$ mkdir ~/RNA2CM/data/GRCm38
$ cd ~/RNA2CM/data/GRCm38
$ STAR --runThreadN 10 --runMode genomeGenerate --genomeDir ~/RNA2CM/
data/GRCm38 --genomeFastaFiles ~/RNA2CM/data/GRCm38.primary_assem-
bly.genome.fa --sjdbGTFfile ~/RNA2CM/data/gencode.vM25.primary_as-
sembly.annotation.gtf --sjdbOverhang 100
6 Create an index file and a GATK dictionary file for the human reference FASTA file. This produces
files with an identical name as the reference FASTA but with a fai suffix (index) and a dict suffix
(dictionary).
$ cd ~/RNA2CM/data
$ samtools faidx GRCh38.primary_assembly.genome.fa
$ gatk CreateSequenceDictionary -R GRCh38.primary_assembly.genome.fa
? TROUBLESHOOTING
7 Create a four-column BED file, with the coordinates of all exons from the human annotation GTF,
gzip the file and index it with tabix. This file will serve later as an interval file for GATK tools.
$ awk ‘{if($3==“exon”) {print $1”\t”$4-100”\t”$5+100”\t“substr

($16,2,length($16)-3)}}’ gencode.v34.primary_assembly.annotation.
gtf | sort -k 1,1 -k2,2n | bgzip > GRCh38_exome.bed.gz
$ tabix GRCh38_exome.bed.gz
8 Since the chromosome names in the dbSNP file from NCBI (NC_000001.11, NC_000002.12…) and
the names in the COSMIC VCF (1, 2…) are not compatible with the chromosome names in the
GENCODE release (chr1, chr2…), it is necessary to rename the chromosome names in both files.
We provide mapping files for the chromosome names within the data directory. Renaming is easily
performed using the annotate function from bcftools. After renaming, the original VCF files are no
longer needed. It is essential to index the renamed files using tabix.
# rename chromosomes of dbSNP file

$ bcftools annotate --threads 10 --output-type z --rename-chrs remapNCBI.
txt --output dbSNPbuild154Renamed.vcf.gz GCF_000001405.38.gz
$ tabix dbSNPbuild154Renamed.vcf.gz
# rename chromosomes of COSMIC file
$ tabix CosmicCodingMuts.vcf.gz
$ bcftools annotate --threads 10 --output-type z --rename-chrs remap-
COSMIC.txt --output CosmicCodingMutsRenamed.vcf.gz CosmicCodingMuts.
vcf.gz
$ tabix CosmicCodingMutsRenamed.vcf.gz
The last data file needed for the analysis is a text file with the VCF headers necessary for later
annotations of variants from your sample. For convenience, this file is already supplied within the
data subdirectory.

Table 1 | Layout of the RNA2CM directory and description of all files used in the analysis
Path File Description
RNA2CM RNA2CMsetup.nf Nextflow script automating the preliminary setup

RNA2CM.nf Nextflow script automating Steps 2–25
RNA2CM/data GRCh38 STAR human genome index files (directory)
GRCm38 STAR mouse genome index files (directory)
CosmicCodingMutsRenamed.vcf.gz COSMIC mutation annotation
CosmicCodingMutsRenamed.vcf.gz.tbi Index file
CosmicMutantExportCensus.tsv.gz COSMIC cancer census genes table
dbSNPbuild154Renamed.vcf.gz dbSNP annotation
dbSNPbuild154Renamed.vcf.gz.tbi Index file
GRCh38_exome.bed.gz Intervals file
GRCh38_exome.bed.gz.tbi Index file
GRCh38.primary_assembly.genome.dict Reference genome dictionary file
GRCh38.primary_assembly.genome.fa Human reference genome sequence
GRCh38.primary_assembly.genome.fa.fai Reference genome index file
Header.txt VCF headers file
remapCOSMIC.txt Chromosome names remapping file
remapNCBI.txt Chromosome names remapping file
CommonAdapters.fa Adapters file for Trimmomatic
By this step, you should have all the data needed to perform the procedure. These necessary files
and their locations are detailed in Table 1.
? TROUBLESHOOTING
Procedure
CRITICAL In this pipeline, each step generates an output file that serves as the input for the next step,
c
which results in large hard disk memory consumption. Although each file can be deleted during the
process after its final usage, it is recommended to keep intermediate files until the final output is reached,
as these intermediate files can be useful for troubleshooting.
CRITICAL For the following example code to work properly, do not change any of the reference files
c
names or locations.
Sample acquisition ● Timing 15–60 min

CRITICAL As an example, the analysis is performed on an RNA-seq sample from Collinson et al.38,
c
which is a single-end Illumina run of wild-type human embryonic stem cells of the H9 cell line; these
cells were cultured on a feeder layer of mouse embryonic fibroblasts.
CRITICAL Publicly available RNA-seq data can be retrieved from databases such as ENA
c
and SRA.
1 Acquire fair-quality RNA-seq data of the cells to be inspected in gzipped FASTQ format
(fastq.gz suffix), and place the file (or pair of files in paired-end Illumina runs) in a new directory
for the analysis (here the directory is named SRR3090631 after the accession number of this
RNA-seq run).
$ mkdir ~/SRR3090631 # create the directory for the analysis

$ cd ~/SRR3090631
# download the example run file from the ENA database
$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR309/001/SRR3090631/
SRR3090631.fastq.gz
CRITICAL STEP It is highly recommended to inspect the quality of FASTQ files via the fastQC
c
software, as results derived from low-quality data could be unreliable. A good standard for high-

quality RNA-seq data is a low level of overrepresented sequences (<1%), and per base sequence
quality scores that are >30 along the read. Low-quality bases at the end of Illumina reads are quite
common, but—unlike low-quality bases in the middle of the read—they do not pose a problem as
they are trimmed in this pipeline.
? TROUBLESHOOTING
Variable assignment ● Timing 5 min

2 For better code readability and to eliminate typing errors, assign the sample’s name (without the
fastq.gz suffix) to a variable, as well as the number of threads for multithreading (this example uses
ten threads).
$ SAMPLE=SRR3090631 # name of the sample (name without “fastq.gz” extension)

$ THREADS=10 # number of processors for multi-threading
3 Assign to variables also the absolute paths to: the STAR genome directories, the human
reference genome FASTA file, the intervals BED file, the dbSNP VCF, the COSMIC VCF and
the header file.
# absolute path to annotations and reference files

$ STAR_HUMAN_INDEX=~/RNA2CM/data/GRCh38
$ STAR_MOUSE_INDEX=~/RNA2CM/data/GRCm38
$ REFERENCE_GENOME=~/RNA2CM/data/GRCh38.primary_assembly.genome.fa
$ INTERVALS=~/RNA2CM/data/GRCh38_exome.bed.gz
$ DBSNP=~/RNA2CM/data/dbSNPbuild154Renamed.vcf.gz
$ COSMIC_VCF=~/RNA2CM/data/CosmicCodingMutsRenamed.vcf.gz
$ HEADER=~/RNA2CM/data/header.txt
CRITICAL STEP The protocol can be paused between any of the next steps, but the subsequent
c
code relies on the variables assigned in Steps 2 and 3. Every time you open a new terminal, you
should reassign these variables.
Preprocessing of RNA-seq data before variant discovery ● Timing 1–4 h

4 Remove residual adapter sequences from the reads as they can decrease alignment efficiency, and
crop low-quality bases from the edges of the reads to eliminate sources of noise that might affect
downstream analysis. Both tasks can be done simultaneously using Trimmomatic that will output a
‘trimmed’ fastq.gz file. Adapter trimming is enabled by providing a FASTA file with the sequences
that should be removed (the directory Trimmomatic-0.39 already contains files with common
adapter sequences).
$ java -jar /Trimmomatic-0.39/trimmomatic-0.39.jar SE -threads $THREADS ${SAMPLE}.fastq.gz
${SAMPLE}.trimmed.fastq.gz ILLUMINACLIP:/Trimmomatic-0.39/adapters/TruSeq3-SE.fa:2:30:10
LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
? TROUBLESHOOTING
5 Align the reads to the human reference genome using the STAR aligner in two-pass mode,
outputting a sorted-by-coordinate alignment file in BAM format.
$ STAR --runThreadN $THREADS --genomeDir $STAR_HUMAN_INDEX --read-

FilesIn ${SAMPLE}.trimmed.fastq.gz --outFileNamePrefix ${SAMPLE}
--outSAMtype BAM SortedByCoordinate --readFilesCommand zcat --out-
SAMattributes NM --twopassMode Basic --outFilterMultimapNmax 1 --out-
FilterMismatchNoverLmax 0.1
CRITICAL STEP If the cells were grown on mouse feeders, or there is any other potential source
c
of mouse cells contamination, it is essential to filter out RNA reads of murine origin as they can be
a major source of false positives (Steps 6–7). If there is no such risk, you can skip to the variant
discovery stage (Step 8).
? TROUBLESHOOTING

6 Align the reads to the mouse reference genome to output an aligned-to-mouse genome BAM file.
$ STAR --runThreadN $THREADS --genomeDir $STAR_MOUSE_INDEX --read-

FilesIn ${SAMPLE}.trimmed.fastq.gz --outFileNamePrefix ${SAMPLE}GRCm
--outSAMtype BAM SortedByCoordinate --readFilesCommand zcat --out-
SAMattributes NM --twopassMode Basic --outFilterMultimapNmax 1 --out-
FilterMismatchNoverLmax 0.1
7 Pass both human and mouse aligned BAM files to the XenofilteR tool. This will remove reads that
aligned better to the mouse genome than to the human genome, and will output a final BAM file
suitable for downstream variant discovery.
$R
> library(“XenofilteR”)
> bp.param <- SnowParam(workers = 10, type = “SOCK”) # number of threads
(10 in this case)
> sample.list <- matrix(c(‘SRR3090631Aligned.sortedByCoord.out.
bam’,‘SRR3090631GRCmAligned.sortedByCoord.out.bam’),ncol=2)
> output.names <- c(‘SRR3090631’)
> XenofilteR(sample.list, destination.folder = “./“, bp.param = bp.
param, MM_threshold = 8, output.names)
> quit()
Variant discovery ● Timing 1–3 h

CRITICAL The following steps outline how to identify discrepancies between your sample and the
c
reference genome, using the Genome Analysis Toolkit-4 (GATK4).

CRITICAL Since RNA-seq data cover only transcribed regions, this procedure supplies GATK4 tools
c
with an intervals file in BED format describing all transcribed regions of the genome (in the -L option).
This option is available with the tools SplitNCigarReads, BaseRecalibrator, ApplyBQSR and
HaplotypeCaller; it is not mandatory, but it significantly reduces computation time.
8 Mark duplicated reads from the Filtered BAM file using MarkDuplicates of Picard tools (if you did
not perform murine reads removal supply in the --I option the aligned-to-human BAM file
generated in Step 4).
$ gatk MarkDuplicates --CREATE_INDEX true --I ./Filtered_bams/${SAM-

PLE}_Filtered.bam --O ${SAMPLE}marked_duplicates.bam --VALIDATION_-
STRINGENCY SILENT --M ${SAMPLE}marked_dup_metrics.txt
9 Using GATK’s SplitNCigarReads, modify the marked-duplicates BAM file to split reads that
contain Ns in their CIGAR string (due to splice junctions within the reads).
$ gatk SplitNCigarReads -L $INTERVALS -R $REFERENCE_GENOME -I ${SAMPLE}

marked_duplicates.bam -O ${SAMPLE}splitN.bam
10 Add read groups to the aligned reads in the BAM file using AddOrReplaceReadGroups of Picard
tools. This operation is needed because GATK4 requires the BAM headers to contain this piece of
information, which is not generated by STAR during the alignment process.
$ gatk AddOrReplaceReadGroups --CREATE_INDEX true --I ${SAMPLE}splitN.

bam --O ${SAMPLE}.grouped.bam --RGID rnasq --RGLB lb --RGPL illumina
--RGPU pu --RGSM $SAMPLE
11 Perform base recalibration with GATK’s BaseRecalibrator on the BAM file, by supplying the tool
with annotation files of known polymorphisms in VCF format. This will output a recalibration table
used in the next step.

$ gatk BaseRecalibrator -L $INTERVALS -I ${SAMPLE}.grouped.bam --use-

original-qualities -R $REFERENCE_GENOME --known-sites $DBSNP -O ${SAM-
PLE}.recal_data.table
12 Apply base recalibration with GATK’s ApplyBQSR
$ gatk ApplyBQSR -L $INTERVALS -R $REFERENCE_GENOME -I ${SAMPLE}.grouped.

bam --use-original-qualities --add-output-sam-program-record --bqsr-
recal-file ${SAMPLE}.recal_data.table -O ${SAMPLE}.recal_output.bam
13 Call variants using GATK’s HaplotypeCaller.
$ gatk HaplotypeCaller -L $INTERVALS -R $REFERENCE_GENOME -I ${SAMPLE}.

recal_output.bam -O ${SAMPLE}.output.vcf.gz --dont-use-soft-clipped-
bases --pcr-indel-model AGGRESSIVE
This is the final step of the variant calling that will output a file in VCF format describing all the
discrepancies of your sample compared with the reference genome.
CRITICAL STEP If during the library preparation the sample was not PCR amplified in any step,
c
change the ‘--pcr-indel-model’ option from ‘AGGRESSIVE’ to ‘NONE’.
Hard filtering and annotation of the called variants ● Timing 1–2 h

14 To eliminate false positives, use GATK’s VariantFiltration to mark in the VCF all variant clusters of
three or more variants within a window of 35 bp. Remove also any variants that conform to one or
both of the following conditions: Fisher Strand >30 and Quality of Depth <2.
$ gatk VariantFiltration --R $REFERENCE_GENOME --V ${SAMPLE}.output.

vcf.gz --window 35 --cluster 3 --filter-name “FS” --filter “FS > 30.0”
--filter-name “QD” --filter “QD < 2.0” -O ${SAMPLE}.hardfilter.vcf.gz
15 Use bcftools to exclude all variants that did not pass the hard filter, and filter out positions covered
by fewer than ten reads and alternative alleles covered by fewer than five reads.
$ bcftools view --threads $THREADS -i ‘FILTER=“PASS” && FORMAT/DP >=

10 && FORMAT/AD[:1] >= 5’ --output-type z --output-file ${SAMPLE}
filtered.vcf.gz ${SAMPLE}.hardfilter.vcf.gz
$ tabix ${SAMPLE}filtered.vcf.gz # index vcf
16 Add gene names to the variants by supplying the GRCh38_exome.bed file to bcftools’s annotate
function. The -h option adds headers to the VCF that are necessary for importing annotations in
this and the subsequent steps. Next, index the annotated VCF.
$ bcftools annotate --threads $THREADS -a $INTERVALS -h $HEADER -c

CHROM,FROM,TO,Gene --output-type z --output ${SAMPLE}named.vcf.gz
${SAMPLE}filtered.vcf.gz
$ tabix ${SAMPLE}named.vcf.gz
17 Using the dbSNP VCF file, annotate the variants in your sample by assigning the known variants
with an RS-ID (dbSNP unique identifier), and import to the sample’s VCF the information about
the prevalence of each SNP, into a new field named ‘COMMON’. This field determines for each
variant if it has an allelic frequency >1% in at least one defined population. This can be done using
bcftools’s annotate function.
$ bcftools annotate --threads $THREADS -a $DBSNP -c INFO/RS,INFO/COMMON

--output-type z --output ${SAMPLE}dbSNP.vcf.gz ${SAMPLE}named.vcf.gz
$ tabix ${SAMPLE}dbSNP.vcf.gz
? TROUBLESHOOTING

18 Annotate your sample by importing data from the COSMIC VCF. Import the ID field containing
the stable COSMIC ID of each cancer-related mutation and the info field named ‘CNT’ that
represents the number of cancer samples presenting this specific mutation.
$ bcftools annotate --threads $THREADS -a $COSMIC_VCF -c ID,INFO/CNT

--output-type z --output ${SAMPLE}.annotated.vcf.gz ${SAMPLE}dbSNP.
vcf.gz
$ tabix ${SAMPLE}.annotated.vcf.gz
? TROUBLESHOOTING
19 At this point, the original VCF output of GATK’s HaplotypeCaller has been annotated with gene
names, filtered out of unreliable variants, common SNPs have been marked using dbSNPs and
mutations that appear in cancer have been marked by using the COSMIC VCF. Use bcftools’s query
function to export the data to a table containing columns with the necessary information for
downstream cancer mutation identification.
$ bcftools query -H -f ‘%CHROM\t%POS\t%REF\t%ALT\t%INFO/Gene\t%ID\t%

INFO/CNT\t%INFO/RS\t%INFO/COMMON\t[%AD]\n’ ${SAMPLE}.annotated.vcf.
gz > ${SAMPLE}varTable.tsv
Identification of high confidence pathogenic cancer-related mutations within the sample

● Timing 15 min
20 Run the following Python script, which ensures that only variants that appear in 20 or more cancer
tumors of the COSMIC database are retained in the variant table (the CNT column specifies the
number of tumors in which the mutation was present).
$ python3
> import pandas as pd
> SAMPLE = ‘SRR3090631’
> df = pd.read_csv(f”{SAMPLE}varTable.tsv”, sep=‘\t’)
> df = df[df[‘[6]ID’] != ‘.’]
> df[‘[7]CNT’] = df[‘[7]CNT’].astype(“int32”)
> df = df[df[‘[7]CNT’] >= 20]
21 Remove from the variant table all the entries annotated as ‘common SNPs’, as these variants are
unlikely to be pathogenic.
> df = df[df[‘[9]COMMON’]!= ‘1’]
22 Merge the COSMIC Census table (obtained in Step 2 of the equipment setup) with your variant
table, on the basis of the COSMIC ID that corresponds to the ‘GENOMIC_MUTATION_ID’
column in the COSMIC census table (importing information about the pathogenicity, coding effect
and experimental evidence of each variant).
> mutations = pd.read_csv(‘~/RNA2CM/data/CosmicMutantExportCensus.

tsv.gz’, sep=‘\t’, compression=‘gzip’, encoding=‘latin1’)
> mutations = mutations[[‘Tier’, ‘GENOMIC_MUTATION_ID’, ‘Mutation AA’,
‘Mutation Description’, ‘FATHMM prediction’, ‘FATHMM score’]]
> df = df.merge(mutations, left_on=‘[6]ID’, right_on=‘GENOMIC_MUTA-
TION_ID’)
23 Filter the variants to leave only mutations in genes that belong to Tier 1 (‘Tier’ column) of the COSMIC
census genes. Tier 1 genes must possess experimental evidence for their role in cancer formation in
addition to documented mutations that change the gene product’s activity to promote oncogenic
transformation, in contrast to Tier 2 genes that are classified as cancer genes using less strict criteria.
> df = df[df[‘Tier’] == 1]

Table 2 | Description of the final output table
Column name Description
Gene Symbol of the gene harboring the mutation

CHROM Chromosome of the mutation
POS Chromosomal coordinate for the position of the mutation
REF Reference sequence in the position
ALT The sequence identified in the analyzed sample
COSMIC CNT In how many COSMIC tumors this mutation has been identified
RS (dbSNP) The dbSNP identifier of this mutation (if one exists)
GENOMIC MUTATION ID COSMIC identifier (can be used to retrieve further information from the COSMIC
database)
Mutation AA Amino-acid change resulting from the mutation
Mutation Description Type of effect this mutation has on the translated protein
FATHMM Score FATHMM predicted pathogenicity score (0–1) of the mutation
AD Allelic depth; how many reads show the REF (left) and ALT (right) allele (can be
used to infer zygosity or heterogeneity of the population)
Table 3 | Final output table summarizing the anticipated results from the provided example
Gene CHROM POS REF ALT COSMIC RS GENOMIC Mutation_AA Mutation FATHMM AD
CNT (dbSNP) MUTATION ID Description score
TP53 chr17 7673821 G A 65 55832599 COSV52678166 p.R267W Substitution - Missense 0.98159 43,28
24 Remove mutations that are described as ‘Substitution - coding silent’ in the ‘Mutation Description’
column.
> df = df[df[‘Mutation Description’]!= ‘Substitution - coding silent’]
25 Leave only entries described as ‘PATHOGENIC’ in the ‘FATHMM prediction’ columns.
> df = df[df[‘FATHMM prediction’] == ‘PATHOGENIC’]
26 At this stage, the analysis is done, and all that is left is reformatting the table for better readability
and saving the final table to a CSV file format that can be opened with Microsoft Excel or any text
editor. The following lines of code reorder and rename the columns of the table before writing it to
a file.
> df.drop_duplicates(subset=‘[6]ID’, inplace=True)

> df = df[[‘[5]Gene’, ‘# [1]CHROM’, ‘[2]POS’, ‘[3]REF’, ‘[4]ALT’, ‘[7]
CNT’, ‘[8]RS’, ‘GENOMIC_MUTATION_ID’, ‘Mutation AA’, ‘Mutation
Description’, ‘FATHMM score’,f’[10]{SAMPLE}:AD’]]
> df.rename(columns = {‘[5]Gene’: ‘Gene’, ‘# [1]CHROM’: ‘CHROM’, ‘[2]
POS’: ‘POS’, ‘[3]REF’:‘REF’,’[4]ALT’: ‘ALT’, ‘[7]CNT’: ‘COS-
MIC_CNT’,’[8]RS’: ‘RS(dbSNP)’, ‘Mutation AA’: ‘Mutation_AA’, ‘Muta-
tion Description’: ‘Mutation_Description’, f’[10]{SAMPLE}:AD’: ‘AD’},
inplace=True)
To save the table as a CVS file, type:
> df.to_csv(f’{SAMPLE}_cancer_mutations.csv’, index=False)
The cancer_mutations.csv table details the cancer mutations identified in the sample (Table 2).
The hPSC sample used in this example harbors a single mutation in the TP53 gene, resulting in a
table named SRR3090631_cancer_mutations.csv that has only a single row (Table 3).

Troubleshooting
Troubleshooting advice can be found in Table 4.
Table 4 | Troubleshooting table
Step Problem Possible reason Solution
Equipment The workstation does not Some workstations and computer clusters do not Download the files via scripted download.
setup, 2 have access to an internet have a graphical interface and are accessible only Instructions are available at the COSMIC website
browser via the command line
Equipment Error generated: ‘usr/bin/ The GATK wrapper script does not recognize the Change ‘python’ to ‘python3’ in the first line of
setup, 7 env “python”: No such file python interpreter the wrapper script (~/RNA2CM/bin/gatk-4.1.8.0/
or directory’) gatk)
Equipment Tabix generates an error The VCF file is corrupted Make sure the VCF download was completed
setup, 8 successfully and that the file is not truncated
(this can be done by comparing the CheckSum
hash string of the downloaded file with the hash
published on the download page using the
md5sum program)
1 RNA-seq reads are not in a Reads are in an uncompressed FASTQ file Use bgzip to compress the file (fastq.gz) or
compressed format, or in an unaligned BAM format bedtools bamtofastq to convert to FASTQ format
FASTQ format
The fastQC quality check Library preparation was performed with low- Run Trimmomatic (Step 4), and reinspect the
shows poor-quality quality RNA or the sequencing run malfunctioned quality of the reads
parameters Repeat experiment if quality is still low
4 No reads pass the The read length is shorter than 36 bp Adjust the ‘MINLEN’ parameter in the
Trimmomatic filter Trimmomatic command to reduce the minimal
read length accepted
4–6 The sequencing data are You might have paired-ends sequencing data View the Trimmomatic and STAR manuals for
divided into two files paired-ends syntax, and adjust Steps 4–6
accordingly
15 and 16 bcftools annotate generates There is a discrepancy between the names of Make sure the chromosome names in the
an error chromosomes in the VCF and the reference annotation VCF are identical to the names in the
genome Fasta file, or between the names of the reference genome file and that the headers in the
headers in the VCF and the headers in the header. header.txt file match the headers in the
txt file annotation VCF
Timing
Equipment setup
Steps 1–4, data download: 1–2 h
Steps 5–8, reference data setup: 2–4 h
Procedure
Step 1, sample acquisition: 15–60 min (for downloading of most RNA-seq samples from the European
Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena/browser/home) database
Steps 2 and 3, variable assignment: 5 min
Steps 4–7, preprocessing of RNA-seq data before variant discovery: 1–4 h
Steps 8–13, variant discovery: 1–3 h
Steps 14–19, hard filtering and variant annotation: 1–2 h
Steps 20–26, identification of cancer-related mutations: 30 min
Anticipated results
The main output of this pipeline is the cancer_mutations.csv table, but additional files produced by the
pipeline should be used for quality control of the analysis. For example, the intermediate files can be used
for assessing the number of variants initially generated and the number of variants filtered along the
different steps of the analysis. The following description of the example data can be used as a benchmark
for expected numbers of variants and the proportion of variants eventually filtered out. The output.vcf.gz
file is the raw unfiltered variant calling file. This file will typically contain ~50,000 (low confidence)

variants. The varTable.tsv file is the filtered (and annotated) derivative of output.vcf.gz, and contains
typically 30–50% of the original variants. Among the called variants, it is expected that ~95% will be
annotated in dbSNP and ~85% will be common SNPs (represented in >1% of the population). In the
likely (and typically desired) case of the absence of any cancer mutations, the cancer_mutations.csv table
will be completely empty. When cancer mutations are indeed identified in the sample, the cancer_mu-
tations.csv table provides extensive information about each mutation in its different columns. Description
of the table and explanations about its different fields are available in Table 2.
In the provided example, variant calling on the ESCs sample (SRR3090631) identified ~45,000
variants, which were reduced to ~18,300 high-confidence variants after hard filtering and satisfying
the minimal coverage requirement. Out of these variants, ~17,350 are annotated in dbSNP and
~15,750 variants are considered as common SNPs. The pipeline identified in this sample one cancer-
related mutation in the known tumor suppressor gene p53. It is a single-nucleotide G-to-A sub-
stitution that results in a missense substitution (arginine to tryptophan) in the translated protein.
Table 3 summarizes the results for the exemplary sample given in the protocol.
Data availability
The RNA-seq sample used as an example in this protocol can be retrieved from the Sequence Read
Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra) under the accession number
SRR3090631.
Code availabilty
All the code used in this protocol is available at https://github.com/elyadlezmi/RNA2CM39.
References
1. De Los Angeles, A. et al. Hallmarks of pluripotency. Nature 525, 469–478 (2015).
2. Tabar, V. & Studer, L. Pluripotent stem cells in regenerative medicine: challenges and recent progress. Nat.
Rev. Genet. 15, 82–92 (2014).
3. Avior, Y., Sagi, I. & Benvenisty, N. Pluripotent stem cells in disease modelling and drug discovery. Nat. Rev.
Mol. Cell Biol. 17, 170–182 (2016).
4. Shahbazi, M. N., Siggia, E. D. & Zernicka-Goetz, M. Self-organization of stem cells into embryos: a window
on early mammalian development. Science 364, 948–951 (2019).
5. Weissbein, U., Benvenisty, N. & Ben-David, U. Genome maintenance in pluripotent stem cells. J. Cell Biol.
204, 153–163 (2014).
6. Bar, S. & Benvenisty, N. Epigenetic aberrations in human pluripotent stem cells. EMBO J. 38, 1–18 (2019).
7. Na, J., Baker, D., Zhang, J., Andrews, P. W. & Barbaric, I. Aneuploidy in pluripotent stem cells and
implications for cancerous transformation. Protein Cell 5, 569–579 (2014).
8. Jo, H. Y. et al. Functional in vivo and in vitro effects of 20q11.21 genetic aberrations on hPSC differentiation.
Sci. Rep. 10, 1–14 (2020).
9. Ben-David, U. & Benvenisty, N. The tumorigenicity of human embryonic and induced pluripotent stem cells.
Nat. Rev. Cancer 11, 268–277 (2011).
10. Ben-David, U. et al. Aneuploidy induces profound changes in gene expression, proliferation and tumor-
igenicity of human pluripotent stem cells. Nat. Commun. 5, 4825 (2014).
11. Simonson, O. E., Domogatskaya, A., Volchkov, P. & Rodin, S. The safety of human pluripotent stem cells in
clinical treatment. Ann. Med. 47, 370–380 (2015).
12. Gore, A. et al. Somatic coding mutations in human induced pluripotent stem cells. Nature 471, 63–67 (2011).
13. Merkle, F. T. et al. Human pluripotent stem cells recurrently acquire and expand dominant negative P53
mutations. Nature 545, 229–233 (2017).
14. Avior, Y., Lezmi, E., Eggan, K. & Benvenisty, N. Cancer-related mutations identified in primed human
pluripotent stem cells. Cell Stem Cell 28, 10–11 (2021).
15. Stirparo, G. G., Smith, A. & Guo, G. Cancer-related mutations are not enriched in naive human pluripotent
stem cells. Cell Stem Cell 28, 164–169.e2 (2021).
16. Halliwell, J., Barbaric, I. & Andrews, P. W. Acquired genetic changes in human pluripotent stem cells: origins
and consequences. Nat. Rev. Mol. Cell Biol. 21, 715–728 (2020).
17. Merkle, F. T. et al. Biological insights from the whole genome analysis of human embryonic stem cells.
Preprint at bioRxiv https://doi.org/10.1101/2020.10.26.337352 (2020).
18. Trounson, A. & DeWitt, N. D. Pluripotent stem cells progressing to the clinic. Nat. Rev. Mol. Cell Biol. 17,
194–200 (2016).
19. Tate, J. G. et al. COSMIC: the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 47,
D941–D947 (2019).
20. Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid
substitutions using hidden Markov models. Hum. Mutat. 34, 57–65 (2013).
21. Sherry, S. T. et al. DbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).

22. Auton, A. et al. A global reference for human genetic variation. Nature 526, 68–74 (2015).
23. Yizhak, K. et al. RNA sequence analysis reveals macroscopic somatic clonal expansion across normal tissues.
Science 364, eaaw0726 (2019).
24. Coudray, A., Battenhouse, A. M., Bucher, P. & Iyer, V. R. Detection and benchmarking of somatic mutations
in cancer genomes using RNA-seq data. PeerJ 6, (2018).
25. Weissbein, U., Schachter, M., Egli, D. & Benvenisty, N. Analysis of chromosomal aberrations and recom-
bination by allelic bias in RNA-Seq. Nat. Commun. 7, 12144 (2016).
26. Radenbaugh, A. J. et al. RADIA: RNA and DNA integrated analysis for somatic mutation detection.
PLoS One 9, e111516 (2014).
27. DI Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35,
316–319 (2017).
28. Merkel, D. Docker: lightweight Linux containers for consistent development and deployment. Linux J.
239, 2 (2014).
29. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
30. Li, H. et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
31. Danecek, P. & McCarthy, S. A. BCFtools/csq: haplotype-aware variant consequences. Bioinformatics 33,
2037–2039 (2017).
32. Li, H. Tabix: fast retrieval of sequence features from generic TAB-delimited files. Bioinformatics 27,
718–719 (2011).
33. Bolger, A. M., Lohse, M. & Usadel, B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioin-
formatics 30, 2114–2120 (2014).
34. Brouard, J.-S., Schenkel, F., Marete, A. & Bissonnette, N. The GATK joint genotyping workflow is appro-
priate for calling variants in RNA-seq experiments. J. Anim. Sci. Biotechnol. 10, 44 (2019).
35. Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res.
47, D766–D773 (2019).
36. Sondka, Z. et al. The COSMIC Cancer Gene Census: describing genetic dysfunction across all human cancers.
Nat. Rev. Cancer 18, 696–705 (2018).
37. Kluin, R. J. C. et al. XenofilteR: computational deconvolution of mouse and human reads in tumor xenograft
sequence data. BMC Bioinformatics 19, 366 (2018).
38. Collinson, A. et al. Deletion of the polycomb-group protein EZH2 leads to compromised self-renewal and
differentiation defects in human embryonic stem cells article deletion of the Polycomb-group protein EZH2
leads to compromised self-renewal and differentiation defects in Hu. Cell Rep. 17, 2700–2714 (2016).
39. Lezmi, E. Identification of cancer-related mutations in human pluripotent stem cells utilizing RNA-seq
analysis. elyadlezmi/RNA2CM https://doi.org/10.5281/zenodo.4810015 (2021).
Acknowledgements
We thank S. Kinreich and A. Pagis for testing the pipeline and providing their constructive input and all members of The Azrieli Center
for Stem Cells and Genetic Research for critical reading of the manuscript. This work was partially supported by the Israel Science
Foundation (494/17), the Rosetrees Trust, and Azrieli Foundation. N.B. is the Herbert Cohn Chair in Cancer Research.
Author contributions
E.L. and N.B. designed the analysis. E.L. developed the bioinformatic pipeline and wrote the manuscript with input from N.B., who
supervised the work.
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to E.L. or N.B.
Peer review information Nature Protocols thanks Anna Esteve-Codina and the other, anonymous reviewer(s) for their contribution to
the peer review of this work.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Received: 16 December 2020; Accepted: 16 June 2021;

Published online: 6 August 2021
Related links
Key references using this protocol
Merkle, F. et al. Nature 545, 229–233 (2017): https://doi.org/10.1038/nature22312
Avior, Y. et al. Cell Stem Cell 28, 10–11 (2021): https://doi.org/10.1016/j.stem.2020.11.013

Identification of Cancer-Related Mutations in Human Pluripotent Stem Cells Using RNA-seq Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Identification of Cancer-Related Mutations in Human Pluripotent Stem Cells Using RNA-seq Analysis

Uploaded by

Copyright:

Available Formats

PROTOCOL

Identiﬁcation of cancer-related mutations in

Genetic stability of hPSCs

4522 NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot

Altered cell cycle regulation

Unknowingly modeling another

Risk of graft contributing to cancer

Relevance of cancer-related point mutations

NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot 4523

Steps Variant discovery Variants.vcf

Identification of high- toHuman.bam SplitN.bam

Applications of the method

Comparison with DNA-based methods for mutation identiﬁcation

4524 NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot

Advantages and limitations

● Pandas package for python 3 (https://pandas.pydata.org)

NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot 4525

$ nextﬂow run hello # test that nextﬂow is working

$ git clone elyadlezmi/RNA2CM # clone the project using git

$ nextﬂow run /path/to/RNA2CM/setup.nf # run the installation script

$ nextﬂow run /path/to/RNA2CM --fastq sample.fastq.gz # for single-end reads

Java programs (do not require installation)

4526 NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot

Reference data download ● Timing 1–2 h

$ wget https://github.com/elyadlezmi/RNA2CM/archive/master.zip &&

alternative haplotypes that might reduce alignment rate in later steps.

NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot 4527

Reference data setup ● Timing 2–4 h

$ awk ‘{if($3==“exon”) {print $1”\t”$4-100”\t”$5+100”\t“substr

# rename chromosomes of dbSNP ﬁle

4528 NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot

Path File Description

RNA2CM RNA2CMsetup.nf Nextﬂow script automating the preliminary setup

Sample acquisition ● Timing 15–60 min

$ mkdir ~/SRR3090631 # create the directory for the analysis

NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot 4529

Variable assignment ● Timing 5 min

$ SAMPLE=SRR3090631 # name of the sample (name without “fastq.gz” extension)

# absolute path to annotations and reference ﬁles

Preprocessing of RNA-seq data before variant discovery ● Timing 1–4 h

$ STAR --runThreadN $THREADS --genomeDir $STAR_HUMAN_INDEX --read-

4530 NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot

$ STAR --runThreadN $THREADS --genomeDir $STAR_MOUSE_INDEX --read-

Variant discovery ● Timing 1–3 h

reference genome, using the Genome Analysis Toolkit-4 (GATK4).

$ gatk MarkDuplicates --CREATE_INDEX true --I ./Filtered_bams/${SAM-

$ gatk SplitNCigarReads -L $INTERVALS -R $REFERENCE_GENOME -I ${SAMPLE}

$ gatk AddOrReplaceReadGroups --CREATE_INDEX true --I ${SAMPLE}splitN.

NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot 4531

$ gatk BaseRecalibrator -L $INTERVALS -I ${SAMPLE}.grouped.bam --use-

12 Apply base recalibration with GATK’s ApplyBQSR

$ gatk ApplyBQSR -L $INTERVALS -R $REFERENCE_GENOME -I ${SAMPLE}.grouped.

13 Call variants using GATK’s HaplotypeCaller.

$ gatk HaplotypeCaller -L $INTERVALS -R $REFERENCE_GENOME -I ${SAMPLE}.

change the ‘--pcr-indel-model’ option from ‘AGGRESSIVE’ to ‘NONE’.

Hard ﬁltering and annotation of the called variants ● Timing 1–2 h

$ gatk VariantFiltration --R $REFERENCE_GENOME --V ${SAMPLE}.output.

$ bcftools view --threads $THREADS -i ‘FILTER=“PASS” && FORMAT/DP >=

$ bcftools annotate --threads $THREADS -a $INTERVALS -h $HEADER -c

$ bcftools annotate --threads $THREADS -a $DBSNP -c INFO/RS,INFO/COMMON

4532 NATURE PROTOCOLS | VOL 16 | SEPTEMBER 2021 | 4522–4537 | www.nature.com/nprot

$ bcftools annotate --threads $THREADS -a $COSMIC_VCF -c ID,INFO/CNT