Professional Documents
Culture Documents
Identification of Cancer-Related Mutations in Human Pluripotent Stem Cells Using RNA-seq Analysis
Identification of Cancer-Related Mutations in Human Pluripotent Stem Cells Using RNA-seq Analysis
https://doi.org/10.1038/s41596-021-00591-5
github.com/elyadlezmi/RNA2CM. The protocol requires minimal command-line skills and can be carried out in 1–2 d.
Introduction
The Azrieli Center for Stem Cells and Genetic Research, The Hebrew University of Jerusalem, Jerusalem, Israel. ✉e-mail: elyad.lezmi@mail.huji.ac.il;
nissimb@mail.huji.ac.il
Developmental
Aberrant differentiation capacity
biology
Mutation burden increase with
passaging
modeling
Disease
syndrome (TP53 mutation)
Cell therapy
Fig. 1 | Relevance of cancer-related point mutations in hPCSs. Outlined are the potential consequences of
pluripotent stem cells carrying cancer-related mutations in various hPSCs applications.
cultured on mouse feeders, and the potential consequences of contaminating mouse DNA or RNA on
mutation identification, needs to be addressed when developing pipelines to detect hPSC mutations
acquired in culture. New point mutations and other genomic aberrations are continuously identified
in hPSCs as next-generation sequencing methodologies improve16,17, supporting the importance of
routine quality control of these cells. Future analysis of cancer-related mutations in low and high
passage numbers of pluripotent stem cells will enable us to determine more precisely the rate at which
these mutations accumulate during cell growth in culture.
Protocol overview
The high-throughput sequencing era has brought enormous amounts of data about variations in the
human DNA sequence. Databases such as the Catalogue of Somatic Mutations in Cancer (COS-
MIC)19 and software like Functional Analysis through Hidden Markov Models (FATHMM)20 have
greatly enhanced our ability to characterize the harmful potential of different mutations. In addition,
databases such as dbSNP21 and the 1000 Genomes Project22 provide extensive information about
genomic polymorphism in the general population. This protocol takes advantage of the present
a b dbSNP154.vcf
CosmicMutations.vcf varTable.tsv
Read processing and RNA-seq experiment Keep mutations identified
Variant annotation
Steps alignment to reference in >20 cancer tumors
1–7 (Trimmomatic, STAR, hESC RNA sample Filtered.vcf
Sample.fastq Variants.vcf
XenofilteR)
Low-quality base trimming Hard filtering Discard common SNPs
Steps
Variant filtration and Trimmed.fastq
annotation GRCh38 toHuman.bam SplitN.bam Recalibrated.bam CosmicCensus.tsv
14–19 GRCm38 toMouse.bam dbSNP154.vcf
(GATK4, Bcftools)
Keep nonsilent mutations in
Filter out mouse reads Base quality recalibration
Tier 1 census genes
Fig. 2 | Schematic representation of the pipeline for identification of cancer-related mutations in hPSCs from RNA-seq data. a, Colored boxes
represent the main stages of the procedure; the tools used in each stage are between parentheses in italics. b, Schematic representation of the pipeline
where each step is represented by a box; input and output files are between the steps in italics.
feasibility of RNA-seq technology, the increasing availability of excellent bioinformatic tools and the
constant growth of publicly available knowledge.
In the protocol, we discuss best practices for the processing of hPSC RNA-seq data, including
eliminating sources of noise such as poorly sequenced reads, or reads that originate from murine
RNA of feeder cells. This is followed by restrictive variant calling that reveals discrepancies with the
reference genome. Next, we filter out all polymorphic sites with a minor allele frequency >1%, in at
least one of the populations defined by the 1000 Genomes Project, as these are very unlikely to be
pathogenic. Finally, we compare the remaining variants with validated somatic mutations that were
identified in a reasonable number of cancer tumors in the COSMIC dataset, and are predicted to be
highly pathogenic by FATHMM.
This pipeline requires an RNA-seq sample as input, and outputs a table with detailed information
about cancer-related mutations found in the input sample (Fig. 2).
Materials
CRITICAL The materials list represents the minimal requirements for execution of the pipeline locally
c
and manually (not recommended). Alternatively, all steps can be streamlined via Nextflow27, and
containerized with Docker28 enabling the user to skip any software installation and computational
environment setup (Box 1).
Hardware
● A workstation or computer cluster, running a 64-bit UNIX-based operating system, with 30 GB of free
hard disk space and at least 64 GB of available random-access memory CRITICAL The main limiting
c
factor for the throughput of this pipeline is the number of available CPU threads.
Software
CRITICAL Note that some components might be already present on your system. The pipeline was
c
tested with the specified software versions but is not necessarily restricted to these specific versions.
● Python3 v3.6.9 (https://www.python.org)
● Java 8 (https://www.java.com)
●
STAR29 v2.5.4b (https://github.com/alexdobin/STAR)
● SAMtools
30
v1.7-2 (http://www.htslib.org)
● BCFtools
31
v1.7-2 (http://www.htslib.org)
● Tabix
32
v1.7-2 (http://www.htslib.org)
Box 1 | Automated usage of the pipeline using the RNA2CM software tool
Software
RNA2CM (https://github.com/elyadlezmi/RNA2CM)
Nextflow (https://www.nextflow.io)
Docker (https://www.docker.com)
Singularity (optional, required only for execution on computer clusters managed by the SLURM workload manager) (https://sylabs.io/singularity)
Procedure
Equipment setup
CRITICAL This setup replaces all steps of the Equipment Setup and is performed only once before the first time the pipeline is run.
c
1 Nextflow and Docker (use Singularity instead of Docker for execution on SLURM-clusters) are the only prerequisites for the RNA2CM tool.
Install both, and make sure they are running properly on your system. If the following commands do not generate any error message, the
installation has been successful.
2 Download and extract the project directory (using either git or an internet browser):
3 Download the files CosmicMutantExportCensus.tsv.gz and CosmicCodingMuts.vcf.gz from the COSMIC website (login required), then move them
into the project’s subdirectory named data (RNA2CM/data).
4 Execute the script named setup.nf, which is responsible for setting up all the reference data and will complete the installation (this might take a
while).
Pipeline execution
CRITICAL This step replaces Steps 2–26 of the Procedure.
c
Data
● Primary assembly of the human reference genome (GRCh38 build) in FASTA format35 (https://www.
gencodegenes.org/human)
● Gene annotation file for the primary assembly of the human reference genome in GTF format
35
(https://www.gencodegenes.org/human)
●
dbSNP data of polymorphic sites in the GRCh38 build of the human reference genome, in VCF
format21 (https://www.ncbi.nlm.nih.gov/snp)
● Data of all coding mutations from COSMIC in VCF format
19
; the file is named CosmicCodingMuts.vcf.
gz (https://cancer.sanger.ac.uk/cosmic)
Other programs and data necessary in case of potential contamination by murine reads (as in the
case of cells cultured on mouse feeder cells)
● R 3.4 (https://www.r-project.org)
● XenofilteR
37
v.1.6 (https://github.com/PeeperLab/XenofilteR)
● Primary assembly of the mouse reference genome in FASTA format
35
(https://www.gencodegenes.org/
mouse)
● Gene annotation file for the primary assembly of the mouse reference genome in GTF format
35
(https://www.gencodegenes.org/mouse)
Equipment setup
CRITICAL In all subsequent code examples, the commands to be typed in the shell terminal are after
c
the $ sign (or > sign in python and R shells), while comments appear after the # sign CRITICAL The
c
equipment setup is performed only once before the first time the procedure is run.
$ cd ~
c
RNA2CM contains scripts for automated execution of both the preliminary setup and the
procedure. Box 1 describes the steps covered by each script and provides detailed usage
instructions.
2 Enter the COSMIC website (https://cancer.sanger.ac.uk/cosmic), and create a username and password
for registered login. Login with your username, and download the VCF file of all coding mutations
named CosmicCodingMuts.vcf.gz, and the tab-separated file of all cancer mutations in census genes
named CosmicMutantExportCensus.tsv.gz. Transfer both files into the data subdirectory.
? TROUBLESHOOTING
3 Download the human and mouse reference genomes and the matching annotation files from the
GENCODE website into the data subdirectory, and unzip the files. If your sample has no risk of
murine RNA contamination, you can skip all parts addressing the mouse reference genome for the
rest of the protocol.
$ cd ~/RNA2CM/data
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/
release_34/GRCh38.primary_assembly.genome.fa.gz
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/
release_34/gencode.v34.primary_assembly.annotation.gtf.gz
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/
release_M25/GRCm38.primary_assembly.genome.fa.gz
$ wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/
release_M25/gencode.vM25.primary_assembly.annotation.gtf.gz
$ gunzip GRCh38.primary_assembly.genome.fa.gz gencode.v34.
primary_assembly.annotation.gtf.gz GRCm38.primary_assembly.genome.
fa.gz gencode.vM25.primary_assembly.annotation.gtf.gz
CRITICAL STEP Download files named ‘primary assembly’, as these files do not contain
c
$ mkdir ~/RNA2CM/data/GRCh38
$ cd ~/RNA2CM/data/GRCh38
$ STAR --runThreadN 10 --runMode genomeGenerate --genomeDir ~/RNA2CM/
data/GRCh38 --genomeFastaFiles ~/RNA2CM/data/GRCh38.primary_assem-
bly.genome.fa --sjdbGTFfile ~/RNA2CM/data/gencode.v34.primary_assem-
bly.annotation.gtf --sjdbOverhang 100
$ mkdir ~/RNA2CM/data/GRCm38
$ cd ~/RNA2CM/data/GRCm38
$ STAR --runThreadN 10 --runMode genomeGenerate --genomeDir ~/RNA2CM/
data/GRCm38 --genomeFastaFiles ~/RNA2CM/data/GRCm38.primary_assem-
bly.genome.fa --sjdbGTFfile ~/RNA2CM/data/gencode.vM25.primary_as-
sembly.annotation.gtf --sjdbOverhang 100
6 Create an index file and a GATK dictionary file for the human reference FASTA file. This produces
files with an identical name as the reference FASTA but with a fai suffix (index) and a dict suffix
(dictionary).
$ cd ~/RNA2CM/data
$ samtools faidx GRCh38.primary_assembly.genome.fa
$ gatk CreateSequenceDictionary -R GRCh38.primary_assembly.genome.fa
? TROUBLESHOOTING
7 Create a four-column BED file, with the coordinates of all exons from the human annotation GTF,
gzip the file and index it with tabix. This file will serve later as an interval file for GATK tools.
8 Since the chromosome names in the dbSNP file from NCBI (NC_000001.11, NC_000002.12…) and
the names in the COSMIC VCF (1, 2…) are not compatible with the chromosome names in the
GENCODE release (chr1, chr2…), it is necessary to rename the chromosome names in both files.
We provide mapping files for the chromosome names within the data directory. Renaming is easily
performed using the annotate function from bcftools. After renaming, the original VCF files are no
longer needed. It is essential to index the renamed files using tabix.
The last data file needed for the analysis is a text file with the VCF headers necessary for later
annotations of variants from your sample. For convenience, this file is already supplied within the
data subdirectory.
By this step, you should have all the data needed to perform the procedure. These necessary files
and their locations are detailed in Table 1.
? TROUBLESHOOTING
Procedure
CRITICAL In this pipeline, each step generates an output file that serves as the input for the next step,
c
which results in large hard disk memory consumption. Although each file can be deleted during the
process after its final usage, it is recommended to keep intermediate files until the final output is reached,
as these intermediate files can be useful for troubleshooting.
CRITICAL For the following example code to work properly, do not change any of the reference files
c
names or locations.
which is a single-end Illumina run of wild-type human embryonic stem cells of the H9 cell line; these
cells were cultured on a feeder layer of mouse embryonic fibroblasts.
CRITICAL Publicly available RNA-seq data can be retrieved from databases such as ENA
c
and SRA.
1 Acquire fair-quality RNA-seq data of the cells to be inspected in gzipped FASTQ format
(fastq.gz suffix), and place the file (or pair of files in paired-end Illumina runs) in a new directory
for the analysis (here the directory is named SRR3090631 after the accession number of this
RNA-seq run).
CRITICAL STEP It is highly recommended to inspect the quality of FASTQ files via the fastQC
c
software, as results derived from low-quality data could be unreliable. A good standard for high-
quality RNA-seq data is a low level of overrepresented sequences (<1%), and per base sequence
quality scores that are >30 along the read. Low-quality bases at the end of Illumina reads are quite
common, but—unlike low-quality bases in the middle of the read—they do not pose a problem as
they are trimmed in this pipeline.
? TROUBLESHOOTING
3 Assign to variables also the absolute paths to: the STAR genome directories, the human
reference genome FASTA file, the intervals BED file, the dbSNP VCF, the COSMIC VCF and
the header file.
CRITICAL STEP The protocol can be paused between any of the next steps, but the subsequent
c
code relies on the variables assigned in Steps 2 and 3. Every time you open a new terminal, you
should reassign these variables.
CRITICAL STEP If the cells were grown on mouse feeders, or there is any other potential source
c
of mouse cells contamination, it is essential to filter out RNA reads of murine origin as they can be
a major source of false positives (Steps 6–7). If there is no such risk, you can skip to the variant
discovery stage (Step 8).
? TROUBLESHOOTING
7 Pass both human and mouse aligned BAM files to the XenofilteR tool. This will remove reads that
aligned better to the mouse genome than to the human genome, and will output a final BAM file
suitable for downstream variant discovery.
$R
> library(“XenofilteR”)
> bp.param <- SnowParam(workers = 10, type = “SOCK”) # number of threads
(10 in this case)
> sample.list <- matrix(c(‘SRR3090631Aligned.sortedByCoord.out.
bam’,‘SRR3090631GRCmAligned.sortedByCoord.out.bam’),ncol=2)
> output.names <- c(‘SRR3090631’)
> XenofilteR(sample.list, destination.folder = “./“, bp.param = bp.
param, MM_threshold = 8, output.names)
> quit()
with an intervals file in BED format describing all transcribed regions of the genome (in the -L option).
This option is available with the tools SplitNCigarReads, BaseRecalibrator, ApplyBQSR and
HaplotypeCaller; it is not mandatory, but it significantly reduces computation time.
8 Mark duplicated reads from the Filtered BAM file using MarkDuplicates of Picard tools (if you did
not perform murine reads removal supply in the --I option the aligned-to-human BAM file
generated in Step 4).
9 Using GATK’s SplitNCigarReads, modify the marked-duplicates BAM file to split reads that
contain Ns in their CIGAR string (due to splice junctions within the reads).
10 Add read groups to the aligned reads in the BAM file using AddOrReplaceReadGroups of Picard
tools. This operation is needed because GATK4 requires the BAM headers to contain this piece of
information, which is not generated by STAR during the alignment process.
11 Perform base recalibration with GATK’s BaseRecalibrator on the BAM file, by supplying the tool
with annotation files of known polymorphisms in VCF format. This will output a recalibration table
used in the next step.
This is the final step of the variant calling that will output a file in VCF format describing all the
discrepancies of your sample compared with the reference genome.
CRITICAL STEP If during the library preparation the sample was not PCR amplified in any step,
c
15 Use bcftools to exclude all variants that did not pass the hard filter, and filter out positions covered
by fewer than ten reads and alternative alleles covered by fewer than five reads.
16 Add gene names to the variants by supplying the GRCh38_exome.bed file to bcftools’s annotate
function. The -h option adds headers to the VCF that are necessary for importing annotations in
this and the subsequent steps. Next, index the annotated VCF.
17 Using the dbSNP VCF file, annotate the variants in your sample by assigning the known variants
with an RS-ID (dbSNP unique identifier), and import to the sample’s VCF the information about
the prevalence of each SNP, into a new field named ‘COMMON’. This field determines for each
variant if it has an allelic frequency >1% in at least one defined population. This can be done using
bcftools’s annotate function.
? TROUBLESHOOTING
? TROUBLESHOOTING
19 At this point, the original VCF output of GATK’s HaplotypeCaller has been annotated with gene
names, filtered out of unreliable variants, common SNPs have been marked using dbSNPs and
mutations that appear in cancer have been marked by using the COSMIC VCF. Use bcftools’s query
function to export the data to a table containing columns with the necessary information for
downstream cancer mutation identification.
$ python3
> import pandas as pd
> SAMPLE = ‘SRR3090631’
> df = pd.read_csv(f”{SAMPLE}varTable.tsv”, sep=‘\t’)
> df = df[df[‘[6]ID’] != ‘.’]
> df[‘[7]CNT’] = df[‘[7]CNT’].astype(“int32”)
> df = df[df[‘[7]CNT’] >= 20]
21 Remove from the variant table all the entries annotated as ‘common SNPs’, as these variants are
unlikely to be pathogenic.
22 Merge the COSMIC Census table (obtained in Step 2 of the equipment setup) with your variant
table, on the basis of the COSMIC ID that corresponds to the ‘GENOMIC_MUTATION_ID’
column in the COSMIC census table (importing information about the pathogenicity, coding effect
and experimental evidence of each variant).
23 Filter the variants to leave only mutations in genes that belong to Tier 1 (‘Tier’ column) of the COSMIC
census genes. Tier 1 genes must possess experimental evidence for their role in cancer formation in
addition to documented mutations that change the gene product’s activity to promote oncogenic
transformation, in contrast to Tier 2 genes that are classified as cancer genes using less strict criteria.
> df = df[df[‘Tier’] == 1]
Table 3 | Final output table summarizing the anticipated results from the provided example
Gene CHROM POS REF ALT COSMIC RS GENOMIC Mutation_AA Mutation FATHMM AD
CNT (dbSNP) MUTATION ID Description score
TP53 chr17 7673821 G A 65 55832599 COSV52678166 p.R267W Substitution - Missense 0.98159 43,28
24 Remove mutations that are described as ‘Substitution - coding silent’ in the ‘Mutation Description’
column.
26 At this stage, the analysis is done, and all that is left is reformatting the table for better readability
and saving the final table to a CSV file format that can be opened with Microsoft Excel or any text
editor. The following lines of code reorder and rename the columns of the table before writing it to
a file.
The cancer_mutations.csv table details the cancer mutations identified in the sample (Table 2).
The hPSC sample used in this example harbors a single mutation in the TP53 gene, resulting in a
table named SRR3090631_cancer_mutations.csv that has only a single row (Table 3).
Equipment The workstation does not Some workstations and computer clusters do not Download the files via scripted download.
setup, 2 have access to an internet have a graphical interface and are accessible only Instructions are available at the COSMIC website
browser via the command line
Equipment Error generated: ‘usr/bin/ The GATK wrapper script does not recognize the Change ‘python’ to ‘python3’ in the first line of
setup, 7 env “python”: No such file python interpreter the wrapper script (~/RNA2CM/bin/gatk-4.1.8.0/
or directory’) gatk)
Equipment Tabix generates an error The VCF file is corrupted Make sure the VCF download was completed
setup, 8 successfully and that the file is not truncated
(this can be done by comparing the CheckSum
hash string of the downloaded file with the hash
published on the download page using the
md5sum program)
1 RNA-seq reads are not in a Reads are in an uncompressed FASTQ file Use bgzip to compress the file (fastq.gz) or
compressed format, or in an unaligned BAM format bedtools bamtofastq to convert to FASTQ format
FASTQ format
The fastQC quality check Library preparation was performed with low- Run Trimmomatic (Step 4), and reinspect the
shows poor-quality quality RNA or the sequencing run malfunctioned quality of the reads
parameters Repeat experiment if quality is still low
4 No reads pass the The read length is shorter than 36 bp Adjust the ‘MINLEN’ parameter in the
Trimmomatic filter Trimmomatic command to reduce the minimal
read length accepted
4–6 The sequencing data are You might have paired-ends sequencing data View the Trimmomatic and STAR manuals for
divided into two files paired-ends syntax, and adjust Steps 4–6
accordingly
15 and 16 bcftools annotate generates There is a discrepancy between the names of Make sure the chromosome names in the
an error chromosomes in the VCF and the reference annotation VCF are identical to the names in the
genome Fasta file, or between the names of the reference genome file and that the headers in the
headers in the VCF and the headers in the header. header.txt file match the headers in the
txt file annotation VCF
Timing
Equipment setup
Steps 1–4, data download: 1–2 h
Steps 5–8, reference data setup: 2–4 h
Procedure
Step 1, sample acquisition: 15–60 min (for downloading of most RNA-seq samples from the European
Nucleotide Archive (ENA; https://www.ebi.ac.uk/ena/browser/home) database
Steps 2 and 3, variable assignment: 5 min
Steps 4–7, preprocessing of RNA-seq data before variant discovery: 1–4 h
Steps 8–13, variant discovery: 1–3 h
Steps 14–19, hard filtering and variant annotation: 1–2 h
Steps 20–26, identification of cancer-related mutations: 30 min
Anticipated results
The main output of this pipeline is the cancer_mutations.csv table, but additional files produced by the
pipeline should be used for quality control of the analysis. For example, the intermediate files can be used
for assessing the number of variants initially generated and the number of variants filtered along the
different steps of the analysis. The following description of the example data can be used as a benchmark
for expected numbers of variants and the proportion of variants eventually filtered out. The output.vcf.gz
file is the raw unfiltered variant calling file. This file will typically contain ~50,000 (low confidence)
variants. The varTable.tsv file is the filtered (and annotated) derivative of output.vcf.gz, and contains
typically 30–50% of the original variants. Among the called variants, it is expected that ~95% will be
annotated in dbSNP and ~85% will be common SNPs (represented in >1% of the population). In the
likely (and typically desired) case of the absence of any cancer mutations, the cancer_mutations.csv table
will be completely empty. When cancer mutations are indeed identified in the sample, the cancer_mu-
tations.csv table provides extensive information about each mutation in its different columns. Description
of the table and explanations about its different fields are available in Table 2.
In the provided example, variant calling on the ESCs sample (SRR3090631) identified ~45,000
variants, which were reduced to ~18,300 high-confidence variants after hard filtering and satisfying
the minimal coverage requirement. Out of these variants, ~17,350 are annotated in dbSNP and
~15,750 variants are considered as common SNPs. The pipeline identified in this sample one cancer-
related mutation in the known tumor suppressor gene p53. It is a single-nucleotide G-to-A sub-
stitution that results in a missense substitution (arginine to tryptophan) in the translated protein.
Table 3 summarizes the results for the exemplary sample given in the protocol.
Data availability
The RNA-seq sample used as an example in this protocol can be retrieved from the Sequence Read
Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra) under the accession number
SRR3090631.
Code availabilty
All the code used in this protocol is available at https://github.com/elyadlezmi/RNA2CM39.
References
1. De Los Angeles, A. et al. Hallmarks of pluripotency. Nature 525, 469–478 (2015).
2. Tabar, V. & Studer, L. Pluripotent stem cells in regenerative medicine: challenges and recent progress. Nat.
Rev. Genet. 15, 82–92 (2014).
3. Avior, Y., Sagi, I. & Benvenisty, N. Pluripotent stem cells in disease modelling and drug discovery. Nat. Rev.
Mol. Cell Biol. 17, 170–182 (2016).
4. Shahbazi, M. N., Siggia, E. D. & Zernicka-Goetz, M. Self-organization of stem cells into embryos: a window
on early mammalian development. Science 364, 948–951 (2019).
5. Weissbein, U., Benvenisty, N. & Ben-David, U. Genome maintenance in pluripotent stem cells. J. Cell Biol.
204, 153–163 (2014).
6. Bar, S. & Benvenisty, N. Epigenetic aberrations in human pluripotent stem cells. EMBO J. 38, 1–18 (2019).
7. Na, J., Baker, D., Zhang, J., Andrews, P. W. & Barbaric, I. Aneuploidy in pluripotent stem cells and
implications for cancerous transformation. Protein Cell 5, 569–579 (2014).
8. Jo, H. Y. et al. Functional in vivo and in vitro effects of 20q11.21 genetic aberrations on hPSC differentiation.
Sci. Rep. 10, 1–14 (2020).
9. Ben-David, U. & Benvenisty, N. The tumorigenicity of human embryonic and induced pluripotent stem cells.
Nat. Rev. Cancer 11, 268–277 (2011).
10. Ben-David, U. et al. Aneuploidy induces profound changes in gene expression, proliferation and tumor-
igenicity of human pluripotent stem cells. Nat. Commun. 5, 4825 (2014).
11. Simonson, O. E., Domogatskaya, A., Volchkov, P. & Rodin, S. The safety of human pluripotent stem cells in
clinical treatment. Ann. Med. 47, 370–380 (2015).
12. Gore, A. et al. Somatic coding mutations in human induced pluripotent stem cells. Nature 471, 63–67 (2011).
13. Merkle, F. T. et al. Human pluripotent stem cells recurrently acquire and expand dominant negative P53
mutations. Nature 545, 229–233 (2017).
14. Avior, Y., Lezmi, E., Eggan, K. & Benvenisty, N. Cancer-related mutations identified in primed human
pluripotent stem cells. Cell Stem Cell 28, 10–11 (2021).
15. Stirparo, G. G., Smith, A. & Guo, G. Cancer-related mutations are not enriched in naive human pluripotent
stem cells. Cell Stem Cell 28, 164–169.e2 (2021).
16. Halliwell, J., Barbaric, I. & Andrews, P. W. Acquired genetic changes in human pluripotent stem cells: origins
and consequences. Nat. Rev. Mol. Cell Biol. 21, 715–728 (2020).
17. Merkle, F. T. et al. Biological insights from the whole genome analysis of human embryonic stem cells.
Preprint at bioRxiv https://doi.org/10.1101/2020.10.26.337352 (2020).
18. Trounson, A. & DeWitt, N. D. Pluripotent stem cells progressing to the clinic. Nat. Rev. Mol. Cell Biol. 17,
194–200 (2016).
19. Tate, J. G. et al. COSMIC: the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 47,
D941–D947 (2019).
20. Shihab, H. A. et al. Predicting the functional, molecular, and phenotypic consequences of amino acid
substitutions using hidden Markov models. Hum. Mutat. 34, 57–65 (2013).
21. Sherry, S. T. et al. DbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Acknowledgements
We thank S. Kinreich and A. Pagis for testing the pipeline and providing their constructive input and all members of The Azrieli Center
for Stem Cells and Genetic Research for critical reading of the manuscript. This work was partially supported by the Israel Science
Foundation (494/17), the Rosetrees Trust, and Azrieli Foundation. N.B. is the Herbert Cohn Chair in Cancer Research.
Author contributions
E.L. and N.B. designed the analysis. E.L. developed the bioinformatic pipeline and wrote the manuscript with input from N.B., who
supervised the work.
Competing interests
The authors declare no competing interests.
Additional information
Correspondence and requests for materials should be addressed to E.L. or N.B.
Peer review information Nature Protocols thanks Anna Esteve-Codina and the other, anonymous reviewer(s) for their contribution to
the peer review of this work.
Reprints and permissions information is available at www.nature.com/reprints.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Related links
Key references using this protocol
Merkle, F. et al. Nature 545, 229–233 (2017): https://doi.org/10.1038/nature22312
Avior, Y. et al. Cell Stem Cell 28, 10–11 (2021): https://doi.org/10.1016/j.stem.2020.11.013