COMP90016 2023 07 Variant Calling I

Computational Genomics
Lecture 7
Variant Calling I
Dr Khalid Mahmood
Before watching this lecture, make sure you are familiar with… Today
1 Intro & 2Genomics II 3Sequencing 5 Sequence 7 Variant

Genomics I technologies alignment calling I
Overview
● What is a genetic variant?
● Types of genetic variants

○ SNVs, indels, structural variants
● Genomic sequencing data
● Genomic resources
● Sensitivity and specificity
● Variant calling and genotyping

Variants
Genetic variant: difference between genome sequences.
In most cases – this means the difference in the DNA sequence

or structure - compared to a reference genome.
Humans share ~99.8% identical DNA.
Primates e.g. Human and Chimp genomes are 99% identical.

Variants
There is high degree of similarity but the human genome is large
- 3 billion nucleotides.
This results in approximately 4-5 million variants between any

individual and the reference genome.
These, seemingly small number of variations likely explains a

significant proportion of phenotypic diversity among humans.
Variants
Given the size of the difference in the genomes and the possible
combinations of alleles the total number of variations in humans
is growing as more sequencing data becomes available.
According to dbSNP, there are ~700 million distinct known

genetic variants in the human genome.
A significant proportion of the variants are common i.e. seen is

more 1% of the population.
Sherry,S.T., Ward,M. and Sirotkin,K. (1999) dbSNP—Database for Single

Nucleotide Polymorphisms and Other Classes of Minor Genetic
Variation. Genome Res., 9, 677–679.
Variants
Germline variants
● Variants you are born with; part of your genetic code
● Should be in the DNA of all cells of your body
● Can be very common (e.g. half the population)
Somatic variants
● Variants you acquire during your life
● Can differ between cells
Each human has ~3M germline variants when compared to “the”

human reference genome.
What is variant calling?
CGTGCCTAGAGTGGGATGGGCCATTGTTCATCTTCTGGCCCCTGTTGTCTGCATGTAACTTAATACCACAACCAGGCATA
GGGGAAAGATTGGAGGAAAGATGAGTGAGAGCATCAACTTCTCTCACAACCTAGGCCAGTAAGTAGTGCTTGTGCTCATC
TCCTTGGCTGTGATACGTGGCCGGCCCTCGCTCCAGCAGCTGGACCCCTACCTGCCGTCTGCTGCCA/TCGGAGCCCAAA
GCCGGGCTGTGACTGCTCAGACCAGCCGGCTGGAGGGAGGGCC/GCTCAGCAGGTCTGGCTTTGGCCCTGGGAGAGCAGG
TGGAAGATCAGGCAGGCCATCGCTGCCACAGAACCCAGTGGATTGGCCTAGGTGGGATCTCTGAGCTCAACAAGCCCTCT
CTGGGTGGTAGGTGCAGAGACGGGAGGGGCAGAGCCGCAGGCACAGCCAAGAGGGCTGAAGAAATGGTAGAACGGAGCAG
CTGGTGATGTGTGGGCCCACCGGCCCCAGGCTCCTGTCTCCCCCCAGGTGTGTGGTGATGCCAGGCATGCCCTTCCCCAG
CATCAGGTCTCCAGAGCTGCAGAAGACGACGGCCGACTTGGATCACACTCTTGTG/AAGTGTCCCCAGTGTTGCAGAGGT
GAGAGGAGAGTAGACAGTGAGTGGGAGTGGCGTCGCCCCTAGGGCTCTACGGGGCCGGCGTCTCCTGTCTCCTGGAGAGG
CTTCGATGCCCCTCCACACCCTCTTGATCTTCCCTGTGATGTCATCTGGAGCCCTGCTGCTTGCGGTGGCCTATAAAGCC
TCCTAGTCTGGCTCCAAGGCCTGGCAGAGTCTTTCCCAGGGAAAGCTACA/TAGCAGCAAACAGTCTGCATGGGTCATCC
CCTTCACTCCCAGCTCAGAGCCCAGGCCAGGGGCCCCCAAGAAAGGCTCTGGTGGAGAACCTGTGCATGAAGGCTGTCAA
CCAGTCCATAGGCAAGCCTGGCTGCCTCCAGCTGGGTCGACAGACAGGGGCTGGAGAAGGGGAGAAGAGGAAAGTGAGGT
TGCCTGCCCTGTCTCCTACCTGAGGCTGAGGAAGGAGAAGGGGATGCACTGTTGGGGAGGCAGCTGTAACTCAAAGCCTT
AGCCTCTGTTCCCACGAAGGCAGGGCCATCAGGCACCAAAGGGATTCTGCCAGCATAGTGCTCCTGGACCAGTGATACAC
CCGGCACCCTGTCCTGGACACGCTGTTGGCCTGGATCTGAGCCCTGGTGGAGGTCAAAGCCACCTTTGGTTCTGCCATTG
CTGCTGTGTGGAAGTTCACTCCTGCCTTTTCCTTTCCCTAGAGCCTCCACCACCCCGAGATCACATTTCTCACTGCCTTT
TGTCTGCCCAGTTTCACCAGAAGTAGGCCTCTTCCTGACAGGC/TAGCTGCACCACTGCCTGGCGCTGTGCCCTTCCTTT
GCTCTGCCCGCTGGAGACGGTGTTTGTCATGGGCCTGGTCTGCAGGGATCCTGCTACAAAGGTGAAACCCAGGAGAGTGT
GGAGTCCAGAGTGTTGCCAGGACCCAGGCACAGGCATTAGTGCCCGTTGGAGAAAACAGGGGAATCCCGAAGAAATGGTG
GGTCCTGGCCATCCGTGAGATCTTCCCAGGTGTGCCGTTTTCTCTGGAAGCCTCTTAAGAACACAGTGGCGCAGGCTGGG
TGGAGCCGTCCCCCCATGGAGCACAGGCA/GGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTC
CTTGAAGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCTCCCAAGGAAGTAGGTCTGAGCAGCTTGTCCTGGCTGTG
TCCATGTCAGAGCAACGGCCCAAGTCTGGGTCTGGGGGGGAAGGTGTCATGGAGCCCCCTACGATTCCCAGTCGTCCTCG
TCCTCCTCTGCCTGTGGCTGCTGCGGTGGCGGCAGAGGAGGGATGGAGTCTGACACGCGGGCAAAGGCTCCTCCGGGCCC
CTCACCAGCCCCAGGTCCTTTCCCAGAGATGCCTGGAGGGAAAAGGCTGAGTGAGGGTGGTTGGTGGGAAACCCTGGTTC
CCCCAGCCCCCGGA/CGACTTAAATACAGGAAGAAAAAGGCAGGACAGAATTACAAGGTGCT
Common types of variants
ctccgag
● SNVs: single-nucleotide variants. ctctgag
● Indels: (small) insertions or deletions. Single-nucleotide
variants
(SNVs)
● SNV and indels lengths range from 1bp-1kb.
ctc--ag
ctctgag
● Other types of variants include: Insertion deletions
(Indels)
● CNV: copy-number variation.
● SV: structural variation. ctccgag
● Impacts larger regions of DNA:
ctctaggtaaagag
● extra/missing DNA
Structural Variants
● Rearranged regions etc. (SVs)
Why study human genetic variants?
DNA controls all cells in our body.
Genetic variants are responsible for genetic diversity

● i.e. biological differences, or phenotype
Some variants are associated with disease

● germline variants may be associated with disease or
increased risk
● somatic variants can explain tumour aetiology in cancer
Types of variants: SNVs
SNV: single nucleotide variant
This is a one-letter substitution:
ACGA → ATGA
If the variant is found in a population, it is called a SNP: single

nucleotide polymorphism.
SNPs are differences from the reference genome but are likely to
be more common in a given population.
(A) Biological effects of a SNV
Transcript
non-coding coding
SNV overlapping position
VEP, ensembl.org
Biological effects of a SNV
Missense mutation: a non-synonymous variant changes an
amino acid to another amino acid. This can have an impact on
the protein function depending on the nature of amino acid
change, its location and how conserved the mutation region is
etc.
AGG -> AAG changes Arginine (R) to Lysine (K)
Silent mutation: a synonymous mutation that does not alter the

amino acid, but in some cases can still have a phenotypic effect,
e.g., by disruption transcription, or splicing.
GCG -> GCA both code for amino acid Alanine(A)

Codon table
Nonsense mutation: a non-synonymous variant that changes an
amino acid codon to the STOP codon resulting in premature
termination of translation.
TCA -> TAA changes Serine (S) to stop_gained
Truncates protein prematurely; likely to affect protein function
can also prevent translation via nonsense-mediated decay
Non-coding variants: depending on the overlap of the variant with

respect to the transcript, non-coding variants can have biological
impact, e.g.
● SNVs in UTRs may affect gene regulation
● SNVs in splice sites (edge of introns) can affect splicing
● SNVs in upstream of gene can affect promoter binding
(B) Types of variants: Indels
Indel: Insertions or deletion:
ACTATGAGATTACA Reference
AC---GAGATTACA 3bp deletion
ACTATGAGATTATGTGCA 4bp insertion
Indels usually refer to short modifications.
Large insertions/deletions are treated separately.

Biological effect of an indel
Indels in coding regions can have multiple consequences
Recall that there are three reading frames in RNA:
AGG.CAG.CCT.GCA.GCC.CTT.GGC.C
A.GGC.AGC.CTG.CAG.CCC.TTG.GCC
AG.GCA.GCC.TGC.AGC.CCT.TGG.CC
An indel with length divisible by 3bp (6bp, 9bp…) can affect one
or two codons
AGG.GAG.CCT.GCA.GCC.CTT.GGC.C
AGG.GCT.GCA.GCC.CTT.GGC.C
Biological effect of an indel
A 1bp (or 2bp, etc) indel called a frameshift, can disrupt a large
stretch of the protein-coding sequence, changing amino acids:
ATG.GAG.CCT.GCA.GCC.CTT.GAC - M E P A A L D
ATG.GCC.TGC.AGC.CCT.TGA - M A C S P Stop
This often also leads to a premature STOP codon.
Just like with SNVs, indels in non-coding regions can also have
biological effects:
● Indels in UTRs may affect gene regulation
● Indels in splice sites can affect splicing
Example variants in humans
● 1 bp deletion (CYP3A5*7): sodium transport gene
○ linked to pre-disposition to hypertension
● 1 bp deletion (HERC2): E3 ubiquitin-protein ligase

○ affects expression of neighboring OCA2 gene
○ linked with eye colour (blue)
● Missense variants (Gly551Asp) CFTR:

○ Associated with CFTR linked disorders such as cystic
fibrosis
● BRCA1/BRCA2 genes: help repair damaged DNA

○ ~50% of women who inherit a damaging variant in these
genes will develop cancer by age 80
(C) Types of variants: structural
variation
SV: structural variation
A broad category, including

● CNVs - large insertions and deletions
● translocations
● inversions
Image: http://compbio.cs.brown.edu/projects/structvar/
Types of variants: CNV
CNV: copy number variation
Large-scale duplications or deletions

of portions of the genome.
Somatic CNV may be called CNA:

copy number alteration.
Image: http://en.wikipedia.org/wiki/Copy-number_variation
Types of variants: CNV (continued)
When CNV is present, the amount
of DNA from the duplicated/deleted
regions changes, so our algorithms
can look for this.
These methods won’t work for

translocations or inversions. For
these structural rearrangements
we usually look for the breakpoints.
Aligned sequence reads

Biological effect of CNV
Gene duplications or deletions
● impact on expression of particular RNA and proteins which
○ can affect biological function of proteins
○ can affect regulation or interaction with other genes
● deletion of regions containing genes
○ often just one copy of a gene is lost
○ if the two copies had different variants (heterozygous),
this is loss of heterozygosity (LOH)
CNV is implicated in some diseases, e.g. autism, schizophrenia,

bipolar disorder
Chiang et.al. (2017) Nat Gen.

Methods and resources
Databases and resources:
○ dbSNP – a public archive of human SNVs and indels
○ gnomAD – an aggregation of >150,000 exomes/genomes

to estimate population level frequency of genetic data
including genetic variants
Reference
Gene
Annotation tracks
UCSC genome browser
Gene
UCSC genome browser

Variant detection using NGS
datacarpentry.org
Variant detection from NGS data
Input:
● a reference genome
● set of reads aligned to reference genome
Output:
● a set of variant calls
● measure of quality of the variant call (as confidence
information for each variant)
A variant is defined by:
(1) CHROM: chromosome
(2) POS: base position of the reference sequence
(3) REF: reference allele
(4) ALT: variant allele

Possible sequencing error
Likely variant
(SNV)
Reference
Aligned sequence reads

Deletion
SNV
Insertion
Variant
score
Variant detection is non-trivial because of errors in the data
● read errors (experimental errors)
● alignment errors
Errors are rare, but so are variants

e.g. humans have about 1 germline variant per 1000bp
Somatic variants (important in cancer research) are usually even

rarer: 1 somatic variant per 1,000,000bp
An algorithm must trade sensitivity and specificity.

Sensitivity and specificity
Sensitivity
How many real variants do we succeed in finding?
○ High sensitivity = less false negatives
Specificity
How many false variants do we correctly reject?
○ High specificity = less false positives
False discovery rate (FDR)

How many of the variants we find are real?
Sensitivity and specificity
Variant No variant
Called TP FP Ncalled
Not called FN TN Nnot called
Nvariant Nno variant Ntotal sites
FP = false positives TP = true positives
TN = true negatives FN = false negatives
Sensitivity = TP / (TP+FN) Specificity = TN / (TN+FP)
True positive rate (TPR) = TP / (TP+FN) = sensitivity
False positive rate (FPR) = FP / (FP+TN) = 1 - specificity
False discovery rate (FDR) = FP / (FP+TP)
Sensitivity vs specificity
Better
algorithm/
data
SNVMix: predicting single

nucleotide variants from next-
generation sequencing of
tumors. Bioinformatics. 2010
26(6): 730–736.
VCF - Variant Call Format
Standard file format for variants.
Developed for 1000 genomes project.
VCF is:
● human-readable
○ text-based, tab-separated
● machine-readable
● self-describing
○ header lines describe fields
Binary equivalent is BCF.

VCF files
##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001
20 14370 . G A 29 PASS NS=3;DP=14;DB GT:GQ:DP 0|0:48:1
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP 0|0:49:3
20 1110696 . A G,T 67 PASS NS=2;DP=10;AA=T GT:GQ:DP 1|2:21:6
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP 0|0:54:7
Variant quality scores: Phred-like
Quality Chance it's wrong Accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
A quality score QUAL is a phred-scaled (integer

representation) of the assertion made in the ALT.
Or
Q = -10 log10 P (call in ALT is wrong)
Q = Phred quality score P = probability of call being incorrect

Variant calling
We recall here the series of evidence/data required to identify
variant sites from sequencing data:
● Depth of coverage – number of reads at position

● Bases at the region with mismatches
● Sequencing quality of the bases
● Mapping quality of the reads
But in many cases we use a probabilistic variant calling

approach known as genotyping. This means we try to calculate
the most likely genotype at a given site.
Genotyping
Reminder: humans are diploid
(two copies of each chromosome)
So you could have 0, 1 or 2 copies of a variant.
These are the three possible genotypes:
AA AB BB
Haploid species: genotypes are A or B (wild-type or variant)

Triploid: possible genotypes are AAA, AAB, ABB, BBB.
And so on.
Genotyping
Diploid genotypes AA, AB, or BB
A stands for the reference allele (C,G,A,T)

B stands for the alternate allele
BB is a homozygous variant
AB is a heterozygous variant
AA is homozygous reference (ie no variant)
Genotyping
Recall that in sequencing:
● we make lots of copies of the DNA
● don't manage to read all of them (sample randomly)
The random processes introduce noise into read depth.
Allele frequencies
We expect to see read counts matching the possible genotypes -
approximately!
BB BBB
Percent alternate alleles
ABB
AB
AAB
AA AAA
Genotyping
Given all this, the task of calling variant and genotyping required
a more sophisticated approach.
The aim is to improve accuracy and report most likely variant

genotype combination.
One of the initial approaches towards this used Bayes theorem.
A general approach to single-nucleotide polymorphism discovery. Nat Genet (1999)

Genotyping
Genotyping or genotype calling = choosing a genotype given
the data
Usual strategy: find probability of each genotype given the

observed data:
P(Genotype | Data)
Then take the most likely genotype.

Genotyping – Data (D)
Example data D:
At this site or column:
Read depth N=18
Reference allele A = C (11 reads)
Alternate allele B = T (7 reads)
We might guess that

P(CT | Data)
is going to be higher than
Ref
P(CC | Data) or P(TT | Data).
Bayesian Genotyping
We want:
P(Genotype | Data)
But easier to calculate:
P(Data | Genotype)
Then use Bayes theorem:
P(D|G) P(G)
P(G|D) =
P(D)
Bayes theorem
Bayes theorem considers prior probabilities.
e.g. if I see a crop circle in my field, what is the chance I have
been visited by aliens?
Lets say aliens always leave crop circles (100%) – P(crop circle|aliens)
Aliens are unlikely – P(aliens)
But crop circles happen occasionally – P(crop circle)
Let’s say 100% Extremely small!
P(aliens|crop circle) = P(crop circle|aliens) P(aliens)

P(crop circle)
Small-ish
=> very small probability (other possible explanations exist)
Genotyping
P(D|G) P(G)
P(G|D) =
P(D)
The terms are:
P(D|G) - probability of seeing the set of reads (observed bases),
for the given genotype
P(G) - prior probability of the genotype, our prior expectation of

seeing this genotype
P(D) - probability of seeing this particular set of reads

Genotyping
P(G|D) = P(D|G) P(G)
P(D)
P(D) = probability of seeing this data
For a given data set D (reads), this is just a constant - the same
for AA, AB and BB.
So instead of calculating P(D) we could use this as a
normalisation factor or:
P(AA|D) + P(AB|D) + P(BB|D) = 1
Genotyping: P(D|G)
P(D)
Start with calculating the
P(D|G) = probability of seeing reads given genotype
i.e. for our data D, what are P(D|AA), P(D|AB), P(D|BB) ?

Genotyping: P(D|G)
Example data D:
At this site, read depth N=18
Reference allele A = C
Alternate allele B = T
There are 11 C’s & 7 T’s
What are
P(Data | CC)
P(Data | CT)
P(Data | TT) ?
Genotyping: P(D|G)
For heterozygous genotype AB, we are choosing each read
randomly from A and B. This gives a binomial distribution with
p=0.5 (like coin-tossing).
Probability of observing that data
Different possible values of the data in term s of the allele fractions actually observed?
Observed allele fraction
Genotyping: P(D|G)
For homozygous genotypes (AA or BB) we expect all data to be
A (for AA) or B (for BB) except where there is a sequencing
error.
Genotyping: P(D|G)
P(D|G) will give a more definite prediction of which allele fraction
to expect if we have
● lower sequencing error
● higher read depth
In practice genotyping software uses a model for the expected

sequencing errors, based on the sequencing technology. e.g.
● Illumina: per-base sequencing errors like C->G, getting more
likely at the ends of reads
Genotyping: prior P(G)
P(D)
P(G) = prior probability of genotype G
If the genome were random, this would be
P(AA) = 0.25
P(AB) = 0.5
P(BB) = 0.25
This is assuming that variants are as likely as non-variants.

The genome is not random! It's constrained by evolution.
Most of our genomes are very similar i.e. we expect to see the
reference genome sequence more often.
Each human has about 1 SNP per kilobase.

i.e. 1/1000 chance of a variant at each site.
If each chromosome’s allele is chosen randomly from the
population then instead of 0.25:
P(AB) ~ 1/1000 P(BB) ~ 1/1,000,000

In other words, variants are rare enough that we need a lot of
read evidence to be confident what we are seeing is real.
If the sequencing error rate is high compared to the rate of true

variants, or if we don’t have high enough read depth, it is difficult
to tell the difference between false positives and true variants.
Genotyping
Read depth Increasing read depth ->

higher confidence
SNVMix: predicting single nucleotide

variants from next-generation sequencing
of tumors. Bioinformatics. 2010 26(6): 730–
736.
Genotyping: summary
Binomial distribution +
error model for
sequencing technology
Prior probability of each
genotype based on
knowledge of sample’s
P(G|D) = P(D|G) P(G) biology
P(D)
Just a constant, for any given D.

Probability of getting the data we got, out of
all possible data, randomly.

COMP90016 2023 07 Variant Calling I

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

COMP90016 2023 07 Variant Calling I

Uploaded by

Copyright:

Available Formats

Computational Genomics

1 Intro & 2Genomics II 3Sequencing 5 Sequence 7 Variant

● Types of genetic variants

● Genomic sequencing data

● Sensitivity and specificity

● Variant calling and genotyping

In most cases – this means the difference in the DNA sequence

Humans share ~99.8% identical DNA.

Primates e.g. Human and Chimp genomes are 99% identical.

This results in approximately 4-5 million variants between any

These, seemingly small number of variations likely explains a

According to dbSNP, there are ~700 million distinct known

A significant proportion of the variants are common i.e. seen is

Sherry,S.T., Ward,M. and Sirotkin,K. (1999) dbSNP—Database for Single

Each human has ~3M germline variants when compared to “the”

DNA controls all cells in our body.

Genetic variants are responsible for genetic diversity

Some variants are associated with disease

This is a one-letter substitution:

If the variant is found in a population, it is called a SNP: single

SNV overlapping position

Silent mutation: a synonymous mutation that does not alter the

GCG -> GCA both code for amino acid Alanine(A)

Non-coding variants: depending on the overlap of the variant with

Indels usually refer to short modifications.

Large insertions/deletions are treated separately.

This often also leads to a premature STOP codon.

● 1 bp deletion (HERC2): E3 ubiquitin-protein ligase

● Missense variants (Gly551Asp) CFTR:

● BRCA1/BRCA2 genes: help repair damaged DNA

A broad category, including

Large-scale duplications or deletions

Somatic CNV may be called CNA:

These methods won’t work for

Aligned sequence reads

CNV is implicated in some diseases, e.g. autism, schizophrenia,

Chiang et.al. (2017) Nat Gen.

○ dbSNP – a public archive of human SNVs and indels

○ gnomAD – an aggregation of >150,000 exomes/genomes

UCSC genome browser

(1) CHROM: chromosome

(2) POS: base position of the reference sequence

(3) REF: reference allele

(4) ALT: variant allele

Possible sequencing error

Aligned sequence reads

Errors are rare, but so are variants

Somatic variants (important in cancer research) are usually even

An algorithm must trade sensitivity and specificity.

False discovery rate (FDR)

SNVMix: predicting single

Binary equivalent is BCF.

A quality score QUAL is a phred-scaled (integer

Q = Phred quality score P = probability of call being incorrect

● Depth of coverage – number of reads at position

But in many cases we use a probabilistic variant calling

Haploid species: genotypes are A or B (wild-type or variant)

A stands for the reference allele (C,G,A,T)

The aim is to improve accuracy and report most likely variant

One of the initial approaches towards this used Bayes theorem.

A general approach to single-nucleotide polymorphism discovery. Nat Genet (1999)

Usual strategy: find probability of each genotype given the