Computational Biology and Biomedical Data Analysis

Computational Biology and Biomedical
Data Analysis
The central dogma of molecular biology explains how information is transmitted from DNA to
proteins and function.
Metabolism is the web of interconnected chemical reactions that use nutrients to produce
energy and the necessary building blocks for cellular functioning and growth.
Metabolism and the central dogma are interconnected in living organisms. Metabolism
provides the energy and building blocks necessary for DNA replication, RNA synthesis, and
protein synthesis. The flow of genetic information in the central dogma depends on the
availability of energy generated by metabolism.
Comparative Genomics and Sequence Alignment:
Comparison of biological sequences (DNA, RNA or proteins)
Sequence: ordered set of n elements/characters representing nucleotides or amino acids.
Proteins are composed of twenty amino acids.
Comparative Genomics:
Goal: To assess the degree of similarity between biological sequences
Why similarity? What can similarity tell us about biology?
- To detect homology (shared evolutionary ancestry of two sequences)
- To reconstruct evolutionary history
- To determine function and /or structure of new sequences (comparing them to those
with known function)
PROKARYOTES (Bacteria and Archaea) EUKARYOTES (Plants, animals and

funghi)
Small + No nucleus Larger cells, often multicellular +
nucleus, specialized organelles
No introns Introns
Not much intergenic DNA Lots of intergenic DNA
Typically 1-10Mb genomes 100Mb -100 Gb genomes

Homology and similarity between sequences:
The limited alphabet in biological sequences can lead to similarities between sequences by
chance. Similarity can arise from shared ancestry (homology) or independent convergence.
Homologous sequences share a common ancestor but similarity doesn't imply the same
function.
Homology: orthology and paralogy
-Orthologous sequences are those sequences that belong to two different species and are
homologous.
- Orthologs are the result of speciation events
- Paralogous sequences are those homologous

sequences within the same species. Paralogs are
the result of duplication events.
Sequence alignment:
Sequence alignment can be used to find ‘conserved patterns’ or similarity patterns between
pairs and groups of sequences(en AND y RNA).
- Global alignment: attempt to align or match every element in a sequence
- Local alignment: attempt to align regions of sequences

Evolutionary events: Point mutations in genes:
Main problems we will discuss in comparative genomics:
I. Similarity of sequences - How can we define similarity?
• The number of matching nucleotides (when aligned)?

• The amount of shared information?
• The “distance” between the two sequences under some metric?
Dot Plots: are graphical methods to observe similarity

between sequences.
Dot Plots - interpretation summary

II. Pairwise sequence alignment
Objective: To find the similarity regions in two sequences S1 and S2 .
How: Optimizing a score function to quantify the similarity between to sequences

allowing for evolutionary events by introducing gaps in one or both sequences.
We look for the longest set of adjacent bases that are equal in the pair of sequences.
Fill the matrix with scores and pointers:

III. Comparison of sequences against a database
-BLAST
Biological networks (tema 3)
Protein-protein interaction networks: Are mathematical representations of the physical

contacts between proteins in the cell.
Gene regulatory networks (GRNs):
ATAC-seq data - technique used in molecular biology to assess genome-wide chromatin

accessibility
Chip-seq data - method used to analyze protein interactions with DNA (high quality but
low coverage)
Computational inference methods – the goal is to obtain a ‘reliable’ GRN from the
available experimental data.
Network players and relationships in bipartite networks
Bipartite graphs: two types of nodes; edges run between nodes of different types
1 2 3 4 5 6 7
Paths between nodes i and k:
Size 1: #number of paths = Aik (0 in the example)
Size 2: #number of paths = Σj AijAjk=(A)²ik (1 in the example)

Complex biological networks and Modularity and group
structure (tema 4)
➔Start with a regular lattice
➔Randomize an increasing number of its links
A quantitative measure of modularity:
Modularity of a partition:
Finding the maximum modularity is a difficult (NP-complete) combinatorial optimization

problema:
There are problems with modularity maximization:
1.Resolution limit
➔Modularity optimization fails to identify modules smaller than a scale which depends
on the total size of the network and on the degree of interconnectedness of the
modules, even in cases where modules are unambiguously defined.
Hierarchically organized modularity
➔Modular structure may be hierarchical (modules within modules) and modularity

maximization only captures one scale (or, worse, a mixture of scales)
Infomap is a very accurate algorithm not based on modularity maximization
Are real networks modular?
➔Problem If you look for modules, you find them
➔Solution – Obtain the modularity M for the real network – Compare M to the
distribution of modularities in an ensemble of random networks with the same degree
sequence as the real network (configuration model)
Modules provide a convenient way to summarize network data
Connectors that span several modules are often key for system-wide behavior
The modular structure of a network determines its dynamic behavior
Computational Metabolism (tema 5)

Metabolism: the set of all biochemical reactions inside a cell Most metabolic reactions are
catalyzed by enzymes. Main functions of metabolism:
● Conversion of food/fuel to energy (ATP)
● Conversion of food/fuel to proteins, lipids, nucleic acids, etc
● Elimination of wastes
Basic notion of reactions within a cell:

Role of an enzyme:
A reaction might be thermodynamically favorable, but typically there is an activation

state involved that implies the existence of an activation energy barrier:
Changing kinetic constants ( i.e k1, k-1 in our example) does not change the equilibrium
constant Keq
At equilibrium the velocities of both reactions are equal
Modeling enzyme kinetics:
The enzyme must combine with the substrate to yield a product, so that we have the following
steps:
Why?
– High specificity of enzymes to substrates
– Substrates often prevent enzyme inactivation
– Shape of the S curve
Mechanisms
– Lock & key (template) – enzyme is assumed to have an active site (when
folded) whose structure is complementary to S
– Other mechanisms/factors (flexibility, strength of bonds)
Classification: EC numbers
Simple model for enzyme kinetics (Henri 1903 – Michaelis & Menten 1914)
●Henry observed that at initial stages the vel of the reaction is proportional to [E] and
that it increases non-linearly with [S] up to a maximum rate.
Assumptions:
● E is a catalyst
● E+S react rapidly
● 1 E and 1 S form ES; 1 ES breaks down into 1 P, 1E - equilibrium hypothesis:
● [E],[S] and [ES] are at equilibrium
● [S]>>[E]; ES formation does not deplete S
● Overall rate, limited by the step ES to P + E
● Early stages of reaction: reverse reaction rate negligible.
General procedure to obtain reaction velocities for rapid equilibrium systems:
1. Write reactions involved in S to P
2. Write mass balance for the enzyme
3. Write the velocity-dependent equation: v is proportional to the concentration of product-

forming species [ES] times the kinetic constant (kp)
4. Divide by [E]t
5. Express the concentration if each species in terms of that of free enzyme [E]
6. Substitute into equation for v
Velocity equations in general:
Velocity only depends on 'intermediate species' ESA, ES, ESI:
E. coli's metabolism:
● ~2000 reactions
● ~800 substrates
● ~1400 enzymes
This means we have at least 4000 parameters for HenriMichaelis-Menten like equations !!!!!
The Law of Mass Action:
• The Law of Mass Action states that the reaction rate is proportional to the probability of a
collision of the reactants.
• This probability is in turn proportional to the concentration of reactants, to the power of the
molecularity: e.g. the number in which they enter the specific reaction
In general, for substrate concentrations [Si] with 'molecularities' {ni} and product
concentrations [Pj] with molecularities {mj}
• The equilibrium constant Keq characterizes the ratio of substrate and product concentrations
in equilibrium (Seq and Peq), that is, the state with equal forward and backward rates.
• The dynamics of the concentrations can be described by Ordinary Differential Equations

(ODE), e.g. for the S1+S2 2P reaction:
• (The time course of S1,S2 and P is obtained by integrating these ODEs)
Stoichiometric coefficients:
Stoichiometric coefficients denote the proportion or substrates and products involved in a

reaction
Stoichiometric Matrix:
The stoichiometric coefficients nij assigned to the m substances Si and the r reactions Vj can
expressed by the so-called stoichiometric matrix
where each column belongs to a reaction and each row to a substance
network of reactions stoichiometric matrix

Mathematical description of a metabolic system:
Vector of concentration values
Vector of reaction rates
Parameter Vector
Stoichiometric matrix N – typically obtained via a metabolic reconstruction So that the balance
equation can be rewritten as
Flux Balance Analysis:
The stoichiometric analysis can be constrained in various ways to simplify the resolution of the
system, and to limit the solution space.
One of the techniques used to analyze the complete metabolic genotype of a microbial strain is
FBA:
1. It relies on balancing metabolic fluxes
2. It is based on the fundamental law of mass conservation
3. It is performed under steady-state conditions (time derivatives equal to zero)
4. It requires information only about:
- the stoichiometry of metabolic pathways;
- metabolic demands;
- and a few strain specific parameters
It does NOT require enzymatic kinetic data

● The time constants characterizing metabolic transients are typically very rapid compared to
the time constants of cell growth, and the transient mass balances can be simplified to only
consider the steady-state behavior (dX/dt=0).
Which yields a condition that is analogous to Kirchoff's law for circuits (charge cannot
accumulate = metabolites cannot accumulate)
● This problem is obviously underdetermined – we want to find the set of flows that are
solution to the problem but the number of equations m (#metabolites) is < r ( number of
reactions) – and we have one flow per reaction!
To find the specific 'phenotype' (set of reaction flows that are a solution) we need to introduce
constraints (some fluxes must be positive/negative and satisfy mass balance) and an objective
functionD
Metabolic capability of an organism:
● Metabolic genotype: The group of solutions to the steady state equation determine the
metabolic capabilities of a given organism (or the metabolic reconstruction for that organism)
● Metabolic phenotype: solutions compatible with the expression of a given set of genes.
● Metabolic flexibility is the manifestation of two principal properties:
– Stoichiometric redundancy means that the network can redistribute flexibly its
metabolic fluxes
– Robustness is the ability of the network to adjust to decreased fluxes through a

particular enzyme without significant changes in overall metabolic function
Advantages and disadvantages of FBA
Advantages:
It relies solely on stoichiometric characteristics (can be used on any fully

sequenced/characterized organism)
Does not need kinetic parameters (that are difficult to obtain)
Disadvantages:
It does not uniquely specify the fluxes (the particular flux distribution chosen by the
cell is a function of regulatory mechanisms that determine the kinetic characteristics of
enzymes/enzyme expression)
Sometimes disagrees with experimental data (discrepancies can often be accounted for
when regulatory loops are considered)
Cannot be used for modeling dynamic behavior (complementary.)
Building a genome-scale metabolic network mode:
Reactions are associated with specific genes

Block II
Quantitative analysis of biomedical data
1. Introduction to statistical learning
Statistical learning refers to a set of methods and techniques used to analyze and
make predictions or decisions based on data.
Suppose we have the following situations:
- A drug company wants to increase sales by changing the amount spent in different types of
advertising;
-The amount of sales is our dependent variable or response.
-The dependent variable is usually denoted by Y
-The amounts spent on TV, radio and newspaper advertising are our independent
variables or features
The features are usually denoted as X1 , X2 ...
The goal of statistical learning: Relate features (input) to response (output)

f is a fixed but unknown function that represents:
● The systematic information that X (and the parameters θ) yield about Y
● The stochastic influence of other factors Ɛ that can only be modeled probabilistically
The goal of statistical learning is to estimate f
Why would we want to estimate f?
1. PREDICTION We often want to make predictions of Y (e.g. predict sales of our drug
if we double the amount of TV advertising, or predict if a person will need to stay
at the ICU) In terms of prediction, we often treat f as a “black box”
2. INSIGHT Understand the relationship between X and Y:
● Which features are associated with the response?

● What is the relationship between the response and each feature?
● Can the relationship between Y and each predictor be adequately
summarized using a linear equation? Is the relationship more complicated?
Black box is not enough to address these.
3. Often, a combination of PREDICTION and INSIGHT
Let’s consider (again) the drug company:
Linear regression assumes a linear relationship between the response (sales) and the features
(advertising in TV, radio and newspaper). SIMPLE LINEAR REGRESSION
Which media contribute to sales? We can examine the p-values associated with each
predictor’s t-statistic
How large is the effect of each medium on sales? We can look at the confidence intervals for
the coefficients βj
Is there synergy among the advertising media? We can an interaction term
But how did we get to these answers? What do we assume in a linear regression?
1. We assume a linear relationship between feature(s) and response:
2. We use least squares to obtain the values of the parameters

We choose parameters that minimize the mean squared error, MSE (why
MSE?)
(measure of error)
3. We select a number of features Consider what happens to the MSE if we

compare two different linear models:
Answering this “model-selection” question opens the door to all other questions
Loosely speaking Best model = Lowest error

But, which error do we care about?
For the training set, the error always decreases with more complex models
But if we try on a new set (test set), at some point the MSE grows: Overfitting
Let’s consider the common case in which the stochastic term is additive with 0 mean:
Our estimate for the unknown function is:
Then, if we generate many different test sets, our predictions will have an error:
This is a general result: Bias and variance:
Expected test MSE that we would obtain if we repeatedly estimated f using a large number of
training sets
Variance: Amount by which the estimated f^ would change if we used a different training data
set
Bias: Error that is introduced by approximating the real f by the estimated f^
But what should we do when we are not given a test set?
Cross-validation
A class of methods that estimate the test error rate by holding out a subset of the
training observations from the fitting process, and then applying the statistical learning method
to those held out observations
1. Validation set approach:
Hold out some of the training data and use it as your validation set
Training set and test set of comparable size
Drawbacks:
● We “waste” a lot of data that could be used for training, so we overestimate
the error (bias)
● The result is highly dependent on the validation set you happen to choose
2. Leave-one-out cross-validation:
Hold out only one point, calculate error and then repeat for each point
Drawbacks:
● Unfeasible if the dataset is large

3. k-fold cross-validation:
Split data set in k subsets (typically k=5 or k=10); then use k-1 for training and 1
for testing. Repeat with each subset as test
● Not as biased and variable as the validation set approach

● Not as costly as the leave-one-out
1. Sometimes Y values are not real numbers but categories: Classification
Instead of MSE, it is common to use the error rate (ER):
Where
The bias-variance trade-off is similar to regression

settings
2. Non-parametric methods
Non-parametric methods do not explicitly assume a parametric form for f(X), and thereby
provide an alternative and more flexible approach for performing regression
Here we consider one of the simplest and best-known non-parametric methods, K-nearest
neighbors (KNN)
KNN can be used for regression or classification
K nearest neighbors regression: Given a value for K and a prediction point x , KNN regression
first identifies the K training observations that are closest to x, represented by N(x). It then
estimates f(x) using the average of all the training responses in N.

2. Classification
Examples of classification problems
1. A person arrives at the emergency room with a set of symptoms that could
possibly be attributed to one of three medical conditions. Which of the three
conditions does the individual have?
2. On the basis of DNA sequence data for a number of patients with and without a
given disease, a biologist would like to figure out which DNA mutations are
deleterious (disease-causing) and which are not.
3. An online banking service must be able to determine whether or not a transaction

being performed on the site is fraudulent, on the basis of the user’s IP address, past
transaction history, and so forth.
Problems with KNN for classification
Choice of K (cross-validation?)
Problematic when with high-dimensional feature spaces and/or sparse training sets
It provides no understanding whatsoever (e.g. on what features are important)
Let’s look at other, more nuanced methods
1. Logistic regression
Predict default on a credit card based on account balance and income of the person
Why not just linear regression?
We want a model of the conditional probability of Y=1 conditional on the value of X:
Again, we cannot simply do
because this does not generate probabilities (values between 0 and 1). We need an S-shaped
function that does!
The logistic function is one such function
In logistic regression:
Making predictions : What is the probability that a person with $1000 balance defaults?
When we have more than 2 categories for Y, logistic regression is not as natural and is less
frequently used Also, in some situations parameters in logistic regression are unstable
2. Linear and quadratic discriminant analysis (LDA & QDA)

Problem setup:
Now we have K different categories, that is, Y can take K different values k=1, 2, …, K.
We want to be able to estimate the probability that a given instance belongs to class
Y=k given some features X:
3. Tree-based methods and support vector machines
1.Decision tres:
Our features are X1 and X2 , and we wish to classify yellow vs green
We segment the feature space into a number of simple regions
Since the set of splitting rules used to segment the predictor space can be summarized in a
tree, these types of approaches are known as decision tree methods
Step 1. Choose the split that results in the highest “purity”
Step 2. Iterate, for each región:
Step 2. Iterate, for each región:

Step 3. Prune the tree
Linear model or decision tree?
Advantages and disadvantages of

classification trees :
▲ Trees are very flexible (e.g. linear

or nonlinear decision boundaries)
▲ Trees are very easy to explain to

people, even easier than linear
regression!
▲ Trees can be displayed graphically,

and are easily interpreted even by a
non-expert (especially if they are
small)
▲ Trees can easily handle qualitative

predictors
▼ Trees generally do not have the same level of predictive accuracy as some of the other
classification approaches
▼ Trees can be very non-robust: a small change in the data can cause a large change in the
final estimated tree (high variance)
-Bagging of regular decision trees still has a major limitation:
All bagged trees tend to be very similar to each other: Using similar features, in similar
order, with similar splits…
Highly correlated trees = very redundant (not independent) predictions
Random forests address this limitation
-Decorrelating bagged trees Random forests:
Instead of looking at the split with highest purity (or lowest squared error, in
regression) among all p features, consider, at each split, only a subset of m<p features
The Random forest is an excellent algorithm:
● Very flexible
● Does not overfit (or at least, not much)
● Always very good results (although perhaps not the best)
2. Support vector machines (SVM)
We will be discussing three methods generally called “support vector machines”
Maximal margin classifier
Support vector classifier
Support vector machine (SVM, proper)

The starting point for all three methods is the idea of “separating hyperplanes”
In a p-dimensional space, a hyperplane is a flat subspace of dimension p−1:
D● In 2D, a hyperplane is a line
● In 3D, a hyperplane is a plane
A hyperplane is defined by an equation with the following

form:
In this particular example:
Maximum margin classifier:
Prefer a classifier that does not separate perfectly in the interest of:
● Greater robustness to individual observations
● Better classification of most of the training observations
Effect of C
High C:
● Many support vectors
● Low variance
● High bias
Low C:
● Fewer support vectors
● Higher variance
● Low bias
SVM with different nonlinear kernels:
Extensions of the SVM:

4. Introduction to deep learning
Fitting (learning) the parameters (weights)
In convolutional neural networks we combine several layers of convolution:

Convolutional neural networks are very efficient representations, allowing us to use much
fewer weights:
● By exploiting the low-dimensional structure of the data (2D structure in images): no all-to-all
connections
● By reusing weights: only a few shared weights in each filter
Therefore, they are very powerful for datasets with “spatial” structure

Computational Biology and Biomedical Data Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Biology and Biomedical Data Analysis

Uploaded by

Copyright:

Available Formats

Computational Biology and Biomedical

Sequence: ordered set of n elements/characters representing nucleotides or amino acids.

Proteins are composed of twenty amino acids.

Why similarity? What can similarity tell us about biology?

- To detect homology (shared evolutionary ancestry of two sequences)

- To reconstruct evolutionary history

PROKARYOTES (Bacteria and Archaea) EUKARYOTES (Plants, animals and

Not much intergenic DNA Lots of intergenic DNA

Typically 1-10Mb genomes 100Mb -100 Gb genomes

Homology: orthology and paralogy

- Orthologs are the result of speciation events

- Paralogous sequences are those homologous

- Global alignment: attempt to align or match every element in a sequence

- Local alignment: attempt to align regions of sequences

Main problems we will discuss in comparative genomics:

I. Similarity of sequences - How can we define similarity?

• The number of matching nucleotides (when aligned)?

Dot Plots: are graphical methods to observe similarity

Dot Plots - interpretation summary

How: Optimizing a score function to quantify the similarity between to sequences

Fill the matrix with scores and pointers:

Biological networks (tema 3)

Protein-protein interaction networks: Are mathematical representations of the physical

Gene regulatory networks (GRNs):

ATAC-seq data - technique used in molecular biology to assess genome-wide chromatin

Network players and relationships in bipartite networks

Paths between nodes i and k:

Size 1: #number of paths = Aik (0 in the example)

Size 2: #number of paths = Σj AijAjk=(A)²ik (1 in the example)

➔Start with a regular lattice

➔Randomize an increasing number of its links

A quantitative measure of modularity:

Finding the maximum modularity is a difficult (NP-complete) combinatorial optimization

Hierarchically organized modularity

➔Modular structure may be hierarchical (modules within modules) and modularity

Infomap is a very accurate algorithm not based on modularity maximization

Are real networks modular?

➔Problem If you look for modules, you find them

The modular structure of a network determines its dynamic behavior

Computational Metabolism (tema 5)

● Conversion of food/fuel to energy (ATP)

● Conversion of food/fuel to proteins, lipids, nucleic acids, etc

Basic notion of reactions within a cell:

A reaction might be thermodynamically favorable, but typically there is an activation

At equilibrium the velocities of both reactions are equal

Modeling enzyme kinetics:

– High specificity of enzymes to substrates

– Substrates often prevent enzyme inactivation

– Shape of the S curve

– Other mechanisms/factors (flexibility, strength of bonds)

● E+S react rapidly

● 1 E and 1 S form ES; 1 ES breaks down into 1 P, 1E - equilibrium hypothesis:

● [E],[S] and [ES] are at equilibrium

● [S]>>[E]; ES formation does not deplete S

● Overall rate, limited by the step ES to P + E

● Early stages of reaction: reverse reaction rate negligible.

General procedure to obtain reaction velocities for rapid equilibrium systems:

1. Write reactions involved in S to P

2. Write mass balance for the enzyme

3. Write the velocity-dependent equation: v is proportional to the concentration of product-

6. Substitute into equation for v

Velocity equations in general:

Velocity only depends on 'intermediate species' ESA, ES, ESI: