Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Computational Biology and Biomedical

Data Analysis
The central dogma of molecular biology explains how information is transmitted from DNA to
proteins and function.

Metabolism is the web of interconnected chemical reactions that use nutrients to produce
energy and the necessary building blocks for cellular functioning and growth.

Metabolism and the central dogma are interconnected in living organisms. Metabolism
provides the energy and building blocks necessary for DNA replication, RNA synthesis, and
protein synthesis. The flow of genetic information in the central dogma depends on the
availability of energy generated by metabolism.
Comparative Genomics and Sequence Alignment:
Comparison of biological sequences (DNA, RNA or proteins)

Sequence: ordered set of n elements/characters representing nucleotides or amino acids.

Proteins are composed of twenty amino acids.

Comparative Genomics:
Goal: To assess the degree of similarity between biological sequences

Why similarity? What can similarity tell us about biology?

- To detect homology (shared evolutionary ancestry of two sequences)

- To reconstruct evolutionary history

- To determine function and /or structure of new sequences (comparing them to those
with known function)

PROKARYOTES (Bacteria and Archaea) EUKARYOTES (Plants, animals and


funghi)
Small + No nucleus Larger cells, often multicellular +
nucleus, specialized organelles
No introns Introns

Not much intergenic DNA Lots of intergenic DNA

Typically 1-10Mb genomes 100Mb -100 Gb genomes


Homology and similarity between sequences:
The limited alphabet in biological sequences can lead to similarities between sequences by
chance. Similarity can arise from shared ancestry (homology) or independent convergence.
Homologous sequences share a common ancestor but similarity doesn't imply the same
function.

Homology: orthology and paralogy

-Orthologous sequences are those sequences that belong to two different species and are
homologous.

- Orthologs are the result of speciation events

- Paralogous sequences are those homologous


sequences within the same species. Paralogs are
the result of duplication events.

Sequence alignment:

Sequence alignment can be used to find ‘conserved patterns’ or similarity patterns between
pairs and groups of sequences(en AND y RNA).

- Global alignment: attempt to align or match every element in a sequence

- Local alignment: attempt to align regions of sequences


Evolutionary events: Point mutations in genes:

Main problems we will discuss in comparative genomics:

I. Similarity of sequences - How can we define similarity?

• The number of matching nucleotides (when aligned)?


• The amount of shared information?
• The “distance” between the two sequences under some metric?

Dot Plots: are graphical methods to observe similarity


between sequences.

Dot Plots - interpretation summary


II. Pairwise sequence alignment
Objective: To find the similarity regions in two sequences S1 and S2 .

How: Optimizing a score function to quantify the similarity between to sequences


allowing for evolutionary events by introducing gaps in one or both sequences.
We look for the longest set of adjacent bases that are equal in the pair of sequences.

Fill the matrix with scores and pointers:


III. Comparison of sequences against a database

-BLAST

Biological networks (tema 3)

Protein-protein interaction networks: Are mathematical representations of the physical


contacts between proteins in the cell.

Gene regulatory networks (GRNs):

ATAC-seq data - technique used in molecular biology to assess genome-wide chromatin


accessibility

Chip-seq data - method used to analyze protein interactions with DNA (high quality but
low coverage)

Computational inference methods – the goal is to obtain a ‘reliable’ GRN from the
available experimental data.

Network players and relationships in bipartite networks

Bipartite graphs: two types of nodes; edges run between nodes of different types
1 2 3 4 5 6 7

Paths between nodes i and k:

Size 1: #number of paths = Aik (0 in the example)

Size 2: #number of paths = Σj AijAjk=(A)²ik (1 in the example)


Complex biological networks and Modularity and group
structure (tema 4)

➔Start with a regular lattice

➔Randomize an increasing number of its links

A quantitative measure of modularity:

Modularity of a partition:

Finding the maximum modularity is a difficult (NP-complete) combinatorial optimization


problema:
There are problems with modularity maximization:

1.Resolution limit

➔Modularity optimization fails to identify modules smaller than a scale which depends
on the total size of the network and on the degree of interconnectedness of the
modules, even in cases where modules are unambiguously defined.

Hierarchically organized modularity

➔Modular structure may be hierarchical (modules within modules) and modularity


maximization only captures one scale (or, worse, a mixture of scales)

Infomap is a very accurate algorithm not based on modularity maximization

Are real networks modular?

➔Problem If you look for modules, you find them

➔Solution – Obtain the modularity M for the real network – Compare M to the
distribution of modularities in an ensemble of random networks with the same degree
sequence as the real network (configuration model)
Modules provide a convenient way to summarize network data

Connectors that span several modules are often key for system-wide behavior

The modular structure of a network determines its dynamic behavior

Computational Metabolism (tema 5)


Metabolism: the set of all biochemical reactions inside a cell Most metabolic reactions are
catalyzed by enzymes. Main functions of metabolism:

● Conversion of food/fuel to energy (ATP)

● Conversion of food/fuel to proteins, lipids, nucleic acids, etc

● Elimination of wastes

Basic notion of reactions within a cell:


Role of an enzyme:

A reaction might be thermodynamically favorable, but typically there is an activation


state involved that implies the existence of an activation energy barrier:

Changing kinetic constants ( i.e k1, k-1 in our example) does not change the equilibrium
constant Keq

At equilibrium the velocities of both reactions are equal

Modeling enzyme kinetics:

The enzyme must combine with the substrate to yield a product, so that we have the following
steps:

Why?

– High specificity of enzymes to substrates

– Substrates often prevent enzyme inactivation

– Shape of the S curve

Mechanisms

– Lock & key (template) – enzyme is assumed to have an active site (when
folded) whose structure is complementary to S

– Other mechanisms/factors (flexibility, strength of bonds)

Classification: EC numbers
Simple model for enzyme kinetics (Henri 1903 – Michaelis & Menten 1914)

●Henry observed that at initial stages the vel of the reaction is proportional to [E] and
that it increases non-linearly with [S] up to a maximum rate.

Assumptions:

● E is a catalyst

● E+S react rapidly

● 1 E and 1 S form ES; 1 ES breaks down into 1 P, 1E - equilibrium hypothesis:

● [E],[S] and [ES] are at equilibrium

● [S]>>[E]; ES formation does not deplete S

● Overall rate, limited by the step ES to P + E

● Early stages of reaction: reverse reaction rate negligible.

General procedure to obtain reaction velocities for rapid equilibrium systems:

1. Write reactions involved in S to P

2. Write mass balance for the enzyme

3. Write the velocity-dependent equation: v is proportional to the concentration of product-


forming species [ES] times the kinetic constant (kp)

4. Divide by [E]t

5. Express the concentration if each species in terms of that of free enzyme [E]

6. Substitute into equation for v

Velocity equations in general:

Velocity only depends on 'intermediate species' ESA, ES, ESI:

E. coli's metabolism:

● ~2000 reactions

● ~800 substrates

● ~1400 enzymes
This means we have at least 4000 parameters for HenriMichaelis-Menten like equations !!!!!
The Law of Mass Action:

• The Law of Mass Action states that the reaction rate is proportional to the probability of a
collision of the reactants.

• This probability is in turn proportional to the concentration of reactants, to the power of the
molecularity: e.g. the number in which they enter the specific reaction

In general, for substrate concentrations [Si] with 'molecularities' {ni} and product
concentrations [Pj] with molecularities {mj}

• The equilibrium constant Keq characterizes the ratio of substrate and product concentrations
in equilibrium (Seq and Peq), that is, the state with equal forward and backward rates.

• The dynamics of the concentrations can be described by Ordinary Differential Equations


(ODE), e.g. for the S1+S2 2P reaction:

• (The time course of S1,S2 and P is obtained by integrating these ODEs)

Stoichiometric coefficients:

Stoichiometric coefficients denote the proportion or substrates and products involved in a


reaction

Stoichiometric Matrix:

The stoichiometric coefficients nij assigned to the m substances Si and the r reactions Vj can
expressed by the so-called stoichiometric matrix

where each column belongs to a reaction and each row to a substance

network of reactions stoichiometric matrix


Mathematical description of a metabolic system:

Vector of concentration values

Vector of reaction rates

Parameter Vector

Stoichiometric matrix N – typically obtained via a metabolic reconstruction So that the balance
equation can be rewritten as

Flux Balance Analysis:

The stoichiometric analysis can be constrained in various ways to simplify the resolution of the
system, and to limit the solution space.

One of the techniques used to analyze the complete metabolic genotype of a microbial strain is
FBA:

1. It relies on balancing metabolic fluxes

2. It is based on the fundamental law of mass conservation

3. It is performed under steady-state conditions (time derivatives equal to zero)

4. It requires information only about:

- the stoichiometry of metabolic pathways;

- metabolic demands;

- and a few strain specific parameters

It does NOT require enzymatic kinetic data


● The time constants characterizing metabolic transients are typically very rapid compared to
the time constants of cell growth, and the transient mass balances can be simplified to only
consider the steady-state behavior (dX/dt=0).

Which yields a condition that is analogous to Kirchoff's law for circuits (charge cannot
accumulate = metabolites cannot accumulate)

● This problem is obviously underdetermined – we want to find the set of flows that are
solution to the problem but the number of equations m (#metabolites) is < r ( number of
reactions) – and we have one flow per reaction!

To find the specific 'phenotype' (set of reaction flows that are a solution) we need to introduce
constraints (some fluxes must be positive/negative and satisfy mass balance) and an objective
functionD

Metabolic capability of an organism:

● Metabolic genotype: The group of solutions to the steady state equation determine the
metabolic capabilities of a given organism (or the metabolic reconstruction for that organism)
● Metabolic phenotype: solutions compatible with the expression of a given set of genes.

● Metabolic flexibility is the manifestation of two principal properties:

– Stoichiometric redundancy means that the network can redistribute flexibly its
metabolic fluxes

– Robustness is the ability of the network to adjust to decreased fluxes through a


particular enzyme without significant changes in overall metabolic function
Advantages and disadvantages of FBA

Advantages:

It relies solely on stoichiometric characteristics (can be used on any fully


sequenced/characterized organism)

Does not need kinetic parameters (that are difficult to obtain)

Disadvantages:

It does not uniquely specify the fluxes (the particular flux distribution chosen by the
cell is a function of regulatory mechanisms that determine the kinetic characteristics of
enzymes/enzyme expression)

Sometimes disagrees with experimental data (discrepancies can often be accounted for
when regulatory loops are considered)

Cannot be used for modeling dynamic behavior (complementary.)

Building a genome-scale metabolic network mode:

Reactions are associated with specific genes


Block II
Quantitative analysis of biomedical data
1. Introduction to statistical learning
Statistical learning refers to a set of methods and techniques used to analyze and
make predictions or decisions based on data.

Suppose we have the following situations:

- A drug company wants to increase sales by changing the amount spent in different types of
advertising;

-The amount of sales is our dependent variable or response.

-The dependent variable is usually denoted by Y

-The amounts spent on TV, radio and newspaper advertising are our independent
variables or features

The features are usually denoted as X1 , X2 ...

The goal of statistical learning: Relate features (input) to response (output)


f is a fixed but unknown function that represents:

● The systematic information that X (and the parameters θ) yield about Y

● The stochastic influence of other factors Ɛ that can only be modeled probabilistically
The goal of statistical learning is to estimate f

Why would we want to estimate f?

1. PREDICTION We often want to make predictions of Y (e.g. predict sales of our drug
if we double the amount of TV advertising, or predict if a person will need to stay
at the ICU) In terms of prediction, we often treat f as a “black box”

2. INSIGHT Understand the relationship between X and Y:

● Which features are associated with the response?


● What is the relationship between the response and each feature?
● Can the relationship between Y and each predictor be adequately
summarized using a linear equation? Is the relationship more complicated?
Black box is not enough to address these.

3. Often, a combination of PREDICTION and INSIGHT

Let’s consider (again) the drug company:

Linear regression assumes a linear relationship between the response (sales) and the features
(advertising in TV, radio and newspaper). SIMPLE LINEAR REGRESSION
Which media contribute to sales? We can examine the p-values associated with each
predictor’s t-statistic

How large is the effect of each medium on sales? We can look at the confidence intervals for
the coefficients βj

Is there synergy among the advertising media? We can an interaction term

But how did we get to these answers? What do we assume in a linear regression?

1. We assume a linear relationship between feature(s) and response:

2. We use least squares to obtain the values of the parameters


We choose parameters that minimize the mean squared error, MSE (why
MSE?)

(measure of error)

3. We select a number of features Consider what happens to the MSE if we


compare two different linear models:

Answering this “model-selection” question opens the door to all other questions

Loosely speaking Best model = Lowest error


But, which error do we care about?

For the training set, the error always decreases with more complex models

But if we try on a new set (test set), at some point the MSE grows: Overfitting

Let’s consider the common case in which the stochastic term is additive with 0 mean:

Our estimate for the unknown function is:

Then, if we generate many different test sets, our predictions will have an error:
This is a general result: Bias and variance:

Expected test MSE that we would obtain if we repeatedly estimated f using a large number of
training sets

Variance: Amount by which the estimated f^ would change if we used a different training data
set

Bias: Error that is introduced by approximating the real f by the estimated f^

But what should we do when we are not given a test set?

Cross-validation

A class of methods that estimate the test error rate by holding out a subset of the
training observations from the fitting process, and then applying the statistical learning method
to those held out observations

1. Validation set approach:

Hold out some of the training data and use it as your validation set

Training set and test set of comparable size

Drawbacks:
● We “waste” a lot of data that could be used for training, so we overestimate
the error (bias)
● The result is highly dependent on the validation set you happen to choose

2. Leave-one-out cross-validation:

Hold out only one point, calculate error and then repeat for each point

Drawbacks:

● Unfeasible if the dataset is large


3. k-fold cross-validation:

Split data set in k subsets (typically k=5 or k=10); then use k-1 for training and 1
for testing. Repeat with each subset as test

● Not as biased and variable as the validation set approach


● Not as costly as the leave-one-out

1. Sometimes Y values are not real numbers but categories: Classification

Instead of MSE, it is common to use the error rate (ER):

Where

The bias-variance trade-off is similar to regression


settings
2. Non-parametric methods

Non-parametric methods do not explicitly assume a parametric form for f(X), and thereby
provide an alternative and more flexible approach for performing regression

Here we consider one of the simplest and best-known non-parametric methods, K-nearest
neighbors (KNN)

KNN can be used for regression or classification

K nearest neighbors regression: Given a value for K and a prediction point x , KNN regression
first identifies the K training observations that are closest to x, represented by N(x). It then
estimates f(x) using the average of all the training responses in N.

Quantitative analysis of biomedical data


2. Classification
Examples of classification problems

1. A person arrives at the emergency room with a set of symptoms that could
possibly be attributed to one of three medical conditions. Which of the three
conditions does the individual have?

2. On the basis of DNA sequence data for a number of patients with and without a
given disease, a biologist would like to figure out which DNA mutations are
deleterious (disease-causing) and which are not.

3. An online banking service must be able to determine whether or not a transaction


being performed on the site is fraudulent, on the basis of the user’s IP address, past
transaction history, and so forth.
Problems with KNN for classification

Choice of K (cross-validation?)

Problematic when with high-dimensional feature spaces and/or sparse training sets

It provides no understanding whatsoever (e.g. on what features are important)

Let’s look at other, more nuanced methods

1. Logistic regression

Predict default on a credit card based on account balance and income of the person

Why not just linear regression?

We want a model of the conditional probability of Y=1 conditional on the value of X:

Again, we cannot simply do

because this does not generate probabilities (values between 0 and 1). We need an S-shaped
function that does!
The logistic function is one such function

In logistic regression:

Making predictions : What is the probability that a person with $1000 balance defaults?

When we have more than 2 categories for Y, logistic regression is not as natural and is less
frequently used Also, in some situations parameters in logistic regression are unstable

2. Linear and quadratic discriminant analysis (LDA & QDA)


Problem setup:
Now we have K different categories, that is, Y can take K different values k=1, 2, …, K.
We want to be able to estimate the probability that a given instance belongs to class
Y=k given some features X:
Quantitative analysis of biomedical data
3. Tree-based methods and support vector machines

1.Decision tres:

Our features are X1 and X2 , and we wish to classify yellow vs green

We segment the feature space into a number of simple regions

Since the set of splitting rules used to segment the predictor space can be summarized in a
tree, these types of approaches are known as decision tree methods
Step 1. Choose the split that results in the highest “purity”

Step 2. Iterate, for each región:

Step 2. Iterate, for each región:


Step 3. Prune the tree

Linear model or decision tree?

Advantages and disadvantages of


classification trees :

▲ Trees are very flexible (e.g. linear


or nonlinear decision boundaries)

▲ Trees are very easy to explain to


people, even easier than linear
regression!

▲ Trees can be displayed graphically,


and are easily interpreted even by a
non-expert (especially if they are
small)

▲ Trees can easily handle qualitative


predictors

▼ Trees generally do not have the same level of predictive accuracy as some of the other
classification approaches

▼ Trees can be very non-robust: a small change in the data can cause a large change in the
final estimated tree (high variance)
-Bagging of regular decision trees still has a major limitation:

All bagged trees tend to be very similar to each other: Using similar features, in similar
order, with similar splits…

Highly correlated trees = very redundant (not independent) predictions

Random forests address this limitation

-Decorrelating bagged trees Random forests:

Instead of looking at the split with highest purity (or lowest squared error, in
regression) among all p features, consider, at each split, only a subset of m<p features

The Random forest is an excellent algorithm:

● Very flexible

● Does not overfit (or at least, not much)

● Always very good results (although perhaps not the best)

2. Support vector machines (SVM)

We will be discussing three methods generally called “support vector machines”

Maximal margin classifier

Support vector classifier

Support vector machine (SVM, proper)


The starting point for all three methods is the idea of “separating hyperplanes”

In a p-dimensional space, a hyperplane is a flat subspace of dimension p−1:

D● In 2D, a hyperplane is a line

● In 3D, a hyperplane is a plane

A hyperplane is defined by an equation with the following


form:

In this particular example:

Maximum margin classifier:

Prefer a classifier that does not separate perfectly in the interest of:

● Greater robustness to individual observations

● Better classification of most of the training observations

Effect of C

High C:

● Many support vectors

● Low variance

● High bias

Low C:

● Fewer support vectors

● Higher variance

● Low bias
SVM with different nonlinear kernels:

Extensions of the SVM:


Quantitative analysis of biomedical data
4. Introduction to deep learning
Fitting (learning) the parameters (weights)

In convolutional neural networks we combine several layers of convolution:


Convolutional neural networks are very efficient representations, allowing us to use much
fewer weights:

● By exploiting the low-dimensional structure of the data (2D structure in images): no all-to-all
connections

● By reusing weights: only a few shared weights in each filter

Therefore, they are very powerful for datasets with “spatial” structure

You might also like