Professional Documents
Culture Documents
Computational Biology and Biomedical Data Analysis
Computational Biology and Biomedical Data Analysis
Data Analysis
The central dogma of molecular biology explains how information is transmitted from DNA to
proteins and function.
Metabolism is the web of interconnected chemical reactions that use nutrients to produce
energy and the necessary building blocks for cellular functioning and growth.
Metabolism and the central dogma are interconnected in living organisms. Metabolism
provides the energy and building blocks necessary for DNA replication, RNA synthesis, and
protein synthesis. The flow of genetic information in the central dogma depends on the
availability of energy generated by metabolism.
Comparative Genomics and Sequence Alignment:
Comparison of biological sequences (DNA, RNA or proteins)
Comparative Genomics:
Goal: To assess the degree of similarity between biological sequences
- To determine function and /or structure of new sequences (comparing them to those
with known function)
-Orthologous sequences are those sequences that belong to two different species and are
homologous.
Sequence alignment:
Sequence alignment can be used to find ‘conserved patterns’ or similarity patterns between
pairs and groups of sequences(en AND y RNA).
-BLAST
Chip-seq data - method used to analyze protein interactions with DNA (high quality but
low coverage)
Computational inference methods – the goal is to obtain a ‘reliable’ GRN from the
available experimental data.
Bipartite graphs: two types of nodes; edges run between nodes of different types
1 2 3 4 5 6 7
Modularity of a partition:
1.Resolution limit
➔Modularity optimization fails to identify modules smaller than a scale which depends
on the total size of the network and on the degree of interconnectedness of the
modules, even in cases where modules are unambiguously defined.
➔Solution – Obtain the modularity M for the real network – Compare M to the
distribution of modularities in an ensemble of random networks with the same degree
sequence as the real network (configuration model)
Modules provide a convenient way to summarize network data
Connectors that span several modules are often key for system-wide behavior
● Elimination of wastes
Changing kinetic constants ( i.e k1, k-1 in our example) does not change the equilibrium
constant Keq
The enzyme must combine with the substrate to yield a product, so that we have the following
steps:
Why?
Mechanisms
– Lock & key (template) – enzyme is assumed to have an active site (when
folded) whose structure is complementary to S
Classification: EC numbers
Simple model for enzyme kinetics (Henri 1903 – Michaelis & Menten 1914)
●Henry observed that at initial stages the vel of the reaction is proportional to [E] and
that it increases non-linearly with [S] up to a maximum rate.
Assumptions:
● E is a catalyst
4. Divide by [E]t
5. Express the concentration if each species in terms of that of free enzyme [E]
E. coli's metabolism:
● ~2000 reactions
● ~800 substrates
● ~1400 enzymes
This means we have at least 4000 parameters for HenriMichaelis-Menten like equations !!!!!
The Law of Mass Action:
• The Law of Mass Action states that the reaction rate is proportional to the probability of a
collision of the reactants.
• This probability is in turn proportional to the concentration of reactants, to the power of the
molecularity: e.g. the number in which they enter the specific reaction
In general, for substrate concentrations [Si] with 'molecularities' {ni} and product
concentrations [Pj] with molecularities {mj}
• The equilibrium constant Keq characterizes the ratio of substrate and product concentrations
in equilibrium (Seq and Peq), that is, the state with equal forward and backward rates.
Stoichiometric coefficients:
Stoichiometric Matrix:
The stoichiometric coefficients nij assigned to the m substances Si and the r reactions Vj can
expressed by the so-called stoichiometric matrix
Parameter Vector
Stoichiometric matrix N – typically obtained via a metabolic reconstruction So that the balance
equation can be rewritten as
The stoichiometric analysis can be constrained in various ways to simplify the resolution of the
system, and to limit the solution space.
One of the techniques used to analyze the complete metabolic genotype of a microbial strain is
FBA:
- metabolic demands;
Which yields a condition that is analogous to Kirchoff's law for circuits (charge cannot
accumulate = metabolites cannot accumulate)
● This problem is obviously underdetermined – we want to find the set of flows that are
solution to the problem but the number of equations m (#metabolites) is < r ( number of
reactions) – and we have one flow per reaction!
To find the specific 'phenotype' (set of reaction flows that are a solution) we need to introduce
constraints (some fluxes must be positive/negative and satisfy mass balance) and an objective
functionD
● Metabolic genotype: The group of solutions to the steady state equation determine the
metabolic capabilities of a given organism (or the metabolic reconstruction for that organism)
● Metabolic phenotype: solutions compatible with the expression of a given set of genes.
– Stoichiometric redundancy means that the network can redistribute flexibly its
metabolic fluxes
Advantages:
Disadvantages:
It does not uniquely specify the fluxes (the particular flux distribution chosen by the
cell is a function of regulatory mechanisms that determine the kinetic characteristics of
enzymes/enzyme expression)
Sometimes disagrees with experimental data (discrepancies can often be accounted for
when regulatory loops are considered)
- A drug company wants to increase sales by changing the amount spent in different types of
advertising;
-The amounts spent on TV, radio and newspaper advertising are our independent
variables or features
● The stochastic influence of other factors Ɛ that can only be modeled probabilistically
The goal of statistical learning is to estimate f
1. PREDICTION We often want to make predictions of Y (e.g. predict sales of our drug
if we double the amount of TV advertising, or predict if a person will need to stay
at the ICU) In terms of prediction, we often treat f as a “black box”
Linear regression assumes a linear relationship between the response (sales) and the features
(advertising in TV, radio and newspaper). SIMPLE LINEAR REGRESSION
Which media contribute to sales? We can examine the p-values associated with each
predictor’s t-statistic
How large is the effect of each medium on sales? We can look at the confidence intervals for
the coefficients βj
But how did we get to these answers? What do we assume in a linear regression?
(measure of error)
Answering this “model-selection” question opens the door to all other questions
For the training set, the error always decreases with more complex models
But if we try on a new set (test set), at some point the MSE grows: Overfitting
Let’s consider the common case in which the stochastic term is additive with 0 mean:
Then, if we generate many different test sets, our predictions will have an error:
This is a general result: Bias and variance:
Expected test MSE that we would obtain if we repeatedly estimated f using a large number of
training sets
Variance: Amount by which the estimated f^ would change if we used a different training data
set
Cross-validation
A class of methods that estimate the test error rate by holding out a subset of the
training observations from the fitting process, and then applying the statistical learning method
to those held out observations
Hold out some of the training data and use it as your validation set
Drawbacks:
● We “waste” a lot of data that could be used for training, so we overestimate
the error (bias)
● The result is highly dependent on the validation set you happen to choose
2. Leave-one-out cross-validation:
Hold out only one point, calculate error and then repeat for each point
Drawbacks:
Split data set in k subsets (typically k=5 or k=10); then use k-1 for training and 1
for testing. Repeat with each subset as test
Where
Non-parametric methods do not explicitly assume a parametric form for f(X), and thereby
provide an alternative and more flexible approach for performing regression
Here we consider one of the simplest and best-known non-parametric methods, K-nearest
neighbors (KNN)
K nearest neighbors regression: Given a value for K and a prediction point x , KNN regression
first identifies the K training observations that are closest to x, represented by N(x). It then
estimates f(x) using the average of all the training responses in N.
1. A person arrives at the emergency room with a set of symptoms that could
possibly be attributed to one of three medical conditions. Which of the three
conditions does the individual have?
2. On the basis of DNA sequence data for a number of patients with and without a
given disease, a biologist would like to figure out which DNA mutations are
deleterious (disease-causing) and which are not.
Choice of K (cross-validation?)
Problematic when with high-dimensional feature spaces and/or sparse training sets
1. Logistic regression
Predict default on a credit card based on account balance and income of the person
because this does not generate probabilities (values between 0 and 1). We need an S-shaped
function that does!
The logistic function is one such function
In logistic regression:
Making predictions : What is the probability that a person with $1000 balance defaults?
When we have more than 2 categories for Y, logistic regression is not as natural and is less
frequently used Also, in some situations parameters in logistic regression are unstable
1.Decision tres:
Since the set of splitting rules used to segment the predictor space can be summarized in a
tree, these types of approaches are known as decision tree methods
Step 1. Choose the split that results in the highest “purity”
▼ Trees generally do not have the same level of predictive accuracy as some of the other
classification approaches
▼ Trees can be very non-robust: a small change in the data can cause a large change in the
final estimated tree (high variance)
-Bagging of regular decision trees still has a major limitation:
All bagged trees tend to be very similar to each other: Using similar features, in similar
order, with similar splits…
Instead of looking at the split with highest purity (or lowest squared error, in
regression) among all p features, consider, at each split, only a subset of m<p features
● Very flexible
Prefer a classifier that does not separate perfectly in the interest of:
Effect of C
High C:
● Low variance
● High bias
Low C:
● Higher variance
● Low bias
SVM with different nonlinear kernels:
● By exploiting the low-dimensional structure of the data (2D structure in images): no all-to-all
connections
Therefore, they are very powerful for datasets with “spatial” structure