Professional Documents
Culture Documents
Summary Molecular Systems Biology
Summary Molecular Systems Biology
Replicates should be independent observations that mimic all variability present between two
samples except the studied treatment.
More replicates will increase the resolution of your results (smaller effects can be discovered)
Randomize your samples to avoid possible systematic bias of unexpected external sources.
Blocking can be used to take expected sources of variation into account (e.g. measurement day)
Pooling your data can be helpful when comparing different treatments but watch out because
outliers or high variance can affect the results
Model assumptions: equal variance among treatments, errors are independent and normally
distributed
p-value tells you how likely the measured outcome of the test statistic would be if H 0 is true. The
smaller the p-value, the more indication that H a is true and H0 is not. Something is considered
significant if the p-value lies below a value alpha that is chosen in advance.
ANOVA can be used to compare more than two groups (gain of power by combining information)
- Use all observations together to estimate variability
- Sometimes it allows to estimate effect of one treatment over all levels of another treatment
- Sometimes it allows to estimate interaction between treatments
On exam: if asked what test to use in a situation, specify what kind of t-test
Useful for:
- Identifying groups of possibly co-regulated metabolites/genes
- Identifying new classes of biological samples
- Detecting possible experimental artifacts (e.g. clustering occurs based on the day a sample
was analyzed instead of biological meaning)
- Checking data to see whether samples are clustered according to known categories (a
supervised approach like classification is better for this)
When variables are measured on different scales it is wise to apply standardization to the
observations when studying Euclidean or Manhattan similarity.
Autoscaling = subtract mean and divide by standard deviation
Sometimes it’s better to replace the original measurements by ranks (for non-normal distributions,
high probability of outliers, data that can only be ranked)
To study the relationships between objects, make a score plot by projecting the original data on a
plane made from two PC’s (usually PC1-PC2). Some information will get lost, it is still in the
dimensions that are not plotted.
To study relationships between variables, make a loadings plot by projecting the original axis as
vectors on a plane made from two PC’s. The original axis corresponded to the original variables so a
loadings plot will show which variables contribute to which PC (direction of vector) and to what
extent (length of vector). Two positively correlated variables will have a small angle between their
vectors whereas two negatively correlated variables will have a wide angle.
A biplot combines a score plot and a loadings plot and can be used to investigate relationships
among objects, among variables and between objects and variables.
A PCA only reflects the variation in the data, not the source of the variation.
Percentages on the axis of a PCA indicate the amount of variance of the original data that is still
present in the PCA. Check how much of the variation is retained in the chosen PC lines.
When testing for many different GO terms, adjust your p-value for multiple testing
Discriminant analysis:
- Using a combination of several variables for separation of data
- Useful to identify the class of a new sample or to select a subset of variables
- Relies on having more objects than variables
o Often not the case for –omics studies. Few samples and many variables (wide data)
Cross-validation can first be used to optimize your model parameters. After the parameters are
optimized, a second cross-validation has to be performed in order to estimate the ability of the
model to predict new data (cross validation error).
All steps in the whole classification process need to be cross validated so also filtering, variable
selection, optimization of parameters, etc.
Decrease in
Gini impurity
Random Forest (method of classification for data with large number of variables):
- A forest consists of many unpruned trees
- Make new datasets from original dataset. Draw with replacement (same as bootstrapping)
- Fit tree to each new dataset
o Use a random subset of the variables at each node
- Use ‘out of bag’ samples (samples that are not part of new dataset due to doubles of other
samples) for cross validation.
- Run your target sample through each tree and compare the different outcomes
- Count the different outcomes. The outcome with the highest count is the final outcome
Random Forest results in lower variance without higher bias
Variable importance: change the value of a single variable and check whether the out of bag error
changes. If the error increases a lot, the variable was important. Do this with every variable.
Besides looking at the out of bag error, also quantify the increase in Gini impurity.
Normalization methods:
- Total counts: (mapped reads per gene / total mapped reads in sample) * mean of the total
counts across all samples
o Corrects for differences in library size
o You can also divide by upper quartile or median of the counts different from 0
- RPKM: (mapped reads per gene * 10^9) / (total mapped reads in sample * gene length)
o Corrects for differences in library size and gene length
- Trimmed mean of M-values (TMM)
o One sample is considered reference, the others are tests
o For each test sample compute the TMM (compute log ratio between sample and
reference (M-value), exclude most expressed genes (trimmed) and compute
weighted mean of log ratios (mean))
o The outcome should be close to 0 if most genes are not differentially expressed
o Otherwise it provides an estimate of a correction factor for the library size
Comparing the different methods: normalization of RNA-seq data for differential analysis is essential.
Total count and RPKM methods are ineffective. Instead use DESeq or TMM.
False positive rate = #FP / (#FP + #TN)
True positive rate = #TP / (#TP + #FN) = power
Lecture 9: Biological networks and their properties (was not in lecture, summary based on slides)
Biological networks can be used to study interactions, pathways or similarity
It is widely believed that cellular networks (metabolic, protein-protein interaction, etc) are scale free.
A scale free network is more robust. Removal of a non-hub node has little overall effect.
In some networks, interaction may vary over time. A party hub interact with all their neighbors
simultaneously. A date hub interacts with different neighbors at different times. A date hub often
indicates a link between various biological processes.
Boolean networks can be used to model gene regulation. The expression level of each gene is related
to the expression states of some other genes using logical functions.
A Boolean network is defined by a set of nodes (V = {v 1, …, vn}) and a list of Boolean functions (B = (B1,
…, Bn)). The state of a node (S = {σ1, …, σ n}) is determined by the state of the other connected nodes
at that time.
A state transition graph represents the state of a network iterating over different time points.
Attractor state: a networks ends in a single state or a set of states which its oscillates between
(equilibrium state). The period of the attractor is the number of different states it consists of.
Total number of states in a network = 2^number of nodes
Lecture 11: Bayesian networks (was not in lecture, summary based on slides)
A Bayesian network is a model of probabilistic relationships among a set of variables. Its aim is to
capture conditional dependency between states in the data. It is a directed acyclic graph.
Nodes represent binary states of a variable and links represent the probability of a relationship
between the two nodes.
You can use database information for generating a network. However, data is generated by different
groups so different methods, normalization, etc. apply. It is best to use data generated by yourself
but if that is not an option, limit yourself to one type of data (method of obtaining data).
Mutual information (0 till infinity): how much information about one variable is encoded in the other
Similarity measure cannot distinguish direct from indirect interactions resulting in many false
positives in the connections. Methods to reduce these false positives:
- Aracne
o Compute mutual information matrix
o Use data processing inequality (DPI)
For A B C, mutual information AC < AB and BC
Remove edge between AC if this is the case
- CLR
o Compute mutual information matrix
o Calculate the statistical likelihood of each mutual information value compared to the
interaction value distribution of all interactions.
o Remove edge if the interaction is not in the significant region for both genes.
To compare the quality of different networks with different thresholds, compare the TP rate to the
FP rate in an ROC curve. The rate for each individual threshold is a data point which together form a
line.
Usually you expect more negatives than positives so a large false positive rate (fraction of negatives
that is perceived as positive) is more detrimental than a low true positive rate (fraction of positives
that is perceived as positive). Focus on the left side of the x-axis of the ROC curve (low FPR).
Only network inference is usually insufficient for a trustworthy network. Integrative bioinformatics
improves the network with other data (omics measurements, databases, etc.).
Explicit data integration: combine measurements to construct a multilayer model with preknowledge
about use of the data.
Implicit data integration: combine measurements to enhance a single layer model. “throw data on a
pile and hope something comes out”
When integrating, make sure your data is heterogeneous in type, reliability, nomenclature, coverage,
possible bias, etc.
Set-based integration: select different types of evidence (data) for interaction and weigh them based
on their reliability. Combine the weighted scores for the different possible interactions. Select
interactions for final model based on a certain threshold.
When integrating the results from the different data sources, you have different integration forms:
- Early integration: put all results through same classifier, compare output to threshold to
obtain binary output.
- Late integration: put all results through unique classifiers, compare output to threshold to
obtain binary output.
- Intermediate integration: translates results into same data type (probability, similarity, etc.),
put these through same classifier, compare output to threshold to obtain binary output.
Motif: a motif in node and degree structure of a small number of nodes (about 3). See if you can find
a certain motif in a network and how often. Compare this to a random networks with same amount
of nodes and degrees to find enriched motifs.
Modules can have shared nodes, whereas clusters can’t. (modules are a good way of separating
cellular processes as one protein can be involved in different processes)
Special way of clustering networks (previous cluster methods also possible): minimize normalized cut
Proteins have diverse properties (lifetime, localization, concentration in cell, ligand interactions, etc.)
making proteomics more diverse than transcriptomics. Besides, interactions are time dependent. A
measurement is only a snapshot from the current situation.
The 3D structure of a protein tells a lot about its properties. 3D structure determination techniques:
- X-ray crystallography: hit crystallized molecule with X-ray beam and study diffraction pattern
- NMR: study structure by inducing and measuring chemical shift of H atoms in a molecule
- EM: advancing technology similar to X-ray but with electron beam on a single protein
- Homology modelling: model structure based on other similar proteins with known structure
Fragmentation of proteins into peptides is usually done by trypsin digestion (cleaves on C-terminal
site of lysine and arginine)
Using a collision cell after the ionizer in your setup, allows to filter for peptides with a certain charge.
Double charged precursors result in spectra that are easiest to interpret. Fragmentation of the
double charged precursors occurs in a way that most peptide bonds (weakest bonds) break with
comparable frequency and usually results in two singly-charged fragment ions.
De novo peptide identification is difficult. It is better to compare to a simulated library based on a
database: Take sequence from database artificial digestion artificial mass spectrometry
library spectrum (for all sequences of interest).
MS can only detect a few peptides at the same time. Therefore it is best to separate your peptides
after digestion, for instance using liquid chromatography.
Comparing different samples can be used by adding isotopic labeled amino acids (different mass) to
the growth medium of one/some of your samples (SILAC).
- Estimates relative protein levels between samples with high accuracy
- Can be used on complex mixtures of proteins
- No addition of tag so no altered chromatography
- Ideal for cultures but tricky for whole organisms
- Expensive
Currently the preferred technique for comparing different samples is label free quantification (LFQ).
At least three independent technical replicates are used followed by a statistical test to study
significant differences between samples.
- Can be applied to all types of samples
- Estimates relative protein levels between samples with high accuracy
- Can be used on complex mixtures of proteins
- Requires replicates so more material
- Requires more MS runs
- Requires well-annotated genomes
In the paper Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single
cells a relation between protein concentration and corresponding mRNA concentration is studied.
- For quantifying number of proteins in a cell
o Fuse YFP to protein (one strain for each protein)
o Measure amount of fluorescence and translate it to protein copy number per cell
- For quantifying number of mRNA in a cell
o Create fluorescent tag that binds to YFP sequence of studied protein
o Measure amount of fluorescence and translate it to mRNA levels per cell
- There is little correlation between protein and mRNA numbers per cell
o Possibly due to short lifetime of mRNA after translation compared to protein lifetime
o Possibly due to localization of protein to different cell