Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Molecular Systems Biology

Lecture 1: Experimental design


Make an experimental design before gathering data
- Determine how many replicates per treatment
o Same amount for each treatment in order to get a balanced design (except if 1
specific comparison is more important)
o Watch out for pseudo replicates (= replicate measured from same experimental unit)
- Minimize bias from systematic effects from sources other than your treatments
- Minimize random variation
o Use same test conditions, minimize technical and measurement variation, etc.
- Make sure design meets requirements for statistical analyses

Replicates should be independent observations that mimic all variability present between two
samples except the studied treatment.
More replicates will increase the resolution of your results (smaller effects can be discovered)

Technical replicate: measure same sample several times


- In case one measurement gets lost
- To detect and quantify variation due to measurement technique
- Not to be used for statistical comparison across samples
Biological replicate – type 1: from the same individual (organism/cell line/etc.)
Biological replicate – type 2: from different individuals (organism/cell line/etc.)
- Provides standard error for the comparison of treatments
- Provides a formal way to generalize to the wider population of individuals
- Most important kind of replications (needed for tests, p-values, etc.)
Technical replicates do not compensate for lack of biological replicates

Degrees of freedom (df) = number of individuals – number of treatment combinations


Experimental unit = unit to which treatment is applied

Randomize your samples to avoid possible systematic bias of unexpected external sources.
Blocking can be used to take expected sources of variation into account (e.g. measurement day)

Include ‘empties’, negative controls and positive controls in your experiment

Pooling your data can be helpful when comparing different treatments but watch out because
outliers or high variance can affect the results

A factorial design can be used when multiple factors are of interest


Lecture 2: Differential expression
Different ways to analyze results when comparing two situations:
- Fold change: ratio of the signal under two conditions (B/A)
o More than two times up or down regulated (fold change > 2)  significant
o The log of the fold change is also often used
o Method is not preferred. It is not statistical, doesn’t take variance into account
- Significance: quantify difference between two conditions and compare to variance within the
conditions
o Calculate outcome of a test statistic and determine the significance of the outcome

Model assumptions: equal variance among treatments, errors are independent and normally
distributed

p-value tells you how likely the measured outcome of the test statistic would be if H 0 is true. The
smaller the p-value, the more indication that H a is true and H0 is not. Something is considered
significant if the p-value lies below a value alpha that is chosen in advance.

Type I error: false positive, p-value < alpha while H 0 is true


Type II error: false negative, p-value > alpha while H a is true

ANOVA can be used to compare more than two groups (gain of power by combining information)
- Use all observations together to estimate variability
- Sometimes it allows to estimate effect of one treatment over all levels of another treatment
- Sometimes it allows to estimate interaction between treatments

On exam: if asked what test to use in a situation, specify what kind of t-test

Lecture 3: False Discovery Rate (FDR)


When you have many measurements, you can analyze different error types
- Per comparison error rate (PCER): expected false discoveries per single test. Just like for t-
test with predefined alpha. Alpha is allowed fraction of false positives. using alpha = 0.05 still
results in many false positives when dealing with large number of measurements. Therefore
a different method needs to be used.
- Family wise error rate (FWER): probability of one or more false discoveries over all tests. This
is a very strict method inducing many false negatives so only attractive if the ‘cost’ of one or
a few false positives is very high or when you expect little or no true positives.
Calculated using the Bonferoni method (significance threshold = alpha / number of tests)
- False discovery rate (FDR): expected false discovery rate in significant set
Calculated using the Benjamini Hochberg method (BH95)
o Order the raw p-values from lowest to highest
o Find the ranks of the p-values with p<= rank number * alpha / number of tests
o K = highest rank where this equality is true
o For ranks 1 till k  H0 is rejected
o This method results in q-values instead of p-values. The q-value for a test is the
lowest p-value in a list from the rank of the p-value for the test until rank k.
The q-value does not refer to single test anymore but to a set of tests

Lecture 4: Cluster analysis


Clustering: arranging objects in non-predefined groups (unsupervised) based on characterization by
values of a set of variables

Useful for:
- Identifying groups of possibly co-regulated metabolites/genes
- Identifying new classes of biological samples
- Detecting possible experimental artifacts (e.g. clustering occurs based on the day a sample
was analyzed instead of biological meaning)
- Checking data to see whether samples are clustered according to known categories (a
supervised approach like classification is better for this)

Things to think about when clustering:


- Which samples/treatments/units/variables are used?  filtering
o All samples, based on statistical comparison, based on ‘sufficient’ variation, etc.
- Which similarity distance (strength of relationship between two objects) measure is used?
o Euclidean: linear distance between objects in a multidimensional space
o Manhattan: similar to Euclidean but distance over straight lines parallel to axis
o Pearson correlation: measure for correlation between two variables (variable A and B
both increase from situation X to Y  correlated)
o Absolute Pearson correlation: take absolute value (or square) of Pearson coefficient
to only measure correlation between changes irrespective of direction.
- Which cluster algorithm is applied?
o Hierarchical: produces a dendogram
 Divisive: start at top of tree, everything in one group. Divide in groups from
there on
 Agglomerative: Start at bottom of tree, everything in separate groups.
Combine groups from there on
o Calculation of distance between two clusters is based on pairwise distances between
members of both the clusters.
 Complete linkage: largest possible distance between objects of each group
 Single linkage: shortest possible distance between objects of each group
 Average linkage: distance between average location of objects in each group
o UPGMA, most popular way to construct tree (hierarchical, agglomerative, average
linkage)
 Fuse two samples with shortest distance. Use branches of length: distance/2
 Treat fused samples as one sample and find shortest distance between
average and the next closest sample. Fuse this sample to the existing group
 Repeat the second step until all samples are fused.
- How to decide the number of clusters?
- How to assess the quality of the clustering?
o Include repetitions of a single object in your cluster
o Include known objects/variables to allow cluster identification
o Test difference between groups with statistical test
o Make clusters with varying distance measure, cluster algorithm, etc. and compare
o Bootstrapping
 Make new datasets (e.g. 1000) from original dataset with same size. Draw
with replacement (doubles are allowed).
 Construct a tree for each new dataset
 Count for each group in the original tree how many times it re-occurs in the
bootstrap trees
 Put these values at the branching point for the corresponding groups in your
original tree (higher value = better cluster)

When variables are measured on different scales it is wise to apply standardization to the
observations when studying Euclidean or Manhattan similarity.
Autoscaling = subtract mean and divide by standard deviation

Sometimes it’s better to replace the original measurements by ranks (for non-normal distributions,
high probability of outliers, data that can only be ranked)

Lecture 5: Principal components analysis (PCA)


Visual inspection of multivariate data becomes difficult when dealing with more than 3 variables. PCA
reduces the dimensions of the data and thus its complexity.

Example: PCA for going from 3 dimensions to 2 dimensions


- Determine the average location of all data points (center of gravity) in the multidimensional
space. This will be the new origin.
- Fit a line through the data that captures the widest spread (lowest sum of squares for the
distances between the line and the data points). This is the first principal component.
- Fit a line orthogonal to the first principal component that captures the widest left over
spread in the data. This is the second principal component.
- Project all observations on the plane of PC1 and PC2.

To study the relationships between objects, make a score plot by projecting the original data on a
plane made from two PC’s (usually PC1-PC2). Some information will get lost, it is still in the
dimensions that are not plotted.
To study relationships between variables, make a loadings plot by projecting the original axis as
vectors on a plane made from two PC’s. The original axis corresponded to the original variables so a
loadings plot will show which variables contribute to which PC (direction of vector) and to what
extent (length of vector). Two positively correlated variables will have a small angle between their
vectors whereas two negatively correlated variables will have a wide angle.
A biplot combines a score plot and a loadings plot and can be used to investigate relationships
among objects, among variables and between objects and variables.

A PCA only reflects the variation in the data, not the source of the variation.
Percentages on the axis of a PCA indicate the amount of variance of the original data that is still
present in the PCA. Check how much of the variation is retained in the chosen PC lines.

Lecture 6: GO enrichment analysis


Once you have a group/cluster of proteins/genes, you want to know to which GO terms they belong
GO is divided in 3 ontologies:
- Molecular function
- Biological process
- Cellular component (location)

Use a hypergeometric test for an enrichment analysis.


Outcome is yes or no (enriched gene belongs to GO group or not)  proportions
Compare sample proportion (# successes in the sample(x)/sample size(K)) to population proportion
(# successes in population(M)/population size(N))

When testing for many different GO terms, adjust your p-value for multiple testing

Lecture 7: Classification trees and Random Forest


Classification: arranging objects in predefined groups (supervised) based on characterization by
values of a set of variables

Discriminant analysis:
- Using a combination of several variables for separation of data
- Useful to identify the class of a new sample or to select a subset of variables
- Relies on having more objects than variables
o Often not the case for –omics studies. Few samples and many variables (wide data)

- Use training set to make a classifier


- Use a test set (with classes known in advance) to evaluate the quality of your classifier
- Adjust your classifier based on test results and test again (repeat until satisfied with results)
- Run your target set through your classifier

Leave One Out Cross-Validation (LOOCV)


- Divide your samples in 10 sets of equal size
- Use 9 sets (90%) to build a classifier
- Use it to predict the classes of the remaining 10%
- Repeat step 2 and 3 with different sets of 10% as test sets

Cross-validation can first be used to optimize your model parameters. After the parameters are
optimized, a second cross-validation has to be performed in order to estimate the ability of the
model to predict new data (cross validation error).

All steps in the whole classification process need to be cross validated so also filtering, variable
selection, optimization of parameters, etc.

Solutions to ‘few samples many variables’ problem (need to be included in CV procedure):


- Only use selected set of variables (filter out non relevant variables before building classifier)
- Perform a PCA to reduce the number of variables
- Shrinkage methods
o weigh variable based on their contribution to the classifier
o put penalty on summed weights of the variables used in the classifier
Gini impurity(i): measure of impurity within a node. Tells something about the improvement of
dividing data into classes after splitting up a node.

Decrease in
Gini impurity

Stopping criteria for splitting your data


- When all nodes are pure
- When a node has a certain number of elements
- Pruning your tree
o Remove splits with low decrease in impurity
o Protects against overfitting
- Pruned trees have lower variance but higher bias

Random Forest (method of classification for data with large number of variables):
- A forest consists of many unpruned trees
- Make new datasets from original dataset. Draw with replacement (same as bootstrapping)
- Fit tree to each new dataset
o Use a random subset of the variables at each node
- Use ‘out of bag’ samples (samples that are not part of new dataset due to doubles of other
samples) for cross validation.
- Run your target sample through each tree and compare the different outcomes
- Count the different outcomes. The outcome with the highest count is the final outcome
Random Forest results in lower variance without higher bias

Variable importance: change the value of a single variable and check whether the out of bag error
changes. If the error increases a lot, the variable was important. Do this with every variable.
Besides looking at the out of bag error, also quantify the increase in Gini impurity.

Lecture 8: RNA seq


Transcriptomics are useful for determining the transcriptional structure of a gene and for
quantification of expression levels and comparison.
Extract RNA

Reverse transcribe long RNA


fragments into cDNA

Amplify cDNA and add adaptors

Sequence the reads

Align read sequences with


reference genome or
transcriptome (or make de novo
transcriptome)

Quantify reads mapped to certain


positions

Take alternative splicing into account, it occurs often.

Factors that affect the number of counts per gene:


- Library size/sequencing dept = total number of reads in library
- Gene length
- GC content (affects extend of amplification)
- Sequence effects during sequencing (a sequence like TTTTTT can alter the sequencing)
- Ribosomal RNA depletion (Often samples are contaminated with large amounts of ribosomal
RNA. This is taken out to an extend but this can vary between samples)
- RNA content (some genes can be transcribed a 1000 fold more than others)

Normalization methods:
- Total counts: (mapped reads per gene / total mapped reads in sample) * mean of the total
counts across all samples
o Corrects for differences in library size
o You can also divide by upper quartile or median of the counts different from 0
- RPKM: (mapped reads per gene * 10^9) / (total mapped reads in sample * gene length)
o Corrects for differences in library size and gene length
- Trimmed mean of M-values (TMM)
o One sample is considered reference, the others are tests
o For each test sample compute the TMM (compute log ratio between sample and
reference (M-value), exclude most expressed genes (trimmed) and compute
weighted mean of log ratios (mean))
o The outcome should be close to 0 if most genes are not differentially expressed
o Otherwise it provides an estimate of a correction factor for the library size

Comparing the different methods: normalization of RNA-seq data for differential analysis is essential.
Total count and RPKM methods are ineffective. Instead use DESeq or TMM.
False positive rate = #FP / (#FP + #TN)
True positive rate = #TP / (#TP + #FN) = power

Lecture 9: Biological networks and their properties (was not in lecture, summary based on slides)
Biological networks can be used to study interactions, pathways or similarity

Degree: number of links to other nodes


Directed network: links between nodes have a specific direction (indegree and outdegree may vary)
Betweenness: measure of centrality of a node or link. A sum of (number of shortest paths between
two nodes going through the node of interest / total number of shortest paths between two nodes)
for each possible node pair.
Neighborhood: all nodes connected to a certain node
Distance: the shortest path length between two nodes (number of links between the nodes)
Network diameter: maximal shortest path distance between any two nodes
Closeness: average length of path to all the rest of the nodes in the network
Clustering coefficient: 2 * # links between nodes in neighborhood / degree * (degree -1)
Scale free network: network for which the degree distribution follows a power law
Hub: highly connected node

It is widely believed that cellular networks (metabolic, protein-protein interaction, etc) are scale free.
A scale free network is more robust. Removal of a non-hub node has little overall effect.

In some networks, interaction may vary over time. A party hub interact with all their neighbors
simultaneously. A date hub interacts with different neighbors at different times. A date hub often
indicates a link between various biological processes.

Lecture 10: Boolean networks


Gene regulatory networks modulate the different stages of gene expression and its regulation.
Making gene regulatory networks based only on expression data does not say anything about
posttranslational modifications, expression levels, localization, etc. Therefore it limits the scope to
regulatory effects resulting from transcription factor transcript levels. It disregards actual proteins.

- Obtain expression data


- Normalize data and put into gene expression data matrix
- Study type of inference (Boolean, Bayesian, etc.)
- Make your model
- Validate your results and make predictions

Boolean networks can be used to model gene regulation. The expression level of each gene is related
to the expression states of some other genes using logical functions.

A Boolean network is defined by a set of nodes (V = {v 1, …, vn}) and a list of Boolean functions (B = (B1,
…, Bn)). The state of a node (S = {σ1, …, σ n}) is determined by the state of the other connected nodes
at that time.
A state transition graph represents the state of a network iterating over different time points.
Attractor state: a networks ends in a single state or a set of states which its oscillates between
(equilibrium state). The period of the attractor is the number of different states it consists of.
Total number of states in a network = 2^number of nodes

Setting up a Boolean network:


- Discretize your data (determine for which expression level a gene is on/off)
- Make a table with the different states of the system over time
- Optionally: make a state transition graph of the table
- When you know the states (σ) of a certain time point, you also now the Boolean functions (B)
of the connected nodes at the previous time point (is same). Add these values to the table.
- See whether you can find a step in time with the change of only one state where the rest of
the states remains the same.
- Between these two time points, look at the changes in the Boolean functions. For instance: If
there is a change in B2 when all states are the same except σ1, then you know that B2
depends on σ1.
- If B2 goes from 0 to 1 when σ1 goes from 1 to 0 (or vice versa), interaction is inhibiting.
Otherwise interaction is activating
- Continue for all time steps to find all interactions
- Build network based on the interactions

Lecture 11: Bayesian networks (was not in lecture, summary based on slides)
A Bayesian network is a model of probabilistic relationships among a set of variables. Its aim is to
capture conditional dependency between states in the data. It is a directed acyclic graph.
Nodes represent binary states of a variable and links represent the probability of a relationship
between the two nodes.

Interpretation of figure (below):


Chance of cloudiness (C) = F:0.5 T:0.5
Chance of sprinkler on (S) = cloudiness is false  F:0.5 T:0.5, cloudiness is true  F:0.9 T:0.1
Chance of raining (R) = cloudiness is false  F:0.8 T:0.2, cloudiness is true  F:0.2 T:0.8
Chance of wet glass (W) = sprinkler is false, rain is false  F:1 T:0,
sprinkler is false, rain is true  F:0.1 T:0.9
sprinkler is true, rain is false  F:0.1 T:0.9
sprinkler is true, rain is true  F:0.01 T:0.99
Overall chance of rain = C(false) * 0.2 + C(true) * 0.8 = 0.5
For the best model, make a set of optimal models and compare them to extract common features.
Bootstrap to asses quality. Change data slightly and make new model and compare the results.

Lecture 12: Reverse engineering of transcription networks


Boolean and Bayesian networks work for small number of genes (<50). Now a focus on networks for
large sets of genes. More genes in a network means less information about how they interact, only if
they interact.

You can use database information for generating a network. However, data is generated by different
groups so different methods, normalization, etc. apply. It is best to use data generated by yourself
but if that is not an option, limit yourself to one type of data (method of obtaining data).

What to do with missing data points:


- Ignore the corresponding samples or genes (the whole row or whole column)
- Replace the missing values by some kind of average
- K nearest neighbor imputation (KNN imputation)
o Find genes with most similar expression profile as gene with missing data
 Usually between 2 and 10
o Replace missing data points by average of the similar genes under same conditions

Sequence based inference:


- Cluster genes with similar expression patterns
- Search for over represented patterns (motifs) in the regulatory regions of the clustered genes
- Search for known transcription factor binding sites in databases

Association network based inference:


- Obtain a similarity measure between the expression levels of the genes
- Connect nodes corresponding to genes with high similarity score
- Optional: assign directionality arrows based on previous knowledge (known TF)

Mutual information (0 till infinity): how much information about one variable is encoded in the other

Similarity measure cannot distinguish direct from indirect interactions resulting in many false
positives in the connections. Methods to reduce these false positives:
- Aracne
o Compute mutual information matrix
o Use data processing inequality (DPI)
 For A  B  C, mutual information AC < AB and BC
 Remove edge between AC if this is the case
- CLR
o Compute mutual information matrix
o Calculate the statistical likelihood of each mutual information value compared to the
interaction value distribution of all interactions.
o Remove edge if the interaction is not in the significant region for both genes.

For both methods you need to assign a threshold value.


- In case of known interactions (training set), fix threshold to maximize TP and minimize FP
- Select threshold so that you have fixed number of interactions
- Select only the top scoring x% interactions
- Build multiple networks with multiple thresholds and choose best based on its properties
o Scale freeness, clustering structure, clustering of genes with similar function.

To compare the quality of different networks with different thresholds, compare the TP rate to the
FP rate in an ROC curve. The rate for each individual threshold is a data point which together form a
line.
Usually you expect more negatives than positives so a large false positive rate (fraction of negatives
that is perceived as positive) is more detrimental than a low true positive rate (fraction of positives
that is perceived as positive). Focus on the left side of the x-axis of the ROC curve (low FPR).

Lecture 13: Network enhancement


Molecular interactions occur between many different types of molecules (proteins, DNA, RNA,
metabolites, etc.). Physical interaction: molecules actually physically interact. Functional interaction:
any type of interaction, state of a molecule is somehow influenced by another molecule. Be aware of
what kind of interaction you are looking at in a database.

Only network inference is usually insufficient for a trustworthy network. Integrative bioinformatics
improves the network with other data (omics measurements, databases, etc.).
Explicit data integration: combine measurements to construct a multilayer model with preknowledge
about use of the data.
Implicit data integration: combine measurements to enhance a single layer model. “throw data on a
pile and hope something comes out”
When integrating, make sure your data is heterogeneous in type, reliability, nomenclature, coverage,
possible bias, etc.

Set-based integration: select different types of evidence (data) for interaction and weigh them based
on their reliability. Combine the weighted scores for the different possible interactions. Select
interactions for final model based on a certain threshold.
When integrating the results from the different data sources, you have different integration forms:
- Early integration: put all results through same classifier, compare output to threshold to
obtain binary output.
- Late integration: put all results through unique classifiers, compare output to threshold to
obtain binary output.
- Intermediate integration: translates results into same data type (probability, similarity, etc.),
put these through same classifier, compare output to threshold to obtain binary output.

Lecture 14: Network mining


Networks can contain a lot of information but how to extract the information?
Networks can be translated in a binary adjacency matrix or a weight matrix with interaction strength.
Annotation of a node can be obtained by looking at neighboring nodes with known annotation.
Annotate if a number of neighbors with a certain annotation exceeds a threshold (chi square test)

Motif: a motif in node and degree structure of a small number of nodes (about 3). See if you can find
a certain motif in a network and how often. Compare this to a random networks with same amount
of nodes and degrees to find enriched motifs.

Modules can have shared nodes, whereas clusters can’t. (modules are a good way of separating
cellular processes as one protein can be involved in different processes)

Special way of clustering networks (previous cluster methods also possible): minimize normalized cut

Lecture 15: Proteomics I


So far, we have been looking at networks based on co-expression. There are many other forms, like
networks based on protein interaction. Protein interactions cannot be deduced from expression
analysis. Only by measuring protein interactions directly.

Proteins have diverse properties (lifetime, localization, concentration in cell, ligand interactions, etc.)
making proteomics more diverse than transcriptomics. Besides, interactions are time dependent. A
measurement is only a snapshot from the current situation.

The 3D structure of a protein tells a lot about its properties. 3D structure determination techniques:
- X-ray crystallography: hit crystallized molecule with X-ray beam and study diffraction pattern
- NMR: study structure by inducing and measuring chemical shift of H atoms in a molecule
- EM: advancing technology similar to X-ray but with electron beam on a single protein
- Homology modelling: model structure based on other similar proteins with known structure

Mass spectrometry can be used to identify proteins


Both positive and negative ion modes are possible depending on functional groups (-OH, -NH 3,
-COOH, etc.) in sample.

MS spectra have m/z on x-axis  mass/charge obtained during ionization.


Actual mass = (peak value * charge) – mass of added protons
One peptide can result in several peaks due to isotope distribution. The mass of a neutron is 1 Da so
based on the distance between the isotope peaks you can determine the charge of the peptide.
Distance between peaks on spectrum is 0.33 m/z  charge is 3 (1/3 = 0.33)

Fragmentation of proteins into peptides is usually done by trypsin digestion (cleaves on C-terminal
site of lysine and arginine)

Using a collision cell after the ionizer in your setup, allows to filter for peptides with a certain charge.
Double charged precursors result in spectra that are easiest to interpret. Fragmentation of the
double charged precursors occurs in a way that most peptide bonds (weakest bonds) break with
comparable frequency and usually results in two singly-charged fragment ions.
De novo peptide identification is difficult. It is better to compare to a simulated library based on a
database: Take sequence from database  artificial digestion  artificial mass spectrometry 
library spectrum (for all sequences of interest).

MS can only detect a few peptides at the same time. Therefore it is best to separate your peptides
after digestion, for instance using liquid chromatography.

Lecture 16: proteomics II (was not in lecture, summary based on slides)


For complex peptide mixtures an additional separation step might be necessary. Different forms of
liquid chromatography are commonly used.
- Reverse phase: peptides in polar solvent through a nonpolar column. Polarity of the peptide
determines extend of binding to the column to separate based on hydrophilic properties.
- Cation/anion exchange chromatography: peptides are loaded onto a charged column to
which they bind due to their own charge. Proteins are eluted with increasing salt gradient.
Stronger bound peptides need higher concentration to elute. Binding strength depends on
charge of peptide so peptides are separated based on charge.
- Combinations of different separation techniques can be used

Comparing different samples can be used by adding isotopic labeled amino acids (different mass) to
the growth medium of one/some of your samples (SILAC).
- Estimates relative protein levels between samples with high accuracy
- Can be used on complex mixtures of proteins
- No addition of tag so no altered chromatography
- Ideal for cultures but tricky for whole organisms
- Expensive

Currently the preferred technique for comparing different samples is label free quantification (LFQ).
At least three independent technical replicates are used followed by a statistical test to study
significant differences between samples.
- Can be applied to all types of samples
- Estimates relative protein levels between samples with high accuracy
- Can be used on complex mixtures of proteins
- Requires replicates so more material
- Requires more MS runs
- Requires well-annotated genomes

Studying post-translational modifications of your proteins, for instance phosphorylation:


- Filter out only phosphorylated proteins with specific columns
- In the mass spectrometer, some peptides can lose their phosphor group
- Find originally phosphorylated peptides by looking for peaks with a difference in m/z value
corresponding to the lost phosphor group (98 Da)

Studying protein complexes using MS:


- Separate complex from mixture using specific antibody for part of the complex
o If no antibody is available, fuse complex with a tag and use antibody for the tag
- After other proteins are washed away from matrix, elute complex and perform MS on elute
- Sensitive to false negatives
o Bait for antibody must be properly localized in the complex
o Optional tag may interfere with function and structure
o Temporary interactions may be missed
- Sensitive to false positives
o Sticky proteins sticking to column or complex
Lecture 17: proteomics III
Another way of studying protein complexes is by selecting based on fractionation. You first simplify
the proteome by separating the proteins based on biochemical properties. You expect stable
complexes to co-fractionate.
In the paper Census of human soluble protein complexes they make use of this strategy
- Separate proteins based on size, charge, etc.
- Look for different complexes that fraction together using MS for identification
- Make network based on co-fractionation results
- New members that are part of already known and new complexes were found

In the paper Quantifying E. coli proteome and transcriptome with single-molecule sensitivity in single
cells a relation between protein concentration and corresponding mRNA concentration is studied.
- For quantifying number of proteins in a cell
o Fuse YFP to protein (one strain for each protein)
o Measure amount of fluorescence and translate it to protein copy number per cell
- For quantifying number of mRNA in a cell
o Create fluorescent tag that binds to YFP sequence of studied protein
o Measure amount of fluorescence and translate it to mRNA levels per cell
- There is little correlation between protein and mRNA numbers per cell
o Possibly due to short lifetime of mRNA after translation compared to protein lifetime
o Possibly due to localization of protein to different cell

You might also like