Professional Documents
Culture Documents
Project Progress Report
Project Progress Report
We will use 2 datasets. The first one is from The Cancer Genome Atlas(TGCA). The goal
of this analysis is to find out what genetic features are related to lung cancer (FireBrowse,
study on lung cancer gene expression. The data was originally published on bioconductor in
The first dataset consists of Lung adenocarcinoma gene expressions. The mRNAseq
preprocessor picks the “scaled_estimate” value from Illumina HiSeq/GA2 mRNAseq level_3(v2)
dataset and makes the mRNAseq matrix with log2 transformed for the downstream analysis.
Preprocessing is already done, but the raw data is available if necessary. The other dataset
So far we have been able to look at the two sets of data and have begun to look into the
best ways we might be able to integrate the two platforms in a useful way. The challenge lies in
finding the most effective means to this end, which is something we are currently researching.
The goal will be to unify the two studies and report on how their respective findings compare
post-integration.
We hope to strengthen the findings made with the two platforms which will ultimately
help in understanding the relationship between the human genome and lung cancer. We still need
to find the best way to get the dataset from Broad Institute into a workable format in R. Further,
There is a chance that we will not be able to integrate the two platforms because we lack
information needed in order to do so. First we will attempt to get what is required from the
publishers of the data, or from another publication. If this fails, we can perform an in-depth