Download as pdf or txt
Download as pdf or txt
You are on page 1of 2

Project Progress Report

Declan Levine, Tong Liu, and Vincent Wu

We will use 2 datasets. The first one is from The Cancer Genome Atlas(TGCA). The goal

of this analysis is to find out what genetic features are related to lung cancer (​FireBrowse​,

firebrowse.org/?cohort=LUSC#). The other dataset we will use is from bioconductor. It is a

study on lung cancer gene expression. The data was originally published on bioconductor in

2004 (Scharpf R, Zhong S, Parmigiani G (2019). ​lungExpression: ExpressionSets for Parmigiani

et al., 2004 Clinical Cancer Research paper​. R package version 0.24.0).

The first dataset consists of Lung adenocarcinoma gene expressions. The mRNAseq

preprocessor picks the “scaled_estimate” value from Illumina HiSeq/GA2 mRNAseq level_3(v2)

dataset and makes the mRNAseq matrix with log2 transformed for the downstream analysis.

Preprocessing is already done, but the raw data is available if necessary. The other dataset

“lungExpression” is represented as an ExpressionSet and is already preprocessed.

So far we have been able to look at the two sets of data and have begun to look into the

best ways we might be able to integrate the two platforms in a useful way. The challenge lies in

finding the most effective means to this end, which is something we are currently researching.

The goal will be to unify the two studies and report on how their respective findings compare

post-integration.

We hope to strengthen the findings made with the two platforms which will ultimately

help in understanding the relationship between the human genome and lung cancer. We still need

to find the best way to get the dataset from Broad Institute into a workable format in R. Further,

we need to translate these data in an integratable way for further analysis.


We intend to use edgeR in order to complete a differential expression analysis of the

integrated data set in order to compare the prior results.

There is a chance that we will not be able to integrate the two platforms because we lack

information needed in order to do so. First we will attempt to get what is required from the

publishers of the data, or from another publication. If this fails, we can perform an in-depth

analysis on both data sets independently and compare the results.

You might also like