(IJCST-V1I2P7) : T.Shanmugavadivu, T.Ravichandran

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

International Journal of Engineering Trends and Applications (IJETA) Volume 1 Issue 2, Sep-Oct 2014

RESEARCH ARTICLE

OPEN ACCESS

Gene Analysis for Class Predication and Class Discovery


T.Shanmugavadivu1, T.Ravichandran2
Research Scholar1, Principal2
Karpagam University1
Hindustan Institute of Technology2
Coimbatore India

ABSTRACT
Classification of patient samples is a crucial aspect of cancer diagnosis and treatment. The present a method for
classifying samples by computational analysis of gene expression data. the classification problem in two parts: class
discovery and class prediction. Class discovery refers to the process of dividing samples into reproducible classes that
have similar behavior or properties, while class prediction places new samples into already known classes. a method for
per-forming class prediction and illustrate its strength by correctly classifying bone marrow and blood samples from
acute leukemia patients. if it describe to use our predictor to validate newly discovered classes.
Keywords:- Gene class, Feature Selection, Feature Prediction.
characterization of over 14 000 full-length cDNAs (FLcDNAs) with 5-UTR sequences of high quality cDNA
I.
INTRODUCTION
and FL-cDNA sequence data provide a valuable resource
Expression profiling experiments often involve
for bioinformatic characterization of features of the 5measuring the relative amount of mRNA expressed in
UTR, coding sequence and 3-UTR sequences that
two or more experimental conditions. This is because
underlie variation in translational regulation. In this
altered levels of a specific sequence of mRNA suggest a
report, a quantitative assessment of the proportion of
changed need for the protein coded for by the mRNA,
mRNA in polysomes for over 11 000 genes was used to
perhaps indicating a homeostatic response or a
evaluate the significance of general mRNA sequence
pathological condition. For example, higher levels of
features on translational regulation under non-stress (NS)
mRNA coding for alcohol dehydrogenase suggest that
and DS conditions in Arabidopsis.
the cells or tissues under study are responding to
III. CLASSIFICATION METHODS
increased levels of ethanol in their environment.
Similarly, if cancer cells express higher levels of mRNA
Classification methods are based on a distance
associated with a particular transmembrane receptor than
function for pairs of tumor mRNA samples, such as the
normal cells do, it might be that this receptor plays a role
Euclidean distance or one minus the correlation of their
in cancer. A drug that interferes with this receptor may
gene expression profiles. Due to proceeds as follows to
prevent or treat cancer. In developing a drug, one may
classify test set observations on the basis of the learning
perform gene expression profiling experiments to help
set. For each tumor sample in the test set (a) find the k
assess the drug's toxicity, perhaps by looking for
closest tumor samples in the learning set, and (b) predict
changing levels in the expression of cytochrome P450
the class by majority vote; that is, choose the class that is
genes, a biomarker of drug metabolism. Gene expression
most common among those k neighbors. The number of
profiling may become an important diagnostic test.
neighbors k is chosen by cross-validation; that is, by
running the classifier on the learning set only. Each
II. SYSTEM MODEL
tumor sample in the learning set is treated in turn as if it
An individual mRNA species in polysomes may
were in the test set; its distance to all of the other learning
provide a means to identify mRNA features that
set tumor samples (except itself) is computed, and it is
contribute to translational regulation. Such an evaluation
classified by the rule. The classification for each learning
would also require knowledge of the full-length sequence
set observation is then compared to the truth to produce
of the mature transcript. There are over 28 000 publicly
the cross validation error rate. This is done for a number
available full coding-region cDNA sequences for
of ks (here k 2 811 31 51 : : : 1 219), and the k for which
Arabidopsis. These cDNAs provide reliable coding and
the cross-validation error rate is smallest is retained for
3-UTR sequence information, but may not begin at the 5
use on the test set.
terminus of the mRNA has allowed for the

ISSN: 2393-9516

www.ijetajournal.org

Page 35

International Journal of Engineering Trends and Applications (IJETA) Volume 1 Issue 2, Sep-Oct 2014

IV. CLASSIFICATION TREES


Binary tree structured classifiers are constructed
by repeated splits of subsets (nodes) of the space of gene
expression pro. les into two descendant subsets, starting
with itself. Each terminal subset is assigned a class
label, and the resulting partition of corresponds to the
classifier. There are three main aspects to tree
construction: (a) selection of the splits, so that the data in
each of the descendant subsets are purer than the data
in the parent subset; (b) the decision declare a node
terminal, which is done using cross validation to prune
the tree; and (c) the assignment of each terminal node to
a class. Different tree classifiers use different approaches.

Generation of trees:
Step1: Let n be the number of samples in the training
data S.
Step2: Assign equal weight 1/n to each sample in S.
Step3: For each of k iterations:
Step4: Apply decision tree algorithm to weighted
samples.
Compute error e of the obtained tree on
weighted samples.
If e is equal to zero:
Store the obtained tree.
Terminate generation of trees.
Step5: For each of samples in S:
step6 : If sample is classified correctly by the obtained
tree:
Multiply weight of the sample by e /(1-e).
Normalize weight of all samples.
Classification
Step1: Given a new sample.
Step2: Assign weight of zero to all classes.
Step3: For each of the tree stored:
Add -log(e/(1-e)) to the weight of the class predicted by
the tree.
Return class with highest weight.

V. RESULTS AND DISCUSSION


The variation in translation of cellular mRNAs under NS
and DS conditions was determined by use of an
oligonucleotide microarray designed to monitor 23 000
gene transcripts. Hybridizations were performed with

ISSN: 2393-9516

mRNAs from sucrose density gradient fractions that


contained non-polysomal and polysomal complexes. The
proportion of individual mRNA species in polysomes
(ribosome loading) was determined for the genes with
transcripts detected in both the non-polysomal and
polysomal fractions The results were consistent with
those obtained with a DNA oligonucleotide array that
monitored fewer genes (8000) NS leaves, for these
genes corresponded to 61.9 and 92.1% of each mRNA
species in polysomal complexes, whereas under DS these
values fell to 45.9 and 86.8% .The decrease in the
average proportion of an mRNA species in polysomes,
from 82 to 72%, was significant (P < 0.0001). DS also
broadened the modal range of ribosome loading for a
large proportion of the mRNAs, indicative of greater
constraints on translation. over 50% of the mRNAs with
a 2-fold or greater increase in abundance in response to
DS showed no decrease in ribosome loading, whereas
over 70% of over 11 000 mRNAs monitored showed a
significant decrease in ribosome loading indicates that
many DS-induced mRNAs can circumvent the global
repression in mRNA translation.

VI. CONCLUSION
The problem of class discovery and distinguish
it as a special subclass of the broad category of clustering
problems. We describe how to efficiently compute
statistical significance to how well individual genes
separate tissue classes (for both the T No M and the
INFO methods). Based on these efficient methods, we
propose several criteria for evaluating the statistical
significance of putative sample classifications. The
central idea is to quantify the overabundance of genes
that are informative with respect to any such putative
classification. We then combine these methods with
search heuristics and develop an efficient search
procedure for finding multiple significant classifications
in data sets.
The main criterion we use in searching for new
classifications is the max-surprise score. This score is
appealing both because of its clear definition and because
it can be efficiently evaluated. Our evaluation on
synthetic data shows that searching using the maxsurprise score can recover a true classification under a
wide range of operating parameters including the number
of relevant and irrelevant genes, the amount of variance

www.ijetajournal.org

Page 36

International Journal of Engineering Trends and Applications (IJETA) Volume 1 Issue 2, Sep-Oct 2014
in the expression level, and the difference between the
expressions of genes in two classes.

REFERENCE
[1].

[2].

[3].

[4].

[5].

[6].

[7].

[8].

[9].

[10].

1.Eisen, M. B., Spellman, P. T., Brown, P. O. &


Botstein, D. (1998), Cluster analysis and display
of genome-wide expression patterns, PNAS
95(25), 148638.
2.Feller, W. (1970), An introduction to
Probability Theory and Its Applications, Vol. I,
third edn, JohnWiley & Sons.
3.Friedman, J. (1997), On bias, variance, 0/1 loss, and the curse-of-dimensionality, Data
Mining and Knowledge Discovery 1. in print.
4.Friedman, N., Geiger, D. & Goldszmidt, M.
(1997), Bayesian network classifiers, Machine
Learning 29, 131163.
5.Schummer, M., Ng, W. V., Bumgarner, R. E.,
Nelson, P. S., Schummer, B., Bednarski, D.
W.,Hassell, L., Baldwin, R. L., Karlan, B. Y. &
Hood, L. (1999), Comparative hybridizationof
an array of 21,500 ovarian cDNAs for the
discovery
of
genes
overexpressed
in
ovariancarcinomas, Gene 238(2), 37585.
6.Sharan, R. & Shamir, R. (2000), CLICK: A
clustering algorithm with applications to gene
expressionanalisys, in ISMB00.
7.Slonim, D. K., Tamayo, P., Mesirov, J. P.,
Golub, T. R. & Lander, E. S. (2000), Class
prediction and discovery using gene expression
data, in Fourth Annual International Conference
on ComputationalMolecular Biology.
8.Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q.,
Kitareewan, S., Dmitrovsky, E., Lander, E.
S.&Golub,T. R. (1999), Interpreting patterns of
gene expression with self-organizing maps:
methods andapplication to hematopoietic
differentiation, PNAS 96(6), 290712.
9.Tavazoie, S., Hughes, J. D., Campbell, M. J.,
Cho, R. J. & Church, G. M. (1999),
Systematicdetermination of genetic network
architecture, Nat Genet 22(3), 2815. Comment
in: NatGenet 1999 Jul;22(3):213-5.
10.Ben-Dor, A., Shamir, R. & Yakhini, Z.
(1999), Clustering gene expression patterns, J.
Comp. Bio.6(3-4), 28197.

ISSN: 2393-9516

www.ijetajournal.org

Page 37

You might also like