Professional Documents
Culture Documents
Cancerous Profiles - 2017 - Conference - Paper
Cancerous Profiles - 2017 - Conference - Paper
Abstract— There are variety of options available for cancer trying their hard to reveal the hidden aspects and networks, which
treatment. The type of treatment recommended for an individual can help in proper diagnosis and treatment of diseases like
is influenced by various factors such as cancer-type, the severity Cancer. Data Mining and Machine learning approaches are giving
of cancer (stage) and most important the genetic heterogeneity. In a powerful hand in such a data driven analysis. Gene expression
such a complex environment, the targeted drug treatments are involves the overall process of information retrieval from the
likely to be irresponsive or respond differently. To study anti- gene, hence helps in the synthesis of functional products called
cancer drug response we need to understand cancerous profiles. protein. The amount of mRNA produced by the gene at a
These cancerous profiles carry information which can reveal the particular instant of time, corresponds to the gene expression
underlying factors responsible for cancer growth. Hence, there is value. These expression values may alter depending upon the
need to analyze cancer data for predicting optimal treatment environment, any biological regulator and biological pathways
options. Analysis of such profiles can help to predict and discover involved. The process of mapping information from genes to
potential drug targets and drugs. In this paper the main aim is to proteins synthesis is carried by agent called mRNA. Transcription
provide machine learning based classification technique for and Translation are the two sub processes that are involved in this
cancerous profiles. process. Transcription involves duplication of gene sequence in
Keywords: Cancer, Clustering, Machine Learning, Genes, Drug the form of RNA. Once the genetic coding is copied on messenger
Prediction RNA(mRNA) then it exists the nucleus and enters cytoplasm and
I. Introduction eventually encoded protein is synthesized. Translation involves
interpretation of mRNA sequence of amino acids so as to
We all living organisms are made up of basic unit of life, called synthesize proteins.
Cells. Individual cells describes a completely complex
functionality. What makes them more interesting, are genes.
Genes are the carrier of genetic information within the Cell. The
information about the inherited phenotypic traits in living
organisms is determined by genes. Genetics is a branch of science
that has evolved ever since study of genes started. Advancement
in bioinformatics has raised the patient’s life expectancy and
boosted the treatment procedure of various chronic diseases.
Screening of various diseases like diabetics, cancer and heart
attack is no more a tedious task. Chip technology in healthcare
has provided laboratory on-a- chip devices. These chips helps in
predicting the drug responses corresponding to patient’s genetic
profile.
32
compound contains 900 AUC values. Each tissue has different Algorithmic description of Back propagation feed forward
IC50 and AUC value. The tissue values have been utilized as inner network
clustering in the proposed algorithm. There can be multiple input variables (x1, . . . ,xm) corresponding
A. Pre-processing to input layer with multiple nodes. Input layer just passes the input
The fourth column of each dataset is the tissue on the basis of data to intermediate processing nodes (nodes 1, 2 and 3) and
which preprocessing has to be performed. The Pseudo-Code mathematically can be written as below:
description of the preprocessing is given above. ͳǤ ݎ݁ݕ݈݄ܽ݊݁݀݀݅ݐݏݎ݂݂݅ݏ݉ݑݏ݀݁ݐ݄݃݅݁ݓǣ
݊ଷ ൌ ݓଵଷ ݔଵ ݓଶଷ ݔଶ
Pseudo-Code-2 ݊ସ ൌ ݓଵସ ݔଵ ݓଶସ ݔଶ
ʹǤ݊݅ݐܿ݊ݑ݂݊݅ݐܽݒ݅ݐ݄ܿܽ݁ݐݕ݈ܣǣ
1. Neuralarch = function createcluster(org_data)
ሺሻ
2. [row,cols] = Size (org_data) ͵Ǥ݂݁݀݊݉ݑݏ݀݁ݐ݄݃݅݁ݓ݄݁ݐ݁ݐ݈ܽݑ݈ܿܽܥͷǣ
3. Group= [ ] // Initializing the group parameters ݊ହ ൌ ݓଷହ ݕଷ ݓସହ ݕସ
ͶǤ݄ܶ݁݊ݐݑݐݑ݈ܽ݊݅ܨ
4. Groupname = [ ] For each set of target based on the inner tissue types, group will
5. Grpcount = 0 Traindata = [ ] record_count = 1 be created.
6. Group[0] = ord_data(1).issuenumber TABLE I. Grouping of Tissue Types
7. Groupname[0] = org_data(1).issuename
TCGA Tissue
8. Currentgroup = groupname[0]
Cell classific Tiss sub-
9. for i = 1:rows line ation ue type IC50 AUC
10. if (ord_data (i).Tissue name==Currentgroup) Group 1
IST_ SKCM Skin Melan .0042 .1550
11. Traindata[recordcount,0:cols] MELI oma
12. Record_count = record_count+1 C32 SKCM Skin Melan .0060 .1760
oma
13. Group(record_count) = grpcount
RPM1 SKCM Skin Melan .0100 .2230
14. Else oma
15. Groupcount = groupcount+1 Group 2
A549 LUAD Lung Lung .0045 .1540
16. Traindata[recordcount,0:colsRecord_count] NSCL
17. Group(record_count)=grpcount HCC- LUAD Lung Lung .0141 .2700
44 NSCL
18. End if
19. End for The first four records belong to same tissue ‘melanoma’ and
hence for neutral, they will be put in group1 and the last two
records belong to tissue ‘Lung_NSCL’ & hence they will be put
B. Feed forward Neural Network in group 2 as per neural architecture. The neural network object
is initialized with the train set and an equal number of group
In feed forward neural networks output of one layer act as input values. There would be 30 hidden neurons for the conversion of
to intermediate next layer without making any acyclic input data at input layer to the transformed data at the hidden
dependency. Output is generally obtained from final network layer. Now, the certainty of taking neurons is random. It is just an
layer. It can be written as below: estimate and it is completely dependent upon the data size. The
neural network will support true training only if the number of
ሾݔǡ ݐሿ ൌ ݐ݁ݏܽݐ̴ܽ݀ݐ݂݈݅݁݉݅ݏǢ rows in training data and group is same.
݊݁ ݐൌ ݂݂݁݁݀ݐ݁݊݀ݎܽݓݎሺͳͲሻǢ
݊݁ ݐൌ ݊݅ܽݎݐሺ݊݁ݐǡ ݔǡ ݐሻǢ The epoch is the total number of iterations which the neural
ݓ݁݅ݒሺ݊݁ݐሻ; network can run to completely understand the data pattern of the
ݕൌ ݊݁ݐሺݔሻǢ training set. 50 is the maximum number of iteration here but it is
݂ݎ݁ൌ ݉ݎ݂ݎ݁ሺ݊݁ݐǡ ݕǡ ݐሻ; not necessary that the network will run all those 50 iterations.
There are certain performance measures for the neural network.
33
If any of the performance measures are satisfied then it would stop
the training at that moment only and the network will roll back.
The network will choose that value where the MSE would be
least.
34
Table II: Target Class (SVM) positives. Its value must be high in order to get good proposed
algorithm. The obtained value of (p) for 100 samples is 89.2%.
No. of Precision Recall F- Accuracy
Samples measure
35
Although various approaches have been proposed in literature for microarray data classification using a novel genetic swarm
classification but gene selection still remains a major curse. algorithm,” Expert Systems with Applications, Vol. 39, No.
Cancer is a heterogeneous disease which consists of various 2, pp. 1811–1821, 2012.
subtypes. Hence, there is urgent need to develop systems or [5] Liwei Fan, Kim-LengPoh, and Peng Zhou “A sequential
methods that can help in early diagnosis and prognosis of cancer feature extraction approach for naive bayes classification of
type. Past decade has evolved various new approaches related to microarray data,” Expert Systems with Applications, Vol. 36,
cancer research. Various biological and computational techniques No. 6, pp. 9919–9923, 2009.
have been used by scientists to early detect cancer type. [6] Todd R Golub, Donna K Slonim, Pablo Tamayo, Christine
Collection of large cancer data repositories has hiked the research Huard, Michelle Gaasen beek, Jill P Mesirov, Hilary Coller,
in this domain. Various machine learning approaches have been Mignon L Loh, James R Downing and Mark A Caligiuri
used to predict if tumor is malignant or not. “Molecular classification of cancer: class discovery and class
So, in order to address aforementioned challenges the proposed prediction by gene expression monitoring,” Science, Vol.
technique is an attempt to solve classification problem for 286, No. 5439, pp. 531–537, 1999.
cancerous genomic profiles. Our technique is based on concept of [7] Gersende Fort and Sophie Lambert-Lacroix, “Classification
utilizing SVM and NN machine learning algorithm. Result using partial least squares with penalized logistic regression,”
provides comparative analysis of model performance when the Bioinformatics, Vol. 21, No. 7, pp. 1104–1111, 2005.
sample size is varied. As the sample size increase model [8] Leping Li, Clarice R Weinberg, Thomas A Darden and Lee
performance also increases, which shows positive aspect towards G Pedersen, “Gene selection for sample classification based
the robustness and adaptivity of the model. In future, this on gene expression data study of sensitivity to choice of
approach can be extended to implement integrative framework for parameters of the GA/KNN method,” Bioinformatics, Vol.
anti-cancer drug prediction. 17, No. 12, pp. 1131–1142, 2001.
[9] John N Weinstein, Eric A Collisson, Gordon B Mills, Kenna
R Mills Shaw, Brad A Ozenberger, Kyle Ellrott, Ilya
VI. References Shmulevich, Chris Sander and Joshua M Stuart, “The cancer
[1] Santanu Ghorai, Anirban Mukherjee, Sanghamitra Sengupta genome atlas pan-cancer analysis project,” Nature Genetics,
and Pranab K Dutta, “Cancer classification from gene Vol. 45, No. 10, pp. 1113–1120, 2013.
expression data by nppc ensemble,” IEEE Transactions on [10] Jianting Sheng, Fuhai Li, and Stephen TC Wong, “Optimal
Computational Biology and Bioinformatics (TCBB), Vol. 8,
drug prediction from personal genomics profiles,”
No. 3, pp. 659–671, 2011.
Biomedical and Health Informatics, Vol. 19, No. 4, pp.
[2] Alexandre R Zlotta, “Genome sequencing identifies a basis 1264–1270, 2015.
for everolimus sensitivity,” European urology, Vol. 64, No. [11] Vandana Bhatia and Rinkle Rani, “A parallel fuzzy
3, pp. 29-33, 2013. clustering algorithm for large graphs using Pregel,” Expert
[3] GopaIyer, Aphrothiti J Hanrahan, Matthew I Milowsky, Systems with Applications, Vol. 78, pp-135-144, 2017.
Hikmat Al-Ahmadie, Sasinya N Scott, Manickam
[12] Bharti Saneja, and Rinkle Rani, “An efficient approach for
Janakiraman, Mono Pirun, Chris Sander, Nicholas D Socci outlier detection in big sensor data of health care,”
and Irina Ostrovnaya, “Genome sequencing identifies a basis International Journal of Communication Systems, DOI:
for everolimus sensitivity,” Science, Vol. 338, No. 6104, pp. 10.1002/dac.3352, 2017
221–229, 2012.
[4] P Ganesh Kumar, T Aruldoss Albert Victoire, P Renukadevi
and Durairaj Devaraj, “Design of fuzzy expert system for
36