Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

A Comparative Study of Machine Learning

Algorithms to Predict Ovarian Cancer


Pranam R Betrabet Ms. Shwetha Rai
Department of Computer Science and Department of Computer Science and
Engineering, MIT, Manipal, India Engineering, MIT, Manipal, India
pranamrrao@gmail.com shwetha.rai@manipal.edu

Abstract— Ovarian Cancer (OC) is one of the most widely recognized kinds of disease in women. Ovarian cancer in the cancer which starts
in the ovary or at the finish of fallopian tubes next to the ovary. Most of the patients with ovarian cancer have advanced illness at the hour of
diagnosis on the grounds that beginning phase tumors are regularly asymptomatic bringing out less fortunate long-term endurance. In order
to tackle this problem, machine learning can help in analyzing better care and low-cost health care. In this research, we are investigating the
feasibility of employing the machine learning algorithms say SVM, Random Forest, Logistic Regression, KNN and Decision Tree for
identifying the malignant or benign type of ovarian cancer found in women.

Keywords-Ovarian Cancer, Machine Learning, Cancer Dataset,


Predictive Model
Their multi-marker model showed significant improvement when
compared to CA125 or HE4 to differentiate benign pelvic masses
I. INTRODUCTION (BPM) and EOC patients [5].
Cancer is considered as an evolving health concern of the present
population. The population of age 35–65 is now being part of this In summary, the current clinical tests for diagnosis of OC focus on a
deadly tumor. The social and the economic cost implications of small number of biomarkers that were either selected by meeting
tumor are enormous to the society. Families with this kind of over-expression of genes/proteins criteria [6] or the epidemiological
illness will suffer a lot due to direct cost involved in reduced evidence. Machine learning (ML) is an emerging research area
productivity, and it adds indirect cost to the society. It is also which offers a variety of useful methodologies that can handle large
important that clinical services should be more effectively dimensional dataset and it excels in providing methods which can
equipped in order to serve the necessary level of care for the efficiently and effectively evaluate a large number of variables to
illness through healthcare delivery system at the primary level as construct an accurate model for prediction [7,8,9,10].
well as at the secondary level. Ovarian cancer is considered as
most hazardous kind of malignant tumour among women across In the medical research, the ML methods such as decision trees
the world. Rate of survival can be high if the disease is diagnosed have been applied successfully to predict mortality of trauma
at the early stage. Thus, it is very crucial to predict these types of [11,12], and breast cancer [13].
cancer at an early stage.
In this study, we will employ the methodologies in machine
II. LITERATURE REVIEW learning to develop a predictive model by investigating a panel of 49
measures from 349 patients which include demographics, blood
Moore at el. compared RMI and ROMA to predict routine test, general chemistry, and tumor marker for classifying
epithelial ovarian cancer in 457 women and concluded that BOT and OC.
ROMA obtained a better accuracy to identify EOC rather than
RMI [1].
III. METHODOLOGY
Wang et al. performed a meta-analysis to evaluate the Five Machine Learning Algorithms were considered in this
diagnostic value of HE4, CA125, and ROMA based on 32 research namely KNN, Logistic Regression, SVM, Random
studies, and they suggested HE4 is useful for diagnosing OC, Forest and Decision Tree. Firstly, the data which was collected
especially in the premenopausal population. CA125 and ROMA was preprocessed by applying suitable preprocessing methods.
are more suitable for diagnosing OC in the postmenopausal The duplicate data which were present in the dataset were
population [2]. removed. The missing values were imputed with the mean in
order to handle them efficiently. Then the five machine learning
Zhang et al. developed a linear multi-marker model by algorithms were run on this pre-processed data and their
combining HE4, CA125, progesterone (Prog), and estradiol (E2) corresponding accuracies and other evaluation metrics were
and the last two markers were suggested by the numbers of compared. Secondly outlier removal technique namely IQR
epidemiological evidence in the development and progression of outlier detection was applied on the pre-processed data and then
OC [3,4]. the above five algorithms were run again and their accuracies
were compared.
Third, Feature subset selection techniques such as PCA, B. Logistic Regression
TSNE were applied on the pre-processed dataset in order to Logistic regression is a predictive analysis that is used to
extract the relevant features to the target class where only predict the outcome of a categorically dependent variable based
the top ranked features are selected. The five algorithms on one or more independent variables. It is used to describe
mentioned above were again run on the processed dataset data and to explain the relationship between one-dependent
after applying each of these feature selection techniques, binary variable and one or more ordinal, nominal, interval,
then the accuracies and other evaluation metrics were ratio level, or independent variables. From the previous
calculated for each of the models and then the results were research, it is observed that logistic regression is being used
compared to find out which algorithm produces the best mainly in medical research especially for the correlation of
accuracy. dichotomous outcomes with the predictor variables that may
include different physiological data.
FEATURE SELECTION METHODS
A. PCA (Principal Component Analysis) C. SVM Classification
SVM is considered as supervised machine learning technique,
Principal component analysis (PCA) is the process of
computing the principal components and using them to and it will classify the data into two different classes over a
hyperplane. The vectors or the cases which define the
perform a change of basis on the data, sometimes using only
hyperplane are called support vectors. SVM is considered as a
the first few principal components and ignoring the rest. PCA
is used in exploratory data analysis and for making predictive classification method and is based on statistical learning
models. It is commonly used for dimensionality reduction by theory. SVM plays out the same role like C4.5 with the
projecting each data point onto only the first few principal exception of that; it will not use decision trees at all.
components to obtain lower-dimensional data while preserving
as much of the data's variation as possible.
D. Random Forest Classification
B. TSNE (T-Distributed Stochastic Neighbour
Random Forest Classification is bagging technique, which can
Embedding)
be implemented by building multiple decision trees at training
It is a nonlinear dimensionality reduction technique well-suited time, get prediction from each decision trees and selects best
for embedding high-dimensional data for visualization in a one as a output. Selection of output is mainly based voting. It
low-dimensional space of two or three dimensions.
produces great results most of the time even when large data is
Specifically, it models each high-dimensional object by a two-
or three-dimensional point in such a way that similar objects missing. It provides good accuracy for unscaled data.
are modelled by nearby points and dissimilar objects are
modelled by distant points with high probability. E. Decision Tree Classification
Decision tree builds classification or regression models in
the form of a tree structure. It breaks down a dataset into
CLASSIFICATION METHODS smaller and smaller subsets while at the same time an
associated decision tree is incrementally developed. The
A. KNN Classification final result is a tree with decision nodes and leaf nodes.
KNN classification is a prominent technique for classifying an The topmost decision node in a tree which corresponds to the
unseen instance. It is done by characterizing the occasions best predictor called root node. Decision trees can handle both
nearest to it. KNN algorithm works by discovering K instances categorical and numerical data.
that are near to the unseen instance.

Fig. 1 Flow Diagram of Complete Procedure


IV. RESULTS AND ANALYSIS

In the dataset, out of 349 records 10% that is 35 records were for each of the machine learning algorithms and the mean score
considered as testing data and the rest 314 data samples were of these 10 folds was taken for the evaluation metrics considered.
considered as training data. At first, we run the five machine The considered evaluation metrics can be calculated as follows.
learning algorithms on the pre-processed data without any feature
selection methods applied on the dataset and obtained the results. Accuracy Score = (TP+TN)/(TP+TN+FP+FN)
This time all the 49 features in the dataset were considered while
training the model. Precision Score = TP/(TP+FP)
Three most commonly used evaluation metrics namely accuracy
score, precision score, recall score and F1 score were considered Recall Score = TP/(TP+FN)
in order to evaluate and compare the models.
A cross fold validation with number of folds = 10 was performed F1 Score = 2. (Precision.Recall)/(Precision+Recall)

The values of evaluation metrics obtained are summarized in table 1

Name of the KNN SVM Random Forest Decision Tree Logistic


algorithm Regression
Accuracy Score 0.8235 0.8235 0.9412 0.7941 0.7941
Precision Score 0.7222 0.7222 0.9444 0.6667 0.6667
Recall Score 0.9286 0.9286 0.9444 0.9231 0.9231
F1 Score 0.8125 0.8125 0.9444 0.7742 0.7742
Table 1

Next a feature subset selection namely PCA was applied on the these 10 best features and the respective values of evaluation
processed dataset. The 10 best PCA features were selected and metrics were obtained.
then the dataset was tested with all the five algorithms with only

The values of evaluation metrics obtained are summarized in table 2

Name of the KNN SVM Random Forest Decision Tree Logistic


algorithm Regression
Accuracy Score 0.6471 0.7647 0.7059 0.7353 0.7059
Precision Score 0.6111 0.6667 0.5000 0.6667 0.5556
Recall Score 0.6875 0.8571 0.9000 0.8000 0.8333
F1 Score 0.6471 0.7500 0.6429 0.7273 0.6667
Table 2

At last, TSNE a feature subset selection method was applied on only these 3 selected best features and the respective values of
the processed dataset. The best 3 TSNE features were selected evaluation metrics were obtained.
and then the dataset was tested with all the five algorithms with

The values of evaluation metrics obtained are summarized in table 3.

Name of the KNN SVM Random Forest Decision Tree Logistic


algorithm Regression
Accuracy Score 0.6176 0.7353 0.6471 0.5000 0.7353
Precision Score 0.5556 0.7222 0.6667 0.3889 0.7778
Recall Score 0.6667 0.7647 0.6667 0.5385 0.7368
F1 Score 0.6061 0.7429 0.6667 0.4516 0.7568
Table 3
V. CONCLUSION AND FUTURE WORK with mean value of the column. To obtain the accuracy, we
have chosen ten folds’ average accuracy.
In this research, we investigated the feasibility of employing the From our observation we can say that Random Forest Classifier
five machine learning algorithms—SVM, Random Forest, gives the better prediction accuracy when all the features of the
Logistic regression, KNN and Decision Tree for identifying the dataset are included.
malignant or benign type of breast cancer found in women.
These experiments were conducted on real-time data set which As the future work, we plan to collect the time series data from
was pre-processed as an initial stage. Missing value was filled the hospitals and apply machine learning
algorithms. The aim of the problem is to merge feature learning work.
with the machine learning algorithms employed in this research

REFERENCES

[1] R.G. Moore, M. Jabre-Raughley, A.K. Brown, K.M. microbial features and tumor marker levels as potential
Robison, M.C. Miller, W.J. Allard, R.J. Kurman, R.C. diagnostic tools for ovarian cancer, PLOS ONE 15 (1)
Bast, S.J. Skates, Comparison of a novel multiple marker (2020) e0227707.
assay vs the Risk of Malignancy Index for the prediction [11] L.H. Kim, J.L. Quon, T.A. Cage, M.B. Lee, L. Pham, H.
of epithelial ovarian cancer in patients with a pelvic mass, Singh, Mortality prediction and long-term outcomes for
American Journal of Obstetrics and Gynecology 203 (3) civilian cerebral gunshot wounds: A decision-tree
(2010) 228.e221-228.e226. algorithm based on a single trauma center, Journal of
[2] J. Wang, J. Gao, H. Yao, Z. Wu, M. Wang, J. Qi, Clinical Neuroscience 75 (2020) 71–79.
Diagnostic accuracy of serum HE4, CA125 and ROMA [12] C.-S. Rau, S.-C. Wu, P.-C. Chien, P.-J. Kuo, Y.-C. Chen,
in patients with ovarian cancer: a meta-analysis, Tumor H.-Y. Hsieh, H.-Y. Hsieh, Prediction of Mortality in
Biology 35 (6) (2014) 6127–6138. Patients with Isolated Traumatic Subarachnoid
[3] A. Lukanova, R. Kaaks, Endogenous hormones and Hemorrhage Using a Decision Tree Classifier: A
ovarian cancer: epidemiology and current hypotheses, Retrospective Analysis Based on a Trauma Registry
Cancer epidemiology, biomarkers & prevention : a System, International Journal of Environmental Research
publication of the American Association for Cancer and Public Health 14 (11) (2017) 1420.
Research, cosponsored by the American Society of [13] R. Sumbaly, N. Vishnusri, S. Jeyalatha, Diagnosis of
Preventive Oncology 14 (1) (2005) 98–107. Breast Cancer using Decision Tree Data Mining
[4] S.M. Ho, Estrogen, progesterone and epithelial ovarian Technique, International Journal of Computer
cancer, Reproductive biology and endocrinology : RB&E Applications 98 (2014) 16–24.
1 (2003) 73.
[5] P. Zhang, C. Wang, L. Cheng, P. Zhang, L. Guo, W. Liu,
Z. Zhang, Y. Huang, Q. Ou, X. Wen, et al., Development
of a multi-marker model combining HE4, CA125,
progesterone, and estradiol for distinguishing benign
from malignant pelvic masses in postmenopausal women,
Tumor Biology 37 (2) (2016) 2183–2191.
[6] L.J. Havrilesky, C.M. Whitehead, J.M. Rubatt, R.L.
Cheek, J. Groelke, Q. He, D.P. Malinowski, T.J. Fischer,
A. Berchuck, Evaluation of biomarker panels for early
stage ovarian cancer detection and monitoring for disease
recurrence, Gynecol Oncol 110 (3) (2008) 374–382.
[7] I.H. Witten, E. Frank, M.A. Hall, Data Mining: Practical
Machine Learning Tools and Techniques, Morgan
Kaufmann Publishers Inc., 2011.
[8] P.-N. Tan, M. Steinbach, Kumar V, Introduction to Data
Mining, (First Edition), Addison-Wesley Longman
Publishing Co. Inc., 2005.
[9] M.D. Ganggayah, N.A. Taib, Y.C. Har, P. Lio, S.K.
Dhillon, Predicting factors for survival of breast cancer
patients using machine learning techniques, BMC
Medical Informatics and Decision Making 19 (1) (2019)
48.
[10] R. Miao, T.C. Badger, K. Groesch, P.L. Diaz-Sylvester,
T. Wilson, A. Ghareeb, J.A. Martin, M. Cregger, M.
Welge, C. Bushell, et al., Assessment of peritoneal

You might also like