Mobile Application Development

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 75

A

CAPSTONE PROJECT REPORT


ON

LUNG CANCER PREDICTION


PROJECT WORK SUBMITTED IN PARTIAL FULFILLMENT OF THE
REQUIREMENTS FORTHE AWARD OF DIPLOMA.

SUBMITTED BY

ADARSH MISHRA
SAHIL POTALE
BHAVESHGHADE
UNDER THE GUIDANCE OF

MRS.MANJULA ATHANI

DEPARTMENT OF COMPUTER ENGINEERING


MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
Academic Year (2022-2023)

1
A
CAPSTONE PROJECT REPORT
ON

LUNG CANCER PREDICTION

PROJECT WORK SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR


THE AWARD OF DIPLOMA.

SUBMITTED BY

MISHRA ADARSH KRISHNMOHA

UNDERTHE GUIDANCE OF

MRS.MANJULA ATHANI

DEPARTMENT OF COMPUTER ENGINEERING


MAHARASHTRA STATE BOARD OF TECHNICAL EDUCATION
Academic Year (2022-2023)

2
CERTIFICATE

This is to certify that Mr. MISHRA ADARSH KRISHNAMOHAN from PRAVIN PATIL
COLLEGE OF DIPLOMA ENGINEERING AND TECHNOLOGY institute having Enrollment no:
2005630126 has completed project of final year having title LUNG CANCER PREDICTION during
the academic year 2022 – 2023.

The project completed in a group consisting of three persons under the guidance of the faculty
Guide.

Project Members
1 Adarsh Mishra
2 Sahil Potale
3 Bhavesh Ghade

MRS. MANJULA ATHANI


Name& Signature of Guide
Contact No: 9766682460

5
Acknowledgement

I express my sincere thanks to the Principal Mrs. R.B Patil , who has given
me the opportunity to pursue my Diploma computer engineering department also
express my thanks to H.O.D Mrs. Manjula Athani and other staff of the
Computer Engineering department. I would like to thanks my guide Mrs.
Manjula Athani for her encouragement and guidance, which helped me in
completing the project. Finally, I would like to thank my colleagues and friends
who helped me in completing the Project successfully.

I would also like to express my heartfelt gratitude to my parents, teachers and


friends for their direction, motivation and selfless support.

PROJECT MEMBERS

1 Adarsh Mishra
2 Sahil Potale
3 Bhavesh Ghade

7
Abstract

Cancer is very dangerous and common disease that causes death


worldwide. Early diagnosis of cancer provide more possibility of getting
cured. Cancer disease generates abnormal growth of cells which spreads to
all parts of body. In this idea we will cure lung cancer at the early stage by
the prediction of lung cancer with help of machine learning techniques. The
common reasons of lung cancer are smoking habits, working in smoke
environment or breathing of industrial pollutions, air pollutions and genetic.
In this paper we have proposed a genetic algorithm based dataset
classification for prediction of multiple models. The usage of genetic
algorithm (GA) have shown better performance when compared with
Particle swarm optimization and differential evolutions In this project user
using a CNN(Conventional Neural Network) of Deep Learning, Image
Processing for prediction between images. This system would help the
Cancer patients to learn independently about their stage, also they can access
this system from anywhere and at any time. Thus, this system is developed
to detect the cancer stages on the basis of X-Ray.

8
Contents

1. Introduction .............................................................................................. 10
1.1 Current Scenario ................................................................................. 11
1.2 Problem Faced in Current scenario .................................................... 11
1.3 Solution and Planning ....................................................................... 11
1.4 Flow Diagram of Lung Cancer Prediction ......................................... 14
2. Literature Review..................................................................................... 15
3. Scope of project ....................................................................................... 18
4. Methodology ........................................................................................... 20
4.1 Hardware and Software Requirements .................................................. 25
5. Designing ................................................................................................ 26
5.1 Block Diagram.................................................................................... 27
5.2 Activity Diagram ................................................................................ 28
5.3 Data Flow Diagram ............................................................................ 29
6. Results and Applications ......................................................................... 32
6.1 Results ................................................................................................ 33
6.2 Data Base Design ............................................................................... 38
6.3 Application ......................................................................................... 39
7. Conclusion and future scope ................................................................... 40
7.1 Conclusion ............................................................................................. 41
7.2 Future Scope .......................................................................................... 42
8. Appendix ................................................................................................. 43
8.1 Gantt Charts ........................................................................................... 44
9. References & Bibliography..................................................................... 45
9.1 References ............................................................................................. 46
9.2 Bibliography .......................................................................................... 47

9
1. INTRODUCTION

10
1. Introduction to project

1.1 Current scenario


Lung cancer is a type of cancer that begins in the lungs. Your lungs are two
spongy organs in your chest that take in oxygen when you inhale and release carbon
dioxide when you exhale. Lung cancer is the leading cause of cancer deaths
worldwide. Globocan estimate of lung cancer in India would indicate that incidence
of lung cancer in India is 70,275 (for all ages and both genders) with an age
standardized incidence rate being 6.9 per 100,000 of our population.
1.2 Problems in existing system

While lung cancer is the second most common cancer in the U.S., it's not often
detected early. However, lung cancer screening offers hope for catching the cancer
early when it's easier to treat. Unlike some other cancers, lung cancer usually has
no noticeable symptoms until it's in an advanced stage.
1.3 Solution
As we all know that “prevention is better than cure” so, this is the main goal of
choosing this topic for project. To overcome this concerns we are going to build
this project which is using various machine learning techniques.

Using Machine Learning and Deep Learning techniques the software will going
topredict the stages of cancer one is having based on the image of the x-ray and will
suggest the early precautions one should take to avoid the increament in the
cancercells

Modules :-
Login-in and signup page:
The patient and user can sign-in or sign-up into his/her account to get the

11
detailed information about the patient’s record.

The login page has two data fields email and password. The patient can fill
up the credentials and sign-in into the account and see the dashboard.

The registration form has name, email and passwords fields for
registering the new user. They can sign-up from this form.
Dashboard:
The dashboard is visible after successful login. It routes of different modules.

Home Page:
A home page (or homepage) is the main page of a website or application. The
term may also refer to the start page shown when the application first opens.
After getting logged into our application the user can now access the
dashboard. The first thing user will see is the home page, the user can get basic
information about our application like what our is about and what is our basic
agenda.

Prediction:
After knowing about the basic architecture of our application (or agenda). The
user or patient will come to this module through the dashboard. This is the major
or integral module of our application. Here only the user can detect the cancer
stages with the help of x-ray and get to know that he /she is cancerous or not. Here
we had used various data fields like Buttons, Image Picker widget, Container.

We had use 2 buttons for selecting the picture from gallery or taking the
picture from camera at the real time. The Image Picker widget help us to select
picture from the gallery on the click of button. And we have stored all this
components in a Container

Doctors:

After doing the prediction, the user will get to know about the stage of Lung

12
cancer he/she is having. And if the stage is severe, it will suggest the doctors and
medical centers were this kind treatment is available and also get to know about
their public details. In this we have stored the details of doctor in a container and
in a Tabular Format.

History:

After performing the prediction the prediction multiple times. If the user or
patient wants to compare the current or previous result. For that, history module
can be used, where all the prediction performed by the single account can be seen.

13
1.4 Flow Diagram of Lung Cancer Prediction:

Fig 1.1 System Flow Diagram

The Fig 1.1 shows the System flow diagram of lung cancer prediction. Firstly
the user has to go on the website or on the application then, the user has to login. For
the security reasons the OTP will generated (4 digits) on the phone no which is used
for the login/signup.

After that there will a browse option is available on GUI of application. The
user has to go on to that option and select the document (image format) from their
device memory. The image must be of the X-Ray of the Lungs. Then the system will
start its working and try to match the image from the database and based on thatit will
give the accuracy between the original and reference image.

Based on the Stage it will suggest the early precaution one should take.

14
2. LITERATURE REVIEW

15
2. Literature Review

Cancer is very dangerous and common disease that causes death worldwide.
Early diagnosis of cancer provide more possibility of getting cured. Cancer disease
generates abnormal growth of cells which spreads to all parts of body. In this paper
we discuss, the early prediction of lung cancer with help of data miningtechniques.
Lung are spongy organs that affected by cancer cells that leads to lossof life. The
common reasons of lung cancer are smoking habits, working in smokeenvironment
or breathing of industrial pollutions, air pollutionsand genetic. In thispaper we have
proposed a genetic algorithm based dataset classification for prediction of multiple
models. The usage of genetic algorithm (GA) have shown better performance
when compared with Particle swarm optimization and differential evolutions.
Lung cancer is one of the leading causes of mortality in every country,affecting
both men and women. Lung cancer has a low prognosis, resulting in a high death
rate. The computing sector is fully automating it, and the medical industry is also
automating itself with the aid of image recognition and data analytics. Thispaper
endeavors to inspect accuracy ratio of three classifiers which isSupport Vector
Machine (SVM), KNearest Neighbor (KNN)and, Convolutional Neural Network
(CNN) that classify lung cancer in early stage so that manylives canbe saving.
Basically, the informational indexes utilized as a part of this examination are taken
from UCI datasets for patients affected by lung cancer. Theprinciple point of this
paper is to the execution investigation of the classification algorithms accuracy by
WEKA Tool. The experimental results show that SVM gives the best result with
95.56%, then CNN with CNN 92.11% and KNN with 88.40% characterized by
pervasive, unobtrusive and anticipatory communications.

Machine learning based lung cancer prediction models have been proposed to
assist clinicians in managing incidental or screen detected
166

16
indeterminate pulmonary nodules. Such systems may be able to reduce variability
in nodule classification, improve decision making and ultimately reduce the
number of benign nodules that are needlessly followed or worked-up. In this
article, we provide an overview of the main lung cancer prediction approaches
proposed to date and highlight some of their relative strengths and weaknesses.
We discuss some of the challenges in the development and validation of such
techniques and outline the path to clinical adoption

Cancer is very dangerous and common disease that causes death worldwide.
Early diagnosis of cancer provide more possibility of getting cured. Cancer disease
generates abnormal growth of cells which spreads to all parts of body. In this idea
we will cure lung cancer at the early stage by the prediction of lung cancer with help
of machine learning techniques. The common reasons of lung cancer are smoking
habits, working in smoke environment or breathing of industrial pollutions, air
pollutions and genetic. In this paper we have proposed a genetic algorithm based
dataset classification for prediction of multiple models. The usage of genetic
algorithm (GA) have shown better performance when compared with Particle
swarm optimization and differential evolutions

In this project user using a CNN(Conventional Neural Network) of Deep


Learning, Image Processing for prediction between images. This system would help
the Cancer patients to learn independently about their stage, also they can access
this system from anywhere and at any time. Thus, this system is developed to detect
the cancer stages on the basis of X-Ray.

17
3. SCOPE OF THE PROJECT

18
3. Scope of Project

Lung cancer results in over 1.7 million deaths per year, making it the deadliest
of all cancers worldwide more than breast, prostate, and colorectal cancers
combined and it’s the sixth most common cause of death globally, according to
the World Health Organization. While lung cancer has one of the worst survival
rates among all cancers, interventions are much more successful when the cancer
is caught early. The user can get the information of early precaution one should
take and also about the nearby doctors. The application interface is responsive and
user friendly. The application can be used by particular individual to know that
they are cancerous or not. It can also used by doctors just to double check their
reports. ML and DL capabilities to be included providing the ability of a computer
to understand, analyze, manipulate, and potentially generate Results. The CNN ,
Image processing technology can be used for analyzing images of X-Ray or Ct
scan images.

19
4. Methodology

20
4. Methodology
In this project we used Flutter and Python as frontend and Firebase as
backend
4.1 Flutter:
Flutter Version: 3.7
The first version of Flutter was announced in the year 2015 at the Dart
Developer Summit. It was initially known as codename Sky and can run on
the Android OS. On December 4, 2018, the first stable version of the Flutter
framework was released, denoting Flutter 1.0. The current stable release of
the framework is Flutter v1.9.1+hotfix.6 on October 24, 2019.
In general, creating a mobile application is a very complex and challenging
task. There are many frameworks available, which provide excellent features
to develop mobile applications. For developing mobile apps, Android
provides a native framework based on Java and Kotlin language, while iOS
provides a framework based on Objective-C/Swift language. Thus, we need
two different languages and frameworks to develop applications for both OS.
Today, to overcome form this complexity, there are several frameworks have
introduced that support both OS along with desktop apps. These types of the
framework are known as cross-platform development tools.

4.2 Python:
Python Version: 3.11.2
Python is a very popular general-purpose interpreted, interactive, object-
oriented, and high-level programming language. Python is dynamically-typed
and garbage-collected programming language. It was created by Guido van
Rossum during 1985- 1990. Like Perl, Python source code is also available
under the GNU General Public License (GPL).
This Python tutorial has been written for the beginners to help them
understand the basic to advanced concepts of Python Programming Language.

21
Python is a high-level, interpreted, interactive and object-oriented scripting
language. Python is designed to be highly readable. It uses English keywords
frequently where as other languages use punctuation, and it has fewer
syntactical constructions than other languages.
History of Python:
Python was developed by Guido van Rossum in the late eighties and early
nineties at the National Research Institute for Mathematics and Computer
Science in the Netherlands.

Python is derived from many other languages, including ABC, Modula-3, C,


C++, Algol-68, Small Talk, and Unix shell and other scripting languages.

Python is copyrighted. Like Perl, Python source code is now available under
the GNU General Public License (GPL).

Python is now maintained by a core development team at the institute,


although Guido van Rossum still holds a vital role in directing its progress.

4.3 Firebase:

Firebase is a backend platform for building Web, Android and IOS


applications. It offers real time database, different APIs, multiple
authentication types and hosting platform. This is an introductory tutorial,
which covers the basics of the Firebase platform and explains how to deal with
its various components and sub-components.

Firebase initially was an online chat service provider to various websites


through API and ran with the name Envolve. It got popular as developers used
it to exchange application data like a game state in real time across theirusers
more than the chats. This resulted in the separation of the

22
We had build this project, using various machine learning techniques (ML)
and Deep Learning Techniques like CNN(Convolutional Neural
Network),Linear Regression, Multiple Regression. This will monitor the
health status of the patient. Using this techniques we can predict the cancer
stage of lungs of an individual with the help of image of the X-Ray. It will
also try to match the imageof X-ray with the dataset of images of X-ray. On
this basis it will give accuracy.

4.4 CNN:

CNN’s were first developed and used around the 1980s. The most that a
CNNcould do at that time was recognize handwritten digits. It was mostly
used inthe postal sectors to read zip codes, pin codes, etc. The important thing
to remember about any deep learning model is that it requires a large amount
ofdata to train and also requires a lot of computing resources. This was a major
drawback for CNNs at that period and hence CNNs were only limited to the
postal sectors and it failed to enter the world of machine learning.

4.5 TensorFlow:

TensorFlow Version: 1.15

TensorFlow is an open-source software library. TensorFlow was


originallydeveloped by researchers and engineers working on the Google
Brain Team within Google’s Machine Intelligence research organization for
the purposes of conducting machine learning and deep neural networks
research, but the system is general enough to be applicable in a wide variety
of other domainsas well! Let us first try to understand what the word
TensorFlow actually

23
mean! TensorFlow is basically a software library for numerical computation
using data flow graphs

4.6 Keras:

Keras is an open-source high-level Neural Network library, which iswritten


in Python is capable enough to run on Theano, TensorFlow, or CNTK. It
wasdeveloped by one of the Google engineers, Francois Chollet. It is made
user-friendly, extensible, and modular for facilitating faster experimentation
with deep neural networks. It not only supports Convolutional Networks and
Recurrent Networks individually but also their combination.

24
4.1 Hardware and Software Requirements:

Hardware:
❖ HDD or SSD.
❖ Intel 2.60 GHz Processor i5 (10th Gen)
❖ 4 GB RAM or above

Software:

❖ Operating systems 64 bits (Windows 10)


❖ Android Studio
❖ Jupyter Notebook/pycharm
❖ PYTHON 3.8 or above
❖ Firebase Databse.

25
5. Designing

26
5. Designing

5.1 Block Diagram

Fig. 5.1 Block-Diagram of the Lung cancer Prediction

The Fig 5.1 shows the Block-diagram of Lung cancer Prediction. Firstly after the
login/signup, when the user inputs the image, the first work the system will is do is to check
if the image is in proper resolution or size for prediction. After satisfying this guide line.
The system is going to apply the prediction algorithm on the image. After that it is going
to extract a feature from the image, this feature will tell that image is cancerous or not.
After scanning the system will try to match that image from the images in the dataset. Based
on the matching process the system will predict that the patient is cancerous or not.

27
5.2 Activity Diagram

Fig. 5.2 Activity diagram for Lung Cancer Prediction

The Fig 5.2 shows the Activity diagram of Lung cancer Prediction. Here firstly the
login/sign up into the application using valid credentials. After that user will reach to dashboard
of our application. Now , the user can select the any tab(Doctors, Prediction, History, Account)
as per his/her suitability. But the main detection is taking place in the prediction tab

28
5.3 Data Flow Diagram

Fig.5.3.1 Data Flow Diagram LEVEL 0 for Lung Cancer


Prediction

In the given level 0 DFD the user will give the input to the system and the system
will showthe result on the basis of the image

29
Fig. 5.3.2 Data Flow Diagram LEVEL 1 for Lung Cancer
Prediction
The Fig 5.3.2 shows the Data flow diagram level 1 of lung cancer prediction.
Firstly the user has to go on the website or on the application then, the user has to
login. For the security reasons the OTP will generated (4 digits) on the phone no
which is used for the login/signup.

After that there will a browse option is available on GUI of application. The
30
user has to go on to that option and select the document (image format) from their
device memory. The image must be of the X-Ray of the Lungs. Then the system will
start its working and try to match the image from the database and based on thatit will
give the accuracy between the original and reference image.
Based on the Stage it will suggest the early precaution one should take.

31
6 Results and Applications

32
6. Results and Applications

6.1 RESULTS

6.1.1 Welcome page

Fig 6.1.1 Welcome page of Lung Cancer Prediction

Fig 6.1.1Shows the Welcome page of our application. This is basically the first page of
our application

33
6.1.2 Login/Signup page

Fig. 6.1.2.1 Login page of Lung Cancer


prediction

Fig. 6.1.2.1 Shows the Login page of our application. Here the new patients or
user cancreate his/her own accounts by using the valid credentials as per his/her choice
34
Fig. 6.1.2.2 Sign Up page for lung Cancer
prediction

Fig. 6.1.2.2 Signup - Here the existing patients can adds his/her his/her correct
credentials to gain access of the website/Application

35
6.1.3 Dashboard

Fig. 6.1.3 Dashboard of Lung Cancer


Prediction

Fig. 6.1.3 Dashboard. After successful login, the following dashboard appears on the screen.

36
6.1.4 Prediction

Fig 6.1.4 Prediction Page for Lung Cancer


Prediction

Fig 6.1.4 Prediction. The data collected from the user in the image document or
format. Here the actual prediction takes place

37
6.2 DATABASE DESIGN

Fig 6.2 Data Base Design

Fig 6.2 Shows the database of our application and it consists of identifier, provider
and date when sign in and User UID.

38
6.3 APPLICATIONS

• This project will surely prove as a boon in medical fields.

• The major application of this project is to provide patients about which


stage of Lung Cancer he/she is having.
• It can be used to reduce the paper works of reports.

• It can also be used to know about the early precaution, one should take.

• It can be used in remote places like villages where doctors cannot reach
out easily.
• It is a friendly-app for the patients.

• It will suggest you about the nearby doctors and the treatment.

• It can also be used for the betterment of medical society by treating


Cancer at the early stages.

39
7. Conclusion and future
scope

40
7. Conclusion and future scope

7.1 Conclusion

As we know that prevention is better than Cure, so the is the main goal of our
project.As we know that Lung Cancer is one of most common cancer most of the
peopleare facing, even at very young age. The vast majority (85%) of cases of
lung cancer are due to long-term tobacco smoking. About 10–15% of cases occur
in people who have never smoked. They are having lung cancer due to their
genetics.But, the cancer cells present in the lungs are very difficult to detect at the
earlystages, it can be detected only in the advanced stages. Since, our motive is
to detect the cancer cells at the early stages. We proposed this project.

41
7.2 Future Scope

Currently the system is supporting only Cancer Prediction of Lungs, but it can
be scaled, and even support can be provided for multiple problems, like Cancer of
heart. Therefore, patients other than Lung Cancer can also use the system for their
convenience; hence targeting a large number of users. The user can get the
information of early precaution one should take and also about the nearby doctors.
The application interface is responsive and user friendly. The system would be
constantly upgraded for new features and regularly tested for errors and bugs, thus
providing more accuracy and less error prone environment. ML and DL
capabilities to be included providing the ability of a computer to understand,
analyze, manipulate, and potentially generate Results. The CNN , Image
processing technology can be used for analyzing images of X-Ray.

42
8. Appendix

43
8. Appendix

8.1 Gantt Chart

Fig 8.1 Gantt Chart for Lung Cancer Prediction

44
9. References &
Bibliography

45
9. References & Bibliography

9.1 References:
[1] F. Leena Vinmala, Dr. A. Kumar Kombaiya, ” Prediction of
Lung Cancer using Data Mining Techniques”, International
Journal of Engineering Research & Technology (IJERT)

[2] Dakhaz Mustafa Abdullah, Adnan Mohsin Abdulazeez, Amira


Bibo Sallow, “Lung cancer Prediction and Classificationbased on
Correlation Selection method Using Machine Learning
Techniques”, Qubahan Academic Journal

[3] Timor Kadi, Fergus Glee, “Lung cancer prediction using


machine learning and advanced imaging technique”, Translational
Lung Cancer Research

46
9.2 Bibliography
• Anuradha A. Puntambekar, Yogesh S. Gunjal, Narendra S. Joshi,
Yogesh B. Patil ,”Software Engineering”, Technical Publication

• Guido van Rossum, “Python Reference Manual”, Nirali


Prakashan

• Ganesh Mante, Naresh Shende, Mrs. Jyoti Mante, “SOFTWARE


TESTING”, Nirali Prakashan

• Vijay T. Patil, Mrs. Manisha A. Pokharkar, Dr. Kishor S. Wagh,


“ADVANCED COMPUTER NETWORK ”, Nirali Prakashan

▪ https://www.lucidchart.com/pages/landing
▪ (PDF) Lung Cancer Prediction Using Deep Learning Framework
(researchgate.net)

▪ https://www.researchgate.net/publication/325977576_Smart_heal
th_ band_using_IoT

• https://ieeexplore.ieee.org/document/8628067

47
Review Article

Lung cancer prediction using machine learning and advanced


imaging techniques
Timor Kadir1, Fergus Gleeson2
1Optellum Ltd, Oxford, UK; 2Department of Radiology, Oxford University Hospitals NHS Foundation Trust, Oxford, UK
Contributions: (I) Conception and design: All authors; (II) Administrative support: All authors; (III) Provision of study materials or patients: All authors; (IV)
Collection and assembly of data: All authors; (V) Data analysis and interpretation: All authors; (VI) Manuscript writing: All authors; (VII) Final approval of
manuscript: All authors.
Correspondence to: Timor Kadir. Optellum Ltd, Oxford Center for Innovation, New Road, Oxford, UK. Email: timor.kadir@optellum.com.

Abstract: Machine learning based lung cancer prediction models have been proposed to assist clinicians in
managing incidental or screen detected indeterminate pulmonary nodules. Such systems may be able to reduce
variability in nodule classification, improve decision making and ultimately reduce the number of benign nodules
that are needlessly followed or worked-up. In this article, we provide an overview of the main lung cancer
prediction approaches proposed to date and highlight some of their relative strengths and weaknesses. We discuss
some of the challenges in the development and validation of such techniques and outline the path to clinical
adoption.

Keywords: Pulmonary nodules; lung neoplasms; lung; machine learning; decision making

Submitted Apr 07, 2018. Accepted for publication May 22, 2018.doi:10.21037/tlcr.2018.05.15
View this article at: http://dx.doi.org/10.21037/tlcr.2018.05.15

Introduction Radiologists (ACR) Lung Imaging Reporting and Data System


(Lung Rads™) tool (2) for standardized reporting of CT based
The demonstration of a 20% reduction in lung cancer mortality
lung cancer screening adopted a threshold for solid nodules of
in the USA National Lung Screening Trial (NLST)
<6 mm for its category 2 where no additional diagnostic work-
(1) and the subsequent decision by the U.S. Centers for
up is recommend and the subject is imaged again at annual
Medicare and Medicaid Services to provide Medicare coverage screening. However, new nodules of4 mm and greater are
for lung cancer screening has paved the way for nationwide considered category 3 and a 6-month follow-up LDCT is
lung cancer screening in the USA. recommended in recognition of their increased probability of
This decision also underscored the pivotal role of low- dose malignancy.
computed tomography (LDCT) in the detection of lung cancer. The impact of Lung- RADS was analysed in a retrospective
However, one of the acknowledged downsidesof LDCT based analysis of the NLST (3). Lung-RADS was shown to reduce the
screening is its relatively high false positive rate. For example, overall screening false positive rate to 12.8% and 5.3% at
the rate of positive screening tests inthe NLST was baseline and interval imaging respectively at the cost of a
approximately 27% in the first two roundsof the LDCT arm and reduction of sensitivity from 93.5% in the NLST to 84.9% using
17% in the third year of screening. A screening CT was Lung-RADS at baseline and 93.8% in the NLST and 84.9% using
considered positive if it contained a non-calcified nodule of at Lung- RADS after baseline. However, while Lung-RADS reduces
least 4 mm in its long axis or other suspicious abnormalities the overall false positive rate, the false positive rate ofpositive
were present. Over the three rounds, over 96% of such positive screens, i.e., Lung-RADS 3 and above, remains very high at 93%
screens were false positives and 72% had some form of at baseline and 89% after baseline; of 3,591 Lung-RADS 3 and
diagnostic follow-up. above screens, 3,343were false positives
To address this issue, the American College of

48
Translational Lung Cancer Research, Vol 7, No 3 June 2018 305

at baseline and of 2,858 Lung-RADS 3 and above screens after However, despite their attraction and good performance, their
baseline 2,543 were false positives. Therefore, while the adoption and performance as part of decision making has not
adoption of Lung-RADS can reduce the total number of benign been studied. The British Thoracic Society (BTS) guidelines on
nodules being worked-up within a screening programme, at a the management of incidentally detected pulmonary nodules
cost of just under 10% loss in sensitivity, there remain a very (10), recommends the use of the Brock model (6). Anecdotally,
large number of benign nodules being investigated, and the many physicians report usingthem for patient communication
nodule classification task remains a challenging one. only and feel that such models do not add a great deal to their
One approach to address this problem is to adopt computer clinical expertise. More specifically, questions remain as to the
aided diagnosis (CADx) technology as an aid to radiologists and utility of such models when the patient population is different
pulmonary medicine physicians. Given an input CT and possible to thatof the training data. It is clear, that for such models to
additional relevant patient meta- data, such techniques aim to be clinically useful, knowledge of the training data used is
provide a quantitative output related to the lung cancer risk. critical, and this also will determine the clinical scenarios in
One may consider the goal of such systems to be two- fold. which they may be used. There are clearly significant
First, to reduce the variability in assessing and reporting the differences in the pre-test probabilities of a nodule being
lung cancer risk between interpreting physicians. Indeed, malignant in different patient groups. For instance, patients
computer assisted approaches have been shown to improve with a current or prior history of malignancy are at significantly
consistency between physicians in a variety of clinical contexts, different risk of nodule malignancy than non- smokers with no
including nodule detection (4) and significant prior history.
mammography screening (5) and one might expect such From a technical perspective, such models have a number
decision support tools could provide the same benefit in nodule of limitations. Foremost is the reliance on human
classification. Second, CADx could improve classification interpretation of input variables such as nodule size,
performance by supporting the less experienced or non- morphology and even the reliance on the patient’s own
specialised clinicians in assessing the risk of aparticular nodule estimate of factors such as smoking history. For example, under
being malignant. the Brock model, a 1mm increase in the reported size of a 5
In this article, we review progress made towards the mm spiculated solid nodule in a 50-year-old female almost
development and validation of lung cancer prediction models doubles its risk, from 0.98% to 1.89%. However, inter-
and nodule classification CADx software. While wedo not radiologist variability in reporting nodule size is typically
intend this to be a comprehensive review, we do aim to greater than this (11). Moreover, inter- reader variability in
provide an overview of the main approaches taken to date and reporting morphology and nodule type is common even
outline some of the challenges that remain to bring this amongst experienced thoracic radiologists (12,13).
technology to routine clinical use. Some recent work to address this has been proposed by
Ciompi et al. (14) where an automated system for the
Risk models classification on nodules into solid, non-solid, part-solid,
calcified, perifissural and spiculated types was proposed.
There have been a number of lung cancer risk models Overall classification accuracy is reported to be within theinter-
developed and validated that one may consider to be a formof radiologist variability at 79.5% but this varies between 86% for
CADx tool (6-9). Typically based on logistic regression, such solid and calcified nodules down to 43% for spiculatednodule
tools aim to provide an overall risk of the patient havingcancer classification. Of course, since the ground-truth classifications
based on patient meta-data such as age, sex and smoking were provided by radiologist opinion, the performance at
history and nodule characteristics such as nodule size, validation cannot be expected to improve on that. As the
morphology and growth, if a previous CT was available. authors point out, the nodule types are radiologist developed
Although such tools currently require manual entry by the concepts that, while useful for clinical purposes, lack a precise
user, they do produce an objective lung cancer risk score which definition. The impact of the system’s output as an input to the
may be used in the decision-making process. Brock model was not reported and ultimately this approach
should be judged

49
306 Kadir and Gleeson. Lung cancer prediction using machine learning

on its ability to improve malignancy prediction. classify the patient.


Of course, both at training and testing steps, a classification
algorithm will need to be defined to convert the Radiomics
Radiomics
values into classifications. For small sets of individual features,
The term Radiomics refers to the automatic extractionof we may simply use thresholds on the Radiomic features;
quantitative features from medical images (15,16) and has however, for larger sets of features more sophisticated
been the subject of a great deal of investigation with techniques from the field of machine learning, such as Support
applications including automated lesion classification, Vector Machines (SVMs) and Random Forests are typically used
response assessment and therapy planning. Fundamentally, to yield better results. A very good review of Radiomics
the Radiomic approach aims to turn image voxels into a set of approaches applied to the classification of pulmonary nodules
numbers that characterize the biological property of interest is provided in Wilson et al. (18).
such as lesion malignancy, tumour grade or therapy response.
One criticism of some of the earlier Radiomics work is the
Although research into, what are termed, Radiomics methods
lack of independent training and validation data (19). Indeed,
has seen an explosion in thelast decade, thetechnical methods
it is not unusual to find very high classification rates being
that it builds on have a very long history in the fields of
reported based on the training data whereas it is well
computer vision and medical image understandingin the area
established within the machine learning literature that such
of texture analysis. Indeed, many of the so-called Radiomic
results may be subject to “overfitting”—the apparent excellent
features are based on techniques that were firstproposed in
the 1970s (17) for the classification of texturedimages and performance that cannot be replicated on unseen and
have been largely superseded in the computervision independent datasets. In fact, one measure of the goodness of
literature. Nevertheless, their application to medicalimage a well-trained classifier is the difference in performance
processing research has in some areas yielded some between training sets and test sets. This phenomenon has led
significant insights, in particular in how such quantitative to a generally over-optimistic view ofthe performance with
features relate to tumour pheno- and genotypes. The idea area under the curve (AUC) numbers reported in the high 80s
that such advanced quantitative techniques may add to the and 90s range that cannot be replicated on independent data.
qualitative clinical interpretation of radiologists is gaining The 2015 SPIE-AAPM-NCI LungX Lung Nodule Classification
momentum and is likely to move into mainstream clinical Challenge (20) was a first attempt at a Grand Challenge style
practice in the coming 5 to 10 years. competition and provided a sobering view of the actual real-life
For a given application, the Radiomic approach proceeds
performance one might expect to see in clinical practice. Ten
in two phases—first a training or feature selection phase and
groups, including our own, submitted computer methods to
then a second testing or application phase. The training
classify nodules as benign or malignant. No additional training
phase typically proceeds as follows. First, a large set, typically
data was provided but a limited “calibration” dataset of ten
some hundreds or thousands, of features are defined a-priori.
Next, the features are extracted from a large corpus of cases was provided. Therefore, all groups were required to
training data where the object of interest, say a tumour, has utilize either publicly available or their own proprietary
been delineated such that a computer algorithm can extract datasets. Many ofthe methods used the Radiomic/texture
the quantitative features automatically. Finally, a step known feature extraction technique followed by a classification step.
AUCs ranged from 0.5 to 0.68 with only three of the
as feature selection is applied that aims to select a smaller
subset, e.g., some tens of such features that efficiently captures methods outperforming random chance with statistical
the imaging characteristics of the biological phenomena of significance. Despite our classifier achieving the highest AUC
interest. For example, in the case of nodule classification into and winning the competition, the performance was
benign and malignant we may pick the features, either significantly below what we had seen on other independent
individually or in combination that perform the best at this task datasets. In the next section, we provide some details of the
on the training data. system that have not been published previously along with
In the testing phase, the Radiomics are applied to a some insights gained during the competition and subsequent
particular patient’s image, with the process being similarto the analysis.
training phase but now the selected features are identified by
the algorithm, extracted and then used to

50
Translational Lung Cancer Research, Vol 7, No 3 June 2018 307

Figure 1 A block diagram of the LungX winning system.

LungX winning entry it difficult for one set of texture features to capture the
patterns. We believe this was a significant contribution to the
Figure 1 provides an overview of the main steps in the
performance of the system.
algorithm used in the winning entry. The software has four
The 15 features were selected from a palette of over 1,300
main steps at test time, i.e., when used to classify a nodule:
classical texture features including Haralick (17), Gabor (22),
(I) nodule segmentation, (II) texture feature extraction, (III)risk
along with simple measures such as mean, standard deviation
score regression and (IV) risk score thresholding.
and volume. We utilized a fully automated feature selection
The nodule segmentation is required because the
strategy that aimed to select a small subset of features that
subsequent step of feature extraction is applied to a regionof
optimised classification performance over an in- house training
interest (ROI) around the nodule. Each nodule was segmented
dataset. Since it is computationallyinfeasible to test all
in a semi-automated thresholding approach using a
combinations of the full palette of features, we utilized a
commercial software package (Mirada RTx, Mirada Medical
sequential “greedy” algorithm that, starting with the optimal
Ltd.). The user first defined a spherical ROI around each nodule
pair of features found by exhaustive search over all pairs of
and then applied a fixed threshold to the ROI. Next, the user
features, selected features one-by-one so as to maximise the
could adjust the threshold to improve the segmentation and
performance over the training dataset at each step.
finally, manual editing tools could be used to edit the
Finally, an SVM regression algorithm with a cubic kernel was
segmentation to remove any voxels that did not correspond to
trained using the libSVM library. The output of this step is a
the nodule of interest that the segmentation had included.
number between 0 and 1 that reflects the likelihood that a
Typically, adjacent vessels would need to be excluded in this
particular nodule is malignant.
manner. In later work, we replaced the semi- automated
The training dataset we utilized for the competition was
method with a more automated technique thatdid not
mostly derived from the Lung Image Database Consortium and
require any user interaction other than toidentify the centre
Image Database Resource Initiative (LIDC-IDRI dataset) (23).
and diameter of the nodule (21).
We extracted 15 texture features from two regions, the This publicly available dataset comprises a wide variety of
first inside the nodule segmentation and the second in a nodules and comes with multiple segmentations and likelihood
surrounding region defined automatically. Based on earlier of malignancy score estimated by expert clinicians. Nodules
work using our internal databases, we found that better were included in our training set if at least three sets of
performance could be achieved if the region inside the nodule clinician-drawn contours and corresponding likelihood- of-
was treated separately to the immediate surrounding malignancy scores were included in the XML metadata. The
parenchyma. The insight here is that the texture of the nodule malignancy scores are integers from one to five inclusive and
carries separate information to the region in the nearby are recorded per clinician. Only nodules whose malignancy
parenchyma and the very different ranges of Hounsfield units scores were all below 3 (the benign set) or all above 3 (the
in each region would make malignant set)

17

51
308 Kadir and Gleeson. Lung cancer prediction using machine learning

A ROC for Oxford dataset


1
ROC for LIDC-IDRI B 1
0.9
0.9
0.8
0.8
0.7
0.7
True positive rate

0.6

True positive rate


0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
Qx.Split 1 (20 splits, mean AUC 0.854)
Qx.Split 2 std AUC 0.019
Split 1 (20 splits, mean AUC 0.967) Qx.Split 2 etc.
0.1 Leave-one-out: AUC 0.864
Split 2 std AUC 0.024 0.1
Split 2 etc. Trained on LIDC-IDRI: AUC
0 Leave-one-out: AUC 0.983 0.682
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

False positive rate False positive rate

Figure 2 ROC curves for the LungX trained and tests on the LIDC-IDRI dataset (A) and the Oxford Data (B). LIDC-IDRI, Lung Image
Database Consortium and Image Database Resource Initiative; ROC, receiver operating characteristic curve; AUC, area under the curve.

were included, yielding a labeled subset of 222 nodules overall once we had collected and curated sufficiently large training
for the LIDC-IDRI training set. sets by the end of 2016, our CNN based techniques started to
Figure 2 shows the Receiver Operating Characteristic (ROC) outperform the previous state-of-the-art texture and SVM
curves for the system as trained and tested on the LIDC-IDRI based method. While a detailed exposition of such techniques
dataset using 20-way cross-validation. With such high AUCs, we is beyond the scope of this article, it is worth understanding the
were suspicious that the dataset was too easy to classify and so main differences to previous methods and their advantages.
we trained and validated on a second dataset, PLAN, to
examine the system’s performancefurther. The PLAN nodule
Feature learning vs. feature selection
database was built up from nodules collected from theOxford
University Hospitals NHS trust. This set consists of 709 Unlike Radiomic/texture analysis approaches, CNN techniques
nodules, 377 malignant and 328 benign, diagnosed either using build features from scratch rather than selecting from a palette
histology or by 2- year stable follow-up. Using 20-way cross- of engineered or pre-selected set that rely on the contextual
validation, the average AUC was 0.854; the ROC curves are knowledge of the algorithm developer.
shown on the right of Figure 2. In the end, the system we
submitted usedboth LIDC-IDRI and PLAN for feature selection
Hierarchical features
but the SVM was trained only on the LIDC-IDRI dataset.
The first few layers of a CNN typically comprise several layers
of features allowing the network to learn the relationships
Convolutional neural networks and deep learning
between features in a much more sophisticated way than can
Convolutional Neural Networks (CNN) trained using deep be achieved with a single feature extraction stage. Consider this
learning techniques have come to dominate pattern detection, illustrative example: a texture feature, such as local entropy of
recognition, segmentation and classification applications in the joint histogram, can be used to detect spiculations
both medical and non-medical fields. Indeed, where sufficient extending into the parenchyma. But a CNN can learn this and
training data is available, CNNs have largely superseded the also learn that spiculations encompass the whole perimeter of
previous generation of Radiomic/ texture analysis methods the nodule and that this is a sign of a malignant nodule.
described above. In our own work,

52
Translational Lung Cancer Research, Vol 7, No 3 June 2018 309

End-to-end learning The winning entry (24), utilized a 3D Convolutional Neural


Network training on a combination of DSB training data and the
CNNs are typically trained “end-to-end” meaning that the
publicly available the dataset used in the LUNA16 nodule
entire network is trained to optimize the problem of interest,
detection competition (26) which itself was derived from the
i.e., all the parameters of the network are adjusted until the
LIDC-IDRI dataset (23). Since no nodules are identified in the
peak classification rate is achieved. In contrast, each stage of
validation and test datasets, a reliable automated nodule
the LungX texture-based approach that we developed had to
detection step is critical for correct classification. In fact, based
be built and optimised individually and there was no guarantee
on the subsequent write-ups from the winning teams (24,25),
that the entire pipeline would be optimal.
much of the effort was put into this step rather than the
subsequent classification step.
Segmentation-free
Is size everything?
The CNN approach can operate without the nodule
segmentation step because segmentation is handled in an One interesting observation regarding the distribution of
implicit way within the algorithm. In subsequent analysis of our nodule sizes was made by the winning team. The LUNA16
LungX algorithm, we found significant sensitivity of the dataset contained many more small nodules, (mean =8 mm),
prediction score to the segmentation step. whereas the DSB datasets comprised many larger lesions
(mean =14 mm), therefore the team had to adjust the training
The Kaggle data science bowel 2017—lungcancer algorithm to compensate for this. Moreover, the distribution of
detection nodule sizes between cancer and benign patients was reported
to be very different; the malignant nodules were large and the
The 2017 lung cancer detection data science bowel (DSB) benign were small. Hencepredicting the diagnosis based on size
competition hosted by Kaggle was a much larger two-stage alone would be expected to produce good results. The issue of
competition than the earlier LungX competition with atotal of size bias in training and test sets is a critical issue and one which
1,972 teams taking part. In stage 1, a large training dataset of we have studied in some depth.
1,397 patients was provided comprising 362 with lung cancer It is well known from the risk model literature that the
and 1,035 without, along with an initial validation set of 198 strongest predictor of a nodule’s malignancy, imaged at one
patients. This validation set was used to produce the public point in time, is its size, whether expressed as its long axis, an
stage 1 leader-board, using which the competitors could judge average of the long and short axes or as a volume. The reason
their performance. In stage2, a further unseen dataset of 506 is quite simple: benign nodules are typically caused by
patients, on which the final competition results were judged, processes that are self-limiting in size, e.g., inflamed lymph
was made available for 7 days. This two-stage approach was nodes, whereas malignant tumours have no such limits, and are
used to avoid competitors inferring the test set labels using constrained by other factors such as the duration of growth,
many entries. In contrast to the LungX competition, here the the cell replication time, the ability of the tumour to invade
competitors needed to produce a completely automated adjacent structures, and its vascular and oxygen supply.
pipeline, taking in a CT image and outputting a likelihood of Therefore, one might expect that nodule size, either implicitly
cancer. or explicitly, will be included as part of any nodule CADx system.
The results were judged using the log-loss function, popular However, additional differences in the size distribution of
on Kaggle competitions. Unlike AUC, the log-loss function benign and malignant nodules may also occur due to selection
penalizes more confident, but incorrect outputs, greater than bias in data collection. For example, a naive approach to
less confident ones. All top three entries utilized CNNs trained collecting examples of malignant lesions mightbe to select all
using Deep Learning and scored within a few decimal places of retrospective CTs for patients diagnosed with lung cancer and
each other, scoring 0.39755, 0.40117 and 0.40127, where a log- all retrospective CTs for patients with benign nodules.
loss of zero corresponds to a perfect score. AUC-ROC results of However, outside of a screening programme, most patients
0.85 and 0.87 were subsequently reported for the top-two diagnosed with lung cancer present with symptoms prior to
teams (24,25) respectively. diagnostic imaging and

53
310 Kadir and Gleeson. Lung cancer prediction using machine learning

SV

Dataset A
Model A
640 size-matched

SV
Dataset B
Model B
640 NLST

Figure 3 Investigating the role of nodule size within a machine learning model of nodule malignancy. Model A was trained on size-matched
data and model B was trained on unmatched data. SVM, support vector machine.

hence are typically at a late stage and their nodules are moreover, that such features add approximately 0.2 AUC
consequently larger than benign nodules. A machine learning points to using size-alone.
algorithm trained on such data would perform very poorly Coincidentally, the performance on size matched data was
when applied to, for example, a screening application where very close to that we achieved on the LungX competition data
the distribution of malignant nodule sizes is more similar to (AUC: 0.70 and 0.68) which was subsequently revealed to have
benign ones. also used size-matched data in the test set (20).
We explored this issue further by comparing the
performance of a CADx system trained on size-matched and Conclusions
size unmatched data (27). Figure 3 illustrates the experiment.
Two datasets were created from the US NLST.The first (A), We have provided an overview of the main approaches used
comprising 640 solid nodules, was built to remove size as a for nodule classification and lung cancer prediction from CT
discriminatory factor between benign and malignant; all imaging data. In our experience, given sufficient training data,
malignant solid nodules between 4 and 20 mm diameter were the current state-of-the-art is achieved using CNNs trained with
selected, and for each, a benign solid nodule was selected that Deep Learning achieving a classification performance in the
most closely matched it in diameter. Any malignant nodule for region of low 90s AUC points. When evaluating system
which an equivalently sized benign couldnot be found within performance, it is important to be aware of the limitations or
0.8 mm was rejected. Sizes were measured using automated otherwise of the training and validation data sets used, i.e.,
volumetric segmentation. The second dataset (B), also were the patients’ smokers or non- smokers, or were patients
comprising 640 subjects, included all malignant nodules in A with a current or prior history of malignancy included.
but benign nodules were randomly selected following the Given an apparent acceptable level of performance, the next
empirical size distribution of the whole NLST dataset. stage is to test such CADx systems in a clinical setting but before
Therefore, nodule size cannot be a discriminative factor in A this can be done, we must first define the way inwhich the
but would be in output of the CADx should be utilized in clinical decision
B. Two nodule classifiers were built using texture features making. Who should use such a system and how should it be
combined with an SVM classifier; this was utilized here because integrated into their decisions? Should thealgorithm produce
the small datasets prevented the use of a CNN model. an absolute risk of malignancy and how should this be
The average AUC for the classifier trained on dataset A was expressed; should it be incorporated into clinical opinion and
0.70 whereas using size alone on the same dataset gave an AUC how much weight should clinicians or patients lend to it. Should
of 0.50 as would be expected. The AUC was the algorithms be incorporated into or designed to fit current
0.91 for the classifier trained on dataset B. This indicates that guidelines such as Lung- RADS or the BTS guidelines? If nodules
the classifier can learn morphological features that can are followed over time, should the algorithm incorporate
discriminate between benign and malignant nodules and, changes in nodule

54
Translational Lung Cancer Research, Vol 7, No 3 June 2018 311

volume or should this be assessed separately? Is success Computer-aided Detection: Prospective Study of 12,860
defined by a reduction in the numbers of false positive scans Patients in a Community Breast Center. Radiology
defined as those needing further follow up or intervention, or 2001;220:781-6.
by detecting all lung cancers and earlier than determined by 6. McWilliams A, Tammemagi MC, Mayo JR, et al. Probability
following current guidelines? Who should be compared to the of Cancer in Pulmonary Nodules Detected onFirst
algorithm when determining its value? Should the comparison Screening CT. N Engl J Med 2013;369:910-9.
be experts or general radiologists, as it may be difficult to be 7. Gould MK, Ananth L, Barnett PG, et al. A Clinical ModelTo
significantly better than an expert but may be of substantial Estimate the Pretest Probability of Lung Cancer
help to a generalist, and most scans are not interpreted by in Patients With Solitary Pulmonary Nodules. Chest
experts? Relatively little work has been done to address such 2007;131:383-8.
questions. 8. Swensen SJ, Silverstein MD, Ilstrup DM, et al. The
probability of malignancy in solitary pulmonary nodules.
Acknowledgments Application to small radiologically indeterminate nodules.
Arch Intern Med 1997;157:849-55.
The authors would like to thank the numerous research 9. Deppen SA, Blume JD, Aldrich MC, et al. Predicting lung
scientists and clinical staff involved in the project for their cancer prior to surgical resection in patients with lung
contributions: Sarim Ather, Djamal Boukerrouri, Amalia Cifor, nodules. J Thorac Oncol 2014;9:1477-84.
Monica Enescu, Mark Gooding, William Hickes, Samia Hussain, 10. Callister ME, Baldwin DR, Akram AR, et al. British
Aymeric Larrue, Jean Lee, Heiko Peschl, Lyndsey Pickup, Thoracic Society guidelines for the investigation and
Shameema Stalin, Ambika Talwar, Eugene Teoh, JulienWillaime management of pulmonary nodules. Thorax 2015;70 Suppl
and Phil Whybra. 2:ii1-54.
Funding: Part of this work was funded by Innovate UK 11. Revel MP, Bissery A, Bienvenu M, et al. Are two-
project TSB 101676. dimensional CT measurements of small noncalcified
pulmonary nodules reliable? Radiology 2004;231:453-8.
Footnote 12. Bartlett EC, Walsh SL, Hardavella G, et al. Interobserver
Variation in Characterisation of Incidentally-Detected
Conflicts of Interest : T Kadir is CTO, Director and shareholder
Pulmonary Nodules: An International, Multicenter Study.
of Optellum Ltd. F Gleeson is a shareholder and advisor to
Availableonline: http://4wcti.org/2017/SS5-3.cgi
Optellum Ltd.
13. Zinovev D, Feigenbaum J, Furst J, et al. Probabilistic lung
nodule classification with belief decision trees. Conf Proc
References IEEE Eng Med Biol Soc 2011;2011:4493-8.
14. Ciompi F, Chung K, van Riel SJ, et al. Towards automatic
1. National Lung Screening Trial Research Team, Aberle DR,
pulmonary nodule management in lung cancer screening
Adams AM, et al. Reduced Lung-Cancer Mortality with
with deep learning. Sci Rep 2017;7:46479.
Low-Dose Computed Tomographic Screening. N Engl J
15. Aerts HJ, Velazquez ER, Leijenaar RT, et al. Decoding
Med 2011;365:395-409.
tumour phenotype by noninvasive imaging using
2. Lung CT Screening Reporting & Data System. Available
a quantitative radiomics approach. Nat Commun
online: https://www.acr.org/Clinical-Resources/ Reporting-
2014;5:4006.
and-Data-Systems/Lung-Rads
16. Lambin P, Rios-Velazquez E, Leijenaar R, et al. Extracting
3. Pinsky PF, Gierada DS, Black W, et al. Performanceof
more information from medical images using advanced
Lung-RADS in the National Lung Screening Trial: A
feature analysis. Eur J Cancer 2012;48:441-6.
Retrospective Assessment. Ann Intern Med
17. Haralick RM, Shanmugam K, Dinstein I. Textural Features
2015;162:485-91.
for Image Classification. IEEE Trans Syst ManCybern Syst
4. Awai K, Murao K, Ozawa A, et al. Pulmonary Nodulesat
1973;3:610-21.
Chest CT: Effect of Computer-aided Diagnosis
18. Wilson R, Devaraj A. Radiomics of pulmonary nodules and
on Radiologists Detection Performance. Radiology
lung cancer. Transl Lung Cancer Res 2017;6:86-91.
2004;230:347-52.
19. Chalkidou A, O’Doherty MJ, Marsden PK. False Discovery
5. Freer TW, Ulissey MJ. Screening Mammography with Rates in PET and CT Studieswith Texture Features: A

55
312 Kadir and Gleeson. Lung cancer prediction using machine learning

Systematic Review. PLoS One 2015;10:e0124165. of malignancy or benign in patients with solitary
20. Armato SG 3rd, Drukker K, Li F, et al. LUNGx Challengefor pulmonary nodules. Beijing Da Xue Xue Bao Yi Xue Ban
computerized lung nodule classification. J Med Imaging 2011;43:450-4.
(Bellingham) 2016;3:044506. 25. Hammack D. Forecasting Lung Cancer Diagnoses
21. Willaime JM, Pickup L, Boukerroui D, et al. Impact of with Deep Learning. Available online: https://raw.
segmentation techniques on the performance of a CT githubusercontent.com/dhammack/DSB2017/master/
texture-based lung nodule classification system. Available dsb_2017_daniel_hammack.pdf
online: https://posterng.netkey.at/esr/viewing/index. 26. Setio AA, Traverso A, de Bel T, et al. Validation,
php?module=viewing_poster&task=&pi=135229 comparison, and combination of algorithms for automatic
22. Lee TS. Image Representation Using 2D Gabor Wavelets. detection of pulmonary nodules in computed tomography
IEEE Trans Pattern Anal Mach Intell 1996;18:1-13. images: The LUNA16 challenge. Med Image Anal 2017;42:1-
23. Armato SG 3rd, McLennan G, Bidaut L, et al. The Lung 13.
Image Database Consortium (LIDC) and Image Database 27. Pickup L, Declerck J, Munden R, et al. MA 14.13 NoduleSize
Resource Initiative (IDRI): a completed reference database Isn't Everything: Imaging Features Other Than
of lung nodules on CT scans. Med Phys 2011;38:915-31. Size Contribute to AI Based Risk Stratification of Solid
24. Li Y, Chen KZ, Sui XZ, et al. Establishment of a mathematical Nodules. J Thorac Oncol 2017;12:S1860-1.
prediction model to evaluate the probability

Cite this article as: Kadir T, Gleeson F. Lung cancer prediction


using machine learning and advanced imaging techniques. Transl
Lung Cancer Res 2018;7(3):304-312. doi: 10.21037/tlcr.2018.05.15

56
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/351889513

Lung cancer Prediction and Classification based on Correlation Selectionmethod Using


Machine Learning Techniques

Article · May 2021


DOI: 10.48161/qaj.v1n2a58

CITATIONS READS

12
5,470

3 authors:

Dakhaz Abdullah Adnan Mohsin Abdulazeez


Duhok Polytechnic Duhok Polytechnic
University University
7 PUBLICATIONS 53 CITATIONS 194 PUBLICATIONS 3,167 CITATIONS

SEE PROFILE SEE PROFILE

Amira Bibo Sallow


Duhok Polytechnic University
46 PUBLICATIONS 532 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Steganography View project


Research Proposal View project

57
Lung cancer Prediction and Classification based on
Correlation Selection method Using Machine Learning
Techniques
1st Dakhaz Mustafa Abdullah 2nd Adnan Mohsin Abdulazeez
Technical College of Informatics, Akre 3rd Amira Bibo Sallow
Presidency of Duhok Polytechnic
Duhok Polytechnic University College of Engineering
University
Nawroz University
Duhok, Iraq Duhok Polytechnic University
Duhok, Iraq
dakhaz.abdullah@dpu.edu.krd Duhok, Iraq
amira.bibo@nawroz.edu.krd
adnan.mohsin@dpu.edu.krd

https://doi.org/10.48161/qaj.v1n2a58

Abstract—Lung cancer is one of the leading causes of artificial intelligence to create algorithms that become more
mortality in every country, affecting both men and women. Lung efficiently when subject to relevant data[9][10]. Many
cancer has a low prognosis, resulting in a high death rate. The systems lack adequate detection accuracy, and some systems
computing sector is fully automating it, and the medical industry must also be developed in order to reach the highest accuracy
is also automating itself with the aid of image recognition and data of 100%. Pulmonary cancer identification and classification
analytics. This paper endeavors to inspect accuracy ratio of three were based on machine learning techniques and image
classifiers which is Support Vector Machine (SVM), K- Nearest processing techniques [11]. However, some signs of lung
Neighbor (KNN)and, Convolutional Neural Network (CNN) that cancer patients, such as their smoking rate, may aid in early
classify lung cancer in early stage so that many lives can be saving.
detection of the disease [12][13][14]. Researchers started to
Basically, the informational indexes utilized as a part of this
use machine learning for medical diagnosis after the advent
examination are taken from UCI datasets for patients affected by
lung cancer. The principle point of this paper is to theexecution of artificial intelligence. using a machine learning approach
investigation of the classification algorithms accuracy by WEKA to investigate the classification of diseases in traditional
Tool. The experimental results show that SVM givesthe best Chinese medicine clinical data (TCM). Valuable guidelines
result with 95.56%, then CNN with CNN 92.11% and KNN with on diagnosis of brain disturbances from network architecture
88.40%. aspects, function learning and classification prediction via the
method of machine learning, and provided through the
Keywords— Lung Cancer, Machine Learning, SVM, KNN, machine learning method and the implementation of the brain
CNN. network based on machine learning [15]. It will be a key step
towards improved early detection [16].
I. INTRODUCTION This paper provides an effective method to predict lung
Cancers exist in several organs, and simultaneously, and cancer in early stage with heigh accuracy ratio. The dataset
different types of cancer occur in various organs of the body. used is taken form UCI machine learning repository. Then
The illness may even go unnoticed for long periods of time. apply three classifier Support Vector Machine (SVM), K-
According to WHO reports, cancer may be prevented if it is Nearest Neighbor (KNN)and, Convolutional Neural Network
detected early enough. The patient's life span will be extended (CNN) to endeavors inspect accuracy ratio of three classifier
whether he or she receives an early prognosis [1][2][3][4]. by using WEKA tool. the present study is aid to develop a
Lung cancer has a low prognosis that differs greatly Machine Learning Models to detect the lung cancer with
depending on tumor staging at the time of diagnosis. Lung better accuracy.
cancer is divided into two types of clinical practice: non-small This paper is organized as follows. Section 2 introduce to
cell lung cancer (NSCLC) and small cell lung cancer (SCLC) lung cancer. Section 3 Material & Methods that used in this
[5][6]. It is, in reality, a malignant tumor characterized by paper. describes related work in Section 4. then Section
unregulated cell tissue formation. Lung cancer developed 5present the theory, introduction to machine learning and
mostly as a result of long-term tobacco use[7]. According to their types also confusion matrix. Section 6 present the
research, a stable individual may be affected by nineteen Performance evaluation and results. Finally, Section 7
distinct forms of cancer. Lung cancer has the largest death
Conclusion.
rate among all of these tumors. This disease is expected to kill
over 1.7 million people per year [8]. In the area of machine
learning (ML) research has already grown a great deal, which II. LUNG CANCER
is helpful to reduce humanlaborers. ML combines statistics Carcinogenesis is the unchecked proliferation of one or
and computers in the area of more cell types. Good tissues do not support the growth of

58
This is an open access article distributed under the Creative Commons Attribution License

59
normal cells, and when they do, they separate quickly and b. Un-Supervised Learning
become tumors. Primary lung cancer originates elsewhere in Unsupervised Learning is a form of learning that occurs
the body and spreads to the lungs, while secondary lung without the presence of a supervisor [28]. The machine is
cancer starts elsewhere in the body and then spreads from given some sample inputs, but no output is generated in the
there. It's one of the most aggressive types of cancer and a method of learning. Since there is no optimal value over here,
life-threatening threat to the human body [17]. If this categorization is used to ensure that the algorithm
unchecked development can be identified correctly at anearly distinguishes between the datasets correctly. It is the
point, it can help to diagnose the likelihood of unnecessary difficulty of finding unknown structure in unidentified details
surgery and improve the chance of recovery. Chronic [29][30]. Although there are no testing sets or tests given to
Obstructive Pulmonary (COPD) illness attacks the areas of the respondent, there are no opportunities to reward a
the lungs and causes diseases such as measles, influenza,
successful solution. Unlike supervised learning and
pneumonia, and other respiratory issues such as asthma.
reinforcement learning, unsupervised learning has no
Small Cell Lung Cancer (SCLC) or oat cell cancer and Non-
Small Cell Lung Cancer (NSCLC) are the two mainforms of teacher, and produces results that are unrelated to prior
lung cancer that develop and expand in separate ways and experience. It is directly connected to density and statistics
may be handled accordingly. Within the non-smallcell lung [21][31].
cancer category, there are three subtypes (adenocarcinomas, c. Reinforcement Learning
squamous cell carcinomas, large cell carcinomas) fig (1)
show the two types of lung cancer. So Mixed small cell/large Reinforcement Learning, this machine learning style comes
cell cancer is a disease that occurs where a patient shows from interacting with its surroundings Reinforcement
symptoms of both types of cancer. (NSCLC) learning. A Reinforcement Learning manager learns from the
Adenocarcinoma is more common and progresses more meaning of tasks, and even by explicitly articulated
slowly than small cell lung cancer. Small cell lung cancer is instructions, and decides on previous behaviors by using new
linked to smoking which progresses more rapidly by techniques. Since specific input/output data sets are not
becoming a large tumor that will spread across the body provided, this differs from traditional supervised learning.
[18][19]. Instead, the focus is on the presentation, which entails striking
a balance between discovery (of uncharted territory) and
utilization (of existing data) [32][33].
IV. DEEP LEARNING
Deep learning is a type of machine learning techniques that
uses representation learning to categorize important features
for classification problems [6]. The primary characteristic of
deep learning is its compatibility with features, although it
may also learn from data. So, to learn complex features a deep
learning integrates the simple features that have learned from
data. Deep learning is accomplished using multiple-layer
Fig. 1. Lung Cancer Tupes artificial neural networks, such as the Deep Neural Network
(DNN), Convolutional Neural Network (CNN), and the
III. MACHINE LEARNING Recurrent Neural Network (RNN) [13][23].
Machine learning is a subfield of Artificial Intelligence V. RELATED WORKS
[20]. Machine Learning is also used for complex data
classification and decision making [21][22]. In general, the Roy et al[34]. They use a combination of image
implementation of algorithms aids the machine's learning. processing biomedical techniques and information discovery
Machine learning gives systems the opportunity to learn in data to improve accuracy and assess precise significance
automatically and improve over time without being directly for early detection of lung carcinoma. The representation of
configured. The implementation of algorithms aids the the lungs acquired from CT (Computer Tomography) The
computer in learning and making the required decisions scan images are pre-processed, and the Region of Interest is
[11][23]. Machine Learning strategies and activities are segmented (ROI) is performed. The Random Forest
narrowly divided into three categories: procedure is used to distinguish the distinct features. Using an
SVM Classifier, the SURF (Speeded Up Robust
Functionality) algorithm was used to extract features like
a. Supervised learning entropy, co-relation, power, and variance from Saliency
Machine learning, in its most simple form, employs Enhanced images. The image's classification determines if it
programmed algorithms that learn and refine their functions is safe or toxic (carcinomic). CT scan images were used as the
by processing input data and making predictions within a dataset. The SVM classification and random forest algorithm
reasonable range. These algorithms aim to be predictivemore were used to carry out the whole operation. Using SVM
precisely by feeding fresh data[24][25]. While there are classification, the best outcome is achieved. This technique is
several changes in the way machine learning algorithms are 94.5 percent effective in general, 74.2 percent sensitive, 66.3
grouped. Two categories of issues: grouping problems and percent recall, and 77.6% specific.
back-up problems, are well suited to supervised learning
algorithms. The output variable usually takes on a limited For lung cancer diagnosis, Faisal et al [12] recommend
number of discrete values[26][27]. evaluating machine learning classifiers as well as, classifiers
such as Multilayer perceptron (MLP), Nave Bayes, Decision

60
Tree, Neural Network, Gradient Boosted Tree, and SVM are Reddy et al [40] propose a model that is successful in
evaluated. The dataset was downloaded from the UCI registry detecting the phases of lung cancer using machine learning
and is used to analyze random forest and plurality voting- algorithms. The model combines K-NN, Decision Trees, and
based ensembles for predict lung cancer. Gradient Boosted Neural Networks structures with the bagging ensemble
Tree was found to outperform all other person and ensemble approach to improve overall prediction accuracy. As opposed
classifiers. Gradient-boosted Tree outperformed allothers as to individual algorithms, the proposed model's estimated
well as ensemble classifiers, achieving 90% precision, outcomes are more accurate. The versions with and without
according to performance assessments. bagging are compared to draw conclusions. The bootstrap
aggregating methodology improves the individual models'
Delta Radiomics uses the machine learning methods performance, with accuracy scores of 97% (Decision Tree),
proposed by Baskar et al [35] to extract the characteristics of 94%, and 96% (K-NN) respectively (Neural Networks). The
the cancer nodules. Lung cancer nodule malignancy is integrated model has a score of 0.98 for accuracy. The
predicted by using the Support Vector Machine (SVM). The precision of the integrated model is increased by 3.33 percent.
SVM can examine compact features in a lung cancer nodule
photograph, and image classification is useful in Günaydin et al [41] proposed machine learning methods
distinguishing between the multiple nodules. As a result, for detecting lung cancer nodules that used Principal
SVM is recommended as the best tool for diagnosing and Component Analysis, K-NN, SVM, Nave Bayes, Decision
detecting lung cancer, with a 90.9 percent accuracy rate. Trees, and ANN to detect anomaly. Then, both approaches
were compared both after and without preprocessing. The
Boban et al [36]. They use ML algorithms for the 400 lung experimental findings indicate that Artificial Neural
disease videos, including the Multilayer perceptron (MLP), Networks produce the best results with 82,43 percent
KNN and SVM classifiers (i.e., CT scan images). The accuracy after image processing, while Decision Tree
performance is segmented after extraction of features and produces the best results with 93,24 percent accuracy without
compares the exactness of the classifier. When a classifier has image processing. Standard Digital Image Database, Japanese
received a CT scan image, it contains irrelevantcontent. Gray Society of Radiological Technology (JSRT) CT was used as
Level Cooccurrence Matrix (GLCM) is used topick the most the dataset (computed tomography).
important features (i.e., for removing features). This
classification is 98% accuracy for MLP, 70,45% for SVM Early identification of lung nodes from low dose
accuracy, and 99,2% for KNN accuracy. computed tomography (LDCT) images was suggested by
Elnakib et al [42]. Initially, the proposed device processes
Using Deep Learning, Sreekumar et al [37] proposed a the raw data in order to increase the comparison between low-
method for detecting malignant pulmonary nodules from CT dose videos. The compact profound learning capabilitiesof
scans. To block out the lung areas from the scans, a various architectures, including Alex, VGG16 and VGG19
preprocessing pipeline was used. A 3D CNN model based on networks are then explored. A genetic algorithm (GA) is
the C3D network architecture was used to remove the trained to identify the most important early detectionfeatures
functionality. For the decrease of false-positives, researchers for optimizing the derived collection of features. In order to
used the Lung Image Database Consortium (LIDC-IDRI) as reliably diagnose lung nodules, various forms of classifiers
well as a few materials from the LUNA16 grand challenge. are then checked. The method is validated using the I-ELCAP
The end result is a model that predicts the coordinates of International Early Lung Cancer Action Project(ELCAP) in
malignant pulmonary nodules and demarcates the associated 320 photographs from 50 separate topics. With VGG19 and
areas using CT scans, for identifying malignant Lung SVM classification, the system suggested achieves the
Nodules and estimating their malignancy scores, the final highest 96.25 percent detection precision, 97.5 percent
model had a sensitivity of 86 percent. sensitivity and 95 percent specificity.
Banerjee et al. [38] suggested a paradigm for tumor
classification, with ANN, Random forests, and SVM as VI. MATERIAL AND METHODS
machine learning algorithms. Artificial neural networks are
more accurate in both area and texture dependent features. As A. Dataset
the precision is compared to the proposed model, it canbe
shown that accuracy has improved while recall has decreased. The data used in this work is a lung cancer dataset that
MATLAB R2017a was used for digital image analysis, and a was first released in and later made available in the UCI
Jupyter notebook was used for machine learning machine learning repository under the name "Lung Cancer
classification. Random Forest 79 percent, SVM 86 percent, Data set". This dataset was used to show the capability of the
and ANN 92 percent were the accuracy for region- based optimum discriminant plane in ill-posed situations. This
features, while Random Forest 70 percent, SVM 80 percent, dataset contains data on the pathological forms of lung
and ANN 96 percent were the accuracy for texture- based cancer. It contains 32 observations on three forms of lung
features. cancer using 56 elements[43].

A technique k-Nearest-Neighbors was developed by B. Classification Models


Maleki et al [39], for which a genetic algorithm was used to
efficiently pick features, to reduce the dimensions of the To compare the output of the classifiers, three classification
dataset and to improve the speed of the classifier. The methods are used. The smallest number of features was used
experimental approach is used to determine the best value for to attain higher efficiency. The classifier models are defined
k to increase the precision of the proposed algorithm. Use of briefly.
the proposed solution to the database for lung cancer shows
100% accuracy.

61
1. Support Vector Machine
Support Vector Machine is a supervised learning
algorithm that uses the Classification method to analyze data
and predicate patterns. The texture is divided into two
categories or classes by the SVM classifier: regular and
abnormal pictures [44]. It is used to effectively map the
nodule. SVM is a margin classifier (hyperplane) that
separates the two classes, which is why it is often referred to
as a non-probabilistic binary classifier. The Support Vector is
described as the training data point that is nearest to the
classifier, and the Support Vector Machine is the maximum
classifier. The gap between the cancer nodules and the
hyperplane is as wide as possible [45][46].
2. K-Nearest Neighbor Classifier
The KNN algorithm is a supervised classificationmethod.
It's a simple algorithm that looks for the nearest fit. The
database is compared to the comparison set. The test sample's
mark is determined by the closest match of the k nearest
neighbors. To calculate the distances between research
samples and database samples, various distances such as
Euclidean, cosine, similarity, and city block are used[47].
3. Convolutional Neural Network
CNN as a supervised deep learning tool, CNN is an
Fig. 2. Block Diagram for Proposed method
excellent choice. This algorithm is suitable for multi-class
classification and binary classification (for example, To define the lung cancer dataset in this article, Weka
predicting whether or not a diagnostic picture contains a classifiers were used. WEKA was established by a team of
malignant tumor) [48][49]. CNNs are often used to solve a researchers from New Zealand's University of Waikato [52].
wide range of pattern and image recognition issues. This deep It is a java-based open-source platform that can perform data
learning approach is effective and appropriate for visualdata mining and machine learning algorithms, such as data pre-
because of three key characteristics. To begin with, local processing, sorting, clustering, and association ruleextraction,
receptive fields are perfectly matched to the image data among other things. WEKA is a popular choice among
specificity of being correlated geographically but analysts because of its ease of use and open-source nature
uncorrelated globally. Second, since the convolution is [53].
applied to the entire image, mutual weights allow for
significant parameter reduction without affecting image In this paper for the feature selection Correlation Attribute
processing. Finally, a grid-structured image allows for data (CA) method were used. CA is a feature subset selection
pooling operations that reduce data complexity without algorithm [1]. It evaluates the attribute by calculating the
sacrificing valuable information [50][51]. correlation (Pearson's product moment correlation) between
it and the class [4]. The main objective of CA is to obtain a
C. Proposed Model highly relevant subset of features that are uncorrelated to each
other. In this way, the dimensionality of datasets can be
The paper suggests a model to predict and classify the
drastically reduced and the performance of learning
lung cancer classes. The proposed model starts with data
algorithms can be improved [2]. Ranker search method used
preprocessing, feature selection, classification and with CA. Based on the Correlation values, features are ranked
evaluating), figure (2) shows the block diagram for the and those features that are most suitable to be appliedin the
proposed work. machine learning algorithm, are filtered [3].

VII. PERFORMANCE EVALUATION MATRICES

A. Confusion Matrix
The Confusion Matrix is a deep learning visual
assessment method. The prediction class results are
represented in the columns of a Confusion Matrix, whereas
the real class results are represented in the rows [54]. This
matrix includes all the raw data regarding a classification
model's assumptions on a specified data collection. To
determine how accurate a model is. It's a square matrix with
the rows representing the instances' real class and the columns
representing their expected class. The confusion matrix is a 2
x 2 matrix that reports the number of true positives (T P), true
negatives (T N), false positives (FP),

62
and false negatives (F N) when dealing with a binary classification mission.
prediction ratio of 95.56 percent and, CNN accuracy ratio is
TP FN 92.11 percent. While KNN has the lowest estimation
FP TN percentage which is 88.40 percent.as shown in Table 2, 3, and
4. Table (2) shows and analysis the results for using SVM
algorithm, Table (3) shows and analysis the results forusing
Precision, recall, and F-measure, which are commonly
CNN algorithm.
utilized in the text mining and machine learning communities,
were used to evaluate the algorithms. True positive (TP –
objects correctly labeled as belonging to the class), false TABLE 1. USING SVM CLASSIFIER
positive (FP – items falsely labeled as belongingto a certain
class), false negative (FN – items incorrectly labeled as not using SVM classifier
belonging to a certain class), and true negative (TN – items Class TP FP Precision Recall F- ROC
incorrectly labeled as not belonging to a certain class) are the Rate Rate Measure Area
1 0.986 0.109 0.951 0.986 0.968 0.946
four types of classified items (TN - items correctly labelled as 2 0.882 0.005 0.938 0.882 0.909 0.984
not belonging to a certain class). Recall is determined using 3 0.667 0.000 1.000 0.667 0.800 0.968
the following formula given the amount of true positives and 4 0.857 0.005 0.947 0.857 0.900 0.944
false negatives[55][56]: 5 1.000 0.000 1.000 1.000 1.000 1.000
Avg 0.956 0.076 0.956 0.956 0.954 0.955

Recall=
TABLE 2. USING K-NEAREST NEIGHBOR CLASSIFIER

The recall is also known as "sensitivity" or the "absolute


using K-Nearest Neighbor Classifier
positive rate." Precision (also known as “positive predictive Class TP FP Precision Recall F- ROC
rate”) is measured using the amount of true positive and false Rate Rate Measure Area
positive graded objects as follows: 1 0.964 0.234 0.899 0.964 0.931 0.878
2 0.706 0.016 0.800 0.706 0.750 0.892
3 0.333 0.000 1.000 0.333 0.500 0.815
Precision = 4 0.762 0.011 0.889 0.762 0.821 0.887
5 0.900 0.005 0.947 0.900 0.923 0.959
The measure that combines precision and recall is known as Avg 0.897 0.164 0.898 0.897 0.891 0.881
F-measure, given as:
TABLE 3. USING CONVOLUTIONAL NEURAL NETWORK

F= USING CONVOLUTIONAL NEURAL NETWORK


Class TP FP Precision Recall F- ROC
where β denotes the precision's relative value. A value of β Rate Rate Measure Area
= 1 (which is often used) means that recall and accuracy are 1 0.906 0.031 0.984 0.906 0.944 0.978
of equal importance. A lower value implies that accuracy is 2 0.882 0.027 0.750 0.882 0.811 0.987
3 1.000 0.020 0.600 1.000 0.750 0.994
more important, whereas a higher value indicates that recall 4 0.952 0.016 0.870 0.952 0.909 0.988
is more important. 5 1.000 0.011 0.909 1.000 0.952 0.997
Avg 0.921 0.027 0.934 0.921 0.924 0.982

B. ROC curve
Table (4) show the comparison between the three classifiers
The region under the ROC curve, or literally AUC,
summarizes the relationship between a binary classifier's true depending on the time taken to build the model and the
and false positive rate for various judgment thresholds. accuracy of the classifier.
Several authors have shown that (AUC) is superior to
TABLE 4. COMPARISON OF RESULT BY TIME AND ACCURACY
absolute accuracy for classifier assessment, rendering it one
of the most common metrics for static imbalanced data. To
measure AUC, however, one must sort a specified dataset COMPARISON OF RESULTS
and iterate through each example [57]. This ensures that CLASSIFIER TIME TAKEN TO ACCURACY
AUC cannot be computed directly on vast data streams since BUILD MODEL
it will necessitate scanning the whole stream after each SVM 1.77 SEC. 95.56
example. As a result, the usage of AUC for data sources has
been restricted to estimations on periodic holdout sets or KNN 0.01 SEC. 89.65
whole streams, rendering it inherently skewed or
CNN 3.79 SEC. 92.11
computationally infeasible for realistic implementations [58].

VIII. PERFORMANCE EVALUATION AND Figure (3) shows that CNN algorithm takes the
RESULTS longest time to build its model while KNN algorithm had the
The confusion matrix was used to evaluate the accuracy of shortest time
each classifier. The experimental results show that using five
attributes from an SVM classifier produces the best

63
used (JSRT) dataset and several classifiers to gain
experimental results (ANN 82,43% and, Decision Tree
93,24%). finally, the researchers in [21] used (LDCT) dataset
and smart genetic algorithm with applying (SVM) to obtained
96.25% accuracy.

Fig. 3. Time analysis

FIGURE (4) SHOWS THAT SVM ALGORITHM HAD THE HIGHER


ACCURACY. WHILE KNN’S ACCURACY HAS THE MINIMUM
VALUE.

Fig. 4. ACCURACY ANALYSIS

IX. COMPARATIVE STUDIES


As shown in the table (5), the researchers used
several different methods, different dataset and different ways
of feature selection/feature extraction. in comparison with
related work, we obtain a good result in this work with dataset
and methods that we used. However, researcher in
[13] obtained 94% CT scan images dataset and SURF
(Speeded Up Robust Features) for feature selection.in
addition the researchers in [14],[15] they used the same
dataset, the researchers in [14] obtained 90.9% with used
Delta Radiomics method for feature extraction. But
researchers in [15] could obtained better accuracy by using
GLCM function for feature extraction and (MLP 98%, SVM
70.45%, & KNN 99.2%) classifier.by using (UCI) dataset the
researchers in [6] could gain a good result 90% and used
several classifiers.
Each of researchers in [16][17] they used same
dataset (LIDC-IDRI) the researchers in[16] could obtained
86% sensitivity while, researchers in [17] depending on 2
features and applied different classifiers they obtained a good
result (Random Forest 70%, SVM 80% and, ANN96%).also
the researchers in [18],[19] by using the same dataset and
number of feature selection which is(23),but the researchers
in[18] obtained higher accuracy 100% by using KNN
classifier and genetic algorithm for feature selection. while
the researchers in [19] obtained (Decision Tree 97%, K-NN
94% and, Neural Networks 96%). researchers in [20]

64
TABLE 5. COMPARISON OF RELATED WORK

Comparisonof Related Work


Ref Dataset Feature No. Feature Selection Feature Extraction Classifier Result
SURF (Speeded
Up Robust random forest algorithm and
Roy et al[34] CT scan images - - 94.5%
Features) SVM classification
(SVM), C4.5, Multi-Layer
Perceptron, Decision tree,
Faisal et al [12] UCI - - - Naïve Bayes, and Neural 90%
Network
features extracted by
Baskaret al [35] CT images - - Delta Radiomics SVM 90.9%

MLP, MLP 98%


features extracted using SVM, SVM 70.45%
Boban et al[36] (CT) images 8 -
GLCM function &KNN KNN 99.2%
3DCNNmodel based on
Sreekumar et the C3D network
LIDC-IDRI - - CNN sensitivity 86%
al[37] architecture

Random Forest, Random Forest 70%


SVM and, SVM 80%
Banerjee et al[38] LIDC-LDRI 2 -
ANN
ANN 96%
genetic algorithm
Data World for feature
Maleki et al [39] 23 - KNN 100%
source selection

Decision Tree 97%


Decision Tree, K-NN 94%
Data World
Reddy et at[40] 23 - - K-NN and, (Neural Networks
source
Neural Networks 96%
K-NN,
SVM,
The experimental
Naïve Bayes,
Günaydin et results
JSRT - - - Decision Trees ANN 82,43%
al[41]
&,
Decision Tree 93,24%
Artificial Neural Networks

smart genetic
LDCT images
Elnakib et al[42] - algorithm - SVM 96.25%

SVM SVM 95.56


Correlation KNN KNN 88.40
This work UCI - -
Method CNN CNN 92.11

derived from UCI databases for lung cancer patients. The


X. CONCLUSION focus of this paper is on using WEKA Tool to investigate the
accuracy of classification algorithms. The results show that
Lung cancer is one of the most dangerous diseases and the the Support Vector Machine (SVM) give the best accuracy
most common cause of death, the severity of the disease lies 95.56%, that can detect lung cancer in its early stages and
in the difficulty of diagnosing it in the early stages. This paper save several lives and, K-Nearest Neighbor KNN It gave less
tries to endeavor to investigate of three classifiers to find the accuracy 88.40%.
best classifier could classify lung cancer in early stage. The
informational indices included in this study were
[5] J. R. F. Junior, M. Koenigkam-Santos, F. E. G. Cipriano, A.
REFERENCES T. Fabro, and P. M. de Azevedo-Marques, “Radiomics-based
[1] P. Chaudhari, H. Agarwal, and V. Bhateja, “Data features for pattern recognition of lung cancer histopathology and
augmentation for cancer classification in oncogenomics: an metastases,” Comput. Methods Programs Biomed., vol. 159, pp. 23–
improved KNN based approach,” Evol. Intell., pp. 1–10, 2019. 30, 2018.
[2] S. F. Khorshid and A. M. Abdulazeez, “BREAST CANCER [6] I. Ibrahim and A. Abdulazeez, “The Role of Machine
DIAGNOSIS BASED ON K-NEAREST NEIGHBORS: A Learning Algorithms for Diagnosing Diseases,” J. Appl. Sci.
REVIEW,” PalArch’s J. Archaeol. Egypt/Egyptology, vol. 18, no. Technol. Trends, vol. 2, no. 01, pp. 10–19, 2021.
4, pp. 1927–1951, 2021. [7] P. Das, B. Das, and H. S. Dutta, “Prediction of Lungs Cancer
[3] F. Q. Kareem and A. M. Abdulazeez, “Ultrasound Medical Using Machine Learning,” EasyChair, 2020.
Images Classification Based on Deep Learning Algorithms: A [8] G. A. P. Singh and P. K. Gupta, “Performance analysis of
Review.” various machine learning-based approaches for detection and
[4] D. Q. Zeebaree, A. M. Abdulazeez, D. A. Zebari, H. Haron, classification of lung cancer in humans,” Neural Comput. Appl.,
and H. N. A. Hamed, “Multi-Level Fusion in Ultrasound for vol. 31, no. 10, pp. 6863–6877, 2019.
Cancer Detection Based on Uniform LBP Features.”

65
normalized Mahalanobis distance,” in 2017 International
[9] B. Charbuty and A. Abdulazeez, “Classification Based on
Conference on Intelligent Informatics and Biomedical Sciences
Decision Tree Algorithm for Machine Learning,” J. Appl. Sci.
(ICIIBMS), 2017, pp. 140–145.
Technol. Trends, vol. 2, no. 01, pp. 20–28, 2021.
[10] H. A. Hussein and A. M. Abdulazeez, “COVID-19 [29] S. Hussein, P. Kandel, C. W. Bolan, M. B. Wallace, and U.
Bagci, “Lung and pancreatic tumor characterization in the deep
PANDEMIC DATASETS BASED ON MACHINE LEARNING
learning era: novel supervised and unsupervised learning
CLUSTERING ALGORITHMS: A REVIEW,” PalArch’s J.
Archaeol. Egypt/Egyptology, vol. 18, no. 4, pp. 2672–2700, 2021. approaches,” IEEE Trans. Med. Imaging, vol. 38, no. 8, pp.
1777–1787, 2019.
[11] D. M. Abdullah and N. S. Ahmed, “A Review of most Recent
[30] B. M. S. Hasan and A. M. Abdulazeez, “A Review of
Lung Cancer Detection Techniques using Machine Learning,” Int.
Principal Component Analysis Algorithm for Dimensionality
J. Sci. Bus., vol. 5, no. 3, pp. 159–173, 2021.
Reduction,” J. Soft Comput. Data Min., vol. 2, no. 1, pp. 20–30,
[12] M. I. Faisal, S. Bashir, Z. S. Khan, and F. H. Khan, “An
2021.
evaluation of machine learning classifiers and ensembles for early
[31] D. M. Sulaiman, A. M. Abdulazeez, H. Haron, and S. S.
stage prediction of lung cancer,” in 2018 3rd International
Sadiq, “Unsupervised Learning Approach-Based New
Conference on Emerging Trends in Engineering, Sciences and
Optimization K-Means Clustering for Finger Vein Image
Technology (ICEEST), 2018, pp. 1–4.
Localization,” in 2019 International Conference on Advanced
[13] D. Q. Zeebaree, H. Haron, and A. M. Abdulazeez, “Gene
Science and Engineering (ICOASE), 2019, pp. 82–87.
selection and classification of microarray data using
[32] H. U. Dike, Y. Zhou, K. K. Deveerasetty, and Q. Wu,
convolutional neural network,” in 2018 International Conference
“Unsupervised learning based on artificial neural network: A
on Advanced Science and Engineering (ICOASE), 2018, pp. 145–
review,” in 2018 IEEE International Conference on Cyborg and
150.
Bionic Systems (CBS), 2018, pp. 322–327.
[14] D. Q. Zeebaree, H. Haron, A. M. Abdulazeez, and D. A.
[33] H. R. Abdulqadir and A. M. Abdulazeez, “Reinforcement
Zebari,“Trainable model based on new uniform LBP feature to
Learning and Modeling Techniques: A Review,” Int. J. Sci. Bus.,
identify the risk of the breast cancer,” in 2019 International
vol. 5, no. 3, pp. 174–189, 2021.
Conference on Advanced Science and Engineering (ICOASE),
[34] K. Roy et al., “A Comparative study of Lung Cancer
2019, pp. 106–111.
detection using supervised neural network,” in 2019 International
[15] H. Tang, J. Zhao, and X. Yang, “Explore machine learning
Conference on Opto-Electronics and Applied Optics (Optronix),
for analysis and prediction of lung cancer related risk factors,” in
2019, pp. 1–5.
Proceedings of the 2018 2nd International Conference on
[35] S. Baskar, P. M. Shakeel, K. P. Sridhar, and R. Kanimozhi,
Computer Science and Artificial Intelligence, 2018, pp. 41–45.
“Classification System for Lung Cancer Nodule Using Machine
[16] P. R. Radhika, R. A. S. Nair, and G. Veena, “A Comparative
Learning Technique and CT Images,” in 2019 International
Study of Lung Cancer Detection using Machine Learning
Conference on Communication and Electronics Systems
Algorithms,” in 2019 IEEE International Conference on
(ICCES), 2019, pp. 1957–1962.
Electrical, Computer and Communication Technologies
[36] B. M. Boban and R. K. Megalingam, “Lung Diseases
(ICECCT), 2019, pp. 1–4.
Classification based on Machine Learning Algorithms and
[17] A. I. Rahmani and M. Katouli, “Diagnosing Lung Cancer
Performance Evaluation,” in 2020 International Conference on
Using Grasshopper Optimization Algorithm and k-Nearest
Communication and Signal Processing (ICCSP), 2020, pp. 315–
Neighbor Classification,” J. homepage http//iieta.
320.
org/journals/rces, vol. 6, no. 4, pp. 69–75, 2019.
[37] A. Sreekumar, K. R. Nair, S. Sudheer, H. G. Nayar, and J. J.
[18] Y. Nai et al., “Improving Lung Lesion Detection in Low
Nair, “Malignant Lung Nodule Detection using Deep Learning,”
Dose Positron Emission Tomography Images Using Machine
in 2020 International Conference on Communication and Signal
Learning,” in 2018 IEEE Nuclear Science Symposium and
Processing (ICCSP), 2020, pp. 209–212.
Medical Imaging Conference Proceedings (NSS/MIC), 2018, pp.
[38] N. Banerjee and S. Das, “Prediction Lung Cancer–In
1–3.
Machine Learning Perspective,” in 2020 International Conference
[19] S. Senthil and B. Ayshwarya, “Lung cancer prediction using feed
on Computer Science, Engineering and Applications (ICCSEA),
forward back propagation neural networks with optimal features,”
2020, pp. 1–5.
Int. J. Appl. Eng. Res., vol. 13, no. 1, pp. 318–325, 2018.
[39] N. Maleki, Y. Zeinali, and S. T. A. Niaki, “A k-NN method
[20] [20]M. R. Mahmood, A. M. Abdulazeez, and Z. ORMAN, “A
NEW HAND GESTURE RECOGNITION SYSTEM USING for lung cancer prognosis with the use of a genetic algorithm for
feature selection,” Expert Syst. Appl., vol. 164, p. 113981, 2021.
ARTIFICIAL NEURAL NETWORK.”
[40] D. Reddy, E. N. H. Kumar, D. Reddy, and P. Monika,
[21] M. Somvanshi, P. Chavan, S. Tambade, and S. V. Shinde, “A
review of machine learning techniques using decision tree and “Integrated Machine Learning Model for Prediction of Lung
Cancer Stages from Textual data using Ensemble Method,” in
support vector machine,” Proc. - 2nd Int. Conf. Comput.
2019 1st International Conference on Advances in Information
Commun. Control Autom. ICCUBEA 2016, 2017, doi:
10.1109/ICCUBEA.2016.7860040. Technology (ICAIT), 2019, pp. 353–357.
[41] Ö. Günaydin, M. Günay, and Ö. Şengel, “Comparison of lung
[22] D. M. Abdulqader, A. M. Abdulazeez, and D. Q. Zeebaree,
cancer detection algorithms,” in 2019 Scientific Meeting on
“Machine Learning Supervised Algorithms of Gene Selection: A
Review,” Mach. Learn., vol. 62, no. 03, 2020. Electrical-Electronics & Biomedical Engineering and Computer
Science (EBBT), 2019, pp. 1–4.
[23] O. Ahmed and A. Brifcani, “Gene Expression Classification
[42] A. Elnakib, H. M. Amer, and F. E. Z. Abou-Chadi, “Early
Based on Deep Learning,” in 2019 4th Scientific International
Conference Najaf (SICN), 2019, pp. 145–149. Lung Cancer Detection Using Deep Learning Optimization,”
2020.
[24] N. O. M. Salim and A. M. Abdulazeez, “Human Diseases
[43] S. M. Salaken, A. Khosravi, A. Khatami, S. Nahavandi, and
Detection Based On Machine Learning Algorithms: A Review,”
M. A. Hosen, “Lung cancer classification using deep learned
Int. J. Sci. Bus., vol. 5, no. 2, pp. 102–113, 2021.
features on low population dataset,” in 2017 IEEE 30th Canadian
[25] N. M. Abdulkareem and A. M. Abdulazeez, “Machine
Conference on Electrical and Computer Engineering (CCECE),
Learning Classification Based on Radom Forest Algorithm: A
2017, pp. 1–5.
Review,” Int. J. Sci. Bus., vol. 5, no. 2, pp. 128–142, 2021.
[44] A. Asuntha and A. Srinivasan, “Deep learning for lung
[26] R. Sathishkumar, K. Kalaiarasan, A. Prabhakaran, and M.
Cancer detection and classification,” Multimed. Tools Appl., vol.
Aravind, “Detection of Lung Cancer using SVM Classifier and
79, no. 11, pp. 7731–7762, 2020.
KNN Algorithm,” in 2019 IEEE International Conference on
[45] W. Rahane, H. Dalvi, Y. Magar, A. Kalane, and S. Jondhale,
System, Computation, Automation and Networking (ICSCAN),
“Lung cancer detection using image processing and machine
2019, pp. 1–7.
learninghealthcare,” in 2018 International Conference on Current
[27] S. Uddin, A. Khan, M. E. Hossain, and M. A. Moni,
Trends towards Converging Technologies (ICCTCT), 2018, pp.
“Comparing different supervised machine learning algorithms for
1–5.
disease prediction,” BMC Med. Inform. Decis. Mak., vol. 19, no.
[46] H. S. Yahia and A. M. Abdulazeez, “Medical Text
1, pp. 1–16, 2019.
Classification Based on Convolutional Neural Network: A
[28] N. Najat and A. M. Abdulazeez, “Gene clustering with
Review,” Int. J. Sci. Bus., vol. 5, no. 3, pp. 27–41, 2021.
partition around mediods algorithmbased on weighted and

66
International Journal of Advanced Computer Science and
[47] S. Potghan, R. Rajamenakshi, and A. Bhise, “Multi-Layer
Technology, 8(1), 1-13.
Perceptron Based Lung Tumor Classification,” in 2018 Second
[55] Sugianela, Y., & Ahmad, T. (2020, February). Pearson
International Conference on Electronics, Communication and
Correlation Attribute Evaluation-based Feature Selection for
Aerospace Technology (ICECA), 2018, pp. 499–502.
Intrusion Detection System. In 2020 International Conference on
[48] S. S. Raoof, M. A. Jabbar, and S. A. Fathima, “Lung Cancer
Smart Technology and Applications (ICoSTA) (pp. 1-5). IEEE.
Prediction using Machine Learning: A Comprehensive
[56] Demisse, G. B., Tadesse, T., & Bayissa, Y. (2017). Data mining
Approach,” in 2020 2nd International Conference on Innovative
attribute selection approach for drought modeling: A case study
Mechanisms for Industry Applications (ICIMIA), 2020, pp. 108–
for Greater Horn of Africa. arXiv preprint arXiv:1708.05072.
115.
[57] Kumar, S., & Chong, I. (2018). Correlation analysis to identify
[49] J. Saeed and A. M. Abdulazeez, “Facial Beauty Prediction
the effective data in machine learning: Prediction of depressive
and Analysis Based on Deep Convolutional Neural Network: A
disorder and emotion states. International journal of
Review,” J. Soft Comput. Data Min., vol. 2, no. 1, pp. 1–12,
environmental research and public health, 15(12), 2907.
2021.
[58] O. Caelen, “A Bayesian interpretation of the confusion
[50] Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, and A. K. Nandi,
matrix,” Ann. Math. Artif. Intell., vol. 81, no. 3, pp. 429–450,
“Applications of machine learning to machine fault diagnosis: A
2017.
review and roadmap,” Mech. Syst. Signal Process., vol. 138, p.
[59] N. Milosevic, A. Dehghantanha, and K.-K. R. Choo,
106587, 2020.
“Machine learning aided Android malware classification,”
[51] N. Omar, A. M. Abdulazeez, A. Sengur, and S. G. S. Al-Ali,
Comput. Electr. Eng., vol. 61, pp. 266–274, 2017.
“Fused faster RCNNs for efficient detection of the license plates,”
[60] J. Xu, Y. Zhang, and D. Miao, “Three-way confusion matrix
Indones. J. Electr. Eng. Comput. Sci., vol. 19, no. 2, pp. 974–982,
for classification: A measure driven view,” Inf. Sci. (Ny)., vol.
2020.
507, pp. 772–794, 2020.
[52] Z. Zainudin, S. M. Shamsuddin, and S. Hasan, “Deep
[61] Z. Yang, T. Zhang, J. Lu, D. Zhang, and D. Kalui,
Learning for Image Processing in WEKA Environment,” Int. J.
“Optimizing area under the ROC curve via extreme learning
Adv. Soft Compu. Appl, vol. 11, no. 1, 2019.
machines,” Knowledge-Based Syst., vol. 130, pp. 74–89, 2017.
[53] V. Mhetre and M. Nagar, “Classification based data mining
[62] D. Brzezinski and J. Stefanowski, “Prequential AUC:
algorithms to predict slow, average and fast learners in
properties of the area under the ROC curve for data streams with
educational system using WEKA,” in 2017 International
concept drift,” Knowl. Inf. Syst., vol. 52, no. 2, pp. 531–562,
Conference on Computing Methodologies and Communication
2017.
(ICCMC), 2017, pp. 475–479.
[54] Al Janabi, K. B., & Kadhim, R. (2018). Data reduction
techniques: a comparative study for attribute selection methods.

68
View publication stats

MaterialsToday:Proceedings50 (2022) 40–47

Contents lists available at ScienceDirect

MaterialsToday: Proceedings
journal homepage: www.elsevier.com/locate/matpr

Predictionof Cancer Diseaseusing Machinelearning Approach


F.J. Shaikh⇑, D.S. Rao
a
School of Computer Engineering and Technology, MIT Academy of Engineering, Alandi Road, Pune, India
b
Computer Science Engineering, Koneru Lakshmaih Education Foundation, Hyderabad Campus, India

ARTI CLE I NF O ABS TRACT

Article history: Cancer has identified a diverse condition of several various subtypes. The timely screening and course of treatment of a
Availableonline16 April2021
cancer form is now a requirement in early cancer research because it supports the medical treatment of patients. Many
research teams studied the application of ML and Deep Learning methods in the field of biomedicine and bioinformatics in
Keywords:
Cancer the classification of people with cancer across high- or low- risk categories. These techniques have therefore been usedas
Deeplearning a model for the development and treatment of cancer. As, it is important that ML instruments are capable of detecting key
ML features from complex datasets. Many of these methods are widely used for the development of predictive models for
ANN predicat-ing a cure for cancer, some of the methodsare artificial neuralnetworks (ANNs), supportvector machine(SVMs)
SVM and decisiontrees (DTs). While we can understandcancer progressionwith the use of ML meth- ods, an adequate validity
Decisiontress level is needed to take these methods intoconsideration inclinical practice every day.
In this study, the ML & DL approaches used in cancer progression modeling are reviewed. The predic- tions
addressed are mostly linked to specific ML, input, and data samples supervision.
© 2021Elsevier Ltd. All rights reserved.
Selection and peer-review under responsibility of the scientific committee of the International Virtual Conference on
Advanced Nanomaterials and Applications. This is an open access article under the CC BY- NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).

1. Introduction A Binary Local Pattern (LBP) is a picture administrator that changes


over a picture into a variety of number names speaking to its essence.
Themainweightofailmentoverallisas Lungmalignancythatisthe These markersare thenmore commonly used in the histogram for further
most inescapable disease in the two men and women [1]. Afew other image processing. In the last three deca- dent years the prevalence of
reportsestimatesome221,200newcasesofpulmonarycanceroccurand prostate andbreast cancer in male andfemale cancer has been the largest,
representapproximately13%ofallcancerdiag-nosesin2015. but lung cancer remainsthe highest in cancer-patient mortality [3]. One of
Approximately 27 percent of all cancer deaths areattributed to lung the main reasons for this is that prostate and breast cancer prognostic
cancer[2]. Lung nodulesmust thereforebe clo-sely examined and models are comparatively more advanced and systemic than pulmonary
monitoredwhenatanearlystage. Inthisstudy, the ML & DL approaches can- cer. Thus, it is urgently necessary to establish an effective early- stage
usedincancerprogressionmod-elingarereviewed. Thepredictivemodels lung cancer forecast model. In linear and non-linear prob- lems, SVM has
discussedherearebasedondifferentsupervisedMLtechniques, input and superior predictor performance and is widely used in various fields
data samples. including in medicalmatters. Even if SVM is asuperiorclassifier, thefield of
ALocal Binary Pattern (LBP) is an image operatorthat trans- forms an cancerprognosismodelsisrelativelyimmature[4].
image into an array or picture of integer labels that describe the Themutationtest[5]hasbecomeanimportanttoolfordeciding
appearance of the small picture. These labels are then used for further the right therapy options for patients in clinical tests. Direct sequencing is
imageanalysis, most frequently in the histogram. The LBPtextureoperator an alternative approach for unknown mutations based on screening. The
has become a popular approach to vari- ous applications thanks to its Mutation Test for Epidermal Growth Fac- tor Receptors (EGFR) has been
discriminativepowerandcomputa-tionalsimplification[2]. identifiedforlungcancergeneticmutationtesting[4]. Acontrast with their
non-ensemble vari- antsoftwotypesof categorizing equipment Artificial
* Correspondingauthor. Neural Net- work (ANN) and Support Vector Machine (SVM) is published.
E-mail address: farhahashaikh20@gmail.com(F.J. Shaikh). The

68
F.J. Shaikh and D.S. Rao Materials Today: Proceedings 50 (2022) 40–47

byWorrawat Engchuan[4]. The resultsoftheclassification using


weight of the misjudgment for the majority is higher than that of the
minorityclassand is likely to causemisjudgment. Traditionalalgorithms of pathway behaviorderivedfrom the proposed approachindicate that all four
classification are not successful and excellent. lung cancer data sets used have high classification capacity in three-fold
validity and robustness.
1.1. Topology of machine learning & deep learning algorithms H. Azzawi et al. [5] proposed a GEP (gene expression) model to forecast
microarray data on lung cancer in 2016. In order to extractimportant lung
In order to predict thevarioustypes of diseases, different deeplearning cancer related genes, the authors use two approaches for selecting genes
& machine learning algorithms are used , such as Support vector machine and thus suggest specific GEP pre- diction models. The validation of the
(SVM), Neural Network (NN), LR, Nevin biases (NB), Fuzzy logic, transfer cross-data collection was tested for reliability. The test results show that,
learning, ensemble learning, Transduction learning, KNN, and Adaboost consideringpreci- sion, sensitivity, speciality, andregionundertherecipient
are mostly utilized in diverse contri- butions. Moreover, SVM is categorized func- tional property curve, the GEPmodel using fewer features surpassed
into Boosted SVM & MLSVMfor predicting distinct diseases in the earlier other models. The GEP model was a better approach to problems of
contributions. Simi- larly, NN is classified as Dynamic Neural Network diagnosis of lung cancer. It has been found.
(DNN) & Convo- lution Neural Network (CNN) which are employed for Panayiotis Petousis et al. [6] created and evaluated a range of dynamic
diagnosing different diseases in different contributions. Moreover, GBDT is BayesianNetworks(DBN) to assist in informingdecisionsaboutlungcancer
a modified form of DT, CVIFLR is the modified form of LR that are usedfor screening by providing insights into how longi- tudinal data can be used.
detecting diseases. Moreover, RF and Fuzzy logic is grouped into HRFLM The NLSTdataset LDCT arm has beenused in creating andexploration five
and Fuzzy SVM, respectively in order to pre- dict discrete diseases in DBNsforhigh-riskpeople. 3 ofthe DBNsweredesignedwithareversestyle,
and 2 throughmethods of structurallearning. All applications are based on
various contributions. So, forpredicting lung cancer in an efficient manner
population, smoking status, a history of cancer, family history of lung
withthehelpofimprovedmachinelearning techniquescanbeuse.
cancer, risk factors forexposure, lung cancer co-orbidities and information
on LDCT screenings. In view of the uncertainty resulting from lung cancer
2. Literaturereview screening, a lung cancer-state model was used to identify the individual’s
cancerstatusovertime. These modelshave been tested on balancedcancer
ChaoTan et al [1] explored the feasibility of using decision stumps as a and non-cancer research and test sets in order to resolve data
poor classification method and track element analysis to predict timely disequilibrium and over fitting. Expert judgments contrasted the results. In
lung cancer in a combination of Adaboost (ma- chine learning ensemble). all three NLST test interven- tion stages, the average area underneath the
For the illustration, a cancer dataset was used which identified 9 trace curve (AUC) of the receiver operating feature (ROC) was above 0.75.
elements in 122 urine samples. The sample set partitioningwasperformed Superiorwere compared models such as logistic regression and naïve Bay.
using Kennard and Stone algorithm (KS), combined with alternative Lung screening DBNs have demonstrated strong discrimination and pre-
samples. The ada- boost forecast results were contrasted with the Fisher dictive strength in both cancer and non-cancer cases.
Biased Ana- lytic (FDA) results. In the test set, 100% of Adaboost’s The SEERdatabasewasusedby ChipM. Lynchet al. [7] to clas-
sensitivity for both cases was reached, 93.8% of accuracy was 95.7% and sify the survival of lung cancer patients as a linear regression, deci- sion
95.1% respectively for case A and case B 96.7%. The structure of boththe trees, gradient boosting machines (GBM), support-vector machines (SVMs)
test data is less reactive than the FDA and the change is often easier to andacustomset. In order to allowthe compar- isonsbetweenthe different
monitorthanthe FDA. The Adaboost appearedsuperior to FDA and proved approaches, the main data attributes for applyingthese processes includes
that combining Adaboost and urine analysis could be a valuable method the tumor level, tumor size, gender, age, stage and number of primaries.
throughclinicalpracticeforthediag- nosis of early lung cancer. Rather of being divided into classes, the prediction has been viewed as a
Tae-WooKimetal. [2], havedevelopeda decisiontreeonoccu- continu- ous goal as a first step to enhancing survival. Results have indi-
pational lung cancer. In 1992–2007, 153 lung cancer cases were reported cated that the expected values conform to the actual values, which
by the Occupational Safetyand Health Researcher’s Insti-tute (OSHRI). The constitute the majority of the results, for low to moderate survival. The
objective parameter was to determine if the sit- uation was accepted as model that was most popular in the custom set was GBM, though Decision
lung cancer linked to age, sex, smoking years, histology, industry size, Trees did not function, because it consistsof some discreet performance.
delay, working time and exposure of independent variables. During the The outcome show that GBM with RMSE value of 15.32 was the most
whole journey for indicators for word related cellular breakdown in the precise of the five individual models produced. While the SVM has an
lungs the characteriza- tion and relapse test (CART) worldview is utilized. underperformed RMSE of 15.82, the SVM is perhaps the only system
Presentation to knownlungsdisease specialistswasthe best pointer of the delivering a distinc- tive efficiency in the quantitative tests. The results of
CART model. As the CART model is not absolute, the functionality of lung the simula- tions were consistent with a traditional Cox proportional risk
cancer must be carefully determined. model, which is used as a reference point. In order to inform the patient’s
Maciej Zie˛ ba et al. [3] introduced boosted SVM in 2014 which is decision in final analysis of these supervised learning strategies, SEERdata
dedicated to solving imbalanced results. The solution proposed combined were found to be used as a way of assessing the time for patient survival
the advantages of using ensemble classifiers with cost- sensitive support andthat thefindings of these tech- nologiesfor thisparticular dataset may
vectors for uneven data. In addition, a methodfor extracting decisions equatetothoseofconven- tional methods.
from the boosted SVMwaspresented. In the next step, the efficiency of the Deeplearning isdependentonseveralcoveredlayersonuseof
solution proposed was assessed by comparing the performance of the neural ‘‘deep” networks, which have understood relations between the
unbalanceddatawith otheralgorithms. Finally, improved SVMwas used to computation of the morphology and structure characteristics [9]. For a
estimateaftersur-gerylifeexpectancy inpatients withlung cancer. strong faith network, they had hit the responsiveness rate of 73.40% and
Amulticlassdata pathwaybehaviortransformationapproach thespecificity rate of 82.20%.
calledAnalysis-of-VarianceBasedFeatureSet (AFS) wassuggested

69
F.J. Shaikh and D.S. Rao Materials Today: Proceedings 50 (2022) 40–47

surgeryor radiation, thescore foreveryclasswascalculatedand


Deep Convolutional Neural Network CNNs is used to identify or label a
medical image in some research papers. Diagnosed lung cancer in 2015 thevariableswerethennumbered. Amathematical collectionwasgenerated
with amultiscal two-layer CNN [13] recorded 86.84% accuracy in [12] the on the life expectancy and radiation combination, together with a grading
CNN architecture, data set characteris- tics, and transfer learning factors tree to every cluster. The results demon- strated that consumers that didn’t
were exploiting and extensively analyzing three significant and previously obtain radiation both with and without surgery have the longest survival
under studied factors. time.
Predictivemethodsforbreastcancer’ssurvivalby alargedata- setwere
built in [15] by the computational regression of 2 major data mining 2.1. Analysis of ML applicability’s in cancer
methods, artificial neural networks and Decision Trees. The impartial
approximation of the three predictionmodels wasmeasured by tentimes A comprehensive search of ML techniques in cancer sensitivity,
the cross-validation methods for com- parative analysis. Results indicate recurrence and survival predictions was conducted. The PubMed, Scopus,
that the Decision Tree (C5), sec- ond most effective artificial 91.2 percent hasentered two online databases. Further review was required due to the
neural networks, and the largenoof blogswhichwerefoundbythesearchqueriesMost of thestudies
89.2 percent logistic regression models, are the best predictor of 93,6 use different input data types: Clinical, genetic, histological, imaging,
percent accuracy for the holdout study. A study was con- ducted with socioeconomic, epidemio- logical or mixed details. According to [13] and
predictive models for the survival of prostate cancer, using vector support theirsurveybasedon ML useof cancerprediction, wehaveseena large rise
machines(SVM)inrelationtothethreetechniques. in docu- ments released in the last decade. Discuss, we selected from the
In [16] SVMwith Artificial Neural Networksand Decision- making Trees first category of papers a representative list following a well- defined
is identified in this case as the precision predictor (92,85% accuracy). structure. We have selected such studies in order to pre- vent the desired
Prostate cancer survival is also examined in context, including artificial effects, especially with the implementation of recognisable ML techniques
neural networks, decision trees andlogistical regression. In the and integrated data through heteroge-neous data. Tables 1 indicate some
segment, data on patients suffering from colon cancer were compared publications in this study. Each suggested approach specifies the type of
to predict cancer, the ML treatment, number of patients, type of data and overall
survivalandmoreaccurateneuralnetworksweredetermined. precisionachieved. Eachtablecorrespondstoaparticularscenarioof study.
In [19] the assembly of the 3 most effective classification meth- ods i.e., predic- tion of cancer sensitivity, forecast of cancer occurrence and
leads to an ideal forecast and region under the ROC for colon cancer predic- tion of cancer survival. It should have been acknowledged to
survival. Some studieshave evaluated the survival of lung cancer patients introduce the most precise predictive model here in publications which
by analyzing the SEER database using learning machinery, such as group apply more than one ML technique to prediction. Different study projects
class-basedmethodsand SVMandlogis- ticregression. Techniquestoassess have attempted to forecast cancer regeneration after remission and have
the probability of development of lung cancer in patients with certain managed to boost predictions correctly in con- trast to alternative
symptoms have been ana- lyzed in data classification techniques. statistical techniques. In addition, molecular and clinical data have been
Comparisons with the lung cancer data in were made with the C4.5 and used to estimate the large bulk of such papers. The implementation of
Naïve-Bayes graders and 90% of the survival estimates were achieved. In observedbehaviors such as input data is a growing phenomenon, based on
[19,20], there was the establishment of a joint voting process with five the growth of HTTs.
Decisions to providethebestpredictor of theprecisionandsurvivalarea of 2.2. Case study 1
ROC lung cancer. In order to identify interesting association rules or
correlation among a wide range of items, Association rules min- ing Application of machine learning to predict the susceptibility of cancer
techniques have been used; different methods for extracting rules and riskfromthe 79 paperssurveyed in thisstudy are relative limited (only 3).
standardcriteriahavebeenproposedto indicatethebestway to choose the
The development of a retrospective methodology to predict the presence
rulesandoptimizethembasedonaspecificdataset.
of ’spontaneous’ breast cancer using single nucleotide polymorphism (SNP)
Apulmonary cancerrulebookwasdevelopedusing automated
technologyin, some of whichwasredundant andmanually removed onthe steroid metabolizing enzymes (CYP 450) is among the interesting
basis of domain expertise. There were three fac- tors: maximum branch documents. Close. Sporadic and non-family breast cancers account for
factor, addition of a new branch, and the factor used to add a new branch. 90% (Dumitrescu and Cotarla, 2015). This trial was conducted with the
From the very beginning, the authors suggested atree algorithm that uses theory that envi- ronmental toxins or hormones were accumulated in
the whole dataset and descends in depth to the data with a greedy breast tissueandthat somecombinationsof the SNPgenewereat increased
approach. Each tree node was a segment and therefore a rule of riskofbreast cancer. The authors have obtained data on 63 breast can- cer
association. The attributes were: age, birthplace, grade of cancer, patients and 74 breast-free (controls) patients from the SNP (98 SNP from
diagnostic test confirmations, the most remote tumor extension, 45 cancer-associated Genes). It was vital to the progress of this research
involvement with lymph nodes; type of operation performed, cause for no that researches used various methods to minimize a sample-per-feature
oper- ation; order of operation and radiation; area of the lymphatic node ratio and analyzed several processes of machine training in order to find
surgery, cancer phase. The measurement of treatment effective- ness and optimum classification. In partic- ular, the authors rapidly reduced this set
surgery is a required result of a SEER dataset analysis, despite a lack of from a start set of 98 SNPs to just 2–3 SNPs that appeared to be as
chemotherapy information in the dataset. informational as possible. Instead of almost 3:2 (withall the 98 SNPsused),
Treatment effectivenesswas taken into account in [21]. The thespecimenratioswerereducedto45:1 (for 3 SNPs) and68:1 (for 2 SNPs).
study investigated whether patients with lung cancer had survival or This made it possible to prevent the ‘‘dimensionality curse” from being
radiationforlonger or forboth. A Propensity Score was utilized which affected (Bellman, 1961; Somorjai et al., 2013). When the testing sample
representsadependentlikelihood of treatment foraunit givenacollection gets minimized, a number of machine learning techniques, consisting of a
of covariates observed. Two methods were used, known as logistic naïve Bayes model, various decision-making meth- ods and a sophisticated
regression andclassification tree, for theassess- SVMwere applied.
mentofscore.Aspatientscanbetreatedseparatelyortogetherfor

70
F.J. Shaikh and D.S. Rao Materials Today: Proceedings 50 (2022) 40–47

Table 1
Featuresandchallengesofexisting lung cancerpredictionmodels.

Author[citation] Methodology Features Challenges


Tan et al. [1] Adaboost ● Hasattainedhighsensitivity andbest performance. ● It isverysensitive to noisy data.

● Itisverysimpletoimplement.Tae-
WooKimet al. [2] DecisionTree(DT) ● Thesearesimpletointerpret. ● Theysufferfromoverfitting.
● It should be takenastheminimaldecisionstandardofwork-
relatedness for lung cancer.
Maciej Zi˛eba et al. [3] Boosted SVM ● It is used in medical application for predicting post- ● Therunningtime of trainingalgorithmsdonotscale
operativelifeexpectancyinlungcancerpatients. wellwiththesizeofthetrainingset.
● Many parameters need to be set accurately for
● Itisused to solve theimbalanceddata problems. attaining the best results.
Worrawat Engchuan[4] SVM ● It is used to build n- hyperlanes and n-features for
dividingeachdifferent classapartfrommaximalmargin.

H. Azzawiet al. [5] Gene Expression Gene Expression Programming ● It has better solution for predicting lungcancer
Programming difficulties.
● Hashighaccuracy.
PanayiotisPetousisetal. [6] DynamicBayesian DynamicBayesian Networks ● Has demonstrated high discrimination and
Networks predictive power.
● It is used to acquire the probability of posi- tive
outcomeofabiopsyforthegivenindividual.
Chip M. Lynchet al. [7] DT ● It isthebestpredictor by attaining highaccuracy.
● Itautomaticallyprunestoaveryshortthree-leveldepth.
P. Petousisetal. [10] Partially- Observable Partially-ObservableMarkov Decision Process(POMDP) ● It optimizesthe lung cancerprediction dur-
MarkovDecisionProcess ing theimprovementoftestspecificity.
(POMDP) ● Itreducesthefalse positive rates.

icalclassificationtopredict DLBCLsurvival. The approach is


withaprecision of 69%, and 67% and 68%, respectively, were found in the
naive Bayesand Decision Treeclassificationsystems. Theoutputsareabout somewhatdifferentfromthe Listgartenetal. study(2014) inits classification
23–25% better than original. The extensive level of cross validation and scheme, which only used genomic data (SNP). Futschiket al. hypothesized,
confirmationconductedwasanothernota- blefeature of thisstudy. At least rightly, thatclinicalknowledgemayimprovedata on microarrays to abetter
three wayshave beenvalidated foreach model’s predictive power. Firstly, output than a microarray-alone or clinical data-based classifier. In addition,
model training with20- fold cross validation has been evaluated and dif- ferent kinds of ‘‘Evolving Neural Network” (EFuNN) classifiers havebeen
monitored. A boot- strap resampling approach was used when the cross produced to manage genomic data, separate from the Baye- sian
validation is performed 5 timesandthe outputswereaveraged to keepthe classification system. A mixture of 17 genes from the microar- ray data is
stochasticdimension in the division of samples to aminimum. In addition, used by the best EFuNN classifier. The accuracy of this best EFuNN was
theselectionprocesswas carriedout for 100 times in eachfold (5 timesfor 78.5%. Inordertoachieveconsensusprediction, the EFuNNandthe Bayesian
each of 20 folds) in order to reduce unequal- ity in function selection (i.e. classificators were mixed in a hierar- chic modular structure. This hybrid
selecting the most informative SNP sub-ensemble). Thus, the outputs are classifier has a precision of 87.5 per cent, significantly improving both
then matched with an altered permutation test that, had 50 percent classifications’ performance alone. This was 10% good than the excellent-
predictiveprecision. Whiletheresearcherstriedtoreducethestochasticsin performing classifica- tion for machine learning (77.6% by VMS). A cross-
sample partition-ing, it could have been better to use leave-one-out cross- validation strat- egy for the EFuNN classifier was applied. Possibly because
validationthatshallhave completely deleted this stochastical element. This the sample was small. No external validation collection was present to
trialwasconductedwiththetheorythatenvironmental toxins or hormones check forthe overall system, as with Case Study # 1. The Sampleper Feature
wereaccumulatedinbreast tissueandthatsomecom- binations of the SNP Ratio (SFR) is above 3 with just 56 patients (samples) categorized using 17
gene were at increased risk of breast cancer. The authors have obtained gene features. SFR below 5 do not always nec- essarily ensure a robust
data on 63 breast cancer patients and 74 breast-free (controls) patients classification (Somorjai et al., 2013). More-over, it is obvious that the
fromthe SNP(98 SNPfrom 45 cancer-associated Genes). It was vital to the researcherswere known of thisproblemandspent enoughtime explaining in
progress of thisresearchthatresearchersusedvarious models to minimize depth the internal function- ing of their classifier to justify their approach.
a sample-per-feature ratio and analyzed several methods of machine This consisted of a summary of how the Bayesian classification was
traininginordertofindoptimumclassification. It also points out the wayin constructed, how the EFuNN works and how the two classification systems
which machine learning can disclose significant infor- mation into the cooperate to make one prediction. Also, the researchers tested the
biology of spontaneous or non-famile breast can- cer and polygenic risk independence of the micro-array knowledge from clinical data and
factors. subsequently verified it. This eye for detail is particularly out- standing in
2.3. Case study 2 such a study. The whole research reveals how well the capacity to use both
genomicandclinicaldatasignificantlyimprovescancerpredictionaccuracy.
Cancer Survival Prediction Almost half (or 1 year or 5 years sur- vival 2.4. Case study 3
rates) of allthemachinestudystudies on cancerprediction. Onereport of a
specific interest (Futschik et al., 2013) was used topredict the outcomesof The Laurentian and the other in a particularly good example, and also
diffused large-B-cell lymphoma (DLBCL) by hybrid machine learning. In discussessome of the inconveniencesobserved in exist- ingresearches. The
particular, both clinical and geno- mic (microarray) information was authors wanted to predict the possibility of
gathered into creating one clin-

71
[18]. Thus, early prediction of lung cancer is very important for the
recurrence in patientswithbreast cancer for five years. Seven pre- dictive
appropriate treatments for decreasing the deaths. In big data, healthcare is
variableshavebeencombined,comprising ofclinical
one of the significant sources. Accurate examination of healthcare
informationlikepatientage, tumorsizeandno. of axillametas-tases. Protein information is mostly demandedfordetecting lung cancer in an early stage.
biomarkers, like oestrogen and progesterone recep- torlevels, also received Multipleresearches are beingdesigning newly to recognize lung cancerwith
information. The focus of the researchwas to produce an automated, more quality using big data. As there is a necessity to classification
quantitative predictive approach more precise than those of the classical approach for improvingthe detection accuracy with respect to time. In
metastasization of the tumor node(TNM). TNM is agroup of medical experts addition, machine learning techniques are modelled for enhancing the
that rely majorly on the professional judgment of a pathologist or clinician. detectioing a new variant. The ability of machine learning to solve compos-
The researchers used an ANN model, were using information from 2441 ite taskswith dynamic environment and knowledge has con- tributed to its
breast cancer patients (each time seven data points). A sam- ple to feature success in prediction research especially lung cancer, enabled with novel
ratioremainedsignificantlyhigherthanthe recom- mendedminimum of five met-heuristic algorithms.
(Somorjai et al., 2013). The wholedatasetwasdivided into 3 classes: training Althoughthere are many advantages for predictingthe lung cancer, but
(1/3), testing (1/3) and test sets (1/3). Furthermore, the authors have still there are few defects with the existing methodolo- gies so that a new
collected 310 separate samples from another organization to carry out an methodneedstobeimplemented. Adaboost[31] has attainedhighsensitivity
exter- nal assessment of breast cancer patients. This helped the research- and best performance, and it is very simple to implement. But, it is very
ers to test the generalization of their system out beyond ones institution — sensitive to noisy data. DT
a phase not taken in the 2 experiments discussed above. The analysis [32] is simple to interpret, it should be taken as the minimal deci- sion
demonstrates not only the volume of data and the thoroughness of standard of work-relatedness for lung cancer, is the best pre- dictor by
validation, but also the level of quality control for data processing. The attaining high accuracy, and it automatically prunes to a very short three-
information, forexample, was decided to enter and collected autonomously level depth. However, the running time of training algorithms do not scale
in a connection data- base and was autonomously checked to keep the well with the size of the training set. SVM
referringdoctors in goodstanding. Thesamples of 2441 patients and 17 000 [34] is used to build n-hyperlanes and n-features for dividingeach different
data points were sufficiently large for a typical breast cancer population class apart from maximal margin, and it improves the classification power
demographics when subdivided into the data sequence. However, by and robustness. Yet, many parameters need to be set accurately for
examining data distributions for patients in each set (training, monitoring, attainingthebest results. Gene Expression Programming [35] has the better
testing and external), the authors explicitly verified this assumption and solution for predicting lung can- cer difficulties, and has high accuracy.
demonstrated that distributions are quite same. The Authors built an However there are some dis- advantages such as if they are easy to
extremely accurate and robust classifier through consistency and attention manipulate,theyloseinfunctionalcomplexity. DynamicBayesianNetworks
todetail Sincethestudy’saimistoproducea system that better predicted re- [36] has demonstrated high discrimination and predictive power, and it is
currence of breast cancer than the traditional TNM stalemate method, used to acquire the probability of positive outcome of abiopsy for the given
comparing the ANN model with the TNM stalemate predictions was impor- individual. Though there are few challenges like if there is longer search
tant. This was done by using an Operator Characteristic (ROC) curve to time, the performance might be affected. POMDP [38] optimizes the lung
compare the performance. The ANN model (0.726), calcu- lated by the cancer prediction during the improvement of test specificity, and it reduces
portion inthe ROCcurve, exceededthe TNMsystem (0.677). Thisresearch is the false positive rates. But, the performance needs to be improved. Hence,
an brilliantillustration thatmachinery is well articulated and tested. Alarge the new model needs to be introduced for providing best performance so
enough set of data was obtained and data was tested for performance thattheaboveconflictsareusefulforthenewdevelopmentmethod.
assuranceandpre- cisionforeachsampleindependently. In addition, blinded
valida-tionsetswereavailable forassessingthe generality of the machine 4. Researchobjectives
learningsystem bothfrom the original data set and through an external
point. Finally, theprecisionof themodel has
Theobjectiveofthisresearchworkisdiscussedas follows.
been contrasted directly with that of the traditional TNM projec- tion
scheme. Thus the only challenge to this analysis was that the researchers
1. Toreviewonvariousstate-of-the-art lung cancer predictionmodels
evaluated only one form of ANN algorithm. Because of the type and the
and develop a new feature extraction model.
amount of data used, another machine learning technique can well have
2. Comparethesymptomsof cancerforearlynotification.
exceeded their ANN model.
3. Todesignanddevelopadeeplearningmodeltopredictthelungcancer.
4. Tovalidatetheproposedmodelbycomparing it withothercon-
3. Researchgap ventional models.
5. Sending Automaticnotificationfordetecting thecancer.
Lung cancer is the second largest human illness, which refers todeaths
from cancer worldwide. The average survival rate of 5 yearsfor patients 5. Discussion
with lung cancer in other organs such as the breast, cervix, bladder,
prostate or colon does not exceed 14 percent, which is significantly less In The latest research on predicting cancer using ML & DL tech- niques
than the rate of patients with cancer are discussed in this study. Furtherthrough the short detailsof the ML & DL
field and the preprocessing data techniques, the selection techniques and
the classification algorithms

72
sorts of cancersbeing investigated, andthe overall performance of can- cer
wereemployed, we discussedthree specific case studies based on popu- lar
prediction or outcome methods have been identified. While the ANNs are
ML tools, concerning foretell of the susceptibility of cancer, can- cer
common, it is clear that a broader variety of alternative learning
recurrence and cancer survival. Clearly, a huge number of ML & DL
concepts released overthe past decade produce precise outputs regarding approaches is also used to predict at least three different cancer types.
ANNs continue to be prevalent. Furthermore, it is clear that machine
particular cancer predictions. Moreover, it is crucial for the separation of
training methods typically increase the effi-ciency or predictable accuracy
clinical decisions to identify potential problems including experimental
ofmostpronostics,inparticularwhen matched with conventional statistical
design, collecting suitable samples of data and validating classified results.
Moreover, despite claims to have contributed to appropriate and efficient or expert systems. Although most researches are usually excellently-
designed and fairly validated, more focus is quite desirable for the
decision-making by the ML classification methods, very few have in fact
planning and implementation of experiments, in particular with regard to
entered clinical practice. Recent advances in omits technology have led us
furthertobetterunderstandawiderange of diseases, butvalidationresults quantity and quality of biological data. Improving the experimental design
and the biological validation of several device classification systems would
needtobeaccuratebeforesignatures of gene expression
undoubtedly increase the general Quality, replica- bility and reproductivity
shall be used in hospitals. Only afew markedsamples in general. The small of manysystems.Intotal, we believethattheusage of the deviceseducation
amount of data samples is a majorly frequent drawback observed in the
& deep learning classificatory will probably be quite common in many
researchsurveyed in thisarticle. The size of train- ing datasetsthat need to
clinicalandhospitalset-tingsifthequalityofstudycontinuestoimprove.
belargeenoughisabasicrequirementintheuseofclassificationschemesto Theassimilationofmultifacetedheterogeneousdata, whichcan
model a disease. A relatively large dataset makes it possible to divide
offer a promising tool for cancer infection and foresee the disease, also
enough into training and trial sets and therefore to validate the
demonstratestheincorporation in the application of differentanalyticaland
calculators reasonably. A small training sample can result in
classification methods.
misclassifications compared with the dimension of the data, while
In future, by using the proposed framework, we would like touse other
estimators can develop unstable and partial techniques. It’s clear that a
state of the art machine learning algorithms and extrac- tion methods to
more wealthy group of patients could predict their survival may improve
allow more intensive comparative analysis.
predic- tive model capacity. The quality of the dataset and the selection
schemes are important for efficient ML and DL and then for precise cancer
Declarationof Competing Interest
foretell except for data size. Using feature selection meth- ods to select the
maximum informative characteristics subset for training the technique
The authors declare that they have no known competing finan- cial
couldlead to sturdy models. Reproducible valuesare also characterized as
interests or personal relationships that could have appeared to influence
characteristic sets consisting of histology and pathology studies. Given the
the work reported in this paper.
lack of static entities, it is essential that a multiple feature sets are
adapted to the ML& DL technology over time. We also discovered which
References
SVM and ANN classifiers are commonly utilized for cancer forecasting
results as one of the most frequently used MLalgorithms [35]. As discussed [1] Chao Tan, Hui Chen, Chengyun Xia, Early prediction of lung cancer based onthe combination of
in our introductory section, ANNs are widely used for nearly 30 years [40]. trace element analysis in urine and an Adaboost algorithm, J. Pharm. Biomed. Anal. 49 (3)
SVMsarealsoanewermethod to cancer pre- diction but have already been (2009) 746–752.
[2] D.-H. Tae-WooKim, Chung-Yill Park, Decision tree of occupational lung cancer using
widely included in their trustworthy predictive results. However, the classificationandregressionanalysis,SafetyHealthWork1(2)(2010)140–148.
selection of the best algorithm is dependent on a large number of [3] M. Zie˛ ba, J.M. Tomczak, Marek Lubicz, Jerzy Świa˛ tek, Boosted SVM for
parameters, whichinclude datatypes collected, sample size, time limitsand extracting rules from imbalanced data in application to prediction of the post- operative life
expectancyinthelungcancerpatients,Appl.SoftComput.14(2014) 99–108.
the type of prediction results. New methods for overcoming the above- [4] Worrawat Engchuan, Jonathan H. Chan, Pathway activity transformation for multi-class
mentioned limita- tions should be explored regarding the future of cancer classificationoflungcancerdatasets,Neurocomputing165(2015)81–89.
modeling. More accurate results and reasoned conclusions would be [5] H. Azzawi, J. Hou, Y. Xiang, R. Alanni, Lung cancer prediction from microarray data by gene
expression programming, IET Syst. Biol. 10 (5) (2016) 168–178.
obtainedthrough efficient quantitative research of the heterogeneous data [6] P.Petousis, S.X.Han,DeniseAberle,AlexA.T.Bui,Prediction of lungcancer incidence on the low-
sages used. Further research on the basis of more public databases, which dose computed tomography arm of the National Lung Screening Trial: a dynamic Bayesian
gather valid cancer data for all diagnosed patients, is needed. Their use by network, Artif. Intell. Med. 72 (2016)42– 55.
[7] C.M. Lynch, J.D. Behnaz Abdollahi, A. Fuqua, R. de Carlo, James A. Bartholomai, Rayeanne N.
scholars will allow their modeling studies to generate relevant outputs
Balgemann,Victor H. vanBerkel, Hermann B. Frieboes, Prediction oflung cancerpatientsurvival
andintegratedclinical decision- making. viasupervisedmachinelearningclassificationtechniques,Int.J. Med. Inf. 108(2017)1–8.
[9] D.S. Rao, D.P. Tripathy, Optimization of machinery noise using Genetic Algorithm. Noise
Conference 2017. Michigan, 2017; 527–537.
6. Conclusion [10] P. Petousis, A. Winter, W. Speier, D.R. Aberle, W. Hsu, A.A.T. Bui, Using sequential decision
making to improve lung cancer screening performance, IEEE Access 7 (2019) 119403–
Thewholestudyexplainsandcomparesthefindingsofvariousmachine 119419.
[12] V. Krishnaiah, G. Narsimha, C. Subhash, Diagnosis of lung cancer prediction system using data
learning and in-depth learning implemented to cancer prognosis. miningclassificationtechniques,Int.J.Comp.Sci.Inf.Technol.4 (1) (2013) 39–45.
Specifically, several trends related to those same kinds of machines [13] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation invariant texture
techniques to be used, the kinds of training data to be incorporated, classification with local binary patterns, IEEE Trans. Pattern Anal. Mach. Intell. 24 (7) (2002)
971–987.
thekindofendpoint forecaststobemade,

73
M. Hoogendoorn, L.M.G. Moons, M.E. Numans, R.-J. Sips, Utilizing data mining for predictive
[15] L. Demidova, I. Klyueva, Y. Sokolova, N. Stepanov, N. Tyart, Intellectual approaches to modeling of colorectal cancer using electronic medical records, in: International Conference
improvement of the classification decisions quality on the base of the SVM classifier, onBrainInformaticsandHealthBIH2014:BrainInformaticsand Health, 2014, pp. 132–141.
Procedia Comput. Sci. 103 (2017) 222–230. [36] R. Al-Bahrani, A. Agrawal, A. Choudhary, Coloncancersurvivalpredictionusing ensemble data
[16] N. Picco, R.A. Gatenby, A.R.A. Anderson, Stem cell plasticity and niche dynamics in cancer mining on SEER data, 2013 IEEE International Conference on Big Data, Silicon Valley, CA,
progression,IEEETrans.Biomed.Eng.64(3)(2017)528–537. pp. 9–16, 2013.
[18] Paweł Krawczyk, Tomasz Kucharczyk, Kamila Wojas-Krawczyk, Screening of Gene Mutations [38] C.M. Lynch, V.H.V. Berkel, H.B. Frieboes, Application of unsupervised analysis techniques to lung
in Lung Cancer for Qualification to Molecularly Targeted Therapies, INTECH Open Access cancer patient data, PLoS One, 12 (9), 2017.
Publisher, 2012. [40] N. Arshadi, I. Jurisica, Data mining for case-based reasoning in high- dimensional
[19] A. Colquhoun, L. McHugh, E. Tulchinsky, M. Kriajevska, J. Mellon, Combinationtreatment with biologicaldomains,IEEETrans.Knowl.DataEng.17(8)(2005)1127–1137.
ionising radiation and Gefitinib (‘Iressa’, ZD1839), an epidermal growth factor receptor
(EGFR)inhibitor,significantlyinhibitsbladdercancercellgrowth in vitro and in vivo, J. Radiat.
Res. 48 (5)(2007) 351–360. Further Reading
[20] E. Adetiba, O.O. Olugbara, Lung cancer prediction using neural network ensemble with
histogramoforientedgradientgenomicfeatures,Sci.WorldJ.(2015).
[8]D.S.Rao,D.P.Tripathy,Optimizationofmachinery noise using DifferentialEvolutionalgorithm,
[21] S.S. Alahmari, D. Cherezov, D.B. Goldgof, L.O. Hall, R.J. Gillies, M.B. Schabath, Delta radiomics
Int. J. Min. Mineral Eng. 8 (4) (2017) 294–309.
improves pulmonary nodule malignancy prediction in lung cancer screening, IEEE Access 6
[11]D.S. Rao, D.P.Tripathy, AGeneticAlgorithmapproachforoptimizationofmachinery noise
(2018) 77796–77806.
calculations. Noise Vibr. Worldwide. 2019 50(4): 112–123.
[22] S. Park, S.J. Lee, E. Weiss, Y. Motai, Intra- and inter-fractional variation prediction of lung
[14]DavidMeyer,FriedrichLeisch,KurtHornik,Thesupportvectormachineundertest,
tumorsusingfuzzydeeplearning,IEEEJ.Transl.Eng.HealthMed. 4 (2016) 1–12.
Neurocomputing 55 (s 1–2) (2003) 169–186.
[23] A. Raweh, M. Nassef, A. Badr, A hybridized feature selection and extraction approach for
[17] W. Kim, K.S. Kim, J.E. Lee, D.Y. Noh, S.W. Kim, Y.S. Jung, M.Y. Park, R.W. Park, Development of
enhancingcancerpredictionbasedonDNAmethylation,IEEEAccess6 (2018) 15212–15223.
novel breast cancer recurrence prediction model using support vector machine, J. Breast
[24] J. Pati, Gene expression analysis for early lung cancer prediction using machine learning
Cancer 15 (2) (2012) 230–238.
techniques:aneco-genomicsapproach,IEEEAccess7(2019)4232–4238.
[30] Z.W. Huang, A. Mcwilliams, H. Lui, D. Mclean, S. Lan, H.S. Zeng, Near-infrared Raman
[25] B. Zhang et al., Ensemble learners of multiple deepCNNs for pulmonary nodules classification
spectroscopyforopticaldiagnosisoflungcancer,Int.J.Cancer107(6)(2003) 1047–1052.
using CTimages, IEEE Access7 (2019)110358–110371.
[33]D.Delen,Analysisofcancerdata:a data miningapproach,ExpertSyst.20(1)(2009) 100–112.
[26] C. Arunkumar, S. Ramakrishnan, Prediction of cancer using customised fuzzy rough machine
[37] D. Fradkin, I. Muchnik, D. Schneider, Machine Learning Methods in the Analysis of Lung
learningapproaches,HealthcareTechnol.Lett.6(1)(2019)13–18. Cancer Survival Data, DIMACS Technical Report, 2005.
[27] H. Guo, U. Kruger, G. Wang, M.K. Kalra, P. Yan, Knowledge-based analysis for mortality [39]D.Chen,K. Xing,D.Henson, L.Sheng,A.M.Schwartz, X. Cheng, Developing prognostic systems of
predictionfromCTimages,IEEEJ.Biomed.Health.Inf.24(2)(2020)457–464. cancerpatientsbyensembleclustering,J.Biomed.Biotechnol.(2009).
[28] J. Yang, N. Li, S. Fang, K. Yu, Y. Chen, Semantic features prediction for pulmonary nodule [41] G. Dimitoglou, J.A. Adams, C.M. Jim, Comparison of the C4.5 and a naive bayes classifier for
diagnosisbasedononlinestreamingfeatureselection,IEEEAccess7(2019) 61121–61135. the prediction of lung cancer survivability, J. Comput. (2012).
[29] Raja MohammadTaisirMasadeh,BaselA. Mahafzah, AhmadAbdel-Aziz Sharieh, Sea lion [42] A. Agrawal, S. Misra, Ramanathan Narayanan, Lalith Polepeddi, Alok Choudhary, Lung
optimizationalgorithm,Int.J.Adv.Comp.Sci.Appl.10(5)(2019)388–395. cancer survival prediction using ensemble data mining on seer data, Sci. Program. 20 (1)
[31] A. Jemal, F. Bray, M.M. Center, J.J. Ferlay, E. Ward, D. Forman, CA A Cancer J. Clin., 61 (2), (2012) 29–42.
69–90, 2011. [43] S.M. Agrawal, R. Narayanan, L. Polepeddi, A. Choudhary, A lung cancer outcome calculator
[32] D.Delen,G.Walker,A.Kadam,Predictingbreast cancersurvivability:a comparisonofthree using ensemble data mining on SEER data, Proceedings of the Tenth International
dataminingmethods,Artif.Intell.Med.34(2)(2005)113–127. Workshop on Data Mining in Bioinformatics, 2011.
[34] D.Delen, N.Patil, Knowledge Extraction from Prostate Cancer Data, Proceedingsofthe [44] D.L. Tong, A.C. Schierz, Hybrid genetic algorithm-neural network: feature extraction for
39thAnnualHawaiiInternationalConferenceon,vol. 5, 2006. unpreprocessedmicroarraydata,Artif.Intell.Med.53(1)(2011)47–56.
[45] M.R. Mohebian, H.R. Marateb, M. Mansourian, Miguel Angel Mañanas, Fariborz Mokarian, A
[35] hybrid computer-aided-diagnosis system for prediction of breast cancer recurrence (HPBCR)
usingoptimizedensemblelearning,Comput.Struct.Biotechnol.J. 15(2017)75–85.
[46] M. Zie ˛ ba, J.M. Tomczak, M. Lubicz, J. Świa˛ tek, Boosted SVM for extracting rules
fromimbalanceddatainapplicationtopredictionofthepost-operativelife

74

You might also like