Lung Cancer Project

Project Analysis Report
On
“LUNG CANCER PREDICTION SYSTEM”
Group No.09
Submitted by
Sumit Mishra (2104500100060)
Abhiyansh Gupta (2104500100005)
Yash Vardhan Gupta (2104500100069)
Project Guide
Ms. Imana Azram
Submitted To
Mr. Pradeep Kumar Maurya
(Project-In-Charge)
Department of Computer Science and Engineering Shri Ram Murti

Smarak College of Engineering Technology & Research ,Bareilly
1
TABLE OF CONTENT
Page No.
DECLARATION………………………………………………………….iii
CERTIFICATE…………………………………………………………....iv
ACKNOWLEDGEMENT…………………………………………………v
ABSTRACT……………………………………………………………….vi
LIST OF FIGURES……………………………………………………….vii
INTRODUCTION……………………………………………………….…1
MOTIVATION………………………………………………………….…4
PROBLEM STATEMENT…………………………………………….…..5
OBJECTIVE………………………………………………………….…….6
LITERATURE REVIEW…………………………………………….…….7
TOOLS AND TECHNOLOGY…………………………………….…......11
METHODOLOGY…………………………………………………...……15
PROJECT OUTCOME……………………………………………...…….18
APPLICATION……………………………………………………………24
CONCLUSION……………………………………………………………25
REFERANCES……………………………………………………………26
2
DECLARATION
I hereby declare that submission in my own work and that, to the best of my
knowledge and belief. It contains no material previously or written by another
person nor material which to a substantial extent has been accepted for the award
of any other degree or diploma of the university or other institute of higher
learning except where due acknowledgment has been made in the text.
Signature…………………………
Name…………………………….
Roll No………………………......
Date………………………………
Name…………………………….
Roll No………………………......
Date………………………………
Name…………………………….
Roll No………………………......
Date………………………………
3
CERTIFICATE
This is to certify that the Mini Project Report entitled “LUNG CANCER
PREDICTION SYSTEM” which is submitted by Sumit
Mishra(2104500100060), Abhiyansh Gupta(2104500100005),Yash Vardhan
Gupta(2104500100069) is a record of the candidates own work carried out by them
under my supervision. The matter embodied in this work is original and has not
been submitted for the award of any other work or degree.
Mr.Pradeep Kumar Maurya

Project Incharge (CSE) Supervisor
4
ACKNOWLEDGEMENT
It gives us a great sense of pleasure to present the Report of the B.Tech Mini
Project undertaken during B.Tech ,Third year. We owe special dept of gratitude to
Assistant Professor Ms. Imana Azram(CSE Department),SRMS CET&R ,Bareilly
for his constant support and guidance throughout the course od our work. His
sincerity ,thoroughness and perseverance have been a constant source of
inspiration for us. It is only his cognizant efforts that our endeavors have seen light
of the day.
We also rake the opportunity to acknowledge the constribution of Prof.
L.S.Maurya, Principal SRMS CET&R ,Bareilly for his full support and assistance
during the development of the project.
We also do not like to miss the opportunity to acknowledge the constribution of all
the faculty members of the department for their kind assistance and cooperation
during the development of our project. Last but not the least , we acknowledge our
friends for their contribution in the completion of the project.
Name……………………………. Name…………………………….
Roll No………………………...... Roll No………………………......
Date………………………………
Date………………………………
Name…………………………….
Roll No………………………......
Date………………………………
5
ABSTRACT
The growth of cancerous cells in lungs is called lung cancer. The mortality rate of
both men and women has expanded due to the increasing rate of incidence of
cancer. Lung cancer is a disease where cells in the lungs multiply uncontrollably.
Lung cancer cannot be prevented but its risk can be reduced. So detection of lung
cancer at the earliest is crucial for the survival rate of patients. The number of
chain- smokers is directly proportional to the number of people affected with lung
cancer. This lung cancer prediction was analyzed using classification algorithm
called Logistic Regression. The objectives include early detection, risk profiling,
and improved prognostication, contributing to personalized medicine and
optimized healthcare resource allocation. Ethical considerations guide the
integration of the model into clinical workflows, ensuring data privacy and
responsible use. The results demonstrate the model’s effectiveness in providing
actionable insights for healthcare professionals, fostering timely interventions, and
potentially reducing the burden of late-stage lung cancer. This predictive model
serves as a valuable tool in the ongoing efforts to enhance lung cancer screening,
patient care, and public health outcomes.
6
LIST OF FIGURE
Figure No.1………………………………………………………………17
Figure No.2………………………………………………………………18
Figure No.3………………………………………………………………19
Figure No.4………………………………………………………………20
Figure No.5………………………………………………………………21
Figure No.6………………………………………………………………21
Figure No.7………………………………………………………………22
Figure No.8………………………………………………………………23
7
INTRODUCTION
The objective of this examination is to investigate and foresee the Lung Diseases
with assistance from Machine Learning Algorithms. The most common lung
diseases are Asthma, Allergies, Chronic obstructive pulmonary disease (COPD),
bronchitis, emphysema, lung cancer and so on. It is important to foresee the odds
of lung sicknesses before it happens and by doing that individuals can be causes
and make fundamental strides before it occurs. In this paper, we have worked with
a collection of data and classified it with various machine learning algorithms.
We have collected 323 instances along with 19 attributes. These data have been
collected from patients suffering from numerous lung diseases along with other
symptoms. The Lung diseases attribute contains two types of category which are
‘Positive’ and ‘Negative’. ‘Positive’ means that the person has a lung disease
and so forth. The training of the dataset has been done with K-Fold Cross
Validation Technique and specifically, five Machine Learning algorithms have
been used which are Bagging, Logistic Regression, Random Forest, Logistic model
tree and Bayesian Networks. The accuracy for the above mentioned machine
learning algorithms are 88.00%, 88.92%, 90.15%, 89.23%, and 83.69%
respectively.
To identify and classify lung cancer efficiently, this study introduces a
methodology for designing an effective ML-classification model. The model is
specifically designed to extract crucial information from medical images while
conserving image energy. Furthermore, this study presents novel algorithms for
multiview medical image registration and fusion, specifically applied to CT scans.
To accomplish this, it is crucial to train our model using a large dataset that
encompasses all possible instances of cancer. As a result, our research focuses on
application of a comprehensive and well-organized dataset called LIDC–IDRI
(lungs image database consortium–image database resource initiative) on our
proposed technique. Additionally, several preprocessing methods are applied to the
datasets to enhance their quality. In order to enhance the robustness and accuracy
of our model, we incorporate k-fold cross-validation during the training process.
8
This technique allows for comprehensive validation and assessment of the model’s
performance. Moreover, we apply feature extraction techniques on the dataset to
extract valuable information that is crucial for effective model training. To further
improve the performance of cancer classification, we employ an ReNet 18 model
in our implementations. By using this model, we are able to achieve enhanced
predictive capabilities and overall better performance in classifying cancer cases.
This approach ensures a more robust and reliable model for accurate cancer
diagnosis.
In this study, the performance of our proposed model relies on the effective
utilization of a multilayer convolutional neural network (CNN). The ReNet 18
model is specifically designed to identify the tumors accurately and classify their
respective stages. This model demonstrates promising results in accurately
diagnosing and classifying tumors based on their stage. During the evaluation of
our models, we considered several performance indicators to assess their
effectiveness. These indicators include precision, recall, accuracy, mutual
information (MI), normalized cross-correlation (NCC), peak signal-to-noise ratio
(PSNR), and root-mean-square error (RMSE). By analyzing the models from
various aspects and considering the classifier algorithm employed in the proposed
model, we achieved a high-accuracy rate 98.2% in the detection of lung cancer.
Therefore, the main proposition of this research paper is to propose and
demonstrate the effectiveness of a novel approach in detecting and classifying lung
cancer using advanced ML techniques. This approach involves the development of
an efficient ML classification model, utilization of feature extraction methods,
exploration of multiview medical image registration, fusion algorithms, and
evaluation using performance indicators. The research aims to contribute to the
field by providing a robust and accurate solution for lung cancer detection,
ultimately improving patient outcomes and advancing medical applications in this
domain.
Lung cancer is one of the deadliest cancers because it affects the lungs directly,
and the lung is the main component of the body that helps in purifying the CO2
and sending the oxygen back. This is a severe issue for those who smoke
frequently, and it causes secondhand smoke or exposure to certain toxins, which
affects their generation's future. The treatments can include radiation therapy,
immunotherapy, surgery, and drug therapy. Such technologies may be able to
9
minimize variability in nodule categorization, enhance decision-making, and, as a
result, reduce the number of benign nodules that are followed or worked up
unnecessarily. The main motive is to propose a model that can predict lung cancer
and help prevent it early. From the algorithm, we have received an accuracy of 99
percent, which predicts well.
From 2005 to 2015, lung cancer incident rates have decreased by 2.5% per year in
men and 1.2% per year in women. Symptoms do got usually occur until the cancer
is advanced, and may include persistent cough, sputum streaked with blood, chest
pain, voice change, worsening shortness of breath, and recurrent pneumonia or
bronchitis. So we have found a technique that will be used to detect cancer in early
stages by using machine learning technique we can do so. Here we will be using
Logistic Regression which will be used to classify the datasets.
10
MOTIVATION
The motivation for developing a lung cancer prediction model is rooted in the
urgent need to address the significant impact of lung cancer on global public
health. Several compelling reasons underscore the importance of investing in and
advancing predictive models for lung cancer:
 High Mortality Rates: Lung cancer is one of the leading causes of cancer-
related deaths worldwide. The mortality rates associated with lung cancer
are alarming, emphasizing the critical need for early detection and
intervention to improve patient survival.
 Late-stage Diagnosis: A substantial number of lung cancer cases are
diagnosed at advanced stages, limiting the effectiveness of available
treatments. Early detection is crucial for enabling more successful treatment
options and improving overall prognosis.
 Limited Screening Tools: Current screening methods for lung cancer, such
as low-dose computed tomography (LDCT), have limitations in terms of
cost, accessibility, and potential overdiagnosis. Developing a reliable
prediction model can complement existing screening methods and aid in the
identification of high-risk individuals who may benefit from more intensive
screening.
 Optimizing Healthcare Resources: Lung cancer imposes a significant
burden on healthcare systems due to the costs associated with late-stage
treatments and decreased productivity. A predictive model can help optimize
resource allocation by focusing on early detection and preventive measures,
potentially reducing the economic impact of lung cancer.
 Advancements in Technology: The rapid advancements in machine
learning and artificial intelligence present an opportunity to leverage these
technologies for the benefit of healthcare. Developing a sophisticated lung
11
cancer prediction model harnesses the power of these tools to analyze
complex data sets and derive meaningful insights.
PROBLEM STATEMENT
• Late-Stage Diagnosis and High Mortality in Lung Cancer:

The problem we aim to address is the high mortality rate associated with
lung cancer due to late-stage diagnoses. Currently, lung cancer is often
detected at an advanced stage, limiting treatment options and leading to poor
outcomes. Our project seeks to develop a predictive model to identify lung
cancer in its early stages, thereby improving the chances of successful
treatment and reducing mortality rates.
• Early Detection: Develop a predictive model that can accurately identify
potential lung cancer cases at an early stage, allowing for timely and
effective intervention.
• Integration of Diverse Data Sources: Integrate and analyze diverse
datasets, including medical imaging (X-rays, CT scans), patient
demographics, and clinical histories, to enhance the predictive accuracy of
the system.
• Robust Feature Selection: Identify and prioritize the most relevant features
and biomarkers associated with lung cancer to improve the efficiency and
interpretability of the prediction model.
• Validation and Generalization: Validate the predictive model on diverse
and representative datasets to ensure its generalizability across different
populations and demographic groups.
• Interpretability and Explainability: Design the system to provide clear
explanations for its predictions, ensuring that healthcare professionals can
trust and understand the decision-making process.
12
• The successful development of an intelligent lung cancer prediction system
addresses these challenges, contributing to improved patient outcomes,
reduced healthcare costs, and advancements in the field of early cancer
detection.
OBJECTIVE
 To develop a highly accurate predictive model for early detection of lung

cancer, aiming to reduce mortality rates by enabling timely intervention and
treatment.
 To improve the overall prognosis and quality of life for individuals at risk of
this devastating disease.
 To design is important to avoid errors in the data input process and show the
correct direction to the management for getting correct information from the
computerized system.
 To Identify individuals at an early stage who are at risk of developing lung
cancer before symptoms manifest, enabling timely and effective
intervention.
 To Stratify individuals into different risk groups based on their likelihood of
developing lung cancer. This allows for targeted screening and preventive
measures for high-risk populations.
 To Provide accurate predictions to assist healthcare professionals in
estimating the likelihood of lung cancer, contributing to better prognosis and
treatment planning.
 To Facilitate the efficient allocation of healthcare resources by focusing on
individuals with an elevated risk, reducing unnecessary screenings for low-
risk populations, and optimizing resource utilization.
13
 To Contribute to the era of personalized medicine by tailoring screening and
intervention strategies based on individual risk profiles, genetic factors, and
other relevant variables.
 To Potentially reduce healthcare costs associated with late-stage lung cancer
treatments by promoting early detection and preventive measures.
14
LITERATURE REVIEW
Lung cancer detection using Machine Learning Literature review plays a very vital
role in the project. It mainly helps in gaining detailed knowledge about the basic
ideas to focus on, and to collect information from the different perspective. By
literature review we can get to know how to prioritise the work and complete it as
intended. We can figure out the pros and cons of adopting a methodology and
helps a lot in decision making and also making it more efficient. Conclusively
literature review enables us to complete related literature review.
1. A Review of most Recent Lung Cancer Detection Techniques using
Machine Learning. Nawzat Ahmed February 2021.International journal of
Science and Business. Lung most cancers is a form of risky most cancers and
tough to detect. There were too many techniques advanced in latest years to
diagnose lung most cancers, most of them utilizing CT scan images and some of
them using x-ray images. In addition, multiple classifier methods are paired with
numerous segmentation algorithms to use image recognition to identify lung
cancer nodules. From the this study it has been found that CT scan images are
more suitable to have accurate results. Therefore, mostly CT scan images are used
for detection of cancer. the extracted features are fed to specified classifier to
classify them as normal and malignant accordingly. Many classifiers have been
used by the researchers in the literature such as: multi-layer perceptron (MLP),
SVM, Naïve Bayes, Neural Network, Gradient Boosted Tree, Decision Tree, k-
nearest neighbors, multinomial random forest classifier naïve Bayes, stochastic
gradient descent, and ensemble classifier. it is clear that highest accuracy result
was about 97% obtained by (Alam et al., 2018) using multi class SVM classifier as
well as adopting marker-controlled watershed-based segmentation for image
segmentation. On the other hand, all the works that have been implemented using
Deep Learning methods obtained high accuracy results where the highest result
was about 99% by (Li et al., 2020) using multi-resolution patch-based CNNs .
2. An Extensive Review on Lung Cancer Detection Using Machine Learning
Techniques: A Systematic Study Debnath Bhattacharya 24 March 2020.Revue
d’ Intelligence Artificielle .The Main Objective of this research paper is to
investigate the accuracy levels of various machine learning algorithms. To find out
15
the accuracy levels of various classifiers Based on the detection velocity for lung
cancer using CT is 2.6 ten times greater than utilizing analogue radiography. To
conquer the issues as well as to bring down the workload the methods recognized
as the COMPUTER-AIDED DETECTION i.e .CAD methods are centered on the
diagnosed information imaging progression as well as to sense the latent lesions in
health.To test their model by using a collected 453 CT images of patients where
217 images were used as the training set the validation achieved a total accuracy of
82.9%.
3. Lynch, Chip M., et al. "Prediction of lung cancer patient survival via
supervised machine learning classification techniques." International journal
of medical informatics 108 (2017).International journal of advance scientific
research and engineering trends In pre-processing, the input CT image is being
processed to improve the quality of image. This enhanced version will contribute
in further steps of any robotized system. Image segmentation is the process in
which a digital image is partitioned into multiple segments. In case of images
segments corresponds to pixels or super pixels. In image processing, Otsu's method
is used to automatically perform clustering-based image thresholding. It performs
the reduction of a grey level image to a binary image. The algorithm works by
assuming that there are two classes of pixels present in image following bi- modal
histogram which includes foreground pixels and background pixels, it then
computes the optimum threshold value which separates the two classes. Sobel filter
is used for calculating gradient for edge detection. In IP special(Sobel) is used for
sobel filtering. Grey-Level Co-Occurrence Matrix: A statistical mathematical
method of examining feature texture that considers the spatial relationship of pixels
in an image is the grey-level co- occurrence matrix (GLCM), also known as the
grey-level spatial dependence matrix.
4. Using Multi-level Convolutional Neural Network for Classification of Lung
Nodules on CT images Juan Lyu, Sai Ho Ling, Senior Member, IEEE 2018
IEEE. Lung cancer is one of the four major cancers in the world. Accurate
diagnosing of lung cancer in the early stage plays an important role to increase the
survival rate. Computed Tomography (CT) is an effective method to help the
doctor to detect the lung cancer. In this paper, we developed a multi-level
convolutional neural network (ML-CNN) to investigate the problem of lung nodule
malignancy classification. For ML-CNN, there are two convolution layers
16
followed by batch normalization (BN) [21] and pooling layers. BN is used after the
convolution operation and before the activation operation. It is used to reduce the
internal covariate shift. The problem is formally known as covariate shift when the
distribution of network activations changes between training and production
stages. In ML-CNN, there are 3 levels and they have same structures and same
number of feature maps in the last convolution step. However, their convolutional
kernels are different.
5. Lung cancer Prediction and Classification based on Correlation Selection
method Using Machine Learning Techniques Qubahan academic journal.
Lung cancer is one of the leading causes of mortality in every country. This paper
endeavors to inspect accuracy ratio of three classifiers which is Support Vector
Machine (SVM), K- Nearest Neighbor (KNN)and, Convolutional Neural Network
(CNN) that classify lung cancer in early stage so that many lives can be saving.
The experimental results show that SVM gives the result 85.56%, CNN gives
92.11% and KNN gives 88.40%. The Confusion Matrix is a deep learning visual
assessment method. The prediction class results are represented in the columns of a
Confusion Matrix, whereas the real class results are represented in the rows . This
matrix includes all the raw data regarding a classification model's assumptions on a
specified data collection. To determine how accurate a model is. It's a square
matrix with the rows representing the instances' real class and the columns
representing their expected class. The confusion matrix is a 2 x 2 matrix that
reports the number of true positives (TP), true negatives (TN), false positives
(FP),and false negatives (FN) when dealing with a binary precision.
6. Predicting Lung Cancer Survivability using SVM and Logistic Regression
Algorithms. Avijith Mandal September 2017 International journal of
computer applications One of the major and frequent bases of cancer deaths
globally in terms of both instance and transience is lung cancer. The main
reason behind the increasing of deaths from it is detecting the disease lately
and faults in effective treatment. So, the early detection is needed to save
lives from this disease. The survivability rate of lung cancer can be predicted with
the help of modern machine learning techniques. Accordingly, it would be
clever to determine the survival possibilities among the patients. In this study
data cleaning, feature selection, splitting and classification techniques have been
applied for predicting survivability of lung cancer as accurately as possible.
17
This project reveals that logistic regression classifier gives the topmost
accuracy of 77.40% compared to support vector machine classifier which
gives 76.20% accuracy. Also, the logistic regression classifier gives maximum
classification accuracy concerning every different classifier. This work can
further be enhanced by modifying logistic regression classifier which gives
highest accuracy
7. Lung cancer Prediction and Classification Using Recurrent Neural

Network. V Raaga Varsini, 11 November 2021.International journal of research
in Engineering, Science and Management. There are two types of lung cancer they
are Small Cell Lung Cancer (SCLC) and Non-Small Cell Lung Cancer (NSCLC)
this are the two main forms for lung cancer this will be develop and expand in their
own ways. This non-small cell lung cancer has three subtypes they are
(adenocarcinomas, squamous cell carcinomas, large cell carcinomas). The
small/large cell cancer is a disease that occurs a patient and shows the symptoms
for both types of cancer. (NSCLC) Adenocarcinoma is affect more common and it
will be progressed more slowly than small cell lung cancer.
18
TOOLS AND TECHNOLOGY
• Machine learning
• HTML
• CSS
• Flask
• Logistic Regression Algorithm
• Random forest Algorithm
• Bagging Aggregating
• Bayesian networks Algorithm
• LMT Algorithm
Machine Learning: Machine Learning (ML) is a subset of artificial intelligence

that focuses on teaching computer systems to learn and make predictions from data
without explicit programming.
ML algorithms use patterns and statistical techniques to continuously improve their
performance, enabling applications in diverse fields, from healthcare to finance
and beyond. The process of learning begins with observations or data, such as
examples, direct experience, or instruction, in order to look for patterns in data and
make better decisions in the future based on the examples that we provide.
Machine learning is a transformative field within artificial intelligence that
empowers computers to learn from data and make intelligent decisions without
explicit programming. At its core, machine learning involves the development of
algorithms and models that can recognize patterns, extract meaningful insights, and
adapt their behavior based on experience. In supervised learning, algorithms are
trained on labeled datasets, where they learn to map input data to specific outputs.
Unsupervised learning, on the other hand, involves finding inherent patterns and
19
structures within unlabeled data. Reinforcement learning focuses on training
models through interaction with an environment, receiving feedback in the form of
rewards or penalties.
HTML: HTML, is the standard markup language used to create web pages.
HTML is a fundamental building block of the World Wide Web and is often used
in conjunction with CSS (Cascading Style Sheets) and JavaScript to create
interactive and visually appealing websites. Elements can be nested within one
another, creating a hierarchical structure. Attributes, which provide additional
information about elements, are specified within the opening tag. HTML allows for
the creation of links, images, lists, tables, and more, providing a versatile
framework for designing web pages.
CSS: Cascading Style Sheets (CSS) is a web technology that enhances web design
by separating content from presentation. It defines the visual appearance of HTML
elements, allowing web developers to create visually appealing, responsive, and
consistent web pages.CSS simplifies styling and layout control, making it a
fundamental tool for modern web design. Using a selector-based syntax,
developers can target specific HTML elements and apply styles such as colors,
fonts, spacing, and positioning. One of the key advantages of CSS is its ability to
separate the structure (HTML) from the presentation (CSS) of a web page,
promoting clean and maintainable code. CSS also enables the creation of
responsive designs through media queries, ensuring that web content adapts to
various screen sizes, from desktop monitors to mobile devices.
FLASK: Flask is a lightweight and versatile Python web framework, ideal for
developing web applications. It offers a minimalistic design that allows developers
to create web services quickly and efficiently.Flask is known for its simplicity,
flexibility, and extensive community support, making it a popular choice for web
development projects.
20
Logistic regression: Logistic regression is a statistical and machine learning
algorithm used for binary classification tasks. It models the probability of an
outcome based on one or more predictor variables. It's a linear model that uses the
logistic function to transform the output into a probability score between 0 and 1.
It's simple to implement, interpretable, and effective for problems like spam
detection or medical diagnosis. Despite its name, logistic regression is employed
for classification rather than regression tasks. It's a fundamental algorithm in the
field of machine learning and is widely used due to its simplicity and
interpretability. Logistic Regression is primarily designed for binary classification
problems where the output variable has two classes, typically labeled as 0 and 1.
It's especially useful when the relationship between the features and the binary
outcome is approximately linear. The model is trained using a method called
maximum likelihood estimation. The objective is to maximize the likelihood of the
observed outcomes given the input features and the current model parameters. This
is often done using optimization algorithms like gradient descent.
Random Forest: Random Forest is an ensemble learning method used in machine
learning. It combines multiple decision trees to make predictions. Each tree in the
forest is trained on a random subset of the data and features, reducing overfitting.
Random Forest is versatile and can handle both classification and regression tasks,
often outperforming individual decision trees. It's used in applications like image
classification, financial modeling, and recommendation systems.
One key feature of Random Forest is the introduction of randomness during the
training process. Instead of using all features for each tree, a random subset of
features is selected at each split point. This helps in decorrelating the trees and
promotes diversity in the ensemble.
Bagging Aggregating: It is an ensemble machine learning technique that aims to
improve the stability and accuracy of models by training multiple instances of the
same learning algorithm on different subsets of the training data. The key idea
behind bagging is to reduce variance and minimize overfitting by introducing
diversity in the training process.
Bagging involves creating multiple bootstrap samples from the original training
dataset. Bootstrap sampling is a random sampling with replacement, meaning that
21
each sample can contain multiple occurrences of the same data point, while others
might be omitted.
Bayesian networks Algorithm: It also known as belief networks or Bayes nets,
are a type of probabilistic graphical model that represents a set of variables and
their probabilistic dependencies in the form of a directed acyclic graph (DAG).
These networks are named after the Bayesian probability theory, which provides a
framework for updating beliefs based on new evidence. Bayesian networks can be
constructed by domain experts or learned from data. Learning involves estimating
the parameters (CPTs) and the structure of the graph from observed data.
LMT Algorithm: The logistic model tree (LMT) algorithm is a popular
classification method that combines a decision tree and logistic regression models.
The combination of two complementary algorithms produces an accurate and
interpretable classifier by combining the advantages of both logistic regression and
tree induction. However, LMT has the disadvantage of high computational cost,
which makes the algorithm undesirable in practice. In this paper, we propose an
efficient method to learn the logistic regression models in the tree. We employ
least angle regression to update the regression model in LogitBoost so that the
algorithm efficiently learns sparse logistic regression models composed of relevant
input variables. We compare the performance of our proposed method with the
original LMT algorithm using 14 benchmark datasets and show that the training
time dramatically decreases while the accuracy is preserved.
22
METHODOLOGY
The examination can be partitioned into four crucial stages which are given as
follows:
• Data Collection
• Data Processing
• Data Training
•Application of ML algorithms
Data Collection: The data for this analysis has been collected from some hospitals
in Dhaka. National Institute of Diseases of the Chest and Hospital, National
Institute of Cancer Research & Hospital (NICRH) helped us by providing us with
the majority of the data. A total of 323 instances and 19 attributes have been
collected in which there is information for an individual patient.
Data Preprocessing: In the wake of gathering the information, we did the

preprocessing inside the preprocessing stage, we utilized the two unsupervised
filters inside the extensively utilized AI stage WEKA (Waikato Environment for
Knowledge Analysis). From the start, we applied the Replace MissingValues filter
on our dataset. This replaces all the missing qualities for ostensible and numerical
traits utilizing the modes and means. Besides, we’ve utilized the Randomize filter
which replaces the missing data without surrendering a great part of the exhibition.
Data Training :The preparation of the data has been finished utilizing the K-Fold
Cross Validation method of WEKA. It is a resampling procedure to evaluate the
forecast model by parting the first dataset into two sections preparing the set and a
test set. The parameter K decides the number of groups the dataset will be
separated into by rearranging the dataset haphazardly.
23
Application of Machine Learning Algorithms: After the training phase,
classification has been done using various machine learning algorithms among
them Bagging, Logistic Regression and Random Forest Logistic model tree and
Bayesian Networks outperformed. Therefore, we determined those five algorithms
to be our model.
Model Evaluation: Evaluate the performance of the trained models using the
testing dataset. Common evaluation metrics for classification tasks include
accuracy, precision, recall, F1 score, and area under the receiver operating
characteristic (ROC) curve.
Hyperparameter Tuning: Fine-tune the hyperparameters of your chosen models

to improve performance. Use techniques such as grid search or randomized search
to find the optimal hyperparameter values.
Validation and Cross-Validation: Perform validation on additional datasets if

available. Implement cross-validation techniques, such as k-fold cross-validation,
to assess the model's robustness and generalizability.
Interpretability and Explainability: Depending on the application, consider the

interpretability of your model. Understand how the model makes predictions and
ensure that it provides explanations that can be understood by healthcare
professionals.
24
Fig.No.1
25
PROJECT OUTCOME
 Prediction of Support Vector Machine(SVM)
Fig.No.2
26
 Visualizing Categories Column
Fig.No.3
27
 Heat Map Of Data
Fig.No.4
28
 Output On The Given Test Dataset
Fig.No.5
 Output On the User Given Data
Fig.No.6
29
 Parameter Of Main Output
Fig.No.7
30
 Best Parameter
Fig.No.8
31
APPLICATIONS
• Early Diagnosis: Detecting lung cancer in its early stages, when treatment is
most effective, can significantly improve patient outcomes and reduce
mortality.
• Risk Assessment: It can be used to assess an individual's risk of developing
lung cancer based on factors like age, smoking history, genetic
predisposition, and environmental exposure.
• Treatment Planning: Predictive models can assist oncologists in creating
personalized treatment plans, selecting the most suitable therapies, and
estimating treatment response.
• Research and Clinical Trials: The system can aid in identifying eligible
participants for clinical trials and conducting research on lung cancer risk
factors and treatment effectiveness.
• Public Health: Public health agencies can use predictive systems to assess
lung cancer prevalence, plan targeted awareness campaigns, and allocate
resources for prevention and early detection programs.
• Continuous Monitoring and Maintenance: Set up mechanisms for
continuous monitoring of the application's performance and user feedback.
Schedule regular updates and maintenance to address any issues, improve
the model, and incorporate the latest advancements in machine learning.
• Predictive Analytics Dashboard: Create a dashboard that provides an
overview of predictive analytics, including trends, model performance
metrics, and patient outcomes. This can assist healthcare providers in
monitoring the impact of predictions over time.
• Patient Risk Stratification: Implement a risk stratification feature that
categorizes patients into different risk groups based on their likelihood of
developing lung cancer. This can help prioritize interventions and screenings
for high-risk individuals.
32
CONCLUSION
In conclusion, our study shows that lung diseases can be correctly classified using
the machine learning techniques. However, obtaining the real time data was one of
the primary concerns that we had faced at the initial stages. In addition to that, we
could not able to get similar data from existing works to compare our results.
However it is worth to mention that our dataset has many attributes, which is rare
to find from online sources. Among the five algorithms, Random Forest gives the
best performance than LR, Bagging, LMT and Bayes Net. The accuracy level are
88.00%, 88.9231%, 90.1538%, 89.2308% and 83.6929% respectively. In addition
to work, future researches like, study with deep learning methods like - Neuron
Advanced Ensemble Learning, Fuzzy Inference System, and Convolution Neural
Network would be useful and beneficial.
Through the intricate analysis of diverse datasets and the application of advanced
machine learning algorithms, the model has proven its potential to identify
individuals at risk of developing lung cancer at an early stage. This capability is
crucial for facilitating timely interventions, thereby enhancing the prognosis and
overall outcomes for affected individuals. Furthermore, the model's ability to
stratify risk accurately allows healthcare professionals to tailor screening and
preventive measures based on individualized risk profiles, optimizing resource
utilization in the process. The ethical considerations embedded in the model's
design underscore its responsible use and seamless integration into clinical
workflows, ensuring patient privacy and ethical standards. As we look ahead,
ongoing refinement, continuous monitoring, and collaboration with healthcare
practitioners will be paramount for the sustained effectiveness and ethical
deployment of this predictive tool. In essence, the lung cancer prediction model not
only holds great promise for improving patient care but also contributes valuable
insights to the broader landscape of predictive medicine, ushering in a new era of
personalized healthcare.
33
REFERENCES
 Raghavendra, Patil G E., Sinchana, C G., Tejashwini, P., et al. 2020. Lung
Cancer Prediction System Using Logistic Regression Approach.
International Research Journal of Modernization in Engineering Technology
and Science.
 Kadir, T., Gleeson, Lung cancer prediction using machine learning and
advanced imaging techniques (2018).
 Lynch, Chip M., et al. "Prediction of lung cancer patient survival via
supervised machine learning classification techniques." International journal
of medical informatics 108 (2017).
 Kasthuri, M., & Jency, M. R. (2020). Lung Cancer Prediction Using
Machine Learning Algorithms on Big Data: Survey. International Journal of
Computer Science and Mobile Computing, 9(10), 73-77.
 Hazra, A., Bera, N., & Mandal, A. (2017). Predicting lung cancer
survivability using SVM and Logistic Regression Algorithms. International
Journal of Computer Applications,
 V Raaga Varsini, 11 November 2021, Lung cancer Prediction and
Classification Using Recurrent Neural Network.
 Juan Lyu, Sai Ho Ling, Senior Member, IEEE 2018 IEEE, Using Multi-level
Convolutional Neural Network for Classification of Lung Nodules on CT
images.
 An Extensive Review on Lung Cancer Detection Using Machine Learning
Techniques: A Systematic Study Debnath Bhattacharya 24 March
2020.Revue d’ Intelligence Artificially.
34

Lung Cancer Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lung Cancer Project

Uploaded by

Copyright:

Available Formats

Project Analysis Report

Department of Computer Science and Engineering Shri Ram Murti

Mr.Pradeep Kumar Maurya

• Late-Stage Diagnosis and High Mortality in Lung Cancer:

 To develop a highly accurate predictive model for early detection of lung

7. Lung cancer Prediction and Classification Using Recurrent Neural

Machine Learning: Machine Learning (ML) is a subset of artificial intelligence

Data Preprocessing: In the wake of gathering the information, we did the

Hyperparameter Tuning: Fine-tune the hyperparameters of your chosen models

Validation and Cross-Validation: Perform validation on additional datasets if

Interpretability and Explainability: Depending on the application, consider the

 Prediction of Support Vector Machine(SVM)

 Output On the User Given Data

You might also like