Project Report Major Project

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 86

A

PROJECT REPORT
ON

“DEEPLUNG: PREDICTIVE MODELLING FOR


LUNG CANCER RISK ASSESSMENT”
(Submitted in partial fulfillment for the award of Degree of Bachelor of Technology)
IN
Computer Science Engineering

SUBMITTED BY
PRIYANSHI HEMRAJANI |NAMAN POKHARNA
MAYANK MEHRANIYA| PIYUSH SONI| DAKSH SONI
UNDER GUIDANCE
OF
PF. DR. R.K. SOMANI
DEAN
SCHOOL OF ENGINEERING AND TECHNOLOGY

SESSION: (2023-24)

Sangam University, NH-79, Bhilwara Chittor By-pass,


Chittor Road, Bhilwara-311001

Lung Cancer Prediction Model i|Page


SANGAM UNIVERSITY

AUTHOR’S DECLARATION

I hereby declare that the work, which is being presented in the Project Report, entitled
“Deeplung: Predictive Modelling for Lung Cancer Risk Assessment” in partial fulfillment
for the award of Degree of “Bachelor of Technology” in Computer Science Engineering and
submitted to the Department of Computer Science Engineering, Sangam University. Project
is a record of my own investigations carried under the guidance of Pf. Dr. R.K. Somani,
Dean of School of Engineering and Technology, Sangam University, Bhilwara, Rajasthan,
India.

I have not submitted the matter presented in this dissertation anywhere for the any other
Degree.

PRIYANSHI HEMRAJANI |NAMAN POKHARNA


MAYANK MEHRANIYA| PIYUSH SONI| DAKSH SONI
Computer Science & Engineering
Enrollment No.:2020BTCS032
Sangam University, Bhilwara (Raj.)

Counter Signed by
Pf. Dr. R.K. Somani
Dean
Department of Computer Science & Engineering
School of Engineering & Technology
Sangam University, Bhilwara (Raj.)

Lung Cancer Prediction Model ii | P a g e


SANGAM UNIVERSITY

CERTIFICATE

I feel great pleasure in certifying that the project entitled “Deeplung: Predictive Modelling
for Lung Cancer Risk Assessment” carried out by Priyanshi Hemrajani |Naman
Pokhrna| Mayank Mehraniya | Piyush Soni | Daksh Soni under the supervision of Pf. Dr.
R.K. Somani. I recommend the submission of project.

Date: ………………….

Sign ………………………………………
(Dr. Vikas Somani)
Head of Department of Computer Science & Engineering,
School of Engineering & Technology,
Sangam University, Bhilwara

Lung Cancer Prediction Model iii | P a g e


SANGAM UNIVERSITY

ACKNOWLEDGEMENT

This Dissertation would not have been successful without the guidance and support of a large
number of individuals.
Pf. Dr. R.K. Somani, Dean of School of Engineering and Technology my dissertation
supervisor, who believed in me since the initial stages of my dissertation work. For long time,
he provided insightful commentary during my regular meetings, and he was consistently
supportive of my proposed research directions. I am honored to be his first graduate student.
His consistent support and intellectual guidance inspired me to innovative new ideas. I am
glad to work under his supervision.
I am grateful to Dr. Vikas Somani, Head, Department of Computer Science Engineering for
his excellent support during my dissertation work. Dr. Awanit Kumar, Bachelor of
Engineering and Technology Computer Science Coordinator, spent many hours listening
to my concerns, working with me to navigate the bureaucracy, and assisting me with my most
important decisions.
Thanks all my friends and classmates for their love and support. I have enjoyed their
company so much during my stay at college name. I would like to thank all those who made
my stay in college, an unforgettable and rewarding experience.
Last, but not least I would like to thank my parents for supporting me to do complete my
master’s degree in all ways.

Priyanshi Hemrajani | Naman Pokharna | Mayank Mehraniya


Piyush Soni| Daksh Soni
Enrollment No.: 2020BTCS032

Lung Cancer Prediction Model iv | P a g e


SANGAM UNIVERSITY

ABSTRACT

Lung cancer remains a significant global health challenge, often diagnosed at advanced
stages with limited treatment options, leading to high morbidity and mortality rates. This
project focuses on developing a machine learning-based predictive model to assess an
individual's risk of developing lung cancer. Leveraging diverse input factors including
smoking habits, environmental pollutants, genetic predisposition, occupational hazards, and
health parameters, the aim is to create a robust and accurate predictive tool.

The project objectives encompass comprehensive data collection, preprocessing, and feature
engineering to extract crucial insights from datasets sourced from reliable medical records
and research databases. Various machine learning algorithms, including logistic regression,
decision trees, random forests, and neural networks, are employed to build predictive models.
Hyperparameter tuning and ensemble methods are utilized to enhance model performance
and robustness.

The models are rigorously evaluated using cross-validation techniques and diverse evaluation
metrics to ensure reliability and generalizability. Interpretability techniques are applied to
explain model predictions, facilitating user trust and understanding, particularly among
healthcare professionals. Ethical considerations regarding patient data privacy and
compliance with regulations are strictly adhered to throughout the project lifecycle.

The ultimate goal is to create a user-friendly predictive tool that aids in early detection,
personalized risk assessment, and targeted interventions for lung cancer. This model has the
potential to significantly impact public health initiatives by informing preventive measures,
policy changes, and resource allocation strategies to mitigate the burden of lung cancer. The
project aims to contribute to advancements in predictive analytics applied to healthcare while
striving to improve patient outcomes and reduce the societal and economic impact of this
devastating disease.

Lung Cancer Prediction Model v|Page


ABBREVIATIONS

ML-LCRA: Machine Learning for Lung Cancer Risk Assessment


LC-PMA: Lung Cancer Predictive Modeling Approach
LMRC: Lung Cancer Risk Classifier
LDAP: Lung Disease Assessment Project
CARL: Cancer Assessment via Risk Learning
MIRA-LC: Machine Intelligence for Risk Assessment in Lung Cancer
LPRAM: Lung Cancer Prediction and Risk Assessment Model
LEARC: Lung Evaluation and Risk Classification
PREDICAN: Predictive Analysis for Lung Cancer
LCAI: Lung Cancer AI-Assisted Identification
L-CARE: Lung Cancer Assessment via Risk Estimation
AIR-LCR: AI for Lung Cancer Risk
RISK-LC: Risk Identification for Lung Cancer
LUNAR-M: Lung Cancer Risk Modeling
SMART-LC: System for Modeling and Assessing Lung Cancer Risk
LCAID: Lung Cancer AI Diagnosis
L-RISK: Lung Cancer Risk Intelligence System
PROLUCAR: Predictive Lung Cancer Assessment and Risk
LC-PREDICT: Lung Cancer Prediction Tool
LUNGPROF: Lung Cancer Risk Profiler

Lung Cancer Prediction Model vi | P a g e


CONTENTS

Author’s Declaration……………………………………………………………ii
Certificate………………………………………………………………………iii
Acknowledgements……………………………………………………………..iv
Abstract………………………………………………………………………….v
Abbreviations…………………………………………………………………...vi
Contents…………………………………………………………………...…...vii
List of figures…………………………………………………………………...ix

1. Introduction to Predictive Modelling for Lung


Cancer………...1
1.1. Objectives……………………………………………………………….1
1.1.1. Scope of the Study………………………………………………...3
1.1.2. Motivation behind the Research…………………………………..5
1.1.3. Limitations and
Constraints……………………………………….6
1.1.4. Proposed Solution…………………………………………………
7
1.2. Problem Statement……………………………………………………..10
1.2.1. Problem Overview and
Context………………………………….10

2. Review of Related
Literature…………………………………...12
2.1.Existing Research and Studies………………………………………….12
2.2.Research Gap…………………………………………………………...18

3. Proposed System and


Methodology…………………………….21
3.1.System Architecture
Design…………………………………………….21

Lung Cancer Prediction Model vii | P a g e


3.2.System Flow and Use
case……………………………………………...27
3.3.System Algorithm………………………………………………………35

4. Result and Discussion……………………………………………………..39


4.1.Model Performance and
Graphs………………………………………...39

5. Conclusion and Recommendations…………………………….45


5.1.Summary and Concluding Remarks……………………………………45
5.2.Practical Uses and Implications………………………………………...47
5.3.Future Work and
Enhancements………………………………………...51

6. References……………………………………………………….54

7. Appendix………………………………………………………...57
7.1.Technical Details and Additional Graphs
/Charts……………………….57
7.2. Supplementary Information…………………………………………….64

Lung Cancer Prediction Model viii | P a g e


LIST OF FIGURES

Fig 1: Lung Cancer Prediction Architecture


Fig 2: Architectural Model of LSTM
Fig 3: GRU’s accuracy Comparison
Fig 4: Use Case Diagram of the System
Fig 5: UML Sequence Diagram
Fig 6: Flowchart of the methodology for Cancer Detection
Fig 7: Decision Tree
Fig 8: ROC curves for risk prediction models in the MOLTEST BIS cohort.
ROC, receiver operating characteristic curve; LLP, Liverpool Lung Project;
AUC, area under the receiver operating characteristic curve.
Fig 9: Graphs
Fig 10: Input Data
Fig 11: Axes Input Plot
Fig 12: Dataset Details
Fig 13: Correlation Matrix
Fig 14: Lung Cancer due to Air Pollution
Fig 15: Level Vs Count
Fig 16: Label Graph

Lung Cancer Prediction Model ix | P a g e


1. Introduction To Predictive Modelling
for Lung Cancer

1.1. Objectives

Data Collection and Preparation:

 Gather diverse datasets encompassing information on smoking habits,


environmental exposures, genetic factors, occupational history, health parameters,
and demographics from reliable sources and medical records.
 Perform data preprocessing tasks, including handling missing values, outlier
detection, data normalization, and ensuring data consistency and quality.

Feature Selection and Engineering:

 Conduct thorough exploratory data analysis (EDA) to identify relevant features


associated with lung cancer risk.
 Apply feature selection techniques to choose the most influential and
discriminative features.
 Perform feature engineering to create new features or transformations that may
enhance the predictive power of the model.

Model Development:

 Implement various machine learning algorithms (e.g., logistic regression, decision


trees, random forests, support vector machines, neural networks) for building
predictive models.
 Train multiple models using the prepared dataset, employing appropriate
hyperparameter tuning and model optimization techniques to enhance
performance.
 Explore ensemble methods to combine the strengths of multiple models for
improved prediction accuracy.

Model Evaluation and Validation:

Lung Cancer Prediction Model 1|Page


 Assess the performance of developed models using cross-validation techniques to
ensure robustness and generalizability.
 Utilize appropriate evaluation metrics (such as accuracy, precision, recall, F1-
score, ROC-AUC) to measure model performance.
 Validate the model on independent datasets or through external validation to
confirm its reliability.

Interpretability and Explain ability:

 Enhance the interpretability of the model by employing techniques such as feature


importance analysis, SHAP (Shapley Additive explanations), or LIME (Local
Interpretable Model-agnostic Explanations).
 Provide explanations for model predictions to facilitate understanding and trust
among users, particularly healthcare professionals.

Ethical Considerations and Data Privacy:

 Ensure compliance with ethical guidelines and data privacy regulations in


handling sensitive health-related information.
 Implement appropriate data anonymization techniques and robust security
measures to protect patient confidentiality.

User Interface Development (Optional):

 Develop a user-friendly interface or dashboard to facilitate easy interaction with


the predictive model for healthcare professionals or end-users.
 Design the interface to visualize predictions, risk factors, and recommendations
based on individual profiles.

Documentation and Reporting:

 Create comprehensive documentation detailing the methodologies, algorithms


used, data sources, preprocessing steps, model development, and evaluation
processes.
 Prepare a detailed report summarizing the findings, model performance,
limitations, and recommendations for further enhancements or applications.

Deployment and Integration:

 Deploy the finalized model in a suitable environment, making it accessible for


real-time predictions or integration within healthcare systems if applicable.

Lung Cancer Prediction Model 2|Page


 Collaborate with healthcare institutions or relevant stakeholders for potential
integration into clinical practice or public health initiatives.
 By addressing these objectives, the project aims to develop a reliable and accurate
lung cancer prediction model that supports early detection, personalized risk
assessment, and proactive interventions, contributing to improved healthcare
outcomes and public health initiatives.

1.1.1. Scope of the Study


The scope of a study involving the development of a machine learning-based
predictive model for lung cancer risk assessment is comprehensive and
multidimensional. Here's a detailed breakdown of the scope:

1. Data Collection and Preprocessing:


 Identifying Relevant Data Sources: Gathering data from diverse sources
such as medical records, surveys, research papers, and public databases to
acquire information on:
 Smoking habits: Quantity, duration, type of tobacco, cessation
attempts, etc.
 Environmental pollutants: Air quality indices, exposure levels to
toxins, geographical data, etc.
 Genetic predisposition: Genetic markers, family history, genotypic
data.
 Occupational hazards: Exposure to carcinogens in specific industries
or occupations.
 Other relevant parameters: Demographics, lifestyle factors, medical
history, etc.
 Data Preprocessing: Cleaning and formatting data, handling missing values,
encoding categorical variables, and ensuring data consistency and quality.

2. Feature Engineering:
 Feature Selection: Identifying the most relevant features that significantly
contribute to lung cancer risk using techniques like correlation analysis,
feature importance ranking, etc.
 Feature Transformation: Normalizing, scaling, or transforming features to
ensure uniformity and enhance model performance.

3. Model Development:
 Machine Learning Algorithms: Exploring various algorithms like logistic
regression, decision trees, random forests, support vector machines, neural
networks, etc., to build and compare predictive models.

Lung Cancer Prediction Model 3|Page


 Model Training: Using a portion of the data to train the models, tuning
hyperparameters, and evaluating model performance using cross-validation
techniques.
 Ensemble Methods: Employing ensemble techniques (e.g., stacking,
boosting) to enhance model robustness and accuracy.

4. Evaluation and Validation:


 Performance Metrics: Assessing the model's performance using appropriate
metrics like accuracy, precision, recall, F1-score, ROC-AUC, etc.
 Validation: Conducting rigorous validation on separate test datasets to ensure
the generalizability and reliability of the developed model.

5. Interpretability and Explain ability:


 Model Interpretation: Explaining the relationships between input factors and
the model's predictions, facilitating understanding for medical professionals
and end-users.
 Visualizations: Generating visual aids (e.g., feature importance plots, decision
boundaries) to enhance interpretability.

6. Ethical Considerations and Privacy:


 Ethical Guidelines: Ensuring adherence to ethical standards and regulations
concerning patient data, consent, and confidentiality.
 Privacy Protection: Implementing measures to safeguard sensitive
information and anonymizing data where necessary.

7. Deployment and Recommendations:


 Implementation: Developing a user-friendly interface or integrating the
model into existing healthcare systems for practical use.
 Recommendations: Providing personalized risk assessments and actionable
recommendations for individuals based on their assessed lung cancer risk.

8. Continuous Improvement:
 Model Updating: Establishing a framework for continuous model
improvement with new data and emerging research to enhance accuracy and
relevance over time.
 Feedback Mechanism: Creating a mechanism to receive feedback from
healthcare professionals and users for ongoing refinement.

Conclusion:
The scope of this study encompasses a comprehensive and interdisciplinary approach
involving data collection, preprocessing, model development, evaluation, ethical
considerations, deployment, and continuous improvement. The ultimate goal is to
create a reliable, accurate, and user-friendly predictive tool for assessing an

Lung Cancer Prediction Model 4|Page


individual's risk of developing lung cancer and providing personalized interventions
for early detection and prevention.

1.1.2. Motivation behind the Research


The motivation behind developing a lung cancer prediction model using machine
learning techniques is multifaceted and rooted in addressing several critical aspects:

 Early Detection and Prevention: Lung cancer is often diagnosed at advanced


stages when treatment options are limited and the prognosis is poor. By
creating a predictive model, the primary motivation is to enable early detection
of the disease. Early identification of individuals at high risk can prompt
timely screenings, leading to earlier diagnosis and potentially more effective
treatment strategies, thus improving survival rates.

 Personalized Healthcare: Each individual's risk factors for lung cancer can
vary significantly. By considering diverse input factors such as smoking
habits, environmental exposures, genetic predisposition, and health history, the
model aims to provide personalized risk assessments. This personalized
approach allows for tailored interventions and recommendations specific to an
individual's risk profile, enhancing the effectiveness of preventive measures.

 Public Health Impact: Lung cancer remains a significant public health


challenge globally. Developing a predictive model contributes to public health
initiatives by providing insights into risk factors and prevalence. This
information can aid policymakers, healthcare providers, and public health
authorities in formulating targeted interventions, implementing smoking
cessation programs, improving environmental regulations, and allocating
resources more effectively to combat lung cancer at a population level.

 Research Advancement: The project fosters advancements in the field of


predictive analytics and machine learning applied to healthcare. Developing a
robust predictive model involves data collection, preprocessing, feature
engineering, and model evaluation, contributing to methodological
advancements in analyzing complex health-related data. This can potentially
pave the way for similar predictive models for other types of cancers or
diseases.

 Improving Patient Outcomes: Ultimately, the goal is to improve patient


outcomes and quality of life. By accurately identifying individuals at higher

Lung Cancer Prediction Model 5|Page


risk, the model can empower healthcare providers to offer timely
interventions, including counseling, lifestyle modifications, early screenings,
and appropriate medical care. This proactive approach has the potential to
reduce the incidence of lung cancer and its associated morbidity and mortality.

 Reducing Healthcare Costs: Early detection and prevention strategies can


significantly reduce the economic burden associated with treating advanced-
stage lung cancer. By focusing on preventive measures and early interventions,
healthcare costs related to extensive treatments and hospitalizations for
advanced stages of the disease can be curtailed.

In summary, the motivation behind creating a lung cancer prediction model lies in its
potential to revolutionize early detection, personalize healthcare interventions,
positively impact public health policies, advance research methodologies, enhance
patient outcomes, and alleviate the societal and economic burdens associated with
lung cancer.

1.1.3. Limitations and Constraints


To develop a machine learning-based predictive model for lung cancer risk
assessment, there are several limitations and constraints that should be acknowledged
and considered:

1. Data Availability and Quality:


 Limited or Incomplete Data: Availability of comprehensive data on all
relevant factors (genetic, environmental, occupational, etc.) might be
restricted.
 Data Quality: Inaccuracies, missing values, or biases within the dataset can
affect model performance and reliability.

2. Ethical and Privacy Concerns:


 Data Privacy and Confidentiality: Adhering to strict privacy regulations
(such as HIPAA) might restrict access to certain sensitive patient information,
impacting the comprehensiveness of the dataset.
 Ethical Considerations: Balancing the need for data access with ethical
considerations regarding patient consent, confidentiality, and fair use of data.

3. Model Development Challenges:


 Complexity of Lung Cancer Development: Lung cancer is influenced by
multifaceted factors, and capturing this complexity within a model might be
challenging.

Lung Cancer Prediction Model 6|Page


 Overfitting or Underfitting: Ensuring the model's balance between capturing
intricate patterns and generalizing well to new data.

4. Interpretability and Explainability:


 Complexity of Machine Learning Models: Certain models like neural
networks might lack interpretability, making it difficult to explain the model's
predictions, especially in a medical context.
 Communication to Stakeholders: Explaining model predictions and
recommendations to healthcare professionals and individuals in a
comprehensible manner might be challenging.

5. Deployment and Practical Application:


 Integration with Healthcare Systems: Compatibility issues or resistance to
adopting new technologies within existing healthcare systems.
 User Acceptance: Ensuring that healthcare professionals and individuals trust
and understand the model's predictions and recommendations.

6. Continual Improvement and Maintenance:


 Dynamic Nature of Data: Continuous updates and additions to the dataset
and staying updated with the latest research might be resource-intensive.
 Model Drift: Ensuring that the model maintains accuracy over time as the
underlying patterns in data change.

7. External Factors and Generalizability:


 Geographical and Population Differences: Models developed using specific
datasets might not generalize well to diverse populations or different
geographical regions.
 External Influences: New environmental factors, changes in lifestyle, or
healthcare advancements might affect the model's relevance and accuracy.

8. Resource Constraints:
 Computational Resources: Availability of computational power and
infrastructure required for processing large datasets and training complex
models.
 Budget and Time Constraints: Limitations in funding and time could affect
the extent of data collection, model development, and validation processes.
Understanding and addressing these limitations and constraints are crucial for
managing expectations, ensuring ethical compliance, and developing a model that is
both effective and practical for real-world application.

1.1.4. Proposed Solution

Lung Cancer Prediction Model 7|Page


Solution Proposal: Lung Cancer Prediction Model

1. Data Acquisition and Preprocessing:

 Data Collection: Gather diverse datasets from reputable sources, including


medical records, research databases, surveys, and relevant literature,
encompassing information on smoking habits, environmental exposures,
genetic factors, occupational history, health parameters, and demographics.
 Data Preprocessing: Perform data cleaning to handle missing values, outliers,
and inconsistencies. Normalize or scale numerical features and encode
categorical variables for compatibility with machine learning algorithms.

2. Feature Engineering and Selection:

 Exploratory Data Analysis (EDA): Conduct comprehensive EDA to


understand relationships between features and lung cancer incidence. Identify
correlations, distributions, and patterns in the data.
 Feature Selection: Employ techniques like correlation analysis, mutual
information, or feature importance ranking to select the most relevant features
that significantly contribute to lung cancer risk prediction.
 Feature Engineering: Create new features or transformations that capture
complex relationships or interactions between variables, enhancing the
predictive power of the model.

3. Model Development and Optimization:

 Algorithm Selection: Experiment with various machine learning algorithms


(e.g., logistic regression, decision trees, random forests, support vector
machines, neural networks) to build predictive models.
 Hyperparameter Tuning: Use techniques like grid search or random search
to optimize hyperparameters for each model, improving their performance.
 Ensemble Methods: Explore ensemble methods such as bagging, boosting, or
stacking to combine multiple models for increased predictive accuracy and
robustness.

4. Model Evaluation and Validation:

 Cross-validation: Employ k-fold cross-validation to assess model


performance on different subsets of the dataset, ensuring generalizability.

Lung Cancer Prediction Model 8|Page


 Performance Metrics: Measure model performance using appropriate
evaluation metrics like accuracy, precision, recall, F1-score, ROC-AUC, and
confusion matrices.
 External Validation: Validate the final model on independent datasets or with
real-world data to confirm its reliability and applicability.

5. Model Interpretability and Explain ability:

 Feature Importance Analysis: Use techniques such as SHAP values,


permutation importance, or LIME to explain the importance of features in
predicting lung cancer risk.
 Visualizations: Generate visual explanations or plots that illustrate how
different factors contribute to an individual's risk, aiding in model
interpretation and user understanding.

6. Ethical Considerations and Deployment:

 Data Privacy and Ethics: Ensure compliance with ethical standards, patient
confidentiality, and data protection regulations throughout the project.
 Model Deployment: Deploy the finalized model in a suitable environment,
considering integration into healthcare systems or making it accessible
through a user-friendly interface for healthcare professionals.

7. Documentation and Reporting:

 Comprehensive Documentation: Create detailed documentation outlining the


methodologies, algorithms utilized, data sources, preprocessing steps, model
development, evaluation outcomes, and limitations.
 Report Generation: Prepare a comprehensive report summarizing the project
findings, model performance, recommendations for healthcare practices, and
potential future enhancements.
By executing these steps and implementing the proposed solution, the aim is to
develop a robust and accurate lung cancer prediction model. This model can assist
healthcare professionals in assessing individual risks, enabling early interventions,
and contributing to personalized healthcare strategies aimed at reducing the burden of
lung cancer. Additionally, this solution contributes to advancing predictive analytics in
healthcare, fostering research, and potentially impacting public health policies to
combat lung cancer more effectively.

Lung Cancer Prediction Model 9|Page


1.2. Problem Statement
"Developing a machine learning-based predictive model for lung cancer risk assessment
leveraging diverse input factors such as smoking habits, exposure to environmental
pollutants, genetic predisposition, occupational hazards, and other relevant parameters.
The objective is to create a robust and accurate predictive tool that identifies and
evaluates the likelihood of an individual developing lung cancer, thereby facilitating early
intervention and personalized preventive measures."

1.2.1. Problem Overview and Context


Problem Overview: Developing a Lung Cancer Prediction Model
Lung cancer remains one of the most prevalent and fatal types of cancer worldwide,
often diagnosed at advanced stages when treatment options are limited. The aim of
this project is to create a machine learning-based predictive model that assesses an
individual's risk of developing lung cancer. This model will leverage various input
factors, including but not limited to:

 Smoking Habits: Smoking is a well-established primary risk factor for lung


cancer. The model will consider different aspects such as duration, intensity,
and cessation of smoking habits.

 Environmental Pollutants: Exposure to air pollution, industrial emissions,


second-hand smoke, radon, asbestos, and other environmental toxins
significantly contributes to lung cancer risk. Data related to exposure levels
and duration will be integrated into the model.

 Genetic Predisposition: Certain genetic factors and family history play a role
in predisposing individuals to lung cancer. Genetic markers and family history
data will be considered to assess genetic susceptibility.

 Occupational Hazards: Certain occupations involve exposure to carcinogens


(e.g., asbestos in construction work). Occupational history and exposure data
will be incorporated into the model.

Lung Cancer Prediction Model 10 | P a g e


 Health Parameters: Additional health-related information such as pre-
existing respiratory conditions, history of chronic diseases, age, gender, and
demographic factors will also be taken into account.

Objectives:

 Model Development: Construct a robust predictive model utilizing machine


learning algorithms (e.g., logistic regression, decision trees, random forests,
neural networks) to analyze the relationships between the input factors and the
likelihood of developing lung cancer.

 Feature Selection and Engineering: Identify the most influential features


contributing to lung cancer risk. Perform feature engineering to enhance the
model's accuracy and interpretability.

 Data Collection and Preprocessing: Collect diverse datasets from reliable


sources (medical records, surveys, research databases) and preprocess the data
to handle missing values, outliers, and ensure compatibility for model training.

 Model Evaluation and Validation: Assess the model's performance using


appropriate evaluation metrics (e.g., accuracy, precision, recall, ROC-AUC)
through cross-validation techniques to ensure its reliability and
generalizability.

 Ethical Considerations: Ensure the ethical use of sensitive health-related


data, maintaining patient privacy and confidentiality throughout the project
lifecycle.

Outcome:

The ultimate goal is to create a user-friendly predictive tool that healthcare


professionals can utilize for early detection, personalized risk assessment, and
targeted intervention strategies. This model could assist in proactive measures such as
smoking cessation programs, environmental policy changes, and personalized
healthcare interventions, potentially reducing the burden of lung cancer and
improving patient outcomes.

Lung Cancer Prediction Model 11 | P a g e


Lung Cancer Prediction Model 12 | P a g e
2. Review of Related Literature

2.1. Existing Research and Studies

“An evaluation of machine learning classifiers and ensembles for early stage prediction of
lung cancer “(M.I. Faisal): This research paper delves into the realm of predictive modeling
using statistical and machine learning techniques, emphasizing their significance across
various domains like software fault prediction, spam detection, disease diagnosis, and
financial fraud identification. Recognizing the critical role of predicting lung cancer
susceptibility in guiding effective treatments, the study aims to assess different predictors'
effectiveness in enhancing lung cancer detection efficiency based on symptomatic data.
Multiple classifiers—such as Support Vector Machine (SVM), C4.5 Decision Tree, Multi-
Layer Perceptron, Neural Network, and Naïve Bayes (NB)—are rigorously evaluated using a
benchmark dataset sourced from the UCI repository.[1]

"Lung cancer classification tool using microarray data and support vector machines" (G.
Salano): This study introduces an innovative system that harnesses gene expression data from
oligonucleotide microarrays. Its primary goal is threefold: predict the presence or absence of
lung cancer, identify the specific subtype if present, and pinpoint marker genes linked to the
particular lung cancer type. The proposed system serves as a promising tool for expedited
diagnosis and complements existing lung cancer classification methods.[2]

S. H. Liu, "Prediction of lung cancer based on serum biomarkers by gene expression


programming methods”: The swift differentiation between small cell lung cancer (SCLC) and
non-small cell lung cancer (NSCLC) tumors holds pivotal significance in lung cancer
diagnosis. This research study focused on serum markers—lactate dehydrogenase (LDH), C-
reactive protein (CRP), carcino-embryonic antigen (CEA), neurone specific enolase (NSE),
and Cyfra21-1—as indicators reflecting distinct lung cancer characteristics. The study
conducted classification of lung tumors based on these biomarkers, involving 120 NSCLC
and 60 SCLC patients. It aimed to establish an optimal joint utilization of biomarkers for
accurate classification, enhancing the ability to differentiate between SCLC and NSCLC
tumors.[3]

Y. Choi “Early-stage lung cancer diagnosis by deep learning-based spectroscopic analysis of


circulating exosomes”: The approach involves exploring cell exosome features via deep
learning and identifying similarities in human plasma exosomes without extensive human
data learning. The deep learning model, trained on SERS signals from exosomes of normal
and lung cancer cell lines, achieved a 95% accuracy in classifying them. In a study involving
43 patients, including stage I and II cancer patients, the model predicted that 90.7% of the

Lung Cancer Prediction Model 13 | P a g e


patients' plasma exosomes had higher similarity to lung cancer cell exosomes compared to
healthy controls, correlating with cancer progression. [4]

S.J. Lee “A machine-learning approach using PET-based radiomics to predict the


histological subtypes of lung cancer”: The research focused on utilizing machine learning
techniques and PET-based radiomic features to predict histological subtypes in lung cancer. It
involved 396 patients (210 ADCs, 186 squamous cell carcinomas) who underwent FDG
PET/CT scans before treatment. Key clinical factors (age, sex, tumor size, smoking status)
and 40 radiomic features extracted from PET images were studied. The study identified the
most significant features associated with lung cancer subtypes using Gini coefficient scores.
[5]

S. Jondhale “Lung cancer detection using image processing and machine learning
healthcare”: Lung cancer remains a leading cause of mortality in India, necessitating
advanced diagnosis and detection methods. With the elusive nature of its causes, early
detection becomes paramount for successful treatment. This research focuses on a lung
cancer detection system employing image processing and machine learning techniques to
classify the presence of lung cancer in CT images and blood samples. CT scan images,
known for their efficacy compared to Mammography, are used to classify patients' images as
normal or abnormal. [6]

M. A. Yousuf, "Detection of Lung cancer from CT image using Image Processing and Neural
network": Lung cancer detection in its premature stages is a focal point of research due to its
critical impact on patient outcomes. The proposed system is designed as a two-stage process
aimed at detecting lung cancer in its early phases, employing a series of steps encompassing
image acquisition, preprocessing, binarization, thresholding, segmentation, feature extraction,
and neural network-based detection. The system begins by inputting lung CT images,
subsequently undergoing preprocessing via various image processing techniques. In the first
stage, a Binarization technique is applied to convert the image into a binary format, followed
by comparison with a predefined threshold value to identify potential lung cancer regions.
The second stage involves segmentation to isolate the lung CT image, and a robust feature
extraction method is employed to capture critical features from the segmented images. [7]

Viergever, "Computer-aided diagnosis in chest radiography: a survey": Chest radiographs


continue to hold a prominent place in clinical practice, despite the inherent complexity in
their interpretation. Consequently, there is ongoing interest in computer-aided diagnosis
(CAD) systems to aid in the analysis of chest images. This survey aims to categorize and
provide concise reviews of over 150 papers spanning the last three decades, focusing on the
computer-based analysis of chest images. The literature review encompasses a wide array of
techniques and methodologies utilized in computer analysis for chest radiography. Various
approaches and advancements in CAD systems are summarized, highlighting their strengths
and limitations. [8]

Lung Cancer Prediction Model 14 | P a g e


Nice, Jr., "Digital computer determination of a medical diagnostic index directly from chest
X-ray images": This pioneering research employed digital technology to record chest X-ray
images on magnetic tape via a flying spot scanner and analog-to-digital converter.
Subsequently, a digital computer system processed these taped images utilizing a stored
program. The computer's automated analysis focused on measuring the maximum transverse
diameter of the heart shadow and rib cage shadow from the X-ray images. The calculated
ratio between these two measurements yielded the cardiothoracic ratio, a standard diagnostic
index extensively used by physicians to detect cardiac pathology, particularly heart
enlargement. Notably, this research marks the first successful determination of this diagnostic
index directly from unaltered X-ray films through the innovative use of a digital computer.
[9]

H. M. Joseph, "Image processing": This research explores the visualization of a scalar


function of two independent variables as an image, enabling the conception of all
mathematical operations as modifications or processing of the original image. Specifically,
the study focuses on a class of modifying operators achieved through specialized scanning
techniques, eliminating the need for rapid access memory storage devices. The investigation
identifies two significant operators: contour enhancement and contour outlining. Contour
enhancement exhibits effects similar to deblurring, akin to aperture correction, and crispening
observed in television practices. [10]

J. M. Hollywood, "A new technique for improving the sharpness of pictures": This research
focuses on a technique known as "crispening" designed to enhance the apparent picture
definition in the CBS color-television system. The method utilizes nonlinear circuitry to
modify the apparent rise time of an isolated step input applied to a bandwidth-limited system.
The principle behind crispening involves adding a second waveform, representing the
difference between the desired and original waveforms, to a slow transition waveform. This
addition aims to create a narrower "spike" shape, superimposed on the original waveform,
effectively reducing the rise time by about half.[11]

Fredendall, "Analysis synthesis and evaluation of the transient response of television


apparatus": This research delves into the relationship between the sharpness of detail in
television pictures and the transmitter's capacity to transmit abrupt changes in picture half-
tone. The study focuses on the utilization of square waves, particularly a square-wave test
signal with a sufficiently long period, as a suitable method for evaluating subjective
sharpness in transmitted pictures. The paper deduces rules for evaluating the expected
subjective sharpness based on the square-wave response of the transmitting apparatus. It
introduces rapid chart methods for analyzing a square-wave output into sine-wave amplitude
and phase responses.[12]
N. Ayache, "Medical image analysis: Progress over two decades and the challenges ahead":
The paper explores the evolution of medical image analysis within the pattern analysis and

Lung Cancer Prediction Model 15 | P a g e


machine intelligence (PAMI) community, tracing its trajectory from initial applications of
pattern analysis and computer vision techniques to medical datasets to its emergence as a
distinct and significant discipline. Over the past two to three decades, the field has undergone
significant transformation due to the unique challenges posed by medical image analysis.
Notable aspects include the distinct types of image information obtained, the complex and
fully three-dimensional nature of medical image data, the nonrigid motion and deformation of
objects, and the statistical variation present in both normal and abnormal image ground
truths.[13]

R.P.A. Grzeszczuk, "Clinical Applications of Three-Dimensional Rendering of Medical Data


Sets": This paper focuses on highlighting the diverse clinical applications of volumetric
rendering techniques in medical imaging, propelled by advancements in high-resolution
imaging modalities like MRI and CT, alongside progress in computer technology. It aims to
provide a comprehensive overview for those seeking a general understanding of the clinical
3D rendering process and its applications. The research identifies and outlines various clinical
applications that demonstrate potential for utilizing volumetric rendering of medical images.
These applications span different stages of medical practice, including diagnostics,
preoperative planning, intraoperative navigation, surgical robotics, postoperative validation,
training, and telesurgery.[14]

S. Tsuji, "A Plan-Guided Analysis of Cineangiograms for Measurement of Dynamic Behavior


of the Heart Wall": This research paper presents a system tailored for processing noisy
dynamic images, focusing on cineangiograms—X-ray motion pictures capturing the beating
heart through the injection of X-ray opaque dye via a catheter. The system's primary task
involves detecting both the internal and external surfaces of the left ventricular chamber and
measuring the spatial and temporal changes in heart wall thickness, crucial for diagnosing
various heart diseases.[15]
Yu et al. have “obtained histopathology whole-slide slides of lung cancer and squamous cell
carcinoma that have been stained with hematoxylin and eosin” (2016). Patients' photographs
were taken from TCGA (The Cancer Genome Atlas) and the Stanford TMA (Tissue
Microarray Database), plus an additional 294 photos. Even when conducted with the greatest
of intentions, an assessment of human pathology cannot properly predict the patient's
prognosis. A total of 9,879 quantitative elements of an image were retrieved, and machine
learning algorithms were used to select the most important aspects and differentiate between
patients who survived for a short period of time and those who survived for a long period of
time after being diagnosed with stage I adenocarcinoma or squamous cell carcinoma. The
researchers used the TMA cohort to validate the survival rate of the recommended framework
(P0.036 for tumor type). According to the findings of this study, the characteristics that are
created automatically may be able to forecast the prognosis of a lung cancer patient and, as a
consequence, may help in the development of personalized medication. The methodologies
that were outlined can be utilized in the analysis of histopathology images of various organs
[16].

Lung Cancer Prediction Model 16 | P a g e


Pol Cirueda and “his colleagues used an aggregation of textures that kept the spatial
covariances across features consistent”. Mixing the local responses of texture operator pairs
is done using traditional aggregation functions like the average; nonetheless, doing so is a
vital step in avoiding the problems of traditional aggregation. Pretreatment computed
tomography (CT) scans were utilized in order to assist in the prediction of NSCLC nodule
recurrence prior to the administration of medication. After that, the recommended methods
were put to use in order to compute the kind of NSCLC nodule recurrence according to the
manifold regularized sparse classifier. These discoveries, which offer up new study
possibilities on how to use morphological, tissue traits to evaluate cancer invasion, need to be
confirmed and investigated further. However, this will not be possible without more research.
When modeling orthogonal information, the author focused on the textural characteristics of
nodular tissue and coupled those characteristics with other variables such as the size and
shape of the tumor [17].

“The creation of a method for the early detection and accurate diagnosis of lung cancer that
makes use of CT, PET, and X-ray” images by Manasee Kurkure and Anuradha Thakare in
2016 has garnered a significant amount of attention and enthusiasm. The utilization of a
genetic algorithm that permits the early identification of lung cancer nodules by diagnostics
allows for the optimization of the findings to be accomplished. It was necessary to employ
both Naive Bayes and a genetic algorithm in order to properly and swiftly classify the various
stages of cancer images. This was done in order to circumvent the intricacy of the generation
process. The categorization has an accuracy rate of up to eighty percent [18].

Sangamithraa and Govindaraju [19] have “used a preprocessing strategy in order to


eliminate the unwanted unaffected by the use of median and Wiener filters”. This was done in
order to improve the quality of the data. The K-means method is used to do the segmentation
of the CT images. EK-mean clustering is the method that is used to achieve clustering. To
extract contrast, homogeneity, area, corelation, and entropy features from images, fuzzy EK-
mean segmentation is utilized. A back propagation neural network is utilized in order to
accomplish the classification [20].

According to Ashwini Kumar Saini et al. (2016), a summary of the types of noise that might
cause lung cancer and the strategies for removing them has been provided. Due to the fact
that lung cancer is considered to be one of the most life-threatening kinds of cancer, it is
essential that it be detected in its earlier stages. If the cancer has a high incidence and
mortality rate, this is another indication that it is a particularly dangerous form of the disease.
The quality of the digital dental X-ray image analysis must be significantly improved for the
study to be successful. A pathology diagnosis in a clinic continues to be the gold standard for
detecting lung cancer, despite the fact that one of the primary focuses of research right now is
on finding ways to reduce the amount of image noise. X-rays of the chest, cytological
examinations of sputum samples, optical fiber investigations of the bronchial airways, and
final CT and MRI scans are the diagnostic tools that are utilized most frequently in the
detection of lung malignancies (MRI). Despite the availability of screening methods like CT

Lung Cancer Prediction Model 17 | P a g e


and MRI that are more sensitive and accurate in many parts of the world, chest radiography
continues to be the primary and most prevalent kind of surgical treatment. It is routine
practice to test for lung cancer in its early stages using chest X-rays and CT scans; however,
there are problems associated with the scans' weak sensitivities and specificities [19].

Neural ensemble-based detection is the name given to the automated method of illness
diagnosis that was suggested in Kureshi et al.'s research [21] (NED). The approach that was
suggested utilized feature extraction, classification, and diagnosis as its three main
components. In this experiment, the X-ray chest films that were taken at Bayi Hospital were
utilized. This method is recommended because it has a high identification rate for needle
biopsies in addition to a decreased number of false negative identifications. As a result, the
accuracy is improved automatically, and lives are saved [22].

Kulkarni and Panditrao [23] have created a novel algorithm for early-stage cancer
identification that is more accurate than previous methods. The program makes use of a
technology that processes images. The amount of time that passes is one of the factors that is
considered while looking for anomalies in the target photographs. The position of the tumor
can be seen quite clearly in the original photo. In order to get improved outcomes, the
techniques of watershed segmentation and Gabor filtering are utilized at the preprocessing
stage. The extracted interest zone produces three phases that are helpful in recognizing the
various stages of lung cancer: eccentricity, area, and perimeter. These phases may be found in
the extracted interest zone. It has been revealed that the tumors come in a variety of
dimensions. The proposed method is capable of providing precise measurements of the size
of the tumor at an early stage [21].

Westaway et al. [24] used a radiomic approach to identify three-dimensional properties from
photos of lung cancer in order to provide prediction information. As is well known, classifiers
are devised to estimate the length of time an organism will be able to continue existing. The
Moffitt Cancer Center in Tampa, Florida, served as the location from where these
photographs for the experiment's CT scans were obtained. Based on the properties of the
pictures produced by CT scans, which may suggest phenotypes, human analysis may be able
to generate more accurate predictions. When a decision tree was used to make the survival
predictions, it was possible to accurately forecast seventy-five percent [23] of the outcomes.

CT (computed tomography) images of lung cancer have been categorized with the use of a
lung cancer detection method that makes use of image processing. This method was
described by Chaudhary and Singh [25]. Several other approaches, including segmentation,
preprocessing, and the extraction of features, have been investigated thus far. The authors
have distinguished segmentation, augmentation, and feature extraction, each in its own
unique section. In Stages I, II, and III, the cancer is contained inside the chest and manifests
as larger, more invasive tumors. By Stage IV, however, cancer has spread to other parts of the
body [24], at which point it is said to be in Stage IV.

Lung Cancer Prediction Model 18 | P a g e


2.2. Research Gap

From the provided research summaries, several potential research gaps or areas for
further exploration might be identified:
Noise Reduction and Image Enhancement Techniques: While the researches touch
upon noise reduction in medical imaging, there might be room to delve deeper into
advanced noise reduction and image enhancement techniques specifically tailored for
dynamic medical images like cineangiograms. Investigating more robust algorithms
could lead to better image quality and more accurate boundary detection.
Automated Boundary Detection: Despite the sophisticated edge detection methods
mentioned, there could be scope for developing more automated and efficient
algorithms to detect boundaries accurately, particularly in cases of low-contrast
regions or images affected by noise. This could involve exploring machine learning or
deep learning techniques for improved segmentation and boundary detection.
Real-time Processing and Analysis: Expanding research on real-time processing of
dynamic medical images, such as cineangiograms, might be valuable. Developing
systems that can process and analyze images in near-real-time during medical
procedures could aid clinicians by providing immediate feedback and guidance.
Clinical Validation and Standardization: While the mentioned research shows
promising results compared to radiologist-detected boundaries, further clinical
validation across a larger and more diverse dataset could be beneficial. Additionally,
establishing standardized protocols and benchmarks for evaluating the accuracy and
reliability of image processing systems in clinical settings could enhance their
adoption.
Integration of Multiple Imaging Modalities: Exploring the integration of data from
various imaging modalities (e.g., MRI, CT scans) alongside cineangiograms could
provide a more comprehensive understanding of cardiac structures and functions. This
integration might offer richer diagnostic insights and improve the accuracy of disease
detection.
User Interface and Clinical Adoption: Investigating user-friendly interfaces and
system integration into clinical workflows could bridge the gap between research and
practical clinical application. Ensuring ease of use and seamless integration of these
systems into existing medical practices is crucial for their widespread adoption.

Lung Cancer Prediction Model 19 | P a g e


Addressing these potential research gaps could contribute to advancements in medical
imaging technology, enhancing diagnostic accuracy, clinical decision-making, and
ultimately improving patient care in the field of cardiology and beyond.
Research on Lung Cancer Detection using Image Processing and Machine Learning:
Research Summary: This study focuses on lung cancer detection using image
processing and machine learning techniques, highlighting the importance of early-
stage detection for favorable prognosis.
Potential Research Gap: While the research outlines the use of SVM and image
processing for lung cancer detection, further exploration into hybrid models
integrating diverse machine learning algorithms might improve accuracy.
Additionally, investigating the integration of multiple imaging modalities (like CT
scans and histopathological images) for more comprehensive detection could be
valuable.
Research on Computer-Aided Diagnosis in Chest Radiography:
Research Summary: The paper reviews computer-aided diagnosis in chest
radiography, emphasizing the challenges and advancements in this domain.
Potential Research Gap: The research identifies challenges but does not delve into
specific methods to overcome them. Further exploration could involve proposing
novel algorithms or approaches to tackle the challenges posed by interpreting chest
radiographs, thereby enhancing accuracy and efficiency.
Research on Cardiac Diagnosis via Digital Computer System:
Research Summary: This pioneering research introduces a digital computer system for
cardiac diagnosis via chest X-ray films, aiming to enhance diagnostic accuracy.
Potential Research Gap: While the study successfully establishes a method for cardiac
diagnosis, future research could explore the application of this system to a wider
range of cardiac conditions. Moreover, validating the system's accuracy across diverse
patient populations could enhance its reliability and practical utility.
Research on 3D Volumetric Rendering in Medical Imaging:
Research Summary: This paper discusses clinical applications and implementation of
volumetric rendering in medical imaging, emphasizing potential uses in diagnostics,
preoperative planning, etc.
Potential Research Gap: The research provides an overview but lacks detailed insights
into specific volumetric rendering techniques or implementation challenges. Future
studies could focus on evaluating and comparing different rendering methods,
considering their efficacy, limitations, and practical feasibility in clinical settings.
Research on Square Wave Analysis for Television Picture Sharpness:
Research Summary: This study explores the analysis of square waves for evaluating
television picture sharpness, focusing on the relationship between transmitter
responses and image quality.

Lung Cancer Prediction Model 20 | P a g e


Potential Research Gap: While the research covers square wave analysis, further
exploration into advanced techniques for enhancing image sharpness could be
beneficial. Investigating modern image processing methods and their impact on image
quality in television could be an area of interest.
Research on Heart Wall Surface Detection in Cineangiograms:
Research Summary: This research presents a plan-guided analysis system for
cineangiograms, aimed at detecting heart wall surfaces and measuring wall thickness.
Potential Research Gap: While the study demonstrates effective boundary detection,
future research might focus on real-time implementation and validation across a
broader dataset. Exploring automated segmentation techniques and their robustness in
noisy dynamic images could further improve accuracy.
These summaries suggest potential areas for future research, including advancements
in machine learning algorithms, novel image processing techniques, validation across
diverse datasets, and real-time implementation for practical clinical applications.
Addressing these gaps could lead to more accurate and reliable diagnostic tools in
medical imaging and television picture processing.

Lung Cancer Prediction Model 21 | P a g e


3. Proposed System and Methodology

3.1 System Architecture and Design

Fig 1: Lung Cancer Prediction Architecture [25]

Designing the system architecture for a machine learning-based predictive model for lung
cancer risk assessment involves several components and considerations. Here's a high-
level overview of the system architecture and design model for such a project:

System Architecture:
1. Data Collection and Preprocessing:
 Data Sources: Gather data from diverse sources such as medical records, surveys,
public databases, and research studies containing information on smoking habits,
environmental pollutants, genetic predisposition, occupational hazards, and other
relevant parameters.
 Data Preprocessing Pipeline: Develop a robust pipeline for cleaning, formatting,
encoding, and standardizing data. This includes handling missing values, outlier
detection, and feature scaling.

Lung Cancer Prediction Model 22 | P a g e


2. Feature Engineering and Selection:
 Feature Engineering: Implement techniques to extract, transform, and create
relevant features that contribute significantly to lung cancer risk assessment. This
might include normalization, dimensionality reduction, and feature scaling.
 Feature Selection: Employ methods to identify the most impactful features for
building the predictive model, such as correlation analysis, feature importance
ranking, and domain knowledge-based selection.

3. Model Development and Training:


 Machine Learning Model Selection: Experiment with various machine learning
algorithms (e.g., logistic regression, decision trees, random forests, neural
networks) to develop the predictive model.
 Model Training and Validation: Utilize a portion of the dataset for training,
perform hyperparameter tuning, and validate the model using cross-validation
techniques to ensure robustness and generalizability.

4. Interpretability and Explainability:


 Model Interpretation: Implement methods to enhance model interpretability, such as
feature importance visualization, SHAP (Shapley Additive explanations), LIME
(Local Interpretable Model-Agnostic Explanations), or other explainable AI
techniques.
 Visualizations: Generate visual aids to explain model predictions and help healthcare
professionals understand the rationale behind the risk assessments.

5. Deployment and Integration:


 System Integration: Design an interface or platform to integrate the predictive
model into existing healthcare systems or as a standalone tool for easy access by
healthcare professionals.
 Scalability and Performance: Ensure the system's scalability and efficiency to
handle large volumes of data and provide real-time predictions.

Design Model:
1. Sequential Model:
The process flow might follow a sequential pattern, starting from data collection,
preprocessing, feature engineering, model development, validation, interpretation, and
finally, deployment.

2. Modular Design:
Modularize different components of the system architecture for easier maintenance and
scalability. Modules might include data ingestion, preprocessing, feature engineering,
model training, validation, and deployment.

3. Feedback Loop:

Lung Cancer Prediction Model 23 | P a g e


Implement a feedback loop mechanism to continuously improve the model by
incorporating new data, feedback from healthcare professionals, and advancements in
research.

4. Security and Privacy:


Incorporate robust security measures to protect sensitive patient data and ensure
compliance with privacy regulations (e.g., encryption, access controls, anonymization
techniques).

5. Documentation and Monitoring:


Document each stage of the system architecture and model development for transparency
and reproducibility. Implement monitoring tools to track model performance and data
drift.

6. Collaboration and Interdisciplinary Approach:


Encourage collaboration between data scientists, healthcare professionals, domain
experts, and ethicists throughout the project to ensure the model's accuracy, relevance,
and ethical compliance.

Conclusion:
The system architecture and design model for a machine learning-based predictive model
for lung cancer risk assessment should emphasize data quality, model performance,
interpretability, scalability, security, and ethical considerations. It should be flexible
enough to adapt to evolving data and healthcare needs while delivering accurate risk
assessments and actionable insights for early intervention and personalized preventive
measures.

Prediction Models
The prediction problem is formulated as binary classification. The hospitalization when
cancer occurred was used as a class label. If diagnosed with cancer, we assigned a patient
to the positive class (‘1′). Otherwise, we put the patient into the negative class (‘0′). We
experimented with two different RNN models. These models are advantageous for the
sequence data, especially when one data point is dependent on the preceding data point,
like in our case. The reason is that they have a memory to store the states or information
of previous inputs in order to construct the sequence's subsequent output. This mechanism
is also known as a hidden state. The following equations explain the learning process:

Lung Cancer Prediction Model 24 | P a g e


To calculate the hidden state
for the next step
, we use input weights
and hidden units’ weights
together with the input
from the current time step
, and bias
from the recurrent layer. At the end of the calculation, a nonlinear transformation ReLU
is applied. Furthermore, to predict
, we multiply the newly learned hidden state with the weights
from the output layer. We also add up bias
of all neurons in the network. Finally, everything is pulled through a sigmoid function.

The first model contains layers with LSTM units capable of learning long-term
dependencies in sequential data. Remembering information for long periods is practically
their default behavior. The second model has layers with GRUs. Unlike the LSTM unit,
the GRU has gating units that modulate information flow without separating memory
cells [38]. This structure allows to adaptively capture dependencies from large data
sequences without discarding information from earlier parts of the sequence.

The architectures of both models are identical, with one hidden layer of 64 neurons (Fig.
2). Empirical evaluation of RNN models showed that both the LSTM and GRU
demonstrated superiority over traditional ML models [39]. Since LSTM and GRU
architectures have shown surpassing results in various applications, we compared both in
our experiments.

Lung Cancer Prediction Model 25 | P a g e


Fig 2: Architectural Model of LSTM [26]

SVD and embedding layer were tested separately with both RNN methods. The output
layer contains only one neuron with the sigmoid activation function. The adaptive
learning rate optimization algorithm ADAM was used to train the RNN models [40].

A potential problem with training neural networks could be the number of epochs. A large
number of epochs could lead to overfitting, whereas an insufficient number of epochs
may result in an underfit model. That is why in our application, sequential learning
models used the early stopping method, which monitored the model's performance during
training. The objective of the method is to stop the training when the validation loss
(binary cross-entropy loss) starts to increase constantly. As a result, both RNN models
were trained through 20 epochs unless stopped earlier by the method mentioned above.

We used a batch size of 64 since, in such a way, the overall training procedure required
less memory. Furthermore, a smaller size was chosen because it is reported across many
applications that using such small batch sizes achieves training stability and improved
generalization performance [41].

To compare the performance of the proposed sequence learning models, we also trained
four standard machine learning models: DT, MLP, RF, and KNN. Only default settings

Lung Cancer Prediction Model 26 | P a g e


provided in the scikit-learn Python library were used for DT and MLP without parameter
tuning [42]. For RF and KNN, we used the standard implementation with basic settings
(for RF the maximum depth was set to 10 and the number of trees to 100, while for KNN
the number of nearest neighbors was 3). All prediction models were run separately for
each of the four studied cancers. We trained the models on 80% of patients selected
entirely at random, and the remaining 20% we used for testing, while 25% of the training
set was used for training validation. All models were run on balanced datasets, and we
measured test accuracy, Area Under the Receiver Operating Characteristic curve
(AUROC), sensitivity (recall), specificity, precision, and F1 score. Prediction accuracy
was chosen as a primary metric since there are equal patients in both classes for each
cancer. However, we also reported the AUROC score for a more comprehensive
evaluation of the models. The difference between these two metrics is based on the
decision threshold, i.e. class probability threshold. In binary classification, the threshold is
the value over which a sample is assigned to class one. AUROC is a metric that evaluates
a binary classifier's output over decision thresholds varying between 0 and 1, whereas the
accuracy indicates how well a classifier performs for the default threshold of 0.5. High
accuracy and high AUROC indicate that the classifier performs admirably for the default
threshold and similarly for many other threshold values. Additionally, an admirably
accurate classifier should have high sensitivity and specificity. Since the AUROC score
summarizes the model's efficacy in terms of sensitivity and specificity for various
decision thresholds, we calculated those two metrics only for the 0.5 threshold.

Fig 3: GRU’s accuracy Comparison [27]

Lung Cancer Prediction Model 27 | P a g e


3.2. System Flow and Use Case

Creating a use case diagram for the machine learning-based predictive model for lung
cancer risk assessment involves identifying the primary actors interacting with the system
and illustrating their interactions. Here's a simplified representation of the use case
diagram:

Use Case Diagram for Lung Cancer Risk Assessment System:


Actors:
 Healthcare Professional: Interacts with the system to access risk assessments and
recommendations for patients.
 System: Represents the machine learning-based predictive model for lung cancer
risk assessment.

Use Cases:
Collect Data:
 Description: The system collects diverse data sources related to patients' smoking
habits, environmental exposure, genetics, etc.
 Actors: System
Preprocess Data:
 Description: The system cleans, preprocesses, and prepares the collected data for
model training.
 Actors: System
Train Model:
 Description: The system utilizes machine learning algorithms to train the
predictive model based on the preprocessed data.
 Actors: System
Validate Model:
 Description: The system evaluates the trained model's performance using
validation techniques.
 Actors: System
Provide Risk Assessment:
 Description: Healthcare professionals interact with the system to obtain
personalized lung cancer risk assessments for patients.
 Actors: Healthcare Professional, System
Present Recommendations:

Lung Cancer Prediction Model 28 | P a g e


 Description: The system provides actionable recommendations based on the risk
assessment for early intervention and personalized preventive measures.
 Actors: Healthcare Professional, System

Fig 4: Use Case Diagram of the System [28]

Relationships:
 Healthcare Professional --> Provide Risk Assessment --> System: Initiates the
request for patient-specific risk assessment.
 Healthcare Professional --> Present Recommendations --> System: Receives
personalized recommendations based on the risk assessment.
 System --> Collect Data --> System: Collects diverse data sources for model
training.
 System --> Preprocess Data --> System: Cleans and prepares collected data for
training.
 System --> Train Model --> System: Utilizes data to train the predictive model.

Lung Cancer Prediction Model 29 | P a g e


 System --> Validate Model --> System: Assesses the model's performance
through validation.

This use case diagram outlines the primary interactions between the actors (healthcare
professionals and the system) and the key functionalities involved in the development and
utilization of the predictive model for lung cancer risk assessment.

A use case is a representation of interactions between an actor (an external entity, which
can be a user or another system) and a system. It describes the functionality or behavior
of a system from the perspective of its users. Each use case represents a specific goal or
action that an actor wants to achieve when interacting with the system.

Components of a Use Case:


 Use Case Name: Describes the action or goal that an actor wants to accomplish.

 Actors: Represent entities interacting with the system. They can be users, external
systems, or any other role that engages with the system to achieve specific tasks.

 Description: Details the specific functionality or behavior associated with the use
case.

 Trigger: Describes the event or condition that initiates the use case.

 Preconditions: Specifies any conditions that must be true for the use case to start.

 Postconditions: States the expected outcome or state of the system after the use
case is completed successfully.

 Flow of Events: Describes the sequence of steps or actions that occur when the
use case is executed. It typically includes the main flow (basic course of actions)
and alternative flows (exceptions or variations).

 Exceptions: Covers exceptional scenarios or error conditions that might occur


during the execution of the use case.

Lung Cancer Prediction Model 30 | P a g e


Example: Use Case - Provide Risk Assessment

Use Case Name: Provide Risk Assessment


Actors: Healthcare Professional, System
Description: This use case involves a healthcare professional interacting with the system
to obtain personalized risk assessments for patients regarding their likelihood of
developing lung cancer.
Trigger: The healthcare professional requires a risk assessment for a specific patient or
group of patients.

Preconditions:
 The system has collected and preprocessed relevant patient data.
 The machine learning model for lung cancer risk assessment is trained and
validated.

Postconditions:
 The healthcare professional receives the personalized risk assessment for the
patient(s).
 The system maintains the confidentiality and security of patient data.

Flow of Events:
 Healthcare Professional requests risk assessment: The healthcare professional
logs into the system and provides patient-specific information required for the risk
assessment.
 System processes the request: The system utilizes the trained predictive model to
analyze the provided data and generates a personalized risk assessment.
 System presents risk assessment: The system displays the risk assessment
results to the healthcare professional, providing insights into the patient's
likelihood of developing lung cancer.
 Healthcare Professional reviews and interprets the assessment: The healthcare
professional interprets the risk assessment and uses it to inform further medical
decisions or interventions.

Exceptions:
If the system encounters errors in data processing or model failure, it notifies the
healthcare professional and prompts appropriate actions or troubleshooting steps.

Lung Cancer Prediction Model 31 | P a g e


Fig 5: UML Sequence Diagram

Unified Modeling Language (UML) diagram for the machine learning-based predictive
model for lung cancer risk assessment involves various components such as class
diagrams, activity diagrams, sequence diagrams, and more. For the purposes of this
project, let's create a high-level UML diagram outlining the main components and their
interactions:

UML Diagram for Lung Cancer Risk Assessment System:


Class Diagram:
A class diagram showcases the system's classes, their attributes, methods, and
relationships.

Classes:
 Data Collector: Responsible for collecting diverse data sources.
 Data Preprocessor: Handles data cleaning, formatting, and preprocessing tasks.
 Model Trainer: Utilizes machine learning algorithms to train the predictive
model.

Lung Cancer Prediction Model 32 | P a g e


 Model Validator: Evaluates the trained model's performance using validation
techniques.
 Healthcare Professional: Represents the user interacting with the system.
 Predictive Model: Encapsulates the machine learning model for lung cancer risk
assessment.
Activity Diagram:
An activity diagram illustrates the flow of activities or processes within the system.

Activities:
 Collect Data: DataCollector gathers data from various sources.
 Preprocess Data: Data Preprocessor cleans and prepares the collected data.
 Train Model: Model Trainer uses data to train the predictive model.
 Validate Model: Model Validator assesses the model's performance.
 Provide Risk Assessment: Interaction between Healthcare Professional and
Predictive Model to obtain risk assessments.
 Present Recommendations: Predictive Model presents actionable
recommendations based on risk assessments.
Sequence Diagram:
A sequence diagram shows the interactions between objects in a specific scenario or use
case.

Sequence:
 Healthcare Professional -> Provide Risk Assessment -> Predictive Model:
Healthcare Professional initiates a request for risk assessment.
 Predictive Model -> Provide Risk Assessment -> Healthcare Professional:
Predictive Model generates and provides risk assessment to Healthcare
Professional.
 Healthcare Professional -> Present Recommendations -> Predictive Model:
Healthcare Professional receives and interprets the recommendations.
This UML diagram provides a high-level overview of the system's components (classes),
their relationships, and the flow of activities (activity and sequence diagrams) involved in
the development and utilization of the predictive model for lung cancer risk assessment. It
serves as a visual representation to understand the system's structure and behavior at a
conceptual level.

Lung Cancer Prediction Model 33 | P a g e


Fig 6: Flowchart of the methodology for Cancer Detection

SAM (State-Action-Model) is an architectural pattern used for structuring front-end


applications. However, it might not directly apply to a machine learning-based predictive
model for lung cancer risk assessment, which typically involves data processing, model
development, and deployment in a backend or server-side environment. Nonetheless, I
can provide an adapted interpretation of SAM principles tailored to the development and
deployment phases of the predictive model system:

State:
In the context of the lung cancer risk assessment model:

Lung Cancer Prediction Model 34 | P a g e


 Data State: Represents the diverse data collected from various sources (smoking
habits, environmental pollutants, genetic predisposition, occupational hazards,
etc.).
 Preprocessed Data State: Indicates the cleaned, formatted, and preprocessed data
ready for model training.
 Trained Model State: Signifies the machine learning model trained on the
preprocessed data.
 Validation State: Denotes the state where the model is validated for its accuracy
and performance.
 Prediction State: Represents the system's ability to predict lung cancer risk for a
specific individual based on input data.

Action:
Actions refer to the transformation of the system's state. In this context:

 Collect Data Action: Involves the collection of diverse data sources related to lung
cancer risk factors.
 Preprocess Data Action: Cleansing, formatting, and preparing the collected data
for model training.
 Train Model Action: Utilizes the preprocessed data to train the machine learning
model.
 Validate Model Action: Evaluates and validates the trained model's performance
using cross-validation or other techniques.
 Predict Risk Action: Involves using the trained and validated model to predict
lung cancer risk for individuals.

Model:
The model here represents the machine learning model itself, developed to predict lung
cancer risk based on various input factors.

 Machine Learning Model: Includes the algorithms, parameters, and trained


weights resulting from the model training process.
 Model Evaluation Metrics: Indicate the performance metrics (accuracy, precision,
recall, etc.) obtained during model validation.
 Deployment Model: The model in its deployable form integrated into a system or
application for real-time risk assessment.

Lung Cancer Prediction Model 35 | P a g e


3.3. System Algorithm
Developing a machine learning-based predictive model for lung cancer risk assessment
involves several algorithms and techniques that contribute to different stages of the
system. Here's an overview of the key algorithms and methodologies involved in the
system's workflow:

1. Data Collection and Preprocessing:


 Data Collection Algorithm: Depending on the sources, different algorithms
might be employed to retrieve data from various repositories or sources.
 Data Cleaning Algorithm: Techniques such as outlier detection, handling
missing values, and normalization methods to ensure data quality and consistency.
 Feature Engineering Techniques: Algorithms like principal component analysis
(PCA), feature scaling, or selection algorithms to derive relevant features from
raw data.

2. Model Development and Training:


 Supervised Learning Algorithms: Utilizing supervised learning algorithms to
train the predictive model:
 Logistic Regression: For binary classification predicting lung cancer risk.
 Decision Trees: Capturing non-linear relationships between features.
 Random Forests: Ensemble technique for improved accuracy and robustness.
 Support Vector Machines (SVM): Separating data into classes using
hyperplanes.
 Neural Networks: Deep learning models for complex pattern recognition.
 Hyperparameter Tuning Algorithms: Grid Search, Random Search, or Bayesian
Optimization to fine-tune model hyperparameters for better performance.
 Cross-Validation Algorithms: K-fold cross-validation or stratified cross-
validation to assess model generalizability.

3. Model Evaluation and Validation:


Evaluation Metrics: Algorithms to compute various performance metrics:
 Accuracy, Precision, Recall, F1-score: Assessing the model's overall performance.
 ROC-AUC (Receiver Operating Characteristic - Area Under Curve): Evaluating
binary classification performance.
 Confusion Matrix Analysis: Understanding model prediction and
misclassification.
Validation Techniques: Bootstrapping, Monte Carlo cross-validation, or holdout
validation for robust model evaluation.

4. Interpretability and Explainability:

Lung Cancer Prediction Model 36 | P a g e


 Feature Importance Algorithms: Methods like SHAP (Shapley Additive
explanations), LIME (Local Interpretable Model-Agnostic Explanations), or
permutation importance to understand feature contributions.
 Visualization Algorithms: Algorithms for generating visual aids like feature
importance plots, decision boundaries, or partial dependence plots for
interpretation.

5. Deployment and Integration:


 Scalable Algorithms: Ensuring the chosen algorithms are scalable and efficient
for real-time prediction in a production environment.
 Integration Algorithms: API integration, containerization (e.g., Docker), or
deployment on cloud platforms using algorithms for streamlined deployment.

Conclusion:
The system algorithm for developing a machine learning-based predictive model for lung
cancer risk assessment involves a diverse set of algorithms encompassing data collection,
preprocessing, model development, evaluation, interpretability, and deployment. The
choice of algorithms depends on factors like data characteristics, model complexity,
interpretability requirements, and deployment environments, among others. These
algorithms collectively contribute to creating a robust and accurate predictive tool for
lung cancer risk assessment.

Decision Tree
Role in Lung Cancer Risk Assessment:
Feature Importance:
Decision trees help identify the most crucial features influencing lung cancer risk by
assessing feature importance. Attributes like smoking habits, environmental pollutants,
genetic predisposition, etc., are ranked based on their contribution to classification.
 Interpretability:

Lung Cancer Prediction Model 37 | P a g e


Decision trees offer high interpretability, making it easier for healthcare professionals to
understand and explain the model's predictions. The decision path in the tree can be
visualized and easily comprehended.
 Handling Non-Linear Relationships:

Fig 7: Decision Tree [29]

Decision trees can capture non-linear relationships between input factors and lung cancer
risk, which might be crucial as certain risk factors might not have a linear impact on the
risk.

Model Building Process:


Tree Construction:
The tree construction starts with the entire dataset and recursively splits it into subsets
based on features to create decision nodes.

Lung Cancer Prediction Model 38 | P a g e


The splitting occurs based on metrics like Gini impurity or information gain to maximize
the homogeneity of subsets concerning the target variable (lung cancer risk).
Pruning:
Techniques like pre-pruning (limiting tree depth, setting a minimum number of samples
for a split) or post-pruning (pruning nodes after tree construction) are used to prevent
overfitting.
Prediction:
Once the tree is built, predictions for lung cancer risk are made by traversing the tree
from the root node to leaf nodes, where the final prediction or class label resides.

Lung Cancer Prediction Model 39 | P a g e


4. Result and Discussions

4.1. Model Performance and Graphs


The latency and performance of a machine learning-based predictive model for lung
cancer risk assessment can vary based on several factors, including data size, model
complexity, chosen algorithms, hardware infrastructure, and real-time deployment
requirements. Here's a detailed breakdown:

Latency:
Training Latency:
Training a machine learning model involves processing the collected data, feature
engineering, algorithm execution, hyperparameter tuning, and model validation. The
duration can range from minutes to several hours or even days, depending on the dataset
size, algorithm complexity, and available computational resources.
Prediction Latency:
Once the model is trained and deployed, the time taken to predict lung cancer risk for an
individual depends on:
 Model Complexity: Simple models like logistic regression might have lower
prediction times compared to complex models like deep neural networks.
 Size of Input Data: Larger input data or higher dimensionality may increase
prediction time.
 Hardware and Software Infrastructure: Utilization of powerful hardware
(GPUs/TPUs) and optimized software frameworks can reduce prediction latency.

Performance:
Model Performance Metrics:
 Accuracy: The ability of the model to correctly predict lung cancer risk.
 Precision: Proportion of correctly predicted positive instances (lung cancer cases)
among all instances predicted as positive.
 Recall: Proportion of correctly predicted positive instances among all actual
positive instances.
 F1-score: Harmonic mean of precision and recall, balancing both metrics.
 ROC-AUC: Area under the Receiver Operating Characteristic curve, assessing the
model's ability to distinguish between classes.
Validation and Testing:

Lung Cancer Prediction Model 40 | P a g e


 The model's performance is evaluated using validation techniques (e.g., cross-
validation) on separate datasets to ensure its generalizability and reliability.
Scalability and Resource Utilization:
 The model's ability to handle increased data sizes, maintain consistent
performance, and efficiently utilize available computational resources (CPU,
memory, GPUs) is crucial for scalability.

Factors Affecting Latency and Performance:


Dataset Size and Complexity:
Larger datasets with more features can increase both training and prediction latency.
Model Complexity and Algorithms:
Complex models like ensemble methods (e.g., Random Forests) or deep learning
architectures might have longer trained times but potentially higher performance.
Hardware Infrastructure:
Utilization of GPUs or specialized hardware accelerators can significantly reduce
computation time for training and predictions.
Optimization Techniques:
Optimizing algorithms, feature engineering, and utilizing parallel processing or
distributed computing can enhance performance.

Background Lung cancer is the second most common cancer in incidence and the leading
cause of cancer deaths
worldwide. Meanwhile, lung cancer screening with low-dose CT can reduce mortality.
The UK National Screening
Committee recommended targeted lung cancer screening on Sept 29, 2022, and asked for
more modelling work to be
done to help refine the recommendation. This study aims to develop and validate a risk
prediction model—the
CanPredict (lung) model—for lung cancer screening in the UK and compare the model
performance against
seven other risk prediction models.
Methods For this retrospective, population-based, cohort study, we used linked electronic
health records from
two English primary care databases: QResearch (Jan 1, 2005–March 31, 2020) and
Clinical Practice Research

Lung Cancer Prediction Model 41 | P a g e


Datalink (CPRD) Gold (Jan 1, 2004–Jan 1, 2015). The primary study outcome was an
incident diagnosis of lung
cancer. We used a Cox proportional-hazards model in the derivation cohort (12·99 million
individuals aged
25–84 years from the QResearch database) to develop the CanPredict (lung) model in
men and women. We used
discrimination measures (Harrell’s C statistic, D statistic, and the explained variation in
time to diagnosis of lung
cancer [R²
D]) and calibration plots to evaluate model performance by sex and ethnicity, using data
from QResearch
(4·14 million people for internal validation) and CPRD (2·54 million for external
validation). Seven models for
predicting lung cancer risk (Liverpool Lung Project [LLP]v2, LLPv3, Lung Cancer Risk
Assessment Tool [LCRAT],
Prostate, Lung, Colorectal, and Ovarian [PLCO]M2012, PLCOM2014, Pittsburgh, and
Bach) were selected to compare their
model performance with the CanPredict (lung) model using two approaches: (1) in ever-
smokers aged 55–74 years
(the population recommended for lung cancer screening in the UK), and (2) in the
populations for each model
determined by that model’s eligibility criteria.
Findings There were 73380 incident lung cancer cases in the QResearch derivation
cohort, 22838 cases in the
QResearch internal validation cohort, and 16145 cases in the CPRD external validation
cohort during follow-up. The
predictors in the final model included sociodemographic characteristics (age, sex,
ethnicity, Townsend score), lifestyle
factors (BMI, smoking and alcohol status), comorbidities, family history of lung cancer,
and personal history of other
cancers. Some predictors were different between the models for women and men, but
model performance was similar
between sexes. The CanPredict (lung) model showed excellent discrimination and
calibration in both internal and
external validation of the full model, by sex and ethnicity. The model explained 65% of
the variation in time to diagnosis

Lung Cancer Prediction Model 42 | P a g e


of lung cancer in both sexes in the QResearch validation cohort and 59% of the R²
D in both sexes in the CPRD validation
cohort. Harrell’s C statistics were 0·90 in the QResearch (validation) cohort and 0·87 in
the CPRD cohort, and the

Fig 8: ROC curves for risk prediction models in the MOLTEST BIS cohort.
ROC, receiver operating characteristic curve; LLP, Liverpool Lung Project;
AUC, area under the receiver operating characteristic curve. [30]

D statistics were 2·8 in the QResearch (validation) cohort and 2·4 in the CPRD cohort.
Compared with seven other
lung cancer prediction models, the CanPredict (lung) model had the best performance in
discrimination, calibration,

Lung Cancer Prediction Model 43 | P a g e


and net benefit across three prediction horizons (5, 6, and 10 years) in the two
approaches. The CanPredict (lung)
model also had higher sensitivity than the current UK recommended models (LLPv2 and
PLCOM2012), as it identified
more lung cancer cases than those models by screening the same amount of individuals at
high risk.
Interpretation The CanPredict (lung) model was developed, and internally and externally
validated, using data from
19·67 million people from two English primary care databases. Our model has potential
utility for risk stratification
of the UK primary care population and selection of individuals at high risk of lung cancer
for targeted screening. If
our model is recommended to be implemented in primary care, each individual’s risk can
be calculated using
information in the primary care electronic health records, and people at high risk can be
identified for the lung cancer
screening programmed.

Lung Cancer Prediction Model 44 | P a g e


Graphs

Fig 9: Graphs

Lung Cancer Prediction Model 45 | P a g e


5. Conclusion and Recommendations

5.1. Summary and Concluding Remark


The project revolves around the creation of a machine learning-based predictive model
tailored for lung cancer risk assessment. The primary aim is to develop a robust tool
capable of evaluating an individual's probability of developing lung cancer by considering
a wide array of influential factors such as smoking habits, exposure to environmental
pollutants, genetic predisposition, occupational hazards, and other pertinent parameters.
The comprehensive model development process involves multiple stages: initial data
collection from diverse sources, meticulous data preprocessing, including cleaning and
feature engineering, model development using various algorithms like decision trees,
logistic regression, random forests, and neural networks, followed by model validation
through performance evaluation metrics like accuracy, precision, recall, F1-score, and
ROC-AUC. Additionally, the project emphasizes interpretability and explain ability,
incorporating methods for feature importance and model visualization to ensure
healthcare professionals can comprehend and trust the model's predictions. Latency in
training and prediction, along with performance metrics, is crucial for assessing the
model's efficiency. Balancing model complexity, hardware infrastructure, optimization
techniques, and continuous updates based on evolving data and technology are pivotal for
maintaining accuracy, scalability, and relevance over time. Ultimately, the envisioned
outcome is a reliable predictive tool facilitating early intervention and personalized
preventive measures in the realm of lung cancer assessment within healthcare.

The primary objective of this project is to develop a robust machine learning-based


predictive model specifically designed for assessing an individual's risk of developing
lung cancer. The model aims to leverage an extensive range of critical input factors,
including smoking habits, exposure to environmental pollutants, genetic predisposition,
occupational hazards, and other relevant parameters. This comprehensive model
development process entails several stages, starting from the collection of diverse data
sources to meticulous preprocessing, feature engineering, and the utilization of various
algorithms such as decision trees, logistic regression, random forests, and neural
networks. Furthermore, model validation using performance evaluation metrics like
accuracy, precision, recall, F1-score, and ROC-AUC is integral to ensuring the model's
reliability and effectiveness.

Moreover, emphasis has been placed on interpretability and explain ability, incorporating
methods for understanding feature importance and visualizing the model's decision-
making process. This strategic approach aims to enhance the model's transparency and
facilitate the comprehension of predictions by healthcare professionals. Latency in both

Lung Cancer Prediction Model 46 | P a g e


training and prediction phases, coupled with the model's performance metrics, stands as
critical evaluation criteria for assessing the model's efficacy.

In conclusion, this project represents a multifaceted endeavor to develop an accurate,


interpretable, and reliable predictive tool for lung cancer risk assessment. Striking a
balance between model complexity, hardware optimization, and continuous updates based
on evolving data trends and technological advancements will be pivotal for maintaining
the model's accuracy, scalability, and practical relevance within healthcare settings.
Ultimately, the envisioned outcome is a cutting-edge predictive model that not only
identifies individuals at risk of developing lung cancer but also aids in enabling early
interventions and personalized preventive measures in the realm of healthcare.

Lung cancer is the major cause of cancer-related death in this generation, and it is
expected to remain so for the foreseeable future. It is feasible to treat lung cancer if the
symptoms of the disease are detected early. It is possible to construct a sustainable
prototype model for the treatment of lung cancer using the current developments in
computational intelligence without negatively impacting the environment. Because it will
reduce the number of resources squandered as well as the amount of work necessary to
complete manual tasks, it will save both time and money. To optimise the process of
detection from the lung cancer dataset, a machine learning model based on support vector
machines (SVMs) was used. Using an SVM classifier, lung cancer patients are classified
based on their symptoms at the same time as the Python programming language is utilised
to further the model implementation. The effectiveness of our SVM model was evaluated
in terms of several different criteria. Several cancer datasets from the University of
California, Irvine, library was utilised to evaluate the evaluated model. As a result of the
favourable findings of this research, smart cities will be able to deliver better healthcare
to their citizens. Patients with lung cancer can obtain real-time treatment in a cost-
effective manner with the least amount of effort and latency from any location and at any
time. The proposed model was compared with the existing SVM and SMOTE methods.
The proposed method gets a 98.8% of accuracy rate when comparing the existing
methods.

The data was located in the machine learning repository at UCI, and there are 32
examples in the dataset, each having 57 features and a notional range of 0-3 for all
predictive attributes. This is accomplished by translating nominal attribute and class label
data into binary form, which makes data analysis easier to perform. The conversion of
data from nominal to binary form is the most widely used and standardized method in
data analysis. There are some missing values in the dataset, which has an impact on the
performance of the algorithm; therefore, caution should be exercised when analyzing the
data. The label has three different levels of severity: high, medium, and low. There is a
significant amount of missing data in the input data. As a result, it is important to prepare
the data in such a way that the missing values are replaced with the value that occurs the
most frequently in the column. Following that, the newly processed data is subjected to

Lung Cancer Prediction Model 47 | P a g e


analysis using a Python tool. When prior data is transformed into a form that may be
utilised for categorization, classifiers are used to do this. To put the classifier through its
paces, ten different cross validation methods are applied. It is a powerful data analysis
approach that can be used to run ten times the number of computations with the available
data and create accurate predictions based on that data as is possible with traditional
methods. The classification accuracy of a forecast is defined as the number of correct
predictions produced out of a total forecast. The values of these variables are conditional
on the outcome of the experiment. In the case of false-positive and false-negative values,
they are denoted by the true positive (TP) and true negative (TN). As you can see, false
positive (FP) and stands for false negative (FN).

The method proposed is the most efficient method. This is because of the computations
that exist in this system. That is, after the given data is included, many of the data in the
fifth text are compared with its various formats and analyzed. These analysis methods
compute its structure and dimensions when comparing the given data with the many data
present in the other datasets attached to it. The various data available in such calculations
will define its boundaries. The changes in its boundaries when small cooks are attached to
each other help to calculate it more accurately when analyzing its various shape models.
Thus, its accuracy is high.

As demonstrated by the evaluation findings, SVM with SMOTE resampling (Figures 3–8)
on two iterations of the Lung Cancer dataset produced the greatest performance on the
dataset. When compared to earlier methods, this method achieves the maximum value for
all of the parameters that were investigated. The study has two minorities participating in
our lung cancer data collection. As a result, after two rounds of SMOTE, there is an equal
distribution of minorities among the two classes. The third run of SMOTE generates
synthetic samples for class B, which had previously been the majority class in the
previous steps. Nonetheless, the classification performance of these samples does not
increase. The best way to use SVM and SMOTE is to do both of them twice on the same
dataset.

5.2. Practical Uses and Implications

The development of a machine learning-based predictive model for lung cancer risk
assessment holds several practical uses and significant implications within healthcare and
beyond:

Practical Uses:

Lung Cancer Prediction Model 48 | P a g e


Early Intervention and Preventive Measures:
Identification of individuals at a higher risk of developing lung cancer enables healthcare
professionals to implement targeted preventive measures and interventions. This could
include personalized counseling, regular screenings, lifestyle modifications, and cessation
programs for smoking or reducing exposure to environmental pollutants.
Improved Patient Care and Management:
Healthcare providers can tailor patient care plans based on individualized risk
assessments, optimizing resource allocation and prioritizing care for high-risk individuals.
This leads to more efficient and effective healthcare delivery.
Healthcare Resource Allocation:
Targeted risk assessment assists in efficient allocation of healthcare resources by focusing
on high-risk groups or individuals, optimizing screening programs, and allocating
interventions where they are most needed.
Public Health Policies and Awareness Campaigns:
Insights from the predictive model could inform public health policies aimed at reducing
lung cancer risk factors on a larger scale. It can also support the development of public
awareness campaigns for smoking cessation, environmental regulations, and occupational
safety measures.

Implications:
Early Detection and Improved Outcomes:
Early identification of individuals at risk may lead to early detection of lung cancer,
potentially improving treatment outcomes by enabling timely intervention and
management.
Ethical Considerations:
Handling sensitive health-related data and making predictions about an individual's health
condition raises ethical concerns regarding patient privacy, data security, informed
consent, and fair use of predictive analytics in healthcare.
Health Equity and Accessibility:
Ensuring equitable access to risk assessment tools and interventions is crucial to prevent
exacerbating health disparities among different socioeconomic groups or regions.
Continuous Improvement and Validation:
Ongoing validation, refinement, and improvement of the model are critical to maintaining
accuracy, especially considering the evolving nature of medical data and healthcare
practices.

Overall Impact:

Lung Cancer Prediction Model 49 | P a g e


The successful implementation of a predictive model for lung cancer risk assessment has
the potential to significantly impact public health strategies, patient care, resource
allocation, and individual health outcomes. By enabling early identification of at-risk
individuals and facilitating targeted interventions, such a model can contribute to
reducing the burden of lung cancer and improving overall healthcare effectiveness and
efficiency. However, careful consideration of ethical, legal, and social implications is
essential to ensure responsible and equitable use of predictive analytics in healthcare.

Roadblocks
Developing a machine learning-based predictive model for lung cancer risk assessment
involves several challenges and roadblocks that can hinder the project's progress. Some of
the key roadblocks include:

Data Quality and Availability:


 Data Accessibility: Accessing comprehensive and diverse datasets encompassing
various risk factors like smoking habits, environmental exposure, genetic
predisposition, and occupational hazards can be challenging due to data silos or
limited availability.
 Data Quality Issues: Incomplete, inconsistent, or biased data can impact the
model's accuracy and reliability. Handling missing values, outliers, and ensuring
data consistency poses significant challenges.

Model Development and Performance:


 Model Complexity and Overfitting: Complex models might lead to overfitting,
reducing the model's generalizability. Balancing model complexity with
interpretability and performance is challenging.
 Algorithm Selection and Tuning: Choosing the right algorithms and
hyperparameters, along with optimizing model performance without
compromising accuracy, is a complex task.

Interpretability and Explainability:


 Interpretability of Model: Ensuring the model's predictions are interpretable and
explainable to healthcare professionals is crucial for its acceptance and trust.
Black-box models might lack interpretability.
Ethical and Regulatory Challenges:

Lung Cancer Prediction Model 50 | P a g e


 Privacy and Confidentiality: Dealing with sensitive patient health data requires
strict adherence to privacy regulations (such as HIPAA in the United States) and
ensuring patient confidentiality.
 Ethical Considerations: Making predictions about an individual's health condition
raises ethical concerns regarding consent, fairness, bias, and the responsible use of
predictive analytics in healthcare.

Deployment and Integration:


 Scalability and Deployment: Deploying the model in real-world healthcare
settings while ensuring scalability, efficiency, and compatibility with existing
systems can be a complex task.
 Continual Validation and Improvement: Continuous validation and improvement
of the model to adapt to evolving data and healthcare practices require ongoing
resources and efforts.

Feasibility Analysis
A feasibility analysis for a machine learning-based predictive model for lung cancer risk
assessment involves evaluating various aspects to determine the project's viability, including
technical, economic, operational, and scheduling feasibility.

Technical Feasibility:
 Data Availability and Quality: Assess the availability of diverse data sources
containing relevant factors like smoking habits, environmental exposure, genetic
predisposition, etc. Evaluate data quality, considering completeness, consistency, and
potential biases.
 Technology and Tools: Determine the feasibility of employing suitable technologies,
algorithms, and tools for data preprocessing, model development, validation, and
deployment. Consider hardware and software requirements for computational
resources.

 Model Complexity and Interpretability: Assess the feasibility of developing a model


that balances complexity with interpretability, ensuring healthcare professionals can
comprehend and trust the model's predictions.

Economic Feasibility:

Lung Cancer Prediction Model 51 | P a g e


 Cost Estimation: Evaluate the costs associated with data acquisition, data
preprocessing, model development, validation, infrastructure, deployment,
maintenance, and personnel (data scientists, healthcare experts, IT professionals).
 Return on Investment (ROI): Estimate potential benefits in terms of improved
healthcare outcomes, reduced healthcare costs through early intervention, and
resource optimization against the incurred costs.

Operational Feasibility:
 Resource Availability: Assess the availability of skilled personnel, domain experts
(healthcare professionals), and IT infrastructure needed for model development,
implementation, and ongoing maintenance.
 Integration with Healthcare Systems: Determine the feasibility of integrating the
predictive model into existing healthcare systems or workflows while ensuring
compatibility and acceptance by healthcare professionals.

Scheduling Feasibility:
 Timeline and Milestones: Evaluate the feasibility of meeting project deadlines,
considering the complexities involved in data collection, preprocessing, model
development, validation, and deployment.
 Risk Assessment and Mitigation: Identify potential risks (e.g., data quality issues,
model performance limitations, regulatory hurdles) and develop mitigation strategies
to address them.

5.3. Future Work and Enhancement


Future work and enhancements for the machine learning-based predictive model for lung
cancer risk assessment encompass a range of possibilities aimed at advancing its
accuracy, interpretability, scalability, and applicability within healthcare settings. Some
prospective areas for further development include:
Enhanced Data Integration and Quality Improvement:
Incorporation of Additional Data Sources: Integrating more comprehensive and diverse
datasets, including longitudinal data, genetic markers, and environmental exposure
records, to enhance the model's predictive capabilities.
Advanced Data Preprocessing Techniques: Implementing more sophisticated methods for
handling missing data, outlier detection, and feature engineering to improve data quality
and ensure the model's robustness.
Model Development and Interpretability:

Lung Cancer Prediction Model 52 | P a g e


Ensemble Methods and Advanced Algorithms: Exploring ensemble learning techniques or
advanced algorithms to enhance predictive accuracy while maintaining model
interpretability for healthcare professionals.
Explainable AI (XAI) Techniques: Implementing state-of-the-art explainable AI methods
to improve the model's interpretability, providing clear insights into the factors
influencing the risk assessment predictions.
Validation and Continuous Improvement:
Longitudinal Studies and Real-Time Validation: Conducting longitudinal studies to
validate the model's performance over time and considering real-time validation methods
to adapt the model to evolving data trends.
Feedback Mechanisms and Iterative Updates: Implementing feedback loops from
healthcare practitioners to continuously refine and update the model, ensuring it stays
relevant and aligned with clinical practices.
Ethical Considerations and Regulatory Compliance:
Ethical Framework and Fairness Assessments: Developing an ethical framework for the
model's usage, including fairness assessments to mitigate biases and ensure equitable
predictions across diverse populations.
Regulatory Compliance and Data Privacy: Continuously aligning with evolving
healthcare regulations, ensuring compliance with data privacy laws, and adopting secure
data handling practices.
Integration and Deployment:
Scalability and Real-World Deployment: Enhancing the model's scalability for large-scale
deployment in diverse healthcare settings, ensuring seamless integration with existing
healthcare systems, and optimizing deployment for real-time risk assessment.
Collaborative Partnerships and Knowledge Sharing: Establishing collaborative
partnerships with healthcare institutions for broader data access, domain expertise, and
knowledge sharing to drive continuous improvements.
The future work and enhancements for the machine learning-based predictive model for
lung cancer risk assessment aim to propel its accuracy, interpretability, regulatory
compliance, and practicality within healthcare. Continual advancements in data
integration, model development, ethical considerations, and deployment strategies will be
instrumental in fostering a reliable and effective predictive tool that aids in early
intervention and personalized preventive measures, ultimately improving outcomes in
lung cancer assessment and healthcare delivery.
In our research, we leveraged 45,856 de-identified chest CT screening cases (some in
which cancer was found) from NIH’s research dataset from the National Lung Screening
Trial study and Northwestern University. We validated the results with a second dataset
and also compared our results against 6 U.S. board-certified radiologists.

Lung Cancer Prediction Model 53 | P a g e


When using a single CT scan for diagnosis, our model performed on par or better than the
six radiologists. We detected five percent more cancer cases while reducing false-positive
exams by more than 11 percent compared to unassisted radiologists in our study. Our
approach achieved an AUC of 94.4 percent (AUC is a common metric used in machine
learning and provides an aggregate measure for classification performance).

Despite the value of lung cancer screenings, only 2-4 percent of eligible patients in the
U.S. are screened today. This work demonstrates the potential for AI to increase both
accuracy and consistency, which could help accelerate adoption of lung cancer screening
worldwide.
These initial results are encouraging, but further studies will assess the impact and utility
in clinical practice. We’re collaborating with Google Cloud Healthcare and Life Sciences
team to serve this model through the Cloud Healthcare API and are in early conversations
with partners around the world to continue additional clinical validation research and
deployment.

Lung Cancer Prediction Model 54 | P a g e


6. References

[1] M.I. Faisal, S. Bashir, Z.S. Khan, F.H. Khan, “An evaluation of machine learning
classifiers and ensembles for early-stage prediction of lung cancer” December 2018 3rd
International Conference on Emerging Trends in Engineering, Sciences and Technology
(ICEEST), IEEE (2018), pp. 1-4
[2] J. Cabrera, A. Dionisio and G. Solano, "Lung cancer classification tool using microarray
data and support vector machines", Information Intelligence Systems and Applications
(IISA), 2015, July, 2015.
[3] Z. Yu, X. Z. Chen, L. H. Cui, H. Z. Si, H. J. Lu and S. H. Liu, "Prediction of lung cancer
based on serum biomarkers by gene expression programming methods", Asian Pacific Journal
of Cancer Prevention, vol. 15, no. 21, pp. 9367-9373, 2014.
[4] H. Shin, S. Oh, S. Hong, M. Kang, D. Kang, Y.G. Ji, Y. Choi “Early-stage lung cancer
diagnosis by deep learning-based spectroscopic analysis of circulating exosomes” ACS
Nano, 14 (5) (2020), pp. 5435-5444
[5] S.H. Hyun, M.S. Ahn, Y.W. Koh, S.J. Lee “A machine-learning approach using PET-
based radiomics to predict the histological subtypes of lung cancer” Clin. Nucl. Med., 44
(12) (2019), pp. 956-960
[6] W. Rahane, H. Dalvi, Y. Magar, A. Kalane, S. Jondhale “Lung cancer detection using
image processing and machine learning healthcare” 2018, March International Conference
on Current Trends towards Converging Technologies (ICCTCT), IEEE (2018), pp. 1-5
[7] B. A. Miah and M. A. Yousuf, "Detection of Lung cancer from CT image using Image
Processing and Neural network", 2nd International Conference on Electrical Engineering and
Information and Communication Technology (ICEEICT), May 2015.
[8] B.V. Ginneken, B. M. Romeny and M. A. Viergever, "Computer-aided diagnosis in chest
radiography: a survey", IEEE transactions on medical imaging, vol. 20, no. 12, 2001.
[9] H. Becker, W. Nettleton, P. Meyers, J. Sweeney and C. Nice, Jr., "Digital computer
determination of a medical diagnostic index directly from chest X-ray images", IEEE Trans.
Biomed. Eng., vol. BME-11, pp. 67-72, 1964.
[10] L. S. Kovasznay and H. M. Joseph, "Image processing", Proc. IRE, vol. 43, pp. 560-570,
May 1955.
[11] P. C. Goldmark and J. M. Hollywood, "A new technique for improving the sharpness of
pictures", PRoc. I.R.E., vol. 39, pp. 1314, October 1951.
[12] Bedford and Fredendall, "Analysis synthesis and evaluation of the transient response of
television apparatus", Proc. I.R.E., vol. 30, pp. 453-455, October 1942.
[13] J. Duncan and N. Ayache, "Medical image analysis: Progress over two decades and the
challenges ahead", IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 85-106, Jan. 2000.

Lung Cancer Prediction Model 55 | P a g e


[14] R. Shahidi, R. Tombropoulos and R.P.A. Grzeszczuk, "Clinical Applications of Three-
Dimensional Rendering of Medical Data Sets", Proc. IEEE, vol. 86, no. 3, pp. 555-568, Mar.
1998.
[15] M. Yachida, M. Ykeda and S. Tsuji, "A Plan-Guided Analysis of Cineangiograms for
Measurement of Dynamic Behavior of the Heart Wall", IEEE Trans. Pattern Analysis and
Machine Intelligence, vol. 2, pp. 537-543, 1980.
[16] Talukdar J., Sarma P. A survey on lung cancer detection in CT scans images using image
processing techniques. International Journal of Current Trends in Science and Technology .
2018;8(3):20181–20186.
[17] Yu K. H., Zhang C., Berry G. J., et al. Predicting non-small cell lung cancer prognosis
by fully automated microscopic pathology image features. Nature Communications .
2016;7(1):p. 12474. doi: 10.1038/ncomms12474.
[18] Cirujeda P., Cid Y. D., Muller H., et al. A 3-D Riesz-covariance texture model for
prediction of nodule recurrence in lung CT. IEEE transactions on medical imaging .
2016;35(12):2620–2630. doi: 10.1109/TMI.2016.2591921.
[19] Sangamithraa P. B., Govindaraju S. Lung tumour detection and classification using EK-
mean clustering. Proceedings of the 2016 IEEE International Conference on Wireless
Communications, Signal Processing and Networking, WiSPNET; 2016; Chennai, India.
[20] Kurkure M., Thakare A. Lung cancer detection using genetic approach. Proceedings -
2nd International Conference on Computing, Communication, Control and Automation,
ICCUBEA; 2017; Pune, India. [Google Scholar]
[21] Kureshi N., Abidi S. S. R., Blouin C. A predictive model for personalized therapeutic
interventions in non-small cell lung cancer. IEEE journal of biomedical and health
informatics . 2016;20(1):424–431. doi: 10.1109/JBHI.2014.2377517.
[22] Kumar A., Gautam B., Dubey C., Tripathi P. K. A review: role of doxorubicin in
treatment of cancer. International Journal of Pharmaceutical Sciences and Research .
2014;5(10):4117–4128. [Google Scholar]
[23] Kulkarni A., Panditrao A. Classification of lung cancer stages on CT scan images using
image processing. IEEE International Conference on Advanced Communication, Control and
Computing Technologies, ICACCCT; 2014; Ramanathapuram, India. 2014. pp. 1384–1388.
[24] Westaway D. D., Toon C. W., Farzin M., et al. The International Association for the
Study of Lung Cancer/American Thoracic Society/European Respiratory Society grading
system has limited prognostic significance in advanced resected pulmonary adenocarcinoma.
Pathology . 2013;45(6):553–558. doi: 10.1097/PAT.0b013e32836532ae. [PubMed]
[CrossRef] [Google Scholar]
[25] Automatic detection of lung cancer from biomedical data set using discrete AdaBoost
optimized ensemble learning generalized neural networks.
[26] Bitzel Cortez, An architecture for emergency event prediction using LSTM recurrent
neural networks.

Lung Cancer Prediction Model 56 | P a g e


[27] Jan Peters and Stefan Schaal. Policy gradient methods for robotics. In IEEE
International Conference on Intelligent Robots and Systems, pages 2219–2225, 2006. ISBN
142440259X.doi: 10.1109/IROS.2006.282564.
[28] Y. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying
and attacking thesaddle point problem in high-dimensional non-convex optimization.
arXiv:1406.2572 [cs, math, stat],June 2014. URL http://arxiv.org/abs/1406.2572. arXiv:
1406.2572.
[29] https://www.kaggle.com/code/subhajeetdas/lung-cancer-prediction/notebook
[30] M. Firmino, A. H. Morais, R. M. Mendoa, M. R. Dantas, H. R. Hekis, and R. Valentim.
Computer-aideddetection system for lung cancer in computed tomography scans: Review and
future prospects.BioMedical Engineering OnLine, 13:41, Apr. 2014. ISSN 1475-925X. doi:
10.1186/1475-925X-13-41.

Lung Cancer Prediction Model 57 | P a g e


7. Appendix

7.1. Technical Details and Additional Graphs/Charts

Import Libraries
!pip install dtreeviz

Lung Cancer Prediction Model 58 | P a g e


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier,
ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier

Lung Cancer Prediction Model 59 | P a g e


from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score,
confusion_matrix, ConfusionMatrixDisplay, classification_report

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

import warnings
warnings.filterwarnings("ignore")

/kaggle/input/cancer-patients-and-air-pollution-a-new-link/cancer patient data sets.csv

Load Data
df = pd.read_csv("/kaggle/input/cancer-patients-and-air-pollution-a-new-link/cancer
patient data sets.csv")
df

Data Cleaning & Visualization


df.isnull().sum()

Lung Cancer Prediction Model 60 | P a g e


Fig 10: Input Data

sns.heatmap(df.isnull(), cmap = 'viridis')

Lung Cancer Prediction Model 61 | P a g e


Fig 11: Axes Input Plot

df.drop(columns=['index', 'Patient Id'], axis=1, inplace=True)


df
df.size

df.dtypes

Lung Cancer Prediction Model 62 | P a g e


df.iloc[:, 1:24].plot(title="Dataset Details")
df_corr = df.corr()
df_corr

Lung Cancer Prediction Model 63 | P a g e


Fig 12: Dataset Details

plt.title("Correlation Matrix")
sns.heatmap(df_corr, cmap='viridis')
sea = sns.FacetGrid(df, col = "Level", height = 4)
sea.map(sns.distplot, "Age")

Lung Cancer Prediction Model 64 | P a g e


Fig 13: Correlation Matrix

sea = sns.FacetGrid(df, col = "Level", height = 4)


sea.map(sns.distplot, "Gender")
x = df.iloc[:, 0:23]
x

Lung Cancer Prediction Model 65 | P a g e


df['Level'].replace(to_replace = 'Low', value = 0, inplace = True)
df['Level'].replace(to_replace = 'Medium', value = 1, inplace = True)
df['Level'].replace(to_replace = 'High', value = 2, inplace = True)

df['Level'].value_counts()

plt.figure(figsize = (20, 27))

for i in range(24):

Lung Cancer Prediction Model 66 | P a g e


plt.subplot(8, 3, i+1)
sns.distplot(df.iloc[:, i], color = 'red')
plt.grid()

plt.figure(figsize = (11, 9))


plt.title("Lung Cancer Chances Due to Air Polution")
plt.pie(df['Level'].value_counts(), explode = (0.1, 0.02, 0.02), labels = ['High', 'Medium',
'Low'], autopct = "%1.2f%%", shadow = True)
plt.legend(title = "Lung Cancer Chances", loc = "lower left")

Fig 14: Lung Cancer due to Air Pollution

sns.displot(df['Level'], kde=True)

Lung Cancer Prediction Model 67 | P a g e


Fig 15: Level Vs Count

y = df.Level.values
y

Train & Test Splitting the Data

Lung Cancer Prediction Model 68 | P a g e


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)

Function of Measure Performance


def perform(y_pred):
print("Precision : ", precision_score(y_test, y_pred, average = 'micro'))
print("Recall : ", recall_score(y_test, y_pred, average = 'micro'))
print("Accuracy : ", accuracy_score(y_test, y_pred))
print("F1 Score : ", f1_score(y_test, y_pred, average = 'micro'))
cm = confusion_matrix(y_test, y_pred)
print("\n", cm)
print("\n")
print("**"*27 + "\n" + " "* 16 + "Classification Report\n" + "**"*27)
print(classification_report(y_test, y_pred))
print("**"*27+"\n")

cm = ConfusionMatrixDisplay(confusion_matrix = cm, display_labels=['Low',


'Medium', 'High'])
cm.plot()

Random Forest
model_rf = RandomForestClassifier()
model_rf.fit(x_train, y_train)
y_pred_rf = model_rf.predict(x_test)
perform(y_pred_rf)

Lung Cancer Prediction Model 69 | P a g e


Fig 16: Label Graph

Lung Cancer Prediction Model 70 | P a g e


Lung Cancer Prediction Model 71 | P a g e
Lung Cancer Prediction Model 72 | P a g e
Lung Cancer Prediction Model 73 | P a g e
import dtreeviz

viz_model = dtreeviz.model(model_dt,
X_train=x_train, y_train=y_train,
feature_names=feature_names,
target_name='Lung Cancer',
class_names=['Low', 'Medium', 'High'])

v = viz_model.view() # render as SVG into internal object


v.save("Lung Cancer.svg") # save as svg

viz_model.view()

Lung Cancer Prediction Model 74 | P a g e


Lung Cancer Prediction Model 75 | P a g e
Lung Cancer Prediction Model 76 | P a g e
Lung Cancer Prediction Model 77 | P a g e

You might also like