Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

VIETNAM NATIONAL UNIVERSITY, HANOI

INTERNATIONAL SCHOOL

FINAL REPORT
Enterprise Analytics for Decision Support

Topic : Lung Cancer Analysis & Prediction


based on classification algorithm
Instructor : Đỗ Trung Tuấn

Student name : Nguyễn Minh Đức - 21070519


Trịnh Quang Minh - 21070143
Trần Việt Dũng - 21070726
Nguyễn Bích Diệp - 21070891
Đinh Duy Minh – 21070638

Ha Noi, December, 29th, 2023


Contribution
No Member Contribution Dedication
1 Nguyễn Minh Đức Model planning, trainning and visualize 20%
2 Trịnh Quang Minh Coding, research 20%
3 Trần Việt Dũng Recommendation, Powerpoint 20%
4 Nguyễn Bích Diệp Intro, Report 20%
5 Đinh Duy Minh Present, Powerpoint 20%

1
Table of Content
Contents

Contribution ....................................................................................................... 1
Table of Content ................................................................................................. 2
I. INTRODUCTION ........................................................................................... 3
1. Topic Selection ..................................................................................................................................... 3
1.1 Data Source ..................................................................................................................................... 3
1.2 Context ............................................................................................................................................ 3
1.3 Content ........................................................................................................................................... 3
1.4 Columns .......................................................................................................................................... 3
1.5 Goal ................................................................................................................................................. 4

II. DESCRIPTIVE METHODS............................................................................. 4


1. Develop analytical models with classification ...................................................................................... 4
1.1 Pre-Processing Data ........................................................................................................................ 4
1.2 Feature selection............................................................................................................................. 9
1.3 Splitting the dataset as training set and testing set...................................................................... 13
2. Develop Classification Models ............................................................................................................ 14
2.1 Model Planning ............................................................................................................................. 14
2.2 Model Training .............................................................................................................................. 17

III. RECOMMENDATION .............................................................................. 19

2
I. INTRODUCTION
1. Topic Selection
1.1 Data Source
Cancer Patients Data (kaggle.com)
Our code source

1.2 Context
Many people's lives are cut short due to cancer, all contributed to an increase in patients with
cancer disease. However, due to the areas of big data we are able to combat this malicious disease.

1.3 Content
This dataset contains information on cancer patients about their lifestyles. By analyzing this data
we can gain insight into what causes lung cancer and how best to treat and predict the reason for
cancer disease.

1.4 Columns

• Age: The age of the patient. (Numeric)


• Gender: The gender of the patient. (Categorical)
• Air Pollution: The level of air pollution exposure of the patient. (Categorical)
• Alcohol use: The level of alcohol use of the patient. (Categorical)
• Dust Allergy: The level of dust allergy of the patient. (Categorical)
• Occupationalal Hazards: The level of occupational hazards of the patient. (Categorical)
• Genetic Risk: The level of genetic risk of the patient. (Categorical)
• Chronic Lung Disease: The level of chronic lung disease of the patient. (Categorical)
• Balanced Diet: The level of a balanced diet of the patient. (Categorical)
• Obesity: The level of obesity of the patient. (Categorical)
• Smoking: The level of smoking of the patient. (Categorical)
• Passive Smoker: The level of passive smoker of the patient. (Categorical)
• Chest Pain: The level of chest pain of the patient. (Categorical)
• Coughing of Blood: The level of coughing of the blood of the patient. (Categorical)
• Fatigue: The level of fatigue of the patient. (Categorical)
3
• Weight Loss: The level of weight loss of the patient. (Categorical)
• Shortness of Breath: The level of shortness of breath of the patient. (Categorical)
• Wheezing: The level of wheezing of the patient. (Categorical)
• Swallowing Difficulty: The level of swallowing difficulty of the patient. (Categorical)
• Clubbing of Finger Nails: The level of clubbing of the fingernails of the patient.
(Categorical)

1.5 Goal

• Determine which patients have cancer disease and which do not use these patient records.
• we need to predict those people who are prone to have Cancer Disease with respect to
different parameters such as their Age, Gender, Alcohol Use, Genetic Risk, and Smoking.

II. DESCRIPTIVE METHODS


1. Develop analytical models with classification
1.1 Pre-Processing Data
We use Jupyter Notebook to build and create our analytical model. Jupyter Notebook is so
familiar with data analysis to write and execute arbitrary python code through the browsers which
are perfect fit for machine learning, data visualization, and analysis.

The very first thing before analyzing data is to import the library which has already been
installed on Jupyter Notebook.

4
We uploaded the “data sets.csv” file onto Notebook and display the collected dataset using
pandas. As you can see, we have 5 rows and 25 columns (with 1 labeled column)

First, we check for possible missing data in the set. Fortunately, the data file is completely
filled. However, if the dataset has some missing values, we can fill it with interpolate function
rather than remove the data (of course, we have to categorize the data).

5
6
After that, We use the Pandas describe function to display basic descriptive statistical
information about the data set, including the mean, standard deviation, min, max, and percentiles
of the dataset.

As we can see “Level” column (dtype) is not int. To help basic machine learning algorithms
understand, we must convert letters into numbers. In this case, low level is denoted as 1, high level
as 0 and medium as 2.

7
Here we use pyplot to check if there is any imbalance in the dataset by visualizing the
proportion of each type (1- Low, 2 - Medium, 3- High). In this case, the ratio of each category is
fairly balanced at 3:3:3 which is hard to create bias.

8
1.2 Feature selection
For analysis purposes, we can break it down further by gender. Then, we have a higher
representation of gender one than gender two.

9
Let’s start by looking to see if there is a correlation between some of the data. We create a
correlation matrix which means heat map to clearly display the correlation between each variable.

10
As you can see, the heat map shows a high correlation between some of the variables such
as: Alcohol use with Genetic Risk and Occupational Hazards with Genetic Risk. However, there
are a lot of variables to look at, we can just find the most important ones by using the SelectKBest
Algorithm with ANOVA F-ratio statistic.

This helps improve interpretability, and reduce overfitting and training time. This method
will generate the F-ratio scores of all features and we can determine which ones to use for machine
learning.

11
The F-score table displays the most important variables. We will take the features that
scored more than 200 as they showed the least redundancy.

12
As we can see, 13 features were selected and we created a new data frame to account for
them. We can now start the process of preprocessing for machine learning applications.

1.3 Splitting the dataset as training set and testing set


Prior to building the models, we had to separate the data into a training and testing set and
set the random_state=0 so as to get the same train and test sets across different executions. The
data will be split so we can train a scaler model to apply to an unknown (test) data set. We will
save 25% of the data for testing.

13
After that, the data will be scaled by the standard scaler function in the sklearn package
using the Z-score formula. This can help reduce the effect of outliers when modeling later.

2. Develop Classification Models


2.1 Model Planning
The dataset is good enough for analysis. Dealing with a binary classification problem, we have
some of the algorithms listed:

• Logistic Regression

The metrics used for evaluation are Confusion Matrix, Accuracy, Precision, Recall, and F1
Score.

Before we go into the specifics of each statistic, it's vital to understand the four-building pieces
that make up the assessment metrics:

• True Positive (TP) - Both the actual label and the forecast are positive.
• True Negative (TN) - Both the actual label and the forecast are negative.
• False Positive (FP) occurs when the actual label is negative but the forecast is positive.
• False Negative (FN) - The actual label is positive, while the predicted label is negative.

Confusion Matrix

A Confusion matrix is a N x N matrix used to assess the effectiveness of a classification


model, where N represents the number of target classes. The matrix compares the actual goal
values to the machine learning model's predictions. This provides us with a comprehensive picture
of how well our classification model is working and the kind of errors it is producing.

For a binary classification problem, we would have a 2 x 2 matrix as shown below:


14
In this case, there are two potential predicted classes: "yes" and "no." We were forecasting the
presence of diabetes, therefore "yes" means they have the condition, and "no" means they don't.

Accuracy

Accuracy is the most intuitive performance metric, it is simply the ratio of correctly
predicted observations to the total observations. However, the fundamental downside of precision
is that it obscures the problem of class imbalance.

Correct Predictions TP + TN
Accuracy = =
All Predictions TP + TN + FP + FN

Precision

Precision is a measure of how many of the positive predictions made are correct (true
positives). It is the fraction of samples that the classifier model predicted to be in the positive class
to the total number of samples predicted to be in the positive class.

Correct Positive Predictions TP


Precision = =
All Positive Predictions TP + FP

Recall

15
Recall is a statistic that counts the number of correct positive predictions produced out of
all possible positive predictions. Recall is the go-to statistic when there is a large cost associated
with a False Negative.

Correct Positive Predictions TP


Recall = =
All Positive TP + FN

F1 score

The F1-score combines a classifier's accuracy and recall into a single statistic by calculating
their harmonic mean. Its primary application is to compare the performance of many classifiers.
Assume classifier A has a greater recall but classifier B has better accuracy. The F1-scores for both
classifiers may be used to assess which one delivers superior results in this scenario. It is usually
employed in the evaluation of binary classification systems.

Precision × Recall
F1 score = 2 ×
Precision + Recall

16
2.2 Model Training
With new and cleaned data, the predictive models might provide close accuracy by
removing 12 fewer relation columns.

• Logistic Regression

17
We can see that the accuracy is pretty good (0.93) while the f1 score is pretty high at
0.87. Hence, This model is highly recommended.

18
III. RECOMMENDATION
It is commonly known that the older you get, the more diseases you get. Surprisingly,
however, the 30 to 40 age group was most likely to develop lung cancer. As you can see in the
graph, the number of people in their 30s with lung cancer is almost twice that of people in their
20s and other ages.

We strongly recommend regular blood tests after age 30 to better prevent and diagnose
possible lung cancer risks.

We also created a few predictive models. From the results, Logistic Regression (93%) is
the best predictive models we created. Users can enter the frequency of their symptoms and expect
results from predictions. We hope that one day in the future, patients will be able to self-test and
diagnose symptoms at home and simply enter numbers and have results on the spot.

Overall, our findings hold promise for enhanced patient interaction, better lung cancer
monitoring, and real-time data exchange and analysis. Furthermore, it allows everyone to be
actively involved in achieving desired health outcomes, saving costs, and making optimal
decisions.

19
IV. ACKNOWLEDGMENT
Firstly, our group would like to express our gratitude to the International University –
VNUIS for including the Enterprise Analytics for Decision Support course in the curriculum,
allowing us the opportunity to acquire such valuable knowledge.
Particularly, we want to extend our sincerest thanks to Assoc. Prof. PhD Do Trung Tuan
for wholeheartedly imparting knowledge to us. The time spent studying under your g uidance has
been truly amazing, as it not only involved theoretical learning but also provided practical and
valuable experiences. This will serve as a foundation for me to confidently step onto the path
we've chosen. The Data Science course has not only been beneficial but also highly practical.
However, due to our limited knowledge and some difficulties in grasping practical
aspects, we acknowledge that our essay may contain shortcomings and inaccuracies. Despite
putting in my best effort, we are certain that there might be areas that need improvement. We
humbly request your review and feedback to help refine our essay.
We sincerely appreciate your consideration and guidance!

20

You might also like