Application of Exploratory Data Analysis To Generate Inferences On The Occurrence of Breast Cancer Using A Sample Dataset

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/343489158

Application of Exploratory Data Analysis to Generate Inferences on the


Occurrence of Breast Cancer using a Sample Dataset

Conference Paper · June 2020


DOI: 10.1109/ICIEM48762.2020.9160290

CITATIONS READS

0 265

2 authors:

Sabeel Ashfaq Khan Senthil Velan S.

1 PUBLICATION   0 CITATIONS   
Amity University, Dubai, UAE
24 PUBLICATIONS   69 CITATIONS   
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by Senthil Velan S. on 20 September 2020.

The user has requested enhancement of the downloaded file.


2020 International Conference on Intelligent Engineering and Management (ICIEM)

Application of Exploratory Data Analysis to


Generate Inferences on the Occurrence of Breast
Cancer using a Sample Dataset
Sabeel Ashfaq Khan Senthil Velan S
Department of Computer Science and Engineering Department of Computer Science and Engineering
Amity University Dubai Amity University Dubai
Dubai, UAE Dubai, UAE
sabeelK@amitydubai.ae svelan@amityuniversity.ae

Abstract—Exploratory Data Analysis (EDA) is a data With reference to Cancer, [18] Exploratory data analysis
analysis technique that can be used to visually represent the (EDA) plays a significant role in improving the diagnosis,
knowledge embedded deep in the given data set. The application patient care, disease management and the administration of the
of this technique can be used in medical data processing for the hospital. [5]Applying the EDA process can help us to
betterment of the offered services of healthcare providers. In
determine which cancer is present in the patient and it can also
women, breast cancer has become prevalent and requires
proper methods to identify the possibility of occurrence at an help in identifying various other attributes needed for
early stage for treatment and cure. In this context, visualization developing a predictive model which could provide early
technique of EDA can be applied to the existing datasets for diagnosis and early detection in patients having cancer.
learning and prediction. In this research work, the analysis aims
on finding predominant features which would be helpful in
Breast cancer has become a very common occurrence in
predicting whether the tumor is benign or malignant. To achieve women due the continuously changing food habits and various
this goal various graphical techniques are used for a better other factors. But diagnosing the occurrence at the early stages
visual understanding. Based on the results obtained it can be for a women can be a life changing and rebirth for women.
found that EDA plays a significant role in describing the data EDA can help in analyzing the [15] visualized data to help in
using statistical and visualization analysis without making any the identification of occurrence of breast cancer at an earlier
speculations about the content. With this dataset and EDA stage. A trained identification model can be developed that
approach we have achieved a clear understanding about the can enable such an identification which will help in the early
significant features needed to predict if the tumor is benign or detection of existence and possibility of a cure.
malignant.
II. APPLYING EDA FOR BREAST CANCER DATA
Keywords—Exploratory Data Analysis, R Language, Breast
Cancer, Benign, Malignant A. Process Flow of EDA
I. INTRODUCTION EDA is the initial step of [9] deciphering data by first
showing the visual representation using different tools
Rich and high volume data is the modern fuel that possess available in a data processing tool. The Process flow of a
inherent characteristics for driving today’s [17] intelligent generic EDA mechanism is shown in Fig 1. The data is first
decision making abilities of smart businesses and services. cleaned and trimmed using inbuilt functions of the R Studio,
When comparing with the energy sector, unprocessed raw where the raw data is loaded and read using the function
data is equivalent to the crude oil. The fuel that powers the read.csv for a scripted dataset.
internal combustion engines is the intelligent information that
is processed from the raw data. Similar to the extraction of
different products using fractional distillation of crude oil,
extraction of intelligent information at different levels will
improve the decisions of different levels across the business
unit.
[2] Exploratory data analysis (EDA) is a process by which
the given data set is analyzed to interpolate useful information.
The process commonly depicts the data in a visual form
enabling betting understanding and to adept informed decision
making of the business entities. Fig. 1. A Generic Process Flow of EDA
The relevance of applying EDA in [6] medical data
processing has increased rapidly mainly because of how After completing the analysis either using [4] R Studio or
effective and improvised these classification techniques are Python, two outcomes are obtained as specified in Fig 1. The
and give an almost accurate prediction. This also reduces the first outcome is the output of the enriched data and the second
cost of medicines and improves the ability to conduct more outcome is the data visualization which shows the enriched
clinical research. data being further classified using various algorithms.

449
978-1-7281-4097-1/20/$31.00 ©2020 IEEE

Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)
B. Application to Breast Cancer Data of the model shows the input of the Breast cancer raw dataset,
Breast cancer being the most common amongst women is second is to apply descriptive statistics to the dataset to trim
the second leading cause [19] of death in the human and clean the data for easier use, third is where the univariate
population. A tumor type of abnormal growth is identified plotting is done in which we have a frequency pie chart and
which can be Malignant (cancerous) or Benign (non- visualization of the features in histograms, fourth being the
cancerous). Various tests are performed such as [7] MRI, [14] multivariate plotting giving us a correlation of the features,
Mammogram and Biopsy to diagnose and identify the type of fifth is data preparation using the PCA model, sixth being the
cancer. The Breast Cancer dataset is retrieved in a raw format different classification models needed, seventh is the model
of a .csv file having rows and columns of various attributes evaluation where we choose the best suitable classification
and features which show the relevance of the cancer cells. algorithm for the dataset, lastly it is the comparison of
diagnosis to determine if the tumor is benign or malignant.
Now we will look into the process flow diagram for
applying EDA to the Breast Cancer dataset where the first part

Fig. 2. Process Flow of applying EDA to Breast Cancer Dataset.

1) Breast Cancer Raw Dataset This dataset included 32 columns with two main attributes
which was the ID number and Diagnosis (Malignant or

450

Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)
Benign). And it had ten real valued features which was a The PCA model (Principal Component Analysis) transforms
description for the tumor or the breast cancer cell nucleus. the data to a new coordinate system in a way that the greatest
The mean radius, standard error of the radius and the worst variance by some scalar projection of the data comes to lie on
radius was found in the columns. Initially the raw dataset was the first coordinate which is the PC1 and the second greatest
imported on R studio by using the function to read .csv file variance on the second which is PC2. Data preparation is an
and the required libraries such as ggplot2, reshape2, important step to analyze the data in such a way that it can
corrplot, caret and various other libraries were loaded. expose the structure of the data to the machine for further
processing. The reason why PCA modelling takes place is
2) Applying Descriptive Statistics when there are many correlation the machine can tend to fail,
The first step includes the inspection of the raw dataset by which is why PCA is helpful.
using the function str(bc_data) which elaborates the
structure of the dataset. [3] Secondly we drop out the 6) Classification Algorithms
unwanted attributes and features to clean or trim our data. This step is crucial for the prediction in the dataset, it is
This makes the EDA process more efficient and the data data mining technique which falls under machine learning. In
visualization more accurate. [8] Function used to remove an this step various machine learning models such as Random
unwanted attribute was bc_data <- medical_data[,- forest, KNN, Neural Networks (NNET) and Naïve Bayes are
c(0:1)]. We then check for missing variables, if the output performed. The first part of this step includes the training of
shows TRUE then we have to re-check the data and find the the algorithm, second part involves in making predictions and
variables. If there are no missing variables it will give an lastly the third part evaluates the predictions.
output as FALSE. Lastly to tidy the data we use the function 7) Model Evaluation
as
In model evaluation we basically compare all the machine
bc_data$diagnosis<as.factor(bc_data$diagnosis) learning models which were trained and tested to perform
Once we have a good sense of the data, the next step predictions and observe which model suits best for the main
involves in having a closer look at the attributes and data by cause of the dataset. Comparison via bwplot and various
using the function summary(bc_data). other plotting methods are done to identify the highest ROI.

3) Univariate Plotting 8) Comparison of diagnosis if it is Benign or Malignant


One of the main aspects of this analysis is to visualize This last step includes the final analysis where we perform
which feature is more helpful in determining whether the the prediction and data visualization on the real-time dataset
tumor is Malignant or Benign cancer. To understand and to know if the given tumour or cell nucleus is Benign (non-
analyze the features we use univariate plotting to have a cancerous) or Malignant (cancerous). Various diagnosis of
distinct understanding. In reference to this dataset we created female patients are collected to classify whether the tumour
the frequency of the cancer diagnosis. Here a table was is Benign or Malignant.
created in terms of frequency and taking that a pie chart was III. INFERENCES FROM THE DATASET
made for visualization. This Pie Chart displayed the
percentage of Malignant and Benign cancer cells present in To evaluate the process flow of applying [13] EDA to
the observations. Breast Cancer dataset, four experiments were performed to
have a data visualization for the dataset. This was done to
The visualization of all the features in terms of their mean, understand which features have larger predictive value and
standard error and worst was depicted by a [10] Histogram smaller predictive value so that we could create a model to
plot by breaking up the columns into groups according to their classify whether the tumor is Benign or Malignant.
suffix such as (_mean,_se and _worst).
4) Multivariate Analysis
In this step we perform analysis on more than one
dependent variables where the variables can be correlated
with each other. [11] To find the 30 predictors from our
features we need to find the correlation. So we calculate the
collinearity by using the function:
corMatMy<-cor(bc_data[,2:31])
And to plot the correlation we used the function:
corrplot(corMatMy, order = "hclust", tl.cex
= 0.7)
By further analyzing we eliminate highly correlated
features to avoid predictive bias. The reason we eliminate the Fig. 3. Frequency of Cancer Diagnosis.
high correlated features is because they provide redundant
information.
5) Data Preparation by creating PCA model

451

Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)

Fig. 4 Visualization plot off of mean in features.

Fig. 5 Visualization plot off of standard error in features.

Fig. 6 Visualization plot off of worst mean in features.

452

Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)

Fig. 7 Correlation plot of the features

A. Experiment (1) Frequency Pie Chart: concave.points_worst,concavity_worst and


The experiment (1) displays a pie chart where Fig. 3 shows perimeter_worst
that M= Malignant which indicates the cancer cell and C. Experiment (3) Correlation Plot:
B=Benign indicates the absence of cancer cells. Out of the 357
observations evaluated on the basis of Benign shows that 67% The experiment (3) explains a heat map where Fig. 7
of all observations is Benign. [14] And out of the 212 shows the calculated collinearity to see a bivariate relationship
observations evaluated we see that 37.3% of all observations between the predictors. Sometimes we may have highly
are Malignant. Since the dataset is not that large it shows the correlated information, so by eliminating such correlation we
number of positives relatively high, but with a dataset large can discard predictive bias of the contained information in
enough the number of negatives would be significantly higher features.
and positives lower.
TABLE I. FEATURE SELECTION AND COLLINEARITY
B. Experiment (2) Visualizing data via Histograms: Feature All
Features Representation
The experiment (2) features various plot offs in a Info. Correlations
histogram representation where, Fig. 4 shows the mean of all Diagnosis diagnosis Nominal <= 0.9
features recorded according to the suffix designation _mean Texture _mean,_se,_worst Numerical <= 0.9
where the variables are plotted in relevance to the diagnosis Area _mean,_se,_worst Numerical <= 0.9
which is B=Benign or M=Malignant. In this figure the best Radius _mean,_se,_worst Numerical <= 0.9
separation found is in area_mean and perimeter_mean. Perimeter _mean,_se,_worst Numerical <= 0.9
The next plot off explains how Fig. 5 shows the standard Concave. Points _mean,_se,_worst Numerical <= 0.9
error of all features recorded according to the suffix Concavity _mean,_se,_worst Numerical <= 0.9
designation _se where the variables are plotted in relevance Compactness _mean,_se,_worst Numerical <= 0.9
to the diagnosis which is B=Benign or M=Malignant. In this Fractal Dimension _mean,_se,_worst Numerical <= 0.9
figure the best separation found is in symmetry_se and Symmetry _mean,_se,_worst Numerical <= 0.9
smoothness_se. Smoothness _mean,_se,_worst Numerical <= 0.9
The last plot off is where Fig. 6 shows the worst or largest
mean of the three largest values of all features recorded The above Table I shows the feature selection and feature
according to the suffix designation _worst where the variables information required to determine if the tumor is benign or
are plotted in relevance to the diagnosis which is B=Benign or malignant which are represented with the suffix designation
M=Malignant. In this figure the best separation found is in

453

Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)
showing the mean, standard error and worst mean. It also Component Analysis) where it constructs orthogonal
displays the collinearity between these features. variables. The x-axis represents the components and the y-axis
represents the variances. [12] The summarization of this PCA
D. Experiment (4) Data Preparation (PCA model):
model is divided into three aspects such as Standard deviation,
Data preparation is by far the most important step in EDA Proportion of variance and Cumulative proportion. Fig. 8
where we create the actual data for the ease of the machine to shows that PC1 has the highest standard deviation of value
implicate predictions. Fig. 8 shows the PCA model (Principal 3.6444.

Fig. 8 Data preparation by applying PCA model.

IV. CONCLUSION package with implications for clusterability." Data in brief 25 (2019):
104004.
The Exploratory Data Analysis done on the Breast Cancer [9] Mannila, Heikki. "Data mining: machine learning, statistics, and
dataset helped us in visualizing various aspects needed to databases." In Proceedings of 8th International Conference on
determine whether the Tumour or diagnosis is Benign or Scientific and Statistical Data Base Management, pp. 2-9. IEEE, 1996.
Malignant. By performing univariate plotting we get a distinct [10] Khan, Rubina, and MADKI MR. "Comparison and analysis of various
separation between the diagnosis classifying it into Benign or histogram equalization techniques." Int. J. Eng. Sci. Technol 4, no. 04
(2012): 1787-1792.
Malignant. [16] The use of histograms displayed a bivariate
[11] Cohen, Jacob, Patricia Cohen, Stephen G. West, and Leona S. Aiken.
analysis of the mean, standard error and worst mean giving the Applied multiple regression/correlation analysis for the behavioral
best separations from each attribute. The correlation plot sciences. Routledge, 2013.
showed the highest correlation between area_se until [12] Abdi, Hervé, and Lynne J. Williams. "Principal component analysis."
concave.points_mean which falls on the positive number Wiley interdisciplinary reviews: computational statistics 2, no. 4
line between 1 and 0. And lastly the PCA model showed the (2010): 433-459.
highest standard deviation in PC1 having the value 3.6444. [13] Myatt, Glenn J. Making sense of data: a practical guide to exploratory
Overall, after the complete EDA process we can build a model data analysis and data mining. John Wiley & Sons, 2007.
which can distinguish the real-time datasets or diagnosis and [14] Tang, Jinshan, Rangaraj M. Rangayyan, Jun Xu, Issam El Naqa, and
Yongyi Yang. "Computer-aided detection and diagnosis of breast
predict whether the tumour or nucleus cell is Benign or cancer with mammography: recent advances." IEEE transactions on
Malignant. information technology in biomedicine 13, no. 2 (2009): 236-251.
REFERENCES [15] Qin, Xuedi, Yuyu Luo, Nan Tang, and Guoliang Li. "Making data
visualization more efficient and effective: a survey." The VLDB
[1] "UCI Machine Learning Repository: Breast Cancer Wisconsin Journal 29, no. 1 (2020): 93-117.
(Diagnostic) Data Set". 2020. Archive.Ics.Uci.Edu. [16] Golino, Hudson, Dingjing Shi, Alexander P. Christensen, Luis Eduardo
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%2 Garrido, Maria Dolores Nieto, Ritu Sadana, Jotheeswaran Amuthavalli
8Diagnostic%29. Thiyagarajan, and Agustin Martinez-Molina. "Investigating the
[2] Velleman, Paul F., and David C. Hoaglin. "Exploratory data analysis." performance of exploratory graph analysis and traditional techniques
(2012). to identify the number of latent factors: A simulation and tutorial."
[3] Bock, Hans-Hermann, and Edwin Diday, eds. Analysis of symbolic Psychological Methods (2020).
data: exploratory methods for extracting statistical information from [17] Dey, Samrat K., Md Mahbubur Rahman, Umme R. Siddiqi, and Arpita
complex data. Springer Science & Business Media, 2012. Howlader. "Analyzing the epidemiological outbreak of COVID̺19:
[4] Peng, Roger. Exploratory data analysis with R. Lulu. com, 2012. A visual exploratory data analysis approach." Journal of Medical
[5] Morrow, Alyssa Kramer, George Zhixuan He, Frank Austin Nothaft, Virology 92, no. 6 (2020): 632-638.
Eric Tongching Tu, Justin Paschall, Nir Yosef, and Anthony Douglas [18] Kumar, Dheeraj, and James C. Bezdek. "Visual Approaches for
Joseph. "Mango: Exploratory Data Analysis for Large-Scale Exploratory Data Analysis: A Survey of the Visual Assessment of
Sequencing Datasets." Cell Systems 9, no. 6 (2019): 609-613. Clustering Tendency (VAT) Family of Algorithms." IEEE Systems,
[6] Kanaan, Yasmine M., Robert L. Copeland Jr, Melvin Gaskins, and Man, and Cybernetics Magazine 6, no. 2 (2020): 10-48.
Robert L. DeWitty Jr. "Exploratory Data Analysis on Breast Cancer [19] Thaung, Su Myat, Hla Myo Tun, Khin Kyu Kyu Win, Myint Myint
Prognosis." In Encyclopedia of Information Science and Technology, Than, Aye Su Su Phyo, Atar Mon, Sao Hone Pha, Cherry Tin, Saw
Fourth Edition, pp. 1794-1805. IGI Global, 2018. Aung Yein Oo, and Tin Tin Hla. "Exploratory Data Analysis Based on
[7] Nazarpour, Ahad. "Application of CA fractal model and exploratory Remote Health Care Monitoring System by Using IoT."
data analysis (EDA) to delineate geochemical anomalies in the: Takab Communications 8, no. 1 (2020): 1-8.
1: 25,000 geochemical sheet, NW Iran." Iranian Journal of Earth [20] J. Dsouza and S. Velan, "Preventive Maintenance for Fault Detection
Sciences 10, no. 2 (2018): 173-180. in Transfer Nodes using Machine Learning," 2019 International
[8] Brownstein, Naomi C., Andreas Adolfsson, and Margareta Ackerman. Conference on Computational Intelligence and Knowledge Economy
"Descriptive statistics and visualization of data from the R datasets (ICCIKE), Dubai, United Arab Emirates, 2019, pp. 401-404

454

Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
View publication stats

You might also like