Professional Documents
Culture Documents
Application of Exploratory Data Analysis To Generate Inferences On The Occurrence of Breast Cancer Using A Sample Dataset
Application of Exploratory Data Analysis To Generate Inferences On The Occurrence of Breast Cancer Using A Sample Dataset
Application of Exploratory Data Analysis To Generate Inferences On The Occurrence of Breast Cancer Using A Sample Dataset
net/publication/343489158
CITATIONS READS
0 265
2 authors:
1 PUBLICATION 0 CITATIONS
Amity University, Dubai, UAE
24 PUBLICATIONS 69 CITATIONS
SEE PROFILE
SEE PROFILE
All content following this page was uploaded by Senthil Velan S. on 20 September 2020.
Abstract—Exploratory Data Analysis (EDA) is a data With reference to Cancer, [18] Exploratory data analysis
analysis technique that can be used to visually represent the (EDA) plays a significant role in improving the diagnosis,
knowledge embedded deep in the given data set. The application patient care, disease management and the administration of the
of this technique can be used in medical data processing for the hospital. [5]Applying the EDA process can help us to
betterment of the offered services of healthcare providers. In
determine which cancer is present in the patient and it can also
women, breast cancer has become prevalent and requires
proper methods to identify the possibility of occurrence at an help in identifying various other attributes needed for
early stage for treatment and cure. In this context, visualization developing a predictive model which could provide early
technique of EDA can be applied to the existing datasets for diagnosis and early detection in patients having cancer.
learning and prediction. In this research work, the analysis aims
on finding predominant features which would be helpful in
Breast cancer has become a very common occurrence in
predicting whether the tumor is benign or malignant. To achieve women due the continuously changing food habits and various
this goal various graphical techniques are used for a better other factors. But diagnosing the occurrence at the early stages
visual understanding. Based on the results obtained it can be for a women can be a life changing and rebirth for women.
found that EDA plays a significant role in describing the data EDA can help in analyzing the [15] visualized data to help in
using statistical and visualization analysis without making any the identification of occurrence of breast cancer at an earlier
speculations about the content. With this dataset and EDA stage. A trained identification model can be developed that
approach we have achieved a clear understanding about the can enable such an identification which will help in the early
significant features needed to predict if the tumor is benign or detection of existence and possibility of a cure.
malignant.
II. APPLYING EDA FOR BREAST CANCER DATA
Keywords—Exploratory Data Analysis, R Language, Breast
Cancer, Benign, Malignant A. Process Flow of EDA
I. INTRODUCTION EDA is the initial step of [9] deciphering data by first
showing the visual representation using different tools
Rich and high volume data is the modern fuel that possess available in a data processing tool. The Process flow of a
inherent characteristics for driving today’s [17] intelligent generic EDA mechanism is shown in Fig 1. The data is first
decision making abilities of smart businesses and services. cleaned and trimmed using inbuilt functions of the R Studio,
When comparing with the energy sector, unprocessed raw where the raw data is loaded and read using the function
data is equivalent to the crude oil. The fuel that powers the read.csv for a scripted dataset.
internal combustion engines is the intelligent information that
is processed from the raw data. Similar to the extraction of
different products using fractional distillation of crude oil,
extraction of intelligent information at different levels will
improve the decisions of different levels across the business
unit.
[2] Exploratory data analysis (EDA) is a process by which
the given data set is analyzed to interpolate useful information.
The process commonly depicts the data in a visual form
enabling betting understanding and to adept informed decision
making of the business entities. Fig. 1. A Generic Process Flow of EDA
The relevance of applying EDA in [6] medical data
processing has increased rapidly mainly because of how After completing the analysis either using [4] R Studio or
effective and improvised these classification techniques are Python, two outcomes are obtained as specified in Fig 1. The
and give an almost accurate prediction. This also reduces the first outcome is the output of the enriched data and the second
cost of medicines and improves the ability to conduct more outcome is the data visualization which shows the enriched
clinical research. data being further classified using various algorithms.
449
978-1-7281-4097-1/20/$31.00 ©2020 IEEE
Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)
B. Application to Breast Cancer Data of the model shows the input of the Breast cancer raw dataset,
Breast cancer being the most common amongst women is second is to apply descriptive statistics to the dataset to trim
the second leading cause [19] of death in the human and clean the data for easier use, third is where the univariate
population. A tumor type of abnormal growth is identified plotting is done in which we have a frequency pie chart and
which can be Malignant (cancerous) or Benign (non- visualization of the features in histograms, fourth being the
cancerous). Various tests are performed such as [7] MRI, [14] multivariate plotting giving us a correlation of the features,
Mammogram and Biopsy to diagnose and identify the type of fifth is data preparation using the PCA model, sixth being the
cancer. The Breast Cancer dataset is retrieved in a raw format different classification models needed, seventh is the model
of a .csv file having rows and columns of various attributes evaluation where we choose the best suitable classification
and features which show the relevance of the cancer cells. algorithm for the dataset, lastly it is the comparison of
diagnosis to determine if the tumor is benign or malignant.
Now we will look into the process flow diagram for
applying EDA to the Breast Cancer dataset where the first part
1) Breast Cancer Raw Dataset This dataset included 32 columns with two main attributes
which was the ID number and Diagnosis (Malignant or
450
Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)
Benign). And it had ten real valued features which was a The PCA model (Principal Component Analysis) transforms
description for the tumor or the breast cancer cell nucleus. the data to a new coordinate system in a way that the greatest
The mean radius, standard error of the radius and the worst variance by some scalar projection of the data comes to lie on
radius was found in the columns. Initially the raw dataset was the first coordinate which is the PC1 and the second greatest
imported on R studio by using the function to read .csv file variance on the second which is PC2. Data preparation is an
and the required libraries such as ggplot2, reshape2, important step to analyze the data in such a way that it can
corrplot, caret and various other libraries were loaded. expose the structure of the data to the machine for further
processing. The reason why PCA modelling takes place is
2) Applying Descriptive Statistics when there are many correlation the machine can tend to fail,
The first step includes the inspection of the raw dataset by which is why PCA is helpful.
using the function str(bc_data) which elaborates the
structure of the dataset. [3] Secondly we drop out the 6) Classification Algorithms
unwanted attributes and features to clean or trim our data. This step is crucial for the prediction in the dataset, it is
This makes the EDA process more efficient and the data data mining technique which falls under machine learning. In
visualization more accurate. [8] Function used to remove an this step various machine learning models such as Random
unwanted attribute was bc_data <- medical_data[,- forest, KNN, Neural Networks (NNET) and Naïve Bayes are
c(0:1)]. We then check for missing variables, if the output performed. The first part of this step includes the training of
shows TRUE then we have to re-check the data and find the the algorithm, second part involves in making predictions and
variables. If there are no missing variables it will give an lastly the third part evaluates the predictions.
output as FALSE. Lastly to tidy the data we use the function 7) Model Evaluation
as
In model evaluation we basically compare all the machine
bc_data$diagnosis<as.factor(bc_data$diagnosis) learning models which were trained and tested to perform
Once we have a good sense of the data, the next step predictions and observe which model suits best for the main
involves in having a closer look at the attributes and data by cause of the dataset. Comparison via bwplot and various
using the function summary(bc_data). other plotting methods are done to identify the highest ROI.
451
Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)
452
Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)
453
Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
2020 International Conference on Intelligent Engineering and Management (ICIEM)
showing the mean, standard error and worst mean. It also Component Analysis) where it constructs orthogonal
displays the collinearity between these features. variables. The x-axis represents the components and the y-axis
represents the variances. [12] The summarization of this PCA
D. Experiment (4) Data Preparation (PCA model):
model is divided into three aspects such as Standard deviation,
Data preparation is by far the most important step in EDA Proportion of variance and Cumulative proportion. Fig. 8
where we create the actual data for the ease of the machine to shows that PC1 has the highest standard deviation of value
implicate predictions. Fig. 8 shows the PCA model (Principal 3.6444.
IV. CONCLUSION package with implications for clusterability." Data in brief 25 (2019):
104004.
The Exploratory Data Analysis done on the Breast Cancer [9] Mannila, Heikki. "Data mining: machine learning, statistics, and
dataset helped us in visualizing various aspects needed to databases." In Proceedings of 8th International Conference on
determine whether the Tumour or diagnosis is Benign or Scientific and Statistical Data Base Management, pp. 2-9. IEEE, 1996.
Malignant. By performing univariate plotting we get a distinct [10] Khan, Rubina, and MADKI MR. "Comparison and analysis of various
separation between the diagnosis classifying it into Benign or histogram equalization techniques." Int. J. Eng. Sci. Technol 4, no. 04
(2012): 1787-1792.
Malignant. [16] The use of histograms displayed a bivariate
[11] Cohen, Jacob, Patricia Cohen, Stephen G. West, and Leona S. Aiken.
analysis of the mean, standard error and worst mean giving the Applied multiple regression/correlation analysis for the behavioral
best separations from each attribute. The correlation plot sciences. Routledge, 2013.
showed the highest correlation between area_se until [12] Abdi, Hervé, and Lynne J. Williams. "Principal component analysis."
concave.points_mean which falls on the positive number Wiley interdisciplinary reviews: computational statistics 2, no. 4
line between 1 and 0. And lastly the PCA model showed the (2010): 433-459.
highest standard deviation in PC1 having the value 3.6444. [13] Myatt, Glenn J. Making sense of data: a practical guide to exploratory
Overall, after the complete EDA process we can build a model data analysis and data mining. John Wiley & Sons, 2007.
which can distinguish the real-time datasets or diagnosis and [14] Tang, Jinshan, Rangaraj M. Rangayyan, Jun Xu, Issam El Naqa, and
Yongyi Yang. "Computer-aided detection and diagnosis of breast
predict whether the tumour or nucleus cell is Benign or cancer with mammography: recent advances." IEEE transactions on
Malignant. information technology in biomedicine 13, no. 2 (2009): 236-251.
REFERENCES [15] Qin, Xuedi, Yuyu Luo, Nan Tang, and Guoliang Li. "Making data
visualization more efficient and effective: a survey." The VLDB
[1] "UCI Machine Learning Repository: Breast Cancer Wisconsin Journal 29, no. 1 (2020): 93-117.
(Diagnostic) Data Set". 2020. Archive.Ics.Uci.Edu. [16] Golino, Hudson, Dingjing Shi, Alexander P. Christensen, Luis Eduardo
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%2 Garrido, Maria Dolores Nieto, Ritu Sadana, Jotheeswaran Amuthavalli
8Diagnostic%29. Thiyagarajan, and Agustin Martinez-Molina. "Investigating the
[2] Velleman, Paul F., and David C. Hoaglin. "Exploratory data analysis." performance of exploratory graph analysis and traditional techniques
(2012). to identify the number of latent factors: A simulation and tutorial."
[3] Bock, Hans-Hermann, and Edwin Diday, eds. Analysis of symbolic Psychological Methods (2020).
data: exploratory methods for extracting statistical information from [17] Dey, Samrat K., Md Mahbubur Rahman, Umme R. Siddiqi, and Arpita
complex data. Springer Science & Business Media, 2012. Howlader. "Analyzing the epidemiological outbreak of COVID̺19:
[4] Peng, Roger. Exploratory data analysis with R. Lulu. com, 2012. A visual exploratory data analysis approach." Journal of Medical
[5] Morrow, Alyssa Kramer, George Zhixuan He, Frank Austin Nothaft, Virology 92, no. 6 (2020): 632-638.
Eric Tongching Tu, Justin Paschall, Nir Yosef, and Anthony Douglas [18] Kumar, Dheeraj, and James C. Bezdek. "Visual Approaches for
Joseph. "Mango: Exploratory Data Analysis for Large-Scale Exploratory Data Analysis: A Survey of the Visual Assessment of
Sequencing Datasets." Cell Systems 9, no. 6 (2019): 609-613. Clustering Tendency (VAT) Family of Algorithms." IEEE Systems,
[6] Kanaan, Yasmine M., Robert L. Copeland Jr, Melvin Gaskins, and Man, and Cybernetics Magazine 6, no. 2 (2020): 10-48.
Robert L. DeWitty Jr. "Exploratory Data Analysis on Breast Cancer [19] Thaung, Su Myat, Hla Myo Tun, Khin Kyu Kyu Win, Myint Myint
Prognosis." In Encyclopedia of Information Science and Technology, Than, Aye Su Su Phyo, Atar Mon, Sao Hone Pha, Cherry Tin, Saw
Fourth Edition, pp. 1794-1805. IGI Global, 2018. Aung Yein Oo, and Tin Tin Hla. "Exploratory Data Analysis Based on
[7] Nazarpour, Ahad. "Application of CA fractal model and exploratory Remote Health Care Monitoring System by Using IoT."
data analysis (EDA) to delineate geochemical anomalies in the: Takab Communications 8, no. 1 (2020): 1-8.
1: 25,000 geochemical sheet, NW Iran." Iranian Journal of Earth [20] J. Dsouza and S. Velan, "Preventive Maintenance for Fault Detection
Sciences 10, no. 2 (2018): 173-180. in Transfer Nodes using Machine Learning," 2019 International
[8] Brownstein, Naomi C., Andreas Adolfsson, and Margareta Ackerman. Conference on Computational Intelligence and Knowledge Economy
"Descriptive statistics and visualization of data from the R datasets (ICCIKE), Dubai, United Arab Emirates, 2019, pp. 401-404
454
Authorized licensed use limited to: AMITY University. Downloaded on September 20,2020 at 09:44:06 UTC from IEEE Xplore. Restrictions apply.
View publication stats