Clustering On Breast Cancer Wisconsin

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

K L UNIVERSITY

COMPUTER SCIENCE ENGINEERING DEPARTMENT

A Project Based Lab Report


On
Artificial Intelligence for Data Science

SUBMITTED BY:

ID NUMBER NAME

2000031259 M.GOUTHAM

2000031826 SAI KRISHANA

UNDER THE ESTEEMED GUIDANCE OF

Dr. Sunanda

Associate Professor

KL UNIVERSITY
Green Fields, Vaddeswaram – 522 502
Guntur Dt., AP, India
Clustering on Breast Cancer Wisconsin (Diagnostic)
Data Set

ABSTRACT
Breast cancer is one of the most common cancers found worldwide and most frequently
found in women. An early detection of breast cancer provides the possibility of its cure;
therefore, many studies are currently going on to identify methods that can detect breast
cancer in its early stages. This study was aimed to find the effects of k-means clustering
algorithm with different computation measures like centroid, distance, split method, epoch,
attribute, and iteration and to carefully consider and identify the combination of measures
that has potential of highly accurate clustering accuracy.

K-means algorithm was used to evaluate the impact of clustering using centroid
initialization, distance measures, and split methods. The experiments were performed using
breast cancer Wisconsin (BCW) diagnostic dataset. Foggy and random centroids were used
for the centroid initialization. In foggy centroid, based on random values, the first centroid
was calculated. For random centroid, the initial centroid was considered as (0, 0).

problem statement

The number of specialists and expertise’s in the medical domain about the breast cancer is limited.
The patients must make appointments with them before doing the medical check-up. So many
patients must wait too long to get their result from the check-up.

Besides that, from day to day, the experience medical staffs are decreasing in number. When they
retired, the new staffs will be replacing their places. They are inexperience staff compared to the old
one. So, they must learn many things regarding to the related works. The application is very useful in
the management of the problem and aids the inexperience physicians to check their diagnosis.
Theoretical Background

Breast cancer is one of the common cancers in the world. It is a tumour that has developed from the
cell of the breast. It occurs in men and women even though the breast cancer in men is rare case.
Breast cancer is the leading cause of death among women between 40 and 55 years old and the
second overall cause of death among women (exceeded only by lung cancer) (Imaginis.com, 2011).
According to the World Health Organization, more than 1.2 million women will be diagnosed with
breast cancer each year worldwide. White women have a higher incidence of breast cancer than
African American women beginning at age 45 (cancer.org, 2011). Not all tumours are cancer.
Tumours can be non-cancerous (benign) or cancerous (malignant).

The breast cancer dataset used was collected by Dr. William H.Wolberg in year 1989 to 1991 at the
University of Wisconsin-Madison Hospitals (Mangasarian and Wolberg 1990). This dataset can be
retrieved from UCI Machine Learning Repository. The dataset contains 699 samples with 16 samples
have attributes with missing values and 683 samples have complete data. Each sample record has
nine attributes and graded on an interval scale from 1 – 10. They are queue in order of ordinal data
type (ordered set). The class attribute is come with two states: 2 for benign and 4 for malignant.

Software Requirements

Software that will use for developing this application is NetBeans IDE 6.9 for creating the interface
and code the function of the application. The NetBeans IDE 6.9 used Java as the programming
language.
Flow chart
Data Analytics: EDA and Plotting
Result Analysis

The results were obtained by employing k-means algorithm and are discussed with different
cases considering variable parameters. The calculations were based on the centroid
(foggy/random), distance (Euclidean/Manhattan/Pearson), split (simple/variance), threshold
(constant epoch/same centroid), attribute (2-9), and iteration (4-10). Approximately, 92 %
average positive prediction accuracy was obtained with this approach. Better results were
found for the same centroid and the highest variance. The results achieved using Euclidean
and Manhattan were better than the Pearson correlation.
Conclusion & Future Scope

The findings of this work provided extensive understanding of the computational parameters
that can be used with k-means. The results indicated that k-means has a potential to classify
BCW dataset

You might also like