Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

SRM INSTITUTE OF SCIENCE AND TECHNOLOGY

Program Offered: M. Tech /CSE

Course Title: DATA ANALYSIS AND VISUALIZATION

Project Name: MLSC MINI PROJECT – 2

Group: Group 15

Members:

Harish R
DATA UNDERSTANDING:

Step1. Importing necessary libraries

Step2 : Read the Dataset ‘.csv file’ and head() is used to print the first 5 rows

Shape displays the total no. of rows and columns present in the dataset. (rows, columns)
Above dataset in used for breast cancer prediction.
The target variable is 'diagnosis'.

Column names and meanings:


id: ID number
diagnosis: The diagnosis of breast tissues (M = malignant, B = benign)
radius_mean: mean of distances from center to points on the perimeter
texture_mean: standard deviation of gray-scale values
perimeter_mean: mean size of the core tumor
area_mean: area of the tumor
smoothness_mean: mean of local variation in radius lengths
compactness_mean: mean of perimeter^2 / area - 1.0
concavity_mean: mean of severity of concave portions of the contour
concave_points_mean: mean for number of concave portions of the contour
fractal_dimension_se: standard error for "coastline approximation" - 1
radius_worst: "worst" or largest mean value for mean of distances from center to points on
the perimeter
texture_worst: "worst" or largest mean value for standard deviation of gray-scale values
perimeter_worst: "worst" or largest mean size of the core tumor
area_worst: "worst" or largest area of the tumor
smoothness_worst: "worst" or largest mean value for local variation in radius lengths
compactness_worst: "worst" or largest mean value for perimeter^2 / area - 1.0
concavity_worst: "worst" or largest mean value for severity of concave portions of the
contour
concave_points_worst: "worst" or largest mean value for number of concave portions of the
contour
fractal_dimension_worst: "worst" or largest mean value for "coastline approximation" - 1
info() is used to get the information of all the datatypes from the dataset.
b. Calculate five-point summary for numerical
variables
describe() is used to display features such as min, max, std etc for the data
Summarize observations for categorical variables
5: include = ‘object’ used to display information of categorical data only

B : Benign
M: Malignant
The total count of each category is displayed using value_counts()
Checking defects in the data such as missing values, null,
outliers, etc and also check for class imbalance.

Step6: isna().sum() is used to display the total of missing data


Step 7. Data Visualization

a) Histogram plot,plot for categorical data B & M from diagnosis column


Scatter plot is used to get visual of how far the data is scattered and visualize the outliers
Step 8.DATA PREPARATION
The same missing data is replaced with mean of the column or median value of the column
using fillna() and mean() functions.

The total percentage of missing values is displayed.


Remove unnecessary columns from the given data

Like the column ‘id’ is not required to predicted the category of cancer

Label encoding is done using labelencoder() so that the categorical data can be used for
making the model ( while designing the model the categorical cannot be used for calculations
so they are dummy encoded so mathematical calculations can be performed by/using them)
a) Histograms for numerical data and skewness is calculated
Step 9: Correlation Graph is made to know the correlationship between variables

Correlation is calculated between target and feature


variables.Pearson’s Correlation Coefficient helps you find out the
relationship between two quantities. It gives you the measure of the
strength of association between two variables. The value of Pearson’s
Correlation Coefficient can be between -1 to +1.
1 means that they are highly correlated and 0 means no correlation. -1 means that there is
a negative correlation.
Step10:Covariance is calculated using cov() and a HeatMap is plotted to visually
understand to relationship between variables

Step 11 : Numerical data and categorical data are separated. Since we don’t have
categorical data we only have to separate feature variables and the target variable.
Step 12: Now to data is split into train and test data.
Train data is 75% and Test Data is 25%.
Train Data is used to train the model and Test data is used to evaluate the accuracy of
the model designed.

Scaling the features makes the flow of gradient descent smooth and helps algorithms quickly
reach the minima of the cost function. Without scaling features, the algorithm may be biased
toward the feature which has values higher in magnitude
Step 13: The Decision Tree Model is Built
Step 14: The accuracy of the model is calculated and the model designed provides an
accuracy of 95%

Confusion matrix is used to calculate the precision_score, recall and f1-score

Feature importance is calculated as the decrease in node impurity weighted by the probability
of reaching that node.
Step 15: The various other performance matrix are calculated

Precision score, Recall, F1 score :


Step 16 : Logistic Model is Built

Same approach for importing libraries reading the dataset and exploratory data analysis is
done
Step 17: The Logistic Summary Report is displayed using summary() in logistic
regression library
Step 18: Confusion Matrix is built and using it different performance measures are
calculated.
Accuracy of the Logistic Model is 96.5%

We interpret that the accuracy of the two models is as below:

You might also like