Professional Documents
Culture Documents
MLSC CT2 Report
MLSC CT2 Report
Group: Group 15
Members:
Harish R
DATA UNDERSTANDING:
Step2 : Read the Dataset ‘.csv file’ and head() is used to print the first 5 rows
Shape displays the total no. of rows and columns present in the dataset. (rows, columns)
Above dataset in used for breast cancer prediction.
The target variable is 'diagnosis'.
B : Benign
M: Malignant
The total count of each category is displayed using value_counts()
Checking defects in the data such as missing values, null,
outliers, etc and also check for class imbalance.
Like the column ‘id’ is not required to predicted the category of cancer
Label encoding is done using labelencoder() so that the categorical data can be used for
making the model ( while designing the model the categorical cannot be used for calculations
so they are dummy encoded so mathematical calculations can be performed by/using them)
a) Histograms for numerical data and skewness is calculated
Step 9: Correlation Graph is made to know the correlationship between variables
Step 11 : Numerical data and categorical data are separated. Since we don’t have
categorical data we only have to separate feature variables and the target variable.
Step 12: Now to data is split into train and test data.
Train data is 75% and Test Data is 25%.
Train Data is used to train the model and Test data is used to evaluate the accuracy of
the model designed.
Scaling the features makes the flow of gradient descent smooth and helps algorithms quickly
reach the minima of the cost function. Without scaling features, the algorithm may be biased
toward the feature which has values higher in magnitude
Step 13: The Decision Tree Model is Built
Step 14: The accuracy of the model is calculated and the model designed provides an
accuracy of 95%
Feature importance is calculated as the decrease in node impurity weighted by the probability
of reaching that node.
Step 15: The various other performance matrix are calculated
Same approach for importing libraries reading the dataset and exploratory data analysis is
done
Step 17: The Logistic Summary Report is displayed using summary() in logistic
regression library
Step 18: Confusion Matrix is built and using it different performance measures are
calculated.
Accuracy of the Logistic Model is 96.5%