AIB Class Assignment PGP 25 160

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Artificial Intelligence for Business

Naïve Bayes Classifier

Submitted by:

Paras Nath Munda

PGP/25/160

Naive Bayes Classifier - Predict if a person earns more than $50,000.

● The Naive Bayes Classifier method is used in Jupyter Notebook to estimate if a person
earns more than $50,000 per year.
● Imported the Pandas data analytics package, NumPy for numerics, and matplotlib for
visualization.
● pd.read CSV reads the data from the CSV and shows it in the head.
● There are 32561 instances and 15 attributes in the data set.

● Displays the number of rows and columns


● Since the attributes we labelled as number, that is, 0,1,2…. and they were renamed like
“age”, “workclass” etc
● The code was run to find categorical variables fundamental to the classifier. It
identified 9 categorical variables and viewed the data frame for the same.
● Checked for the missing values by looking for a null.
● Identified the frequency of values in categorical variables in integer and floating point
numeric format to identify the missing values.
● Since in the data set null was not coded as NaN while as ?, it was replaced with NaN for
python to identify the missing values for each categorical variable.
● Data was split into training and test. The size of test data set is 30% while that of training
data is 70%.
● After exploring the missing values, the data was further cleaned.
● Missing values were replaced and added with new values in order to
remove the null values from the set.
● Data was encoded in numerical format. Intitially we had 14 columns but now we have
113 columns.
● Data set is fed to the Naïve Bayes Classifier to train the model. The type utilized here

is the Gaussian classifier.

● Predicted results help in identifying the accuracy of the data set.


● Naïve Bayes Classifier can be utilised to calculate Accuracy score, Total True Positives
(TP), True Negatives (TN), False Positives (FP), and False Negatives (FN).
● The same data could be visualised in the form of visual_confusion_matrix with seaborn
heatmap.
● The values of performance parameters, that is, accuracy, classification error,Precision,
Recall or Sensitivity are also calculating using the formulas.
● The curve of True Positive Rate vs False Positive Rate generated from the algorithm
depicts that AUC is greater than 0.7 making the model good.

You might also like