DWM Exp 5,219

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 12

https://replit.com/@nevp/nev#Main.

java

Experiment: 05
Name:Mayur Pawade Roll No:219

Aim:To implement Naive Bayes


Classification Algorithm in Java/Python
The Naive Bayes classification algorithm is a probabilistic classifier. It is based on
probability models that incorporate strong independence assumptions.
The independence assumptions often do not have an impact on reality. Therefore, they
are considered as naive.
You can derive probability models by using Bayes' theorem (credited to Thomas
Bayes). Depending on the nature of the probability model, you can train the Naive
Bayes algorithm in a supervised learning setting.
Data mining in Infosphere Warehouse is based on the maximum likelihood for
parameter estimation for Naive Bayes models. The generated Naive Bayes model
conforms to the Predictive Model Mark-up Language (PMML) standard.
A Naive Bayes model consists of a large cube that includes the following

Dimensions:
 Input field name.
 Input field value for discrete fields, or input field value range for continuous
fields.
 Continuous fields are divided into discrete bins by the Naive Bayes algorithm.
 Target field value.
This means that a Naive Bayes model records how often a target field value appears
together with a value of an input field.

You can activate the Naive Bayes classification algorithm by using the following
command:
The Naive Bayes classification algorithm includes the probability-threshold parameter
ZeroProba. The value of the probability-threshold parameter is used if one of the
above-mentioned dimensions of the cube is empty. A dimension is empty, if a
training-data record with the combination of input-field value and target value does
not exist.
The default value of the probability-threshold parameter is 0.001. Optionally, you
can modify the probability threshold. For example, you can set the value to 0.0002
by using the following command:

Type of Naive Bayes classification


There are three types of Naive Bayes Model, which are given below:

o Gaussian: The Gaussian model assumes that features follow a normal


distribution. This means if predictors take continuous values instead of
discrete, then the model assumes that these values are sampled from the
Gaussian distribution.

o Multinomial: The Multinomial Naïve Bayes classifier is used when the


data is multinomial distributed. It is primarily used for document
classification problems, it means a particular document belongs to which
category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.

o Bernoulli: The Bernoulli classifier works similar to the Multinomial


classifier, but the predictor variables are the independent Booleans
variables. Such as if a particular word is present or not in a document.
This model is also famous for document classification tasks.

Classifier
In data science, a classifier is a type of machine learning algorithm used to
assign a class label to a data input. An example is an image recognition
classifier to label an image (e.g., “car,” “truck,” or “person”). Classifier
algorithms are trained using labelled data; in the image recognition example, for
instance, the classifier receives training data that labels images. After sufficient
training, the classifier then can receive unlabelled images as inputs and will
output classification labels for each image.
Classifier algorithms employ sophisticated mathematical and statistical methods
to generate predictions about the likelihood of a data input being classified in a
given way. In the image recognition example, the classifier statistically predicts
whether an image is likely to be a car, a truck, or a person, or some other
classification that the classifier has been trained to identify.

Types of classifiers

There are six different classifiers in machine learning, that we are going to
discuss below:
 

1. Perceptron:
For binary classification problems, the Perceptron is a linear machine
learning technique. It is one of the original and most basic forms of
artificial neural networks. 
It isn't "deep" learning, but it is a necessary building component. It is made
up of a single node or neuron that processes a row of data and predicts a
class label. The weighted total of the inputs and a bias is used to achieve
this (set to 1). The activation is defined as the weighted sum of the model's
input.
A linear classification algorithm is the Perceptron. This implies it learns a
decision boundary in the feature space that divides two classes using a line
(called a hyperplane). 
As a result, it's best for issues where the classes can be easily separated
using a line or linear model, sometimes known as linearly separable
problems. The stochastic gradient descent optimization procedure is used
to train the model's coefficients, which are referred to as input weights.
(here)
 

2. Logistic Regression:
Under the Supervised Learning approach, one of the most prominent
Machine Learning algorithms is logistic regression. It's a method for
predicting a categorical dependent variable from a set of independent
factors. 
 

A logistic function is used to describe the probability of the probable


outcomes of a single trial in this technique. Logistic regression was created
for this goal (classification), and it's especially good for figuring out how
numerous independent factors affect a single outcome variable.
Except for how they are employed, Logistic Regression is quite similar to
Linear Regression. For regression issues, Linear Regression is employed,
whereas, for classification difficulties, Logistic Regression is used. 
The algorithm's sole drawback is that it only works when the predicted
variable is binary, requires that all predictors are independent of one
another, and expects that the data is free of missing values.
 

3. Naive Bayes:
The Naive Bayes family of probabilistic algorithms calculates the
likelihood that every given data point falls into one or more of a set of
categories (or not). It is a supervised learning approach for addressing
classification issues that are based on the Bayes theorem. It's a
probabilistic classifier, which means it makes predictions based on an
object's likelihood. 
In the text analysis, Naive Bayes is used to classifying customer
comments, news articles, emails, and other types of content into
categories, themes, or "tags" in order to organise them according to
specified criteria, such as this:
The likelihood of each tag for a given text is calculated using Naive Bayes
algorithms, and the highest probability is output:
In other words, the chance of A being true if B is true is equal to the
likelihood of B is true if A is truly multiplied by the probability of A being
true divided by the probability of B being true. As you move from tag to
tag, this estimates the likelihood that a data piece belongs in a certain
category.

4. K-Nearest Neighbours:
K nearest neighbours is a straightforward method that maintains all
existing examples and categorizes new ones using a similarity metric (e.g.,
distance functions). 
KNN has been utilized as a non-parametric approach in statistical
estimates and pattern recognition since the early 1970s. It's a form of lazy
learning since it doesn't try to build a generic internal model; instead, it
only saves instances of the training data. The classification is determined
by a simple majority vote of each point's k closest neighbours.
A case is categorized by a majority vote of its neighbours, with the case
being allocated to the class having the most members among its K closest
neighbours as determined by a distance function. If K = 1, the case is
simply allocated to the nearest neighbour's class.

5. Support Vector Machine:


The Support Vector Machine, or SVM, is a common Supervised Learning
technique that may be used to solve both classification and regression
issues. However, it is mostly utilized in Machine Learning for
Classification difficulties. 
The SVM algorithm's purpose is to find the optimum line or decision
boundary for categorizing n-dimensional space into classes so that
additional data points may be readily placed in the proper category in the
future. A hyperplane is a name for the optimal choice boundary.
SVM techniques categorize data and train models within supra limited
degrees of polarity, resulting in a three-dimensional classification model
that extends beyond X/Y predictive axes. The extreme points/vectors that
assist create the hyperplane are chosen via SVM. 
 

Support vectors are extreme instances, and the method is called a Support
Vector Machine. Consider the picture below, which shows how a decision
boundary or hyperplane is used to classify two separate categories:

6. Random Forest:
Random forest is a supervised learning approach used in machine learning
for classification and regression. It's a classifier that averages the results of
many decision trees applied to distinct subsets of a dataset to improve the
dataset's projected accuracy. 
It's also known as a meta-estimator since it fits a number of decision trees
on different sub-samples of datasets and utilizes the average to enhance the
model's forecast accuracy and prevent over-fitting. The size of the sub-
sample is always the same as the size of the original input sample, but the
samples are generated using replacement.
It produces a "forest" out of a collection of decision trees that are
frequently trained using the "bagging" method. The main idea of the
bagging approach is that combining many learning models enhances the
final result. Rather than relying on a single decision tree, the random forest
gathers forecasts from each tree and predicts the ultimate output based on
the majority of votes.

Code:
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Map;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Set;
import java.util.Iterator;

class Dataset {
public ArrayList<Map> tuples;

public Dataset() {
this.tuples = new ArrayList<Map>();
}

public void addTuple(Map<String, String> tuple) {


this.tuples.add(tuple);
}
public static Dataset readFromCSV(String path) throws FileNotFoundException,
IOException {
Dataset ds = new Dataset();

FileReader fr = new FileReader(path);


BufferedReader br = new BufferedReader(fr);

String firstLine = br.readLine();


String headers[] = firstLine.split(",");
String line;
while ((line = br.readLine()) != null) {
String vals[] = line.split(",");
Map<String, String> row = new HashMap<String, String>();
for (int i = 0; i < headers.length; i++) {
row.put(headers[i], vals[i]);
}
ds.addTuple(row);
}

return ds;
}

public HashSet getClassNames(String attr) {


HashSet set = new HashSet<String>();

for (int i = 0; i < tuples.size(); i++) {


Map<String, String> e = tuples.get(i);
String clazz = e.get(attr);
set.add(clazz);
}
return set;
}

public float pcx(String attr, String clazz) {


float count = 0f;

for (int i = 0; i < tuples.size(); i++) {


Map<String, String> tuple = tuples.get(i);
String val = tuple.get(attr);
if (val.equals(clazz))
count++;
}
return count / tuples.size();
}

public float pcgx(String classAttr, String clazz, String attr, String val) {
float classCount = 0f;
float valCount = 0f;

for (int i = 0; i < tuples.size(); i++) {


Map<String, String> tuple = tuples.get(i);
if (tuple.get(attr).equals(val)) {
valCount++;
if (tuple.get(classAttr).equals(clazz)) {
classCount++;
}
}
}
return classCount / valCount;
}

public String predict(Map<String, String> tuple, String classAttr) {


HashSet classes = getClassNames(classAttr);
Map<String, Float> probabilities = new HashMap<String, Float>();

Iterator<String> iter = classes.iterator();


while (iter.hasNext()) {
String clazz = iter.next();

float classProbability = pcx(classAttr, clazz);


float givenXProbability = 1f;

Set<String> keys = tuple.keySet();


Iterator<String> keyIter = keys.iterator();
while (keyIter.hasNext()) {
String attr = keyIter.next();
if (attr.equals(classAttr))
continue;

givenXProbability *= pcgx(classAttr, clazz, attr, tuple.get(attr));


}

float finalProbability = classProbability * givenXProbability;

System.out.println("calculated " + clazz + " : " + finalProbability);

probabilities.put(clazz, finalProbability);
}
iter = classes.iterator();
String prediction = "NAN";
float currMax = 0f;
do {
String clazz = iter.next();
float probability = probabilities.get(clazz);
if (currMax < probability) {
prediction = clazz;
currMax = probability;
}
} while (iter.hasNext());

return prediction;
}
}

class Main {
public static void main(String[] args) throws FileNotFoundException, IOException {
Dataset ds = Dataset.readFromCSV("data2.csv");

Map<String, String> toPredict = new HashMap<String, String>();

toPredict.put("type of family", "Single");


toPredict.put("age group", "young");
toPredict.put("income status", "low");
//toPredict.put("windy", "False");

String prediction = ds.predict(toPredict, "buys car");


System.out.println(prediction);
}
}

Dataset:
type of family,age group,income status,buys car
nuclear,young,low,yes
extended,old ,low,no
childless,middle,low,no
childless,young,medium,yes
single,middle,medium,yes
childless,young,low,no
nuclear,old,high,yes
nuclear,middle,medium,yes
extended,middle,high,yes
Single,old,low,no

Output:

You might also like