Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

THYNK UNLIMITED

WE LEARN FOR THE FUTURE

CLASSIFYING IN
MACHINE LEARNING
PRESENTATION
PRESENTED BY:

PHẠM TUẤ N DŨNG


WHAT IS ARTIFICIAL
INTELLIGENCE?
Artificial intelligence, or AI, is technology that enables
computers and machines to simulate human intelligence
and problem-solving capabilities.

On its own or combined with other technologies (e.g.,


sensors, geolocation, robotics) AI can perform tasks that
would otherwise require human intelligence or
intervention.

Digital assistants, GPS guidance, autonomous vehicles,


and generative AI tools (like Open AI's Chat GPT) are just
a few examples of AI in the daily news and our daily lives.
WHAT IS MACHINE LEARNING?
Machine learning (ML) is a branch of
artificial intelligence (AI) and computer
science that focuses on the using data
and algorithms to enable AI to imitate the
way that humans learn, gradually
improving its accuracy.

It's a key driver of AI applications,


including natural language processing,
image recognition, and recommendation
systems.
TYPES OF MACHINE LEARNING?
Supervised Learning Unsupervised Learning

Semi-Supervised Learning Reinforcement Learning


SUPERVISED LEARNING:
Supervised learning is an algorithm that predicts the
output of a new input based on previously known
(input, outcome) pairs. This data pair is also called
(data, label). Supervised learning is the most popular
group of Machine Learning algorithms.

Supervised learning algorithms are further divided


into two main types:

Classification (Phân loại)

Regression (Hồ i quy)


CLASSIFICATION:
Classification is a supervised machine learning method where the model tries to
predict the correct label of a given input data. In classification, the model is fully
trained using the training data, and then it is evaluated on test data before being
used to perform prediction on new unseen data.

For instance, an algorithm can learn to


predict whether a given email is spam or
ham (no spam):
REGRESSION:
Regression is a statistical method used
to analyze the relationship between a
dependent variable (target variable) and
one or more independent variables
(predictor variables). The goal is to
determine the most suitable function
that describes the relationship between
these variables.

It seeks to find the best-fitting model,


which can be used to make predictions
or draw conclusions.
UNSUPERVISED LEARNING:
In this algorithm, we do not know the outcome or label but only the input data. The
unsupervised learning algorithm will rely on the structure of the data to perform
certain tasks, such as clustering or dimension reduction for convenient storage
and calculation.
Mathematically, Unsupervised learning is when we only have X input data without
knowing the corresponding Y label.
Supervised learning algorithms are further divided into two main types:

Clustering (phân nhóm)

Association
CLUSTERING:
Clustering is the process of arranging a group of objects in such a manner that the
objects in the same group (which is referred to as a cluster) are more similar to
each other than to the objects in any other group. Data professionals often use
clustering in the Exploratory Data Analysis phase to discover new information and
patterns in the data. As clustering is unsupervised machine learning, it doesn’t
require a labeled dataset.
ASSOCIATION:
Association learning, often referred to in the context of association rule learning, is
a rule-based machine learning method for discovering interesting relations
between variables in large databases. It is intended to identify strong rules
discovered in databases using some measures of interestingness.

This method is widely used for market basket


analysis, where it is used to find relationships
between items that are frequently bought
together.
SEMI-SUPERVISED LEARNING:
Semi-supervised learning is a branch of machine learning that combines
supervised and unsupervised learning by using both labeled and unlabeled data to
train artificial intelligence (AI) models for classification and regression tasks.

In fact, many Machine Learning problems belong to this group because collecting
labeled data takes a lot of time and has high costs. Many types of data even require
experts to label (medical images, for example). In contrast, unlabeled data can be
collected at low cost from the internet.
REINFORCEMENT LEARNING:
Reinforcement learning (RL) is a machine learning (ML) technique that trains
software to make decisions to achieve the most optimal results. It mimics the
trial-and-error learning process that humans use to achieve their goals.

An example of reinforcement learning is


teaching a computer program to play a video
game. The program learns by trying different
actions, receiving points for good moves and
losing points for mistakes. Over time, it learns
the best strategies to maximize its score and
improve its performance in the game.
SOME BASIC MACHINE LEARNING ALGORITHMS

LINEAR REGRESSION DECISION TREE RANDOM FOREST


A algorithm used to A graph of decisions An ensemble
predict the value of a and their possible learning method for
variable based on consequences. classification,
the value of another regression, and other
variable tasks that works by
building an infinite
number of decision
trees at training
time.
LINEAR REGRESSION:
Linear Regression is one of the most important algorithms in Machine Learning
especially in the Supervised Learning category. This algorithm will predict
continuous values ​based on input data. Linear Regression finds a linear relationship
between the input variable (X) and the output variable (Y) by finding a straight line
of the form Y=mx+b where:
m is the slope of the line, also known as the weight.
b is the y-axis intercept coefficient.
LINEAR REGRESSION:
The goal of the algorithm is to adjust the weights m and b so that the distance
between the data points and the line is minimized, usually measured by calculating
the sum of squared errors. Linear Regression algorithm is used to predict sales
based on advertising costs, predict house prices based on location/area,...
DECISION TREE:
A decision tree is a flowchart-like structure in which each internal node represents a
"test" on an attribute (e.g. whether a coin flip comes up heads or tails), each branch
represents the outcome of the test, and each leaf node represents a class label
(decision taken after computing all attributes).
Each leaf node is labeled as the most
common class in the corresponding sub-
dataset.

Once built, the decision tree can be used


to classify new data by following rules
from root to leaf.

Application of Decision Tree algorithm


for classification and prediction in
machine learning and data mining
problems.
RANDOM FOREST:
The Random Forest algorithm combines
decision tree construction to create a
more stable and powerful basic Machine
Learning model. Each decision tree in
Random Forest is trained on a randomly
selected subset of data. Then build a
decision tree for each sample and get the
prediction results.

When there is a new data point to predict, Random Forest will make a prediction by
combining the predictions of all subtrees. Finally, the algorithm will choose the
result with the most votes to conclude the problem and situation.
ENTROPY AND GINI INDEX IN DECISION
TREE Both entropy and Gini index are
impurity measures used in decision
trees to guide the process of splitting
data points.

They essentially tell you how mixed up


the data is at a particular node in the
tree, and the goal is to make the data
purer (more homogenous) as you move
down the tree.
ENTROPY:
Entropy, in the context of decision
trees, is a measure of impurity or
disorder within a dataset at a
specific node. It essentially tells you
how mixed up the data is in terms of
class labels.

Entropy is calculated using a formula that involves the probabilities of


each class being present in the data. The result is a value between 0
and 1, where:
0 indicates perfect purity: All data points belong to the same class
(e.g., all emails are spam).
1 indicates complete mix-up: There's an equal probability of any class
being present (completely random).
HOW TO CALCULATE ENTROPY:

Example:
If we had total 10 data points in our dataset with 3 belonging to positive
class and 7 belonging to negative class , then we use the fomula:

The entropy is approximately 0.88.


The higher the entropy, the more disorder or impurity.
ENTROPY IN DECISION TREE
INFORMATION GAIN:
Information gain, directly related to entropy in decision trees, tells you how
much more organized your data becomes after splitting it based on a
particular feature. In simpler terms, it measures the reduction in uncertainty
about the class labels achieved by learning the value of that feature.

Mathematically, information gain can be expressed with the below formular :

information gain = (Entropy of parent node) - (entropy of


child node)
INFORMATION GAIN:
We have:
GINI INDEX:
In decision trees, the Gini index, also
known as Gini impurity, is another
measure of impurity used alongside
entropy. It essentially tells you how likely
you are to misclassify a data point if you
were to randomly pick one from a set.

Gini specifically looks at the probability of making a mistake. It calculates a


value between 0 and 0.5, where:
0 represents perfect purity: All data points belong to the same class (no
chance of misclassification).
0.5 represents complete mix-up: There's an equal probability of any class
being present (completely random, high chance of misclassification).
GINI INDEX FORMULA:

Example:
THANK YOU VERY MUCH
FOR LISTENING.

You might also like