Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

Data Mining: Classification

Ajay Kumar Jena


Classification and Prediction
• What is classification? What is prediction?

• Issues regarding classification and prediction

• Classification by decision tree induction


Data Mining Task
• A massive amount of data is available in the
information industry. It is of no use unless converted
to information.

• The goal is to analyze the data and extract the useful


information from it.

• On the basis of the kind of data to be mined, there are


two types of task, that are performed in data mining.
• Classification and Prediction
• Descriptive

3
Regression Analysis
• In classification there are some predefined classes.
• There are defined categories, and we have a training data
set.
• For every data what will be the category/label/class it is
predefined.
• As it is a supervised learning process, the output of each
row is mentioned.
• In case of cluster, we do not have any pre defined classes.
• In clusters we try to find the homogeneous groups in the
data for which called as exploratory data mining.
• Training examples have only the attribute values.
• As Class/ category/ label are not available with the data
it is unsupervised learning
4
Regression Analysis

• Mainly used for prediction, known as predictive data


mining
• Here the output is not a class,
• From the attribute values of a data we have to predict the
output.
• The output may be integer or real. Mainly real.
• It is a supervised learning uses a training data set.
• Examples:
• Predict the value of gold in the next six months.
• Predict the value of a stock for the next day.
• Predict the rainfall in cm for next one month

5
Linear Regression
• To predict the crop yield this year looking to the data of
rainfall, manure and pesticides used in the land

• It uses independent and dependent variables.

• Independent variables does not change by the effect of


other variables and it is used to manipulate the
dependent variable. It is denoted as X.

• In dependent variable the value changes when any


manipulation is done on the independent variable. It is
denoted as Y

6
Applications of Linear Regression
• To calculate the GDP of the country
• To predict the runs a player would score in the coming
matches based on his previous performances

7
Descriptive Function
• The descriptive function deals with the general
properties of data in the database. Some descriptive
functions are

• Class / Concept Description

• Mining of Frequent Patterns

• Mining of Association

• Mining of Correlations

• Mining of Clusters
8
Classification and Prediction
• Classification is the process of finding a model that
describes the data classes or concepts. The purpose is
to be able to use this model to predict the class of
objects whose class label is unknown. This derived
model is based on the analysis of sets of training data.
The derived model can be presented in the following
forms:
• Classification (IF-THEN) Rules

• Decision Trees

• Neural Networks
• Mathematical formulae
9
Classification vs. Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
• Prediction:
– models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Classification Learning: Definition

• Given a collection of records (training set)


• Each record contains a set of attributes, one of the
attributes is the class
• Find a model/function for the class attribute as a
function of the values of the other attributes
• Goal: previously unseen records should be
assigned a class as accurately as possible
• Use test set to estimate the accuracy of the model.
• Often, the given data set is divided into training and test
sets, with training set used to build the model and test
set used to validate it.
Illustrating Classification Learning
Examples of Classification Task
l Clasifying email as spam or nonspam

l Classifying credit card transactions


as legitimate or fraudulent

l Classifying secondary structures of protein


as alpha-helix, beta-sheet, or random
coil

l Categorizing news stories as finance,


weather, entertainment, sports, etc.
l Predicting tumor cells as benign or malignant
(normal /abnormal)
Classification - A Two-Step Process
• Model construction: describing a set of predetermined classes
– Building the Classifier or Model
– Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
– The set of tuples used for model construction: training set
– The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
– Using Classifier for Classification
– Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will
occur
Classification Process (1): Model Construction
Example: Loan application

The data classification process: Learning: Training data are analyzed by a


classification algorithm. Here, the class label attribute is loan decision, and the
learned model or classifier is represented in the form of classification rules.
Classification Process (2): Use the Model
in Prediction

Classification: Test data are used to estimate the accuracy of the classification
rules. If the accuracy is considered acceptable, the rules can be applied to the
classification of new data tuples.
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues (1): Data Preparation

• Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
Issues (2): Evaluating Classification
Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provded by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
The problem
• Given a set of training cases/objects and their attribute
values, try to determine the target attribute value of new
examples.
– Classification
– Prediction
• Use a decision tree to predict categories for new events.
• Use training data to build the decision tree.
New
Events

Training
Decision
Events and
Tree
Categories

Category
Decision Tree

Each day blongs to two categories Yes / No 21


Decision Tree
Decision tree to represent learned target functions

• Each internal node tests an attribute


• Each branch corresponds to attribute value
• each leaf node assigns a classification
• It can also be represented by logical formulas.

22

You might also like