Professional Documents
Culture Documents
Classification
Classification
3
Regression Analysis
• In classification there are some predefined classes.
• There are defined categories, and we have a training data
set.
• For every data what will be the category/label/class it is
predefined.
• As it is a supervised learning process, the output of each
row is mentioned.
• In case of cluster, we do not have any pre defined classes.
• In clusters we try to find the homogeneous groups in the
data for which called as exploratory data mining.
• Training examples have only the attribute values.
• As Class/ category/ label are not available with the data
it is unsupervised learning
4
Regression Analysis
5
Linear Regression
• To predict the crop yield this year looking to the data of
rainfall, manure and pesticides used in the land
6
Applications of Linear Regression
• To calculate the GDP of the country
• To predict the runs a player would score in the coming
matches based on his previous performances
7
Descriptive Function
• The descriptive function deals with the general
properties of data in the database. Some descriptive
functions are
• Mining of Association
• Mining of Correlations
• Mining of Clusters
8
Classification and Prediction
• Classification is the process of finding a model that
describes the data classes or concepts. The purpose is
to be able to use this model to predict the class of
objects whose class label is unknown. This derived
model is based on the analysis of sets of training data.
The derived model can be presented in the following
forms:
• Classification (IF-THEN) Rules
• Decision Trees
• Neural Networks
• Mathematical formulae
9
Classification vs. Prediction
• Classification:
– predicts categorical class labels
– classifies data (constructs a model) based on the
training set and the values (class labels) in a
classifying attribute and uses it in classifying new data
• Prediction:
– models continuous-valued functions, i.e., predicts
unknown or missing values
• Typical Applications
– credit approval
– target marketing
– medical diagnosis
– treatment effectiveness analysis
Classification Learning: Definition
Classification: Test data are used to estimate the accuracy of the classification
rules. If the accuracy is considered acceptable, the rules can be applied to the
classification of new data tuples.
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
– Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
– New data is classified based on the training set
• Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
Issues (1): Data Preparation
• Data cleaning
– Preprocess data in order to reduce noise and handle
missing values
• Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
• Data transformation
– Generalize and/or normalize data
Issues (2): Evaluating Classification
Methods
• Predictive accuracy
• Speed and scalability
– time to construct the model
– time to use the model
• Robustness
– handling noise and missing values
• Scalability
– efficiency in disk-resident databases
• Interpretability:
– understanding and insight provded by the model
• Goodness of rules
– decision tree size
– compactness of classification rules
The problem
• Given a set of training cases/objects and their attribute
values, try to determine the target attribute value of new
examples.
– Classification
– Prediction
• Use a decision tree to predict categories for new events.
• Use training data to build the decision tree.
New
Events
Training
Decision
Events and
Tree
Categories
Category
Decision Tree
22