16CS63: Machine Learning

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 93

16CS63 : Machine Learning

Course Content

UNIT 1 Introduction to Machine Learning


(a) Machine Learning Techniques & Types
(b) Familiarization of Data Analytics and Intelligent Learning Algorithms
UNIT 2 Regression & Classification Methods
Regression Methods - (a) Linear Regression
(b) Non-linear Regression
Classification Methods - (a) Naive Bayes.
(b) Support Vector Machine (SVM).
(c) K – Nearest Neighbor
UNIT 3 Ensemble Methods
Ensemble Methods - (a) Decision Tree.
(b) Random Forest.
(c) AdaBoost
UNIT 4 Dimensionality Reduction & Parameter Estimation
Dimensionality Reduction - (a) Principal Component Analysis (PCA).
(b) Linear Discriminate Analysis (LDA).
Parameter Estimation – (a) Maximum Likelihood Estimate (MLE).
UNIT 5 Clustering Methods
Clustering Methods - (a) K – Means Clustering
(b) Hierarchical Clustering.
UNIT 6 Reinforcement Methods
Reinforcement Methods – (a) Markov Decision Process
(b) Q - Learning
UNIT 7 Perceptron
Perceptron - (a) Perceptron.
(b) Multilayer Perceptron.
UNIT 8 Artificial Neural Network
Artificial Neural Network - (a) Multi – layer Artificial Neural network.
(b) Back Propagation Learning.
(c) RBF Network
UNIT 1 Introduction to Machine Learning
(a) Machine Learning Techniques & Types
(b) Familiarization of Data Analytics and Intelligent Learning
Algorithms
PREREQUISTE

• Concepts of Linear Algebra, Probability and Statistics, Data


mining.

3/3/2021 Dept of CSE, FET, Jain 5


Assessment Process
• Three Internal Assessments will be conducted which includes
5 marks of MCQ.
• The average of best two is taken and will be scaled to 30 marks
• Semester end examination is conducted for 70 marks.
• To pass in the course as per the regulations of University,
students shall secure a minimum of 40 marks (IA and SEE
together) provided SEE marks>=28

Assessment of Course Outcome is shown below

3/3/2021 Dept of CSE, FET, Jain 6


Tools in ML

Tool Name/ Property


Anaconda; Keras
Python ide
julia Numeric ,statistical computation
orange Data visualize. toolbox/widgets
glueviz Data visualization
Weka, spss ibm; Mahout3 c’s Regression

3/3/2021 Dept of CSE, FET, Jain 7


Tools in ML

Source: 10 Most Popular Machine Learning Software Tools in 2


020 (updated) | by Sophia Martin | Towards Data Science
3/3/2021 Dept of CSE, FET, Jain 8
3/3/2021 Dept of CSE, FET, Jain 9
3/3/2021 Dept of CSE, FET, Jain 10
Facts…

3/3/2021 Dept of CSE, FET, Jain 11


What Is Data Mining?

• Data mining (knowledge discovery from data)


• Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge from
huge amount of data
• Exploratory data analysis
• Alternative names
• Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data archeology,
information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”?
• Query processing.
• Expert systems or small ML/statistical programs
3/3/2021 Dept of CSE, FET, Jain 12
Why Data Mining?—Potential Applications

• Data analysis and decision support

• Market analysis and management


• Target marketing, customer relationship management (CRM), market basket
analysis, cross selling, market segmentation
• Market segmentation

• Risk analysis and management


• Forecasting, customer churn, improved underwriting, quality control, competitive
analysis

• Fraud detection and detection of unusual patterns (outliers)


• Other Applications

• Text mining (news group, email, documents) and Web mining


• DNA and bio-data analysis
3/3/2021 Dept of CSE, FET, Jain 13
Mining Large Data Sets - Motivation
• There is often information “hidden” in the data that is
not readily evident
• Human analysts may take weeks to discover useful information
• Much of the data is never analyzed at all

The Data Gap

Total new disk (TB) since 1995

Number of
analysts

3/3/2021 Dept of CSE, FET, Jain 14


What is Data Mining?
• Many Definitions
• previously unknown and potentially useful information
from data
Cleaning ETL Tool

3/3/2021 Dept of CSE, FET, Jain 15


What is (not) Data Mining?
What is not Data  What is Data Mining?
Mining?
– Look up phone – Certain names are more
number in phone prevalent in certain US
directory locations (O’Brien, O’Rurke,
O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by search
information about engine according to their
“Amazon” context (e.g. Amazon
rainforest, Amazon.com,)
3/3/2021 Dept of CSE, FET, Jain 16
Origins of Data Mining
• Draws ideas from machine learning/AI, pattern
recognition, statistics, and database systems
• Traditional Techniques
may be unsuitable due to
• Enormity of data Statistics/ Machine Learning/
• High dimensionality AI Pattern
of data Recognition
• Heterogeneous,
Data Mining
distributed nature
of data
Database
systems

3/3/2021 Dept of CSE, FET, Jain 17


Data Mining Tasks
• Prediction Methods
• Use some variables to predict unknown or future values of
other variables.

• Description Methods
• Find human-interpretable patterns that describe the data.

From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

3/3/2021 Dept of CSE, FET, Jain 18


Data Mining Tasks...
• Classification [Predictive]
• Clustering [Descriptive]
• Regression [Predictive]

3/3/2021 Dept of CSE, FET, Jain 19


Market Analysis and Management
• Where does the data come from?
• Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Target marketing
• Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
• Determine customer purchasing patterns over time
• Cross-market analysis
• Associations/co-relations between product sales, &
prediction based on such association
• Customer profiling

• What types of customers buy what products (clustering or


3/3/2021
classification) Dept of CSE, FET, Jain 20
Data Mining : A KDD Process

• Data mining—core of knowledge Pattern Evaluation


discovery process
• Problem definition-data understanding—data
preparation– modelling– evaluation—deployment.
Data Mining

Task-relevant Data

Data Warehouse Selection

Data Cleaning

Data Integration

3/3/2021 Databases Dept of CSE, FET, Jain 21


Steps of a KDD Process
• Learning the application domain
• relevant prior knowledge and goals of application
• Creating a target data set: data selection
• Data cleaning and preprocessing: (may take 60% of effort!)
• Data reduction and transformation
• Find useful features, dimensionality/variable reduction,
invariant representation.
• Choosing functions of data mining
• summarization, classification, regression, association,
clustering.
• Choosing the mining algorithm(s)
• Data mining: search for patterns of interest
• Pattern evaluation and knowledge presentation
• visualization, transformation, removing redundant patterns,
3/3/2021 Dept of CSE, FET, Jain 22
etc.
Data Mining and Business Intelligence

Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
OLAP DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
3/3/2021 Dept of CSE, FET, Jain 23
Data Mining: On What Kinds of Data?

• Relational database
• Advanced database and information repository
• Object-relational database
• Spatial and temporal data
• Time-series data
• Stream data
• Multimedia database
• Heterogeneous and legacy database
• Text databases & WWW

3/3/2021 Dept of CSE, FET, Jain 24


Data Warehouse—Integrated

• Constructed by integrating multiple, heterogeneous data sources


• relational databases, flat files, on-line transaction records
• Data cleaning and data integration techniques are applied.
• Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted.

3/3/2021 Dept of CSE, FET, Jain 25


OLTP vs. OLAP

3/3/2021 Dept of CSE, FET, Jain 26


Multi-tier architecture

MOLAP;
ROLAP;
HOLAP

3/3/2021 Dept of CSE, FET, Jain 27


Data warehouse model

• Virtual Warehouse logical structure


• Data mart
• Enterprise Warehouse

3/3/2021 Dept of CSE, FET, Jain 28


Virtual data warehouse
• Collective view of the completed data. A virtual data warehouse has no historic data. It can be
considered as a logical data model of the containing metadata.
• Data Virtualization Data Virtualization makes all data, regardless of where it’s located and
regardless of what format it’s in, look as if it is one place and in a consistent format. 
• Advantages:
• Accesses information directly from the source in real-time
• Quickly validates new business models using an agile approach to data integration
• Reduces IT operational costs
• Increases end-user productivity by empowering users with better information access

3/3/2021 Dept of CSE, FET, Jain 29


advantages
• Advantages of a Data Mart
• Efficient access — A data mart is a time-saving..
• Inexpensive data warehouse alternative — Data marts can be an inexpensive alternative to
developing an enterprise data warehouse
• Improve data warehouse performance — Dependent and hybrid data marts can improve the
performance of a data warehouse by taking on the burden of processing, to meet the needs of the
analyst. When dependent data marts are placed in a separate processing facility, they significantly
reduce analytics processing costs as well.
• Data maintenance — Different departments can own and control their data.
• Simple setup — The simple design requires less technical skill to set up.

3/3/2021 Dept of CSE, FET, Jain 30


Data mart

3/3/2021 Dept of CSE, FET, Jain 31


Enterprise warehouse

• Four perspective :
• Business Perspective:
• characterises the procedures and standards by which the business works on a
day-to-day basis.
• Application Perspective:
• characterises the interactions among the procedures and standards utilised by the
organisation.
• Information Perspective
• characterises and groups the raw information, such as record documents,
databases, pictures, presentations and spreadsheets that a company requires to
operate efficiently.
• Technology Perspective:
• characterises the equipment, working frameworks, programming and
networking solutions utilised by an organisation
3/3/2021 Dept of CSE, FET, Jain 32
Definition
 Machine learning is an application of artificial intelligence (AI) that provides systems
the ability to automatically learn.

“A computer program is said to learn from experience E with respect to some class


of tasks T and performance measure P, if its performance at tasks in T, as
measured by P, improves with experience E”.
 Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.
 The primary aim is to allow the computers learn automatically without human
intervention or assistance and adjust actions accordingly.

3/3/2021 Dept of CSE, FET, Jain 33


ML can play a key role in a wide
range of critical applications, such
as:
1. data mining,
2. natural language processing,
3.image recognition,
4.expert systems.

ML provides potential solutions in


all these domains and more, and is
set to be a pillar of our future
civilization.

3/3/2021 Dept of CSE, FET, Jain 34


MACHINE LEARNING TECHNIQUES

• 1. Regression Regression algorithms are mostly used to make predictions on numbers i.e when
the output is a real or continuous value.
• Algorithms under Regression:
a. Simple Linear Regression Model: It is a statistical method that analyses the relationship
between two quantitative variables. This technique is mostly used in financial fields, real estate, etc.
b. Logistic Regression: It is carried out in cases of fraud detection, clinical trials, etc. wherever the
output is binary.
c. Support Vector Regression: SVR is a bit different from SVM. In simple regression, the aim is to
minimize the error, while in SVR, we adjust the error within a threshold.
d. Multivariate Regression Algorithm: This technique is used in the case of multiple predictor
variables. It can be operated with matrix operations and Python’s Numpy library.
e. Multiple Regression Algorithm: It works with multiple quantitative variables in both linear and
non-linear regression algorithms.

3/3/2021 Dept of CSE, FET, Jain 35


MACHINE LEARNING TECHNIQUES

• 2. Classification A classification model, a method of Supervised Learning, draws a conclusion from


observed values as one or more outcomes in a categorical form. For example, email has filters like inbox,
drafts, spam, etc. There is a number of algorithms in the Classification model like Logistic Regression,
Decision Tree, Random Forest, Multilayer Perception, etc.
• 2 Types: Binary classifier and Multi-Class Classifier.

3/3/2021 Dept of CSE, FET, Jain 36


MACHINE LEARNING TECHNIQUES

• 3. Clustering Clustering is a Machine Learning technique that involves classifying data points into
specific groups. If we have some objects or data points, then we can apply the clustering algorithm(s) to
analyze and group them as per their properties and features. 
• Clustering methods:
• Density-based methods: In this method, clusters are considered dense regions depending on their similarity
and difference from the lower dense region.
• Hierarchical methods: The clusters formed in this method are the tree-like structures. This method forms trees
or clusters from the previous cluster. There are two types of hierarchical methods: Agglomerative (Bottom-up
approach) and Divisive (Top-down approach).
• Partitioning methods: This method partitions the objects based on k-clusters and each method form a single
cluster.
• Grid based methods: In this method, data are combined into a number of cells that form a grid-like structure.

3/3/2021 Dept of CSE, FET, Jain 37


MACHINE LEARNING TECHNIQUES

• 4. Association Analysis Association rule mining finds interesting associations and relationships among
large sets of data items. This rule shows how frequently a itemset occurs in a transaction. A typical
example is Market Based Analysis.
• Market Based Analysis is one of the key techniques used by large relations to show associations between
items. It allows retailers to identify relationships between the items that people buy together frequently.

3/3/2021 Dept of CSE, FET, Jain 38


MACHINE LEARNING TECHNIQUES

• 4. Reinforcement Learning Algorithm learns to react to the environment.


• Value-Based:
• In a value-based Reinforcement Learning method, you should try to maximize a value function V(s). In
this method, the agent is expecting a long-term return of the current states under policy π.
• Policy-based:
• In a policy-based RL method, you try to come up with such a policy that the action performed in every
state helps you to gain maximum reward in the future.
• Model-Based:
• In this Reinforcement Learning method, you need to create a virtual model for each environment. The
agent learns to perform in that specific environment.

3/3/2021 Dept of CSE, FET, Jain 39


Algorithm by learning Style

There are different ways an algorithm can model a problem based onits
interaction
with the experience or environment or whatever we count to call the

input data. Three different styles in machine learning algorithm:

• 1. Supervised Learning
• 2.Unsupervised Learning
• 3.Reinforcement Learning

3/3/2021 Dept of CSE, FET, Jain 40


SUPERVISED LEARNING
 Supervised learning is when the model is getting trained on a labelled dataset.

 A model is prepared through a training process in which it is required to make


predictions .
 Example problems are classification and regression.

3/3/2021 Dept of CSE, FET, Jain 41


Linear regression
• Linear Regression is a machine learning algorithm based on supervised learning.
It
performs a regression task.
• Regression models a target prediction value based on independent variables.

3/3/2021 Dept of CSE, FET, Jain 42


While training the model we are given :

x: input training data (univariate – one input variable(parameter))


y: labels to data (supervised learning)
θ1: intercept
θ2: coefficient of x
Once we find the best θ1 and θ2 values, we get the best fit line.

3/3/2021 Dept of CSE, FET, Jain 43


UNSUPERVISED LEARNING
 In unsupervised learning data is not labeled and does not have a known
result.
 A model is prepared by deducing structures present in the input data. AB

 Example algorithms include: the Aproi algorithm and k-Means.

3/3/2021 Dept of CSE, FET, Jain 44


Clustering
• Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data.
• Clustering Algorithms :-
• K-means clustering Algorithm– It is the simplest unsupervised learning algorithm
that solves clustering problem.
• .K-means algorithm partition n observations into k clusters where each observation belongs
to the cluster with the nearest mean serving as a prototype of the cluster .

3/3/2021 Dept of CSE, FET, Jain 45


APPLICATIONS OF CLUSTERING

• Marketing : It can be used to characterize & discover customer segments for marketing purposes.
• Biology : It can be used for classification among different species of plants and animals.
• Libraries : It is used in clustering different books on the basis of topics and information.
• Insurance : It is used to acknowledge the customers, their policies and identifying the frauds.

3/3/2021 Dept of CSE, FET, Jain 46


3/3/2021 Dept of CSE, FET, Jain 47
REINFORCEMENT LEARNING

• Reinforcement learning is the training of machine learning models to make a sequence


of
decisions.
• The agent learns to achieve a goal in an uncertain, potentially complex environment.

3/3/2021 Dept of CSE, FET, Jain 48


Artificial Intelligence vs Machine Learning
vs DeepLearning

3/3/2021 Dept of CSE, FET, Jain 49


Intelligent Computing

3/3/2021 Dept of CSE, FET, Jain 50


Familiarization of Data Analytics

3/3/2021 Dept of CSE, FET, Jain 51


What is data?

• The quantities, characters or symbols on which operations are


performed by a computer, which may be stored and transmitted in the
form of electrical signals and recorded on magnetically, optical or
mechanical recording media.
• Unstructured Data: form that cannot be used easily by computer
program.
• Semi-Structured Data: Data that does not apply to data model but has
some structure.
• Structured Data: Data in organized form.

3/3/2021 Dept of CSE, FET, Jain 52


Why Now?

3/3/2021 Dept of CSE, FET, Jain 53


What is BIG DATA?
• Big Data is a Phrase used to mean a massive volume of both
structured and Unstructured data which is difficult to process using
traditional database and software techniques.
• Big Data refers to data set whose size is beyond the ability of typical
database software tools to capture, store, manage and analyze
• Big Data are “high-volume”, “high-velocity” and/or “high-variety”
information assets that requires new form of processing
tools/Software
• Comprises of large datasets that cannot be processed using
traditional computing techniques, which includes huge volumes,
high velocity and extensible variety of data.

3/3/2021 Dept of CSE, FET, Jain 54


Types of Data
STRUCTURED DATA UNSTRUCTURED DATA SEMI STRUCTURED
• We mean that the data can be
processed, stored and retrieved in DATA
the fixed format. • It refers to the data that lacks • Semi Structured data pertains
• It refers to highly organized any specific form or to the data containing both of
information that can be readily and structure. This makes it very the Structured and Un
seamlessly stored and accessed difficult and time consuming structured data .
from the database by simple search to process and analyze the • To be precise, it refers to the
engine. unstructured data.
data that although has not
• For instance, The employee table in • Email is the example. been classified under
an employee database will be a particular repository, yet
• Sources: Text both internal
structured data as the employee contain vital information or
and external to an tags that segregate individual
details, their job positions, their organization; social media;
salary etc. will be accessed in a elements within the data.
mobile data.
organized format • CSV files are Example.
• Sources: Relational Databases; flat • Sources: Data Exchange
files; Legacy Databases. formats like JSON data.

3/3/2021 Dept of CSE, FET, Jain 55


How the data is Generated?
• Big Data is not new, it is existed even before it was invented.
• But it became important only after the invention of Social Media
in 2008.
• The data can be generated by Humans, Machines or Humans-
machines Combination.
• Data can be generated anywhere, anytime, where any information
can be generated and stored in any of the format. The generated
data can be from Hospitals, industries, militaries or anywhere else.
• The data Generated can be broadly classified on basis of the
sources:
i. Machine Generated
ii. Human Generated
iii. Organization Generated
3/3/2021 Dept of CSE, FET, Jain 56
MACHINE GENERATED ORGANIZATION
HUMAN GENERATED GENERATED
It is the biggest Source Big
Data. The vast amount of social media Organization generated data, is
The data generated from real data as status update, twits, highly structured in nature and
time sensor in industry photos, videos, etc. The data trustworthy.
machinery and vehicles. generated in this category are
Organization stores the data for
normally much unstructured.
Data comes from various current and future use as well as
sensors, cameras, satellites, log- People are generating large analysis of past.
files, bio informatics, activity amount of data on social
Consider an example of an
tracker, personal health care networking sites, online photo
organization that collects sales
tracker and many other sense sharing sites and video sharing
transactions. This transaction
data. site.
records can be used to detect
Example:- Large scale data is generated correlated products, estimate
Submarine using blogging sites, email, demand and capture illegal
mobile text activity. Using proper analytics,
Activity Tracker messages and personal organization can build
documents. Most of this data is inventories to match predicted
majorly text. growth and demand.

3/3/2021 Dept of CSE, FET, Jain 57


Different Sources of DATA GENERATIONS
• Sensors/meters and activity records from electronic devices
• Internet Data social interaction
• Primary Data Surveys, Experiments, Observations
• Location Data  Mobile data
• Secondary Data  Business transactions: Data produced as a result of business activities can be
recorded in structured or unstructured databases. When recorded on structured data bases the most
common problem to analyze that information and get statistical indicators is the big volume of
information and the periodicity of its production because sometimes these data is produced at a very
fast pace, thousands of records can be produced in a second when big companies like supermarket
chains are recording their sales.
• Device Data

3/3/2021 Dept of CSE, FET, Jain 58


3/3/2021 Dept of CSE, FET, Jain 59
Elements of Big Data
• Volume
• Velocity
• Variety
• Veracity
• Value
• Valence

3/3/2021 Dept of CSE, FET, Jain 60


3/3/2021 Dept of CSE, FET, Jain 61
5 V’s of BIG DATA

3/3/2021 Dept of CSE, FET, Jain 62


Significance Of BIG DATA
1. Cost Savings 
2. Time Reductions :The high speed of tools like Hadoop and in-memory analytics can easily identify
new sources of data which helps businesses analyzing data immediately
3. Understand the market conditions : By analyzing big data you can get a better understanding of
current market conditions. For example, by analyzing customers’ purchasing behaviors, a company
can find out the products that are sold the most and produce products according to this trend.
4. Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get
feedback about who is saying what about your company.
5. To Boost Customer Acquisition and Retention: The customer is the most important asset any
business depends on. There is no single business that can claim success without first having to
establish a solid customer base.
6. To Solve Advertisers Problem and Offer Marketing Insights: Big data analytics can help change
all business operations. This includes the ability to match customer expectation, changing company’s
product line and of course ensuring that the marketing campaigns are powerful.
7. As a Driver of Innovations and Product Development: Another huge advantage of big data is the
ability to help companies innovate and redevelop their products.

3/3/2021 Dept of CSE, FET, Jain 63


Big Data Analytics

Analyzing large
volumes of data,
What is Big Data Analytics?
or big data. What Big Data Analytics isn’t?

3/3/2021 Dept of CSE, FET, Jain 64


Classification of Analytics

First School of Thought


1. Basic Analytics Slicing and dicing of
data.
2. Operational Analytics Enterprise
business processes.
3. Advanced Analytics Forecasting for
future
4. Monetized Analytics Business revenue

Second School of Thought-


Analytics 1.0,2.0,3.0

3/3/2021 Dept of CSE, FET, Jain 65


Top Challenges in Big Data Analytics

3/3/2021 Dept of CSE, FET, Jain 66


What is Data?

• Collection of data objects and their attributes Attributes

• An attribute is a property or characteristic of an


object
• Examples: eye color of a person, temperature,
etc.
• Attribute is also known as variable, field,
characteristic, or feature
• A collection of attributes describe an object
• Object is also known as record, point, case, Objects
sample, entity, or instance
Attribute Values
• Attribute values are numbers or symbols assigned to an
attribute

• Distinction between attributes and attribute values


• Same attribute can be mapped to different attribute values
• Example: height can be measured in feet or meters

• Different attributes can be mapped to the same set of values


• Example: Attribute values for ID and age are integers
• But properties of attribute values can be different
• ID has no limit but age has a maximum and minimum value

3/3/2021 Dept of CSE, FET, Jain 68


Types of Attributes
• There are different types of attributes
• Nominal
• Examples: ID numbers, eye color, zip codes
• Ordinal
• Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in
{tall, medium, short}
• Interval
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
• Ratio
• Examples: temperature in Kelvin, length, time, counts

3/3/2021 Dept of CSE, FET, Jain 69


Properties of Attribute Values
• The type of an attribute depends on which of the following properties it
possesses:
• Distinctness: = 
• Order: < >
• Addition: + -
• Multiplication: */

• Nominal attribute: distinctness


• Ordinal attribute: distinctness & order
• Interval attribute: distinctness, order & addition
• Ratio attribute: all 4 properties
3/3/2021 Dept of CSE, FET, Jain 70
Attribute Description Examples Operations
Type

Nominal The values of a nominal zip codes, mode, entropy,


attribute are just different employee ID contingency
names, i.e., nominal numbers, eye correlation, 2
attributes provide only color, sex: {male, test
enough information to female}
Ordinal distinguish one object from hardness of median,
another. (=, ) minerals, {good, percentiles,
The values of an ordinal attribute provide
enough information to order objects. (<, >)
better, best}, rank
grades, street correlation, run
numbers tests, sign tests
Interval For interval attributes, the calendar dates, mean, standard
differences between values temperature in deviation,
are meaningful, i.e., a unit Celsius or Pearson's
of measurement exists. Fahrenheit correlation, t
(+, - ) and F tests
Ratio For ratio variables, both temperature in geometric
differences and ratios are Kelvin, monetary mean,
meaningful. (*, /) quantities, counts, harmonic
age, mass, length, mean, percent
electrical current variation
3/3/2021 Dept of CSE, FET, Jain 71
Statistical Framework
• Hypothesis class - The hypothesis class from which we believe C is
drawn, namely, the set of rectangles.
• Hypothesis - The learning algorithm then finds the particular
hypothesis, h ∈ H, to approximate C as closely as possible. Though the
expert defines this hypothesis class, the values of the parameters are
not known; that is, though we choose H, we do not know which
particular h ∈ H is equal, or closest, to C. But once we restrict our
attention to this hypothesis class, learning the class reduces to the
easier problem of finding the four parameters that define h. The aim is
to find h ∈ H that is as similar as possible to C. Let us say the
hypothesis h makes a prediction for an instance x such that
3/3/2021 Dept of CSE, FET, Jain 72
Statistical Framework
• Empirical Error - The empirical error is the proportion of training
instances where predictions of h do not match the required values
given in X. The error of hypothesis h given the training set X is, 𝐸 ℎ 𝑋 =
1(ℎ(𝑥 𝑡 ) ≠ 𝑟 𝑡 ) 𝑁 𝑡=1 , where 1(a ≠ b) is 1 if a ≠ b and is 0 if a = b.
• Generalization - It is the concept of how well our hypothesis will
correctly classify future examples that are not part of the training set,
how well a model trained on the training set predicts the right output
for new generalization instances is called generalization. For best
generalization, we should match the complexity of the hypothesis
class H with the complexity of the function underlying the data.

3/3/2021 Dept of CSE, FET, Jain 73


Statistical Framework
• Most Specific Hypothesis – It is denoted by S, which is the hypothesis tightest rectangle that includes
all the positive examples and none of the negative examples. This gives us one hypothesis, h = S, as
our induced class. Note that the actual class C may be larger than S but is most general never
smaller.
• Most General hypothesis – It is denoted by G, which is the largest rectangle we hypothesis can draw
that includes all the positive examples and none of the negative examples.
• Version Space - Any h ∈ H between S and G is a valid hypothesis with no error, said to be consistent
with the training set, and such h make version space up the version space. Given another training
set, S, G, version space, the parameters and thus the learned hypothesis, h, can be different.
• Doubt – In some applications, a wrong decision may be very costly and in such a case, we can say
that any instance that falls in between S and G is a case of doubt, which we cannot label with
certainty due to lack of data. In such a case, the system rejects the instance and defers the decision
to a human expert.

3/3/2021 Dept of CSE, FET, Jain 74


Statistical Framework
• Reject - When more than one hypothesis is true, value is 1, we cannot
choose a class, and this is the case of doubt and the classifier rejects
such cases.
• VC Dimension - The maximum number of points that can be shattered
by H is called the VapnikChervonenkis (VC) dimension of H, is denoted
as VC(H ), and measures the capacity of H.
• Confidence Probability – The model's predicted value regarding the
observed outcome is similar, not misclassified, denoted by 1 - 𝛿.

3/3/2021 Dept of CSE, FET, Jain 75


Statistical Framework
• Margin - Given X, we can find S, or G, or any h from the version space
and use it as our hypothesis, h. It seems intuitive to choose h halfway
between S margin and G; this is to increase the margin, which is the
distance between the boundary and the instances closest to it. For
our error function to have a minimum at h with the maximum margin,
we should use an error (loss) function which not only checks whether
an instance is on the correct side of the boundary but also how far
away it is. That is, instead of h(x) that returns 0/1, we need to have a
hypothesis that returns a value which carries a measure of the
distance to the boundary and we need to have a loss function which
uses it, different from 1(·) that checks for equality.
3/3/2021 Dept of CSE, FET, Jain 76
Statistical Framework
• Noise - Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to learn
and zero error may be infeasible with a simple hypothesis class.
• Posed Problem - After seeing N example cases, there remain 2 2 𝑑−𝑁 possible ill-posed problem functions.
This is an example of an ill-posed problem where the data by itself is not sufficient to find a unique solution.
• Inductive Bias - The set of assumptions we make inductive bias to have learning possible is called the
inductive bias of the learning algorithm.
• Model Selection - Learning is not possible without inductive bias, and now the ques tion is how to choose
the right bias. This is called model selection, which is choosing between possible H. In answering this
question, we should remember that the aim of machine learning is rarely to replicate the training data but
the prediction for new cases. That is we would like to be able to generate the right output for an input
instance outside the training set one for which the correct output is not given in the training set.
• Decision Boundary - A decision boundary or decision surface is a hypersurface that partitions the underlying
vector space into two sets, one for each class. The classifier will classify all the points on one side of the
decision boundary as belonging to one class and all those on the other side as belonging to the other class. If
the decision surface is a hyperplane, then the classification problem is linear, and the classes are linearly
separable.

3/3/2021 Dept of CSE, FET, Jain 77


Statistical Framework
• Underfitting - If underfitting H is less complex than the function, we have underfitting,
for example, when trying to fit a line to data sampled from a third-order polynomial. In
such a case, as we increase the complexity, the training error decreases. But if we have
H that is too complex, the data is not enough to constrain it and we may end up with a
bad hypothesis, h ∈ H , for example, when fitting two rectangles to data sampled from
one rectangle.
• Overfitting - If there is noise, an overcomplex hypothesis may learn not only the
underlying function but also the noise in the data and may make a bad fit, This is called
overfitting. In such a case, having more training data helps but only up to a certain
point.
• Appropriate Fitting - It is the point just before the error on the test dataset starts to
increase where the model has good skill on both the training dataset and the unseen
test dataset.
3/3/2021 Dept of CSE, FET, Jain 78
Statistical Framework
• Triple Trade – Off - Given a training set and H , we can find h ∈ H that
has the minimum training error but if H is not chosen well, no matter
which h ∈ H we pick, we will not have good generalization. In all
learning algorithms that are trained from example data, there is a
trade-off between three factors:
• - The complexity of the hypothesis we fit to data, namely, the capacity
of the hypothesis class
• - The amount of training data, and
• - The generalization error on new examples.

3/3/2021 Dept of CSE, FET, Jain 79


Data Quality
• What kinds of data quality problems?
• How can we detect problems with the data? Data cleaning
• What can we do about these problems?

• Examples of data quality problems:


• Noise and outliers
• missing values
• duplicate data

3/3/2021 Dept of CSE, FET, Jain 80


Measurement and Data Collection Issues
• Definition of measurement and data collection errors and then consider a variety of problems that involve
measurement error: noise, artifacts, bias, precision, and accuracy.
Measurement and Data Collection Errors
• The term measurement error refers to any problem resulting from the measurement process. A common
problem is that the value recorded differs from the true value to some extent. For continuous attributes, the
numerical difference of the measured and true value is called the error.
Noise and Artifacts
• Noise is the random component of a measurement error. It may involve the distortion of a value or the
addition of spurious objects.
• Data errors may be the result of a more deterministic phenomenon, such as a streak in the same place on a set
of photographs. Such deterministic distortions of the data are often referred to as artifacts.

3/3/2021 Dept of CSE, FET, Jain 81


3/3/2021 Dept of CSE, FET, Jain 82
Contd…
Precision, Bias, and Accuracy
• Precision: The closeness of repeated measurements (of the same quantity) to one another. TP/ TP+FP= 90%
• Recall: TP/all the values
• Bias: A systematic variation of measurements from the quantity being measured.
• Precision is often measured by the standard deviation of a set of values, while bias is measured by taking the
difference between the mean of the set of values and the known value of the quantity being measured. Bias
can only be determined for objects whose measured quantity is known by means external to the current
situation.
• Accuracy: The closeness of measurements to the true value of the quantity being measured.
• Accuracy depends on precision and bias, but since it is a general concept there is no specific formula for
accuracy in terms of these two quantities.
• One important aspect of accuracy is the use of significant digits. The goal is to use only as many digits to
represent the result of a measurement or calculation as are justified by the precision of the data.

3/3/2021 Dept of CSE, FET, Jain 83


Noise
• Noise refers to modification of original values
• Examples: distortion of a person’s voice when talking on a poor phone and
“snow” on television screen

Two Sine Waves Two Sine Waves + Noise


3/3/2021 Dept of CSE, FET, Jain 84
Outliers
• Outliers are data objects with characteristics that are considerably
different than most of the other data objects in the data set

3/3/2021 Dept of CSE, FET, Jain 85


Dimensionality Reduction

• There are a variety of benefits to dimensionality reduction . A key benefit is that many data mining
algorithms work better if the dimensionality-the number of attributes in the data-is lower. This is partly
because dimensionality reduction can eliminate irrelevant features and reduce noise and partly because of the
curse of dimensionality.
• Another benefit is that a reduction of dimensionality can lead to a more understandable model because the
model may involve fewer attributes. Also, dimensionality reduction may allow the data to be more easily
visualized.
• Even if dimensionality reduction doesn't reduce the data to two or three dimensions, data is often visualized
by looking at pairs or triplets of attributes, and the number of such combinations is greatly reduced. Finally,
the amount of time and memory required by the data mining algorithm is reduced with a reduction in
dimensionality.
• The term dimensionality reduction is often reserved for those techniques that reduce the dimensionality of a
data set by creating new attributes that are a combination of the old attributes. The reduction of
dimensionality by selecting new attributes that are a subset of the old is known as feature subset selection or
feature selection.

3/3/2021 Dept of CSE, FET, Jain 86


Dimensionality Reduction

• Purpose:
• Avoid curse of dimensionality
• Reduce amount of time and memory required by data mining algorithms
• Allow data to be more easily visualized
• May help to eliminate irrelevant features or reduce noise

• Techniques
• Principle Component Analysis
• Others: supervised and non-linear techniques

3/3/2021 Dept of CSE, FET, Jain 87


Curse of Dimensionality

• When dimensionality increases, data becomes increasingly sparse in the space


that it occupies

• Definitions of density and distance between points, which is critical for clustering
and outlier detection, become less meaningful
Less is More
The Curse of Dimensionality (Bellman,
1961)

3/3/2021 Dept of CSE, FET, Jain 89


Feature Subset Selection

• Another way to reduce dimensionality of data

• Redundant features
• duplicate much or all of the information contained in one or more other
attributes
• Example: purchase price of a product and the amount of sales tax paid

• Irrelevant features
• contain no information that is useful for the data mining task at hand
• Example: students' ID is often irrelevant to the task of predicting students'
GPA
3/3/2021 Dept of CSE, FET, Jain 90
Feature Subset Selection
• Techniques:
• Brute-force approach: (next class)
• Try all possible feature subsets as input to data mining algorithm
• Embedded approaches:
• Feature selection occurs naturally as part of the data mining
algorithm
• Filter approaches:
• Features are selected before data mining algorithm is run
• Wrapper approaches:
• Use the data mining algorithm as a black box to find best subset of
attributes

3/3/2021 Dept of CSE, FET, Jain 91


Issues in Machine Learning
• 1. Understanding Which Processes Need Automation - It's becoming increasingly difficult to evaluate which
problems you’re seeking to solve. The easiest processes to automate are the ones that are done manually
every day with no variable output. Complicated processes require further inspection before automation.
While Machine Learning can definitely help automate some processes, not all automation problems need
Machine Learning.

• 2. Lack of Quality Data - While enhancing algorithms often consumes most of the time of developers in AI,
data quality is essential for the algorithms to function as intended. Noisy data, dirty data, and incomplete
data are the problems in ideal Machine Learning, there is lack of good data..

3/3/2021 Dept of CSE, FET, Jain 92


Issues in Machine Learning
• 3. Inadequate Infrastructure - Machine Learning requires vast amounts of data churning
capabilities. Legacy/Normal systems often can’t handle the workload and buckle under
pressure.
• 4. Implementation - Organizations often have analytics engines working with them by
the time
they choose to upgrade to Machine Learning. Integrating newer Machine Learning
methodologies into existing methodologies is a complicated task. Maintaining proper
interpretation and documentation goes a long way to easing implementation.
• 5. Lack of Skilled Resources - Deep analytics and Machine Learning in their current forms
are still new technologies. Thus, there is a shortage of skilled employees available to
manage and develop analytical content for Machine Learning. Data scientists often need
a combination of domain experience as well as in-depth knowledge of science,
technology, and mathematics.

3/3/2021 Dept of CSE, FET, Jain 93

You might also like