Lecture 2 Introduction

INTRODUCTION TO
MACHINE LEARNING
Pooja Vashisth
What Is Machine Learning?
 Machine learning is defined as an automated

process that extracts patterns from data. To build
the models used in predictive data analytics
applications, we use supervised machine learning.
Supervised machine learning techniques
automatically learn a model of the relationship
between a set of descriptive features and a target
feature based on a set of historical examples, or
instances
Dataset
 Instance
 Training
 Testing
 Machine learning algorithms automate the process
of learning a model that captures the relationship
between the descriptive features and the target
feature in a dataset
Machine learning is an ill-posed
problem
 Because a single consistent model cannot be found
based on the sample training dataset alone, we say
that machine learning is fundamentally an ill-posed
problem
 The set of assumptions that defines the model selection
criteria of a machine learning algorithm is known as the
inductive bias of the machine learning algorithm.
There are two types of inductive bias that a machine
learning algorithm can use, a restriction bias and a
preference bias. A restriction bias constrains the set of
models that the algorithm will consider during the
learning process. A preference bias guides the learning
algorithm to prefer certain models over others.
Problems
 There are two kinds of mistakes that an
inappropriate inductive bias can lead to:
underfitting and overfitting.
 No Free Lunch Theorem
The Predictive Data Analytics Project
Lifecycle: CRISPDM
The Predictive Data Analytics Project
Lifecycle: CRISPDM
 Business Understanding
 Data Understanding
 Data Preparation: Building predictive data
analytics models requires specific kinds of data,
organized in a specific kind of structure known as
an analytics base table (ABT)
 Modeling
 Evaluation
 Deployment
Predictive Data Analytics Tools
 The tools by IBM and SAS are enterprise-wide
solutions that integrate with the other offerings by
these companies.
 An interesting alternative to using an application-
based solution for building predictive data analytics
models is to use a programming language. Two of
the most commonly used programming languages
for predictive data analytics are R and Python.
Data to Insights to Decisions
 Predictive data analytics projects are not handed to
data analytics practitioners fully formed.
 A key step, then, in any data analytics project is to
understand the business problem that the
organization wants to solve and, based on this, to
determine the kind of insight that a predictive
analytics model can provide to help the
organization address this problem. This defines the
analytics solution that the analytics practitioner will
set out to build using machine learning. Defining the
analytics solution is the most important task in the
Business Understanding phase of the CRISP-DM
process.
Key Questions
 What is the business problem? What are the goals
that the business wants to achieve?
 How does the business currently work?
 In what ways could a predictive analytics model
help to address the business problem?
 2.1.1 Case Study: Motor Insurance Fraud (from text)

Assessing Feasibility
 Is the data required by the solution available, or could
it be made available?
 The first question addresses data availability
 What is the capacity of the business to utilize the
insights that the analytics solution will provide?
 The second issue affecting the feasibility of an analytics
solution is the ability of the business to utilize the insight that
the solution provides. If a business is required to drastically
revise all their processes to take advantage of the insights
that can be garnered from a predictive model, the business
may not be ready to do this no matter how good the model
is.
Designing the Analytics Base Table
 The basic structure in which we capture these
historical datasets is the analytics base table (ABT),
a schematic of which is shown in Table 2.1. An
analytics base table is a simple, flat, tabular data
structure made up of rows and columns. The columns
are divided into a set of descriptive features and a
single target feature. Each row contains a value for
each descriptive feature and the target feature and
represents an instance about which a prediction can
be made.
Analytics Base Table
In designing an ABT, the first decision an analytics practitioner needs to make is on the
prediction subject for the model they are trying to build. The prediction subject defines
the basic level at which predictions are made, and each row in the ABT will represent
one instance of the prediction subject—the phrase one-row-per-subject is often used to
describe this structure
 The actual process for determining domain concepts
is essentially one of knowledge elicitation—
attempting to extract from domain experts the
knowledge about the scenario we are trying to
model.
Designing and Implementing Features
 Data availability
 Timing
 Longevity
Different Types of Data
 Numeric: True numeric values that allow arithmetic
operations (e.g., price, age)
 Interval: Values that allow ordering and subtraction, but do
not allow other arithmetic operations (e.g., date, time)
 Ordinal: Values that allow ordering but do not permit
arithmetic (e.g., size measured as small, medium, or large)
 Categorical: A finite set of values that cannot be ordered
and allow no arithmetic (e.g., country, product type)
 Binary: A set of just two values (e.g., gender)
 Textual: Free-form, usually short, text data (e.g., name,
address)
 We often reduce this categorization to just two data
types: continuous (encompassing the numeric and
interval types), and categorical (encompassing the
categorical, ordinal, binary, and textual types).
Different Types of Features
 The features in an ABT can be of two types:
 raw features or
 derived features
 Aggregate
 Flags
 Ratios
 Mappings
 For propensity modeling, there are two key
periods: the observation period, over which
descriptive features are calculated, and the
outcome period, over which the target feature is
calculated.
Legal Issues
 Anti-discrimination legislation
 Personal data

Lecture 2 Introduction

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 2 Introduction

Uploaded by

Copyright:

Available Formats

INTRODUCTION TO

 Machine learning is defined as an automated

 2.1.1 Case Study: Motor Insurance Fraud (from text)

You might also like