Module 2

AI Fundamentals for Non-Data Scientists
Specific Machine Learning Methods: A Deep Dive
Kartik Hosanagar, Professor of Operations, Information and Decisions

A Closer Look at ML Methods
All different ways of approximating the relationship between X and y

• Logistic regression
• Decisions trees and random forests
• Neural nets
Logistic Regression
• Used for binary classification, i.e. when outcomes can take only 1 of 2
values (yes/no, click/no click, health/sick)
• Among the most useful and popular methods (along with ordinary least
squares) in all of statistics, data science, academia
• Is it really machine learning?
• Early development in 19th century
Logistic Regression
• Goal is to estimate the probability of a given outcome, conditional on some

variables (x):
• Logit function constrains probabilities

to between 0 and 1
Logistic Regression
• Equivalent to finding the “best-fit” line/plane that separates the data
• Example: How do age and income

predict whether a customer will
purchase your new product?
• Red dots: purchase

• Blue dots: no purchase
Decision Trees
• Easy-to-interpret models built by iteratively looking for features in your data

that are most predictive
• Example:
Output: Will it rain or

not today?
Inputs: Temperature,
wind, pressure
Decision Trees
Question: How to decide which variables to split on at each branch? (e.g.

consider temperature or pressure first?)
• Also, how to select what value to split on?
• e.g. 70 degrees vs. 60 degrees
Mathematical Answer: Select variables/splits that minimize “entropy” of the

dataset
• In simple terms, choose a variable/ split that provides the most predictive
power at each step
• A “greedy” algorithm that looks for the best split at each step in the process
• Continue to split branches on different variables until a desired level of
depth or data size limits are reached
Random Forests
• An “ensemble” algorithm that harnesses the power of many decision trees

• Very popular, powerful and relatively simple ML algorithm
• Basic idea:
• Take many random samples of your dataset
• For each subset of data, train a decision tree (for each node in tree, only
use a random subset of features)
• To make final decision, treat each decision tree as a “vote” and choose
the prediction with the most votes among all sub-trees
Random Forests
Neural Networks (NNs)
• Loosely inspired by biological neurons

• Neurons take input from other neurons, apply transformation & pass
signal on
• Single “Neuron”
“weights”
“parameters”
• Deep Neural Net

Neural Network for Face Recognition
See https://www.youtube.com/watch?v=aircAruvnKk for a simple video tutorial

Layers of a deep neural network

• Input layer: Neurons correspond to the RGB value of all the pixels in the
image
• Final layer has the outputs we expect (e.g. 3 neurons corresponding to
three different people who are going to be identified)
• Hidden layers in between
See https://www.youtube.com/watch?v=aircAruvnKk for a simple video tutorial

• NNs have become very successful in recent years

• Are often among the best algorithms in top ML competitions (especially
with images, audio, video, etc.)
• Why?
• Lots of parameters = ability to build very complex models
• Recent advances in computation (GPUs) and algorithms
(backpropagation) has allowed for more layers (more complex models)
• NNs are hard to understand and interpret
• We know inputs and outputs, but the middle layers are merely numbers
from our point of view
• Lots of work is being done to “open up” black box models so we can
understand what they are doing
Other ML Methods
• Boosting
• Support vector machines (SVM)
• Neural nets - much more complicated and varied than covered here
• Many, many other regression techniques
• LASSO, Ridge, weighted regression, kernel regression
• Want to learn more?
• Statistics or Intro CS courses on machine learning
AI Fundamentals for Non-Data Scientists
Intro to Model Selection
Kartik Hosanagar, Professor of Operations, Information and Decisions

Model Selection
• For any prediction problem, there are many algorithms and methods
available - decision trees, random forests, neural networks, and more
• Model evaluation and selection is done by evaluating model performance
on a validation dataset
• Holdout validation: Partition available data into a training dataset and a
holdout; evaluate model performance on holdout
• Cross-validation: Create a number of partitions (validation datasets) from
the training dataset; fit model to the training dataset (sans the validation
data); evaluate model against each validation dataset; repeat with each
validation set and average results to obtain the cross-validation error
Data vs. Model
• Often Data > Methods

• Microsoft researchers (Banco and Brill)
evaluated performance of multiple models
for a language understanding task
• Varied size of training dataset (up to 1B
words)
• Among modern methods, performance
differences between algorithms are
relatively small when compared to
differences between same algorithms with
more/less data
“Unreasonable effectiveness of data” – Peter Norvig (Google)

AI Fundamentals
Feature Engineering and Deep Learning Introduction
Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

What About Unstructured Data?
• Big revolution in AI is ability to predict from “unstructured data”
Decision
What About Unstructured Data?
• When data are not structured, features have to be “engineered” from the
data
• A time-consuming and challenging process, and often requires domain
expertise
• One of the most difficult parts of the ML process, where data scientists
spend a lot of their time
• Can be as much an art as a science
Feature Engineering
Example: Features you might you engineer from real estate pictures
• Take individual images and extract individual features
• Requires knowledge of real estate
• Requires access to a realtor and a software developer
• Good amount of guessing involved — very likely to miss critical features
AI Fundamentals
Deep Learning

Deep Learning
• When working with unstructured data, requires a feature engineering step to

generate features for the model
• If the performance is too large or unacceptable, engineers try a new set of
features or adjust features used in the current model
• Domain expertise is still required
• What are the right features to extract?
• Can be especially hard when using unstructured data like images,
sounds, or essays
Example: X-Ray Diagnostics
Example: X-Ray Diagnostics
Domain experts
Deep Learning
• Eliminates the need for feature extraction

Why is Deep Learning a “Game Changer”?
• Feature engineering is expensive

• Requires domain expertise, is error-prone, and is uncertain
• Deep learning can lead to massive performance improvements, relative to
hand-coded features
• Computation is getting cheaper, making deep learning more feasible
• Going to let us predict from unstructured data (images, online reviews,
health data, audio, etc.) at scale
• For any prediction or classification task, this is going to be an application of
substituting more and more data (labeled examples) for expertise
Deep Learning Examples
• Image recognition
• Detecting fake news
• Detecting knockoffs from luxury products
AI Fundamentals
Evaluating ML Perfomance

Evaluating ML Performance
• What should we tell the algorithm to optimize on?

• Lost/cost functions
• Accuracy
• Precision
• Recall
• Specificity
• Why are there so many?
Example: “Identify Fraudulent Credit Card Transactions”
Actual value Predicted value
Fraudulent Fraudulent
Legitimate Legitimate
Fraudulent Legitimate
Legitimate Fraudulent
Evaluating ML Performance
• Is the classifier doing a good job?

• Depends on what we care about
• Is it more costly to miss a fraudulent transaction?
• Is it very costly to put a valuable customer’s credit card on hold
accidentally?
• There are different costs and benefits of getting different types of labels
wrong in a prediction task
AI Fundamentals
How Deep Learning Works

Unstructured Data
• Unstructured data can be formulated into raw digital form

Unstructured Data
• Unstructured data can be formulated into raw digital form

• Data are then pre-processed to make it standardized for the prediction task
• This data is passed into neural networks (modeled after neurons)
Neural Networks
• Data forms the input layer

Neural Networks
• Data forms the input layer

• Engineers choose a loss/cost function to compare against the training labels
• How close are we to predicting the right answers?
• At each layer, inputs summed and passed to the next layer
Backpropagation
• Tuning the network

Neural Networks
• Limited domain information embedded in the model

• Substituting computation for expert knowledge
• Great for tasks with a lack of domain understanding for feature extraction
Deep Learning
• What is the role of the engineer?

• Setting hyperparameter values — require engineering knowledge, but less
domain knowledge
• Epochs
• Batch size
• Learning rate
• Regularization
• Activation function
• Number of hidden layers
Deep Learning
• Workflow changes — we only need examples

AI Fundamentals
Limitations of Deep Learning

Limitations of Deep Learning
• Do you have plenty of varied training data available?

• Deep learning is going to require much more training data than other
approaches
• Also need a lot of variation in the data
• Do you have the necessary data storage and computational power
available?
• The hardware requirements of deep learning are much more significant
• Do you need insight into why a deep learning model made a specific
decision?
• Lack of interpretability (explainability)
AI Fundamentals
Common Loss Functions

Accuracy
Actual value Predicted value • Fraction of labels (answers) that the

algorithm predicts correctly
Fraudulent Legitimate • Accuracy = 0.70
Fraudulent Fraudulent • Classification error = 0.30
Precision
Actual value Predicted value • What proportion of positive

identifications were actually correct?
Legitimate Fraudulent • If “fraudulent” is the positive class,
Legitimate Legitimate what proportion of the ones we call
Fraudulent Fraudulent fraudulent are actually fraudulent?
Fraudulent Legitimate • Precision = 5/7 = 0.71
Sensitivity (Recall)
Actual value Predicted value • How many relevant instances

(fraudulent) did you catch?
Legitimate Fraudulent • Recall = 5/6 = 0.8333
Specificity
Actual value Predicted value • Proportion of legitimates correctly

identified as such
Legitimate Fraudulent • Specificity = 2/4 = 0.50
Definitions: Type I and Type II Error
True positive Fraction of times we identify a Fraudulent as a Fraudulent
True negative Fraction of times we identify a Legitimate as a Legitimate
False positive Fraction of times we identify a Legitimate as a Fraudulent

(Type I error)
False negative Fraction of times we identify a Fraudulent as a Legitimate
(Type II error)
Confusion Matrix
Receiver Operating Characteristic (ROC) Curve
• ROC curves suggest a tradeoff

AI Fundamentals
Tradeoffs Between Loss Functions

• What are the relative costs of false negatives and false positives?
When might we want a sensitive test?

• Medical application — screening for a disease
• Want to make sure we don’t miss a person with the condition
• Even at the potential cost of falsely identifying some cases as having the
condition, even if they actually don’t
• Radar system detecting incoming aircraft
When might we want a precise test?

• When it’s okay for a car to take a left turn
• Want to be absolutely sure that it is okay to take a left before
recommending the decision to take a left
• Even if it means we will miss some opportunities to take a left turn
• Identifying violations that have severe punishments (e.g. cheating that leads
to expulsion)
AI Fundamentals
How is Training Data Acquired?

Training Data
• The data that the algorithm uses to learn the best mapping between the
inputs and the right predictions, or outputs
Training Data
Where does training data come from?

• Archival, or historical, data in the organization
• In many domains there are records of decisions that have been made
that can be used to train a model
• E.g. resume screening
• Human data labeling to generate training data
• Platforms for crowdsourcing the task of labeling data
• Using customers to label data
• Google and Gmail spam filtering
• Social networks
AI Fundamentals
The Over-Fitting Problem

Key to ML Algorithms
• Training data is the key to building ML algorithms, but we care most about
performance on out-of-sample data
• The point is to predict outcomes where we don’t already know what is going
to happen
• Over-fitting is the danger that the model performs well on training data, but
not other data sets
• ML engineers try to avoid fitting the model to the point that it picks up noise
in the training data
• Trying to balance using the training data to build an accurate model, with
having a model that still performs well on out-of-sample data
• Example: Studying the test vs. studying the material
• The challenge is in capturing the relevant aspects of the model vs. capturing
the noise in the training data
• This is called the “bias-variance” tradeoff
The “Bias-Variance” Tradeoff
Example: Customer targeting

• Want to run a promotion where we target specific customers
• Have training data based on a small set of customers who we’ve run the
promotion on in the past
• We’d like to run a model that picks up the relevant attributes of those
customers that can be useful for predicting future customers who might be
responsive
• Could run the model to the point where it goes overboard and picks up
noise in the data
• E.g. first name “Julie” and promotion response
AI Fundamentals
Test Data

Test Data
• Test data (also called a “hold-out sample”) is a data set that is not used to
train or build the model, but can be used to validate the model
• Validating performance on a data set that’s not used to build the model (test
data) helps ensure the model also works well on outside samples
Performance Tradeoffs with Training and Test Data
Where Does Test Data Come From?
• One common approach is for ML engineers to start with all of the data for
which they have labels, and then divide it up into training and test data (e.g.
conduct a 70/30 split)
• Example: Insurance data to predict accident likelihood
• Take all historical data and divide it up
• Everything up until the last 6 months is used as training data
• Everything from 6 months ago to the present day is used as test data
AI Fundamentals
Examples of End-to-End AI Workflow

End-to-End AI Workflow
ML to identify disease in medical images

• Collect large amounts of data on medical diagnostic images with expert
decisions (e.g. from radiologists)
• These experts have labeled these images as indicating a condition or not
• Build the model on a training sample and evaluating performance on test
samples
• The machine can be taught to predict with accuracy, given a medical image,
whether someone has a condition
• Key point — never had to ask anyone to describe what aspects of medical
images suggest a condition
• No need to talk to a medical expert/radiologist
Magic of Machine Learning
• Data and computation substitute for expertise

• Advantages
• Consistency
• Scale/speed
• Works really well for some tasks
• Faster/cheaper to build

Module 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Module 2

Uploaded by

Copyright:

Available Formats

AI Fundamentals for Non-Data Scientists

Specific Machine Learning Methods: A Deep Dive

Kartik Hosanagar, Professor of Operations, Information and Decisions

All different ways of approximating the relationship between X and y

• Goal is to estimate the probability of a given outcome, conditional on some

• Logit function constrains probabilities

• Equivalent to finding the “best-fit” line/plane that separates the data

• Example: How do age and income

• Red dots: purchase

• Easy-to-interpret models built by iteratively looking for features in your data

Output: Will it rain or

Question: How to decide which variables to split on at each branch? (e.g.

Mathematical Answer: Select variables/splits that minimize “entropy” of the

• An “ensemble” algorithm that harnesses the power of many decision trees

• Loosely inspired by biological neurons

• Deep Neural Net

See https://www.youtube.com/watch?v=aircAruvnKk for a simple video tutorial

Layers of a deep neural network

See https://www.youtube.com/watch?v=aircAruvnKk for a simple video tutorial

• NNs have become very successful in recent years

Kartik Hosanagar, Professor of Operations, Information and Decisions

• Often Data > Methods

“Unreasonable effectiveness of data” – Peter Norvig (Google)

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

• Big revolution in AI is ability to predict from “unstructured data”

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

• When working with unstructured data, requires a feature engineering step to

• Eliminates the need for feature extraction

• Feature engineering is expensive

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

• What should we tell the algorithm to optimize on?

Actual value Predicted value

Actual value Predicted value

Actual value Predicted value

• Is the classifier doing a good job?

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

• Unstructured data can be formulated into raw digital form

• Unstructured data can be formulated into raw digital form

• Data forms the input layer

• Data forms the input layer

• Tuning the network

• Limited domain information embedded in the model

• What is the role of the engineer?

• Workflow changes — we only need examples

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

• Do you have plenty of varied training data available?

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

Actual value Predicted value • Fraction of labels (answers) that the

Actual value Predicted value • What proportion of positive

Actual value Predicted value • How many relevant instances

Actual value Predicted value • Proportion of legitimates correctly

True positive Fraction of times we identify a Fraudulent as a Fraudulent

True negative Fraction of times we identify a Legitimate as a Legitimate

False positive Fraction of times we identify a Legitimate as a Fraudulent

• ROC curves suggest a tradeoff

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

When might we want a sensitive test?

When might we want a precise test?

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

Where does training data come from?

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions

Example: Customer targeting

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions