Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 73

AI Fundamentals for Non-Data Scientists

Specific Machine Learning Methods: A Deep Dive

Kartik Hosanagar, Professor of Operations, Information and Decisions


A Closer Look at ML Methods

All different ways of approximating the relationship between X and y


• Logistic regression
• Decisions trees and random forests
• Neural nets
Logistic Regression

• Used for binary classification, i.e. when outcomes can take only 1 of 2
values (yes/no, click/no click, health/sick)
• Among the most useful and popular methods (along with ordinary least
squares) in all of statistics, data science, academia
• Is it really machine learning?
• Early development in 19th century
Logistic Regression

• Goal is to estimate the probability of a given outcome, conditional on some


variables (x):

• Logit function constrains probabilities


to between 0 and 1
Logistic Regression

• Equivalent to finding the “best-fit” line/plane that separates the data

• Example: How do age and income


predict whether a customer will
purchase your new product?

• Red dots: purchase


• Blue dots: no purchase
Decision Trees

• Easy-to-interpret models built by iteratively looking for features in your data


that are most predictive
• Example:

Output: Will it rain or


not today?
Inputs: Temperature,
wind, pressure
Decision Trees

Question: How to decide which variables to split on at each branch? (e.g.


consider temperature or pressure first?)
• Also, how to select what value to split on?
• e.g. 70 degrees vs. 60 degrees

Mathematical Answer: Select variables/splits that minimize “entropy” of the


dataset
• In simple terms, choose a variable/ split that provides the most predictive
power at each step
• A “greedy” algorithm that looks for the best split at each step in the process
• Continue to split branches on different variables until a desired level of
depth or data size limits are reached
Random Forests

• An “ensemble” algorithm that harnesses the power of many decision trees


• Very popular, powerful and relatively simple ML algorithm
• Basic idea:
• Take many random samples of your dataset
• For each subset of data, train a decision tree (for each node in tree, only
use a random subset of features)
• To make final decision, treat each decision tree as a “vote” and choose
the prediction with the most votes among all sub-trees
Random Forests
Neural Networks (NNs)

• Loosely inspired by biological neurons


• Neurons take input from other neurons, apply transformation & pass
signal on

• Single “Neuron”
“weights”
“parameters”

• Deep Neural Net


Neural Network for Face Recognition

See https://www.youtube.com/watch?v=aircAruvnKk for a simple video tutorial


Neural Networks (NNs)

Layers of a deep neural network


• Input layer: Neurons correspond to the RGB value of all the pixels in the
image
• Final layer has the outputs we expect (e.g. 3 neurons corresponding to
three different people who are going to be identified)
• Hidden layers in between

See https://www.youtube.com/watch?v=aircAruvnKk for a simple video tutorial


Neural Networks (NNs)

• NNs have become very successful in recent years


• Are often among the best algorithms in top ML competitions (especially
with images, audio, video, etc.)
• Why?
• Lots of parameters = ability to build very complex models
• Recent advances in computation (GPUs) and algorithms
(backpropagation) has allowed for more layers (more complex models)
• NNs are hard to understand and interpret
• We know inputs and outputs, but the middle layers are merely numbers
from our point of view
• Lots of work is being done to “open up” black box models so we can
understand what they are doing
Other ML Methods

• Boosting
• Support vector machines (SVM)
• Neural nets - much more complicated and varied than covered here
• Many, many other regression techniques
• LASSO, Ridge, weighted regression, kernel regression
• Want to learn more?
• Statistics or Intro CS courses on machine learning
AI Fundamentals for Non-Data Scientists
Intro to Model Selection

Kartik Hosanagar, Professor of Operations, Information and Decisions


Model Selection

• For any prediction problem, there are many algorithms and methods
available - decision trees, random forests, neural networks, and more
• Model evaluation and selection is done by evaluating model performance
on a validation dataset
• Holdout validation: Partition available data into a training dataset and a
holdout; evaluate model performance on holdout
• Cross-validation: Create a number of partitions (validation datasets) from
the training dataset; fit model to the training dataset (sans the validation
data); evaluate model against each validation dataset; repeat with each
validation set and average results to obtain the cross-validation error
Data vs. Model

• Often Data > Methods


• Microsoft researchers (Banco and Brill)
evaluated performance of multiple models
for a language understanding task
• Varied size of training dataset (up to 1B
words)
• Among modern methods, performance
differences between algorithms are
relatively small when compared to
differences between same algorithms with
more/less data

“Unreasonable effectiveness of data” – Peter Norvig (Google)


AI Fundamentals
Feature Engineering and Deep Learning Introduction

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


What About Unstructured Data?

• Big revolution in AI is ability to predict from “unstructured data”

Decision
What About Unstructured Data?

• When data are not structured, features have to be “engineered” from the
data
• A time-consuming and challenging process, and often requires domain
expertise
• One of the most difficult parts of the ML process, where data scientists
spend a lot of their time
• Can be as much an art as a science
Feature Engineering

Example: Features you might you engineer from real estate pictures
• Take individual images and extract individual features
• Requires knowledge of real estate
• Requires access to a realtor and a software developer
• Good amount of guessing involved — very likely to miss critical features
AI Fundamentals
Deep Learning

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


Deep Learning

• When working with unstructured data, requires a feature engineering step to


generate features for the model
• If the performance is too large or unacceptable, engineers try a new set of
features or adjust features used in the current model
• Domain expertise is still required
• What are the right features to extract?
• Can be especially hard when using unstructured data like images,
sounds, or essays
Example: X-Ray Diagnostics
Example: X-Ray Diagnostics

Domain experts
Deep Learning

• Eliminates the need for feature extraction


Why is Deep Learning a “Game Changer”?

• Feature engineering is expensive


• Requires domain expertise, is error-prone, and is uncertain
• Deep learning can lead to massive performance improvements, relative to
hand-coded features
• Computation is getting cheaper, making deep learning more feasible
• Going to let us predict from unstructured data (images, online reviews,
health data, audio, etc.) at scale
• For any prediction or classification task, this is going to be an application of
substituting more and more data (labeled examples) for expertise
Deep Learning Examples

• Image recognition
• Detecting fake news
• Detecting knockoffs from luxury products
AI Fundamentals
Evaluating ML Perfomance

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


Evaluating ML Performance

• What should we tell the algorithm to optimize on?


• Lost/cost functions
• Accuracy
• Precision
• Recall
• Specificity
• Why are there so many?
Example: “Identify Fraudulent Credit Card Transactions”

Actual value Predicted value

Fraudulent Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Legitimate Legitimate
Fraudulent Fraudulent
Fraudulent Legitimate
Legitimate Legitimate
Legitimate Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Example: “Identify Fraudulent Credit Card Transactions”

Actual value Predicted value

Fraudulent Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Legitimate Legitimate
Fraudulent Fraudulent
Fraudulent Legitimate
Legitimate Legitimate
Legitimate Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Example: “Identify Fraudulent Credit Card Transactions”

Actual value Predicted value

Fraudulent Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Legitimate Legitimate
Fraudulent Fraudulent
Fraudulent Legitimate
Legitimate Legitimate
Legitimate Fraudulent
Legitimate Legitimate
Legitimate Legitimate
Evaluating ML Performance

• Is the classifier doing a good job?


• Depends on what we care about
• Is it more costly to miss a fraudulent transaction?
• Is it very costly to put a valuable customer’s credit card on hold
accidentally?
• There are different costs and benefits of getting different types of labels
wrong in a prediction task
AI Fundamentals
How Deep Learning Works

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


Unstructured Data

• Unstructured data can be formulated into raw digital form


Unstructured Data

• Unstructured data can be formulated into raw digital form


• Data are then pre-processed to make it standardized for the prediction task
• This data is passed into neural networks (modeled after neurons)
Neural Networks

• Data forms the input layer


Neural Networks

• Data forms the input layer


• Engineers choose a loss/cost function to compare against the training labels
• How close are we to predicting the right answers?
• At each layer, inputs summed and passed to the next layer
Backpropagation

• Tuning the network


Neural Networks

• Limited domain information embedded in the model


• Substituting computation for expert knowledge
• Great for tasks with a lack of domain understanding for feature extraction
Deep Learning

• What is the role of the engineer?


• Setting hyperparameter values — require engineering knowledge, but less
domain knowledge
• Epochs
• Batch size
• Learning rate
• Regularization
• Activation function
• Number of hidden layers
Deep Learning

• Workflow changes — we only need examples


AI Fundamentals
Limitations of Deep Learning

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


Limitations of Deep Learning

• Do you have plenty of varied training data available?


• Deep learning is going to require much more training data than other
approaches
• Also need a lot of variation in the data
• Do you have the necessary data storage and computational power
available?
• The hardware requirements of deep learning are much more significant
• Do you need insight into why a deep learning model made a specific
decision?
• Lack of interpretability (explainability)
AI Fundamentals
Common Loss Functions

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


Accuracy

Actual value Predicted value • Fraction of labels (answers) that the


Legitimate Legitimate
algorithm predicts correctly
Fraudulent Legitimate • Accuracy = 0.70
Fraudulent Fraudulent • Classification error = 0.30
Legitimate Legitimate
Legitimate Fraudulent
Legitimate Legitimate
Fraudulent Fraudulent
Fraudulent Legitimate
Legitimate Legitimate
Legitimate Legitimate
Precision

Actual value Predicted value • What proportion of positive


Fraudulent Fraudulent
identifications were actually correct?
Legitimate Fraudulent • If “fraudulent” is the positive class,
Legitimate Legitimate what proportion of the ones we call
Fraudulent Fraudulent fraudulent are actually fraudulent?
Fraudulent Legitimate • Precision = 5/7 = 0.71
Fraudulent Fraudulent
Legitimate Legitimate
Legitimate Fraudulent
Fraudulent Fraudulent
Fraudulent Fraudulent
Sensitivity (Recall)

Actual value Predicted value • How many relevant instances


Fraudulent Fraudulent
(fraudulent) did you catch?
Legitimate Fraudulent • Recall = 5/6 = 0.8333
Legitimate Legitimate
Fraudulent Fraudulent
Fraudulent Legitimate
Fraudulent Fraudulent
Legitimate Legitimate
Legitimate Fraudulent
Fraudulent Fraudulent
Fraudulent Fraudulent
Specificity

Actual value Predicted value • Proportion of legitimates correctly


Fraudulent Fraudulent
identified as such
Legitimate Fraudulent • Specificity = 2/4 = 0.50
Legitimate Legitimate
Fraudulent Fraudulent
Fraudulent Legitimate
Fraudulent Fraudulent
Legitimate Legitimate
Legitimate Fraudulent
Fraudulent Fraudulent
Fraudulent Fraudulent
Definitions: Type I and Type II Error

True positive Fraction of times we identify a Fraudulent as a Fraudulent

True negative Fraction of times we identify a Legitimate as a Legitimate

False positive Fraction of times we identify a Legitimate as a Fraudulent


(Type I error)
False negative Fraction of times we identify a Fraudulent as a Legitimate
(Type II error)
Confusion Matrix
Receiver Operating Characteristic (ROC) Curve

• ROC curves suggest a tradeoff


AI Fundamentals
Tradeoffs Between Loss Functions

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


Tradeoffs Between Loss Functions

• What are the relative costs of false negatives and false positives?
Tradeoffs Between Loss Functions

When might we want a sensitive test?


• Medical application — screening for a disease
• Want to make sure we don’t miss a person with the condition
• Even at the potential cost of falsely identifying some cases as having the
condition, even if they actually don’t
• Radar system detecting incoming aircraft
Tradeoffs Between Loss Functions

When might we want a precise test?


• When it’s okay for a car to take a left turn
• Want to be absolutely sure that it is okay to take a left before
recommending the decision to take a left
• Even if it means we will miss some opportunities to take a left turn
• Identifying violations that have severe punishments (e.g. cheating that leads
to expulsion)
AI Fundamentals
How is Training Data Acquired?

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


Training Data

• The data that the algorithm uses to learn the best mapping between the
inputs and the right predictions, or outputs
Training Data

Where does training data come from?


• Archival, or historical, data in the organization
• In many domains there are records of decisions that have been made
that can be used to train a model
• E.g. resume screening
• Human data labeling to generate training data
• Platforms for crowdsourcing the task of labeling data
• Using customers to label data
• Google and Gmail spam filtering
• Social networks
AI Fundamentals
The Over-Fitting Problem

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


Key to ML Algorithms

• Training data is the key to building ML algorithms, but we care most about
performance on out-of-sample data
• The point is to predict outcomes where we don’t already know what is going
to happen
The Over-Fitting Problem

• Over-fitting is the danger that the model performs well on training data, but
not other data sets
• ML engineers try to avoid fitting the model to the point that it picks up noise
in the training data
• Trying to balance using the training data to build an accurate model, with
having a model that still performs well on out-of-sample data
• Example: Studying the test vs. studying the material
The Over-Fitting Problem

• The challenge is in capturing the relevant aspects of the model vs. capturing
the noise in the training data
• This is called the “bias-variance” tradeoff
The “Bias-Variance” Tradeoff

Example: Customer targeting


• Want to run a promotion where we target specific customers
• Have training data based on a small set of customers who we’ve run the
promotion on in the past
• We’d like to run a model that picks up the relevant attributes of those
customers that can be useful for predicting future customers who might be
responsive
• Could run the model to the point where it goes overboard and picks up
noise in the data
• E.g. first name “Julie” and promotion response
AI Fundamentals
Test Data

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


Test Data

• Test data (also called a “hold-out sample”) is a data set that is not used to
train or build the model, but can be used to validate the model
• Validating performance on a data set that’s not used to build the model (test
data) helps ensure the model also works well on outside samples
Performance Tradeoffs with Training and Test Data
Where Does Test Data Come From?

• One common approach is for ML engineers to start with all of the data for
which they have labels, and then divide it up into training and test data (e.g.
conduct a 70/30 split)
• Example: Insurance data to predict accident likelihood
• Take all historical data and divide it up
• Everything up until the last 6 months is used as training data
• Everything from 6 months ago to the present day is used as test data
AI Fundamentals
Examples of End-to-End AI Workflow

Prasanna (Sonny) Tambe, Associate Professor of Operations, Information and Decisions


End-to-End AI Workflow

ML to identify disease in medical images


• Collect large amounts of data on medical diagnostic images with expert
decisions (e.g. from radiologists)
• These experts have labeled these images as indicating a condition or not
• Build the model on a training sample and evaluating performance on test
samples
• The machine can be taught to predict with accuracy, given a medical image,
whether someone has a condition
• Key point — never had to ask anyone to describe what aspects of medical
images suggest a condition
• No need to talk to a medical expert/radiologist
Magic of Machine Learning

• Data and computation substitute for expertise


• Advantages
• Consistency
• Scale/speed
• Works really well for some tasks
• Faster/cheaper to build

You might also like