Tutorial 3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Introduction to

Business Analytics
and Machine Learning

TUTORIAL 3
Definition
• Analytics is a field of computer science that uses math, statistics, and machine
learning to find meaningful patterns in data.
• Business Analytics- When analytics is applied to make business decisions
• Analytics may be considered as a three-step process-

1. Descriptive Analytics- (What happened?) It involves looking at the history of business


activities to get a fair idea of how a business performed in the past. (Technique- EDA)

2. Predictive Analytics (forecasting)- (What is likely to happen in the future?) Here the
facts or information from the past are leveraged to understand the future course the
business may assume. (Techniques- Regression, Decision Tree, Machine Learning, etc.)

3. Prescriptive Analytics- (What action should we take?) Based on the findings of


descriptive and predictive analytics, it determines the best course of action in a scenario
(Techniques- Optimization algorithms {linear/non-linear programming, genetic
algorithms, etc.}, simulations, game theory, etc.)
Exploratory Data Analysis (EDA)
• Analysts need first to explore the data for potential research questions
before jumping into confirming the answers with hypothesis testing and
inferential statistics.

• Observations & Variables

We call them variables


because their values may
vary across
observations
Further reference: Sarah Boslaugh’s Statistics in a Nutshell, 2nd edition (O’Reilly)

EDA contd…
Types of variables

It can take only a


It can only take two levels Variable with more It takes more than two It can in theory take an
fixed number of
(e.g. Married? (yes or no) than two levels is a levels, where there is an infinite number of values
countable values
• Made purchase? (yes or no) nominal variable intrinsic ordering between any two values
between any two
• Wine type? (red or white)) (e.g. Favorite color between these levels (e.g. Height (within a
values. (e.g. units
(orange, blue, burnt (e.g. Beverage size range of 59 and 75
sold 10, 12, 13, …)
sienna, and so forth)) (small, medium, large)) inches, 59.3, 60.2, …)
EDA contd…
• Interval vs. Ratio Type?

• Exploratory Data Analysis refers to the critical process of performing


initial investigations on data so as to discover patterns, to spot anomalies, to
test hypothesis and to check assumptions with the help of summary
statistics and graphical representations
EDA contd…
Data Visualization
• Univariate visualization: Only one variable is visualized graphically (e.g. bar
charts, pie charts, histogram, etc.)

• Bivariate visualization: Each point is placed according to its value on two


attributes (e.g. scatterplot)

• Multivariate visualization: More than two variables are visualized


simultaneously
EDA contd…
Data Visualization

• Tools- Tableau, Python, Qlik View, SAS Visual Analytics, Power Bi, R, etc.
EDA contd…
Correlation
• Correlation Analysis is
statistical method that
is used to discover if
there is a relationship
between two
variables/datasets, and
how strong that
relationship may be.
EDA contd…
Methods to find Correlation Coefficient
• Pearson Coefficient (generally, useful for linear relationship between two
continuous variables)

• Spearman's Rank Coefficient (generally, useful for ordinal or non-normally


distributed data)

• Kendall's Rank Coefficient (generally, appropriate for ordinal or non-


normally distributed data)

“Correlation Does Not Imply Causation”


Regression for Predictive Model
Building
• Businesses want to take faster and better decisions compared to their
competitors. So they would like to get a fairly good idea regarding what is
expected to happen in the future.

Simple Linear Regression


• The estimated impact of a unit change of the independent variable X on the
dependent variable Y.
• The equation for linear regression
Y = 𝛽0 + 𝛽1 𝑋 + 𝜖
• H0: There is no linear influence of our independent variable on our
dependent variable.
• Ha: There is a linear influence of our independent variable on our dependent
variable.
What is Machine Learning?
• Machine Learning is the science (and art) of programming
computers so they can learn from data.
• For example, a bank might deploy machine learning to
detect whether a customer will default on a loan. As more
data is fed in, the algorithm may find patterns and
relationships in the data and use them to better predict
the likelihood of a default.
Why Use Machine Learning?
• Consider how you would write a spam filter using traditional programming
techniques?

1. First you would look at what spam typically looks like. You might notice
that some words or phrases (such as “4U,” “credit card,” “free,” and
“amazing”) tend to come up a lot in the subject.

2. You would write a detection algorithm for each of the patterns that you
noticed, and your program would flag emails as spam if a number of these
patterns are detected.

3. You would test your program, and repeat steps 1 and 2 until it is good
enough
Traditional approach
Machine Learning approach
Problem with Traditional approach
• If spammers notice that all their emails containing “4U” are
blocked, they might start writing “For U” instead. A spam
filter using traditional programming techniques would need to
be updated to flag “For U” emails. If spammers keep working
around your spam filter, you will need to keep writing new
rules forever.
Automatically adapting to change
Types of Machine Learning Systems
• Broadly classifying:
1. Supervised learning
 In supervised learning, the training data you feed to the algorithm includes the
desired solutions, called labels
 The spam filter is a good example of this: it is trained with many example emails
along with their class (spam or ham), and it must learn how to classify new emails.
Types of Machine Learning Systems
• Supervised learning deals with two distinct kinds of problems:
 Classification problems
 Classification problems are often resolved using algorithms such as Naïve Bayes,
Support Vector Machines, Random Forest, Logistic Regression (It is used to
calculate or predict the probability of a binary (yes/no) event occurring), etc.

 Regression problems
 linear regression, non-linear regression, Bayesian linear regression, etc.

• Recommender systems are a notable example of supervised learning. E-


commerce companies such as Amazon, streaming sites like Netflix, and
social media platforms such as TikTok, Instagram, and even YouTube
among many others make use of recommender systems to make appropriate
recommendations to their target audience.
Types of Machine Learning Systems
2. Unsupervised Learning
 In unsupervised learning, as you might guess, the training data is unlabeled
 The main task of unsupervised learning is to find patterns in the data.
Types of Machine Learning Systems
• Some of the most important unsupervised learning algorithms:

• Clustering
 k-Means
 Hierarchical Cluster Analysis (HCA)
 Expectation Maximization

• Visualization and dimensionality reduction


 Principal Component Analysis (PCA)
 Kernel PCA
Types of Machine Learning Systems
3. Reinforcement Learning

• The learning system (agent), can observe the environment, select and
perform actions, and get rewards in return (or penalties in the form of
negative rewards

• It does not have a labelled dataset or results associated with data so the only
way to perform a given task is to learn from experience.

• For every correct action or decision of an algorithm, it is rewarded with


positive reinforcement whereas, for every incorrect action, it is rewarded
with negative reinforcement.
• Summary

• https://www.youtube.com/watch?v=1FZ0A1QCMWc
Main Challenges of Machine
Learning
• Insufficient or poor-quality data

It should be noted,
however, that small- and
medium sized datasets
are still very common,
and it is not always easy
or cheap to get extra
training data, so don’t
abandon algorithms just
yet
Main Challenges of Machine
Learning (Contd.)
• Nonrepresentative Training Data
 In order to generalize well, it is crucial that your training data be representative of
the new cases you want to generalize to

 The set of countries we used earlier for training the linear model was not perfectly
representative; a few countries were missing
 It seems that very rich countries are not happier than moderately rich countries (in
fact they seem unhappier), and conversely some poor countries seem happier than
many rich countries.
Main Challenges of Machine
Learning (Contd.)
• Poor quality data (training data is full of errors, outliers, and noise)

• Overfitting the Training Data


 Say you are visiting a foreign country and the taxi driver rips you off. You might be
tempted to say that all taxi drivers in that country are thieves (overgeneralization)
 In Machine Learning this is called overfitting

Why did it happen?


• Training set is noisy, or if it is too small
(which introduces sampling noise), then
the model is likely to detect patterns in
the noise itself
Main Challenges of Machine
Learning (Contd.)
• Overfitting happens when the model is too complex relative to the amount
and noisiness of the training data. The possible solutions are:

 To simplify the model by selecting one with fewer parameters (e.g., a linear model
rather than a high-degree polynomial model), by reducing the number of attributes
in the training data or by constraining the model
 To gather more training data
 To reduce the noise in the training data (e.g., fix data errors and remove outliers)
Main Challenges of Machine
Learning (Contd.)
• Underfitting the Training Data
 A linear model of life satisfaction is prone to underfit; reality is just more complex
than the model
 Selecting a more powerful model, with more parameters
 Feeding better features to the learning algorithm (feature engineering)
 Reducing the constraints on the model

• Ethical considerations and bias

• Interpretability and explainability of models

• Selection of appropriate algorithms

You might also like