Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 165

Big Data Management

NOTER
SYLLABUS
Books:
MLP: Müller, A. C., & Guido, S. (2016). Introduction to machine 2016 learning with Python: a
gui.de for data scientists
DSB: Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and data-analytic
thinking

Other:
PPBD: Bollier, D., & Firestone, C. M. (2010). The promise and peril of big data.
http://www.lsv.fr/~monmege/teach/learning2013/ThePromiseAndPerilOf BigData.pdf

Learning Objectives
 Explain the business value of big data and be able to deploy machine learning techniques to analyze it for broad
contexts, such as classification, regression, and clustering.
o
 Evaluate methods for the testing and assessment of data models and critically reflect on the meaning of findings.
 Recognize the practical and ethical boundaries of machine learning and big data management.
o Thinking about the social xxx
o Is it useful?
 Create a business case by identifying a valuable data set and working collaboratively to apply and justify an
appropriate machine learning technique.
o Synthesize into the group project

Readings:
1. Følger dagets struktur og opbygning
2. Gennemgår python kode
3. high-level overview – critique of big data,
Group Project & Exam
 Form a group of 2-4 students
 Get an interesting, relevant data set
 Use appropriate machine learning techniques to extract value from the data
 Write a paper describing what you did, why you did it, and the limitations faced
 Deliverables:
o One-page project plan (optional)
o Final paper (max. 15 pages) -- Group oral exam

Oral exam
 As a group
 20 minutes per student
 Assessing your knowledge of course content
 Assessing your completion of the learning objectives

Group project examples


 Get data from Twitter. Apply classification techniques to identify bots.
 Get data from wine auctions. Apply linear regression to predict price appreciation.
 Get data on customer purchases. Apply clustering to identify customer segments.

Group project suggestions


 Form your groups as soon as possible
 Make sure you and your group members have the same goals
 Find your dataset early and get to know it
o kaggle
o Ofte dårlig dokumenteret
 Be ambitious, but realistic – don’t need to be innovative – in between easy and ambitous – still interesting, not boring

Course project comments and expectations


 There’s no hard line on how many instances and attributes constitute a ”big enough” dataset.
 Your business case is probably fine. If you have references to support how you motivate the case, then that’s even
better.
 You should implement multiple models to compare them.
o Use other models more complex not just simple methods showed in class
 You should critically reflect on how you’ve interpreted the data and results,
 issues of ethics, and whether you’re interpreting correlation as causation.
 Your code/notebook can be added as an appendix or uploaded as a separate file.

Finding a dataset online

Search with key words here: https://datasetsearch.research.google.com


Can you tick all these boxes?

[] the dataset has been shared by a reputable source


- Not always yes or no
- Ngo, governmental, corporate entity - recognizable
[] there are enough instances and attributes to enable ML models
- No hard line, what is appropriate
[] it is clear to us what each attribute is
- You should know what the attributes are – go through each column, or documentation of what the attributes are
[] the values in the dataset are sensible
- The ranges of the values
[] there are no comments or other indications of issues with the dataset online
- Discussion board on Kaggle
- Obvious red flags

Code?
If anything is important you can include code, but only actual fitting or outputting results, interpreting something particular.
Math behind the algorithms/models > he doesn’t care about this. He wants to see critical reflections – sometimes involves
explanation of math if that is the cause for differneces between two models.
1. INTRODUCTION
Topic Introduction
Learning 1
Objective
Syllabus  MLP p. 1-26
 DSB p. 1-18
Activities to be Readings
done before next
class
Exercise Python Basics

Session 1
What is big data?
3 vs of big data. Easy, intuitive distinction – if data is due to one of these, can’t be processed with normal things you probably
are working with big data. Overly simplistic
What distincts big data from just data?

 Volume
o Terabytes of data being captured
 Variety
o Not only data captured, by different types of computers stored in different types of formats
o Text, images, videos etc
 Velocity
o Not just various types of data, data is constantly changing, in motion.
 Veracity
o Idea that big data is difficult to assess
 Value
o Innovation, revenue

More becomes digitized

Not online, but in the physical world as well

Lots of computers

The business case for big data


2 different ways
 Transforms traditional business
o E.g., selling books . sell products to customers. Can track what books that sell. That’s it.
o Also sells book online; track what customers buy, also how they navigate through sites, similarities across
customer types. Predict what book they might buy next
o Samle valueproposition, but in a different way; salesmenship can outperform what traditional bookstore
would have otherwise
 Creates new business
o E.g., data-driven advertising;
o Microtargeting?
o Big data is the product itself.
“kinda” – organizations use big data

Same survey; challenges they face from big data


projects

Shortage of technical expertise.

Machine learning –

Attention increasing to big data

close association between ML and BD

big data cant be processed though concentional means


– but machine learning can. Machine learning increase
with the increase

What is machine learning?


Definition: A field of artificial intelligence involving computer algorithms that ‘learn’ by finding patterns in sample data,and
apply these findings to new data to make predictions or provide other useful outputs.

Since big data ikke ka processed by normal menas, we use machine learning models

AI
Statistics

AI and ML are statistics


Supervised vs. unsuperrvvisiseedd

How does X look


2 types of models

Supervised vs. unsuperrvvisiseedd


Trained with labelled data sets

Eg. Some pictures labelled by humans


Training to see if a picture is a dog or not

Downside:
Manual boring tasked to label pictures/data
Most commonly used

We know the output.

Supervised vs. unsuperrvvisiseedd

No labled dataset

Trends

Online purchasing power – we don’t know what we are looking


for

No lableed training data


Downside: outputs can be dif to interpret and hard to evaluate
performance
Classification vs. regression vs. clustering
3 common ML tasks – our focus

Classification
 Input labelled training data (supervised)
o Predicting what given data belongs to
o
 Output categorical labels from predefined possibilities
o Bniary; dog not dog
o Multiclass > identifying different
o Using same algo with some adation
 Example algorithms:
o k-nearest neighbors
o Logistic regression
o Decision Trees
 Example applications:
o Detecting fake news
o Identifying cancer in medical scans

Regression
 Input labelled training data (supervised)
 Output continuous value
o Rpredicting ocntinous output
o
 Example algorithms:
o Linear regression
o Polynomial regression
o Ridge regression
 Example applications:
o Financial forecasting
o Predicting future temperatures

We don’t know how much will increase or decrease


Climate change; we want to know how much it might rise

Clustering
 Input unlabeled data (unsupervised)
 Output categorical labels from unknown possibilities
o We donøt know what we are looking for
 Example algorithms:
o k-means
o DBSCAN
 Example applications:
o Market segmentation
o Identifying player typologies in sports

Typologies besides from standard positions – football


Identifying clusters within a dat asset; recruitment

Business; online purchasing data

The big picture


 By using machine learning, we can extract value from big data to give a more informed basis for all sorts of business
decisions...
o Who to hire
o What customers to target
o Where to allocate resources
o What to sell
o By using machine learning, we can remove all sorts of human biases and weaknesses in decision processes

By delegating decision process to machiens we can work around flaws in human decision making

WARNING: DANGER AHEAD


 Garbage in, garbage out (GIGO)
o Models only as goods as the data its trained on
 Causality
o Common ML techniwues geared towards preicion;
o Not anytin gabout cause and effect which most people want to know
 Model based on shoe size and test scores
 Show shoe size is predictor on test scores; better with shoe size
 But really based on age
 Ethics
o Biases; relates to GIGO
o Privacy data

Summary
 Digitization and increased computing capacities have given rise to big data.
o Need tools; 3 Vs (machinellearnig is one of the tools)
 Machine learning involves a suite of computational algorithms for extracting value from big data.
 Machine learning can be supervised or unsupervised.
 Common machine learning tasks are classification, regression, and clustering.

Why do we use machine learning instead of traditional statistics?


Machine learning s better at processing big data

A phone service provider wants to predict which of their customers might switch to a different provider. They have data
indicating which customers have switched in the past, plus their account details. Which kind of ML task best suits this
challenge?
Classfication – predict either if customer is to churn or not churn
Regresson – could also be justified. Continuous target variable. Probability of someone churning.
Clustering – probably not given thye have data, some target value, unsupervised ml technique, but could be done in some way
Ch. 1
The book, chap. 1

We can use intuition and back it up with data. Data improves decision making.

- Data science (options) is a set of principles that guide the extraction of knowledge from data. – About finding
valuable/interesting patterns in data.
- Data mining (doing) is the extraction of knowledge from data, via technologies that incorporate these principles. –
This is useful for predicting.
- CRISP: Used to extract useful knowledge from data to solve business problems systematically by following stages.

Data manipulation:
- Python: Programming to manipulate with data to set up machine learning.
- Tableau: Analyzing and visualizing data.

Machine learning:
- WEKA: makes it easy to create models without coding by applying algorithms.
- In early stages determine it the problem should be attached supervised or unsupervised. If supervised, define the
target value!

Supervised: Unsupervised:
You mark what the right answer is – the system tries You don’t have a label. You can still organize by:
to figure out from data how to find the right
data/answer.
You have a label that you try to predict.

- Decision trees - Clustering


- Linear classifications (our focus) - Market basket analysis. It is hard to say
- and Regressions. how good the outcome is.

Always training and prediction phase.


Training:
- Label = BUY MOTORBIKE
- Input = Excel sheet à Extract the features
à SEX, NAME etc. (features could also be an
image or audio file)
- Then the ML algorithm tries to figure out
(training) what we can get from features.
o Decision Tree: Compares who
bought motorbikes and those who
did not, and creates tree.
Prediction:
- Predict on new data based on the result of
training.

Accuracy of 78%. Good? Depends on:


- Training vs. Test data
- Baseline
- Expected value

Specified target value Unspecified target value


Examines each instance that is labeled with a target Looks at patterns. - There is no correct answer,
value indicating what class it belongs to + and look at you must look at the clusters and see if it
how often it predicts correctly. - There is a correct corresponds and can be useful for anything.
answer!
Useful if you do not have data on whether
Target information must exist in data. customers buy or not.

- Data mining vs. results (chap. 2)


Data mining process Results (using model on new data)
Mining historical data to create a model based on Applied on new data where we do not know the
the target value. results!
Data mining creates a model/classifier Afterwards model is put to use and applied to new
data. à Creates predictions.
Exercise 1: Python Basics
Emne Introduction
Læringsmål 1
Activities to be Complete extra introductory Python tutorial (optional)
done before next
class

General exercise session objectives


 Learn to actually deploy ML techniques for analyzing large datasets with Python §Work collaboratively and receive
peer-to-peer feedback
 Have the opportunity to receive personal feedback from me and the Tas

Today’s exercise objectives


 Download Python and Jupyter via Anaconda distribution
 Complete today’s exercise notebook
 Talk to one another, form study groups, and form project groups
 Self-evaluate: should you seek out extra help with programming?
2. FOUNDATIONS OF DATA- ANALYTIC THINKING
Topic Foundations of data- analytic thinking
Learning 1
Objective
Syllabus  MLP p. 1-26
 DSB p. 1-18
Activities to be Readings
done before next
class
Exercise Pandas

Session 2
Focus: Translation from business problem to data-analytic tasks

Meet problems that will raise questions – could be:

Are our employees happy at work?


How do we hire more effectively?
Who are the most profitable customers?
What products and services do people want?
Which of our customers are likely to churn?

Business questions to Data-Analytc tasks


Purpose: highlight the challenge
Data science and ML is able to accurate translate into qunatative data anlytic technique
Business questions Data-analytic tasks

§ Verbal § Mathematical
§ Vague § Precise
§ Interpretive - Interpret in many ways § Positivist – the core – data is valid and independent of the
§ Qualitative – depends on context in which they’re researcher. Reproduce – return same result every time you inquire
asked it
§ Often explanatory, even if explained in predictive § Quantitative
terms – explain cause and effect § Often predictive – of course exceptions
In great contrast to vague business questions

Types of data-analytic tasks


First step is to be aware of the different techniques and tools within DA – cover fundamental approaches

 Classification: whether or not something will occur, belongs to certain class


o Medicine
 Regression: predicting HOW MUCH something will happen – point of estimate
 Clustering: group indiviudals within a population based on similarities
 Similarity matching – closely related to clustering. Using data to quantify how similar they are to other observations
 Co-occurrence grouping: associations, product reccomendsations (peanutbutter-jelly)
 Profiling: characterize indivudal behavior within indivudals in a populations. Detect
 Link prediction: similar to co-occurance, network theory, a structure of nodes
 Data reduction: replace big data with smaller data,
o Principle component analysis
 Causal modeling: What indlfuenes, explanatory,
Example: Predicting news article popularity
A struggling media company hires you as a data science consultant.
The executive board gives you data from their website and asks: how can we increase our readership?
...how do you approach this problem?

Terms:
 Instance = Rows
 Column = attribute – independent variable, co-variant
 Target = One special column/attribute – the dependent variable we want to predict! Shares is also – none of the other
attributes are XX for reads

Logical first step – look at data available

Use shares as proxy measure


The more shares = more sees and reads

Regression: what type of articles get the highest number of shares?

Supervised ML > Continuous variable > regression

Linear function

Target = dependent variable (shares), the one


we want to predict

Numeric value

Translating question into equation > we can


say precise quantitative things.

Ex. Increasing x will decrease Y

Useful in guiding interventions


Classification: what type of articles get on the ’trending’ page?
Change the question > change the target variable

No longer continuous variable but categorical


Binary classification
Different quntativie translation > change equation

More pr

The equation is different

The point is
Depends on business question -

Turned vague business question into data analytic task


Different attributes effects the amount of shares etc

Are shares an accurate indication of reads?

When translating;
Information and context gets lost – we need to iterate
Ensure that it solves the tasks

Paper:
Arfue many ds projects affected by hidden biases
True to some extent

Algorithm biased that black defendan

UK govern; exam algorithm used to estimate grades of A-students during pandemic  down graded students; upgraded grades
for students at private schools

Qualitative thinking more serious

Interpretavism:

Reflexivity –

Sensibility Definition Example


Interpretivism An epistemological approach probing Trace ethnography (Geiger & Ribes, 2011; Geiger &
the multiple and contingent ways that Halfaker, 2017)
meaning is ascribed to objects, actions,
and situations. -can be unsterood in different ways
Shares can be understood in many ways
Gone viral for wrong reasons; ruins a media
company’s reputation

Abductive reasoning A mode of inference that updates and Iterations of open coding, theoretical coding, and
builds upon preexisting assumptions selective coding (Thornberg & Charmaz, 2013)
based on new observations in order to Supervised, req labeled training data
generate a novel explanation for a Recycling of training data; shouldn’t assume label
phenomenon. training data is true wihtoug undersøge det

Reflexivity A process by which researchers Brain dumps, situational mapping, and toolkit
systematically reflect upon their own critiques (Markham, 2017)
positions relative to their object, Situational mapping; they may know very little
context, and method of inquiry. about people behind the data points.

Toolkit critiques; relevant – e.g. when picked


dataset, why do we use this? Because it is available
– recognize limitaitons of the datasets

constantly ask why they study, what assumptions


go unremarked
Group discussion
Imagine you’re a team of data science consultants hired by a municipality to develop an ML model to predict where crimes will
take place. The municipality’s police force is understaffed, so they want to station their officers accordingly. The municipality
provides you with a big, historical dataset where each instance includes the date, neighborhood, and number of arrests made
that day.

How could you incorporate qualitative sensibilities in your approach to this task?

Researcher’s bias:
Affect the business question
Affect the interpretation of data and result

Reflexitivity:
Reflecting of your biases
Be aware of your own bias
How the model will be used; If you’re allocating more resources to a place where there are more crimes; there will be more
arrests
Reflecting on implecations how the model is used.broken window problem; focusing on market of disorder can sustain more
serious crime.

Interpretation:
How is the target interpreted?
What is a crime?
An arrest is not necessarily a crime; could be the officer’s fault

Interpret: How is ‘arrests’ / crime

Even though it is an arrest in the data sheet, it is not necessarily a crime – interpretivsm.

Associating arrests with crime?

The tool critique; Is this the best data for the problem? Or is it just the one that is available to us

Abductive reasoning:
Can we somehow recode data? Binding additional; breaking down ‘arrest’ > not just stealing a bag of chips
Cross Industry Standard Process for Data Mining (CRISP-DM)

The CRISP model (The Cross Industry Standard Process for Data Mining)
Turning BQ into DA task

Model is exploratory; a strategy and approach (circle – this can be hard for developers to understand) and
has 6 stages: B.U, D. U, D. P, M, E, D.

B.U:
- Addressing the problem and use scenario/practical business goals to achieve.
- What do we want and how to achieve it?
D.U:
- Important to understand strengths and limitations of data.
- Is the data enough to achieve the goal you want à if not, be creative.
- Was it free/costly?
- Was it collected for this purpose or?
D. P.
1. Data format – you want rows & columns + if supervised; one column with target.
2. Missing values, errors, or other problems.
3. Converting – data type or (numerical to categorical value = discretization [Low,Medium,High]
(income)
4. Leaks – Too good data, that you would not normally have.

M:
- The data mining phase à Creating model – classification, regression, clustering…
E:
- How well did the model work? How often does it give right answer?
- Is it general enough to apply to new data? Hard to do, but important! Think back on business
problem.
- How important is it that it gives the right answer: Depends on situation!
- Are the results valid?
D:
- Does the model produce any value?
o Perhaps we can target customers differently?
Example: overview
Example: Overiew
An established electronics e-retailer is facing increasing competition from newer sites. Web stores are cropping up as fast
(or faster!) than customers are migrating to the Web, so the company must find ways to remain profitable despite the rising
costs of customer acquisition. One proposed solution is to cultivate existing customer relationships in order to maximize the
value of each of the company's current customers.

The e-retailer has hired you, a data science consultant, to lead the project.

We want to turn this business problem into a data analytic task

Business understanding
The practical business goals you want to acheive; the use scenario.
What exactly do we want to do?
How exactly, in data-analytic terms, would we do it?
Legal/admin, financial, business objectives
Involves as many stakeholders as possible; involve in the process.

Example: Business Understanding


Example: business understanding
The e-retailer specifies the following goals to you:
1. Improve cross-sales by making better recommendations
2. Increase customer loyalty with a more personalized service
Still vague business questions

Together with the e-retailer, you translate those goals into data-analytic terms:
1. With historical data about previous purchases, build a model that identifies items frequently bought together (co-
occurrence grouping; “market-basket analysis”)
2. With a database containing the personal details of registered customers, build a model to identify different
customer typologies (clustering; “profiling”)
Went from better recommendations to

Data understanding
The data that will provide the basis for a solution.
What data is readily available?
What are the limitations of the available data?
Is different data needed?
How much will collecting different data cost (in time and money)?

Rarely the data is just ready to go; historical data that an organization hasn’t been collected with the purpose of clustering but
for other reasons
Explore data
Get prescriptive statistics; validate the quality of the data

Example: Data Understanding


The e-retailer provides several potential datasets: purchase data, the product database, and the customer database.

You recognize that these datasets are probably sufficient, but you also recognize some limitations. For example:
 The product database containing information about each product is not linked to the purchase data.
 The customer database has lots of missing data because many customers did not fill out all fields in the
registration form. Website where you need to create an account

Maybe not linked with purchase data >


Gain deeper understanding of data to fulfill your use case

Data preparation
Most time intensive and frustrating

The data format required to enable analysis.


How should missing or erroneous data points be handled?
Does data need to be converted from on type to another? Continuous variable to categorical ?
Are there any ”leaks” in the data? Not privacy, variable in historical datta fives information about the target variable?
- Example: are you traing model on attributes that
- Detect skin cancer; historical data sets are “have previous xxx”
- Deploy where you don’t have historical data; not useful

Example: Data Preparation


To utilize the product database, you find a way to merge it with the purchase data so that not only can you train your model
on what specific products are frequently purchased together but what kinds of products are purchased together.
- Generalizable (laptop purchases with warranty)

To handle the missing data in the customer database, you first conduct an exploratory analysis to see whether data are
missing completely at random (MCAR) or missing not at random (MNAR).
- Clustering;

You decide to delete instances with missing data because the customer database is massive and it looks like a case of MCAR.

Modeling
Into the analysis
The machine learning technique used to analyze the data.
What specific algorotihms fits best?
Is classification, regression, or clustering the appropriate technique?
What specific algorithm(s) suit the problem faced?
How can the model be tuned?
Iterative; Running different default algorithms, compare, contrast etc to see how performance changes
Back to data preparation ;

Example: Modeling

For the co-occurrence grouping of products purchased, perhaps you try out several different algorithms to compare, like
the “apriori algorithm” vs. the “FP growth algorithm”.

2 unsupervised ML techniques

For the clustering with the customer database, perhaps you’ve opted to used the “k-means algorithm.”
You might then tune the n_clusters parameter depending on how many different customer typologies it makes sense to
identify.
Obviously just optimizing on data
Also remember business
Evaluation
The assessment of modeling results.

How well does the model perform from a technical perspective?


Is the model general enough to apply to new data?
Are the model and results comprehensible?
Does the model satisfy the original business goal?

Main; asses model from technical/statistical but also business perspective


Evaluating technical in modeling stage

Example: Evaluation
For the co-occurrence grouping of products purchased, you compare the “apriori algorithm” to the “FP growth algorithm”
and report how each algorithm performs on some statistical criteria and the diversity of recommendations made.
How likely is an association

For the clustering with the customer database, you show how different settings of n_clusters may be more interpretable
and actionable than others.

Deployment
The application of the model and production of value.
How can the model be integrated into existing business operations?
How does the use of the model necessitate new business operations?
How should the model be used, adjusted, and re-deployed to realize ROI?

Example: Deployment
The e-retailer is happy with your work and wants to deploy both models. What now?

A web developer is needed to integrate the recommendations generated by your co- occurrence grouping model into the
website UI.

The marketing team is needed to develop specialized promotions and product packages to suit the customer typologies your
clustering model identified.

Finally, you point out that if the e-retailer succeeds in growing their customer base, you models must be re-trained and re-
deployed.

CRISP-DM implications and reflections


 Codifies how business questions can be translated into data-analytic tasks
o
 Requires collaboration between business- oriented and data-oriented teams
 Iterative, not linear
No results were reproducable
Summary
 Translating business questions into data-analytic tasks means translating vague, qualitative ideas into precise,
quantitative definitions.
 Quantitative does not mean objective. Data-analytic thinking needs qualitative sensibilities.
o Be critical
 The CRISP-DM process model is an iterative framework for how to approach business problems from a data scientific
perspective.

Which of the following is NOT a common characteristic of machine learning tasks?


- Explanatory: Regression, clustering algorithms, decision trees find correlations, not causal connections
o Exeptions

What is the first stage in the CRISP-DM model?


- Business Understanding
o Defining practical business goals you want to achieve. Important to start here. Measuring success whether it
achieves business goals
Failures of big data and machine learning are always due to errors of programming or mathematics.
- False
o “always” > CAN be due to that, but often – training data (garbage in-out)
o Qualitative thinking – understanding what algo actually measure

Ch. 2
The book, chap. 2

Data Mining Techniques


- For finding valuable/interesting patterns in data (Data science).

Classification
- Involves defining a small number of classes, and then trying to predict for each instance, which
class they belong to.

- Ex: Turn problem: Will or will not turn.


- And probability estimation (closely related task): You will get a percentage of this (a score).

(Predicts whether customers buy motorbikes)

Predictive Analytics = using historical data as a bases to make predictions. (!!!!)


- Then we see which classes our data falls into.
- The data mining process results in a model/classifier. This can be put to use and applied to new
data.

Or Sentiment analysis = A classification problem, where texts are classified as being positive or
negative. – for this you need texted that are labeled positive or negative (ex. reviews on Trustpilot).

Regression
- Gives a numerical value for each instance.

(Predicts how much will customer spend? Or use the motorbike?)

Similarity Matching
- Compare instances based on their attributes and determine how similar they are.
- For good similarity matching it is important to have information about the relevant attributes +
and figure out which attributes are most important.

- Ex: Recommending motorbikes based on what people with your profile has previously bought.
o Amazon uses this to recommend books you previously bought.
Clustering
- Group individuals in a population together based on similarities.
- Is not driven by a purpose, - and the groupings are not predefined.
- Is more open-ended than classification and regression (?).

- Ex. Do customers form natural groups? – this can later be used to design campaigns for each
group.

Co-occurrence grouping
- Looks at similarity of objects, based on if they appear together in transactions.

- Ex. What items are commonly purchased together?

Profiling
- Attempts to characterize typical behavior of an individual, group or population.

- Ex. Fraud detection – is a credit card used differently than it is normally used.

Link prediction
- Ex. If customers rated motorbikes, then they could get a recommendation based on their ratings
of the motorbike.

Data reduction
- Creates smaller dataset based on a bigger. Can make it more precise and easier to work with.

Casual modeling
- Help us understand what actions/events actually influences others.

- Ex. If targeted customers were more likely to buy motorbikes, was it then because we targeted
them or because they would have bought anyway?

o Done by A/B testing: (one control group and then apply marketing to another group)
o Results in a casual conclusion with assumptions – always assumptions. Consider
assumptions before going forward.

Supervised: Classification vs. regression


Tip to differentiate: Is there continuity in the output = regression ?
Classification Regression
Has a categorial (binary) target. Has a numeric target.

 Goal: Predict a class label – a choice from a Goal: Predict a continuous number/floating-point
predefined list of possibilities number (real number)

Separated into two:


 Binary classification: Distinguishing between
exactly two classes (like yes/no) –
o One positive/negative class
 Multiclass classification: Classification
between more than two classes

“Will this customer purchase service X if given “How much will this customer use the service?”
incentive Y?”
This is a regression problem because it has a
This is a classification problem because it has a binary numeric target. The target variable is the amount
target (the customer either purchases or does not). of usage (actual or predicted) per customer.

“Which service package (X1, X2, or none) will a Predicting a person’s annual income
customer likely purchase if given incentive Y?”
This is also a classification problem, with a three-
valued target.

Generalization, Overfitting and Underfitting

Generalization  If a model is able to make accurate predictions on unseen data, we say it


is able to generalize from the training set to the test set. We want to
build a model that is able to generalize as accurately as possible.
 There is a sweet spot in between that will yield the best generalization
performance - A trade-off between overfitting and underfitting


Overfitting  Building a model that is too complex for the amount of information we
have, as our novice data scientist did, is called overfitting.
 Occurs when you fit a model too closely to the particularities of the
training set and obtain a model that works well on the training set but is
not able to generalize to new data
 Often occurs when model is too simple
Underfitting  Choosing too simple a model is called underfitting.
 Occurs if your model is too simple—say, “Everybody who owns a house
buys a boat”—then you might not be able to capture all the aspects of
and variability in the data, and your model will do badly even on the
training set.

Supervised ML Algorithms

k-Nearest Neighbors (k-NN)

k-Nearest Neighbors  The Simplest machine learning algorithm


(k-NN)  Building the model consists only of storing the training dataset. To make
a prediction for a new data point, the algorithm finds the closest data
Classification points in the training dataset—its “nearest neighbors.”
 In its simplest version, the k-NN algorithm only considers exactly one
nearest neighbor, which is the closest training data point to the point we
want to make a prediction for. The prediction is then simply the known
output for this training point.
 Can also consider an arbitrary number, k, of neighbors. Here it uses
voting to assign a label; for each test point, we count how many
neighbors belong to class 0 and how many neighbors belong to class 1.
We then assign the class that is more frequent: the majority class among
the k-nearest neighbors.

Analyze:
 Using few neighbors corresponds to high model com‐ plexity (as shown
on the right side of Figure 2-1), and using many neighbors corre‐ sponds
to low model complexity
k-Nearest Neighbors  When using multiple nearest neighbors, the prediction is the average, or
(k-NN) mean, of the relevant neighbors
 Evaluate the model: Using score method, which returns the R2-score
Regression (coefficient of determination). A measure of goodness of prediction.
Value between 0 and 1, where 1 corresponds to perfect prediction.
Analyze

Strengths + Easy to understand


+ Reasonable performance without many adjustments
+ Fast building model
Weaknesses - When training set is very large (features or samples) prediction can be
slow
- Important to orerocess data
- Often does not perform well on datasets with many features, particularly
bad with ‘sparse ‘ datasets

Linear Models

Linear Models  Linear models make a prediction using a linear function of the input
features
Regression
Linear Regression -  Linear regression finds the parameters w and b that mini‐ mize the mean
Ordinary Least squared error between predictions and the true regression targets, y, on
Squares (OLS) the training set.
 Overfitting data
Ridge regression  Also a linear model for regression, so the formula it uses to make
predictions is the same one used for ordinary least squares.
 The coefficients (w) are chosen not only so that they predict well on the
training data, but also to fit an additional constraint

 We also want the magnitude of coefficients to be as small as possible; in


other words, all entries of w should be close to zero. Intuitively, this
means each feature should have as little effect on the outcome as
possible (which translates to having a small slope), while still predicting
well. This constraint is an example of what is called regularization.
Regularization means explicitly restricting a model to avoid overfitting.
 A less complex model means worse performance on the training set, but
better generalization. As we are only interested in generalization perfor‐
mance, we should choose the Ridge model over the LinearRegression
model.

Other analytic techniques:


- Who is the most profitable customer? ß Use standard database query (used when you already know what could be
interesting).
- Is there really a difference between the profitable customer and average customer? ß Statistical hypothesis testing
can confirm or disconfirm.
- Who are the customers? Can I characterize them? ß Database query (1), then data mining techniques to find pattern
(2).
- Will some particular new customers be profitable? How much revenue should I expect this customer to generate? ß
Data mining techniques.
Exercise 2: Pandas
Topic Foundations of data- analytic thinking
Learning 1
Objective
Today’s exercise objectives
- Learn the basics of the pandas library
- Practice finding programming solutions in documentation and other places online
- Get comfortable wrangling with dataframes in Python
Activities to be
done before next
class
3. PREDICTIVE MODELLING
Topic Predictive modelling
Learning 1
Objective
Syllabus  MLP p. 70-84
 DSB p. 43-80
Activities to be Readings
done before next
class
Exercise Decision Trees

Session 3
Prediction, Information & Segmentation

What is a model?
A simplified representation of reality created to serve a purpose. Based on some assumptions.
It is simplified based on some assumptions about what is and is not important for the specific purpose, or sometimes based on
constraints on information or tractability.

Linear regression – simplifies relation between 2 variables to understand better,


Point: models are intentionally simplistic, not capture all in reality, all models are wrong but some are useful – because they’re
simplistic. Crucial tool to turning big data into business xx

What is a predictive model?


A formula for estimating an unknown value of interest: the target.
The formula could be mathematical, or it could be a logical statement such as a rule.

A model predicting things,


Conditional statements
Distinct from descriptive or causal models,
NOT focus on descriptive causal models

Supervised segmentation
The process of splitting (segmenting) data into subgroups depending on their attributes and a known target variable.
When done using values of attributes that will be known when the target is not, then these segments can be used to predict
the value of the target variable.

Decision rules/formula
One way of building a predictive model
Selecting informative attributes
How do we select an attribute to partition data in an informative way? Segment the data into groups that are as pure as
possible.
By pure we mean homogeneous with respect to the target variable.

Selecting informative attributes


“informative” = segment data into
groups that are as pure as possible
Pure = homogenous in respect to target
variable

Will say yes/no to a product package


Split people into groups

Training data set:


Impure: still one saying no thus not pure
Pure: every person in segment says yes.

Measures to help us make informative segmentations:


 Gini impurity
 Entropy
 Information gain

What type of segmentation is the best?


Use measures > can be cal for any given segmentation > tell how good the segmentation is

Gini impurity

Gini impurity for a group with a binary target variable


Gini = 1 – (probability of “yes”)2 – (probability of “no”)2

Measures prob of incorrectly a random chosen


indicates the likelihood of new, random data
being misclassified if it were given a random class
label according to the class distribution in the
dataset.

0-0,5
Worst = 0,5
Randomly

0 = Pure, completely certain

0 % chance that you incorrectly categorized

Totally informed segment


0,5 = worst possible, impure

Entropy

Entropy for a variable with two possible outcomes


Entropy = – [(probability of “yes”) log2(probability of “yes”) + (probability of “no”) log 2(probability of “no”)]
Reduce chaos/entropy

Ranges between 0-1


0: no uncertainty
1: all possible are equally probable

Entropy = 0
Certain that outcome is negative

Right: entrooy is 0

In between:
When neither the 2 outcomes

0,5 = entropy of 1

Information gain

IG = (Entropy before segmentation) – (Entropy after segmentation)

How good/informative a segmentation is

Information gain = 1

Left and right:

Entoropy 0

Since 1-0 is 1, perfect segmentation

These measures decides how our models are built


Decision Trees

Terminology break: the anatomy of a decision tree

Top:
Root node – first seg take splace, conditional statement
Is some attribtu eq some value

True = Follow the path to left (left is true, right is false)


False = follow path t right

Branch node
Another conditional statement

Leaf node
Where pure groups are supposed to be
Eventually make pure leaf nodes

End up with interpretative tree structure,

Use measures to do supervised segmentation

Building a decision tree


Tree induction – data set, making segment > tree

Classify all data point in the training data


StatQuest

Binary attributes

Target attribute = churn (predict this)

Identify most informative attribute to make


segmentation – try one attribute after one

Segment into two groups based on first attribute

Repeat for the enxt attribute

Homogen grupe – customers who did not churn

This is the better segmentation? Apply measures to


calculate how good the segmentation is

Calculating Gini impurity


The proportion of ‘yes’ in the node
Repeat for the enxt attribute

Homogen grupe – customers


who did not churn

This is the better


segmentation? Apply measures
to calculate how good the
segmentation is
Total gini

Calculate the branch gini

Account for the fact that there


might be more data in –
unbalanced data in the nodes?

Pretty bad as the worst is 0,5

Do the same for the other


attributes - subscription > that
is the most informative as the
fini is 0,214

Leaf nodes

Which place provides the most

Not very informative

Check for all others to know if


there is a better

The best are at 15 and 44

But these are not lower than


the one
Back to building our tree...
We fiured the root node > now the Branch node >

Impurity at the left leaf node >


more calculation

What is the most informative


attribute for the remaing four

Calls is more informative


attribute for the remaing four
instances > this is the next
branch node in our tree

Back to the tree

Tree induction, simple predictive


model
Using our tree...

Use his attribute through the


tree > sam will churn

What’s wrong with our tree?


 Overfit to the specific dataset used
 Ungeneralizable
 Probably not very good at making predictions with new, unseen customers
Should we trust our model’s prediction
Might make an accurate prediction
Might not be able to extrapolate new data that comes in
Every new data that come in wil follow same pattern as training data
Overview of Decision Trees

What is a decision tree?


A non-parametric supervised machine learning technique that repeatedly segments data according to some measure of
informativeness, resulting in a model with a tree-like structure.

Supervised
Needs labeled data
Known target variable

(couldn’t calc gini if we didn’t know churn)

Linear regression >


Decision trees don’t need X relationship

Decision tree algorithms


Most widely used
 ID3 selects informative attributes based on information gain
o Iterates through data
o Only c
 C4.5 selects informative attributes based on information gain or gain ratios
o Offspring by ID3
o Few approvements
o Handle contin and categorial
o Can handle missing data
o Pruning
 CART selects informative attributes based on Gini impurity
o Most generic
o Clas and regression trees
o Default,
Classification trees
Intuitive
Into Discrete categories
Attirbutes can be conintous or categorical

 Categorical target variable


 Can be binary, or multiclass
 Selects informative attributes by Gini impurity, information gain, or gain ratios

Continuous >

Are leaf nodes

Measure (gini) not for regression trees

The measure on which we base our segmentation

Regression > numerical value only

Regression trees
 Continuous target variable
 Selects informative attributes by the sum of squared residuals (SSR) or mean squared error (MSE)

How well a line fits

Decision trees are Non parametric


Assume a linear rel through data

Regression not works well if there is not a linear rel between


Attribute and target
https://www.youtube.com/watch?v=g9c66TUylZ4

Pros and cons

Advantages of trees Disadvantages of trees


Flexible Prone to overfitting
Easy to implement - Partition ALL data into categories
- No rely on assumptions (missing data, no - Every sngle data point in training data set, not nec good
prepossing) when predicting
Easy to interpret – white box models
Unstable
- Small deviance will result in dif trees being built

Computationally expensive
-
Pruning
1) Models are simple – useful because theyre simple
2) Models Tend to overfit the data they are traingn

Cosntriants on how colmex the trees should be

A technique for reducing a decision tree’s complexity (and risk of overfitting) by removing the least informative branches.

Two pruning strategies:


§ Post-pruning (or, ”pruning”) § Pre-pruning (or, ”tuning”)

Pruning away

Removing the small, noisy


Based o small sample fo data > remove

How do we know which are uninformative

Post-pruning (or, pruning)


Building the tree but then removing or collapsing nodes that contain little information
E.g., minimal cost-complexity pruning: algorithm that recursively trims leaf nodes to find a sub-tree that optimally balances
complexity and accuracy – for each know calculate a measure if it outweighs the cost of complexity

Retro-actively – after let the tree grow


Based on small sample

Pre-pruning (or, tuning)


Stopping the creation of the tree early – telling how much the tree can grow before you plant it
Setting parameters

 E.g., max_depth: parameter specifying the maximum depth of the tree (note: max_depth = 1 means only a root node)
o Most common
o Depth = 1, root node and 2 leaves
 E.g., min_samples_leaf: parameter specifying the minimum number of samples required to be at a leaf node
o Minimum samples >
o Reduce computational expense of running the tree
 A

Adjusting a simple tree > reduce risk of overfitting


Ensembles of decision trees
Ensemble methods combine several decision trees to produce better predictive performance than utilizing a single decision
tree.
Reduce risk of overfitting by combining multiple decision trees

Rather multiple than a single tree

Two ensemble methods:


1. Bagging (or, “bootstrap aggregation”)
2. Boosting

Bagging (or, bootstrap aggregation)


Take many random samples from the data with replacement, build separate decision trees for each sample, then aggregate
predictions for new inputs
Lots of different trees > when new data point, every tree votes, aggregate, taking the average vote. Training many smaller trees
on smaller subsets of data > crowd effect

E.g., Random Forest: ensemble algorithm that randomly samples data and attributes to create many data subsets, builds
separate trees for each subset, and then averages the predictions for new inputs
Each tree is different
Additional parameters to tune; n_estimaters (number of trees in the forest)
More yield better accurace, dem returns, rule of thumb > use as many as you have time for or or computational power.

Boosting
Successively build tree and combine them so that each new tree corrects for the errors of the previous one

Each new tree can learn from errors from earlier trees. The final will be the best one as it learns from its ancestors.

E.g., XGBoost: ensemble algorithm that implements parallel (rather than sequential) gradient tree boosting to minimize a
specified loss function
- current
Loss function > find errros. Optimizing this function.
Strucuted data sets

Transparent and more likely to be used(?)


Real world applications
Averse to algorithms > take advise from human rather than ML model
CRISP- failure in deployement
No value in BD if people don’t use its output

Donøt wanna use algo they don’t understand

e.g. doctors using model making predicitons on


Transpa doesn’t p
Might not care if they understand?

Transparent and they have to be (by law)

Gdpr

Regulation on algorithms

Decision trees are the most legally


accepted in our time because they’re
transparent and interpretable
Summary
 Predictive models are formulas or rules for estimating an unknown values of interest.
o m
 Supervised segmentation — iteratively identifying informative attributes and partitioning data — is one
straightforward way of building predictive models.
o
 Informative attributes can be quantitatively identified with measures like Gini impurity and information gain.
 Decision trees are supervised machine learning algorithms (need label to train data) that perform supervised
segmentation to generate predictive models with a hierarchical, tree-like structure.
 Decision trees are easy to implement (pro) and interpret but are prone to overfitting (con)
 Pruning and ensemble algorithms can reduce overfitting.
 But using ensembles of decision trees can undercut transparency and interpretability. which is the key attracton to
decision trees

Recap
1) Which statement is NOT true about models in the context of big data?
Models are intentionally simplistic
Models are useful because they are simplistic

2) What is a key disadvantage of decision tree models?

 Difficult to interpret
 Difficult to implement
 Lots of data preprocessing required. Some – but in general very little compared to other.
 Prone to overfitting

3) Which of the following are ways to reduce decision trees' risk of overfitting?
 Pruning
o Most prominent one.
 Tuning
o Variation of pruning – prepruning. How big can the tree grow.
 Bagging
o Ensembling methods – also reducing risk of overfitting
 Boosting
 All of the above
o
Ch. 3
The book, chap. 3

Predictive Analytics = using historical data as a bases to make predictions.


- The data mining process results in a model/classifier. This can be put to use and applied to new data. Then we see
which classes our data falls into.
- A contrast (not a strict difference) to Descriptive modeling = primary purpose of model is not to estimate a value, but
to gain insights into the underlying phenomenon or process (will show us how people that buy bikes look like).

Model induction
= creating models based on data (training data/labeled data (because they are labled)).

Supervised Segmentation
- Supervised = data includes the target value (meaning; each instance is labeled)
- Segmentation = Segmenting data into subgroups based on what we want to predict.
à By creating decision trees. (will

Selecting Informative Attributes


- THIS IS DONE TO GET A DATA SET THAT IS AS PURE AS POSSIBLE.
- First, we must find and select important/informative attributes. Informative = reduces uncertainty about something.

- We want the subgroups to be as pure as possible. = homogeneous with respect to target

- Technical implications
1. Attributes rarely splits a group perfectly (= pure). If one is, the other might not.
2. Best to have one pure subset or more broadly purity?
3. Not all attributes are binary... Has more values. Etc.

Technical implications can be addressed, by evaluating how well each attribute splits a set into subsets. à By Claude
Shannon.
Entropy: measures the disorder that can be applied to the set, or how impure the set or subset is.
- The higher entropy, the harder is it to predict anything.

entropy(S) = −p1 × log2(p1) - p2 × log2(p2)

Based on the entropy, we can define IG, that…


IG: measures how much an attribute improves the overall entropy
- IG is a function of both parent set and children (subsets):
o Parent: (Global) Original set
o Children: Result of splitting on the attribute values.

The amount of information the attribute provides depends on how much purer the children are than their parents.

Calculating
Entropy of set:

entropy(S) = −p1 × log2(p1) - p2 × log2(p2)


entropy(S) ≈ −0.6677 × log2(0.6677) - 0.3323 × log2(0.3323)
entropy(S) ≈ 0.9172

Entropy of NumberChildrenAtHome:
Instances Correct Proportion Wrong Entropi Ins. of Total
in Total Proportion
(Total – Wrong)/Total Wrong/ −p1 × log2(p1) - p2 × log2(p2) Total
Total /16519
0 9990 0.8059 0.1941 0.71 0.6048
1 2197 0.7533 0.2467 0.8060 0.1330
2 1462 0.5896 0.4104 0.9767 0.0885
3 1066 0.6942 0.3058 0.8883 0.0645
4 952 0.7258 0.2742 0.8474 0.0576
5 852 0.8392 0.1608 0.6362 0.0516

Weighted entropy = 0.6048 * 0.71 + 0.1330 + 0.8060 …. = 0.7620

IG of NumberChildrenAtHome:

IG = 0.9172 – 0.7620 = 0.1552

If we also did this to YearlyIncome, we would find that this variable is purer that NumberChildrenAtHome.

The highest IG is typically the root node. – but not in our situation…

Segmentation if regression:
- Look at variance?

Entropy chart:

- This is about getting the smallest shaded area.

Classification tree / Decision tree:


- Used as prediction models.
- Tree induction takes a divide-and-conquer approach, starting with the whole dataset and applying variable selection
to try to create the “purest” subgroups possible-

Visualizing Segmentation for comparing decision trees


- It is often easier to compare decision based on how they partition the instance space instead of number of leaves.
- For example, the figure above shows a simple classification tree next to a two-dimensional graph of the instance
space: Balance on the x axis and Age on the y axis.

Trees as sets of rules:


- Using IF-statements
- Classification trees are equivalent to rules.

Probability estimation:
- Is about assigning a probability of membership to each subset.
- Frequency based estimate: The probability of a new instance being positive is p/(p+n)
- To not be able to say that a subset with one instance is 100%, which equals overfitting. – We look at the Laplace
correction.

n is the number of instances that belongs to the class.

Building classification tree


- First, consider if we should stop or continue.
o If continue, then pick the best attribute.
o We can continue if two (or more) pure subsets are created, but not continue if it is a termination leaf node.
- Secondly, repeat
Exercise 3: Decision Trees
Topic Predictive modelling
Learning 1, 4
Objective
Activities to be
done before next
class

Exercises: Decision trees (and intro to scikit-learn)

Today’s exercise objectives


 Start working with scikit-learn (the machine learning library we'll use in this course)
 Implement a basic preprocessing pipeline for data analysis
 Be able to implement a decision tree classifier and interpret its output §Simulate a real case with healthcare data
involves applying machine learning

Quick notes on preprocessing


 What’s a train-test split? And why do we do it?
o Using a portion of the data to train a model on, and a separate “untouched” portion of the data the test the
model on
o We want generalizable, predictive models that perform well on new, unseen data; model validation
(remember slide 39?)
 What does it mean to have imbalanced data? And why do we correct for it?
o When the target class labels in a dataset are unequally distributed
o Churn – giant data set 1% is churned.
o We want our models to be able to predict all possible classes
 What are dummy variables?
o Variables that binarize a categorical variable with no natural order

Dummy variable example


3 possible classes/outcomes

dummy variables easier to handle


4. FITTING A MODEL TO DATA
Topic Fitting a model to data
Learning 1
Objective
Syllabus MLP p. 133-141
DSB p. 81-110
Activities to be Readings
done before next
class
Exercise Linear regression I

Session 4
What is a predictive model?
A formula for estimating an unknown value of interest: the target variable
The formula could be mathematical, or it could be a logical statement such as a rule.

Both regression and classification tasks


Decisions rules rather than equations

What is parametric modeling?


A technique for building a predictive model whereby a formula is pre- specified and parameters are estimated (or “learned”)
from some data.
Prespecifying a formula > assume that data follows this genral dynamic > estimate parameters.
Non-parametric modeling > assuming nothing.

Parameters are variables in a formula for which the values are unknown.

Example of parametric model formula:


- Relationship between x and y
- Predict Y
- Base prediction on knowledge about x
y=β 0 + β 1 x 1 +ϵ

Linear Regression
A supervised algorithm that learns to predict a dependent variable (”target variable”) as a function of some independent
variable (”attribute"), by finding a line that best "fits" the data.
At this point > just ONE attribute of linear regression models. Same logic applies to multiple attributes.

The target variable must be a continuous value (since it is a regresion) The attribute can be any type of variable (e.g., binary,
categorical, continuous,... ).

Linear regression equation


The straight line
 Target variable (some number we want to
predict)
 Coefficient (the slope of the line) - β 1, strength
of relationship between x and y
 Error (everything our model is missing)
 Intercept (where the line crosses the y-axis) –
can be called “bias”
 Attribute (some feature we know that we base
our prediction on)

Figure out the value of b0 and b1

Fitted linear regression equation

 Predicted/estimated target variable


 Estimated coefficient (the slope of the line)
 Estimated intercept (where the line starts on
the y-axis)
 Attribute (some feature we know that we base
our prediction on)

Difference from the previous equation:


Hats: not reality > estimated values. We dont know the
real vlaues.
No error term

Linear regression assumptions


Assumptions about when the equation, about data
 Linearity: the relationship between the attribute(s) and target variable is linear.
 No multicollinearity: attributes are not correlated.
 Normality of errors: prediction errors are normally distributed. Not biased
 Homoscedasticity: prediction errors have equal variance across attribute range. Homogenitet I varians -

Is your linear regression model good at predicting new, unseen data?


Ignore these. Just be careful when interpreting coefficients.

If assumptions not met > fitting model to data could return incorrect interpretations?
Fitting a linear regression

Example from statquest

Fitting LR – to predict housing prices

6 instances > each representing home


sold in last year (attributes,
independent)

Dependent variable – house price

Assuming relationship between Size and


price

Starting out by assuming a very simple


linear model
Ignore size of house
Hosuse price is the average price of our
house data

Estimating intercept b0 to be 471

We know it doesn’t fit well > quantify a


metric for this fit

Calcualte residuals:

How far is an observed actual data point


from the estimated data pint (hat)

Quantification of how bad the line is


Calculate the residual sum of squares
(RSS)

Actual minus Estimated y

Remember: Horisontal line predicts >


price of each house is the average

Same prediction for every house (471).


The other is the observed from our
training data.

RSS of 0 will be great


Influenced by the scale we are working
on.

Optimize the measure > minimize


Find the line that minimizes the RSS.

Visaulise the optimization

RSS on y, estimated line on X


RSS is a big number

Rotate the line a little bit > RSS went


down
Rotating even further will reduce RSS

Rotating further is too much > RSS went


up. Passed the ebst fitting of our model

Assume this is our best model


We estimated the intercept is -38.9

Fitting a linear regression: interim summary


 We set out to build a model to estimate/predict house price (continuous target variable) based on house size
(continuous attribute).
 Assuming a linear relationship between price and size, we fit a linear regression model to our data by finding the line
that minimized RSS — “Ordinary Least Squares (OLS)” estimation.
 This OLS estimation specified the following model: price = -38.9 + 0.94(size)
 -38.9 is our intercept, 𝛽%!, where the line crosses the y-axis
 0.94 is our coefficient, 𝛽%", showing that our model estimates price to increase by $940 for each square foot of a
house’s size (remember our price scale? 0.94 x $1000 = $940)

Plug in size for the variable and output the Y (price)


Fitting a linear regression (with categorical attribute)

Binary

Whether or not it has ocean view (either yes


or no)

Fitting LR model

Initla simple model > predictse vyerhing on


avr house price

Miminize RSS with OLS estimation

This is the best fitting model

OLS - R-Squared, or R2, equation


Are both fitted models equally good?
How do we know which model to use?
No! Check the goodness of fit
First, we can evaluate by using our eyeballs >
We can see that the left is best, as it is closest
to the data points, thus less variance.

The right doesn’t fit data very well, make bad


predicitons

We want a measure of goodness of fit > R-


squared

OLS - R-Squared, or R2, equation


A measure of goodness of fit
A metric that represents the percentage of variance in a target variable (Y) that is captured by the attribute(s) - X; a measure
of a model’s ”goodness of fit.”

R-Squared ranges from 0 to 1. R-Squared = 1 indicates that a model captures 100% of the variance.
In principle it can be negative?

Coefficient of

TSS = Total sum of squares”... the RSS of a


horizontal line through the mean of y – “the
oversimpel model” denominator.

RSS= Residual sum of squares... we know


this! – numerator

How much better the finalized model is than


the initial horrible simplistic model

Favor the model with the highest R-square

However:
Doesn’t take into account the number of
attributes used

Same number of attirbutes > not a problem

Cant compare a simple to complex model.

However; if we have a complex model with


higher R-square, it might be overfitting the
training data

When fitting LR model, it will output R2.


Polynomial Regression: Adding Attributes
We can add more attributes
Will have more than one attributes
Same logic applies
Instread of a 2d line, it is a 3d plane.
Still make residuals as small as possible

Can be non-linear
Equation might look different – det er blot 2.
og 3. grads polynomier der viser hvor meget
linjen kurver

We like simple models > too complex risk of


overfitting!
Logistic Regression
A supervised algorithm that can be used to classify data into categories, or classes, by predicting the probability that an
instance falls into a particular class based on its attributes.

The target variable must be a categorical value. The attribute can be any type of variable (e.g., binary, categorical,
continuous,... ).

Used for classification – NOT regression despite the name.


Maybe because it is extended from LR

Logistic regression equation


A mathematical equation – logistic function > optimize the fit to whatever training data.

Looks a bit more complicated

P = probability
 y1 = Target variable (class label we
want to predict) – binary, one class
or the other
 𝑝(𝑦1) = Probability that target
variable is a certain class
 F(x) Linear predictor (the linear
regression equation! ) – can be
complicated – today just a single
attribute

Fitting a logistic regression


For classification

House size as attribute

Binary (not continuous) variable, whether or


not the house was sold above market
valuation.

Plot to right;
Y-axis = probability, from 0 %-100% chance
that sold from more than its valuation
X= house size

Fitting a logistic regression


For classification
House size as attribute

Binary (not continuous) variable, whether or


not the house was sold above market
valuation.

Plot to right;
Y-axis = probability, from 0 %-100% chance
that sold from more than its valuation
X= house size

s-curve, logistic function

shift line a bit

can calc RSS, not use OLS

Tries to follow same logic as linear regression

LogR trasnfom y prob to logOdds scale

Always either negative or positive

Use something else


Instead of calculation residuals, logodds scale
from +- infinity

Whatever logodds value can be

project points onto the line


Then convert back to probability scale

For each pint > find logdds scale to


probability scale
Logodds 2
Translate back to to prob acale
Equates to prob to 0.88 > plot into
probability scale

MLE – Log-likelihood

 “Maximum Likelihood Estimation” ...finding the line that maximizes the log-likelihood
Equals to r-squares for linear regression

Positive: the y-coordinate


Negative (yellow) points > 1- y coordinate

Higher = better

Keep trying canidate lines until we find the


one that maximizes log likelihood (where we
minimized R2 for linear regression)
Classification threshold
Set a classification threshold
...What is the cut-off on the probability scale
for classifying instances as 0 vs. 1

How will the model classify new

Logistic regression interpretation

Focus on f(X)

Important difference:
These coefficients can’t be interpreted as
easy as with Linear regression

 Intercept = -3.75 = the log odds of


selling above valuation when size is
0. Translate to probability
 Change that to probability...
exp(3.75)/1+exp(-3.75) = 0.047

 There’s a 4.7% chance of a house


selling above valuation without
taking size in account.

 Coefficient = 0.007 = the change in


log odds per 1 square foot of house
size. Also log-odds scale
 Change that to odds – not
probability.. exp(0.007) = 1.007
 For each square foot in size a house
has, the odds of selling above
valuation multiples by 1.007 (the
odds very slightly increase).

Fitting a logistic regression: interim summary


 We set out to build a model to classify/predict whether a house will sell for more than its market valuation (binary,
categorical target variable) based on house size (continuous attribute).
 Assuming a logistic relationship between the target and attribute, we fit a logistic regression model to our data by
finding the line that maximized log-likelihood — “Maximum Likelihood Estimation (MLE).”
 This MLE specified the following parameters:
 -3.75 is our intercept, β 0 , where the line crosses the y-axis on the log-odds scale
 0.007 is our coefficient, β 1, showing that our model estimates the log-odds of selling above valuation to slightly
increase for each square foot of a house’s size
Multi-class logistic regression
 One-vs-Rest – most normal
o Train sep for each sep class
o Several binary classification tasks
o Start with this
 One-vs-One
o Make even more models, computational expensive
 Multinomial logistic regression

Polynomial regression

Multiclass classification tasks

How does it work when multiple class labels?


Make multiple models and compare them

Why use logistic and linear regression instead of classification and regression trees?*
Gran salt > generalisere

Class tree models: binary


Log regression:, adjust threshold

Depends on real world use case


Loss functions
Loss functions in brief
A loss function is a quantitative measure of model error; a calculation of how good or bad a model is performing.

RSS, Log-likelihood > quantify how good models are performing based on GOF.
GOF > not
Usutlaly accuracy prediction.

Common loss functions in machine learning:


 Mean Squared Error (MSE)
 Mean Absolute Error (MAE)
 Root Mean Squared Error (RMSE)
What erros means in context of ypur use case

Gradient descent
An optimization algorithm that iteratively estimates a model’s parameter(s) to minimize a specified loss function.
Don’t have to do OLS or MLE.

MLE (we use gradient Ascent > maximize log likelihood)

Starting from an initial estimate, the algorithm takes steps towards the minimized loss function, and decreases the size of the
steps as the slope of the gradient decreases.
In contrast to OLS and MLE > doesn’t care about loss functions

Takes big steps when steep >


Steps decrease when slope becomes
flatter

Slope is 0 = done
Should we care about these “basic” models?
Reflect about bigger picture

Summary
 Parametric modeling is a technique for building a predictive model whereby a formula is pre-specified and parameters
are estimated (or “learned”) from some data.
 Linear regression is a parametric model that fits a linear function to data in order to generate continuous predictions.
 Logistic regression is a parametric model that fits a logistic, or sigmoid, function to data in order to generate
categorical predictions. For classifications
 Compared to decision trees, regression models are generally less computationally expensive, less reliant on a large
training dataset, more robust to variance, and provide more nuanced predictions.
 Compared to regression models, decision trees are easier to implement and interpret.
 Parametric models can be compared by their “goodness of fit.” However, the main marker of performance for most
machine learning tasks is predictive accuracy (...more on this in the coming weeks).

Question Answer
What statement is NOT true about parametric models? 1. Parametric models make no assumptions about
the relationship between attributes X and target
variable Y.
False >
2. Parametric models pre-specify a formula and
estimate its parameters from some data.
Linear regression etc.
3. Linear and logistic regression are examples of
parametric models.
4. Parametric models can be evaluated by their
"goodness of fit."
Linear regression model's are often fit to data with 1. R-squared – goodness of fit
Ordinary Least Squares (OLS) estimation. What does OLS 2. Residual sum of squares (RSS)
optimize? Distance between actual observed and estimated
datapoint
3. Log-likelihood > log regression
4. Mean Squared Error MSE > could be relevant, not
what ols minimizes but. Gradient ascent

You've fit a logistic regression model to some data that 1. There's a 70% chance of Y being assigned a class
outputs a coefficient of 0.70. What does this coefficient tell label of 1 without taking X into account.
you? 2. The log odds of Y is expected to change by 0.70 per
one unit change in X, on average. > converting all
on log odds scale to fit
3. Y is expected to change by 0.70 per one unit change
in X, on average.
4. For each unit of X an instance has, the odds of Y
being assigned a class label of 1 is multiplied by
0.70.

References:
Regression models are complicated.
Here are some resources I used in this lecture:
 Linear regression:
o https://mlu-explain.github.io/linear-regression/
o https://www.youtube.com/watch?v=PaFPbb66DxQ&t=400s §
o https://www.youtube.com/watch?v=nk2CQITm_eo&t=907s
 Logistic regression:
o https://mlu-explain.github.io/logistic-regression/
o https://www.youtube.com/watch?v=yIYKR4sgzI8&t=75s §
o https://www.youtube.com/watch?v=BfKanl1aSG0&t=302s
Ch. 4
The book, chap. 4

Parametric Modeling
- We start by specifying the structure of a model (ex a line) and leave certain parameters unspecified. à Learning
process is to find the best values of those parameters.
- We want to know parameters or numerical values from the training data.
- Once we have established values for the parameters, we will have a model that can produce the predictions we
want – for classification or regression.
- We look for some kind of mathematical formula that combines the attributes in the best way, to get useful
information about the target value.
o There are many kinds, but we can get a long way with:
 a linear model, that describes the line that best separates the data. (also natural for regression).

Another type/process of predictive modeling instead of Decision tree:


Parameter fitting:
- Can be used both for classification and regression
- About find a discriminant function (the math formula that combines attributes in the best way). It specifies the
structure of the model (the line) and leave some parameters unspecified = partial specified equation.
- OR about estimating/finding the best parameters to fit the data set – the parameters should fit the weight of the
attributes, to give useful information about target value.
o Based on training data.

Variety of parameter fittings:


Linear modeling techniques
- Each uses different functions to find a discriminant function.
- Linear classifier: Separates the instances by introducing a boundary.
Linear regression
- Linear discriminant function:
o Y = ax+b
o a = weight
o Multiple: Y = a1*x1 + a2*x2 +b
- Classification function:
o Class(x), + if …., * if ….
Finds best using least square method to create an
objective function.
- Takes the distance from the line (absolute value) and add them together.
o Takes the weight for each attribute and adds the weighted attribute values together.

- It has to be a numeric value! (ex. 0 or 1)

- Equation:
o F(x) = w0 + w1*x1 + w2*x2 …….

Prediction problem:
- We want predicted outcome to be as close to the real outcome as possible.

SVM (Support Vector Machines)


- A way to find an objective function (hyperplane)!
- Finds the widest bar between the attributes and the discriminant function will be in the center
line through the bar.
o The wider bar, the more generalizing.
- The attributes can fall into the margin, that is not a problem.
Logistic regression
Its name is something of a misnomer—logistic regression doesn’t really do what we call regression, which
is the estimation of a numeric target value. Logistic regression applies linear models to class probability
estimation, which is particularly useful for many applica‐ tions.

- Equation:
o F(x) = w0 + w1*x1 + w2*x2 …….

- What are the changes that something happens?

Classification via Mathematical functions


Decision boundaries

Instance Space-view of tree models

A dataset split by a classification tree with four leaf nodes.


shows the space broken up into regions by horizontal and
vertical decision boundaries that partition the instance space
into similar regions.
Examples in each region should have similar values for the
target variable.
A main purpose of creating homogeneous regions is so that
we can predict the target variable of a new, unseen instance
by determining which segment it falls into.
linear classifier and is essentially a weighted sum of the
values for the
various attributes

We want new attributes as far away from the discriminant function as possible.

Unsupervised Learning

Types of unsupervised learning:


1. Unsupervised transformations = Unsupervised transformations of a dataset are algorithms that create a new
representation of the data which might be easier for humans or other machine learning algorithms to understand
compared to the original representation of the data.
- Dimensionality reduction: takes a high-dimensional representation of the data, consisting of many features,
and finds a new way to represent this data that summarizes the essential characteristics with fewer features.
- Finding the parts or compo‐ nents that “make up” the data. An example of this is topic extraction on
collections of text documents.

2. Clustering algorithms = Partition data into distinct groups of similar items.


- Example: Uploading pictures to social media site and organizing them into pictures of the same persons

Challenges in Unsupervised Learning


 Evaluating whether the algorithm learned something useful. Unsupervised Learning is usually applied to data that
does not contain any label information, so we don’t know what the right output should be. Often the only way to
evaluate the result of an unsupervised algorithm is to inspect it manually
 Usages:
o In exploratory settings, when a data scientist wants to understand the data better, rather than as part of a
larger automatic system.
o As a preprocessing step for supervised algorithms. Learning a new representation of the data can sometimes
improve the accuracy of supervised algorithms, or can lead to reduced memory and time consumption.
Exercise 4: Linear regression I
Topic Fitting a model to data
Learning 1, 4
Objective
Activities to be
done before next
class

Today’s exercise objectives


§ Learn the basics of fitting a model to data
§ Fit a linear regression model with scikit-learn § Fit a logistic regression model with scikit-learn §
Interpret the outputs of the models
5. OVERFITTING AND ITS AVOIDANCE
Topic Overfitting and its avoidance
Learning 1, 3
Objective
Syllabus DSB p. 111-140
Activities to be Readings
done before next
class
Exercise Linear regression II

Session 5
Generalization & Overfitting

Definitions

What is machine learning?


A field of artificial intelligence involving computer algorithms that ‘learn’ by finding patterns in sample data, and apply these
findings to new data to make predictions or provide other useful outputs.

Don’t care about specific definition > The interesting is what is highligted

What is a predictive model?


A formula for estimating an unknown value of interest: the target.
The formula could be mathematical, or it could be a logical statement such as a rule.
 Different kinds – parametric, non-parametric.

What is generalization?
The property of a model or modeling process, whereby the model applies to data that were not used to build the model.
If a model cannot generalize well to new data, then it will not be able to perform the prediction task that it was intended for.
 Key concept of ML: Generalization
o This property of ML models enables them to handle big data. Focus on generalization > how well they predict
new data.
o How do we evaluate ML models > on how well they do generalization.
 Side note: Human develop skills that are generalizable. That is something that AI lacks
o AI is narrow intelligence

 Example: Teslas AI
o In front of the car drives a horse carriage – on the screen in the car it shows a
o Car doesn’t recognize horse carriages. An example of failure of generalization – the ML model doesn’t know what it’s
encountering. Only know what its trained on, and hasn’t been trained on horse carriages.
o 2 ways we can explain this failure; under-and overfitting

What is overfitting?
When a model fits its training data so well that it cannot perform accurately on new, unseen data.
If the model fits perfectly, it might show a perfect r-square (1). This might be a red flag - The model has memorized all data –so
also remembered all noise etc. As result when applying on unseen data the model will predict same noise that it was trained
on. The model is tailored so perfectly on the training data > this comes at the expense of generalization.

It occurs when a model is too complex, perhaps as a result of too much training, too many input features, or not enough
regularization. Too many attributes
 Troublesome – hard to spot because it is often related to a “perfect” model (perfect r-square indicating that the
model is perfect)
 Deep learning, can happen when letting a model train too much
What is underfitting?
When a model is unable to capture the relationship between the input and output variables accurately, generating a high
error rate on both the training set and unseen data.
It occurs when a model is too simple, perhaps as a result of a model needing more training, more input features, or less
regularization.

 More obvious, easier to spot.


 When the model is overly simplistic > not too simple models.
 Not all can be captured by linear regression
Under- vs. overfitting models
Under- vs. overfitting regression models
Hypothetic linear regression – one attribute

Underfit Optimal Overfit

 Too simple model  Quadratic relationship  More complicated


 It doesn’t capture non-linear between x and y, but does it  Too complex – high
relations between x and y too simple – doesn’t bend polynomiym > goes through all
 Bad r-square over backwards to capture data points
data points  r-square of 1 – indicating a
“good” model, but this is
wrong
 The model will also be
capturing the noise from the
training data it was trained on

Under- vs. overfitting classification models


Classification problem with 2 attributes: Classifying as either black or blue (2 attributes); Classification applies same logic as
regression
Making decision boundaries; Fit a line to data in order to separate data points
Underfit Optimal Overfit

 Too simplistic > guessing a  The optimal partition – a  In real world > don’t want the
partition straight line weird curves.
 It might be a mistake that the
blue lies among the black, the
model overfits and captures
the noise from the training
data; this will also be applied
on new unseen data
Bias and Variance
Understand 2 key concepts in order to understand over and overfitting

Bias (the statistical kind)


The difference between the average prediction and the true value; a measure of systematic error.

Difficult to predict the average model

Average model:
Run on independent data sets > in practice it is
difficult to identify whether we observing a bias
(systematic error) or a mistake

Bias is bad > but shall we minimize it to all cost?

Variance
Variance = How much predictions vary for a given data point on average; a measure of how “sensitive” a model is.
Small change in X will result in a huge change in Y > Therefore indicates how sensitive is our model

Concept of average model

How widely dispersed our model is

The bias-variance tradeoff


These statistics will tell how much error we will observe when we apply ML model to data

 Mean squared error:


o Most common loss function in prediction models
 Bias squared
 Variance

Errors – Mean Squared Error:

Minimize both bias and variance. Issue:


hard to know how to minimize bias to
0. Trying to minimize observed error.
Pulling on 2 levers.
What does it mean pulling the 2 levers
– bias and variance

X-axis: Model complexity


Y-axis: Error - mean squared error

 As we progress along x-axis > bias


decreases, variance increases

This shows: Too much complexity


backfires

 As we progress the opposite


direction on x-axis > bias increases,
variance decreases

This shows: Too little complexity also


backfires

Optimal: We want to build models with


optimal level of complexity

Also shows where over- and


underfitting occur…

Underfitting
Left side: The model doesn’t capture
the relationship between x and y. It’s
too simple.

Low variance (consistent predictions),


but high bias (wrong predictions)

Overfitting

Right side: Too high complexity

Low bias (perfect predictions


sometimes), but high variance

The bias-variance tradeoff in words


= The tension of trying to minimize error by simultaneously reducing bias and variance.
Techniques for reducing bias often increase variance and vice versa.
Tech to reduce variance increses bias and vice verca
Visualizing bias and variance:

Low bias (on the mark), low variance


(consistent)

Low bias (on the mark), high variance


(inconsistent predicitons)

High bias (off the mark) and low


variance (consistent) – hitting
consistently wrong

High bias (off mark), high variance


(inconsistent)
Holdout data
How do we deal with bias and variance?
Partition data in test and training data “Train test split”

Large data set > yes/now.


Partition data. Use 70-80% as training data -
20-30% we don’t touch this data. This is holdout data –

generalizability, not just fit – test data.

Hold out data = Data for which we know the value of the target variable, but which will not be used to build the model (i.e.,
“test data”). Partition the data > build the model
Implies that separate “training data” is used to build the model.

Training data – data used to train or


build the model. “fit” the model on.
Learn patterns and relationsships in
order to make predicitons later on.

Careful > no bias – model will inherit


this bias and apply to new data

Test/holdout
Test set, put aside when building the
mode
20-30%
Evaluate the model on it. Doesn’t touch
this data – save for when done building
the model. Imposing voerfitting if we
look at this data ebrfore the model is
done
“little experiment”

Analogy:
Building a theory on training data, in
order to test the theory we will test it
on our “test set”

Holdout data alternative scheme

Partition data in 3sets

Training data
Build our model > default model

20% validation set


Observe error and adjust
Why? When comparing – using same
for both tuning and training > might
overfit.
Memorized > not tune on same data –
tailoring on one specific data set.
Avoid this.
Make prediction on validation sets >
tune it – tune and adjust

20% testing
Don’t touch this before the model is
done
Only running once

Detection of under- and overfitting

Gof on training data and test data

Compare
Idea of how good generalizes

Bad test > overfit – fit traning too well

Bad on both > underfit. Biased, not


caputirng anything, no genrai signal,

Optimal: performing well on both. To


get a balance > decrease the training
score. Balance >

Lucky: bad on training, good on test

The fitting graph

Vary complexity > different results on


test

Run training test split

Plot both test and training scores for


each level of complexity

2 key points:
1. Often benefits to either increase
error or decrease

Don’t care about training score –


perfect – usually red flag.

2- sweet spot
Level > where they begin to move from
each other.

Occam’s Razor, or the principle of parsimony


A scientific and philosophical rule stating that the simplest of competing theories be preferred to the more complex.

Simplemodel that solves complex problems


A note on R-squared, or R2
 R-squared typically ranges from 0 to 1... but when applying the model to new, unseen data to make predictions, R-
squared can be negative.
o
 R-squared = 1 indicates that a model captures 100% of the variance.
 R-squared < 0 indicates that a model is a worse fit than a horizontal line for some given data.
o Can’t see negative, probably some overfitting.

Corrextion:
True – measure of GOF for linear regression

Not negative

Cross-Validation & Regularization


High level

Identify and mitigating overfitting

Useful > genera


Navigating bias variance tradeoff
Challenge to do this: Avoid overfitting > avoid memorize, only pick up genralizsable signals

How can we as researchers avoid risk of overfitting

How can we avoid overfitting?


4 solutions to avoid overfitting
 Cross-validation:
 Regularization
 Ensemble methods: combining many models to cancel out mistakes
 Get a larger and more representative training dataset –
 Having the right data, understand the data available – think critical about it – is it representative for the popul we are
applying it on
Cross-validation, or k-fold cross-validation
Holdout data, or the train-test split

Example of cross-validation
Partition data available so it enables this experimentation

Maximize both training and test data

test: get a sense of generalizable


small > not sure on the

Purpose > run experiment


Build on
Test genre on untouched data

What can we do so we make the most


of the available data? > k-fold cross

k-fold cross-validation
A method of evaluating a model’s performance whereby data is split into k subsets (“folds”) and the train-test evaluation is
repeated k times, such that each subset serves as the test set once.
Test scores are returned by each fold and averaged together for a final score.

Average together
Look at s-deviation
Varied: concern, not much stability in the way it generalizes

Step by step: 5

Partition into 5 subsets


Many splits instead of 2

One is test, keep the remaining 4 for

2:
Put aside fild nr 2 > train on the rest 4

3:
Fold nr 3 as test, use rest as training

4:
Put aside fold nr 4 output at testscore

Each fold has opperunity to be test


data
Etc.

Calculate mean and standard >


More robust,
Score will differ if using different data
as test and training

k-fold cross-validation alternative scheme


20-30% for test data >

For remaining training data you do


cross-validation

When happy with result from cross-


validation >

Then apply final model to test data for


final evaluation

Cross-validation considerations

 Useful for cases where you have small datasets and you need to utilize every little bit of information to develop your
model.
o Limited datasets ! no need to get too complicated
o Instead of
o Systematically swap data for training and test

 Computationally expensive when using very large datasets and/or complex models (e.g., deep learning models).
o Partitioning, training and testing takes many resources
 Different types of cross-validation?
o K-fold most commonly used
o Focus on k-fold in this course
Regularization
Calibrating a model’s fit to data with the model’s complexity; “desensitizing” a model to training data. Desensitizing model
to data > this training might not be perfectm don’t take small percularites too serious

Regularization techniques include pruning, imposing explicit complexity penalties in the model’s algorithm, and feature
selection. For the sake of generalizability

Another method to avoid overfitting


Add a term to rss – the complexity of the model
We look at L1 and L2 norm
Consider feature selection: prevent accessing all the data. Model with many attributes are complex

L2 Regularization, or Ridge Regression


A linear regression model with an adjusted loss function that penalizes coefficient size.

As the penalty term λ increases, coefficients (slope) decrease, due effectively to less priority being given to minimizing RSS. If
λ = 0, then a standard OLS line is fit. As λ approaches infinity, the slope gets asymptotically close to 0.
The penalty term is often denoted as lambda λ , or alpha α

Retrogression

Assume rel between x and y might not be as X when on bigger data sets

Lamda is set by the programmer – not learned by data

L2 norm

The bigger coef > bigger L2 norm.

As increase lamda increase l2 norm


term

Increase l2:

OLS linear regression

Small

What happens
Orange – new unseen data

Not good fit

Retrogression we could have used


isnteas

Doubt training data is not representa

One attribute – 1 coef

Optimizing retrogression instead of OLS

Depends on what we set lamda to

More horizontal as we increase lamda

Note: The reguralized lines here are


purely illustrative. They have not
actually been fit to the points depicted

Close to horizontal when increasing


labda much.

As lamda increase – l2 norm get


witghed more than rss and end up with
a horizontal line

Why assuming rel between – wil be less


in a population.

No guarantuee
Blue > with smaller data > steep
increase in score
L1 Regularization, or Lasso Regression
A linear regression model with an adjusted loss function that penalizes model complexity, and is capable of zeroing out
coefficients.
The penalty term λ functions the same as with ridge regression, but here, some coefficients may be reduced to exactly zero.
This means some features are entirely ignored by the model.

Similar to ridge regression


Both impose bias with purpose of reducing variance

Key dif:
Ridge – reg can lead to small coef, cant be reduced to exactly 0
Always include same attributes
L1 to linear regression > might remove attributes. Effectiveatly doing feature selection. Get fitted one at a time. Attirbutes
coefficiton can be set to 0 if it doesn’t increase the model

L1 norm

Important change

L1 vs l2:
Compapre them and try to understand dif

L2 is ofnten the choice

1. lasso comput expensive – stastistical software – coef one at a time


2. corelated attributes, while

Large amount of features and some might not be relevant for your model > lasso

Think/know all attribtues/features might be important > use l1


Other regularization techniques and hyperparameters
 Elastic net
o Combines both l1 and l2 at same time
 Stepwise regression
o Standard statistical
o Incrementally adding feature > use stat significance to find what to include in final model
User other hyperparameters such as:
 max_depth and min_samples_leaf for decision trees
o hyperparameters
 n_estimators for ensembles of trees
 C for logistic regression and support vector classifiers.
o Hyperp
o Log regression
o Applying
o Function same way as lambda
o Read documentation
o Above 0 max 1 (1 no regularization is applied)
o Small: lots regularization is happening
 ... many hy

What are hyperparameters?


External model parameters that control how the model is built.
Whereas model parameters are estimated/learned from some data, model hyperparameters specified and tuned by the
programmer.

Optimize lambda through grid search, but not predciting

What is grid search?


A method for hyperparameter tuning that exhaustively generates and compares candidate models from a grid of possible
(hyper)parameter values.

When implemented, an algorithm iterates over a (hyper)parameter grid containing the prespecified range of values to evaluate
for each (hyper)parameter.
Automated way > report score, pick the. Best one

Case study: overfitting and the GIGO principle


Example what can go wrong if we don’t take into account the concepts of today
Garbage in garbae out concept
Pred/classify criminality based on
pic of faces
Reports accuracies

Ta: 90% accuracy is HIGH

Conclusion:
ML model picking on smallest
singals that humans cannot

Issue: a criminal faces

B: non criminal faces

Concerns from ml standpot

Target variable: being convicted


of a crime
Can a judge like you?
Picking up features of attraction

Facial features
smiling

Takeaways:
 Garbage in, garbage out: a machine learning model can be only as good and unbiased as the training data provided to
it.
o getting better training data if overfitting
 Overfitting (?) and too-good-to-be-true performance: be wary if your model is producing remarkable results based on
unremarkable features.
o Too good to be tru eperformance probalbly not true

 Occam’s razor: do not appeal to the extraordinary (neural networks picking up features that the human brain does
not) when the ordinary (smiles) is sufficient.

Summary
 Useful machine learning models are generalizable.
 Overfitting is when a model is fit so precisely to the data on which it was trained that it is no longer able to generalize
to new, unseen data. No longer useful
 Avoiding overfitting means dealing with the bias-variance tradeoff.
 To avoid overfitting we can use cross-validation, regularization, ensemble methods, and/or get a larger and more
representative training dataset.
 Cross-validation involves partitioning your dataset into training data (for building a model) and holdout/test data (for
evaluating a model).
 k-fold cross-validation is a method of evaluating a model’s performance whereby data is split into k subsets (“folds”)
and the train-test evaluation is repeated k times, such that each subset serves as the test set once.
 Regularization involves calibrating a model’s fit to data with the model’s complexity; “desensitizing” a model to
training data.
 Regularization often entails tuning hyperparameters, which can be done systematically with grid search.

Question Answer
Which statement is NOT true about overfitting 1. It occurs when a model is too complex
2. It's observed when a model performs well on training data but
poorly on test data

3. It occurs when a model is too simple


In gneereal happens when too COMPLEX
Bending over backwards to fit training data- capture percualries and
applies on – capture general patterns not noise

4. Techniques for avoiding it include regularization, cross-


validation, ensemble methods, and getting better training data
If you have a model displaying 100% accuracy Low variance, low bias
(or an R-squared of 1.0) on some training data,
your model probably has... Low variance, high bias
Also possible –

High variance, high bias

High variance, low bias


Depend on data – cant know for sure
If model shows 100 % on training data – overfitting. We don’t want
perfect on training > indicates overfitting. = HIGH variance, small
increase in X great change in Y.
Which of the following is NOT true about They can be directly estimated from training data
hyperparameters?
They control how a model is built

They are specified and tuned by the programmer

A common method for tuning them is grid search


Try different settings – find out what performs best

Some resources used in this lecture:


Generalization and the bias-variance tradeoff:
 https://mlu-explain.github.io/bias-variance/
 https://www.ibm.com/cloud/learn/overfitting
 https://www.ibm.com/cloud/learn/underfitting
Regularization:
 https://www.youtube.com/watch?v=Q81RR3yKn30&t
 https://www.youtube.com/watch?v=NGf0voTMlcs
Case study:
 https://www.callingbullshit.org/case_studies/case_study_criminal_machine_learning. html

Ch. 5
The book, chap. 5

Evaluating a model
- When evaluating the model, it is not enough to look at the accuracy for the dataset used to build the model (that is
called a table model, and does not generalize), you need some test data.
- Generalization; The model should generalize from the training data! – then you can later test the model on the
test data (Holdout data)
- Overfitting; If the model is closely fitted to the training data, in a way that does not generalize to other data.
- As the model becomes more complex, the errors go down from both holdout and training data. But the holdout
data is does not continue to reduce error, because of overfitting.
- Looking at table data: The Holdout data keeps being at b, because it is not in the table. Training data keeps
getting better till all rows are used.

Overfitting in tree induction:


- With decision tree, you can keep splitting nodes until you get single instance for each leaf node.
- Decision trees are very likely to overfit, as they will find classification for new instances.
- Below is a fitting-graph for three induction. Sweet spot = when overfitting begins to occur and the performance
on the hold out data decreases.
o By using hold-out data, we get a chance in noticing the overfitting.

Technique for evaluating: How good is the model really?


- When looking at accuracy on training data, you can’t assume that the accuracy will be similar with new data.

Hold out data:


- = Hold out some data and see what the accuracy is on this data.
- How do we know that the data is representative?
Solved by cross validation.

Cross validation:
- All data will be held out once
- You get the test results five times and take the average of this.

Learning curve
- Generally, models tend to approve as the size of the training data
increases, but the improvement is typically not constant.
- At some point there is no longer more value in adding training data.
MinNumObj
- Overfitting happens when model gets overly complex.
- Avoid overfitting by changing min. number of instances per leaf.
- A small number of instances = not much reason to think that the decision is accurate!
- Prune = beskære.

- MinNumObj removes branches if they do not provide power to classify instances.


- Instances are represented in parent node if children are lower than minimum.

Key points:
- You can’t assess a model by looking at is performance on training data.
- You want the model to Generalize – apply well to data other than training data.
- There is a danger of Overfitting – model is closely fitted to training data, in a way that doesn’t generalize to other
data.
- To assess a model, you need to use holdout data – keep some data from the training data, to be used for test data
- Cross validation does this systematically with many different partitions (or folds) in the dataset
Quiz:
- The red line in Figure 5-2 shows the base error rate. Let's say that, overall 45% of customers churn, and 55% do
not. What is the value of the base error rate in this case?
o 45%.
What would b be? Since the table model always
predicts no churn for every new case with which it is
presented, it will get every no churn case right and
every churn case wrong. Thus the error rate will be
the percentage of churn cases in the population.
This is known as the base rate, and a classifier that
always selects the majority class is called a base rate
classifier.

- Where does overfitting start to occur in Figure 5-3?


o 100

- You have a data set with 500 instances, and you will perform
10-fold cross-validation. In this case, how many instances of
data will you use for testing in each iteration?
o 50 instances will be in one fold.

- You have a data set with 1000 instances, and you want to
perform 5-fold cross-validation. In this case, how many
instances will you use for training in each iteration? à 800
(because: 1000/5*(5-1))
Exercise 5: Linear regression I!
Topic Overfitting and its avoidance
Learning 1, 4
Objective
Activities to be Send one-page project proposal to JB for written feedback (optional)
done before next
class

 Implementing a multiple linear regression model with a train-test split


 Detect overfitting
 Implement cross-validation
6. SIMILARITY, NEIGHBOURS AND CLUSTERING
Topic Similarity, Neighbours and Clustering
Learning 1
Objective
Syllabus MLP p. 170-178
DSB p. 141-186
Activities to be Readings
done before next
class
Exercise k-means and clustering

Session 6
Similarity and Neighbors
DP = data points
Sim = similarity
Metr = metric

What is similarity?
Core idea of similarity

A way of measuring how alike, close together, or related data are, usually involving a so-called distance metric.
Similarity can be used for supervised and unsupervised machine learning; it can be used for classification, regression, and
clustering tasks.
 Sim basic intuivly; used in everyday life. Behave -> reaction – next time predict the same reaction. This applies to
similarities.
 Look for most similar DP we larned in data points; predict in test data that same will aplly

Distance metrics (DM)


In real world often 2 dimensional plane
Big data => often more dimensional (many attributes) – many are often irrelevant – some are just noise
Huge variety of DM
Quantify X between two dp
Dpeneding on nature and scale of data; some are more accurate/true than others.

 Euclidean distance
 Manhattan distance
 Minkowski distance
 Cosine similarity
 Jaccard distance
 ...

Euclidean distance
Measures the distance between two points as the length of a straight line.
One the most intuitive and commonly used distance metrics. Good off-the-shelf performance with low-dimensional data, but
requires that data has been normalized and performs poorly with high-dimensional data.
 Simple, into, widely used,
 Default setin in ML programs;
 Challenge: high dimensional
o Technical: on 2d or 2d plane; flat, straight line. Increasing; things curve. Becomes more a worm ball, than the
distance between them.
 Most applicable: low dimensional data
Between point A and B

Triangles:
Theorem

Manhattan distance
Measures the distance between two points along axes at right angles; also commonly referred to as taxicab distance.
Less intuitive than Euclidean distance but displays better performance with high- dimensional data.

 Taxi cab distance: total north south distance and east west distance you travel

Add together abs differne between the data


points along each dimension

Minkowski distance
A generalization of Euclidean, Manhattan, and other vector-based distance measures.
Introduces a parameter p that can be adjusted to suit your use case, but interpretability can be troublesome without a firm
understanding of vector-based distance metrics.

 Pulls in multiple distance metrics

Add together abs differne between the data points along


each dimension

Setting p determin what distance metric you use

1 = Manhattan
2 = euclidan distance

Nice: brings in everything togeger


Easy adjust P
Aware:

Cosine similarity
Measures the similarity between two points as the cosine of the angle between two vectors.
Well-suited to high-dimensional data but ignores the magnitude (size; length) of the vectors. Only accounts for the orientation
between to points, rather than the distance per se.

 Tricky – not measuring distance per se (not as a line)


 high-dimensional data
Doesn’t depend on length but the angle.
Cosine > scaled and varint. Doesn’t care about
Assumes: orientation of featue speace /isntanc ein fp
matters

-1 = points geos in op direceten – 180 angle


0 = points are retviklet to another (90 gradeR)
XX means they’re very close

The direction the point goes from the

Same orientation and

Jaccard distance
Treats data points as sets of characteristics and measures distance as one minus the proportion of all unique characteristics
(the union) that are shared by the two (the intersection).
Well-suited to high-dimensional data (e.g., text data), but is susceptible to being skewed by the size of the data. Increasing
dataset size may increase the union without increasing the intersection.
 Different
 Isn’t. a vector bases metrics – no distance
 A bucket of values (rows). Checks instances – how many is present in this < compare also present in other
instances
 Affected by size of dataset
o Lots of featues; opportunity to
-

1 minus interseption
Minus union

1 = of three unique values, both share 2 (3


unique vaæues across) only one I shared = 1
intersection

Union is 3 > 3 unique values across the two

Doesn’t care about or what they


represent.looking at common values across
buckets

No sense for numerical datasets. Text datasets


(words shared across documents)
Nearest-neighbor reasoning
Once you have a way to measure the distance (or similarity) between instances, you can make predictions for new, unseen
data based on what known instances are closest (or most similar).
The instances that are most similar to any given instances are called its nearest neighbors.

 To make predicitons
 Lazy learning; no model beig buil.t not fitted equiation etc.
 Pattern recogniztion.

Categorical variables > it gets difficult

K-nearest neighbors for regression

To estimate a value for a new instance...


1. Calculate the distance between all known instances and the new instance.
2. Identify the knearestneighbors.
3. Average the known target variables of the nearest neighbors.
a. Instead of “vote”
b. Continuous value
c. Adapting > predict continuous variables instead of categorical

Reg task

Predict whether

First predict credit score in future; will we


make an offer in the first place

Numeric value, not categorical


1. calc distance and identify 3 nearest
neigtbors
2. average credit score of 3 nearest
enightbores
Credit score – past customer slike David
had a credit score of 600

Distance weigthign; people further away


will have less influence on davids
prediciton

Picking k for KNN regression


 If k = 1, you have a complex model that captures the variance of your training data (risk of overfitting)
 §If k = N, you have a simple model that predicts the average value of your target variable for all new data (risk of
underfitting)
o A place in the middle
 Use grid search and cross-validation
o Try diffent settings, compare, contrast and

Problems with nearest neighbor reasoning


Intuitive

 Interpretability and justifiability; the algorithm doesn’t produce a model per se


o Justyfing;
o Easy to explain how
o Inappropriate fr some rela wold application
o Difficult to makeCounterfactual
 Distance metrics are affected by the curse of dimensionality and varying scales
o Lots attributes, high dimensions etc
o
 Computationally expensive to query the entire dataset for every prediction
o Massive dataset requiring fast prediction knn impractical
o

Clustering
Unsupervised

What is clustering?
- An unsupervised machine learning task that involves grouping together similar instances into so-called “clusters.”
o The key thing is that in contrast to the others, clustering is unsupervised. So, we’re not predicting something
specific.
o The objective is more exploratory, we don’t know what we’re looking for.
o Might look for the company’s natural customers
- As an unsupervised task, the input data is unlabeled; the modeling is not driven by a pre-specified target variable, but
rather seeks to identify naturalistic groupings.

K-means
- An iterative algorithm that groups unlabeled data into k non-overlapping clusters.
o The most intuitive and most used.
o Non-overlapping: Exclusive (?)
- Represents clusters of data by their centroid — their ”cluster centers,” or the arithmetic means (averages) of the
values along each dimension for the instances in the cluster.

To group data into k clusters…


1. Randomly define k centroids.
a. Plucks them on there
2. Assign each instance to the nearest centroid.
a. Gets assigned to a cluster
3. Re-define k centroids by calculating the actual centroid of the assigned instances.

4. Re-assign instances to the nearest centroid.

5. Iterate until convergence (cluster assignment stops changing).

We are trying to find 3 natural clusters. (se video I linket under billede)

https://www.linkedin.com/pulse/k-means-clustering-itsreal- use-case-
surayya-shaikh/
Evaluating your clusters
- The goal is to minimize distortion (or inertia). The lower the distortion, the better and more “coherent” your clustering
is.
o How good are these clusters?
o How good/natural is your cluster? We use distortion to evaluate this.
- Distortion is the within-cluster sum-of-squares; the sum of the squared differences between each data point and its
corresponding centroid. Scikit-learn calls this inertia.

This measure of distortion is not mentioned in todays exercise, but inertia is. Distortion is the same thing as inertia!! They
function the same way.

K-means problems
- Assumes regularly shaped clusters.
o Can’t see natural clusters if it doesn’t have a regular shape.
o Only cares about minimizing distortion/inertia.
- The initial placement of centroids affects how long it takes to converge and which instances are assigned to a
particular cluster.
o
- How do you pick k (the number of clusters)?

By default, sklearn doesn’t do this randomly, it uses K-means+


+. If you do it truly randomly, it will give some weird results.

N_init: By default it runs it ten times.


Picking k with the elbow method
Trying different values and plotting them in.
Picking higher value of k is better, but doing so
can have consequenses. After 3 it’s more or less
flat, so you would pick 3. Anything over would
cause overfitting and with 3 you still get the same
level of variance.

Hierarchical clustering
- An algorithm that builds nested clusters by successively merging the most similar clusters (or data points) until all data
points have been merged into a single cluster, such that the clustering can be represented as a tree-like dendrogram.
o Buttom-up clustering.
- There are several algorithms that are considered part of the hierarchical clustering family. What we’re referring to
here is sometimes called “Agglomerative Clustering.”

To group data into clusters…


1. Calculate the pairwise distance between all clusters (on the first iteration, each
instance is treated as its own cluster).
2. Merge the two most similar clusters into a single cluster.
3. Plot the merge on a dendrogram with the height equal to the distance between
the two clusters just merged.
4. Repeat steps 1-3 until all data has been clustered.

https://dashee87.github.io/data%20science/general/Clustering-
with-Scikit-with-GIFs/

Interpreting a dendrogram

Why use Hierarchical clustering over K-means: easily interpretable


and explainable.

While K-means presents the clusterne, hierarchical clustering shows


what’s going on behind the scenes

Linkage functions
Clustering considerations
- The distance metric you use matters.
- The dimensionality and scaling of your data matters.
- Sometimes clusters are difficult to interpret.
o Up to the programmer to interpret. Fx which one is natural.
o There aren’t any labeled clusters, so it’s up to the programmer.
- Oftentimes clusters are difficult to evaluate as there is no ground truth with which to compare.
o There is no ground truth with clustering.
Other relevant algorithms
For clustering:
§Affinity Propagation
§ DBSCAN
§... https://scikit-learn.org/stable/modules/clustering.html
For dimensionality reduction:
§ PCA
§ UMAP
§... https://scikit-learn.org/stable/modules/unsupervised_reduction.html

Clustering in practice
- Market segmentation
- Profiling and anomaly detection
- Document/text analysis
o Touch on in the next lecture
o Rarely your main model, but more used as explaining and guiding your next steps.
- Data exploration and problem definition

- They group different players based on their stats.


- If you’re planning football transfers and want to buy new
players, they can check which type of player they are scarce
on.

Summary
 Similarity is a measure of how alike, close together, or related data are, usually derived from a so-called distance
metric, of which there are many.
 By measuring the similarity (or distance) between instances, you can make predictions for new, unseen data based on
what known instances are most similar (or closest). This “nearest neighbor reasoning” is the logic of k-nearest
neighbor algorithms.
 Clustering is an unsupervised machine learning task that follows the logic of similarity and nearest neighbors, which
involves grouping together similar instances into so-called “clusters.”
o We touched on K-means
o Hierachical clustering: Nested-clustering
 Two common clustering algorithms are k-means and hierarchical clustering (agglomerative clustering).
 K-means clustering is an iterative, centroid-based algorithm that groups unlabeled data into k non-overlapping
clusters.
 Hierarchical clustering builds nested clusters by successively merging the most similar clusters until all data points
have been merged into a single cluster.
 Clustering is useful for exploratory data analysis, where there is no known target variable to be predicted or classified.
 However, nearest neighbor methods and clustering suffer from the so-called “curse of dimensionality.”

Question Answer
Which statement is NOT true about similarity? It can be used for supervised and unsupervised machine learning

It can be used for classification, regression, and clustering tasks.

It's typically quantified with distance metrics.

It's easy to apply to high-dimensional data.


If you set k=1 with the KNN algorithm, you A complex model that risks overfitting the training data.
have... Drastically change in Y – complex model. Risk overfitting!
A simple model that risks underfitting the training data.
- Simple
A complex model that risks underfitting the training data.

A simple model that risks overfitting the training data.


No a complex model – a simplemodel risks underfitting
Clustering is most useful for... Predicting continuous values

Classifying data – putting into category.


Classification task -

Exploratory data analysis, where there is no known target variable


Usupervised elarning technique – don’t know what w’re loking for.
exploratorive

Ch. 6
The book, chap. 6

Similarity
About finding patterns in data.
- Gets value from datasets by comparing.
- Can be used for classification, regression and (is a basis for) clustering.

- Example: Used by Amazon to provide recommendations of books where similarity plays


a big role.

How to: Looks the differences in each attribute, and measures the distance:
1. Look at age attribute 40-23 = 17, current address 10-2 = 8, RS = 1.

2. Then measure the shortest distance between points using Euclidean distance:

- For more than two dimensions:

K-Nearest Neighbors Reasoning

About finding an alternative.


- Is one of the simplest and most powerful ways to show similarity, is by identifying a
small number of the most similar instances.
- Can be used for classification, regression.

- Example: Whiskey. Find one that is close to what you like.


- Translating: To be able to compare different features, we translate them into binary
attributes.

- Then we can compute distance in the usual way by combining the differences in the
feature values.

- Distance is based on Euclidean distance.

Our project:
- Result in a lot of attributes like street name..

Clusters
- Unsupervised segmentation
o No target but look for a natural way to group the data.
o Two ways to grouping the data:
Hierarchy clustering

- Group the two that are most similar


- You find the distance between each plot on the
plane.

- Creates dendrogram over the hierarchy


of the clusters.
o The number of joins correspond to the
number of clusters.

K-means clustering

1- K: You choose the number of clustering you want.


2- Plot the number of centroids randomly in your graph.
3- You measure the distance from instance to centroids and group instances where they
belong.
4- Move the centroids to get the smallest weight.
Centroid will be in the center of each cluster.
Exercise 6: k-means and clustering
Topic Similarity, Neighbours and Clustering
Learning 1, 4
Objective
Activities to be
done before next
class

Today’s exercise objectives


 Implement k-means clustering,
 Interpret the results to get insight of your data,
 Understand how and when to use the elbow method, and
 Use some basic visualizations to explore and understand your dataset.
7. EVIDENCE AND PROBABILITIES; TEXT-AS-DATA
Topic Evidence and probabilities; text-as-data
Learning 1
Objective
Syllabus MLP p. 329-340
DSB p. 233-278
Activities to be Readings
done before next
class
Exercise Sentiment analysis

Session 7

Evidence and Probabilities


Data as evidence
We can think of each instance in a dataset as evidence for or against different values for the target.
- Think of each instance as evidence.
Then, based on the combined evidence we possess, we can estimate the probability of a new, unseen instance having a certain
target value.
- Probababilisticaly and
- Generatively
- Given what we know, how likely is it that we encounter…

What is probability?
The chance that an event will occur.
- Mathematic framework that allows to analyze chance
Probabilities range from 0 to 1, where 0 (0%) indicates with complete certainty than an event will not occur, and 1 (100%)
indicates with complete certainty that an event will occur.
-

What is an event?
A possible outcome within a sample space.

The sample space of a phenomenon is the set of all possible outcomes. For example, the sample space of rolling a die once is
{1, 2, 3, 4, 5, 6}.
- 6 outcomes – sample space contains 6 possible outcomes.
- Dif events that can occur; odd number, over three, 1, 2 etc.
Finding the probability of an event
When all possible outcomes are equally likely, the probability of an event A, denoted by p(A), is obtained by adding the
probabilities of the individual outcomes in the event.
When all the possible outcomes are equally likely,
Central challenge; estimate the probability of an event

How to calculate probability

Example:
Event B: 4 different things and still satisfy
the event definition

Unconditional probability
A probability that is not affected by or dependent on other events.
For example, the probability of rolling a die and getting a 6, or the probability of flipping a coin and getting heads.
- Nice and intuitive

Combining unconditional probabilities


Unfortunately –

This only works because Event A and


Event B are independent. They don’t
affect one another.

AB: multiplying.
Prob of two events lower than
Placing two bets instead of 1 > more risk.
When. Combining always will be lower.

Checking for independence


How do we know event is independent? In real world harder to tell. Mathematically; we need to know this before we continue
our calculations.

To determine if events A and B are independent, ask:


 Is p(A|B) = p(A)?
 Is p(B|A) = p(B)?
 Ls p(AandB)=p(A) x p(B)?

If one of these is true, the two others are also true: both events are independent.
You can just check one.
What do we meen “|”

Conditional probability
The probability of an event (A) given that another event (B) has occurred; denoted by p(A|B).
For example, the probability of customer churning given that they have purchased a special subscriptions, or the probability of
having COVID given that you tested positive.

Influenced by another event has occurred

Formula for conditional probability

Example:
Check for independence

Data on 10 customers
We have data whether they bought a
special subscription and whether they
churned

Training data

Check mathematically

Check if it is the same as probability of


person churning times a person having a
special subscription

NOT equal > churning and having a special


sub are not independent

Practice; we know if a customer has a


special subs > should we do something to
promote the persons business? We don’t
want them to churn

25% that the person churned given that


they have a special subscription
Prob of person churn given they do NOT
have a soecial sub

What can we say?


67% of churning given they do NOT have a
subscription, compare to 25% of churning
if they DO have special sub

Prevent churn by > special subs.

Bayes’ theorem
Blue terms;

Left side: Posterior probability

Right side: Prior probability

Psych: how rational are people

Bayes’ theorem for classification

Apply to classification task

Calculated easy out of training data

Denon: how XX among all examples

Prob of E given C > if the world generating an in


 Not possible

Naïve Bayes
Adjust it for some assumptions
Assump of cond independence

Approach ieach attribute independently

Naïve;

Independent of one another; not always realistic


therefore NAÏVE

Nævner: p€ becomes a constant. Thus just focus


on the top.

Naïve Bayes in words


A classification algorithm that estimates a probability of an instance belonging to each candidate class and reports the class
with highest probability.
Makes the “naïve” assumption of conditional independence between every pair of features given the class label.

Note: There are several variations of the Naïve Bayes algorithm used for classification. We’re focusing on the most common
variant, Multinomial Naïve Bayes.

Fake news detection with Naive Bayes


Different vairations of it. Representing a famlity of algorithms

Multinomial bayes
CLASSIC classification task
Data set: 20 tweets.
Labeled as having fake news or not.

Fake news = target variable

Tweet as attribute

Features = given it contains the words “trump,


lizard, election “

Probability: cond probability

Calculating the conditional probability Of


observing the word given the class label

Perfect case for naïve base


First:

Using the following formula:


It is just the denumerator from naïve base;
Plug in numbers

Cond probability of observing the word “election”

Without looking at any features in the tweets;


balance of the spread of class label
Not done yet; 12% doesn’t

Compare with tweet containing not fake news

Compare with cond. Probability of true news


containing same words

We cnat interpret this.

Important to in cae of class task; compare these


two

Not meaningful –

0.2 > 0.12. because 20% is higher the new tweet is


more likely to be true news
Smoothing
A technique that accounts for attributes not present in the training and prevents zero probabilities in further computations.
Smoothing is done with the alpha parameter in scikit-learn. The default adds 1 to all possible outcomes to avoid assigning any
non-zero probabilities.

Naive base mulit cond prob. > if one is 0, it 0 out everything.


A kind of overfitting > nothing can happen that doesn’t happen in training set
Smoothing is a way of Avoid overfitting In context of naïve base

Alpha parameter

What does it mean:


The problem with naïve base;

0 chance of being labeled true >


0s out all other features because of multiplying
> everything containing “lizard” (feature) will be
0.

Someone can be saying true things abut lizards


> we need ot perform smoothing

Add value to numerator so it will never be 0

Avoid it 0s out other features

Naïve Bayes considerations


Should we use it for projects?

 Computationally efficient
o Fast
 Good for handling high-dimensional data (e.g., text)
 Multinomial Naïve Bayes suits data with discrete, integer attribute values (different Naïve Bayes variants can handle
other attribute types) – multinomial distribution.
o Not integers; use different variation of bayes
 Not good for probability estimation per se; only good for ranking class labels
o Think about when having many features > becoming smaller the more.
 Strict independence assumption means semantics are not appreciated
o Ignoring linguistic meaning.
Any data can be high dimentional
Text-As-Data
Typical high dimentional data = text

What do we mean by text-as-data?


Techniques for turning text into machine-readable inputs for statistical and machine learning models.
Text data is ubiquitous and naturalistic; it can be collected via web scraping, APIs, and by scanning physical documents. But text
is messy, unstructured, context- dependent, and requires lots of preprocessing.
- Speadsheets of numbers so far
- Difficult for computers to read
- Text is not easy; typos

Google Ngram Viewer


Digitized mio. Of books

Two terms; communism and capitlaism

Source:

Representing text
Basic text as data conecepts terminologies
 Corpus: a collection of documents
 Document: one unit of text for analysis (e.g., a sentence, blog, book, etc.)
 N-gram: a sequence of adjacent words
 Token/term: a word

Corpus (cf. dataset):


the collection of 3 documents
collection of 3 sentences

Document (cf. instance): one sentence

Token:
one word; e.g., ”rhythm”

individual word
N-gram:
a sequence of adjacent words;
e.g., ”natural rhythm” is a bigram;
“a natural rhythm” is a trigram

Data preprocessing
Number of prepossing steps unique for data

 Case normalization: typically, all text is changed to lowercase


 Punctuation: typically, all punctuation is removed
 Stemming: removing suffixes and plurals (e.g., jumped, jumping, and jumps all become jump )
 Stopwords: removing very common words that don’t convey much information (e.g., the, are, that )

Bag-of-words
Onc eprepossing done> diff direction. Most common direction is bag ..

A text-as-data approach that treats every document as a collection of individual words, ignoring grammar, word order,
sentence structure, and (usually) punctuation.
It treats every word in a document as a potentially important keyword of the document. It’s straightforward, computationally
inexpensive, and tends to perform (at least acceptably) well for many tasks.

Metaphor; text written nice syntachtic.


To make readabile for computer put all in
a bag and look at each word individually

Looses context

3 small documents
Prepossing; lower case etc.
Bag of words: 3 things we do
Tokenize; split sting using… chopping up in
small tokens
Count occuences of each token >
vectorization
Vectorixaiton: turn into

String of text > turn into instance as we


know it; attribute for computer to read

Result = Document term matrix


High dimensional

We can now analyse using dif measures


Measures
Interesting:
Try to capture interesting terms
What makes a term interesting;
1) Occurs frequently (TF)
2) If appears in lots of documents. Across the corpus.

Term frequency (TF)


Most recognizable measures
The number of times a term occurs in a document.
To account for the fact that documents may vary in length, term frequency is often normalized in some way, such as by dividing
each by the total number of words in the document.

Written as equation

How frequently terms appears

Doesn’t tell us anyting


Can show which words are interesting

Inverse document frequency (IDF)


A measure of whether a term is common or rare in a given corpus.
A high IDF indicates that a term does not appear in many documents; a low IDF indicates that a term does appear in many
documents.
High idf =
Low = appears in many docuemnts

How to calc IDF

OBS: calculated for each term; not for


each term in each document.
Another level than TF-

Can never be less than 1!

As number of doc containing the term


increases > IDF decreases

Maybe doc says something important

Rare terms = higher iDF


IDF calculated for XX case.

Not within a specific document!

TFIDF
End up combining with TF
A combined measure of term frequency and inverse document frequency; term frequency weighted by inverse document
frequency.
The most widely used feature representation of text data.

TF times IDF - simple

Can be input for ML algorithm etc

Text-as-data with scikit-learn


Most 3 common measures

Use couple of functions


CountVectorizer

1. stopwords not removed by default


(common words like “is”) > tell vectorizer
to ignore them! Default “none”

2. only counts unigrams by default. One


word at a time. Single tokens. If consider
“bigrams, trigrams” specify; a range of the
ngrams.

3. adjust to ignore overly common or rare


terms. To parameters; maxdf (IDF max >
identify overly common terms) – by
default no terms ignored.
Between 0-1
Min_df = overly rare terms. Integer.
Specify to 20 > ignore terms in less than
20 documents.

Tfidf

To change > look in documentation

Text-as-data and Natural Language Processing (NLP)


Many machine learning and AI-related tasks are built on text-as-data techniques, such as...
 Text classification
 Sentiment analysis
 Stance detection
 Bot detection.
 ”Fake news” detection
 Topic modelling
o Find dopics within a document
 Argument mining & knowledge graphs
 Text summarization
 Chat bots
 Text generation (e.g., GPT-3)

Trying to get comp to understand text.

Sentiment analysis

Step 1: vectorize review text


Step 2: input feature matrix into classification algorithm (e.g., logistic regression; naïve bayes)
Step 3: compare across algorithms and across text representations
Data preprocessing

If not combined with XX

Summary
 We can think of each instance in a dataset as evidence for or against different values for the target.
 If we know all possible outcomes (class labels), we can estimate the conditional probability of an outcome being
observed given its attributes.
 Assuming conditional independence, Naïve Bayes algorithms estimate the probability of an instance belonging to a
candidate class and reports the class with highest probability.
o Naïve bayes as baseline

 (Multinomial) Naïve Bayes is computationally efficient and good for high- dimensional data (e.g., text), but only good
for class labeling (not probability estimation ala logistic regression).
 Approaching text-as-data means turning text into machine-readable inputs for statistical and machine learning
models.
 Text data is ubiquitous and naturalistic, but messy and high-dimensional — a central challenge with text data is the
necessary preprocessing.
 Text data is commonly represented as a bag of words.

 Key feature representations derived from bag-of-words are TF, IDF, and TFIDF.
 Combining text-as-data feature representations with machine learning gives rise to NLP, which powers wide-ranging
tasks like text classification and topic modelling.

Question Answer
Which statement is NOT true about Naive It's good for high-dimensional data, like text
Bayes?
It assumes conditional independence among attributes
It's good for probability estimation, just like logistic regression
- Prob estimated
- Just comparing with other

There are several variations of the Naive Bayes algorithm for


classification
Which of the following is an advantage of using It's naturalistic
text data? - “Often”
- In context of big data > messages on some, emails,
manifestos etc.

Other are disadantegous of text data;

It's unambiguous

It's easy to preprocess


- Difficult to preprocess
It's neatly structured
- Not in tables, dataframes etc.
When programming - using countvectorizer Text data is transformed into a document term matrix
When you apply CountVectorizer to text data, - Column for each word observed in the data
what is happening? TFIDF is calculated
- Could be calculated
A machine learning model is trained

Sentiment analysis is performed


- Not right away, but involved I sentiment analysis
Ch. 7
The book, chap. 9-10

Patterns in text data


- When looking at patterns in Data mining, we need to work with evidence and probability.

o Probability of an event occurring: P(C)


 Ex. Independent:
If A and B are independent the product of
P(AB) = P(A) * P(B).

o Conditional probability of event occurring: If we have relevant evidence of event that effects event
 Ex. Dependent:
P(C given E) or P(C l E)
 Many events are not independent, normally we say that the probability of A and B occurring is the
probability of A times probability of B given A.
If B is depending on A the product of
P(AB) = P(A) * P(B l A).
P(AB) = P(B) * P(A l B).

Bayes rule:
- A way to figure out how likely a hypothesis is given evidence.

<- comes from

Replace B and A with H and E:

- To see how likely a hypothesis is given the evidence, p(H l E) (People getting into hospital with red spots and how
many has misuls), it is often hard to get a good estimates of the probability of H given E, but it might be easier to
get data on E given H which ex. is the proportion of the patients with mussels that also has red spots (This is often
collected anyway).
- + we need data on how often people in general have red spots: P(E) ß this can be hard, but often we do not even
need P(E). <- Because: For classification we want to find the hypothesis c, that has the highest probability given E
(attribute vector), so we would typically compare to classes, C1 and C2 given E, and to determine which one is
larger – Therefore we just need to know who has the larger numerator! Both has the same denominator!

- A way to use Bayes Rule is in Classification:

o H would be c, and E is the vector of attributes.


 Assume evidence is independent (naïve bayes), so we can take the product of each.
o Easy: For p(E I c) we need to know number of instances in class c, and then how many had E1 – then E2 etc...

Text mining:
- Can be used to find all articles about a topic or classify documents as positive, negative or neutral.
- To do this each instance should be represented as a vector of attributes.
o Ex: list of all words; 1 is present in document and 0 if not, and look at the number of occurrences in the
document:
^^There are eleven features representing all the words in the three documents.

o Additionally, (1) A word is more interesting if it has more frequency in the document and (2) A word is less
interesting if it occurs in many other documents.

o We measure this by using a score called TF*IDF:


TF = Term frequency and IDF = Inverse document frequency.
 The bigger ratio the better, 0-1.

Dummy values
- You want text to be turned into numbers so you can work with them. It takes up more space on the screen but
makes the machine happier.

Example: Mining News Data


- In this example we will predict stock prices from news articles.
- We pick articles that mention a specific stock on a specific day and look at change in stock price on that day.
- We define a binary label on the stock price change. If it rises of falls more or less than 5% the label is Change
otherwise NoChange. This way we can label articles with Change or NoChange.

- Instance: News Article on given day mentioned a given stock (represented as a row)
- The attributes of each article are the TFIDF values of terms in a given article.
- Then we can build a classifier!
2) We can order the attributes based on how well they predict a change in stock price. Then we know what words
are more interesting in news articles about stocks.
3) One way to think about classification is to ask: What is the EVIDENCE for a given class.
Exercise 7: Sentiment analysis
Topic Evidence and probabilities; text-as-data
Learning 1, 4
Objective
Activities to be Today’s exercise objectives
done before next  Use CountVectorizer to turn unstructured text into machine-readable attributes §Perform sentiment
class analysis with those attributes
 Interpret the results to get insight out of your data
8. DECISION ANALYTIC THINKING I: MODEL EVALUATION AND ETHICS
Topic Decision analytic thinking I: Model evaluation and ethics
Learning 1, 2, 3
Objective
Syllabus DSB p. 187-208
PPBD p. 1-40
Activities to be Readings
done before next
class
Exercise Model evaluation

Session 8
Measures of Model Performance
What is a good model?
A model that answers your question or solves your problem. It depends on the use case.
- Not single answer.
What is a good predictive model?
A model the makes accurate predictions on new, unseen data.
A good predictive model should make more accurate predictions than a relevant baseline model or simple rules of thumb.
- Generalize (lec 5)
What is an appropriate baseline?
It depends on the use case.
A baseline could be a dummy model (e.g., a classifier that always predicts one class label), the simplest version of a model (e.g.,
a linear regression vs. a polynomial regression), whatever model is currently deployed in practice, or the current state-of-the-
art for a given task.
- Addition to accurate prediction.
Model performance metrics
Depending on task different measures to choose from
 Regression:
o Goodness of fit (R-squared)  See lecture 4
o Prediction error (MSE, MAE, RMSE)
 Classification:
o Accuracy, precision, recall, F1
o Sensitivity, specificity
o Expected value
 Clustering:
o Distortion/inertia  See lecture 6
Regression - Evaluating regression models
Regression models are most commonly evaluated by their goodness of fit (e.g., R-squared) or by their prediction errors,
which are quantified by a loss function (e.g., MSE, MAE, RMSE, etc.).

Loss functions measure how bad it is to get an error of a particular size and direction. Different loss functions suit different use
cases.

Prediction error

Attribute values (X)


- Test daya on house size,
attribute
Actual target values (Y)

Predicted target values generated by


the following linear regression:

Straight line:
Regression coefficion

Calc prediction error for each instance


y-y(hat)

PE = difference between actual and


predicted target values

Loss functions
Each work with prediction error.

Mean Absolute Error = MAE

Mean Squared Error = MSE

Root Mean Squared Error = RMSE


Same as mean but take square root
Why care use different for dif use
cases?

Vary along x axis.

MAE. Loss dsproportionali to PE.

MSE: large errors shoud be pen disprop


more than small erros
Only makes sense for some context
(house prices – avoid big errors) doesn’t
matter if theres a small error
NOT use for medical

RMSE small errors should be penalized


disproportional than large errors.
Penalize everything a lot.

When predicting house price; MSE


might make sense. Get close to true
error is good enough.

>< predicting correct dosis of medicine


for patient; get perfect – NO errors. Not
even small errors.
Difference between small and big errors
doesn’t matter > either has big
consequences. Want to train model to
get perfect! MSE not good.

Example: mean squared error (MSE)

Take PE calculated already.

Take all PE,

square each one (MSE)

Denom > 5 (average)

Cant interpret MSE on its own – useful for


evaluation when compared to other
models or relevant baseline

Prediction error with scikit-learn


One line of code! Use
mean_squared_Error > input true target
value

Calcualte RMSE

Calculate MAE
Classification: Evaluating classification models
Classification models are evaluated by how often they predict the correct class label.
Classification performance can be quantified by several measures, such as accuracy, precision, recall, F1 scores, sensitivity, and
specificity.
- Dif ways to quantigy

Accuracy, precision, recall, F1 + Sensitivity & Specificity

Confusion matrix
Confusion Contingency tbale, separetes made by classifier

1.
Predicted class label
Actual class label

Line up actual values >


Each instance given neg or pos label (0 or
1 same)

Makes clear: 4 types of predicitons:


1. True positive
2. False positive: Incorrectly predict
XX
3. True negative
False negative

Common classification metrics


All range from 0 to 1
1 perfect
0 bad

Ideally both high precsion and high recall; trade-off. Shift classification threshold

Precision High precision means that most/all instances predicted to be TP


positive are in fact positive Precision=
TP+ FP
Aren’t many false positives
Mindre vigtigt > hellere overse n 0’er end overse en 1’er.

Accuracy How often prediction correlated class label – easy TP+TN


Accuracy=
P+ N
Recall High recall means that most/all instances that are in fact positive TP
are predicted to be positive Recall=
TP+ FN
Aren’t many false negatives

Predicter 0, men det er 1 –

False negatives are worse than fakse positives > recall over

F1 High F1 means that both precision and recall are high 2 x Precision x Recall
Clalc harmonic mean of precision and recall F 1 Score=
Precision+ Recall
Precision up recall down and vice verce – captures this tradeoff.

Other metrics:
Sensivity High sensitivity means there are few false negatives; the classifier TP
doesn’t miss (m)any positive instances Sensitivity=
TP+ FN
More metrics;
Pandemic tests. Plug in confusion matrix.
Specificity High specificity means there are few false positives; the classifiers TN
doesn’t mistakenly label (m)any negative instances as positive Specificity=
TP+ FP

Example: classification report


10 people for whom we know some attributes;

Label each person as churner or non churner


Generate class report
Calculate all measures to evaluate

Precision = 0.67

Accuracy = 0.7

F1 = 0.73
Not directly at confusion matrix, but at
calculated precision and recall calculations

Straight forward with Python

Just use one line of code


Interpret now we know the values

Classify people who churn

Arbitrary to change the labels – we can


sway 0 for 1s
Multi-class confusion matrix

Partition the matrix


Expected value
The weighted average of the values of the different possible outcomes, where the weight given to each value is its
probability of occurrence; a measure of the prospective business value of a deployed model.
Expected value calculation provides a framework for evaluating models with respect to their intended business application.

Build on multi-class

The exp val

For bin clas task > 4 outcomes

Infinity

The value of each outcome is derived – vs


not from data, on our own.

Readonable logic explain why we chose it.

Example: Expected value


Let’s say a customer who churns costs the company an average of $100.
To try and prevent churn, a marketing brochure has been developed that is to be mailed to customers at risk of churning. The
brochure materials cost $1.
Correctly predicting an instance of churn thus earns the company $99, incorrectly predicting an instance of churn thus costs
the company $1 and missing an instance of churn costs $100.

Hired us to classify who are churners and who are not.


Correctly predicting earns company dollars

Note down the values


True positive;
False posisives: losses one dollar -waste,
shoulndt have shipped to them

Plug in numbers

Prob of each outcome - calc from confus


matrix.

Corresponding dollar values noted with


red.

Is it good? We don’t know. Important to


compare with other models with a
baseline. On its own different to interpret
or impossible. Compare and contrast
Model evaluation considerations
 A meaningful model evaluation requires a firm understanding of the real-world use case that the model is intended to
be applied to.
o With pract under – impo to def what a good model is
 It’s good practice to compare models across several measures of performance.
o For some spec loss func or . not ignore other metrics
o Recommendation: try as many as possible making sense, compare, contrast and understand
 There are many other ways to evaluate machine learning models that we haven’t covered, including other loss
functions, ROC Curves (and AUC), and extensions of the expected value framework.
Big Data Ethics

What do we mean by ethics in the context of big data?


Principles that guide safe, responsible, and equitable use of big data in business and society.
Several related concepts fall under this header of ethics in this context, such as responsible AI, algorithmic fairness,
accountability, and transparency.
- Safe = doesn’t introduce new harms
- Responsible = in way that considers rather than just reacting
- Equitable = Inclusive, fair and non-discriminating

New initiatives to the area of BD ethics


Microsoft dedicated ethics departments

Institute-oriented - Alan turing institute

Entire fields of academic study

3 sources of ethical concern


In relating to BD and ML

1. Biased training data


2. Privacy-invasive data
3. Black-boxed automation

NOTE: There are definitely other sources of ethics violations. We’re just focussing on these three today.
Main streams of discourse

1. Biased training data


When predictions made by a machine learning model reproduce biases captured in historical training data; a version of the
“garbage in, garbage out” principle.
If a dataset captures systematic biases that are present in society, a machine learning model trained on that data will
propagate those biases.
- Bias – whole umbrella
- Whatever patterns in input > also reflected in outout

e.g. both been arrested for drug possession >

Risk scores for black vs. white.

Trained on hist criminal data > racial bias. Reproducing this bias in
its prediction. Picking up prediction attribute > directly correlated
to race
Scan CVs of job applicants

Training data; gender > male candiates that were prefereable.

Now been corrected;


Idea; in Hungarian, no gender. But translating to English; applies
genders to the sentences ! NOT neutral, depends on data.
2. Privacy-invasive data
Big data is often generated ambiently by people without their knowledge, meaning such data may reveal sensitive
information that people did not consent to sharing.
- Often on SoMe, private photos
- Terms and condition – unaware how our data is used.
Ethically, the availability of data does not necessitate its use. Statistically, bigger data isn’t always better.
- Just because we can, doesn’t mean we should
-

Consider business case:

Data driven advertising.

Data modeling and profiling

As it turns out;

“Cambridge Analytica” > shady way of collecting data


- CA. fb terms and conditions bent in order tog et
user data
Used: Classify into political groups.
Leveraging BD for political > affect the election

Fx:
Academic
Copenhagen
Learning analytics; assessing in academic performance

Most usefuld also most pricacy invasive data.

Always be compared to other – hostile data sources

3. Black-boxed automation
When machine learning models are deployed in contexts where their functionality is not transparent or explainable to the
implicated stakeholders.
If a black-boxed model automates or augments a decision-making process, then there is little opportunity to challenge a
potentially misguided decision.
- If I go to doc; live changing operation. Why? Based on wrong data?
- If doc says “algo told me” seems wrong
- At the same time; if you are a financial trader you also use algorithms to making decisions; algo tells to making
trades
o Ensure to justify the algo decision otherwise loose job
-
Easier to correct algorithms than macines
Black boxed algorithms
Private companies; algo consideren propritety info,
secrets.
Concern: bias can be built in algo, we wouldn’t know.
Trial; faced with criminal charges. Challenge the
decision. The algo is a part of that decision, you should
be allowed to know how the decision was made

Attorneys

Algorithms just pops out an answer

Makes the process of reviewing impossible by


appellant judge; is the constitution followed?

Observe: credit scoring models become more complex


and automated;

So;

Approving bank loans; risk to doing this. Always


anticipate social implications!

Survey of people attitude towards black box models.

Value interpreability over all

Paradox: willing to substitute interpretability to


accuracy gains

What can we do?


 If biases in training data are identifiable, then models can be built in a way to correct for them (e.g., weighting training
data; adjusting classification thresholds) – shift direction of predictions.
o We are black box in our selves; base decision on intuition etc.
 Critically reflect on whether “big data” is actually the “right data”
o Often go out and ask people > sometimes more relevant than BD
o Do you need to scrape all data from social live?
 Independent auditing of algorithms
o Internal ethics departments approving to be inefficient (like google)
o Hiding behind trade secrets > need external
 Regulatory policies (e.g., a right to explanation)
o Legal right for explanation?
o Gdpr > know how data is used etc
 Integrate qualitative sensibilities into data-analytic thinking (see lecture 2).
o Exercise about interpretivism etc.
o When forming a ml task, we can often prevent source of ethical concerns manifesting into real life issues
Summary
 A good machine learning model is one that answers your question, solves your problem, or otherwise suits the use
case.
 Identifying a good model requires evaluation against an appropriate baseline.
 There are several different measures of model performance to base evaluations on (e.g., loss functions; classification
reports; expected value).
 Technical measures of model performance only capture part of the picture; big data ethics must also be considered
before any deployment. Before development of model
 To follow good big data ethics means to follow and support principles
 Practicing big data ethics means following principles for safe, responsible, and equitable use of big data in business
and society.
 Three key sources of ethical violations are biased training data, privacy-invasive data, and black-boxed automation.
 To prevent such potential ethical violations, we can identify and correct biases in model development, emphasize
using the “right data” instead of “big data,” promote independent algorithm auditing, and integrate qualitative
sensibilities into data-analytic thinking.

Question Answer
Which statement is NOT true about measures of They often require an appropriate baseline to compare against
model performance (e.g., loss functions,
classification reports, etc.)? Loss functions quantify how bad it is to get errors of a particular size
and direction
Turn prediction error into losss used to compare models - tradeoff

It's good practice to evaluate models on one single measure


Dilemma between two decisions; do beth, compare and contrast
especially with models of evaluation

Increasing recall often leads to a decrease in precision


What is the benefit of using an expected value It quantifies the prospective business value of deploying the model
framework to evaluate a model? Purely stat measures; turn into monetary model. Put into business
terms. Applied for classification

It allows us ignore other measures of model performance

It allows us to skip the process of comparing against a baseline model

It provides a measure for comparing classification and regression


models
Which statement is NOT true about big data A machine learning model trained on biased data will display those
ethics? biases in its outputs

Model's trained on bigger data always outperform models trained on


less data
Paper – privacy invasive data. Even training on small data can
outperform big data that might be privacy invasive. Just because you
can, doesn’t mean you should

Using black-boxed models to automate decision processes limits


opportunities to challenge misguided decisions

Machine learning models can be built to mitigate biases or prejudice in


society
Ch. 8
The book, chap. 7-8
Cohen et al. 199-206

How to evaluate and compare if it is a good model?


(1) Accuracy
- Accuracy does not necessarily equal a good model.
o What Training and test data was used?
o Think about the data (use confusion matrix) and how important it is that it predicts correctly.
Ex: Fraud: High accuracy, but not a good model.

(2) Confusion Matrix


- Use confusion matrix to look at when the model makes mistakes.
o Is it important (ex. cancer), and costly?
- If balanced, also remember to make the data representative:

(3) Expected Value


- Look at value of outcomes and their probability
EV = p(o1) * v(o1) + p(o2) * v(o2) + …

- Probability: is easy to get from data/classifier when building the model (confusion matrix).
- Value: Not as straight forward – depends on the business understanding og outcome.

- Targeting example: Calculating benefit from targeting consumers.

Expected benefit of targeting = PR* VR + [1-PR] * VNR

- Two outcomes; VR, Responding or VNR, Not responding


- PR, Probability of responding

- Example:
o VR = 100$ (profit) – 1$ (targeting) = 99$
o VNR = - 1$ (targeting)

Expected benefit of targeting = PR* 99$ - [1-PR] * 1$

Rearranging = target if probability of responding is 0.01 = 1%

- Proof:
Probability Value EV
1% 99% 99$ -1$ 99$ -99§
0% 0% 0$ 0$ 0$ 0$

= 99$ – 99$ = 0$

EV for evaluating and comparing classifiers:


- Confusion matrix:
Probability Value EV
51/100 6/100 = 99$ -1$ 99$ * 51% = -1$ * 6% =
= 51 % 6% 50.49$ -0.06$
5% 38% 0$ 0$ 0$ 0$

= 50.49$ – 0.06$ = 50.43$

Visualizing:
Thresholds
- In classification, a class is given to each instance and a score.
o It is not perfect and make mistakes
- Model can predict positive above a score (threshold).
o Score of 0.99 means that it is predicted positive, which can be true or false. In this
case it is true since it is positive.

o At 0.65: 10 above are classified as positive, 6 were positive and 4 were negative.

- The threshold with the highest score/profit/accuracy should be chosen.


o This can be calculated based on the confusion matrix/cost (see profit curve).

Profit curve
- Graph with the profit of different classifiers based on different thresholds.
- The threshold with the highest profit should be chosen.

- Example:
o Profit = 9$ and it costs 5§ to advertise:
o Confusion matrix:
4$ -5$
0$ 0$

o Profit curve:
 If 0% is targeted, profit is 0. (This is the case when there is no positive
classification. It estimates everything to be negative)
 Choose the threshold with max profit, in this case almost 50%.

o Budget:
 Another way to use the model: You have a budget of 40.000, you can reach
8.000 people, and there is 100.000, you should use generate offers to the 8%
with largest scores = look at the max profit up to 8 % à this will give a profit
of 100$.

ROC graph

Each point in ROC correspond to a specific confusion matrix.

- The classifier and confusion matrix closest to the upper left corner, the better.
- The line in the middle is of random guessing.

How to: (number 3)

Cumulative response and Lift curve


- Graph showing the percentage of positives targeted based on the total number of test
instances (decreasing by score).
- Also called hit rate
Lift curve:
- Graph showing the ratio of the hit rate (% true positive / % test instances)

- Ex: 40 % true positive / 20% test instances = 2.0

- The more we target, less will be true positives.


o Ex. 35% / 25% = 1.4

Cohen et al.
- About Peer-to-peer lending.
- About Jasmine Gonzales; Young professional wants to diversify her portfolio.
- Data:
o Loans categorized from A – G. A = safest and G is riskier. This needs to be balanced.
o Loans from 2007- 2017. 100 features (Loan amount etc.)
o For simplicity, we only look at expired loans (default 13.9% and fully paid = 86.5% in total)
- Project:
o (1) Introduction:
 (1) She must figure out how much to invest here and other places (this requires data about other
places) – She must figure out how much to invest.
 (2) Objective: Get the highest return based on her risk tolerance etc.
 (3) Is old data ok? Tend to indicate patterns
 (4) Attributes differences (some are grouped, some changes)
 (5) Leakage ß Some data is not available. This should not be used for predicting – perhaps it has a
great impact on model.
o (2) Combine all data to one data set etc.
o (3) Data exploration
o (4)
 Do attributes affect other attributes (Bayes)?
 Time aspect? Solved by looking at different time periods.
Exercise 8: Model evaluation
Topic Decision analytic thinking I: Model evaluation and ethics
Learning 1, 4
Objective
Activities to be
done before next
class

Today’s exercise objectives


§Implement several different machine learning models for comparison §Evaluate regression models with
different loss functions
§Evaluate classification models with various metrics
9. DECISION ANALYTIC THINKING II: CORRELATION AND CAUSATION
Topic Decision analytic thinking II: Correlation and causation
Learning 1,2,3
Objective
Syllabus DSB p. 279-314
Activities to be Readings
done before next
class
Exercise Causal Inference

Session 9
More classification metrics
Prediction vs. Explanation

From 2008 – provakative.


Stop looking for explanatory models > numbers speak for
themselves.

Perception has changed;

Google flu trends

Predict better than domain experts > just SW developers


making predicitons based on data

Pros for society;

Seems like huge win for big data

Became the limitation; missed the swine flu outbreak

Updated the model; errors kept coming

Overestimated outbreaks

No accurate prediction.
The project failed

Why did it fail?

Methodology between google flu.

Small data points,


50 mio. Attributes! (high dimensional)
Risk for overfitting

Takeaway
Correlation isn’t aways enough.
Tons of data doesn’t nec lead to truth
Obvious when looking at stupid correlations:

Corr from caus

Causality is important in BD
Statistical observations
Correlation
When two variables display an increasing or decreasing trend.
- Shared line relationship between variables
For example, X and Y are correlated if observing a change in X tells you that Y will either increase or decrease.
- Expect as sales of pump lattes goes up > sale of wintercoats?

Association
When one variable provides information about another variable.
- Used interchangeably
- Correlation is a Specific type of association (broader concept)

Not all associations are correlations.

As moving alo

Knowing about x we know about y > correlation

Causation
When a change in one variable causes a change in another variable.
- Intervene and change > get desired outcome
The gold standard for identifying causation is with randomized control trials.
- Experiments – type of
- Control group – placebo
- Experiment – get a treatment
-

Statistical observations – relations


- Association does not imply causation.
- Correlation implies association, but not causation.
- Causation implies association, but not correlation.

What can we do with these statistical observations?

Prediction
Forecasting future observations.
Prediction is what machine learning is generally good at. Prediction is where association is enough (oftentimes).
- Association

Explanation
Accurately describing the causal mechanisms underpinning observations.
- Haven’t done this in course
- Ml not good at this
Explanation is what machine learning is generally not good at. Explanation is where association is not enough (oftentimes).
- Requires > we Know cause and effect relationships
- In business – we often want explanation.
- ML > only predictions and association
o Indicative of association, but not necessarily

Closely related but not always in agreement

Prediction without explanation


Using known causally irrelevant attributes to predict future observations.
“Sam wears a size 8 shoe and is likely to score better on the SAT test than Taylor, who wears a size 1 shoe.”
- Shoes doesn’t casualy cause score (buying bigger size wouldn’t help)
- Corelates with age > intelligence > scores

Prediction with explanation


Using “known” cause-and-effect relationships to predict future observations.
“Grandpa is more likely to develop lung cancer than the average person because he smokes two packs of cigarettes every day.”
- Best case scenario
- Very dif to know assoc with 100 certainty
- Based on smok behavior > pred lung cancer.
o Causal relationship
o Always used as explanation
o Is indicator of causation

Explanation without prediction


“Knowing” about instances of cause-and-effect that can’t be generalized to make new predictions.
“Donald Trump won the 2016 election because voter turnout wasn’t as high as expected, third-party candidates split the
Democratic block, and Facebook didn’t crack down on fake news.”
- Interpretive, plausible, feeling of understanding
- Not same as predicting something
- Expl fits hisst data – maybe overfitting (verbally)

Data-driven explanation without prediction

Some words get retweeted more

Phenomenon = moral contagion

Misleading

Repeated study with different twitter data

Replicate – not well

Compare this to observed model.

Don’t rely on big observational data


Group discussion
 Form a group of 3-5 and discuss whether the business problems on the next slide require predictive or explanatory
answers.
 Then, come up with your own example of a predictive business question and your own example of an explanatory
business question.

1. A café wants to reduce food waste, so they want to know which days of the week are going to be busy to stock
accordingly.
- Prediction: It deosnt matter why a day is busy, they just need to know when a day is likely to be busy

2. A telecoms company wants to know why people churn so they can design new product packages to prevent it.
- Explanation: “why” > causality. They could predict which costumers are likely to churn based on historical data, but that
wouldn’t explain why they churn

3. A YouTube influencer wants to know how to make a viral video.


- Explanatory: they could use prediction to fin video attributes that correlates with views, but those feat wloiuldn nece explain
why the video gets

Why an how / if and will

Other

Predictive examples:
- Are sales likely to increase enxt quarter
- Where might traffic jams occur
- Which student sare likely to drop out
- How many claims will an insurance company get

Explanatory:
- Explan

- How can we increase sales next quarter


- Why do traffic jams occur
- Why do students drop out of school
Thinking About Causality
Conceptual tool elp visualy naking causal assumption explicit> misinterpret correlation and causation

When association means causation?

Conceptual tools; not stat


Indicative of casuation or ot

The ladder of causation


Theo framework
In any data analysis, task is to move at all levels

No causa
Intervention; causality >
Counterfactuals. Highest form of causal reasoning. Answer all explanatory questions

Association to intervention: need control experiments to manipulate a variable.


Problem with observational data;c ant do experiments. Stuck at association. Difficult as we often only can analyze association.

Intervention > counterfactuals


Make assumption. No amt of data can get us to this level of understanding
Make causal assumption; not purely data driven.

Directed acyclic graphs (DAGs)


A way of visually representing causal assumptions as nodes (variables) and arrows (causal relationships).
They’re directed, because arrows can only point in one direction, and acyclic, because they do not allow for cycles between
nodes (a variable cannot causally affect itself).
- Tool for thinking about causality, big observational data sets
-

Why DAGs are useful in the context of big data


DAGs help us identify when associations might or might not be indicative of causation; they help us draw valid causal
inferences from observational data.
DAGs show us when including certain attributes (covariates) in a model influences the accuracy of effect estimates (i.e.,
correlation coefficients; feature importance) for better or worse.
- If Purely focusing on prediction: don’t need dags. Never looking at feature importance > doesn’t matter
- If try learning from reg coef, feature import > dags are important. What attributes are important, which are not etc.

DAG terminology
 Outcome: a dependent variable; the target variable.
 Exposure: an independent variable; the attribute(s) you’re interested in.
o Effect of exposures on an outcome
 Ancestor: a variable that causally affects another variable, influencing it either directly (ancestor → X) or indirectly
(ancestor → mediator → X). Direct ancestors are also called parents.
 Descendant: a variable causally affected by another variable, either directly (X → descendant) or indirectly (X →
mediator → descendant). Direct descendants are also called children.

 Path: a sequence of edges that connect a sequence of nodes. In a DAG for observational data, a path is a sequence of
arrows connecting variables. The arrows of a path need not point in the same direction.
 Causal path: a path that consists only of chains and can transmit a causal association if unblocked.
 Noncausal path: a path that contains at least one fork or inverted fork and can transmit a noncausal association if
unblocked.
 Adjusting/controlling for a variable: introducing information about a variable into an analysis (e.g., adding a variable
into a multiple linear regression).
o Stratifcation, matching, adding attributes into our models

https://journals.sagepub.com/doi/full/10.1177/2515245917745629

Drawing a DAG: a quick DAGitty demo


http://www.dagitty.net

We don’t know if its correct, but reasonable

Drawing a DAG: conceptual step-by-step


 What do you care to know? What is your outcome? What is (are) your exposure(s) you cab have multiple?
o In a causal way
 Beyond your exposure(s), what are other plausible ancestors of your outcome?
o explanding
 What are other plausible ancestors of your exposure(s)?
 Are there any plausible causal relationships among the ancestors your just identified, the exposure(s), and the
outcome?
 Are there any unobserved variables that could play a causal role among your variables?
o Datasets with attributes

Drawing a DAG: considerations


 There’s often more than one defensible DAG
o Bult on assumption – visualization of assumption
o Causal thinking made transparent.
o Common ground
 Finding it impossible to come up with a DAG? Maybe you shouldn’t be working on the project without more domain
knowledge...
 If you are a data scientist/consultant.
o Writers block

Covariate roles
Quickly get big, messy and complex; basic structures > little modules of nodes

 Confounder
 Mediator
 Collider

Confounder
A variable that is an ancestor of both the exposure and outcome variables; a common cause.
Failing to adjust for confounders in your model leads to inaccurate effect estimates (i.e., correlation coefficients; feature
importance)

No causal rel between x and y; z is the confounder.

Causality; no assicaition

Whether or not they exist in data set; they exist in


the real world

Side bar: omitted variable bias


When a relevant, confounding variables is unobserved, ignored, or otherwise omitted from a model, in turn giving rise to a
spurious association between an exposure and outcome.

E.g., failing to control for weather when modelling the influence of ice cream sales on drownings, given that

[E = ice cream sales] ← [weather] → [swimming] → [O = drownings] 


When a conf is a part of a bigger chain; proxy confounder

Mediator
A variable that is a descendant of the exposure and an ancestor of the outcome.
Adjusting for mediators in your model leads to inaccurate effect estimates (i.e., coefficients; feature importance); it removes or
“blocks” association despite causation.
-
Stat assoc; flowing through arrows passing
through nodes

Chain:
Suggest NO caus rel between x and y; there is, but
if including mediator, it will say there is not, which
will be wrong.

Log r to classify churners

Model output mislead no rel between sub type


and churn

Collider
A variable that is a descendant of both the exposure and outcome variables.
Adjusting for colliders in your model leads to inaccurate effect estimates (i.e., coefficients; feature importance); it introduces
association despite no causation.

There is a path but NO causal path

No causal between appearance and acting skills

Within sample of actos successful; both attractive


and talented

Normal:
tradeiff for each other

Side bar: selection bias


When the way data is sampled induces a spurious correlation between an exposure and outcome by unwittingly adjusting
for a collider.

E.g., only analyzing data on restaurants in business when modelling the relationship between location quality and food quality,
given that
Seems rare and easy to avoid. Often end up adjusting without knowing.
[E = good location] → [restaurant success] ← [O = good food]
- Only look at restaurants in business; adjusting for a collider
- Neg correlation – no causal relationship.

No distinction between Causal and non causal rel

Don’t include happiness!

We want to block pink paths

Exercise
1. Go to http://www.dagitty.net/learn/graphs/index.html
2. Complete the game at the bottom of the page to test yourknowledge of DAG terminology
3. Go to http://www.dagitty.net/learn/graphs/roles.html
4. Complete the game at the bottom of the page to test your knowledge of covariate roles

So... what’s the problem?


 Sometimes it’s obvious when there are confounders, mediators, and/or colliders, but sometimes it’s not.
 Sometimes it’s genuinely unclear whether a known variable is a confounder, mediator, or collider.
 Sometimes (oftentimes?) we’re entirely unaware of causally relevant variables.
o Confoundres not in dataset
 Adjusting for covariates can either increase or decrease the accuracy of causal effect estimates.
 Evaluating models on prediction metrics (e.g., R-squared; classification accuracy) tells us nothing about whether our
causal assumptions are correct.
o trouble

What can we do?


 Focus on prediction problems and ignore all this causality stuff but then > big obs data useless
 Find sufficient adjustment sets
o Sets of covariates that, once adjusted/controlled for, close all biasing paths (pink) while keeping desired
causal paths open
 Find instrumental variables
o Observable variables that can be included in a model to account for unobservable (latent) confounders
o Think of something that is a confounder, but not in dataset or
o What is a proxy that we can include
 Pred > show assumptions and implicaitons

Summary
 Association is when one variable provides information on another variable (correlation is one type of association).
 Causation is when a change in one variables causes a change in another variable.
 Common machine learning models identify association, not causation.
 Prediction involves forecasting future observations. It’s what machine learning is generally good it; it’s when
association is (often) enough.
 Explanation involves accurately describing the causal mechanisms underpinning observations. It’s what machine
learning is generally not good at; it’s where association is (often) not enough.
 In order to infer causation from data, we need to make causal assumptions. No data can show that there is a causal
relationship
 DAGs help us define our assumptions and spot covariates with different roles so that we might draw valid causal
assumptions.
 We want to adjust for confounders.
 We do not want to adjust for mediators or colliders.
 Accurately constructing a DAG is hard. There is often more than one defensible DAG for a given scenario.
 The best way to avoid misinterpreting correlation as causation is by framing business questions as predictive tasks,
rather than explanatory tasks.

Question Answer
Which statement is NOT true about prediction Machine learning is generally good at prediction, not explanation
and explanation?
Machine learning is generally good at explanation, not prediction
NOT good at explanation!

Prediction is when association is enough (oftentimes)

Explanation is when association is not enough (oftentimes)


Select the predictive problem(s) Which customers are most likely to churn?
Not why, bit know who is most likely

What characteristics of a news article make it get shared more online?


Explanatory

Why do flu outbreaks occur?


Explanatory

Will sales increase next quarter?


You've just fit a multiple linear regression to Your model neglected a confounder
some data and now you're inspecting the Can be from an unobserved confounder – like the icecream example
coefficients. What's one reason why the
coefficients could be inaccurate measures of a Your model neglected a mediator
causal effects? We don’t want these. Causal effect of some exposure; walking the
causal pathway, look like there is no causal effect,just pasing through
another node

Your model neglected a collider


We don’t want to include these > show that there Is a X when there
isn’t a causation

The R-squared is low

Ch. 11-12
The book, chap. 11-12

Expected Value Framework


- In both cases EV framework is used to understand true business problem:

Case Study: Targeting donor’s with mail.

- Look at:
o (1) Probability of donating
o (2) How much they will donate ß This is what we want to maximize.

- EV for targeting:

Expected benefit of targeting = P(R l x)* (dR(x) - c) + [1- P(R l x)] * – c

- P(R l x) = probability of response given customer x.


- dR(x) = value/donation we get from a given customer x when responding.
- dNR(x) = value/donation we get from no response = 0, so not in equation.
- - c = cost of targeting
- We want this to be greater than zero.

Case Study: Churn Example


- About customers changing company when their subscription runs out.
- Business objective; minimize money lost because of churn (some customers are more valuable
than others, and these should stay).

- EV for targeting:

Expected benefit of targeting x = EBT(x) = P(S l x, T)* (uS(x) - c) + [1- P(S l x, T)] * (uNS(x) – c)

- S = stay

- EV for NOT targeting:

Expected b. of not targeting x = EBnotT(x) = P(S l x, notT)* uS(x) + [1- P(S l x, notT)]* uNS(x)

- We want to target where EBT(X) > EBnotT(X)

How to obtain data:


- Benefits and costs can easily be found by looking back, but it is hard to find probability.
o Problems with looking back:
 (1) Perhaps offer has changed
 (2) Selection bias. We want data to be representative of customers in general, but
they may have been selected because they seemed good to be target in the past.
- We do have data on people staying without being targeted, but not data on people staying when
targeted.

Another technique in Data Mining:


- Most widely used are: Classification, regression and clustering – but there are many more!
- Co-occurrences and Associations:
o About finding items that go together
o Analyzing market basket data (ex. people buying a sandwich and also a water).
o IF A occurs, B is also likely to occur as well.
o Measurements used:
 (1) Support of rule = Frequency = % of transactions where it applies
 (2) Confidence of rule = Probability of B given A = p(B l A).
 (3) Measurement of Lift = ratio of probability of A and B occurring together and product of separate
probability.
 If Lift = 1 = They are independent of each other.
 If Lift = > 1 = There is a connection between the two à They co-occur more frequently.

P(A, B) = P(A, B) / (P(A) * P(B)

 If
o 30% of transactions involve beer
o 40% of transactions involve lottery tickets
o 20% of transactions involve beer and lottery tickets
o 0.3*0.4 à Independent probability of beer and lottery co-occurring.

Lift = 0.2/(0.3*0.4) = 0.2/0.12 = 1.67

o This means that there is an interesting connection between the two.


Exercise 9: Causal Inference
Topic Decision analytic thinking II: Correlation and causation
Learning 2,3,4
Objective

Activities to be Understand the fragility of correlations


done before next Recognize how correlations can be misinterpreted as causation
class Draw a DAG to visualize your causal thinking
10. CONCLUSIONS

Topic Conclusions
Learning 1,2,3
Objective
Syllabus DSB p. 315-347
Activities to be Readings
done before next
class
Exercise Project workshop

Session 10
Course narrative in a nutshell
 Digitization and increased computing capacities generate data of such great volume, variety, and velocity that it’s
been dubbed “big data”
 Big data is capital. Big data revolutionizes business by transforming traditional business models (e.g., selling books)
and by creating entirely new business opportunities (e.g., data-driven advertising)
 Machine learning (ML) models are needed to extract value from big data §But translating qualitative business
questions into quantitative ML tasks isn’talways straightforward
 There are many different models, each of which suits a particular type of task
 Supervised learning techniques
o Classification is an ML task where you predict a categorical target variable (supervised learning; e.g.,
logistic regression)
o Regression is an ML task where you predict a continuous target variable (supervised learning; e.g., linear
regression)
 Unsupervised:
o Clustering is an ML task where you group together similar instances into “clusters” (unsupervised
learning; e.g., K-means)

 Sometimes big data is unstructured (e.g., text data). Here you need to do some feature engineering to get machine-
readable attributes for ML models (e.g., text-as-data; TFIDF)

 To evaluate ML models, you need to compare them with appropriate metrics and compare against an appropriate
baseline – if using for medical reasons! important
o Loss functions
o Clustering:
o Important: think
 But measures of model performance don’t tell us everything we need to know before deployment. The ethical
implications of a model should be anticipated, not reacted to
o Beyond technical evaluation techniques
o Even if high 3 square. Model performs good in technical
o Should it be deployed? Ethical
o Implications of deploying a model
 Moreover, common measures of model performance (e.g., R-squared, loss functions, classification reports) do nothing
to distinguish association from causation, meaning that it’s easy to misinterpret outputs
o Causality
o
 Big data and ML brings many opportunities, but it’s not a panacea

Pick one task > different models to try out. Pros and cons of all
Model covered in lecture and/or One strength of the model
reading
Classification Decision trees Interpretability

Logistic regression Robust to variance in training


data

KNN classification No training time; “lazy learner”

(Multinomial) Naïve Bayes Good for high-dimensional data


(e.g., text)

Regression Regression Trees Non-parametric

Linear regression Flexible (add polynomial terms,


etc.)

Regularized regression (i.e., Guards against overfitting


ridge and lasso)

KNN regression No training time; “lazy learner”

Clustering K-means Easy to implement

Hierarchical clustering Interpretability

Extra resources
On big data’s business value
§Big Data: The Management Revolution (McAfee & Brynjolfsson, 2012)

On turning business questions into ML tasks


§IBM’s CRISP-DM guide
§Why the Data Revolution Needs Qualitative Thinking (Tanweer et al., 2017)

On statistics and models


§StatQuest YouTube channel
§Machine Learning University visual explainers

On big data ethics


§Six Provocations for Big Data (Crawford & boyd, 2012) §Algorithms rule us all (VPRO Documentary,
2018)
On causality
§DAGitty
§Thinking Clearly About Correlations and Causation (Rohrer, 2018)
Expectations for the paper
 Clearly and concisely tell me what you did and why you did it
 Use course content and synthesis it in a clever, logical, justified way
o
 You don’t need to limit yourself to course content (e.g., if it suits your project, use XGBoost, feature selection
techniques, PCA, DBSCAN, whatever), but course content should be the focus
 The performance of your models is not connected to the grade you receive per se. It’s more important to show that
you understand your models and demonstrate how to improve them — an excellent project leaves no stone unturned
o Low score and show how to improve it rather than perfect

Paper

Structuring the paper


Bolded = subheaders
 Introduction: Introduce the business/societal problem you’re tackling; make it clear why big data and ML suit the
problem
o Why is it interesting and why does it makes sense to use bd
 Data: What dataset are you using? Why is it the best dataset for your problem?
o Description: Where did you get the data? What’s the target variable (if applicable)? What’s the distribution of
your target variable? What are your attributes? How many instances are there? Is there any missing data?

o Preprocessing: Did you create dummy variables? Did you remove or impute missing data? Did you balance
the dataset? Did you scale your attributes?

 Modelling: What task are you doing? What key considerations guided your modelling choices?
o Model A: e.g., we first implemented a KNN classifier because... and tuned the k hyperparameter
 Don’t have to tune all, but explain and justify WHY you chose the one
o Model B: e.g., then we implemented a logistic regression because... and tuned the C hyperparameter with
gridsearchCV...
o Model C: e.g., then we implemented XGBoost because... and tuned hyperparameters with gridsearchCV,
including max_depth, eta...
o Baseline model – Dummy classifier > addition to the 3 – often use KNN as the baseline
 Results: model evaluation (probably with a table showing metrics for each model)
o Put a table, just show statisics
o Only the final and tuned model
 Training score, test score
 Reporting statistics. neutral
 Discussion: Summarise your interpretation of the results. Reflect on any limitations or potential ethical concerns.
o Interpret results as humans.
o Ethics > make its own section. For some contexts its relevant or describe that its not
o Using private data
 Conclusion: Would you recommend deploying any of your models? If so, why? If not, what more would you need to
do/know? (keep this brief!)
o Could be in discussion section.
o If you were a consultant
o Recommend deploying one of the model, what more to know before?
o
 References: not included in the 15-page limit

How many times iterating on each model


Decision tree; no mex depth – then adjust.
Don’t report scores on all > start with baseline, then adjust. Show what you tuned, the settings.

Paper dos and don'ts


DO
 Do attach your Python notebook as a separate file
 Do use headings, sub-headings, and sub-sub-headings — keep it structured!
 Do include references to motivate your business case, justify your modelling decisions, and/or support your reflections
 Show all
DON’T
 Don’t include big chunks of code in the main text
 Don’t include elaborate visualizations. We didn’t cover that in this course
 Don’t waste space demonstrating rote memorization (e.g., don’t describe how to fit a line with Ordinary Least
Squares)

Oral Exam

Format of the oral exam


 As a group; you’re all in the room together with the examiner and the co-examiner §20 minutes per student (e.g., 3-
person group gets up to 60 minutes, including grade deliberation and delivery)
 Many groups start by briefly presenting their project (~5 mins.), but this is not required
 Examiner and co-examiner lead a discussion to assess your understanding of course content and fulfillment of the
course learning objectives
 Examiner and co-examiner deliberate grade with you outside the room, then call you in to deliver the grade

Expectations for the oral exam


 Every group member contributes; every group member is responsible for the entire project (it’s not guaranteed that
every group member gets the same grade)
 Two kinds of questions may come up during the discussion: §Trivia-like questions
o E.g., what’s the difference between linear and logistic regression?
 Project-based questions (it mostly be this kind of question)
o E.g., which model evaluation metric do you think is most important for your use case?

Oral exam dos and don'ts


 Do make yourself comfortable
 Do bring a hard copy of your paper if it helps you
 Do be honest about what you know
 Do be honest about problems you might have had programming
 Don’t interrupt your peers. Ask before providing help on a question
 Don’t fight with me about your grade. I can’t change it once it’s been delivered
Algorithms:
- Gradient boosting: Svage træer der bliver stærke sammen.
- De er ensembled modeller.
o Catboost; Laver kun et træ ud af mange små træer!
 Den kan både string og nummer (categorisk) – men den forstår ikke tomme felter. Derfor tilføjer vi -
9999 og unknown.
o Light GBM;
 Sætter gennemsnit ind de steder hvor der ikke er tal.
 Forstår kun nummer, ikke string.
 Categorical encoding; Den giver dem en code. Kunne også lave one-hot Encoding; hvor den laver
mange features med 1 og 0.
o Sat de to sammen; Det lykkedes ikke for os, vi tog gennemsnit?
o Vi kunne også have brugt; XG Boost (Det første der kom);

Hele træer:
- J48: Laver ud fra Information Gain. Tager den mest informative hver gang.
- Random forrest: Laver mange fulde træer (J48 træer) og vælger majoritetstræet (Det træ der kommer flest gange)
Find billede på nettet.

Extra:
Precision / Recall for evaluating.
- Bruges til – hvor vigtigt er det at vi har den her mængde true positive / False negatives etc.
- Hvor vigtigt er det?
- Precision; true positives / true + false positives.
- Ved os coster det ikke meget at skyde forkert!
- False positives koster os penge, og catboost har en lav af dette!
- False negatives; Koster os penge også penge, fordi de køber ikke, fordi de kender ikke din virksomhed.

Vi har byttet rundt på Confusion Matrix:


- True Expected Value:
o 4995$ * 0.= 1328$ hvad vi tjener per person der træder ind i butikken.

Nominal = Female = F, Male = M eller 0 og 1, den tæller hvor mange der er af hver.

Logloss; bare det samme som list binær model.

For at finde ud af hvor god modellen kan, så kan vi tage noget træningssættet og teste.

Test data bruges først til Python!

*** Der er nogen der går igen! HVORFOR???


- De svarer to gange på spørgeskema, om de har købt eller ikke købt. Spørgeskema sendes to gange årligt.
- Er det repræsentativt eftersom det er folk der modtager spørgeskema, har interesse for motorcykler?
o Folk kan også lave skrivefejl? – Google maps for at finde adresse.

- AveMonth spend; Slettede fordi test data ikke har det ???

Python Coden;
Pandas; Kan lave tabeller i python og tilføje værdier de steder der er tomt – Python Processing

- Reduce mem usage; with 10% (this was for fraud.)


- Vi tilføjer index for at kunne merge. Hvis vi gør det på CustomerID, så ville alle ikke virke fordi der er nogen der
kommer før.
- Så printer vi hvor mange null objecter der er for at se hvor mange tomme. EX. 88 non.nul; Der er 88 der ikke er
tomme.
- Fjerner customer ID’erne, fordi vi havde for mange – og tilføjer dem i den rækkefølge vi ville.
- Vi fjernede colon osv. – vi kunne ikke lave det til en arff, da der var mange symboler, den ikke forstod.
- Clearner test data på samme måde.
- Fjerner features der er 90% De samme eller 90% mangler. =
- Definerer hvad der er num = income, categoric/String = phone number etc.
- Tjekker at test ikke har ave month spend eller bike buyer.
- Laver weka filer
- Vi laver en back-up. Så vi har den originale.

- ” Balance in training set” = Baseline. Line ZeroR

Catboost:
- “features” man definerer object/arraw, I de vi gerne vil arbejde med.
- Tilføjer <miss> i categoriske.
- Cat_features er at den automatisk handler categoriske værdier. (Det samme som i WEKA = NomToNumeric).
- Vi har ingen ’early-stopping’. Den kører 200 iterationer.
- Cv_result står for cross validation.

LGBM
- Can ikke categorial features, så vi har lavet code.
o Def cat_to_int ß laver alle categoriske variables tile n integer.

2 modeller:
- Vi har fucket up! Vi tog bare gennemsnittet af de to accuracy.
- Sandsynligheden for at de IKKE køber! Probabilities er et gennemsnit af de to modeller.
- Getting the best score from Catboost. Model2
o Validation is cross-fold =5.
- Bst.best_score = lightGBM score

Feature engineering:
- Year born and then calculate age, and then age groups.
- Makes the same for income – creates groups.

231 er den vi ender med.

You might also like