Professional Documents
Culture Documents
Big Data Management - Noter
Big Data Management - Noter
NOTER
SYLLABUS
Books:
MLP: Müller, A. C., & Guido, S. (2016). Introduction to machine 2016 learning with Python: a
gui.de for data scientists
DSB: Provost, F., & Fawcett, T. (2013). Data Science for Business: What you need to know about data mining and data-analytic
thinking
Other:
PPBD: Bollier, D., & Firestone, C. M. (2010). The promise and peril of big data.
http://www.lsv.fr/~monmege/teach/learning2013/ThePromiseAndPerilOf BigData.pdf
Learning Objectives
Explain the business value of big data and be able to deploy machine learning techniques to analyze it for broad
contexts, such as classification, regression, and clustering.
o
Evaluate methods for the testing and assessment of data models and critically reflect on the meaning of findings.
Recognize the practical and ethical boundaries of machine learning and big data management.
o Thinking about the social xxx
o Is it useful?
Create a business case by identifying a valuable data set and working collaboratively to apply and justify an
appropriate machine learning technique.
o Synthesize into the group project
Readings:
1. Følger dagets struktur og opbygning
2. Gennemgår python kode
3. high-level overview – critique of big data,
Group Project & Exam
Form a group of 2-4 students
Get an interesting, relevant data set
Use appropriate machine learning techniques to extract value from the data
Write a paper describing what you did, why you did it, and the limitations faced
Deliverables:
o One-page project plan (optional)
o Final paper (max. 15 pages) -- Group oral exam
Oral exam
As a group
20 minutes per student
Assessing your knowledge of course content
Assessing your completion of the learning objectives
Code?
If anything is important you can include code, but only actual fitting or outputting results, interpreting something particular.
Math behind the algorithms/models > he doesn’t care about this. He wants to see critical reflections – sometimes involves
explanation of math if that is the cause for differneces between two models.
1. INTRODUCTION
Topic Introduction
Learning 1
Objective
Syllabus MLP p. 1-26
DSB p. 1-18
Activities to be Readings
done before next
class
Exercise Python Basics
Session 1
What is big data?
3 vs of big data. Easy, intuitive distinction – if data is due to one of these, can’t be processed with normal things you probably
are working with big data. Overly simplistic
What distincts big data from just data?
Volume
o Terabytes of data being captured
Variety
o Not only data captured, by different types of computers stored in different types of formats
o Text, images, videos etc
Velocity
o Not just various types of data, data is constantly changing, in motion.
Veracity
o Idea that big data is difficult to assess
Value
o Innovation, revenue
Lots of computers
Machine learning –
Since big data ikke ka processed by normal menas, we use machine learning models
AI
Statistics
Downside:
Manual boring tasked to label pictures/data
Most commonly used
No labled dataset
Trends
Classification
Input labelled training data (supervised)
o Predicting what given data belongs to
o
Output categorical labels from predefined possibilities
o Bniary; dog not dog
o Multiclass > identifying different
o Using same algo with some adation
Example algorithms:
o k-nearest neighbors
o Logistic regression
o Decision Trees
Example applications:
o Detecting fake news
o Identifying cancer in medical scans
Regression
Input labelled training data (supervised)
Output continuous value
o Rpredicting ocntinous output
o
Example algorithms:
o Linear regression
o Polynomial regression
o Ridge regression
Example applications:
o Financial forecasting
o Predicting future temperatures
Clustering
Input unlabeled data (unsupervised)
Output categorical labels from unknown possibilities
o We donøt know what we are looking for
Example algorithms:
o k-means
o DBSCAN
Example applications:
o Market segmentation
o Identifying player typologies in sports
By delegating decision process to machiens we can work around flaws in human decision making
Summary
Digitization and increased computing capacities have given rise to big data.
o Need tools; 3 Vs (machinellearnig is one of the tools)
Machine learning involves a suite of computational algorithms for extracting value from big data.
Machine learning can be supervised or unsupervised.
Common machine learning tasks are classification, regression, and clustering.
A phone service provider wants to predict which of their customers might switch to a different provider. They have data
indicating which customers have switched in the past, plus their account details. Which kind of ML task best suits this
challenge?
Classfication – predict either if customer is to churn or not churn
Regresson – could also be justified. Continuous target variable. Probability of someone churning.
Clustering – probably not given thye have data, some target value, unsupervised ml technique, but could be done in some way
Ch. 1
The book, chap. 1
We can use intuition and back it up with data. Data improves decision making.
- Data science (options) is a set of principles that guide the extraction of knowledge from data. – About finding
valuable/interesting patterns in data.
- Data mining (doing) is the extraction of knowledge from data, via technologies that incorporate these principles. –
This is useful for predicting.
- CRISP: Used to extract useful knowledge from data to solve business problems systematically by following stages.
Data manipulation:
- Python: Programming to manipulate with data to set up machine learning.
- Tableau: Analyzing and visualizing data.
Machine learning:
- WEKA: makes it easy to create models without coding by applying algorithms.
- In early stages determine it the problem should be attached supervised or unsupervised. If supervised, define the
target value!
Supervised: Unsupervised:
You mark what the right answer is – the system tries You don’t have a label. You can still organize by:
to figure out from data how to find the right
data/answer.
You have a label that you try to predict.
Session 2
Focus: Translation from business problem to data-analytic tasks
§ Verbal § Mathematical
§ Vague § Precise
§ Interpretive - Interpret in many ways § Positivist – the core – data is valid and independent of the
§ Qualitative – depends on context in which they’re researcher. Reproduce – return same result every time you inquire
asked it
§ Often explanatory, even if explained in predictive § Quantitative
terms – explain cause and effect § Often predictive – of course exceptions
In great contrast to vague business questions
Terms:
Instance = Rows
Column = attribute – independent variable, co-variant
Target = One special column/attribute – the dependent variable we want to predict! Shares is also – none of the other
attributes are XX for reads
Linear function
Numeric value
More pr
The point is
Depends on business question -
When translating;
Information and context gets lost – we need to iterate
Ensure that it solves the tasks
Paper:
Arfue many ds projects affected by hidden biases
True to some extent
UK govern; exam algorithm used to estimate grades of A-students during pandemic down graded students; upgraded grades
for students at private schools
Interpretavism:
Reflexivity –
Abductive reasoning A mode of inference that updates and Iterations of open coding, theoretical coding, and
builds upon preexisting assumptions selective coding (Thornberg & Charmaz, 2013)
based on new observations in order to Supervised, req labeled training data
generate a novel explanation for a Recycling of training data; shouldn’t assume label
phenomenon. training data is true wihtoug undersøge det
Reflexivity A process by which researchers Brain dumps, situational mapping, and toolkit
systematically reflect upon their own critiques (Markham, 2017)
positions relative to their object, Situational mapping; they may know very little
context, and method of inquiry. about people behind the data points.
How could you incorporate qualitative sensibilities in your approach to this task?
Researcher’s bias:
Affect the business question
Affect the interpretation of data and result
Reflexitivity:
Reflecting of your biases
Be aware of your own bias
How the model will be used; If you’re allocating more resources to a place where there are more crimes; there will be more
arrests
Reflecting on implecations how the model is used.broken window problem; focusing on market of disorder can sustain more
serious crime.
Interpretation:
How is the target interpreted?
What is a crime?
An arrest is not necessarily a crime; could be the officer’s fault
Even though it is an arrest in the data sheet, it is not necessarily a crime – interpretivsm.
The tool critique; Is this the best data for the problem? Or is it just the one that is available to us
Abductive reasoning:
Can we somehow recode data? Binding additional; breaking down ‘arrest’ > not just stealing a bag of chips
Cross Industry Standard Process for Data Mining (CRISP-DM)
The CRISP model (The Cross Industry Standard Process for Data Mining)
Turning BQ into DA task
Model is exploratory; a strategy and approach (circle – this can be hard for developers to understand) and
has 6 stages: B.U, D. U, D. P, M, E, D.
B.U:
- Addressing the problem and use scenario/practical business goals to achieve.
- What do we want and how to achieve it?
D.U:
- Important to understand strengths and limitations of data.
- Is the data enough to achieve the goal you want à if not, be creative.
- Was it free/costly?
- Was it collected for this purpose or?
D. P.
1. Data format – you want rows & columns + if supervised; one column with target.
2. Missing values, errors, or other problems.
3. Converting – data type or (numerical to categorical value = discretization [Low,Medium,High]
(income)
4. Leaks – Too good data, that you would not normally have.
M:
- The data mining phase à Creating model – classification, regression, clustering…
E:
- How well did the model work? How often does it give right answer?
- Is it general enough to apply to new data? Hard to do, but important! Think back on business
problem.
- How important is it that it gives the right answer: Depends on situation!
- Are the results valid?
D:
- Does the model produce any value?
o Perhaps we can target customers differently?
Example: overview
Example: Overiew
An established electronics e-retailer is facing increasing competition from newer sites. Web stores are cropping up as fast
(or faster!) than customers are migrating to the Web, so the company must find ways to remain profitable despite the rising
costs of customer acquisition. One proposed solution is to cultivate existing customer relationships in order to maximize the
value of each of the company's current customers.
The e-retailer has hired you, a data science consultant, to lead the project.
Business understanding
The practical business goals you want to acheive; the use scenario.
What exactly do we want to do?
How exactly, in data-analytic terms, would we do it?
Legal/admin, financial, business objectives
Involves as many stakeholders as possible; involve in the process.
Together with the e-retailer, you translate those goals into data-analytic terms:
1. With historical data about previous purchases, build a model that identifies items frequently bought together (co-
occurrence grouping; “market-basket analysis”)
2. With a database containing the personal details of registered customers, build a model to identify different
customer typologies (clustering; “profiling”)
Went from better recommendations to
Data understanding
The data that will provide the basis for a solution.
What data is readily available?
What are the limitations of the available data?
Is different data needed?
How much will collecting different data cost (in time and money)?
Rarely the data is just ready to go; historical data that an organization hasn’t been collected with the purpose of clustering but
for other reasons
Explore data
Get prescriptive statistics; validate the quality of the data
You recognize that these datasets are probably sufficient, but you also recognize some limitations. For example:
The product database containing information about each product is not linked to the purchase data.
The customer database has lots of missing data because many customers did not fill out all fields in the
registration form. Website where you need to create an account
Data preparation
Most time intensive and frustrating
To handle the missing data in the customer database, you first conduct an exploratory analysis to see whether data are
missing completely at random (MCAR) or missing not at random (MNAR).
- Clustering;
You decide to delete instances with missing data because the customer database is massive and it looks like a case of MCAR.
Modeling
Into the analysis
The machine learning technique used to analyze the data.
What specific algorotihms fits best?
Is classification, regression, or clustering the appropriate technique?
What specific algorithm(s) suit the problem faced?
How can the model be tuned?
Iterative; Running different default algorithms, compare, contrast etc to see how performance changes
Back to data preparation ;
Example: Modeling
For the co-occurrence grouping of products purchased, perhaps you try out several different algorithms to compare, like
the “apriori algorithm” vs. the “FP growth algorithm”.
2 unsupervised ML techniques
For the clustering with the customer database, perhaps you’ve opted to used the “k-means algorithm.”
You might then tune the n_clusters parameter depending on how many different customer typologies it makes sense to
identify.
Obviously just optimizing on data
Also remember business
Evaluation
The assessment of modeling results.
Example: Evaluation
For the co-occurrence grouping of products purchased, you compare the “apriori algorithm” to the “FP growth algorithm”
and report how each algorithm performs on some statistical criteria and the diversity of recommendations made.
How likely is an association
For the clustering with the customer database, you show how different settings of n_clusters may be more interpretable
and actionable than others.
Deployment
The application of the model and production of value.
How can the model be integrated into existing business operations?
How does the use of the model necessitate new business operations?
How should the model be used, adjusted, and re-deployed to realize ROI?
Example: Deployment
The e-retailer is happy with your work and wants to deploy both models. What now?
A web developer is needed to integrate the recommendations generated by your co- occurrence grouping model into the
website UI.
The marketing team is needed to develop specialized promotions and product packages to suit the customer typologies your
clustering model identified.
Finally, you point out that if the e-retailer succeeds in growing their customer base, you models must be re-trained and re-
deployed.
Ch. 2
The book, chap. 2
Classification
- Involves defining a small number of classes, and then trying to predict for each instance, which
class they belong to.
Or Sentiment analysis = A classification problem, where texts are classified as being positive or
negative. – for this you need texted that are labeled positive or negative (ex. reviews on Trustpilot).
Regression
- Gives a numerical value for each instance.
Similarity Matching
- Compare instances based on their attributes and determine how similar they are.
- For good similarity matching it is important to have information about the relevant attributes +
and figure out which attributes are most important.
- Ex: Recommending motorbikes based on what people with your profile has previously bought.
o Amazon uses this to recommend books you previously bought.
Clustering
- Group individuals in a population together based on similarities.
- Is not driven by a purpose, - and the groupings are not predefined.
- Is more open-ended than classification and regression (?).
- Ex. Do customers form natural groups? – this can later be used to design campaigns for each
group.
Co-occurrence grouping
- Looks at similarity of objects, based on if they appear together in transactions.
Profiling
- Attempts to characterize typical behavior of an individual, group or population.
- Ex. Fraud detection – is a credit card used differently than it is normally used.
Link prediction
- Ex. If customers rated motorbikes, then they could get a recommendation based on their ratings
of the motorbike.
Data reduction
- Creates smaller dataset based on a bigger. Can make it more precise and easier to work with.
Casual modeling
- Help us understand what actions/events actually influences others.
- Ex. If targeted customers were more likely to buy motorbikes, was it then because we targeted
them or because they would have bought anyway?
o Done by A/B testing: (one control group and then apply marketing to another group)
o Results in a casual conclusion with assumptions – always assumptions. Consider
assumptions before going forward.
Goal: Predict a class label – a choice from a Goal: Predict a continuous number/floating-point
predefined list of possibilities number (real number)
“Will this customer purchase service X if given “How much will this customer use the service?”
incentive Y?”
This is a regression problem because it has a
This is a classification problem because it has a binary numeric target. The target variable is the amount
target (the customer either purchases or does not). of usage (actual or predicted) per customer.
“Which service package (X1, X2, or none) will a Predicting a person’s annual income
customer likely purchase if given incentive Y?”
This is also a classification problem, with a three-
valued target.
Overfitting Building a model that is too complex for the amount of information we
have, as our novice data scientist did, is called overfitting.
Occurs when you fit a model too closely to the particularities of the
training set and obtain a model that works well on the training set but is
not able to generalize to new data
Often occurs when model is too simple
Underfitting Choosing too simple a model is called underfitting.
Occurs if your model is too simple—say, “Everybody who owns a house
buys a boat”—then you might not be able to capture all the aspects of
and variability in the data, and your model will do badly even on the
training set.
Supervised ML Algorithms
Analyze:
Using few neighbors corresponds to high model com‐ plexity (as shown
on the right side of Figure 2-1), and using many neighbors corre‐ sponds
to low model complexity
k-Nearest Neighbors When using multiple nearest neighbors, the prediction is the average, or
(k-NN) mean, of the relevant neighbors
Evaluate the model: Using score method, which returns the R2-score
Regression (coefficient of determination). A measure of goodness of prediction.
Value between 0 and 1, where 1 corresponds to perfect prediction.
Analyze
Linear Models
Linear Models Linear models make a prediction using a linear function of the input
features
Regression
Linear Regression - Linear regression finds the parameters w and b that mini‐ mize the mean
Ordinary Least squared error between predictions and the true regression targets, y, on
Squares (OLS) the training set.
Overfitting data
Ridge regression Also a linear model for regression, so the formula it uses to make
predictions is the same one used for ordinary least squares.
The coefficients (w) are chosen not only so that they predict well on the
training data, but also to fit an additional constraint
Session 3
Prediction, Information & Segmentation
What is a model?
A simplified representation of reality created to serve a purpose. Based on some assumptions.
It is simplified based on some assumptions about what is and is not important for the specific purpose, or sometimes based on
constraints on information or tractability.
Supervised segmentation
The process of splitting (segmenting) data into subgroups depending on their attributes and a known target variable.
When done using values of attributes that will be known when the target is not, then these segments can be used to predict
the value of the target variable.
Decision rules/formula
One way of building a predictive model
Selecting informative attributes
How do we select an attribute to partition data in an informative way? Segment the data into groups that are as pure as
possible.
By pure we mean homogeneous with respect to the target variable.
Gini impurity
0-0,5
Worst = 0,5
Randomly
Entropy
Entropy = 0
Certain that outcome is negative
Right: entrooy is 0
In between:
When neither the 2 outcomes
0,5 = entropy of 1
Information gain
Information gain = 1
Entoropy 0
Top:
Root node – first seg take splace, conditional statement
Is some attribtu eq some value
Branch node
Another conditional statement
Leaf node
Where pure groups are supposed to be
Eventually make pure leaf nodes
Binary attributes
Leaf nodes
Supervised
Needs labeled data
Known target variable
Continuous >
Regression trees
Continuous target variable
Selects informative attributes by the sum of squared residuals (SSR) or mean squared error (MSE)
Computationally expensive
-
Pruning
1) Models are simple – useful because theyre simple
2) Models Tend to overfit the data they are traingn
A technique for reducing a decision tree’s complexity (and risk of overfitting) by removing the least informative branches.
Pruning away
E.g., max_depth: parameter specifying the maximum depth of the tree (note: max_depth = 1 means only a root node)
o Most common
o Depth = 1, root node and 2 leaves
E.g., min_samples_leaf: parameter specifying the minimum number of samples required to be at a leaf node
o Minimum samples >
o Reduce computational expense of running the tree
A
E.g., Random Forest: ensemble algorithm that randomly samples data and attributes to create many data subsets, builds
separate trees for each subset, and then averages the predictions for new inputs
Each tree is different
Additional parameters to tune; n_estimaters (number of trees in the forest)
More yield better accurace, dem returns, rule of thumb > use as many as you have time for or or computational power.
Boosting
Successively build tree and combine them so that each new tree corrects for the errors of the previous one
Each new tree can learn from errors from earlier trees. The final will be the best one as it learns from its ancestors.
E.g., XGBoost: ensemble algorithm that implements parallel (rather than sequential) gradient tree boosting to minimize a
specified loss function
- current
Loss function > find errros. Optimizing this function.
Strucuted data sets
Gdpr
Regulation on algorithms
Recap
1) Which statement is NOT true about models in the context of big data?
Models are intentionally simplistic
Models are useful because they are simplistic
Difficult to interpret
Difficult to implement
Lots of data preprocessing required. Some – but in general very little compared to other.
Prone to overfitting
3) Which of the following are ways to reduce decision trees' risk of overfitting?
Pruning
o Most prominent one.
Tuning
o Variation of pruning – prepruning. How big can the tree grow.
Bagging
o Ensembling methods – also reducing risk of overfitting
Boosting
All of the above
o
Ch. 3
The book, chap. 3
Model induction
= creating models based on data (training data/labeled data (because they are labled)).
Supervised Segmentation
- Supervised = data includes the target value (meaning; each instance is labeled)
- Segmentation = Segmenting data into subgroups based on what we want to predict.
à By creating decision trees. (will
- Technical implications
1. Attributes rarely splits a group perfectly (= pure). If one is, the other might not.
2. Best to have one pure subset or more broadly purity?
3. Not all attributes are binary... Has more values. Etc.
Technical implications can be addressed, by evaluating how well each attribute splits a set into subsets. à By Claude
Shannon.
Entropy: measures the disorder that can be applied to the set, or how impure the set or subset is.
- The higher entropy, the harder is it to predict anything.
The amount of information the attribute provides depends on how much purer the children are than their parents.
Calculating
Entropy of set:
Entropy of NumberChildrenAtHome:
Instances Correct Proportion Wrong Entropi Ins. of Total
in Total Proportion
(Total – Wrong)/Total Wrong/ −p1 × log2(p1) - p2 × log2(p2) Total
Total /16519
0 9990 0.8059 0.1941 0.71 0.6048
1 2197 0.7533 0.2467 0.8060 0.1330
2 1462 0.5896 0.4104 0.9767 0.0885
3 1066 0.6942 0.3058 0.8883 0.0645
4 952 0.7258 0.2742 0.8474 0.0576
5 852 0.8392 0.1608 0.6362 0.0516
IG of NumberChildrenAtHome:
If we also did this to YearlyIncome, we would find that this variable is purer that NumberChildrenAtHome.
The highest IG is typically the root node. – but not in our situation…
Segmentation if regression:
- Look at variance?
Entropy chart:
Probability estimation:
- Is about assigning a probability of membership to each subset.
- Frequency based estimate: The probability of a new instance being positive is p/(p+n)
- To not be able to say that a subset with one instance is 100%, which equals overfitting. – We look at the Laplace
correction.
Session 4
What is a predictive model?
A formula for estimating an unknown value of interest: the target variable
The formula could be mathematical, or it could be a logical statement such as a rule.
Parameters are variables in a formula for which the values are unknown.
Linear Regression
A supervised algorithm that learns to predict a dependent variable (”target variable”) as a function of some independent
variable (”attribute"), by finding a line that best "fits" the data.
At this point > just ONE attribute of linear regression models. Same logic applies to multiple attributes.
The target variable must be a continuous value (since it is a regresion) The attribute can be any type of variable (e.g., binary,
categorical, continuous,... ).
If assumptions not met > fitting model to data could return incorrect interpretations?
Fitting a linear regression
Calcualte residuals:
Binary
Fitting LR model
R-Squared ranges from 0 to 1. R-Squared = 1 indicates that a model captures 100% of the variance.
In principle it can be negative?
Coefficient of
However:
Doesn’t take into account the number of
attributes used
Can be non-linear
Equation might look different – det er blot 2.
og 3. grads polynomier der viser hvor meget
linjen kurver
The target variable must be a categorical value. The attribute can be any type of variable (e.g., binary, categorical,
continuous,... ).
P = probability
y1 = Target variable (class label we
want to predict) – binary, one class
or the other
𝑝(𝑦1) = Probability that target
variable is a certain class
F(x) Linear predictor (the linear
regression equation! ) – can be
complicated – today just a single
attribute
Plot to right;
Y-axis = probability, from 0 %-100% chance
that sold from more than its valuation
X= house size
Plot to right;
Y-axis = probability, from 0 %-100% chance
that sold from more than its valuation
X= house size
MLE – Log-likelihood
“Maximum Likelihood Estimation” ...finding the line that maximizes the log-likelihood
Equals to r-squares for linear regression
Higher = better
Focus on f(X)
Important difference:
These coefficients can’t be interpreted as
easy as with Linear regression
Polynomial regression
Why use logistic and linear regression instead of classification and regression trees?*
Gran salt > generalisere
RSS, Log-likelihood > quantify how good models are performing based on GOF.
GOF > not
Usutlaly accuracy prediction.
Gradient descent
An optimization algorithm that iteratively estimates a model’s parameter(s) to minimize a specified loss function.
Don’t have to do OLS or MLE.
Starting from an initial estimate, the algorithm takes steps towards the minimized loss function, and decreases the size of the
steps as the slope of the gradient decreases.
In contrast to OLS and MLE > doesn’t care about loss functions
Slope is 0 = done
Should we care about these “basic” models?
Reflect about bigger picture
Summary
Parametric modeling is a technique for building a predictive model whereby a formula is pre-specified and parameters
are estimated (or “learned”) from some data.
Linear regression is a parametric model that fits a linear function to data in order to generate continuous predictions.
Logistic regression is a parametric model that fits a logistic, or sigmoid, function to data in order to generate
categorical predictions. For classifications
Compared to decision trees, regression models are generally less computationally expensive, less reliant on a large
training dataset, more robust to variance, and provide more nuanced predictions.
Compared to regression models, decision trees are easier to implement and interpret.
Parametric models can be compared by their “goodness of fit.” However, the main marker of performance for most
machine learning tasks is predictive accuracy (...more on this in the coming weeks).
Question Answer
What statement is NOT true about parametric models? 1. Parametric models make no assumptions about
the relationship between attributes X and target
variable Y.
False >
2. Parametric models pre-specify a formula and
estimate its parameters from some data.
Linear regression etc.
3. Linear and logistic regression are examples of
parametric models.
4. Parametric models can be evaluated by their
"goodness of fit."
Linear regression model's are often fit to data with 1. R-squared – goodness of fit
Ordinary Least Squares (OLS) estimation. What does OLS 2. Residual sum of squares (RSS)
optimize? Distance between actual observed and estimated
datapoint
3. Log-likelihood > log regression
4. Mean Squared Error MSE > could be relevant, not
what ols minimizes but. Gradient ascent
You've fit a logistic regression model to some data that 1. There's a 70% chance of Y being assigned a class
outputs a coefficient of 0.70. What does this coefficient tell label of 1 without taking X into account.
you? 2. The log odds of Y is expected to change by 0.70 per
one unit change in X, on average. > converting all
on log odds scale to fit
3. Y is expected to change by 0.70 per one unit change
in X, on average.
4. For each unit of X an instance has, the odds of Y
being assigned a class label of 1 is multiplied by
0.70.
References:
Regression models are complicated.
Here are some resources I used in this lecture:
Linear regression:
o https://mlu-explain.github.io/linear-regression/
o https://www.youtube.com/watch?v=PaFPbb66DxQ&t=400s §
o https://www.youtube.com/watch?v=nk2CQITm_eo&t=907s
Logistic regression:
o https://mlu-explain.github.io/logistic-regression/
o https://www.youtube.com/watch?v=yIYKR4sgzI8&t=75s §
o https://www.youtube.com/watch?v=BfKanl1aSG0&t=302s
Ch. 4
The book, chap. 4
Parametric Modeling
- We start by specifying the structure of a model (ex a line) and leave certain parameters unspecified. à Learning
process is to find the best values of those parameters.
- We want to know parameters or numerical values from the training data.
- Once we have established values for the parameters, we will have a model that can produce the predictions we
want – for classification or regression.
- We look for some kind of mathematical formula that combines the attributes in the best way, to get useful
information about the target value.
o There are many kinds, but we can get a long way with:
a linear model, that describes the line that best separates the data. (also natural for regression).
- Equation:
o F(x) = w0 + w1*x1 + w2*x2 …….
Prediction problem:
- We want predicted outcome to be as close to the real outcome as possible.
- Equation:
o F(x) = w0 + w1*x1 + w2*x2 …….
We want new attributes as far away from the discriminant function as possible.
Unsupervised Learning
Session 5
Generalization & Overfitting
Definitions
Don’t care about specific definition > The interesting is what is highligted
What is generalization?
The property of a model or modeling process, whereby the model applies to data that were not used to build the model.
If a model cannot generalize well to new data, then it will not be able to perform the prediction task that it was intended for.
Key concept of ML: Generalization
o This property of ML models enables them to handle big data. Focus on generalization > how well they predict
new data.
o How do we evaluate ML models > on how well they do generalization.
Side note: Human develop skills that are generalizable. That is something that AI lacks
o AI is narrow intelligence
Example: Teslas AI
o In front of the car drives a horse carriage – on the screen in the car it shows a
o Car doesn’t recognize horse carriages. An example of failure of generalization – the ML model doesn’t know what it’s
encountering. Only know what its trained on, and hasn’t been trained on horse carriages.
o 2 ways we can explain this failure; under-and overfitting
What is overfitting?
When a model fits its training data so well that it cannot perform accurately on new, unseen data.
If the model fits perfectly, it might show a perfect r-square (1). This might be a red flag - The model has memorized all data –so
also remembered all noise etc. As result when applying on unseen data the model will predict same noise that it was trained
on. The model is tailored so perfectly on the training data > this comes at the expense of generalization.
It occurs when a model is too complex, perhaps as a result of too much training, too many input features, or not enough
regularization. Too many attributes
Troublesome – hard to spot because it is often related to a “perfect” model (perfect r-square indicating that the
model is perfect)
Deep learning, can happen when letting a model train too much
What is underfitting?
When a model is unable to capture the relationship between the input and output variables accurately, generating a high
error rate on both the training set and unseen data.
It occurs when a model is too simple, perhaps as a result of a model needing more training, more input features, or less
regularization.
Too simplistic > guessing a The optimal partition – a In real world > don’t want the
partition straight line weird curves.
It might be a mistake that the
blue lies among the black, the
model overfits and captures
the noise from the training
data; this will also be applied
on new unseen data
Bias and Variance
Understand 2 key concepts in order to understand over and overfitting
Average model:
Run on independent data sets > in practice it is
difficult to identify whether we observing a bias
(systematic error) or a mistake
Variance
Variance = How much predictions vary for a given data point on average; a measure of how “sensitive” a model is.
Small change in X will result in a huge change in Y > Therefore indicates how sensitive is our model
Underfitting
Left side: The model doesn’t capture
the relationship between x and y. It’s
too simple.
Overfitting
Hold out data = Data for which we know the value of the target variable, but which will not be used to build the model (i.e.,
“test data”). Partition the data > build the model
Implies that separate “training data” is used to build the model.
Test/holdout
Test set, put aside when building the
mode
20-30%
Evaluate the model on it. Doesn’t touch
this data – save for when done building
the model. Imposing voerfitting if we
look at this data ebrfore the model is
done
“little experiment”
Analogy:
Building a theory on training data, in
order to test the theory we will test it
on our “test set”
Training data
Build our model > default model
20% testing
Don’t touch this before the model is
done
Only running once
Compare
Idea of how good generalizes
2 key points:
1. Often benefits to either increase
error or decrease
2- sweet spot
Level > where they begin to move from
each other.
Corrextion:
True – measure of GOF for linear regression
Not negative
Example of cross-validation
Partition data available so it enables this experimentation
k-fold cross-validation
A method of evaluating a model’s performance whereby data is split into k subsets (“folds”) and the train-test evaluation is
repeated k times, such that each subset serves as the test set once.
Test scores are returned by each fold and averaged together for a final score.
Average together
Look at s-deviation
Varied: concern, not much stability in the way it generalizes
Step by step: 5
2:
Put aside fild nr 2 > train on the rest 4
3:
Fold nr 3 as test, use rest as training
4:
Put aside fold nr 4 output at testscore
Cross-validation considerations
Useful for cases where you have small datasets and you need to utilize every little bit of information to develop your
model.
o Limited datasets ! no need to get too complicated
o Instead of
o Systematically swap data for training and test
Computationally expensive when using very large datasets and/or complex models (e.g., deep learning models).
o Partitioning, training and testing takes many resources
Different types of cross-validation?
o K-fold most commonly used
o Focus on k-fold in this course
Regularization
Calibrating a model’s fit to data with the model’s complexity; “desensitizing” a model to training data. Desensitizing model
to data > this training might not be perfectm don’t take small percularites too serious
Regularization techniques include pruning, imposing explicit complexity penalties in the model’s algorithm, and feature
selection. For the sake of generalizability
As the penalty term λ increases, coefficients (slope) decrease, due effectively to less priority being given to minimizing RSS. If
λ = 0, then a standard OLS line is fit. As λ approaches infinity, the slope gets asymptotically close to 0.
The penalty term is often denoted as lambda λ , or alpha α
Retrogression
Assume rel between x and y might not be as X when on bigger data sets
L2 norm
Increase l2:
Small
What happens
Orange – new unseen data
No guarantuee
Blue > with smaller data > steep
increase in score
L1 Regularization, or Lasso Regression
A linear regression model with an adjusted loss function that penalizes model complexity, and is capable of zeroing out
coefficients.
The penalty term λ functions the same as with ridge regression, but here, some coefficients may be reduced to exactly zero.
This means some features are entirely ignored by the model.
Key dif:
Ridge – reg can lead to small coef, cant be reduced to exactly 0
Always include same attributes
L1 to linear regression > might remove attributes. Effectiveatly doing feature selection. Get fitted one at a time. Attirbutes
coefficiton can be set to 0 if it doesn’t increase the model
L1 norm
Important change
L1 vs l2:
Compapre them and try to understand dif
Large amount of features and some might not be relevant for your model > lasso
When implemented, an algorithm iterates over a (hyper)parameter grid containing the prespecified range of values to evaluate
for each (hyper)parameter.
Automated way > report score, pick the. Best one
Conclusion:
ML model picking on smallest
singals that humans cannot
Facial features
smiling
Takeaways:
Garbage in, garbage out: a machine learning model can be only as good and unbiased as the training data provided to
it.
o getting better training data if overfitting
Overfitting (?) and too-good-to-be-true performance: be wary if your model is producing remarkable results based on
unremarkable features.
o Too good to be tru eperformance probalbly not true
Occam’s razor: do not appeal to the extraordinary (neural networks picking up features that the human brain does
not) when the ordinary (smiles) is sufficient.
Summary
Useful machine learning models are generalizable.
Overfitting is when a model is fit so precisely to the data on which it was trained that it is no longer able to generalize
to new, unseen data. No longer useful
Avoiding overfitting means dealing with the bias-variance tradeoff.
To avoid overfitting we can use cross-validation, regularization, ensemble methods, and/or get a larger and more
representative training dataset.
Cross-validation involves partitioning your dataset into training data (for building a model) and holdout/test data (for
evaluating a model).
k-fold cross-validation is a method of evaluating a model’s performance whereby data is split into k subsets (“folds”)
and the train-test evaluation is repeated k times, such that each subset serves as the test set once.
Regularization involves calibrating a model’s fit to data with the model’s complexity; “desensitizing” a model to
training data.
Regularization often entails tuning hyperparameters, which can be done systematically with grid search.
Question Answer
Which statement is NOT true about overfitting 1. It occurs when a model is too complex
2. It's observed when a model performs well on training data but
poorly on test data
Ch. 5
The book, chap. 5
Evaluating a model
- When evaluating the model, it is not enough to look at the accuracy for the dataset used to build the model (that is
called a table model, and does not generalize), you need some test data.
- Generalization; The model should generalize from the training data! – then you can later test the model on the
test data (Holdout data)
- Overfitting; If the model is closely fitted to the training data, in a way that does not generalize to other data.
- As the model becomes more complex, the errors go down from both holdout and training data. But the holdout
data is does not continue to reduce error, because of overfitting.
- Looking at table data: The Holdout data keeps being at b, because it is not in the table. Training data keeps
getting better till all rows are used.
Cross validation:
- All data will be held out once
- You get the test results five times and take the average of this.
Learning curve
- Generally, models tend to approve as the size of the training data
increases, but the improvement is typically not constant.
- At some point there is no longer more value in adding training data.
MinNumObj
- Overfitting happens when model gets overly complex.
- Avoid overfitting by changing min. number of instances per leaf.
- A small number of instances = not much reason to think that the decision is accurate!
- Prune = beskære.
Key points:
- You can’t assess a model by looking at is performance on training data.
- You want the model to Generalize – apply well to data other than training data.
- There is a danger of Overfitting – model is closely fitted to training data, in a way that doesn’t generalize to other
data.
- To assess a model, you need to use holdout data – keep some data from the training data, to be used for test data
- Cross validation does this systematically with many different partitions (or folds) in the dataset
Quiz:
- The red line in Figure 5-2 shows the base error rate. Let's say that, overall 45% of customers churn, and 55% do
not. What is the value of the base error rate in this case?
o 45%.
What would b be? Since the table model always
predicts no churn for every new case with which it is
presented, it will get every no churn case right and
every churn case wrong. Thus the error rate will be
the percentage of churn cases in the population.
This is known as the base rate, and a classifier that
always selects the majority class is called a base rate
classifier.
- You have a data set with 500 instances, and you will perform
10-fold cross-validation. In this case, how many instances of
data will you use for testing in each iteration?
o 50 instances will be in one fold.
- You have a data set with 1000 instances, and you want to
perform 5-fold cross-validation. In this case, how many
instances will you use for training in each iteration? à 800
(because: 1000/5*(5-1))
Exercise 5: Linear regression I!
Topic Overfitting and its avoidance
Learning 1, 4
Objective
Activities to be Send one-page project proposal to JB for written feedback (optional)
done before next
class
Session 6
Similarity and Neighbors
DP = data points
Sim = similarity
Metr = metric
What is similarity?
Core idea of similarity
A way of measuring how alike, close together, or related data are, usually involving a so-called distance metric.
Similarity can be used for supervised and unsupervised machine learning; it can be used for classification, regression, and
clustering tasks.
Sim basic intuivly; used in everyday life. Behave -> reaction – next time predict the same reaction. This applies to
similarities.
Look for most similar DP we larned in data points; predict in test data that same will aplly
Euclidean distance
Manhattan distance
Minkowski distance
Cosine similarity
Jaccard distance
...
Euclidean distance
Measures the distance between two points as the length of a straight line.
One the most intuitive and commonly used distance metrics. Good off-the-shelf performance with low-dimensional data, but
requires that data has been normalized and performs poorly with high-dimensional data.
Simple, into, widely used,
Default setin in ML programs;
Challenge: high dimensional
o Technical: on 2d or 2d plane; flat, straight line. Increasing; things curve. Becomes more a worm ball, than the
distance between them.
Most applicable: low dimensional data
Between point A and B
Triangles:
Theorem
Manhattan distance
Measures the distance between two points along axes at right angles; also commonly referred to as taxicab distance.
Less intuitive than Euclidean distance but displays better performance with high- dimensional data.
Taxi cab distance: total north south distance and east west distance you travel
Minkowski distance
A generalization of Euclidean, Manhattan, and other vector-based distance measures.
Introduces a parameter p that can be adjusted to suit your use case, but interpretability can be troublesome without a firm
understanding of vector-based distance metrics.
1 = Manhattan
2 = euclidan distance
Cosine similarity
Measures the similarity between two points as the cosine of the angle between two vectors.
Well-suited to high-dimensional data but ignores the magnitude (size; length) of the vectors. Only accounts for the orientation
between to points, rather than the distance per se.
Jaccard distance
Treats data points as sets of characteristics and measures distance as one minus the proportion of all unique characteristics
(the union) that are shared by the two (the intersection).
Well-suited to high-dimensional data (e.g., text data), but is susceptible to being skewed by the size of the data. Increasing
dataset size may increase the union without increasing the intersection.
Different
Isn’t. a vector bases metrics – no distance
A bucket of values (rows). Checks instances – how many is present in this < compare also present in other
instances
Affected by size of dataset
o Lots of featues; opportunity to
-
1 minus interseption
Minus union
To make predicitons
Lazy learning; no model beig buil.t not fitted equiation etc.
Pattern recogniztion.
Reg task
Predict whether
Clustering
Unsupervised
What is clustering?
- An unsupervised machine learning task that involves grouping together similar instances into so-called “clusters.”
o The key thing is that in contrast to the others, clustering is unsupervised. So, we’re not predicting something
specific.
o The objective is more exploratory, we don’t know what we’re looking for.
o Might look for the company’s natural customers
- As an unsupervised task, the input data is unlabeled; the modeling is not driven by a pre-specified target variable, but
rather seeks to identify naturalistic groupings.
K-means
- An iterative algorithm that groups unlabeled data into k non-overlapping clusters.
o The most intuitive and most used.
o Non-overlapping: Exclusive (?)
- Represents clusters of data by their centroid — their ”cluster centers,” or the arithmetic means (averages) of the
values along each dimension for the instances in the cluster.
We are trying to find 3 natural clusters. (se video I linket under billede)
https://www.linkedin.com/pulse/k-means-clustering-itsreal- use-case-
surayya-shaikh/
Evaluating your clusters
- The goal is to minimize distortion (or inertia). The lower the distortion, the better and more “coherent” your clustering
is.
o How good are these clusters?
o How good/natural is your cluster? We use distortion to evaluate this.
- Distortion is the within-cluster sum-of-squares; the sum of the squared differences between each data point and its
corresponding centroid. Scikit-learn calls this inertia.
This measure of distortion is not mentioned in todays exercise, but inertia is. Distortion is the same thing as inertia!! They
function the same way.
K-means problems
- Assumes regularly shaped clusters.
o Can’t see natural clusters if it doesn’t have a regular shape.
o Only cares about minimizing distortion/inertia.
- The initial placement of centroids affects how long it takes to converge and which instances are assigned to a
particular cluster.
o
- How do you pick k (the number of clusters)?
Hierarchical clustering
- An algorithm that builds nested clusters by successively merging the most similar clusters (or data points) until all data
points have been merged into a single cluster, such that the clustering can be represented as a tree-like dendrogram.
o Buttom-up clustering.
- There are several algorithms that are considered part of the hierarchical clustering family. What we’re referring to
here is sometimes called “Agglomerative Clustering.”
https://dashee87.github.io/data%20science/general/Clustering-
with-Scikit-with-GIFs/
Interpreting a dendrogram
Linkage functions
Clustering considerations
- The distance metric you use matters.
- The dimensionality and scaling of your data matters.
- Sometimes clusters are difficult to interpret.
o Up to the programmer to interpret. Fx which one is natural.
o There aren’t any labeled clusters, so it’s up to the programmer.
- Oftentimes clusters are difficult to evaluate as there is no ground truth with which to compare.
o There is no ground truth with clustering.
Other relevant algorithms
For clustering:
§Affinity Propagation
§ DBSCAN
§... https://scikit-learn.org/stable/modules/clustering.html
For dimensionality reduction:
§ PCA
§ UMAP
§... https://scikit-learn.org/stable/modules/unsupervised_reduction.html
Clustering in practice
- Market segmentation
- Profiling and anomaly detection
- Document/text analysis
o Touch on in the next lecture
o Rarely your main model, but more used as explaining and guiding your next steps.
- Data exploration and problem definition
Summary
Similarity is a measure of how alike, close together, or related data are, usually derived from a so-called distance
metric, of which there are many.
By measuring the similarity (or distance) between instances, you can make predictions for new, unseen data based on
what known instances are most similar (or closest). This “nearest neighbor reasoning” is the logic of k-nearest
neighbor algorithms.
Clustering is an unsupervised machine learning task that follows the logic of similarity and nearest neighbors, which
involves grouping together similar instances into so-called “clusters.”
o We touched on K-means
o Hierachical clustering: Nested-clustering
Two common clustering algorithms are k-means and hierarchical clustering (agglomerative clustering).
K-means clustering is an iterative, centroid-based algorithm that groups unlabeled data into k non-overlapping
clusters.
Hierarchical clustering builds nested clusters by successively merging the most similar clusters until all data points
have been merged into a single cluster.
Clustering is useful for exploratory data analysis, where there is no known target variable to be predicted or classified.
However, nearest neighbor methods and clustering suffer from the so-called “curse of dimensionality.”
Question Answer
Which statement is NOT true about similarity? It can be used for supervised and unsupervised machine learning
Ch. 6
The book, chap. 6
Similarity
About finding patterns in data.
- Gets value from datasets by comparing.
- Can be used for classification, regression and (is a basis for) clustering.
How to: Looks the differences in each attribute, and measures the distance:
1. Look at age attribute 40-23 = 17, current address 10-2 = 8, RS = 1.
2. Then measure the shortest distance between points using Euclidean distance:
- Then we can compute distance in the usual way by combining the differences in the
feature values.
Our project:
- Result in a lot of attributes like street name..
Clusters
- Unsupervised segmentation
o No target but look for a natural way to group the data.
o Two ways to grouping the data:
Hierarchy clustering
K-means clustering
Session 7
What is probability?
The chance that an event will occur.
- Mathematic framework that allows to analyze chance
Probabilities range from 0 to 1, where 0 (0%) indicates with complete certainty than an event will not occur, and 1 (100%)
indicates with complete certainty that an event will occur.
-
What is an event?
A possible outcome within a sample space.
The sample space of a phenomenon is the set of all possible outcomes. For example, the sample space of rolling a die once is
{1, 2, 3, 4, 5, 6}.
- 6 outcomes – sample space contains 6 possible outcomes.
- Dif events that can occur; odd number, over three, 1, 2 etc.
Finding the probability of an event
When all possible outcomes are equally likely, the probability of an event A, denoted by p(A), is obtained by adding the
probabilities of the individual outcomes in the event.
When all the possible outcomes are equally likely,
Central challenge; estimate the probability of an event
Example:
Event B: 4 different things and still satisfy
the event definition
Unconditional probability
A probability that is not affected by or dependent on other events.
For example, the probability of rolling a die and getting a 6, or the probability of flipping a coin and getting heads.
- Nice and intuitive
AB: multiplying.
Prob of two events lower than
Placing two bets instead of 1 > more risk.
When. Combining always will be lower.
If one of these is true, the two others are also true: both events are independent.
You can just check one.
What do we meen “|”
Conditional probability
The probability of an event (A) given that another event (B) has occurred; denoted by p(A|B).
For example, the probability of customer churning given that they have purchased a special subscriptions, or the probability of
having COVID given that you tested positive.
Example:
Check for independence
Data on 10 customers
We have data whether they bought a
special subscription and whether they
churned
Training data
Check mathematically
Bayes’ theorem
Blue terms;
Naïve Bayes
Adjust it for some assumptions
Assump of cond independence
Naïve;
Note: There are several variations of the Naïve Bayes algorithm used for classification. We’re focusing on the most common
variant, Multinomial Naïve Bayes.
Multinomial bayes
CLASSIC classification task
Data set: 20 tweets.
Labeled as having fake news or not.
Tweet as attribute
Not meaningful –
Alpha parameter
Computationally efficient
o Fast
Good for handling high-dimensional data (e.g., text)
Multinomial Naïve Bayes suits data with discrete, integer attribute values (different Naïve Bayes variants can handle
other attribute types) – multinomial distribution.
o Not integers; use different variation of bayes
Not good for probability estimation per se; only good for ranking class labels
o Think about when having many features > becoming smaller the more.
Strict independence assumption means semantics are not appreciated
o Ignoring linguistic meaning.
Any data can be high dimentional
Text-As-Data
Typical high dimentional data = text
Source:
Representing text
Basic text as data conecepts terminologies
Corpus: a collection of documents
Document: one unit of text for analysis (e.g., a sentence, blog, book, etc.)
N-gram: a sequence of adjacent words
Token/term: a word
Token:
one word; e.g., ”rhythm”
individual word
N-gram:
a sequence of adjacent words;
e.g., ”natural rhythm” is a bigram;
“a natural rhythm” is a trigram
Data preprocessing
Number of prepossing steps unique for data
Bag-of-words
Onc eprepossing done> diff direction. Most common direction is bag ..
A text-as-data approach that treats every document as a collection of individual words, ignoring grammar, word order,
sentence structure, and (usually) punctuation.
It treats every word in a document as a potentially important keyword of the document. It’s straightforward, computationally
inexpensive, and tends to perform (at least acceptably) well for many tasks.
Looses context
3 small documents
Prepossing; lower case etc.
Bag of words: 3 things we do
Tokenize; split sting using… chopping up in
small tokens
Count occuences of each token >
vectorization
Vectorixaiton: turn into
Written as equation
TFIDF
End up combining with TF
A combined measure of term frequency and inverse document frequency; term frequency weighted by inverse document
frequency.
The most widely used feature representation of text data.
Tfidf
Sentiment analysis
Summary
We can think of each instance in a dataset as evidence for or against different values for the target.
If we know all possible outcomes (class labels), we can estimate the conditional probability of an outcome being
observed given its attributes.
Assuming conditional independence, Naïve Bayes algorithms estimate the probability of an instance belonging to a
candidate class and reports the class with highest probability.
o Naïve bayes as baseline
(Multinomial) Naïve Bayes is computationally efficient and good for high- dimensional data (e.g., text), but only good
for class labeling (not probability estimation ala logistic regression).
Approaching text-as-data means turning text into machine-readable inputs for statistical and machine learning
models.
Text data is ubiquitous and naturalistic, but messy and high-dimensional — a central challenge with text data is the
necessary preprocessing.
Text data is commonly represented as a bag of words.
Key feature representations derived from bag-of-words are TF, IDF, and TFIDF.
Combining text-as-data feature representations with machine learning gives rise to NLP, which powers wide-ranging
tasks like text classification and topic modelling.
Question Answer
Which statement is NOT true about Naive It's good for high-dimensional data, like text
Bayes?
It assumes conditional independence among attributes
It's good for probability estimation, just like logistic regression
- Prob estimated
- Just comparing with other
It's unambiguous
o Conditional probability of event occurring: If we have relevant evidence of event that effects event
Ex. Dependent:
P(C given E) or P(C l E)
Many events are not independent, normally we say that the probability of A and B occurring is the
probability of A times probability of B given A.
If B is depending on A the product of
P(AB) = P(A) * P(B l A).
P(AB) = P(B) * P(A l B).
Bayes rule:
- A way to figure out how likely a hypothesis is given evidence.
- To see how likely a hypothesis is given the evidence, p(H l E) (People getting into hospital with red spots and how
many has misuls), it is often hard to get a good estimates of the probability of H given E, but it might be easier to
get data on E given H which ex. is the proportion of the patients with mussels that also has red spots (This is often
collected anyway).
- + we need data on how often people in general have red spots: P(E) ß this can be hard, but often we do not even
need P(E). <- Because: For classification we want to find the hypothesis c, that has the highest probability given E
(attribute vector), so we would typically compare to classes, C1 and C2 given E, and to determine which one is
larger – Therefore we just need to know who has the larger numerator! Both has the same denominator!
Text mining:
- Can be used to find all articles about a topic or classify documents as positive, negative or neutral.
- To do this each instance should be represented as a vector of attributes.
o Ex: list of all words; 1 is present in document and 0 if not, and look at the number of occurrences in the
document:
^^There are eleven features representing all the words in the three documents.
o Additionally, (1) A word is more interesting if it has more frequency in the document and (2) A word is less
interesting if it occurs in many other documents.
Dummy values
- You want text to be turned into numbers so you can work with them. It takes up more space on the screen but
makes the machine happier.
- Instance: News Article on given day mentioned a given stock (represented as a row)
- The attributes of each article are the TFIDF values of terms in a given article.
- Then we can build a classifier!
2) We can order the attributes based on how well they predict a change in stock price. Then we know what words
are more interesting in news articles about stocks.
3) One way to think about classification is to ask: What is the EVIDENCE for a given class.
Exercise 7: Sentiment analysis
Topic Evidence and probabilities; text-as-data
Learning 1, 4
Objective
Activities to be Today’s exercise objectives
done before next Use CountVectorizer to turn unstructured text into machine-readable attributes §Perform sentiment
class analysis with those attributes
Interpret the results to get insight out of your data
8. DECISION ANALYTIC THINKING I: MODEL EVALUATION AND ETHICS
Topic Decision analytic thinking I: Model evaluation and ethics
Learning 1, 2, 3
Objective
Syllabus DSB p. 187-208
PPBD p. 1-40
Activities to be Readings
done before next
class
Exercise Model evaluation
Session 8
Measures of Model Performance
What is a good model?
A model that answers your question or solves your problem. It depends on the use case.
- Not single answer.
What is a good predictive model?
A model the makes accurate predictions on new, unseen data.
A good predictive model should make more accurate predictions than a relevant baseline model or simple rules of thumb.
- Generalize (lec 5)
What is an appropriate baseline?
It depends on the use case.
A baseline could be a dummy model (e.g., a classifier that always predicts one class label), the simplest version of a model (e.g.,
a linear regression vs. a polynomial regression), whatever model is currently deployed in practice, or the current state-of-the-
art for a given task.
- Addition to accurate prediction.
Model performance metrics
Depending on task different measures to choose from
Regression:
o Goodness of fit (R-squared) See lecture 4
o Prediction error (MSE, MAE, RMSE)
Classification:
o Accuracy, precision, recall, F1
o Sensitivity, specificity
o Expected value
Clustering:
o Distortion/inertia See lecture 6
Regression - Evaluating regression models
Regression models are most commonly evaluated by their goodness of fit (e.g., R-squared) or by their prediction errors,
which are quantified by a loss function (e.g., MSE, MAE, RMSE, etc.).
Loss functions measure how bad it is to get an error of a particular size and direction. Different loss functions suit different use
cases.
Prediction error
Straight line:
Regression coefficion
Loss functions
Each work with prediction error.
Calcualte RMSE
Calculate MAE
Classification: Evaluating classification models
Classification models are evaluated by how often they predict the correct class label.
Classification performance can be quantified by several measures, such as accuracy, precision, recall, F1 scores, sensitivity, and
specificity.
- Dif ways to quantigy
Confusion matrix
Confusion Contingency tbale, separetes made by classifier
1.
Predicted class label
Actual class label
Ideally both high precsion and high recall; trade-off. Shift classification threshold
False negatives are worse than fakse positives > recall over
F1 High F1 means that both precision and recall are high 2 x Precision x Recall
Clalc harmonic mean of precision and recall F 1 Score=
Precision+ Recall
Precision up recall down and vice verce – captures this tradeoff.
Other metrics:
Sensivity High sensitivity means there are few false negatives; the classifier TP
doesn’t miss (m)any positive instances Sensitivity=
TP+ FN
More metrics;
Pandemic tests. Plug in confusion matrix.
Specificity High specificity means there are few false positives; the classifiers TN
doesn’t mistakenly label (m)any negative instances as positive Specificity=
TP+ FP
Precision = 0.67
Accuracy = 0.7
F1 = 0.73
Not directly at confusion matrix, but at
calculated precision and recall calculations
Build on multi-class
Infinity
Plug in numbers
NOTE: There are definitely other sources of ethics violations. We’re just focussing on these three today.
Main streams of discourse
Trained on hist criminal data > racial bias. Reproducing this bias in
its prediction. Picking up prediction attribute > directly correlated
to race
Scan CVs of job applicants
As it turns out;
Fx:
Academic
Copenhagen
Learning analytics; assessing in academic performance
3. Black-boxed automation
When machine learning models are deployed in contexts where their functionality is not transparent or explainable to the
implicated stakeholders.
If a black-boxed model automates or augments a decision-making process, then there is little opportunity to challenge a
potentially misguided decision.
- If I go to doc; live changing operation. Why? Based on wrong data?
- If doc says “algo told me” seems wrong
- At the same time; if you are a financial trader you also use algorithms to making decisions; algo tells to making
trades
o Ensure to justify the algo decision otherwise loose job
-
Easier to correct algorithms than macines
Black boxed algorithms
Private companies; algo consideren propritety info,
secrets.
Concern: bias can be built in algo, we wouldn’t know.
Trial; faced with criminal charges. Challenge the
decision. The algo is a part of that decision, you should
be allowed to know how the decision was made
Attorneys
So;
Question Answer
Which statement is NOT true about measures of They often require an appropriate baseline to compare against
model performance (e.g., loss functions,
classification reports, etc.)? Loss functions quantify how bad it is to get errors of a particular size
and direction
Turn prediction error into losss used to compare models - tradeoff
- Probability: is easy to get from data/classifier when building the model (confusion matrix).
- Value: Not as straight forward – depends on the business understanding og outcome.
- Example:
o VR = 100$ (profit) – 1$ (targeting) = 99$
o VNR = - 1$ (targeting)
- Proof:
Probability Value EV
1% 99% 99$ -1$ 99$ -99§
0% 0% 0$ 0$ 0$ 0$
= 99$ – 99$ = 0$
Visualizing:
Thresholds
- In classification, a class is given to each instance and a score.
o It is not perfect and make mistakes
- Model can predict positive above a score (threshold).
o Score of 0.99 means that it is predicted positive, which can be true or false. In this
case it is true since it is positive.
o At 0.65: 10 above are classified as positive, 6 were positive and 4 were negative.
Profit curve
- Graph with the profit of different classifiers based on different thresholds.
- The threshold with the highest profit should be chosen.
- Example:
o Profit = 9$ and it costs 5§ to advertise:
o Confusion matrix:
4$ -5$
0$ 0$
o Profit curve:
If 0% is targeted, profit is 0. (This is the case when there is no positive
classification. It estimates everything to be negative)
Choose the threshold with max profit, in this case almost 50%.
o Budget:
Another way to use the model: You have a budget of 40.000, you can reach
8.000 people, and there is 100.000, you should use generate offers to the 8%
with largest scores = look at the max profit up to 8 % à this will give a profit
of 100$.
ROC graph
- The classifier and confusion matrix closest to the upper left corner, the better.
- The line in the middle is of random guessing.
Cohen et al.
- About Peer-to-peer lending.
- About Jasmine Gonzales; Young professional wants to diversify her portfolio.
- Data:
o Loans categorized from A – G. A = safest and G is riskier. This needs to be balanced.
o Loans from 2007- 2017. 100 features (Loan amount etc.)
o For simplicity, we only look at expired loans (default 13.9% and fully paid = 86.5% in total)
- Project:
o (1) Introduction:
(1) She must figure out how much to invest here and other places (this requires data about other
places) – She must figure out how much to invest.
(2) Objective: Get the highest return based on her risk tolerance etc.
(3) Is old data ok? Tend to indicate patterns
(4) Attributes differences (some are grouped, some changes)
(5) Leakage ß Some data is not available. This should not be used for predicting – perhaps it has a
great impact on model.
o (2) Combine all data to one data set etc.
o (3) Data exploration
o (4)
Do attributes affect other attributes (Bayes)?
Time aspect? Solved by looking at different time periods.
Exercise 8: Model evaluation
Topic Decision analytic thinking I: Model evaluation and ethics
Learning 1, 4
Objective
Activities to be
done before next
class
Session 9
More classification metrics
Prediction vs. Explanation
Overestimated outbreaks
No accurate prediction.
The project failed
Takeaway
Correlation isn’t aways enough.
Tons of data doesn’t nec lead to truth
Obvious when looking at stupid correlations:
Causality is important in BD
Statistical observations
Correlation
When two variables display an increasing or decreasing trend.
- Shared line relationship between variables
For example, X and Y are correlated if observing a change in X tells you that Y will either increase or decrease.
- Expect as sales of pump lattes goes up > sale of wintercoats?
Association
When one variable provides information about another variable.
- Used interchangeably
- Correlation is a Specific type of association (broader concept)
As moving alo
Causation
When a change in one variable causes a change in another variable.
- Intervene and change > get desired outcome
The gold standard for identifying causation is with randomized control trials.
- Experiments – type of
- Control group – placebo
- Experiment – get a treatment
-
Prediction
Forecasting future observations.
Prediction is what machine learning is generally good at. Prediction is where association is enough (oftentimes).
- Association
Explanation
Accurately describing the causal mechanisms underpinning observations.
- Haven’t done this in course
- Ml not good at this
Explanation is what machine learning is generally not good at. Explanation is where association is not enough (oftentimes).
- Requires > we Know cause and effect relationships
- In business – we often want explanation.
- ML > only predictions and association
o Indicative of association, but not necessarily
Misleading
1. A café wants to reduce food waste, so they want to know which days of the week are going to be busy to stock
accordingly.
- Prediction: It deosnt matter why a day is busy, they just need to know when a day is likely to be busy
2. A telecoms company wants to know why people churn so they can design new product packages to prevent it.
- Explanation: “why” > causality. They could predict which costumers are likely to churn based on historical data, but that
wouldn’t explain why they churn
Other
Predictive examples:
- Are sales likely to increase enxt quarter
- Where might traffic jams occur
- Which student sare likely to drop out
- How many claims will an insurance company get
Explanatory:
- Explan
No causa
Intervention; causality >
Counterfactuals. Highest form of causal reasoning. Answer all explanatory questions
DAG terminology
Outcome: a dependent variable; the target variable.
Exposure: an independent variable; the attribute(s) you’re interested in.
o Effect of exposures on an outcome
Ancestor: a variable that causally affects another variable, influencing it either directly (ancestor → X) or indirectly
(ancestor → mediator → X). Direct ancestors are also called parents.
Descendant: a variable causally affected by another variable, either directly (X → descendant) or indirectly (X →
mediator → descendant). Direct descendants are also called children.
Path: a sequence of edges that connect a sequence of nodes. In a DAG for observational data, a path is a sequence of
arrows connecting variables. The arrows of a path need not point in the same direction.
Causal path: a path that consists only of chains and can transmit a causal association if unblocked.
Noncausal path: a path that contains at least one fork or inverted fork and can transmit a noncausal association if
unblocked.
Adjusting/controlling for a variable: introducing information about a variable into an analysis (e.g., adding a variable
into a multiple linear regression).
o Stratifcation, matching, adding attributes into our models
https://journals.sagepub.com/doi/full/10.1177/2515245917745629
Covariate roles
Quickly get big, messy and complex; basic structures > little modules of nodes
Confounder
Mediator
Collider
Confounder
A variable that is an ancestor of both the exposure and outcome variables; a common cause.
Failing to adjust for confounders in your model leads to inaccurate effect estimates (i.e., correlation coefficients; feature
importance)
Causality; no assicaition
E.g., failing to control for weather when modelling the influence of ice cream sales on drownings, given that
Mediator
A variable that is a descendant of the exposure and an ancestor of the outcome.
Adjusting for mediators in your model leads to inaccurate effect estimates (i.e., coefficients; feature importance); it removes or
“blocks” association despite causation.
-
Stat assoc; flowing through arrows passing
through nodes
Chain:
Suggest NO caus rel between x and y; there is, but
if including mediator, it will say there is not, which
will be wrong.
Collider
A variable that is a descendant of both the exposure and outcome variables.
Adjusting for colliders in your model leads to inaccurate effect estimates (i.e., coefficients; feature importance); it introduces
association despite no causation.
Normal:
tradeiff for each other
E.g., only analyzing data on restaurants in business when modelling the relationship between location quality and food quality,
given that
Seems rare and easy to avoid. Often end up adjusting without knowing.
[E = good location] → [restaurant success] ← [O = good food]
- Only look at restaurants in business; adjusting for a collider
- Neg correlation – no causal relationship.
Exercise
1. Go to http://www.dagitty.net/learn/graphs/index.html
2. Complete the game at the bottom of the page to test yourknowledge of DAG terminology
3. Go to http://www.dagitty.net/learn/graphs/roles.html
4. Complete the game at the bottom of the page to test your knowledge of covariate roles
Summary
Association is when one variable provides information on another variable (correlation is one type of association).
Causation is when a change in one variables causes a change in another variable.
Common machine learning models identify association, not causation.
Prediction involves forecasting future observations. It’s what machine learning is generally good it; it’s when
association is (often) enough.
Explanation involves accurately describing the causal mechanisms underpinning observations. It’s what machine
learning is generally not good at; it’s where association is (often) not enough.
In order to infer causation from data, we need to make causal assumptions. No data can show that there is a causal
relationship
DAGs help us define our assumptions and spot covariates with different roles so that we might draw valid causal
assumptions.
We want to adjust for confounders.
We do not want to adjust for mediators or colliders.
Accurately constructing a DAG is hard. There is often more than one defensible DAG for a given scenario.
The best way to avoid misinterpreting correlation as causation is by framing business questions as predictive tasks,
rather than explanatory tasks.
Question Answer
Which statement is NOT true about prediction Machine learning is generally good at prediction, not explanation
and explanation?
Machine learning is generally good at explanation, not prediction
NOT good at explanation!
Ch. 11-12
The book, chap. 11-12
- Look at:
o (1) Probability of donating
o (2) How much they will donate ß This is what we want to maximize.
- EV for targeting:
- EV for targeting:
Expected benefit of targeting x = EBT(x) = P(S l x, T)* (uS(x) - c) + [1- P(S l x, T)] * (uNS(x) – c)
- S = stay
Expected b. of not targeting x = EBnotT(x) = P(S l x, notT)* uS(x) + [1- P(S l x, notT)]* uNS(x)
If
o 30% of transactions involve beer
o 40% of transactions involve lottery tickets
o 20% of transactions involve beer and lottery tickets
o 0.3*0.4 à Independent probability of beer and lottery co-occurring.
Topic Conclusions
Learning 1,2,3
Objective
Syllabus DSB p. 315-347
Activities to be Readings
done before next
class
Exercise Project workshop
Session 10
Course narrative in a nutshell
Digitization and increased computing capacities generate data of such great volume, variety, and velocity that it’s
been dubbed “big data”
Big data is capital. Big data revolutionizes business by transforming traditional business models (e.g., selling books)
and by creating entirely new business opportunities (e.g., data-driven advertising)
Machine learning (ML) models are needed to extract value from big data §But translating qualitative business
questions into quantitative ML tasks isn’talways straightforward
There are many different models, each of which suits a particular type of task
Supervised learning techniques
o Classification is an ML task where you predict a categorical target variable (supervised learning; e.g.,
logistic regression)
o Regression is an ML task where you predict a continuous target variable (supervised learning; e.g., linear
regression)
Unsupervised:
o Clustering is an ML task where you group together similar instances into “clusters” (unsupervised
learning; e.g., K-means)
Sometimes big data is unstructured (e.g., text data). Here you need to do some feature engineering to get machine-
readable attributes for ML models (e.g., text-as-data; TFIDF)
To evaluate ML models, you need to compare them with appropriate metrics and compare against an appropriate
baseline – if using for medical reasons! important
o Loss functions
o Clustering:
o Important: think
But measures of model performance don’t tell us everything we need to know before deployment. The ethical
implications of a model should be anticipated, not reacted to
o Beyond technical evaluation techniques
o Even if high 3 square. Model performs good in technical
o Should it be deployed? Ethical
o Implications of deploying a model
Moreover, common measures of model performance (e.g., R-squared, loss functions, classification reports) do nothing
to distinguish association from causation, meaning that it’s easy to misinterpret outputs
o Causality
o
Big data and ML brings many opportunities, but it’s not a panacea
Pick one task > different models to try out. Pros and cons of all
Model covered in lecture and/or One strength of the model
reading
Classification Decision trees Interpretability
Extra resources
On big data’s business value
§Big Data: The Management Revolution (McAfee & Brynjolfsson, 2012)
Paper
Oral Exam
Hele træer:
- J48: Laver ud fra Information Gain. Tager den mest informative hver gang.
- Random forrest: Laver mange fulde træer (J48 træer) og vælger majoritetstræet (Det træ der kommer flest gange)
Find billede på nettet.
Extra:
Precision / Recall for evaluating.
- Bruges til – hvor vigtigt er det at vi har den her mængde true positive / False negatives etc.
- Hvor vigtigt er det?
- Precision; true positives / true + false positives.
- Ved os coster det ikke meget at skyde forkert!
- False positives koster os penge, og catboost har en lav af dette!
- False negatives; Koster os penge også penge, fordi de køber ikke, fordi de kender ikke din virksomhed.
Nominal = Female = F, Male = M eller 0 og 1, den tæller hvor mange der er af hver.
For at finde ud af hvor god modellen kan, så kan vi tage noget træningssættet og teste.
- AveMonth spend; Slettede fordi test data ikke har det ???
Python Coden;
Pandas; Kan lave tabeller i python og tilføje værdier de steder der er tomt – Python Processing
Catboost:
- “features” man definerer object/arraw, I de vi gerne vil arbejde med.
- Tilføjer <miss> i categoriske.
- Cat_features er at den automatisk handler categoriske værdier. (Det samme som i WEKA = NomToNumeric).
- Vi har ingen ’early-stopping’. Den kører 200 iterationer.
- Cv_result står for cross validation.
LGBM
- Can ikke categorial features, så vi har lavet code.
o Def cat_to_int ß laver alle categoriske variables tile n integer.
2 modeller:
- Vi har fucket up! Vi tog bare gennemsnittet af de to accuracy.
- Sandsynligheden for at de IKKE køber! Probabilities er et gennemsnit af de to modeller.
- Getting the best score from Catboost. Model2
o Validation is cross-fold =5.
- Bst.best_score = lightGBM score
Feature engineering:
- Year born and then calculate age, and then age groups.
- Makes the same for income – creates groups.