Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 15

Regression

Welcome to Numerical targets!

Big Data, Machine Learning, and their Real World Applications


Pre-College Program
Columbia University, SPS
Regression Concepts
• Intro to Regression
• Linear Regression
• Mean Squared Error (MSE)
• Other models for regression
• Metrics for Regression
• Time Series: a special case of regression
• Housing Data Lab
Which tasks are regression tasks?
• Predicting the highest price passengers are willing to pay for a taxi.
• Forecasting how many tickets will be sold for morning and evening
movie showings.
• Identifying the age group of a person: under 18, under 35, 35+.
• Determining how many days visa processing will take based on the
country, entry permit type, and the applicant's details.
Which tasks are regression tasks?
• Predicting the highest price passengers are willing to pay for a taxi.
• Forecasting how many tickets will be sold for morning and evening
movie showings.
• Identifying the age group of a person: under 18, under 35, 35+.
• Determining how many days visa processing will take based on the
country, entry permit type, and the applicant's details.
Linear Regression- The Basics
• Fits the best line for
observed data
• y=mx+b
• Finds slope (m) and
intercept (b) that minimize
least squares error
• Can be done in multiple
dimensions (multiple
features).
Mean Squared Error
• Cost function, also used for evaluation

predicted
expected
Mean Absolute Error
• Mean Absolute Error
A Note on Target Distributions, when to use
MSE vs MAE?
• MSE is good when your target variable follows a normal or
symmetrical distribution
• MAE is good when your target variable follows a skewed distribution
Coefficient of Determination – R2 score
(Goodness of Fit)
• Usually between 0 and 1
• Can be negative if fit of model is really bad
• The closer to 1, the better fit
Random Forest Regressor
• Random forests are a
collection of decision
trees where the
predictions of all trees
are averaged.
Which Model to Choose? Underfitting vs
Overfitting
How to choose a model?
• Usually we will try a couple of different models ( algorithms ) such as
linear regression vs random forest vs decision tree.
• We can do some hyperparameter tuning for each of the models to
see which sets of hyperparameters perform better. Then we compare
the models.
• We stick to one evaluation metric for all models, like MSE or MAE.
Multiple is fine as long as you do it for every model.
• We also assess if we want a simpler or more heavy model depending
on our application.
Housing Data Lab
• Load in housing data from Kaggle: https://www.kaggle.com/datasets/
yasserh/housing-prices-dataset
• Fit linear regression algorithm using at least one feature to find the price
of the house (split into training and testing data first)
• Use model.coef_ to find what the coefficients of linear regression are
• If using features that are categorical , make sure you encode them
• What are the first 5 residuals?
• Calculate the MSE of the first 5 predictions.
• What is the R^2 value of the entire fit of the model? Use
sklearn.metrics.r2_score()
Time Series- Autoregression
• Autoregression :
when you use one
variable (the past)
to predict the
same variable at
another time (the
future).
Final Project Specifications
• 15-20 minute presentations + 5 minutes for questions
• Use slides: Introduction, Analysis, Conclusions , Future work
• Show your code
• Data Analysis Project:
• At least 5 visualizations with conclusions.
• Make sure you tell a story with your data.
• Machine Learning Project:
• At least two models with comparisons (except if using neural networks) with
the same metric (MSE/precision, etc)
• What challenges did you face? Talk about hyperparameter tuning, model
comparison, metrics, etc.

You might also like