Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 7

Data Science Lifecycle

#1 Data Pre-processing

Steps >>>

a) Clean the data - data wrangling techniques - identify and replace missing values as well as garbage
values, mean, median, mode can be used to replace missing of garbage values

b) Normalize teh data: to ensure that all attributes have teh same weight and absolute balues of 1
attribute does not skew the results

c) Format the data

#2 Exploratory Analysis
Also called EDA - analyze data to summarize main characteristics of the data, gain better understanding
of the data set.

a) Use descriptive statistics - mean, median. std. deviation, box plots, scatter plots, etc.

b) Group data especially object data - Heatmaps, pivots, etc.

c) Perform correlation to understand relationship between variables - scatterplot.

i> Pearson Correlation

Two factors to determine correlation

i. Correlation coefficient (between -1 to 1, 1 mean strong positive relationship, -1


means strong negative relationship, 0 means no relationship)
ii. p-value - confidence level of identified correlation (p<.001 - strong certainty,
p>0.1 - no certainty)

ii> ANOVA, the analysis of variance, to find correlation between categorical values

Two factors to be calculated

i. F-test - Larger F value means better correlation


ii. p-value - confidence level of identified correlation (p<.001 - strong certainty,
p>0.1 - no certainty)
#3 Model Development

Conduct Linear Regression & Multiple Linear Regression

a. Simple Linear Regression: Simple Linear Regression is used to predict the response (dependent)
variable as a function of the predictor (independent) variable.

y = a + bx (a is the intercept of regression line <predicted values of x & y cordinates> & b is the
slope)

b. Multiple Linear Regression: used to predict the response (dependent) variable as a function of the
multiple predictor (independent) variable.

y = a + bx1 + cx2 and so on

c. Polynomial Regression: used in case on non-linear relationships between variables. To ensure a better
fit and predictive model, we use polynomial regression to describe curvilinear relationships. It is a form
of linear regression in which the relationship between the independent variable x and dependent
variable y is modeled as an nth degree polynomial

y = a + bx^n +cy^n and so on - nth Order

y = a + bx^2 + cy^2 and so on - Quadratic

y = a + bx^3 + cy^3 and so on - Cubic

### In-Sample Evaluation ###

a) Measures for Evaluation

i. R-Squared - R squared, also known as the coefficient of determination, is a measure to indicate


how close the data is to the fitted regression line. The value of the R-squared is the
percentage of variation of the response variable (y) that is explained by a linear model. Vakue
lies betwee 0 and 1, closer to 1 mean better fit
ii. Mean Squared Error - The Mean Squared Error measures the average of the squares of errors,
that is, the difference between actual value (y) and the estimated value (y). Smaller the
better.
b) Visualization – Visualization can help evaluate the accuracy of the developed models

i. Simple Linear Regression:

Some methods to evaluate model using visualization are:

1) Scatter plots - a liner disribution of scatter plots confirms that teh model is accurate.

2) Residual Plot - The difference between the observed value (y) and the predicted value (Yhat) is called
the residual (e). When we look at a regression plot, the residual is the distance from the data point to
the fitted regression line. If the points in a residual plot are randomly (evenly) spread around the x-axis,
then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that
the variance is constant, and thus the linear model is a good fit for this data.

ii. Visualization for Multiple/polynomial Regression:

To evaluate the model fitness, distribution plot is used to see the distribution of predicted values as
compared to actual values. If the distribution is similar, then the model should be accurate.
#4 Model Evaluation

Dataset is split into training and testing data to evaluate the model and check how it predicts value sin
real world.

Methods >>>

1) Split data into two sets: a) Training data – Larger Set & b) Testing Data

Generalization Error: Error that we get when predicting a value and how much it differs from actual
values.

2) Cross-Validation: one of the most common out of sample evaluation techniques

The entire data set is used for both testing and training the models by dividing the data into folds. For
Example: A data set may be divided into 4 folds – each fold has 25% data. One by one each 3-folds data
is used for training and 1-fold for testing. This goes on until all the folds are used for both testing and
training. At the end, we take the average of all the estimations to finalize the model. The average of all
out-of-sample errors is calculated as the mean estimation error or R-squared.

Lower the R-squared value better the model.

For calculating R^2 value:

From Sklearn.model_selection import cross_val_score

Scores = cross_val_score (lr, x_data, y_data, cv =3)  3 folds

For Prediction:

Scores = cross_val_predict (lr, x_data, y_data, cv =3)


3) Overfitting, underfitting and Model Selection

Underfitting – Model is too simple to fit the data; predicted values do not align with most actual values.

Overfitting – Model is too flexible and even follows the noise just missing the actual function.

Higher the polynomial order lower is the error for training data; but for predicted values the error starts
increasing after a certain order. In the below example, 8 th order polynomial is the optimal order for
model selection.

4) Ridge Regression

Ridge Regression is a method for preventing overfitting. As we know that, using higher order polynomial
improves the training, however, but overfitting is also a big problem when you have multiple
independent variables, or features. Hence, Ridge regression controls the magnitude of these polynomial
coefficients by introducing the parameter alpha. Alpha is a parameter we select before fitting or training
the model.

Sklearn provide us Ridge library which provide a way to conduct Ridge regression during model fitting.
To identify the optimal value of alpha, run the model on different alpha values from small to big alpha
value and select the alpha value with highest R-squared.

5) Grid Search

The Alpha that we discussed in the ridge regression is known as a hyperparameter. Multiple such
hyperparameters may be used to improve the prediction model.

Scikit has means of automatically iterating multiple such parameters using cross-validation called Grid
Search.

Glossary

1) Pearson Correlation:

The Pearson Correlation measures the linear dependence between two variables X and Y and tell you
how strong in the relationship.

Coefficient tells the correlation between two variables and P valueThe P-value is the probability value
that the correlation between these two variables is statistically significant. Normally, we choose a
significance level of 0.05, which means that we are 95% confident that the correlation between the
variables is significant.

2) ANOVA (Analysis of Variance):

The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant
differences between the means of two or more groups. ANOVA returns two parameters:

F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual
means deviate from the assumption, and reports it as the F-test score. A larger score means there is a
larger difference between the means.
P-value: P-value tells how statistically significant is our calculated score value.

If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a
sizeable F-test score and a small p-value.

6) In-sample vs out-of-sample evaluation

If you are forecasting for an observation that was part of the data sample - it is in-sample forecast.

If you are forecasting for an observation that was not part of the data sample - it is out-of-sample
forecast.

So the question you have to ask yourself is: Was the particular observation used for the model fitting or
not ? If it was used for the model fitting, then the forecast of the observation is in-sample. Otherwise
it is out-of-sample.
if you use data 1990-2013 to fit the model and then you forecast for 2011-2013, it's in-sample forecast.
but if you only use 1990-2010 for fitting the model and then you forecast 2011-2013, then its out-of-
sample forecast.

7) R-squared:

R-squared (R2) is a statistical measure that represents the proportion of the variance for a dependent
variable that's explained by an independent variable or variables in a regression model.

8) Pipeline

A machine learning pipeline is used to help automate machine learning workflows. They operate by
enabling a sequence of data to be transformed and correlated together in a model that can be tested
and evaluated to achieve an outcome, whether positive or negative. Machine learning (ML) pipelines
consist of several steps to train a model. Machine learning pipelines are iterative as every step is
repeated to continuously improve the accuracy of the model and achieve a successful algorithm.

A typical machine learning pipeline would consist of the following processes:

Data collection > Data cleaning> Feature extraction (labelling and dimensionality reduction) > Model
validation > Visualization

You might also like