Professional Documents
Culture Documents
Ai Hon 4
Ai Hon 4
Ai Hon 4
Statistical Methods
Standard Deviation,
Normalization- Feature Scaling,
Min-Max scaling,
Bias,
Variance,
Regularization,
Ridge Regression,
Lasso Regression,
Cross Validation Techniques- K-fold,
LOOCV,
Stratified K-fold,
Grid Search CV,
CV Error
What is meant by statistical methods?
Definition. Statistical methods are mathematical formulas, models, and techniques
that are used in statistical analysis of raw research data. The application of statistical
methods extracts information from research data and provides different ways to assess
the robustness of research outputs.
Statistical methods refer to general principles and techniques which are commonly
used in the collection, analysis and interpretation of data.
Here,
s = Sample standard deviation
n = Number of observations in sample
xi = ith observation in the sample
x―
= Sample mean
Standard deviation:
Feature Scaling is a technique to standardize the independent features present in the
data in a fixed range. It is performed during the data pre-processing.
Working:
Given a data-set with features- Age, Salary, BHK Apartment with the data size of 5000
people, each having these independent data features.
Each data point is labeled as:
Class1- YES (means with the given Age, Salary, BHK Apartment feature value one
can buy the property)
Class2- NO (means with the given Age, Salary, BHK Apartment feature value one
can’t buy the property).
Using a dataset to train the model, one aims to build a model that can predict whether
one can buy a property or not with given feature values.
Once the model is trained, an N-dimensional (where N is the no. of features present in
the dataset) graph with data points from the given dataset, can be created. The figure
given below is an ideal representation of the model.
As shown in the figure, star data points belong to Class1 – Yes and circles
represent Class2 – No labels, and the model gets trained using these data points. Now
a new data point (diamond as shown in the figure) is given and it has different
independent values for the 3 features (Age, Salary, BHK Apartment) mentioned above.
The model has to predict whether this data point belongs to Yes or No.
Prediction of the class of new data points:
The model calculates the distance of this data point from the centroid of each class
group. Finally, this data point will belong to that class, which will have a minimum
centroid distance from it.
Feature Scaling:-
Feature scaling is a method used to normalize the range of independent variables or
features of data. In data processing, it is also known as data normalization and is generally
performed during the data preprocessing step.
Feature Engineering is a big part of Data Science and Machine Learning. Feature Scaling is one of the last
steps in the whole life cycle of Feature Engineering. It is a technique to standardize the independent features
in a data in a fixed range or scale. Thus the name Feature Scaling.
In simple words, once we are done with all the other steps of feature engineering, like ecoding variables,
handling missing values etc, then we scale all the variable to a very small range of say -1 to +1. So all the
data gets squeezed to decimal points between -1 and +1. What it does is keep the distribution of the data,
the correlation and covariance absolutely the same however scales every independent or the feature matrix
columns to a smaller scale. We do this as most of the ML algorithms problems perform significantly better
after scaling.
Standardization:
Standard Scaler
Normalization:
Min Max Scaling
Mean Normalization
Max Absolute Scaling
Robust Scaling etc.
Normalization
Normalization is a technique often applied as part of data preparation for machine learning. The goal of
normalization is to change the values of numeric columns in the dataset to use a common scale, without
distorting differences in the ranges of values or losing information.
Min Max Scaling
Min-max normalization is one of the most common ways to normalize data. For every feature,
the minimum value of that feature gets transformed into a 0, the maximum value gets
transformed into a 1, and every other value gets transformed into a dec imal between 0 and 1.
Bias:-
It is a lack of objectivity when looking at something. The bias can be both intentional and
unintentional. For example, a person may like one shirt more than two others when given a choice
because the shirt they picked is also their favorite color.
Statistical bias is anything that leads to a systematic difference between the true parameters
of a population and the statistics used to estimate those parameters.
Bias is a statistical term which means a systematic deviation from the actual value. It is a sampling
procedure that may show some serious problems for the researcher as a mere increase cannot reduce
it in sample size. Bias is the difference between the expected value and the real value of the parameter.
In this article, we are going to discuss the classification of bias and its different types.
Selection Bias
Spectrum Bias
Cognitive Bias
Data-Snooping Bias
Omitted-Variable Bias
Exclusion Bias
Analytical Bias
Reporting Bias
Funding Bias
Classification of Bias
The bias is mainly categorized into two different types
Measurement Bias
Measurement bias takes place for the duration of the carrying out the survey, and its consequences
chiefly because of three reasons –
(i) The error happens while recording the data
While recording data, errors happen due to the malfunction of instruments that are used for data
collection or because of ineffective handling of these tools by the researchers concerned with data
collection.
(ii) Leading Questions
The questions prepared for the survey might be put in a manner to lead the responses that are
preferred by the researcher. There can be more choices for the preferred retort given than for the
conflicting views.
(iii) Respondents gave inadvertent false responses
There can be a situation when many responders may have misunderstood the question and chooses
an incorrect option. If the sample groups were composed of numerous senior citizens and if they
asked to give answers by remembering their previous experiences, they might be providing some
false inputs because of deficiency of memory.
Variance:-
What Is Variance?
The term variance refers to a statistical measurement of the spread between numbers
in a data set. More specifically, variance measures how far each number in the set is
from the mean (average), and thus from every other number in the set. Variance is often
depicted by this symbol: σ2. It is used by both analysts and traders to
determine volatility and market security.
The square root of the variance is the standard deviation (SD or σ), which helps
determine the consistency of an investment’s returns over a period of time.
Understanding Variance
Statisticians use variance to see how individual numbers relate to each other within a
data set, rather than using broader mathematical techniques such as arranging numbers
into quartiles. The advantage of variance is that it treats all deviations from the mean as
the same regardless of their direction. The squared deviations cannot sum to zero and
give the appearance of no variability at all in the data.
One drawback to variance, though, is that it gives added weight to outliers. These are
the numbers far from the mean. Squaring these numbers can skew the data. Another
pitfall of using variance is that it is not easily interpreted. Users often employ it primarily
to take the square root of its value, which indicates the standard deviation of the data.
As noted above, investors can use standard deviation to assess how consistent returns
are over time.
Here’s a hypothetical example to demonstrate how variance works. Let’s say returns for
stock in Company ABC are 10% in Year 1, 20% in Year 2, and −15% in Year 3. The
average of these three returns is 5%. The differences between each return and the
average are 5%, 15%, and −20% for each consecutive year.
Squaring these deviations yields 0.25%, 2.25%, and 4.00%, respectively. If we add these
squared deviations, we get a total of 6.5%. When you divide the sum of 6.5% by one less
the number of returns in the data set, as this is a sample (2 = 3-1), it gives us a variance
of 3.25% (0.0325). Taking the square root of the variance yields a standard deviation of
18% (√0.0325 = 0.180) for the returns.
Variance is essentially the degree of spread in a data set about the mean value of that
data. It shows the amount of variation that exists among the data points. Visually, the
larger the variance, the "fatter" a probability distribution will be. In finance, if something
like an investment has a greater variance, it may be interpreted as more risky or volatile.
Standard deviation is the square root of variance. It is sometimes more useful since
taking the square root removes the units from the analysis. This allows for direct
comparisons between different things that may have different units or different
magnitudes. For instance, to say that increasing X by one unit increases Y by two
standard deviations allows you to understand the relationship between X and Y
regardless of what units they are expressed in.
Variance is the average squared deviations from the mean, while standard deviation is
the square root of this number. Both measures reflect variability in a distribution, but their
units differ:
Standard deviation is expressed in the same units as the original values (e.g.,
minutes or meters).
Variance is expressed in much larger units (e.g., meters squared).
Although the units of variance are harder to intuitively understand, variance is important
in statistical tests.
Statistical tests such as variance tests or the analysis of variance (ANOVA) use
sample variance to assess group differences of populations. They use the variances of
the samples to assess whether the populations they come from significantly differ from
each other.
What is Homoscedasticity?
What is Regularization?
model performs well with the training data but does not perform well with the test data. It
means the model is not able to predict the output when deals with unseen data by
introducing noise in the output, and hence the model is called overfitted. This problem
can be deal with the help of a regularization technique.
This technique can be used in such a way that it will allow to maintain all variables or
features in the model by reducing the magnitude of the variables. Hence, it maintains
accuracy as well as a generalization of the model.
It mainly regularizes or reduces the coefficient of features toward zero. In simple words,
"In regularization technique, we reduce the magnitude of the features by keeping the
same number of features."
Regularization works by adding a penalty or complexity term to the complex model. Let's
consider the simple linear regression equation:
y= β0+β1x1+β2x2+β3x3+⋯+βnxn +b
β0,β1,…..βn are the weights or magnitude attached to the features, respectively. Here
represents the bias of the model, and b represents the intercept.
Linear regression models try to optimize the β0 and b to minimize the cost function. The
equation for the cost function for the linear model is given below:
Now, we will add a loss function and optimize parameter to make the model that can
predict the accurate value of Y. The loss function for the linear regression is called as RSS
or Residual sum of squares.
Techniques of Regularization
There are mainly two types of regularization techniques, which are given below:
Ridge Regression
o Ridge regression is one of the types of linear regression in which a small amount
of bias is introduced so that we can get better long-term predictions.
o Ridge regression is a regularization technique, which is used to reduce the
complexity of the model. It is also called as L2 regularization.
o In this technique, the cost function is altered by adding the penalty term to it. The
amount of bias added to the model is called Ridge Regression penalty. We can
calculate it by multiplying with the lambda to the squared weight of each individual
feature.
o The equation for the cost function in ridge regression will be:
o In the above equation, the penalty term regularizes the coefficients of the model,
and hence ridge regression reduces the amplitudes of the coefficients that
decreases the complexity of the model.
o As we can see from the above equation, if the values of λ tend to zero, the
equation becomes the cost function of the linear regression model. Hence,
for the minimum value of λ, the model will resemble the linear regression model.
o A general linear or polynomial regression will fail if there is high collinearity
between the independent variables, so to solve such problems, Ridge regression
can be used.
o It helps to solve the problems if we have more parameters than samples.
Lasso Regression(L1 Regularization):-
o Lasso regression is another regularization technique to reduce the complexity of
the model. It stands for Least Absolute and Selection Operator.
o It is similar to the Ridge Regression except that the penalty term contains only the
absolute weights instead of a square of weights.
o Since it takes absolute values, hence, it can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
o It is also called as L1 regularization. The equation for the cost function of Lasso
regression will be:
o Some of the features in this technique are completely neglected for model
evaluation.
o Hence, the Lasso regression can help us to reduce the overfitting in the model as
well as the feature selection.
Cross-validation:-
Cross-validation is a resampling method that uses different portions of the data to test
and train a model on different iterations. It is mainly used in settings where the goal is
prediction, and one wants to estimate how accurately a predictive model will perform in
practice.
Cross-Validation is a resampling technique with the fundamental idea of splitting the
dataset into 2 parts- training data and test data. Train data is used to train the model and
the unseen test data is used for prediction.
Cross-validation classified into two broad categories – Non-exhaustive and Exhaustive
Methods.
Non-exhaustive Methods
Non-exhaustive cross validation methods, as the name suggests do not compute all ways
of splitting the original data. Let us go through the methods to get a clearer understanding.
Holdout method
This is a quite basic and simple approach in which we divide our entire dataset into two
parts viz- training data and testing data. As the name, we train the model on training data
and then evaluate on the testing set. Usually, the size of training data is set more than
twice that of testing data, so the data is split in the ratio of 70:30 or 80:20.
In this approach, the data is first shuffled randomly before splitting. As the model is trained
on a different combination of data points, the model can give different results every time
we train it, and this can be a cause of instability. Also, we can never assure that the train
set we picked is representative of the whole dataset.
Also when our dataset is not too large, there is a high possibility that the testing data may
contain some important information that we lose as we do not train the model on the
testing set.
The hold-out method is good to use when you have a very large dataset, you’re on a time
crunch, or you are starting to build an initial model in your data science project.
K-fold cross validation is one way to improve the holdout method. This method
guarantees that the score of our model does not depend on the way we picked the train
and test set. The data set is divided into k number of subsets and the holdout method is
repeated k number of times. Let us go through this in steps:
Using K Fold on a classification problem can be tricky. Since we are randomly shuffling
the data and then dividing it into folds, chances are we may get highly imbalanced folds
which may cause our training to be biased. For example, let us somehow get a fold that
has majority belonging to one class(say positive) and only a few as negative class. This
will certainly ruin our training and to avoid this we make stratified folds using stratification.
Stratification is the process of rearranging the data so as to ensure that each fold is a
good representative of the whole. For example, in a binary classification problem where
each class comprises of 50% of the data, it is best to arrange the data such that in every
fold, each class comprises of about half the instances.
Exhaustive Methods
Exhaustive cross validation methods and test on all possible ways to divide the original
sample into a training and a validation set.
When using this exhaustive method, we take p number of points out from the total number
of data points in the dataset(say n). While training the model we train it on these (n – p)
data points and test the model on p data points. We repeat this process for all the possible
combinations of p from the original dataset. Then to get the final accuracy, we average
the accuracies from all these iterations.
This is a simple variation of Leave-P-Out cross validation and the value of p is set as one.
This makes the method much less exhaustive as now for n data points and p = 1, we
have n number of combinations.
For time-series data the above-mentioned methods are not the best ways to evaluate the
models. Here are two reasons as to why this is not an ideal way to go:
1. Shuffling the data messes up the time section of the data as it will disrupt the order
of events
2. Using cross-validation, there is a chance that we train the model on future data
and test on past data which will break the golden rule in time series i.e. “peaking
in the future is not allowed”.
The purpose of cross–validation is to test the ability of a machine learning model to predict
new data. It is also used to flag problems like overfitting or selection bias and gives
insights on how the model will generalize to an independent dataset.
Holdout Method
K-Fold Cross-Validation
Stratified K-Fold Cross-Validation
Leave-P-Out Cross-Validation
Cross-validation is a procedure that is used to avoid overfitting and estimate the skill of
the model on new data. There are common tactics that you can use to select the value of
k for your dataset.
What is GridSearchCV?
GridSearchCV is the process of performing hyperparameter tuning in order to determine
the optimal values for a given model. As mentioned above, the performance of a model
significantly depends on the value of hyperparameters. Note that there is no way to know
in advance the best values for hyperparameters so ideally, we need to try all possible
values to know the optimal values. Doing this manually could take a considerable amount
of time and resources and thus we use GridSearchCV to automate the tuning of
hyperparameters.
'kernel': ['rbf',’linear’,'sigmoid'] }
Here C, gamma and kernels are some of the hyperparameters of an SVM model. Note
that the rest of the hyperparameters will be set to their default values
GridSearchCV tries all the combinations of the values passed in the dictionary and
evaluates the model for each combination using the Cross-Validation method. Hence
after using this function we get accuracy/loss for every combination of hyperparameters
and we can choose the one with the best performance.
Parameter Hyperparameter
The configuration model’s parameters Hyperparameters are parameters that are explicitly
are internal to the model. specified and control the training process.
These are learned & set by the model These are set manually by a machine learning
by itself. engineer/practitioner.
What is cross-validation?
Cross-Validation is a technique used in model selection to better estimate the test error
of a predictive model. The idea behind cross-validation is to create a number of partitions
of sample observations, known as the validation sets, from the training data set. After
fitting a model on to the training data, its performance is measured against each validation
set and then averaged, gaining a better assessment of how the model will perform when
asked to predict for new observations. The number of partitions to construct depends on
the number of observations in the sample data set as well as the decision made regarding
the bias-variance trade-off, with more partitions leading to a smaller bias but a higher
variance.
K-Fold cross-validation
This is the most common use of cross-validation. Observations are split into K partitions,
the model is trained on K – 1 partitions, and the test error is predicted on the left out
partition k. The process is repeated for k = 1,2…K and the result is averaged. If K=n, the
process is referred to as Leave One Out Cross-Validation, or LOOCV for short. This
approach has low bias, is computationally cheap, but the estimates of each fold are highly
correlated. In this tutorial we will use K = 5.
CV Error:-
Cross validated error is your best guess for the average error you would see with your
regression model on new data.
The basic idea in calculating cross validation error is to divide up training data into k-
folds (e.g. k=5 or k=10). Each fold will then be held out one at a time, the model will be
trained on the remaining data, and that model will then be used to predict the target for
the holdout observations.