Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 17

Lecture 3 - Exploratory Data Analysis

Part I – Introduction to EDA


Definition: Exploratory data analysis (EDA) is a critical part of the data science process, and
the first step toward building a model. The basic tools of EDA are plots, graphs, and summary
statistics.
Although there’s lots of visualization involved in EDA, we distinguish between EDA and
data visualization
● EDA is done toward the beginning of the analysis.
● Data Visualization is done toward the end to communicate one’s findings.
● With EDA, the graphics are solely done for you to understand what’s going on.

Data Science Cycle

Part II - Guide to perform EDA


1. First, we find out how many observations (rows) and variables (columns) we have in our
data, in addition to the name of the variables.
2. We use the “table” function in R to find out the number of observations for each type in a
specific variable.
3. Before we explore the data we need to take care of missing values by setting them to
NA.
4. It’s very helpful to create cross tables using the “table” function in R especially if you
have a special-temporal application in mind.
5. Make sure that the data type of the column is representative (Boolean, Character,
integer, etc..) and keep in mind that R reads letters T and F as Boolean while you might
want them to be characters which will affect the ML algorithm later on.
6. Apply “split-apply-combine” if needed, where you split the data frame into groups, apply
a function for each group, and then combine the results into a new data frame.
7. Change the type of variable as needed and draw multiple graphs into the same
observation using the “wrap” function in R for better comparison.
8. It’s helpful to use boxplots to observe the distribution of certain variables.

Conclusion: After knowing your variables, take care of the missing values and the type of the
variables, and then use visualizations to draw an analysis of your variables.

Lecture 4 - Introduction to Machine Learning


Supervised Learning: Classification, Regression or Ranking
Unsupervised Learning: Anomaly Detection, Clustering
Noisy data is meaningless data that cannot be understood and interpreted correctly by
machines, such as unstructured text.
Overfitting occurs when a model begins to "memorize" training data rather than "learning" to
generalize from the trend.
Underfitting occurs when the model or algorithm shows low variance but high bias: a feature
whereby the expected value of the results differs from the true underlying quantitative parameter
being estimated.
The balance between overfitting and underfitting leads to bias and variance.
Ranking labels in supervised learning are called “ordinals”
Majority Class is when we have one class that has records more than the others with a
threshold of around 10% to 5%.

Lecture 5 - kNN
Lazy Learning – Classification Using Nearest Neighbors

NN classifiers: classify unlabeled examples by assigning them the class of the most similar
labeled examples.
Strengths:
● Simple and effective
● Makes no assumptions about the underlying data distribution
● Fast Training Phase

Weaknesses:
● Does not produce a model which limits the ability to find novel insights in relationships
among features
● Slow Classification Phase
● Requires a large amount of memory

The kNN algorithm


1. Let k be an integer specified in advance. Begin with a training dataset.
2. Assume that we have a test dataset containing unlabeled examples that otherwise have
the same features as the training data.
3. For each record in the test dataset, identify k records in the training data that are the
"nearest" in similarity.
4. The unlabeled test instance is assigned the class of the majority of the k nearest
neighbors.

Scaling and Normalization


The traditional method is min-max normalization, it transforms all the values so they fall
X −min( X)
between 0 and 1. Xnew =
max(X )−min
The issue with this method that it squashes all the outliers and you wouldn’t be able to identify
outliers.
X−μ
To solve this problem we use z-score standardization Xnew=
δ

Calculating the Distance


Calculating the distance is usually done after scaling and normalization and usually done using
the euclidian distance, while euclidian distance isn’t defined for nominal data we could perform
data encoding either by doing dummy encoding or by performing one-hot encoding.

Best practices for kNN


● start by removing the Id column (not descriptive) and the labels row(for learning
purposes) for more accurate clustering.
● perform hyperparameter tuning.
● Testing alternative values of k.
● Perform Cross-Validation.
● Change Training Set vs Testing Set ration.
● Change the distance function.
Lecture 6 - Decision Trees and Rules
Decision trees and rule learners:
● Divide data into smaller and smaller portions
● Identify patterns that can be used for prediction.
● Knowledge is presented in the form of logical structures
● Understood without any statistical knowledge.
May not be an ideal fit:
● when the data has a large number of nominal features with many levels.
● when the data has a large number of numeric features.

Divide and Conquer


Decision trees are built using divide and conquer:
● Begin at the root node
● The node represents the entire dataset
● Choose a feature that is the most predictive of the target class.
● The examples are then partitioned into groups of distinct values of this feature;
● This decision forms the first set of tree branches.
● The algorithm continues to divide-and-conquer the nodes,
● Chooses the best candidate feature each time until a stopping criterion is reached.

The C5.0 decision tree algorithm

Strengths:
● An all-purpose classifier that does well on most problems
● Can handle numeric or nominal features, missing data
● Uses only the most important features
● Can be used on data with relatively few training examples
Weaknesses:
● Often biased toward splits on features having a large number of levels
● It is easy to overfit or underfit the model
● Can have a trouble modeling some relationships due to reliance on axis-parallel splits
● Small changes in training data can result in large changes to decision logic
● Large trees can be difficult to interpret and the decisions they make may seem
counterintuitive.
Choosing the best split

C5.0 and many decision trees algorithms use entropy for measuring purity, the best split is the
split that results in the purest partition (If the segments of data contain only a single class, they
are considered pure. )
The entropy of a sample of data indicates how mixed the class values are, the lowest is 0
which means that the data is completely homogeneous, while the value 1 indicates the highest
amount of disorder.
For a given segment of data (S):
c: the number of different class levels
p(i): the proportion of values falling into a class level I

The algorithm uses entropy to calculate information


gain

The higher the information gain, the better a feature is set creating a homogeneous group after
a split on that feature

Pruning the decision tree

A decision tree can continue to grow indefinitely, choosing splitting features and dividing into
smaller and smaller partitions until each example is perfectly classified or the algorithm runs out
of features to split on.
If the tree grows overly large, many of the decisions it makes will be overly specific and the
model will have been overfitted to the training data.
The process of pruning the decision tree is reducing its size that generalizes better for unseen
data.
Pre-pruning: stop the tree from growing and doing needless work once it reaches a certain
number of decisions or if the decision nodes contain only a small number of examples.
Post-pruning: Growing a tree that is too large, pruning criteria based on the error rates at the
nodes to reduce the size of the tree to a more appropriate level.

Improving model performance


Adaptive boosting: A process in which many decision trees are built, and the trees vote on the
best class for each example. Combine a number of weak performing learners, to create a team
that is much stronger than any one of the learners alone. Each of the models has a unique set
of strengths and weaknesses and may be better or worse at certain problems.

Defining a cost matrix: This creates a matrix with two rows and two columns, that specify the
cost of each decision.
Rule-based learners should be the baseline for our model(Read More about it in the slides).

Lecture 7 - Data Preparation for ML


Data preparation is the process of transforming raw data into a form that is more appropriate for
modeling.

Data Cleaning

The performance of the ML algorithm is as good as the data so:


● We use statistics to define normal data and identify outliers
● Identify the columns that have the same value or no variance and removing them
● Identify duplicate rows and remove them
● Making empty values as missing
● Imputing missing values using statistics or a learned model.

Basic Data Cleaning:


1. Remove columns (variables) that contain a single value.
2. Treat with caution columns (variables) that contain only a few distinct values:
3. Delete rows that contain duplicate data.

Outlier Identification and Removal


An outlier is an observation that is unlike the other observations. They are rare, distinct, or do
not fit in some way.
If the data is gaussian:
If we know that the distribution of values in the sample is Gaussian or Gaussian-like, we can
use the standard deviation of the sample as a cut-off for identifying outliers.
Given mu and sigma, a simple way to identify outliers is to compute a z-score for every xi, which
is defined as the number of standard deviations away xi is from the mean.
Else:
We use the Interquartile Range Method, The IQR is calculated as the difference between the
75th and the 25th percentiles of the data and defines the box in a box and whisker plot.
The IQR can be used to identify outliers by defining limits on the sample values that are a factor
k of the IQR below the 25th percentile or above the 75th percentile.
The common value for the factor k is the value 1.5. A factor k of 3 or more can be used to
identify values that are extreme outliers or far outs when described in the context of box and
whisker plots.

Automatic Outlier Detection


The local outlier factor, or LOF for short, is a technique that builds on the idea of nearest
neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely
it is to be an outlier based on the size of its local neighborhood. Those examples with the largest
score are more likely to be outliers.

Missing Values
1. Mark invalid or corrupt values as missing in your dataset.
2. Confirm that the presence of marked missing values causes problems for learning
algorithms.
3. You may choose to remove rows with missing data from your dataset and evaluate a
learning algorithm on the transformed dataset.
4. Can be too limiting on some predictive modeling problems, say if more than 30% of the
data is missing.

How to deal with it?


Statistical Imputation: estimate a statistical value for a column from those values that are
present, replace all missing values in the column with the calculated statistic.
Common statistics calculated include the column mean value, median value, mode value, or a
constant value.
KNNImputer Data Transform: The KNNImputer from scikit-learn is a data transformation that
is first configured based on the method used to estimate the missing values. The default
distance measure is a Euclidean distance measure that is NaN aware, e.g. will not include NaN
values when calculating the distance between members of the training dataset.
Iterative Imputation: Iterative imputation refers to a process where each feature is modeled as
a function of the other features, e.g. a regression problem where missing values are predicted.
Each feature is imputed sequentially, one after the other, allowing prior imputed values to be
used as part of a model in predicting subsequent features. This approach may be generally
referred to as fully conditional specification (FCS) or multivariate imputation by chained
equations (MICE).

Feature Selection

Feature selection is the process of reducing the number of input variables when developing a
predictive model. It is desirable to reduce the number of input variables to both reduce the
computational cost of modeling and, in many cases, to improve the performance of the model.
Statistical-based feature selection methods involve evaluating the relationship between each
input variable and the target variable using statistics and selecting those input variables that
have the strongest relationship with the target variable.

Unsupervised Selection: Do not use the target variable (e.g. remove redundant variables).
Supervised Selection: Use the target variable (e.g. remove irrelevant variables).
● Intrinsic: Algorithms that perform automatic feature selection during training.
● Filter: Select subsets of features based on their relationship with the target.
● Wrapper: Search subsets of features that perform according to a predictive model.

Feature selection is also related to dimensionality reduction techniques in that both methods
seek fewer input variables to a predictive model. The difference is that feature selection selects
features to keep or remove from the dataset, whereas dimensionality reduction creates a
projection of the data resulting in entirely new input features. As such, dimensionality reduction
is an alternate to feature selection rather than a type of feature selection

A. This is a regression predictive modeling problem with numerical input variables. The
most common techniques are to use a correlation coefficient, such as Pearson’s for a
linear correlation, or rank-based methods for a nonlinear correlation.
B. This is a classification predictive modeling problem with numerical input variables. This
might be the most common example of a classification problem, Again, the most
common techniques are correlation-based, although in this case, they must take the
categorical target into account.
C. This is a regression predictive modeling problem with categorical input variables.
Nevertheless, you can use the same Numerical Input, Categorical Output methods
(described above), but in reverse.
D. This is a classification predictive modeling problem with categorical input variables, The
most common correlation measure for categorical data is the chi-squared test. You can
also use mutual information (information gain) from the field of information theory.
E.

Recursive Feature Elimination


RFE is a wrapper-type feature selection algorithm. This means that a different machine learning
algorithm is given and used in the core of the method, is wrapped by RFE, and used to help
select features. RFE works by searching for a subset of features by starting with all features in
the training dataset and successfully removing features until the desired number remains.
Features are scored either using the provided machine learning model (e.g. some algorithms
like decision trees offer importance scores) or by using a statistical method.

Feature Importance
Feature importance refers to a class of techniques for assigning scores to input features to a
predictive model that indicates the relative importance of each feature when making a
prediction.
The scores can be used for a better understanding of the data, the model, and reducing the
number of input features.
Coefficients as Feature Importance: Linear machine learning algorithms fit a model where the
prediction is the weighted sum of the input values. Examples include linear regression, logistic
regression, and extensions that add regularization, such as ridge regression, LASSO, and the
elastic net.
Data transforms

Data transforms to change the type of distribution of data variables, which may be applied to
input as well as output variables.
Types of transforms:
● Discretization Transform: Encode a numeric variable as an ordinal variable
● Ordinal Transform: Encode a categorical variable into an integer variable
● One Hot Transform: Encode a categorical variable into binary variables
● Binary Hot Transform (a space-efficient adaptation of One Hot Transform)

● Normalization Transform: Scale a variable to the range 0 and 1


● Standardization Transform: Scale a variable to a standard Gaussian
● Power Transform: Change the distribution of a variable to be more Gaussian
● Quantile Transform: Impose a probability distribution such as uniform or Gaussian

How to Scale Numerical Data


Data Normalization: Normalization is a rescaling of the data from the original range so that all
values are within the new range of 0 and 1.
Data Standardization: Standardizing a dataset involves rescaling the distribution of values so
that the mean of observed values is 0 and the standard deviation is 1.
Robust Scaling Data (in case you want to keep the outliers): Ignore the outliers from the
calculation of the mean and standard deviation, then use the calculated values to scale the
variable.
How to Encode Categorical Data
Discretization: A numerical variable can be converted to an ordinal variable by dividing the
range of the numerical variable into bins and assigning values to each bin.
Ordinal Encoding: An integer ordinal encoding is a natural encoding for ordinal variables. For
categorical variables, it imposes an ordinal relationship where no such relationship may exist.
One Hot Encoding: The one-hot encoding creates one binary variable for each category.

Make Data More Gaussian


Many machine learning algorithms perform better when the distribution of variables is Gaussian.
Some algorithms like linear regression and logistic regression explicitly assume the real-valued
variables have a Gaussian distribution.

A power transform will make the probability distribution of a variable more Gaussian.
We can apply a power transform directly by calculating the log or square root of the variable,
although this may or may not be the best power transformer for a given variable.
Instead, we can use a generalized version of the transform that finds a parameter (lambda or λ)
that best transforms a variable to a Gaussian probability distribution.

Some transformation techniques: Box-Cox, Yeo-Johnson Transform, Quantile Transforms,


Normal Quantile Transform, Uniform Quantile Transform

How to Transform Numerical to Categorical Data


Some machine learning algorithms may prefer or require categorical or ordinal input variables,
such as some decision tree and rule-based algorithms.

Uniform: Each bin has the same width in the span of possible values for the variable.
Quantile: Each bin has the same number of values, split based on percentiles.
Clustered: Clusters are identified and examples are assigned to each group.

Feature Engineering
Creating new input variables from the available data In order to expose clearer trends in the
input data by adding a broader context to a single observation or decomposing a complex
variable.
Some common techniques:
● Adding a Boolean flag variable for some state.
● Adding a group or global summary statistic, such as a mean.
● Adding new variables for each component of a compound variable, such as a date-time.
● Polynomial Transform: Create copies of numerical input variables that are raised to a
power
● Feature Crossing

How to Derive New Input Variables


Polynomial Features: Polynomial features are those features created by raising existing
features to an exponent.
It is also common to add new variables that represent the interaction between features e.g a
new column that represents one variable multiplied by another.

Feature Crossing
A feature cross is a synthetic feature that encodes nonlinearity in the feature space by
multiplying two or more input features together.

Dimensionality Reduction
Dimensionality reduction techniques create a projection of the data into a lower-dimensional
space that still preserves the most important properties of the original data.

Techniques from linear algebra can be used for dimensionality reduction. Specifically,
matrix factorization methods can be used to reduce a dataset matrix into its constituent
parts.
Manifold Learning: Techniques from high-dimensionality statistics are used to create a
low-dimensional projection of high-dimensional data, often for the purposes of data
visualization.

Autoencoder Methods: Deep learning neural networks can be constructed to perform


dimensionality reduction.

Principal Component Analysis (PCA)


Principal Component Analysis (PCA) is a method for reducing the dimensionality of data.
In PCA, data with 𝑚-columns (features) is projected into a subspace with 𝑚 or fewer
columns, whilst retaining the essence of the original data.

Many statistical models suffer from a high correlation between covariates, PCA can be
used to produce linear combinations of the covariates that are uncorrelated between
each other.
In PCA, we simplify a dataset with many variables by turning the original variables into a
smaller number of "Principal Components".

Lecture 8 - Clustering with kmeans


Clustering is an unsupervised machine learning process that automatically divides data
into clusters or groupings of similar items. It does this without having been told what the
groups should look like ahead of time.
The resulting clusters can then be used for action customer segmentation, anomaly
behavior, Simplifying extremely large datasets by grouping a large number of features
with similar values into a much smaller number of homogeneous categories.

The k-means algorithm for clustering


Strength:
● Uses simple principles for identifying clusters which can be explained in non-
statistical terms
● Highly flexible and can be adapted to address nearly all of its shortcomings with
simple adjustments
● Fairly efficient and performs well at dividing the data into useful clusters.
Weaknesses:
● Less sophisticated than more recent clustering algorithms
● Because it uses an element of random chance, it is not guaranteed to find the
optimal set of clusters
● Requires a reasonable guess as to how many clusters naturally exist in the data
Due to the heuristic nature of k-means, you may end up with somewhat different final
results by making only slight changes to the starting conditions. If the results vary
dramatically, this could indicate a problem.

Using distance to assign and update clusters: Traditionally,


k-means uses Euclidean distance, but Manhattan distance or
Minkowski distance is also sometimes used.

Choosing the appropriate number of clusters

The algorithm can be sensitive to randomly chosen cluster centers. Choosing the
number of clusters requires a delicate balance: Setting k to be very large will improve the
homogeneity of the clusters, but risks overfitting the data.

Sometimes the number of clusters is dictated by business requirements or the


motivation for the analysis. Without any, a priori knowledge at all, one rule of thumb
suggests setting k equal to the square root of (n / 2), where n is the number of
examples in the dataset.
The elbow method attempts to gauge how the homogeneity or heterogeneity within the
clusters changes for various values of k. There are numerous statistics to measure
homogeneity and heterogeneity within clusters that can be used with the elbow method.
Lecture 9 - Regression Methods
Regression is concerned with specifying the relationship between a single numeric dependent
variable (the value to be predicted) and one or more numeric independent variables (the
predictors).

Commonly used for modeling complex relationships among data elements. Estimating the
impact of a treatment on an outcome, and extrapolate into the future.
Example of regression models: simple linear regression, multiple regression, logistic regression,
Poisson regression.

Simple linear regression

Simple linear regression defines the relationship between a dependent variable


and a single independent predictor variable using a line:

The ordinary least squares (OLS) estimation method


allows us to determine the optimal estimates of α and β.
It can be shown using calculus that the value of b that
results in the minimum squared error is:

While the optimal value of a is:

The denominator for b is the variance of x. The numerator is the covariance of x and y.
Correlations

The correlation between two variables indicates how closely their relationship follows a straight
line. The correlation ranges between -1 and +1. The extreme values indicate a perfectly linear
relationship. A correlation close to zero indicates the absence of a linear relationship. The
following formula defines Pearson's correlation:

Multiple linear regression


Most real-world analyses have more than one independent variable. Therefore, it is likely that
you will be using multiple linear regression most of the time you use regression for a numeric
prediction task.
Strengths:
● By far the most common approach for modeling numeric data
● Can be adapted to model almost any data
● Provides estimates of the strength and size of the relationships among features and the
outcome
Weaknesses:
● Makes strong assumptions about the data
● The model's form must be specified by the user in advance
● Does not do well with missing data
● Only works with numeric features, so categorical data require extra processing
● Requires some knowledge of statistics to understand the model

Multiple regression is an extension of simple linear regression, find values of beta coefficients
that minimize the prediction error of a linear equation.

Minimizing error:
How to solve for the vector β that minimizes the sum of the squared errors between the
predicted and actual y values?
It has been shown in the literature that the best estimate of the vector β can be computed as:

You might also like