Download as pdf or txt
Download as pdf or txt
You are on page 1of 166

Unit 1

Math, Probability, and


Statistical Modelling
Exploring Probability and Inferential Statistics

• Making predictions and searching for different structures in data is the most
important part of data science.
• They are important because they have the ability to handle different
analytical tasks.
• Probability and Statistics are involved in different predictive algorithms that
are there in Machine Learning. They help in deciding how much data is
reliable
• Probability is one of the most fundamental concepts in statistics.
• A statistic is a result that’s derived from performing a mathematical
operation on numerical data.
• Probability is all about chance. Whereas statistics is more about how we
handle various data using different techniques.
Statistics Basics:-
• Statistics is the study of the collection, analysis, interpretation, presentation,
and organization of data.
• It is a method of collecting and summarising the data. This has many
applications from a small scale to large scale.
• Whether it is the study of the population of the country or its economy, stats
are used for all such data analysis.
• Statistics has a huge scope in many fields such as sociology, psychology,
geology, weather forecasting, etc.
• The data collected here for analysis could be quantitative or qualitative.
Quantitative data are also of two types such as: discrete and continuous.
Discrete data has a fixed value whereas continuous data is not a fixed data
but has a range.
Exploring Descriptive and Inferential Statistics

• In general, you use statistics in decision making. Statistics come in two flavours:
• Descriptive: Descriptive statistics provide a description that illuminates some
characteristic of a numerical dataset, including dataset distribution, central
tendency (such as mean, min, or max), and dispersion (as in standard deviation
and variance).
• Inferential: Rather than focus on pertinent descriptions of a dataset, inferential
statistics carve out a smaller section of the dataset and attempt to deduce
significant information about the larger dataset.
• Use this type of statistics to get information about a real-world measure in which
you’re interested.
Descriptive Statics
• descriptive statistics describe the characteristics of a numerical dataset, but that
doesn’t tell you why you should care.
• most data scientists are interested in descriptive statistics only because of what
they reveal about the real-world measures they describe.
• For example, a descriptive statistic is often associated with a degree of accuracy,
indicating the statistic’s value as an estimate of the real-world measure.
• You can use descriptive statistics in many ways — to detect outliers, for example,
or to plan for feature pre-processing requirements or to quickly identify what
features you may want, or not want, to use in an analysis.
statistic Class value
Mean 79.18
Range 66.21 – 96.53
Proportion >= 70 86.7%
Inferential Statics
• inferential statistics are used to reveal something about a real-world measure.
• Inferential statistics do this by providing information about a small data selection,
so you can use this information to infer something about the larger dataset from
which it was taken.
• In statistics, this smaller data selection is known as a sample, and the larger,
complete dataset from which the sample is taken is called the population.
• If your dataset is too big to analyse in its entirety, pull a smaller sample of this
dataset, analyse it, and then make inferences about the entire dataset based on
what you learn from analysing the sample.
• You can also use inferential statistics in situations where you simply can’t afford
to collect data for the entire population.
• In this case, you’d use the data you do have to make inferences about the
population at large.
• At other times, you may find yourself in situations where complete information
for the population is not available. In these cases, you can use inferential statistics
to estimate values for the missing data based on what you learn from analysing the
data that is available
• For an inference to be valid, you must select your sample carefully so that you get
a true representation of the population.
• Even if your sample is representative, the numbers in the sample dataset will
always exhibit some noise — random variation, in other words — that guarantees
the sample statistic is not exactly identical to its corresponding population statistic.
Probability basics:-
• Probability denotes the possibility of the outcome of any random event.
• The meaning of this term is to check the extent to which any event is likely
to happen.
• For example, when we flip a coin in the air, what is the possibility of getting
a head? The answer to this question is based on the number of possible
outcomes. Here the possibility is either head or tail will be the outcome. So,
the probability of a head to come as a result is 1/2.
• The probability is the measure of the likelihood of an event to happen. It
measures the certainty of the event. The formula for probability is given by;
• P(E) = Number of Favourable Outcomes/Number of total outcomes
• P(E) = n(E)/n(S)
• Probability denotes the possibility of something happening.
• It is a mathematical concept that predicts how likely events are to
occur.
• The probability values are expressed between 0 and 1.
• The definition of probability is the degree to which something is likely
to occur.
• This fundamental theory of probability is also applied to probability
distributions.
Axioms of probability
• Axioms mean a rule a principle that most people believe to be true. It is the
premise on the basis of which we do further reasoning
• There are three axioms of probability that make the foundation of
probability theory-
• Axiom 1: Probability of Event
• The first one is that the probability of an event is always between 0 and 1. 1
indicates definite action of any of the outcome of an event and 0 indicates
no outcome of the event is possible.

• Axiom 2: Probability of Sample Space


• For sample space, the probability of the entire sample space is 1.
• Axiom 3: Mutually Exclusive Events
• And the third one is- the probability of the event containing any
possible outcome of two mutually disjoint is the summation of their
individual probability.
Probability distributions
• In Statistics, the probability distribution gives the possibility of each
outcome of a random experiment or event. It provides the probabilities
of different possible occurrences.
• A probability distribution is a mathematical function that describes
the probability of different possible values of a variable. Probability
distributions are often depicted using graphs or probability tables.
• For example, the following probability distribution table tells us the
probability that a certain soccer team scores a certain number of goals
in a given game:
Probability distributions
• When the roulette wheel spins off, you intuitively understand that
there is an equal chance that the ball will fall into any of the slots of
the cylinder on the wheel.
• The slot where the ball will land is totally random, and the probability,
or likelihood, of the ball landing in any one slot over another is the
same.
• Because the ball can land in any slot, with equal probability, there is an
equal probability distribution, or a uniform probability distribution —
the ball has an equal probability of landing in any of the slots in the
cylinder.
• But the slots of the roulette wheel are not all the same — the wheel has 18 black
slots and 20 slots that are either red or green. Because of this arrangement, there is
18/38 probability that your ball will land on a black slot.
• You plan to make successive bets that the ball will land on a black slot.
Random variable:-
• A random variable is a variable whose value is unknown or a function
that assigns values to each of an experiment's outcomes.
• A random variable is a numerical description of the outcome of a
statistical experiment.
• In probability and statistics, random variables are used to quantify
outcomes of a random occurrence, and therefore, can take on many
values.
• Random variables are required to be measurable and are typically real
numbers.
• For example, the letter X may be designated to represent the sum of
the resulting numbers after three dice are rolled.
• In this case, X could be 3 (1 + 1+ 1), 18 (6 + 6 + 6), or somewhere
between 3 and 18, since the highest number of a die is 6 and the lowest
number is 1.
• Random variables are often designated by letters and can be classified
as discrete, which are variables that have specific values, or
continuous, which are variables that can have any values within a
continuous range.
• A random variable has a probability distribution that represents the
likelihood that any of the possible values would occur.
• Let’s say that the random variable, Z, is the number on the top face of
a die when it is rolled once.
• The possible values for Z will thus be 1, 2, 3, 4, 5, and 6. The
probability of each of these values is 1/6 as they are all equally likely
to be the value of Z.
• For instance, the probability of getting a 3, or P (Z=3), when a die is
thrown is 1/6, and so is the probability of having a 4 or a 2 or any
other number on all six faces of a die. Note that the sum of all
probabilities is 1.
• Discrete Random Variables
• Discrete random variables take on a countable number of distinct values. Consider
an experiment where a coin is tossed three times.
• If X represents the number of times that the coin comes up heads, then X is a
discrete random variable that can only have the values 0, 1, 2, 3 (from no heads in
three successive coin tosses to all heads). No other value is possible for X.

• Continuous Random Variables


• Continuous random variables can represent any value within a specified range or
interval and can take on an infinite number of possible values.
• An example of a continuous random variable would be an experiment that
involves measuring the amount of rainfall in a city over a year or the average
height of a random group of 25 people.
• To understand discrete and continuous distribution, think of two
variables from a dataset describing cars.
• A “color” variable would have a discrete distribution because cars
have only a limited range of colours (black, red, or blue, for example).
The observations would be countable per the color grouping.
• A variable describing cars’ miles per gallon, or “mpg,” would have a
continuous distribution because each car could have its own separate
value for “mpg.”
Types of Probability Distributions

• Two major kind of distributions based on the type of likely values for
the variables are,
1.Discrete Distributions
2.Continuous Distributions
Discrete Distribution Vs Continuous Distribution

Discrete Distributions Continuous Distribution


Discrete distributions have finite number of Continuous distributions have infinite many
different possible outcomes consecutive possible values
We cannot add up individual values to find out the
We can add up individual values to find out the
probability of an interval because there are many of
probability of an interval
them
Discrete distributions can be expressed with a Continuous distributions can be expressed with a
graph, piece-wise function or table continuous function or graph
In discrete distributions, graph consists of bars In continuous distributions, graph consists of a
lined up one after the other smooth curve
DISCRETE DISTRIBUTIONS:

• Discrete distributions have finite number of different possible


outcomes.
• Characteristics of Discrete Distribution
• We can add up individual values to find out the probability of an
interval
• Discrete distributions can be expressed with a graph, piece-wise
function or table
• In discrete distributions, graph consists of bars lined up one after the
other
• Expected values might not be achievable
In graph, the discrete distributions looks like as,

Examples of Discrete Distributions:


1.Bernoulli Distribution
2.Binomial Distribution
3.Uniform Distribution
4.Poisson Distribution
• Bernoulli Distribution
• In Bernoulli distribution there is only one trial and only two possible
outcomes i.e. success or failure. It is denoted by y ~Bern(p).
• Characteristics of Bernoulli distributions
• It consists of a single trial
• Two possible outcomes
• E(Y) = p
• Examples and Uses:
• Guessing a single True/False question.
• It is mostly used when trying to find out what we expect to obtain a single
trial of an experiment.
Binomial Distribution

• A sequence of identical Bernoulli events is called Binomial and follows a


Binomial distribution. It is denoted by Y ~B(n, p).
• Characteristics of Binomial distribution
• Over the n trials, it measures the frequency of occurrence of one of the
possible result.
• E(Y) = n × p
• P(Y = y) = C(y, n) × py× (1 – p)n-y
• Examples and Uses:
• Simply determine, how many times we obtain a head if we flip a coin 10
times.
• It is mostly used when we try to predict how likelihood an event occurs over
a series of trials.
CONTINUOUS DISTRIBUTIONS:

• Continuous distributions have infinite many consecutive possible


values.
• Characteristics of Continuous Distributions
• We cannot add up individual values to find out the probability of an
interval because there are many of them
• Continuous distributions can be expressed with a continuous function
or graph
• In continuous distributions, graph consists of a smooth curve
• To calculate the chance of an interval, we required integrals
Examples of Continuous Distributions
1.Normal Distribution
2.Chi-Squared Distribution
3.Exponential Distribution
4.Logistic Distribution
5.Students’ T Distribution
Normal Distribution

• It shows a distribution that most natural events follow. It is denoted by Y


~ (µ, σ2). The main characteristics of normal distribution are:
• Characteristics of normal distribution
• Graph obtained from normal distribution is bell-shaped curve, symmetric
and has shrill tails.
• 68% of all its all values should fall in the interval, i.e. (µ – σ , µ+ σ )
• E(Y) = µ
• Var(Y) = σ2
• Examples and Uses
• Normal distributions are mostly observed in the size of animals in the
desert.
Categorial (non-numeric) distribution
Represents either non-numeric categorical variables or ordinal variables (a special case of numeric variable that
can be grouped and ranked like a categorical variable).
Conditional probability with Naïve Bayes

• You can use the Naïve Bayes machine learning method, which was borrowed straight from the
statistics field, to predict the likelihood that an event will occur, given evidence defined in your
data features — something called conditional probability.
• Naïve Bayes, which is based on classification and regression, is especially useful if you need to
classify text data.
• This model is easy to build and is mostly used for large datasets. It is a probabilistic machine
learning model that is used for classification problems.
• The core of the classifier depends on the Bayes theorem with an assumption of independence
among predictors. That means changing the value of a feature doesn’t change the value of another
feature.
• Why is it called Naive?
• It is called Naive because of the assumption that 2 variables are independent when they may not
be. In a real-world scenario, there is hardly any situation where the features are independent.
• Conditional probability is defined as the likelihood of an event or outcome occurring, based on the occurrence
of a previous event or outcome. Conditional probability is calculated by multiplying the probability of the
preceding event by the updated probability of the succeeding, or conditional, event.
• A conditional probability would look at such events in relationship with one another.
• Conditional probability is thus the likelihood of an event or outcome occurring based on the occurrence of
some other event or prior outcome.
• Two events are said to be independent if one event occurring does not affect the probability that the other
event will occur.
• However, if one event occurring or not does, in fact, affect the probability that the other event will occur, the
two events are said to be dependent. If events are independent, then the probability of some event B is not
contingent on what happens with event A.
• A conditional probability, therefore, relates to those events that are dependent on one another.
• Conditional probability is often portrayed as the "probability of A given B," notated as P(A|B).
• Conditional probability is calculated by multiplying the probability of the preceding event by the probability
of the succeeding or conditional event.
• Four candidates A, B, C, and D are running for a political office. Each
has an equal chance of winning: 25%. However, if candidate A drops
out of the race due to ill health, the probability will change: P(Win |
One candidate drops out) = 33.33%.
The formula for conditional probability is:
P(B|A) = P(A and B) / P(A)
which you can also rewrite as:
P(B|A) = P(A∩B) / P(A)
Example:-
• In a group of 100 sports car buyers, 40 bought alarm systems, 30
purchased bucket seats, and 20 purchased an alarm system and bucket
seats. If a car buyer chosen at random bought an alarm system, what is
the probability they also bought bucket seats?
• Step 1: Figure out P(A). It’s given in the question as 40%, or 0.4.
• Step 2: Figure out P(A∩B). This is the intersection of A and B: both
happening together. It’s given in the question 20 out of 100 buyers, or
0.2.
• Step 3: Insert your answers into the formula:
P(B|A) = P(A∩B) / P(A) = 0.2 / 0.4 = 0.5
Bayes’ Theorem(Example)
• Mathematically Bayes’ theorem can be stated as:

Basically, we are trying to find the probability of event A, given event B is true.
Here P(B) is called prior probability which means it is the probability of an
event before the evidence
P(B|A) is called the posterior probability
• Probability of an event after the evidence is seen. With regards to our
dataset, this formula can be re-written as:
• Y: class of the variable
• X: dependent feature vector (of size n)
What is Naive Bayes?

• Bayes’ rule provides us with the formula for the probability of Y given some
feature X. In real-world problems, we hardly find any case where there is
only one feature.
• When the features are independent, we can extend Bayes’ rule to what is
called Naive Bayes which assumes that the features are independent that
means changing the value of one feature doesn’t influence the values of
other variables and this is why we call this algorithm “NAIVE”
• Naive Bayes can be used for various things like face recognition, weather
prediction, Medical Diagnosis, News classification, Sentiment Analysis, and
a lot more.
• When there are multiple X variables, we simplify it by assuming that X’s
are independent, so
For n number of X, the formula becomes Naive Bayes:
Naive Bayes Example
• Let’s take a dataset to predict whether we can pet an animal or not.
Assumptions of Naive Bayes

• All the variables are independent. That is if the animal is Dog that
doesn’t mean that Size will be Medium
• All the predictors have an equal effect on the outcome. That is, the
animal being dog does not have more importance in deciding If we can
pet him or not. All the features have equal importance.
• We should try to apply the Naive Bayes formula on the above dataset
however before that, we need to do some precomputations on our
dataset.
• We also need the probabilities (P(y)), which are calculated in the table
below. For example, P(Pet Animal = NO) = 6/14.
• Now if we send our test data, suppose test = (Cow, Medium, Black)
Probability of petting an animal :

And the probability of not petting an animal:


• We know P(Yes|Test)+P(No|test) = 1 So, we will normalize the result:

We see here that P(Yes|Test) > P(No|Test), so the prediction that we can pet this animal
is “Yes”.
Types of Naïve Bayes
• Naïve Bayes comes in these three popular flavors:
• »»MultinomialNB: Use this version if your variables (categorical or continuous) describe discrete
frequency counts, like word counts.
• This version of Naïve Bayes assumes a multinomial distribution, as is often the case with text data.
• It does not except negative values.
• »»BernoulliNB: If your features are binary, you use multinomial Bernoulli Naïve Bayes to make
predictions.
• This version works for classifying text data, but isn’t generally known to perform as well as
MultinomialNB.
• If you want to use BernoulliNB to make predictions from continuous variables, that will work, but
you first need to sub-divide them into discrete interval groupings (also known as binning).
• »»GaussianNB: Use this version if all predictive features are normally distributed. It’s not a good
option for classifying text data, but it can be a good choice if your data contains both positive and
negative values (and if your features have a normal distribution, of course).
Quantifying Correlation
• Many statistical and machine learning methods assume that your features are independent.
• To test whether they’re independent, though, you need to evaluate their correlation — the extent
to which variables demonstrate interdependency.
• We will have brief introduction to Pearson correlation and Spearman’s rank correlation.
• Correlation is used to test relationships between quantitative variables or categorical variables. In
other words, it's a measure of how things are related. The study of how variables are correlated is
called correlation analysis.
• Some examples of data that have a high correlation: Your caloric intake and your weight.
• Correlation means to find out the association between the two variables and Correlation
coefficients are used to find out how strong the is relationship between the two variables. The most
popular correlation coefficient is Pearson’s Correlation Coefficient. It is very commonly used in
linear regression.
• Correlation is quantified per the value of a variable called r, which
ranges between –1 and 1.
• The closer the r-value is to 1 or –1, the more correlation there is
between two variables.
• If two variables have an r-value that’s close to 0, it could indicate that
they’re independent variables.
Calculating correlation with Pearson’s r
• If you want to uncover dependent relationships between continuous variables in
a dataset, you’d use statistics to estimate their correlation.
• The simplest form of correlation analysis is the Pearson correlation, which
assumes that
• Your data is normally distributed.
• You have continuous, numeric variables.
• Your variables are linearly related.
• Because the Pearson correlation has so many conditions, only use it to determine
whether a relationship between two variables exists, but not to rule out possible
relationships.
• If you were to get an r-value that is close to 0, it indicates that there is no linear
relationship between the variables, but that a nonlinear relationship between them
still could exist.
• Consider the example of car price detection where we have to detect
the price considering all the variables that affect the price of the car
such as carlength, curbweight, carheight, carwidth, fueltype, carbody,
horsepower, etc.
• We can see in the scatterplot, as the carlength, curbweight, carwidth
increases price of the car also increases.
• So, we can say that there is a positive correlation between the above
three variables with car price.
• Here, we also see that there is no correlation between the carheight
and car price.
• To find the Pearson coefficient, also referred to as the Pearson correlation
coefficient or the Pearson product-moment correlation coefficient, the two
variables are placed on a scatter plot. The variables are denoted as X and Y.
• There must be some linearity for the coefficient to be calculated; a scatter
plot not depicting any resemblance to a linear relationship will be useless.
• The closer the resemblance to a straight line of the scatter plot, the higher
the strength of association.
• Numerically, the Pearson coefficient is represented the same way as a
correlation coefficient that is used in linear regression, ranging from -1 to
+1.
Formula:-
Find the value of the correlation
coefficient from the following table:

Subject Age x Glucose Level y


1 43 99
2 21 65
3 25 79
4 42 75
5 57 87
6 59 81
Glucose
Subject Age x xy x2 y2 2868 / 5413.27 = 0.529809
Level y
1 43 99 4257 1849 9801
2 21 65 1365 441 4225
3 25 79 1975 625 6241
4 42 75 3150 1764 5625
5 57 87 4959 3249 7569
6 59 81 4779 3481 6561
Σ 247 486 20485 11409 40022
Ranking variable-pairs using Spearman’s rank
correlation
• The Spearman’s rank correlation is a popular test for determining correlation
between ordinal variables.
• By applying Spearman’s rank correlation, you’re converting numeric variable-
pairs into ranks by calculating the strength of the relationship between variables
and then ranking them per their correlation.
• The Spearman’s rank correlation assumes that
• Your variables are ordinal.
• Your variables are related non-linearly.
• Your data is non-normally distributed.
Formula:-
• The scores for nine students in physics and math are as follows:
• Physics: 35, 23, 47, 17, 10, 43, 9, 6, 28
• Mathematics: 30, 33, 45, 23, 8, 49, 12, 4, 31
• Compute the student’s ranks in the two subjects and compute the
Spearman rank correlation.
Add a third column, d, to your data. The d is the difference
between ranks.
• Sum (add up) all of your d-squared values.
4 + 4 + 1 + 0 + 1 + 1 + 1 + 0 + 0 = 12. You’ll need this for the formula
• = 1 – (6*12)/(9(81-1))
= 1 – 72/720
= 1-0.1
= 0.9
The Spearman Rank Correlation for this set of data is 0.9.
Reducing Data Dimensionality with Linear Algebra

• Any intermediate-level data scientist should have a pretty good


understanding of linear algebra and how to do math using matrices.
• Array and matrix objects are the primary data structure in analytical
computing.
• You need them in order to perform mathematical and statistical
operations on large and multidimensional datasets — datasets with
many different features to be tracked simultaneously.
• When you have dataset with large number of features its very challenging
task to work with such dataset.
• having a high number of variables is both a boon and a curse.
• It’s great that we have loads of data for analysis, but it is challenging due to
size.
• It’s not feasible to analyse each and every variable at a microscopic level.
• It might take us days or months to perform any meaningful analysis and
we’ll lose a ton of time and money for our business
• Also, the amount of computational power this will take is higher.
• We need a better way to deal with high dimensional data so that we can
quickly extract patterns and insights from it.
• So, Using dimensionality reduction technique we can reduce the
number of features in dataset without having to lose much information
and keep (or improve) the model’s performance.
• It’s a really powerful way to deal with huge datasets
What is Dimension?

• let’s first define what a dimension is. Given a matrix A, the dimension
of the matrix is the number of rows by the number of columns. If A
has 3 rows and 5 columns, A would be a 3x5 matrix.
• Now in the most simplest of terms, dimensionality reduction is exactly
what it sounds like, you’re reducing the dimension of a matrix to
something smaller than it currently is.
• Given a square (n by n) matrix A, the goal would be to reduce the
dimension of this matrix to be smaller than n x n.
• Current Dimension of A : n
Reduced Dimension of A : n - x, where x is some positive integer
• the most common application would be for data visualization
purposes. It’s quite difficult to visualize something graphically which
is in a dimension space greater than 3.
• Through dimensionality reduction, you’ll be able to transform your
dataset of 1000s of rows and columns into one small enough to
visualize in 3 / 2 / 1 dimensions.
What is dimensionality Reduction?
• As data generation and collection keeps increasing, visualizing it and
drawing inferences becomes more and more challenging.
• One of the most common ways of doing visualization is through
charts.
• Suppose we have 2 variables, Age and Height. We can use a scatter or
line plot between Age and Height and visualize their relationship
easily:
• Now consider a case in which we have, say 100 variables (p=100).
• In this case, we can have 100(100-1)/2 = 5000 different plots.
• It does not make much sense to visualize each of them separately.
• In such cases where we have a large number of variables, it is better to
select a subset of these variables (p<<100) which captures as much
information as the original set of variables.
• we can reduce p dimensions of the data into a subset of k dimensions
(k<<n). This is called dimensionality reduction.
Benefits of Dimensionality Reduction

• It helps to remove redundancy in the features and noise error factors


ultimately enhanced visualization of the given data set.
• Excellent memory management activity has been exhibited due to
dimensionality reduction.
• Improving the performance of the model by choosing the right features
by removing the unnecessary lists of features from the dataset.
• Certainly, less number of dimensions (mandatory list of dimensions)
required less computing efficiency and train the model faster with
improved model accuracy.
• Considerably reducing the Complexity and Overfitting of the overall model
and its performance.
• Dimensionality reduction can be achieved by simply dropping columns.
• for example, those that may show up as collinear with others or identified as
not being particularly predictive of the target as determined by an attribute
importance ranking technique.
• But it can also be achieved by deriving new columns based on linear
combinations of the original columns.
• In both cases, the resulting transformed data set can be provided to machine
learning algorithms to yield faster model build times, faster scoring times,
and more accurate models.

• While SVD can be used for dimensionality reduction, it is often used in


digital signal processing for noise reduction, image compression, and other
areas.
Eigenvector and Eigenvalue:-
• Eigenvectors and eigenvalues have many important applications in computer
vision and machine learning in general.
• Well known examples are PCA (Principal Component Analysis) for
dimensionality reduction or EigenFaces for face recognition.
• An eigenvector is a vector whose direction remains unchanged when a linear
transformation is applied to it.
• To conceptualize an eigenvector, think of a matrix called A. Now consider a
nonzero vector called x and that Ax = λx for a scalar λ.
• In this scenario, scalar λ is what’s called an eigenvalue of matrix A.
• It’s permitted to take on a value of 0.
• Furthermore, x is the eigenvector that corresponds to λ, and again, it’s not
permitted to be a zero value.
SVD(Singular Value Decomposition)
• The SVD linear algebra method decomposes the data matrix into the three
resultant matrices shown in Figure .
• The product of these matrices, when multiplied together, gives you back your
original matrix.
• SVD is handy when you want to remove redundant information by compressing
your dataset.
.

The SVD of m x n matrix A is given by the formula :


where:
•A=u*v*S
• »»A: This is the matrix that holds all your original data.
• »»u: This is a left-singular vector (an eigenvector) of A, and it holds all the
important, non-redundant information about your data’s observations.
• »»v: This is a right-singular eigenvector of A. It holds all the important,
nonredundant information about columns in your dataset’s features.
• »»S: This is the square root of the eigenvalue of A. It contains all the information
about the procedures performed during the compression.
•A: Input data matrix — m x n matrix (eg. m documents, n terms)
•U: Left Singular Vectors — m x r matrix (m documents, r concepts)
•Σ: Singular Values — r x r diagonal matrix (strength of each ‘concept’)
where r is rank of matrix A
•V: Right Singular Vectors — n x r matrix (n terms, r concepts)
The theorem states the following:
1.U, Σ,V: unique
2.U, V: Both the matrices are orthonormal in
nature. Orthonormal matrices are those whose
column’s Euclidean length is 1 or in other terms
sum of squared values in each column of U and
V matrices is equals to 1. Also, the columns are
orthogonal. In simple terms dot product of two
columns of U and two columns of V leads to
zero
3.Σ: Diagonal — All the entries (singular
values) are positive and are sorted in decreasing
order (σ1≥σ2≥….≥0)
Example:-
• Left Singular Matrix (U): Columns of matrix U can be thought of as concepts, the first
column of U corresponds to the SciFi concept and the second column of U corresponds to
the Romance concept. What it basically shows is first 4 users correspond to scifi concept
and the last 3 users correspond to romance concept. Matrix
• U would be “Users-to-Concept” similarity matrix. Each value in matrix U determines
how much a given user corresponds to a given concept (in our case there are two
concepts, SciFi and romance concept). In the given matrix for eg, the first user
corresponds to SciFi-concept whereas the fifth user corresponds to romance-concept.
• Singular Values (Σ): In this diagonal matrix, each diagonal values are non zero positive
value. Each value depicts the strength of every concept. For instance, it can be seen
“strength” of SciFi concept is higher than that of romance concept.
• Right Singular Matrix (V): V is a “movie-to-concept” matrix. For instance, it shows
that first three movies heavily belongs to the first concept i.e. SciFi concept while last two
belongs to second concept which is romance concept.
• Although it might sound complicated, it’s pretty simple. Imagine that you’ve compressed your
dataset and it has resulted in a matrix S that sums to 100.
• If the first value in S is 97 and the second value is 94, this means that the first two columns contain
94 percent of the dataset’s information.
• In other words, the first two columns of the u matrix and the first two rows of the v matrix contain
94 percent of the important information held in your original dataset, A.
• To isolate only the important, non-redundant information, you’d keep only those two columns and
discard the rest.
• When you go to reconstruct your matrix by taking the dot product of S, u, and v, you’ll probably
notice that the resulting matrix is not an exact match to your original dataset. That’s the data that
remains after much of the information redundancy and noise has been filtered out by SVD.
• When deciding the number of rows and columns to keep, it’s okay to get rid of rows and columns, as long as
you make sure that you retain at least 70 percent of the dataset’s original information.
Reducing dimensionality with factor analysis

• Factor analysis is along the same lines as SVD in that it’s a method you can use for filtering out
redundant information and noise from your data.
• An offspring of the psychometrics field, this method was developed to help you derive a root
cause, in cases where a shared root cause results in shared variance — when a variable’s variance
correlates with the variance of other variables in the dataset.
• A variables variability measures how much variance it has around its mean.
• The greater a variable’s variance, the more information that variable contains
• When you find shared variance in your dataset, that means information redundancy is at
play.
• You can use factor analysis or principal component analysis to clear your data of this information
redundancy.
• In order to apply Factor Analysis, we must make sure the data we have
is suitable for it.
• The simplest approach would be to look at the correlation matrix of
the features and identify groups of intercorrelated variables.
• If there are some correlated features with a correlation degree of more
than 0.3, perhaps it would be interesting to use Factor Analysis.
Groups of features highly intercorrelated will be merged into one
variable latent, called factor.
• Factor analysis makes the following assumptions:
• Your features are metric — numeric variables on which meaningful calculations
can be made.
• Your features should be continuous or ordinal.
• You have more than 100 observations in your dataset and at least 5 observations
per feature.
• Your sample is homogenous.
• There is r > 0.3 correlation between the features in your dataset.
• In factor analysis, you do a regression on features to uncover underlying latent
variables, or factors.
• You can then use those factors as variables in future analyses, to represent the
original dataset from which they’re derived.
• At its core, factor analysis is the process of fitting a model to prepare a dataset for
analysis by reducing its dimensionality and information redundancy.
Decreasing dimensionality and removing outliers with PCA

• Principal component analysis (PCA) is another dimensionality reduction technique that’s


closely related to SVD
• This unsupervised statistical method finds relationships between features in your dataset
and then transforms and reduces them to a set of non-information-redundant principle
components — uncorrelated features that embody and explain the information that’s
contained within the dataset (that is, its variance).
• These components act as a synthetic, refined representation of the dataset, with the
information redundancy, noise, and outliers stripped out.
• You can then take those reduced components and use them as input for your machine
learning algorithms, to make predictions based on a compressed representation of your
data.
• Principal Component Analysis is an unsupervised dimension reduction technique that
focuses on capturing maximum variation of the data.
• In this technique, variables are transformed into a new set of variables,
which are linear combination of original variables.
• These new set of variables are known as principle components.
• They are obtained in such a way that first principle component
accounts for most of the possible variation of original data after
which each succeeding component has the highest possible variance.
• The PCA algorithm is based on some mathematical concepts such as:
• Variance and Covariance
• Eigenvalues and Eigen factors
• The PCA model makes these two assumptions:
• Multivariate normality is desirable, but not required. (normal distribution)
• Variables in the dataset should be continuous.
• Although PCA is like factor analysis, there are two major differences
• One difference is that PCA does not regress to find some underlying cause of
shared variance, but instead decomposes a dataset to succinctly represent its most
important information in a reduced number of features.
• The other key difference is that, with PCA, the first time you run the model, you
don’t specify the number of components to be discovered in the dataset. You let
the initial model results tell you how many components to keep, and then you
rerun the analysis to extract those features.
• A small amount of information from your original dataset will not be captured by
the principal components.
• Just keep the components that capture at least 95 percent of the dataset’s total
variance. The remaining components won’t be that useful, so you can get rid of
them.
• When using PCA for outlier detection, simply plot the principal components on an
x-y scatter plot and visually inspect for areas that might have outliers.
• Those data points correspond to potential outliers that are worth investigating.
Modelling Decisions with Multi-Criteria Decision
Making
• You can use MCDM methods in anything from stock portfolio management to
fashion-trend evaluation, from disease outbreak control to land development
decision making.
• Anywhere you have two or more criteria on which you need to base your
decision, you can use MCDM methods to help you evaluate alternatives.
• To use multi-criteria decision making, the following two assumptions must be
satisfied:
• Multi-criteria evaluation: You must have more than one criterion to optimize.
• Zero-sum system: Optimizing with respect to one criterion must come at the
sacrifice of at least one other criterion.
• This means that there must be trade-offs between criteria — to gain with respect to
one means losing with respect to at least one other.
Example 1:-
• The best way to get a solid grasp on MCDM is to see how it’s used to solve a real- world problem.
• MCDM is commonly used in investment portfolio theory.
• Pricing of individual financial instruments typically reflects the level of risk you incur, but an entire portfolio
can be a mixture of virtually riskless investments (U.S. government bonds, for example) and minimum-,
moderate-, and high-risk investments.
• Your level of risk aversion dictates the general character of your investment portfolio.
• Highly risk-averse investors seek safer and less lucrative investments, and less risk-averse investors choose
riskier investments.
• In the process of evaluating the risk of a potential investment, you’d likely consider the following criteria:
• Earnings growth potential: Here, an investment that falls under an earnings growth potential threshold gets
scored as a 0; anything above that threshold gets a 1.
• Earnings quality rating: If an investment falls within a ratings class for earnings quality, it gets scored as a
0; otherwise, it gets scored as a 1.
• earnings quality refers to various measures used to determine how suitable a
company’s reported earnings are
• Dividend performance: When an investment doesn’t reach a set dividend
performance threshold, it gets a 0; if it reaches or surpasses that threshold, it gets a
1.
• Imagine that you’re evaluating 20 different potential investments.
• In this evaluation, you’d score each criterion for each of the investments.
• To eliminate poor investment choices, simply sum the criteria scores for each of
the alternatives and then dismiss any investments that do not get a total score of 3
— leaving you with the investments that fall within a certain threshold of earning
growth potential, that have good earnings quality, and whose dividends perform at
a level that’s acceptable to you.
Example 2:-
• A shopper is in an electronics shop. Their objective is to purchase a new
mobile phone.
• Their criteria are price, screen size, storage space and appearance, with
price, storage space and screen size as the non-beneficial values and
appearance being a beneficial value that the shopper evaluates using a
five-point scale. To this shopper, all the criteria are equal in value, so each
has a weight of 25%. They've reduced their choices to three phones with the
following ratings:
• Phone A: 16,000, 6.2 inches, 32 GB, average looks (3 out of 5)
• Phone B: 19,000, 5.8 inches, 64 GB, excellent looks (5 out of 5)
• Phone C: 17,500, 6.0 inches, 64 GB, above-average looks (4 out of 5)
• In mathematics, a set is a group of numbers that shares some similar
characteristic.
• In traditional set theory, membership is binary — in other words, an
individual is either a member of a set or it’s not.
• If the individual is a member, it is represented with the number 1. If it
is not a member, it is represented by the number 0.
• Traditional MCDM is characterized by binary membership.
Focusing on fuzzy MCDM
• If you prefer to evaluate suitability within a range, instead of using binary membership terms of 0
or 1, you can use fuzzy multi-criteria decision making (FMCDM) to do that.
• With FMCDM you can evaluate all the same types of problems as you would with MCDM.
• The term fuzzy refers to the fact that the criteria being used to evaluate alternatives offer a range of
acceptability — instead of the binary, crisp set criteria associated with traditional MCDM.
• Evaluations based on fuzzy criteria lead to a range of potential outcomes, each with its own level
of suitability as a solution.
• One important feature of FMCDM: You’re likely to have a list of several fuzzy criteria, but
these criteria might not all hold the same importance in your evaluation.
• To correct for this, simply assign weights to criteria to quantify their relative importance.
Introducing Regression Methods
• Machine learning algorithms of the regression variety were adopted from the statistics field, to
provide data scientists with a set of methods for describing and quantifying the relationships
between variables in a dataset.
• Use regression techniques if you want to determine the strength of correlation between variables
in your data.
• Regression analysis is a group of statistical methods that estimate the relationship between
a dependent variable (otherwise known as the outcome variables) and one or more independent
variables (often called predictor variables).
• Unlike many other models in Machine Learning, regression analyses can be used for two separate
purposes.
• First, in the social sciences, it is common to use regression analyses to infer a causal relationship
between a set of variables
• second, in data science, regression models are frequently used to predict and forecast new values.
• A regression model provides a function that describes the relationship
between one or more independent variables and a response, dependent,
or target variable.
• You can use regression to predict future values from historical values,
but be careful
• Regression methods assume a cause-and-effect relationship between
variables, but present circumstances are always subject to flux.
• Predicting future values from historical ones will generate incorrect
results when present circumstances change.
Linear regression
• Linear regression is a machine learning method you can use to describe and
quantify the relationship between your target variable, y — the predictant, in
statistics lingo — and the dataset features you’ve chosen to use as predictor
variables (commonly designated as dataset X in machine learning).
• Linear regression shows the linear relationship between the independent(predictor)
variable i.e. X-axis and the dependent(output) variable i.e. Y-axis.
• If there is a single input variable X(dependent variable), such linear regression is
called simple linear regression.
• you can also use linear regression to quantify correlations between several
variables in a dataset — called multiple linear regression.
• Equation of Simple Linear Regression, where b is the intercept, b is
o 1

coefficient or slope, x is the independent variable and y is the


dependent variable.
y= a0+a1x+ ε

Y= Dependent Variable (Target Variable)


X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input
value).
ε = random error
• Equation of Multiple Linear Regression, where bo is the intercept,
b ,b ,b ,b …,b are coefficients or slopes of the independent variables
1 2 3 4 n

x ,x ,x ,x …,x and y is the dependent variable.


1 2 3 4 n

• A Linear Regression model’s main aim is to find the best fit linear
line and the optimal values of intercept and coefficients such that
the error is minimized.
The above graph presents the linear relationship between the output(y) variable and predictor(X)
variables. The blue line is referred to as the best fit straight line. Based on the given data points, we
attempt to plot a line that fits the points the best.
• Before using linear regression, though, make sure you’ve considered
its limitations:
• Linear regression only works with numerical variables, not categorical ones.
• If your dataset has missing values, it will cause problems. Be sure to address
your missing values before attempting to build a linear regression model.
• If your data has outliers present, your model will produce inaccurate results.
• Check for outliers before proceeding.
• The linear regression assumes that there is a linear relationship
between dataset features and the target variable. Test to make sure this
is the case, and if it’s not, try using a log transformation to
compensate.
• The linear regression model assumes that all features are independent
of each other.
• Prediction errors, or residuals, should be normally distributed.
• you should have at least 20 observations per predictive feature if you
expect to generate reliable results using linear regression.
Logistic regression
• Logistic Regression is a “Supervised machine learning” algorithm that can be used to model the
probability of a certain class or event. It is used when the data is linearly separable and the outcome
is binary in nature.
• Logistic regression is a machine learning method you can use to estimate values for a categorical
target variable based on your selected features.
• Your target variable should be numeric, and contain values that describe the target’s class — or
category.
• Logistic regression predicts the output of a categorical dependent variable. Therefore the outcome
must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but
instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0
and 1.
• Logistic Regression is much similar to the Linear Regression except that how they are used. Linear
Regression is used for solving Regression problems, whereas Logistic regression is used for
solving the classification problems.
• In Logistic regression, instead of fitting a regression line, we fit an "S"
shaped logistic function, which predicts two maximum values (0 or 1).
• The curve from the logistic function indicates the likelihood of something
such as whether the cells are cancerous or not, a mouse is obese or not
based on its weight, etc.
• Logistic Regression is a significant machine learning algorithm because it
has the ability to provide probabilities and classify new data using
continuous and discrete datasets.
• Logistic Regression can be used to classify the observations using different
types of data and can easily determine the most effective variables used for
the classification.
Logistic Function (Sigmoid Function):

• The sigmoid function is a mathematical function used to map the


predicted values to probabilities.
• It maps any real value into another value within a range of 0 and 1.
• The value of the logistic regression must be between 0 and 1, which
cannot go beyond this limit, so it forms a curve like the "S" form. The
S-form curve is called the Sigmoid function or the logistic
function.
• In logistic regression, we use the concept of the threshold value, which
defines the probability of either 0 or 1. Such as values above the
threshold value tends to 1, and a value below the threshold values
tends to 0.
The Logistic regression equation can be obtained from the Linear
Regression equation. The mathematical steps to get Logistic Regression
equations are given below:
•We know the equation of the straight line can be written as:

•In Logistic Regression y can be between 0 and 1 only, so for this let's
divide the above equation by (1-y):
• But we need range between -[infinity] to +[infinity], then take
logarithm of the equation it will become:
• One cool thing about logistic regression is that, in addition to predicting the class of observations in
your target variable, it indicates the probability for each of its estimates. Though logistic regression
is like linear regression, it’s requirements are simpler, in that:
• There does not need to be a linear relationship between the features and target variable.
• Residuals don’t have to be normally distributed.
• Predictive features are not required to have a normal distribution.
• When deciding whether logistic regression is a good choice for you, make sure to consider the
following limitations:
• Missing values should be treated or removed.
• our target variable must be binary or ordinal.
• Predictive features should be independent of each other.
• Logistic regression requires a greater number of observations (than linear regression) to produce a
reliable result.
• The rule of thumb is that you should have at least 50 observations per predictive feature if you
expect to generate reliable results.
Example:-
• Let us consider a problem where we are given a dataset containing
Height and Weight for a group of people.
• Our task is to predict the Weight for new entries in the Height column.
• So we can figure out that this is a regression problem where we will
build a Linear Regression model.
• We will train the model with provided Height and Weight values.
• Once the model is trained we can predict Weight for a given unknown
Height value.
• Now suppose we have an additional field Obesity and we have to classify whether a
person is obese or not depending on their provided height and weight.
• This is clearly a classification problem where we have to segregate the dataset into two
classes (Obese and Not-Obese).
• So, for the new problem, we can again follow the Linear Regression steps and build a
regression line.
• This time, the line will be based on two parameters Height and Weight and the regression
line will fit between two discreet sets of values.
• As this regression line is highly susceptible to outliers, it will not do a good job in
classifying two classes.
• To get a better classification, we will find probability for each output value from the
regression line.
• Now based on a predefined threshold value, we can easily classify the output into two
classes Obese or Not-Obese.
Ordinary least squares (OLS) regression
methods
• Ordinary Least Squares regression (OLS) is a common technique for estimating coefficients
of linear regression equations which describe the relationship between one or more independent
quantitative variables and a dependent variable (simple or multiple linear regression).
• Least squares stand for the minimum squares error (SSE). Maximum likelihood and Generalized
method of moments estimator are alternative approaches to OLS.
• Example: We want to predict the height of plants depending on the number of days they have
spent in the sun. Before getting exposure, they are 30 cm. A plant grows 1 mm (0.1 cm) after being
exposed to the sun for a day.
• Y is the height of the plants
• X is the number of days spent in the sun
• β0 is 30 because it is the value of Y when X is 0.
• β1 is 0.1 because it is the coefficient multiplied by the number of days.
• A plant being exposed 5 days to the sun has therefore an estimated height of Y = 30 + 0.1*5 = 30.5
cm.
How do ordinary least squares (OLS) work?

• The OLS method aims to minimize the sum of square differences


between the observed and predicted values.
• For example, if your real values are 2, 3, 5, 2, and 4 and your
predicted values are 3, 2, 5, 1, 5, then the total error would be (3-
2)+(2-3)+(5-5)+(1-2)+(5-4)=1-1+0-1+1=0 and the average error
would be 0/5=0, which could lead to false conclusions.
• However, if you compute the mean squared error, then you get (3-
2)^2+(2-3)^2+(5-5)^2+(1-2)^2+(5-4)^2=4 and 4/5=0.8. By scaling the
error back to the data and taking the square root of it, we get
sqrt(0.8)=0.89, so on average, the predictions differ by 0.89 from the
real value.
• Now, the idea of Simple Linear Regression is finding those parameters
α and β for which the error term is minimized.
• To be more precise, the model will minimize the squared errors:
indeed, we do not want our positive errors to be compensated by the
negative ones, since they are equally penalizing for our model.
• This procedure is called Ordinary Least Squared error — OLS.
• With OLS, you do this by squaring the vertical distance values that describe the
distances between the data points and the best-fit line, adding up those squared
distances, and then adjusting the placement of the best-fit line so that the summed
squared distance value is minimized.
• Use OLS if you want to construct a function that’s a close approximation to your
data.
• As always, don’t expect the actual value to be identical to the value predicted by
the regression.
• Values predicted by the regression are simply estimates that are most similar to the
actual values in the model.
• OLS is particularly useful for fitting a regression line to models containing more than one
independent variable.
• In this way, you can use OLS to estimate the target from dataset features.
• When using OLS regression methods to fit a regression line that has more than one independent
variable, two or more of the IVs may be interrelated.
• When two or more IVs are strongly correlated with each other, this is called multicollinearity.
• Multicollinearity tends to adversely affect the reliability of the IVs as predictors when they’re
examined apart from one another.
• Luckily, however, multicollinearity doesn’t decrease the overall predictive reliability of the model
when it’s considered collectively.
Detecting Outliers
• Many statistical and machine learning approaches assume that there are no outliers
in your data.
• Outlier removal is an important part of preparing your data for analysis.
• Analysing extreme values
• Outliers are data points with values that are significantly different than the
majority of data points comprising a variable.
• It is important to find and remove outliers, because, left untreated, they skew
variable distribution, make variance appear falsely high, and cause a
misrepresentation of intervariable correlations.
Effect of outlier on dataset
• Most machine learning and statistical models assume that your data is free of outliers, so spotting
and removing them is a critical part of preparing your data for analysis.
• Not only that, you can use outlier detection to spot anomalies that represent fraud, equipment
failure, or cybersecurity attacks.
• In other words, outlier detection is a data preparation method and an analytical method in its own
right.
• Outliers fall into the following three categories:
• Point: Point outliers are data points with anomalous values compared to the normal range of values
in a feature.
• Contextual: Contextual outliers are data points that are anomalous only within a specific context.
• To illustrate, if you are inspecting weather station data from January in Orlando, Florida, and you
see a temperature reading of 23 degrees F, this would be quite anomalous because the average
temperature there is 70 degrees F in January.
• But consider if you were looking at data from January at a weather
station in Anchorage, Alaska — a temperature reading of 23 degrees F
in this context is not anomalous at all.
• Collective: These outliers appear nearby to one another, all having
similar values that are anomalous to the majority of values in the
feature.
• You can detect outliers using either a univariate or multivariate
approach.
• This outlier could be the result of many different issues:
• Human error
• Instrument error
• Experimental error
• Intentional creation
• Data processing error
• Sampling error
• Natural outlier
• The purpose for being able to identify this outlier of course can also be
different.
• This could be because an outlier would indicate something has
changed in the action that produces the data which is useful in the case
of:
• Fraud detection
• Intrusion detection
• Fault diagnostics
• Time series monitoring
• Health monitoring
Detecting outliers with univariate analysis
• Univariate outlier detection is where you look at features in your dataset, and inspect them
individually for anomalous values.
• There are two simple methods for doing this:
• Tukey outlier labelling
• Tukey boxplot
• It is cumbersome to detect outliers using Tukey outlier labelling, but if you want to do it, the trick
here is to see how far the minimum and maximum values are from the 25 and 75 percentiles.
• The distance between the 1st quartile Q1 (at 25 percent) and the 3rd quartile (at 75 percent) Q3
is called the inter-quartile range (IQR), and it describes the data’s spread.
• When you look at a variable, consider its spread, its Q1 / Q3 values, and its minimum and
maximum values to decide whether the variable is suspect for outliers.
• Any data point that falls outside of either 1.5 times the IQR below the first quartile
or 1.5 times the IQR above the third quartile is considered an outlier.
• Here’s a good rule of thumb: a = Q1 - 1.5*IQR and b = Q3 + 1.5*IQR.
• If your minimum value is less than a, or your maximum value is greater than b, the
variable probably has outliers.
Method:-
1.Sort your data from low to high
2.Identify the first quartile (Q1), the median, and the third quartile (Q3).
3.Calculate your IQR = Q3 – Q1
4.Calculate your upper fence = Q3 + (1.5 * IQR)
5.Calculate your lower fence = Q1 – (1.5 * IQR)
6.Use your fences to highlight any outliers, all values that fall outside
your fences.
Your dataset has 11 values. You have a couple of extreme values in your dataset, so you’ll use the
IQR method to check whether they are outliers.

26 37 24 28 35 22 31 53 41 64 29
Step 1: Sort your data from low to high
First, you’ll simply sort your data in ascending order.

22 24 26 28 29 31 35 37 41 53 64

Step 2: Identify the median, the first quartile (Q1), and the third quartile (Q3)
The median is the value exactly in the middle of your dataset when all values are ordered from low to
high.
Since you have 11 values, the median is the 6th value. The median value is 31.

22 24 26 28 29 31 35 37 41 53 64
• Next, we’ll use the exclusive method for identifying Q1 and Q3. This
means we remove the median from our calculations.
• The Q1 is the value in the middle of the first half of your dataset,
excluding the median. The first quartile value is 26.
22 24 26 28 29

Your Q3 value is in the middle of the second half of your dataset, excluding the median. The third
quartile value is 41.

35 37 41 53 64
Calculate your IQR
The IQR is the range of the middle half of your dataset. Subtract Q1 from Q3 to calculate the IQR.

Formula Calculation
IQR = Q3 – Q1 Q1 = 26
Q3 = 41
IQR = 41 – 26
= 15
Calculate your upper fence
The upper fence is the boundary around the third quartile. It tells you that any values
exceeding the upper fence are outliers.
Formula Calculation
Upper fence = Q3 + (1.5 * IQR) Upper fence = 41 + (1.5 * 15)
= 41 + 22.5
= 63.5
Calculate your lower fence
The lower fence is the boundary around the first quartile. Any values less than the lower fence
are outliers.

Formula Calculation
Lower fence = Q1 – (1.5 * IQR) Lower fence = 26 – (1.5 * IQR)
= 26 – 22.5
= 3.5
Use your fences to highlight any outliers
Go back to your sorted dataset from Step 1 and highlight any values that are greater than the upper
fence or less than your lower fence.
These are your outliers.
•Upper fence = 63.5
•Lower fence = 3.5

22 24 25 28 29 31 35 37 41 53 64
• In comparison, a Tukey boxplot is a pretty easy way to spot outliers.
• Each boxplot has whiskers that are set at 1.5*IQR. Any values that lie
beyond these whiskers are outliers.
• Figure shows outliers as they appear within a Tukey boxplot.
Detecting outliers with multivariate analysis

• Sometimes outliers only show up within combinations of data points from disparate variables.
• These outliers really wreak havoc on machine learning algorithms, so it’s important to detect and
remove them.
• You can use multivariate analysis of outliers to do this.
• A multivariate approach to outlier detection involves considering two or more variables at a time
and inspecting them together for outliers.
• There are several methods you can use, including
• Scatter-plot matrix
• Boxplot
• Density-based spatial clustering of applications with noise (DBScan)
• Principal component analysis (PCA)
Introducing Time Series Analysis
• A time series is just a collection of data on attribute values over time.
• Time series analysis is performed to predict future instances of the measure based
on the past observational data.
• To forecast or predict future values from data in your dataset, use time series
techniques
• In time series the order of observations provides a source of additional information
that should be analysed and used in the prediction process
• Time series are typically assumed to be generated at regularly spaced interval of
time (e.g. daily temperature), and so are called regular time series.
• A Time-Series represents a series of time-based orders. It would be
Years, Months, Weeks, Days, Horus, Minutes, and Seconds
• A time series is an observation from the sequence of discrete-time of
successive intervals.
• A time series is a running chart.
• Time Series Analysis (TSA) is used in different fields for time-based
predictions – like Weather Forecasting, Financial, Signal processing,
Engineering domain – Control Systems, Communications Systems.
• Since TSA involves producing the set of information in a particular
sequence, it makes a distinct from spatial and other analyses.
• Time series can have one or more variables that change over time.
• If there is only one variable varying over time, we call it Univariate
time series.
• If there is more than one variable it is called Multivariate time series.
How to analyse Time Series?

• Quick steps here for your reference, anyway. Will see this in detail in
this article later.
• Collecting the data and cleaning it
• Preparing Visualization with respect to time vs key feature
• Observing the stationarity of the series
• Developing charts to understand its nature.
• Model building – AR, MA, ARMA and ARIMA
• Extracting insights from prediction
Identifying patterns in time series
• Time series exhibit specific patterns.
• Take a look at Figure to get a better understanding of what these patterns are all
about.
• Constant time series remain at roughly the same level over time, but are subject
to some random error.
• In contrast, trended series show a stable linear movement up or down.
• Whether constant or trended, time series may also sometimes exhibit seasonality
— predictable, cyclical fluctuations that reoccur seasonally throughout a year.
• As an example of seasonal time series, consider how many businesses show
increased sales during the holiday season.
• Let’s discuss the time series’ data types and their influence. While
discussing TS data-types, there are two major types.
• Stationary
• Non- Stationary
• 6.1 Stationary: A dataset should follow the below thumb rules,
without having Trend, Seasonality, Cyclical, and Irregularity
component of time series
• The MEAN value of them should be completely constant in the data
during the analysis
• The VARIANCE should be constant with respect to the time-frame
• The COVARIANCE measures the relationship between two variables.
• 6.2 Non- Stationary: This is just the opposite of Stationary.
• If you’re including seasonality in your model, incorporate it in the quarter, month,
or even 6-month period — wherever it’s appropriate.
• Time series may show nonstationary processes — or, unpredictable cyclical
behaviour that is not related to seasonality and that results from economic or
industry-wide conditions instead.
• Because they’re not predictable, nonstationary processes can’t be forecasted.
• You must transform nonstationary data to stationary data before moving forward
with an evaluation.
Modelling univariate time series data

• Similar to how multivariate analysis is the analysis of relationships between


multiple variables, univariate analysis is the quantitative analysis of only one
variable at a time.
• When you model univariate time series, you are modelling time series changes
that represent changes in a single variable over time.
• Autoregressive moving average (ARMA) is a class of forecasting methods that
you can use to predict future values from current and historical data.
• As its name implies, the family of ARMA models combines autoregression
techniques (analyses that assume that previous observations are good predictors of
future values and perform an autoregression analysis to forecast for those future
values)
• moving average techniques — models that measure the level of the constant time
series and then update the forecast model if any changes are detected.
• If you’re looking for a simple model or a model that will work for only a small
dataset, the ARMA model is not a good fit for your needs.
• An alternative in this case might be to just stick with simple linear regression.
• In Figure , you can see that the model forecast data and the actual data are a very
close fit.
• To use the ARMA model for reliable results, you need to have at least 50
observations.

You might also like