Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 34

Unit 2: DATA ANALYSIS

Data format and types of EDA, Univariate non-graphical EDA -


Categorical data, Characteristics of quantitative data, UNIVARIATE
NON-GRAPHICAL EDA, Central tendency, spread, Skewness and
kurtosis, Univariate graphical EDA –Histograms, Stem and leaf plots,
Boxplots, Quantile-normal plots. Bivariate Analysis-correlation
coefficient , scatter plots and heatmaps Types of Bivariate Analysis-
Scatter Plots, Regression Analysis, Correlation Coefficients
Population Distribution
• Population distribution refers to the distribution of a particular characteristic
or variable among all individuals or units in a specific population.

• For example, if you want to know the average height of the residents of India,
that is your population, i.e., the population of India.

• Population characteristic are mean (μ), Standard deviation (σ) , proportion (P)
, median, percentiles etc. The value of a population characteristic is fixed.
This characteristics are called population distribution.
Sampling Distribution

A sampling distribution refers to a probability distribution of a


statistic that comes from choosing random samples of a given
population. Also known as a finite-sample distribution, it represents
the distribution of frequencies on how spread apart various
outcomes will be for a specific population.

The sampling distribution depends on multiple factors – the statistic,


sample size, sampling process, and the overall population. It is used
to help calculate statistics such as means, ranges, variances, and
standard deviations for the given sample.
How Does it Work?
1. Select a random sample of a specific size from a given population.

2. Calculate a statistic for the sample, such as the mean, median, or standard
deviation.

3. Develop a frequency distribution of each sample statistic that you


calculated from the step above.

4. Plot the frequency distribution of each sample statistic that you developed
from the step above. The resulting graph will be the sampling distribution.
Univariate graphical EDA
Data Scientists often use visualization to discover anomalies and
patterns. The graphical method is a more subjective approach to EDA.
These are some of the graphical tools to perform univariate analysis.

• Histogram

• Stem-and-leaf plots

• Boxplots

• Quantile-normal plots
Histogram
They represent an actual count of a particular range of values.

It shows the frequency of data in the form of rectangles' which is also

known as bar graph representation and can be either vertical or

horizontal.
Stem-and-leaf plots
• A simple substitute for a histogram is a stem and leaf plot.

• A Stem and Leaf Plot is a special table where each data


value is split into a "stem" (the first digit or digits) and a "leaf"
(usually the last digit).
Boxplots
• Another very useful univariate graphical technique is the boxplot.

• Boxplots are very good at presenting information about the central


tendency, symmetry and skew, as well as outliers.
Quantile-normal plots
The final univariate graphical EDA technique is the most
complicated.

It is called the quantile-normal or QN plot or more generality


the quantile-quantile or QQ plot
Bivariate Analysis

correlation coefficient

scatter plots

heatmaps
correlation coefficient
The correlation coefficient is the specific measure that quantifies the strength of
the linear relationship between two variables in a correlation analysis.

OR

A correlation coefficient is a statistical measure of the degree to which changes


to the value of one variable predict change to the value of another. In positively
correlated variables, the value increases or decreases in tandem. In negatively
correlated variables, the value of one increases as the value of the other
decreases.
A correlation could be presented in different ways:

•Positive Correlation: both variables change in the same direction.

•Neutral Correlation: No relationship in the change of the variables.

•Negative Correlation: variables change in opposite directions.


Name Years of Experience Annual Salary
Ann 30 120,000

Rob 21 105,000

Tom 19 90,000

Ivy 10 82,000
•One variable could cause or depend on the values of another variable.

•One variable could be lightly associated with another variable.

•Two variables could depend on a third unknown variable.

A correlation could be presented in different ways:


•Pearson’s r
•Spearman’s rho
•Kendall’s tau
scatter plots
For two quantitative variables, the basic graphical EDA technique is the
scatterplot which has one variable on the x-axis, one on the y-axis and a point
for each case in your dataset.

If one variable is explanatory and the other is outcome, it is a very, very strong
convention to put the outcome on the y (vertical) axis.
heatmaps
Heatmap is defined as a graphical representation of data using colors to visualize the value
of the matrix.

In this, to represent more common values or higher activities brighter colors basically
reddish colors are used and to represent less common or activity values, darker colors are
preferred.

Heatmap is also defined by the name of the shading matrix.

Heatmaps in Seaborn can be plotted by using the seaborn.heatmap() function.


Regression analysis
• Regression analysis is perhaps one of the most widely used statistical methods for
investigating or estimating the relationship between a set of independent and dependent
variables.

• The equation for that curve or line can also be provided to you using regression analysis.
Additionally, it may show you the correlation coefficient.

OR

• Regression analysis is often used to model or analyze data. Most survey analysts use it to
understand the relationship between the variables, which can be further utilized to predict
the precise outcome
• It is widely used when the dependent and independent variables are linked in
a linear or non-linear fashion, and the target variable has a set of continuous
values.

What is the purpose of a regression model?

Regression analysis is used for one of two purposes: predicting the value of the
dependent variable when information about the independent variables is known
or predicting the effect of an independent variable on the dependent variable.
For Example – Suppose a soft drink company wants to expand its manufacturing unit to a
newer location. Before moving forward, the company wants to analyze its revenue generation
model and the various factors that might impact it. Hence, the company conducts an online
survey with a specific questionnaire.

After using regression analysis, it becomes easier for the company to analyze the survey results
and understand the relationship between different variables like electricity and revenue – here,
revenue is the dependent variable.
Linear Regression
• The most extensively used modelling technique is linear regression, which assumes a linear
connection between a dependent variable (Y) and an independent variable (X).

• It employs a regression line, also known as a best-fit line.

• The linear connection is defined as Y = c+m*X + e,

where ‘c’ denotes the intercept,

‘m’ denotes the slope of the line, and

‘e’ is the error term.


Consider a dataset where the independent attribute is represented by x and the
dependent attribute is represented by y.

It is known that the equation of a straight line is y = mx + b where m is the


slope and b is the intercept. In order to prepare a simple regression model of
the given dataset, we need to calculate the slope and intercept of the line
which best fits the data points. How to calculate slope and
intercept? Mathematical formula to calculate slope and intercept are given
below
Slope = Sxy/Sxx
where Sxy and Sxx are sample covariance and
sample variance respectively.
Intercept = y – slope* x
mean mean
As per the above formulae, Slope = 28/10 = 2.8 Intercept = 14.6 – 2.8 *
3 = 6.2 Therefore,

The desired equation of the regression model is y = 2.8 x + 6.2


• When the dependent variable is discrete, the logistic regression technique is
Logistic Regression

applicable.

• In other words, this technique is used to compute the probability of mutually


exclusive occurrences such as pass/fail, true/false, 0/1, and so forth.

• Thus, the target variable can take on only one of two values, and a sigmoid
curve represents its connection to the independent variable, and probability
has a value between 0 and 1.
Polynomial Regression
• The technique of polynomial regression analysis is used to represent a non-
linear relationship between dependent and independent variables.

• It is a variant of the multiple linear regression model, except that the best fit
line is curved rather than straight.

You might also like