Professional Documents
Culture Documents
Unit_2
Unit_2
• For example, if you want to know the average height of the residents of India,
that is your population, i.e., the population of India.
• Population characteristic are mean (μ), Standard deviation (σ) , proportion (P)
, median, percentiles etc. The value of a population characteristic is fixed.
This characteristics are called population distribution.
Sampling Distribution
2. Calculate a statistic for the sample, such as the mean, median, or standard
deviation.
4. Plot the frequency distribution of each sample statistic that you developed
from the step above. The resulting graph will be the sampling distribution.
Univariate graphical EDA
Data Scientists often use visualization to discover anomalies and
patterns. The graphical method is a more subjective approach to EDA.
These are some of the graphical tools to perform univariate analysis.
• Histogram
• Stem-and-leaf plots
• Boxplots
• Quantile-normal plots
Histogram
They represent an actual count of a particular range of values.
horizontal.
Stem-and-leaf plots
• A simple substitute for a histogram is a stem and leaf plot.
correlation coefficient
scatter plots
heatmaps
correlation coefficient
The correlation coefficient is the specific measure that quantifies the strength of
the linear relationship between two variables in a correlation analysis.
OR
Rob 21 105,000
Tom 19 90,000
Ivy 10 82,000
•One variable could cause or depend on the values of another variable.
If one variable is explanatory and the other is outcome, it is a very, very strong
convention to put the outcome on the y (vertical) axis.
heatmaps
Heatmap is defined as a graphical representation of data using colors to visualize the value
of the matrix.
In this, to represent more common values or higher activities brighter colors basically
reddish colors are used and to represent less common or activity values, darker colors are
preferred.
• The equation for that curve or line can also be provided to you using regression analysis.
Additionally, it may show you the correlation coefficient.
OR
• Regression analysis is often used to model or analyze data. Most survey analysts use it to
understand the relationship between the variables, which can be further utilized to predict
the precise outcome
• It is widely used when the dependent and independent variables are linked in
a linear or non-linear fashion, and the target variable has a set of continuous
values.
Regression analysis is used for one of two purposes: predicting the value of the
dependent variable when information about the independent variables is known
or predicting the effect of an independent variable on the dependent variable.
For Example – Suppose a soft drink company wants to expand its manufacturing unit to a
newer location. Before moving forward, the company wants to analyze its revenue generation
model and the various factors that might impact it. Hence, the company conducts an online
survey with a specific questionnaire.
After using regression analysis, it becomes easier for the company to analyze the survey results
and understand the relationship between different variables like electricity and revenue – here,
revenue is the dependent variable.
Linear Regression
• The most extensively used modelling technique is linear regression, which assumes a linear
connection between a dependent variable (Y) and an independent variable (X).
applicable.
• Thus, the target variable can take on only one of two values, and a sigmoid
curve represents its connection to the independent variable, and probability
has a value between 0 and 1.
Polynomial Regression
• The technique of polynomial regression analysis is used to represent a non-
linear relationship between dependent and independent variables.
• It is a variant of the multiple linear regression model, except that the best fit
line is curved rather than straight.