Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 113

UNIT -III

UNIT II EXPLORATORY DATA ANALYTICS (9 Hrs)


Introduction – Data Preparation – Exploratory Data Analysis: Data
summarization – Data distribution –
Outlier Treatment – Measuring asymmetry – Continuous distribution;
Estimation: Mean – Variance –
Skewness and Kurtosis - Box Plots - Pivot Table - Heat Map - Correlation
Statistics - ANOVA. Sampling
– Covariance – Correlation.
Exploratory Data Analysis (EDA)
• Data analysis involves different processes of cleaning, transforming,
analyzing the data, and building models to extract specific, relevant
insights.
• These are beneficial for making important business decisions in real-
time situations.
• Exploratory Data Analysis is important for any business.
• It lets data scientists analyze the data before reaching any conclusion.
• Also, this makes sure that the results which are out are valid and
applicable to business outcomes and goals
• Exploratory Data Analysis is a data analytics process to understand the
data in depth and learn the different data characteristics, often with
visual means
univariate analysis bivariate analysis
• Univariate analysis looks at one variable, Bivariate analysis looks at
two variables and their relationship.
• Univariate is a term commonly used in statistics to describe a type of
data which consists of observations on only a single characteristic or
attribute.
• A simple example of univariate data would be the salaries of workers
in industry.
Univariate analysis
Univariate Descriptive Statistics

• Some ways you can describe patterns found in univariate data include central tendency
(mean, mode and median) and dispersion: range , variance, maximum, minimum, quartiles
(including the interquartile range), and standard deviation.

• You have several options for describing data with univariate data. Click on the link to find out
more about each type of graph or chart:

• Frequency Distribution Tables.


• Bar Charts.
• Histograms.
• Frequency Polygons.
• Pie Charts.
Bivariate
Bivariate data help you in studying two variables.
• For example, you are studying a group of college students.
• To find out their average SAT score and their age, you have two
pieces of the puzzle to find (SAT score and age).
• Bivariate data refers to a dataset that contains exactly two variables.

• This type of data occurs all the time in real-world situations and we
typically use the following methods to analyze this type of data:

• Scatterplots
• Correlation Coefficients
• Simple Linear Regression
Example 1: Business
• Businesses often collect bivariate data about total money spent on
advertising and total revenue.

• For example, a business may collect the following data for 12


consecutive sales quarters:
• This is an example of bivariate data because it contains information on
exactly two variables: advertising spend and total revenue.

• The business may decide to fit a simple linear regression model to this
dataset and find the following fitted model:

• Total Revenue = 14,942.75 + 2.70*(Advertising Spend)

• This tells the business that for each additional dollar spent on advertising,
total revenue increases by an average of $2.70.
Example 2: Medical
Medical researchers often collect bivariate data to gain a better
understanding of the relationship between variables related to health.

For example, a researcher may collect the following data about age and
resting heart rate for 15 individuals:
• The researcher may then decide to calculate the correlation between
the two variables and find it to be 0.812.

• This indicates that there is a strong positive correlation between the


two variables. That is, as age increases resting heart rate tends to
increase in a predictable manner as well.
Example 4: Economics
• Economists often collect bivariate data to understand the relationship
between two socioeconomic variables.

• For example, an economist may collect data on the total years of


schooling and total annual income among individuals in a certain city:
• He may then decide to fit the following simple linear regression
model:

• Annual Income = -45,353 + 7,120*(Years of Schooling)


• This tells the economist that for each additional year of schooling,
annual income increases by $7,120 on average
Example 3: Academics
• Researchers often collect bivariate data to understand what variables
affect the performance of university students.

• For example, a researcher may collect data on the number of hours


studied per week and the corresponding GPA for students in a certain
class:
She may then create a simple scatterplot to visualize the relationship
between these two variables:
• Clearly there is a positive association between the two variables: As
the number of hours studied per week increases, the GPA of the
student tends to increase as well.
Multivariate analysis
• Multivariate analysis is based in observation and analysis of more
than one statistical outcome variable at a time.
• In design and analysis, the technique is used to perform trade studies
across multiple dimensions while taking into account the effects of all
variables on the responses of interest.
Objectives of multivariate data analysis:

• Multivariate data analysis helps in the reduction and simplification of data as


much as possible without losing any important details.

• As MVA has multiple variables, the variables are grouped and sorted on the
basis of their unique features.

• The variables in multivariate data analysis could be dependent or independent.


It is important to verify the collected data and analyze the state of the variables.
• In multivariate data analysis, it is very important to understand the
relationship between all the variables and predict the behavior of the
variables based on observations.

• It is tested to create a statistical hypothesis based on the parameters


of multivariate data.
• This testing is carried out to determine whether or not the
assumptions are true.
Advantages of multivariate data analysis:

The following are the advantages of multivariate data analysis:

• As multivariate data analysis deals with multiple variables, all the variables can either
be independent or dependent on each other.
• This helps the analysis to search for factors that can help in drawing accurate
conclusions.

• Since the analysis is tested, the drawn conclusions are closer to real-life situations.
multivariate data analysis
Importance of EDA in Data Science
• The Data Science field is now very important in the business world as
it provides many opportunities to make vital business decisions by
analyzing hugely gathered data.
• Understanding the data thoroughly needs its exploration from every
aspect.
• The impactful features enable making meaningful and beneficial
decisions; therefore, EDA occupies an invaluable place in Data
science.
Objective of Exploratory Data Analysis
• The overall objective of exploratory data analysis is to obtain vital
insights and hence usually includes the following sub-objectives:

• Identifying and removing data outliers


• Identifying trends in time and space
• Uncover patterns related to the target
• Creating hypotheses and testing them through experiments
• Identifying new sources of data
Role of EDA in Data Science

• The role of data exploration analysis is based on the use of objectives


achieved as above.
• After formatting the data, the performed analysis indicates patterns
and trends that help to take the proper actions required to meet the
expected goals of the business.
• As we expect specific tasks to be done by any executive in a particular
job position, it is expected that proper EDA will fully provide answers
to queries related to a particular business decision.
Steps Involved in Exploratory Data Analysis (EDA)

• 1. Data Collection
• Nowadays, data is generated in huge volumes and various forms
belonging to every sector of human life, like healthcare, sports,
manufacturing, tourism, and so on. Every business knows the
importance of using data beneficially by properly analyzing it.
However, this depends on collecting the required data from various
sources through surveys, social media, and customer reviews, to
name a few. Without collecting sufficient and relevant data, further
activities cannot begin.
2. Finding all Variables and Understanding Them

• When the analysis process starts, the first focus is on the available
data that gives a lot of information.
• This information contains changing values about various features or
characteristics, which helps to understand and get valuable insights
from them. It requires first identifying the important variables which
affect the outcome and their possible impact. This step is crucial for
the final result expected from any analysis.
3. Cleaning the Dataset

• The next step is to clean the data set, which may contain null values
and irrelevant information.
• These are to be removed so that data contains only those values that
are relevant and important from the target point of view.
• This will not only reduce time but also reduces the computational
power from an estimation point of view.
• Preprocessing takes care of all issues, such as identifying null values,
outliers, anomaly detection, etc.
4. Identify Correlated Variables

• Finding a correlation between variables helps to know how a


particular variable is related to another.
• The correlation matrix method gives a clear picture of how different
variables correlate, which further helps in understanding vital
relationships among them.
5. Choosing the Right Statistical Methods

• As will be seen in later sections, depending on the data, categorical or


numerical, the size, type of variables, and the purpose of analysis,
different statistical tools are employed.
• Statistical formulae applied for numerical outputs give fair
information, but graphical visuals are more appealing and easier to
interpret.
6. Visualizing and Analyzing Results

• Once the analysis is over, the findings are to be observed cautiously


and carefully so that proper interpretation can be made.
• The trends in the spread of data and correlation between variables
give good insights for making suitable changes in the data
parameters.
• The data analyst should have the requisite capability to analyze and
be well-versed in all analysis techniques.
• The results obtained will be appropriate to data of that particular
domain and are suitable for use in retail, healthcare, and agriculture.
Types of Exploratory Data Analysis

• There are three main types of EDA:

• Univariate
• Bivariate
• Multivariate
• 1. Univariate Non-Graphical
• It is the simplest of all types of data analysis used in practice.
• As the name suggests, uni means only one variable is considered
whose data (referred to as population) is compiled and studied.
• The main aim of univariate non-graphical EDA is to find out the details
about the distribution of the population data and to know some
specific parameters of statistics.
• The significant parameters which are estimated from a distribution
point of view are as follows:
• Central Tendency: This term refers to values located at the data's central
position or middle zone.
• The three generally estimated parameters of central tendency are mean,
median, and mode.
• Mean is the average of all values in data, while the mode is the value that
occurs the maximum number of times.
• The Median is the middle value with equal observations to its left and right.
• Range: The range is the difference between the maximum and minimum
value in the data, thus indicating how much the data is away from the
central value on the higher and lower side
• Variance and Standard Deviation: Two more useful parameters are
standard deviation and variance.
• Variance is a measure of dispersion that indicates the spread of all
data points in a data set.
• It is the measure of dispersion mostly used and is the mean squared
difference between each data point and mean, while standard
deviation is the square root value of it.
• The larger the value of standard deviation, the farther the spread of
data, while a low value indicates more values clustering near the
mean.
• 2. Univariate Graphical
• The graphs in this section are based on Auto MPG dataset available
on the UCI repository.
• Some common types of univariate graphics are:
• Stem-and-leaf Plots: This is a very simple but powerful EDA method
used to display quantitative data but in a shortened format.
• It displays the values in the data set, keeping each observation intact
but separating them as stem (the leading digits) and remaining or
trailing digits as leaves.
• But histogram is mostly used in its place now.
• Histograms (Bar Charts): These plots are used to display both grouped
or ungrouped data.
• On the x-axis, values of variables are plotted, while on the y-axis are
the number of observations or frequencies.
• Histograms are very simple to quickly understand your data, which tell
about values of data like central tendency, dispersion, outliers, etc
• There are many types of histograms, a few of which are listed below:

• Simple Bar Charts: These are used to represent categorical variables with rectangular bars,
where the different lengths correspond to the values of the variables.
• Multiple or Grouped charts: Grouped bar charts are bar charts representing multiple sets of data
items for comparison where a single color is used to denote one specific series in the dataset.
• Percentage Bar Charts: These are bar graphs that depict the data in the form of percentages for
each observation. The following image shows a percentage bar chart with dummy values.
• Box Plots: These are used to display the distribution of quantitative value in the data. If the data
set consists of categorical variables, the plots can show the comparison between them.
• Further, if outliers are present in the data, they can be easily identified. These graphs are very
useful when comparisons are to be shown in percentages, like values in the 25 %, 50 %, and 75%
range (quartiles).
• 3. Multivariate Non-Graphical
• The multivariate non-graphical exploratory data analysis technique is
usually used to show the connection between two or more variables
with the help of either cross-tabulation or statistics.
• 4. Multivariate Graphical
• Graphics are used in multivariate graphical data to show the
relationships between two or more variables.
• Here the outcome depends on more than two variables, while the
change-causing variables can also be multiple.
• Some common types of multivariate graphics include:

• A) Scatter Plot

• The essential graphical EDA technique for two quantitative variables is


the scatter plot, so one variable appears on the x-axis and the other
on the y-axis and, therefore, the point for every case in your dataset.
This can be used for bivariate analysis.
• B) Multivariate Chart

• A Multivariate chart is a type of control chart used to monitor two or


more interrelated process variables.
• This is beneficial in situations such as process control, where
engineers are likely to benefit from using multivariate charts.
• These charts allow monitoring multiple parameters together in a
single chart
• C) Run Chart

• A run chart is a data line chart drawn over time. In other words, a run
chart visually illustrates the process performance or data values in a
time sequence.
• Rather than summary statistics, seeing data across time yields a more
accurate conclusion. A trend chart or time series plot is another name
for a run chart.
• D) Bubble Chart

• Bubble charts scatter plots that display multiple circles (bubbles) in a


two-dimensional plot.
• These are used to assess the relationships between three or more
numeric variables.
• In a bubble chart, every single dot corresponds to one data point, and
the values of the variables for each point are indicated by different
positions such as horizontal, vertical, dot size, and dot colors.
• E) Heat Map

• A heat map is a colored graphical representation of multivariate data


structured as a matrix of columns and rows. The heat map transforms
the correlation matrix into color coding and represents these
coefficients to visualize the strength of correlation among variables
Data Summerization
• Data summarization is a process of reducing the complexity and
volume of data sets by extracting the most relevant and meaningful
information.
• It can help you perform exploratory data analysis (EDA) and generate
hypotheses for further investigation
• The term Data Summarization can be defined as the presentation of a
summary/report of generated data in a comprehensible and
informative manner.
• To relay information about the dataset, summarization is obtained
from the entire dataset.
What are data summarization tools?
• Microsoft Excel, Google Sheets, SQL, Tableau, R or Python, and SAS
are some of the most commonly used tools in the data analytics
industry.
• Choosing the right tool for data summarization is crucial for accurate
and efficient data analysis.
Data Summarization in Data Mining: Centrality
Mean: This is used to calculate the numerical average of the set of
values.

Mode: This shows the most frequently repeated value in a dataset.

Median: This identifies the value in the middle of all the values in the
dataset when values are ranked in order.
Data distribution
• Data distribution is a function that specifies all possible values for a
variable and also quantifies the relative frequency (probability of how
often they occur).
• Distributions are considered to be any population that has a
scattering of data.
Binomial Distribution

• Binomial Distribution is simply an extension of Bernoulli distribution.


• If we repeat Bernoulli trials for n times, we will get a Binomial
distribution.
• If we want to model the number of successes in n trials, we use
Binomial Distribution.
• As each unit of Binomial is a Bernoulli trial, the outcome is always
binary.
• The observations are independent of each other.
• The Probability Mass Function is given by
Where
• import numpy as np
• import seaborn as sns
• sns.set(style="darkgrid", palette="muted")
• fig,ax = plt.subplots(figsize=(15,8))
• binomial = np.random.binomial(20,0.5,1000)
• sns.countplot(binomial)
Normal Distribution

• Normal Distribution
• The most common and naturally occurring distribution is Normal Distribution.
• It is otherwise also known as Gaussian Distribution. There is no field where
this distribution is not seen.
• Finance, Statistics, Chemistry, you name it.
• It is an omnipresent distribution.
• A classic example could be the distribution of SAT scores higher number of
students will score around the mean.
• As the distance increases from either side of the mean, the probability
decreases.
• nums = np.random.normal(50, 5, 1000)
• sns.set(style="darkgrid", palette="cividis",)
• fig,ax = plt.subplots(figsize=(15,7))
• sns.distplot(nums)
• Exponential Distribution
• The exponential distribution is often associated with the time elapsed
until some event happens.
• The events within the time interval occur continuously and at an
average constant rate.
• If you know high school chemistry, then chemical first-order reaction r
time until a radioactive substance decays follow an exponential
distribution.
• A more general example could be amount of months a car battery
lasts.
• from scipy.stats import expon
• sample_space = np.arange(0,5,0.05)
• sns.set(style="darkgrid", palette="muted",)
• fig,ax = plt.subplots(figsize=(15,7))
• plt.ylim(0,1.6,0.1)
• sns.lineplot(x = sample_space, y = expon.pdf(sample_space,scale=2), label="sigma = 0.5 and beta = 2")
• #scale parameter is used which is inverse of rate parameter
• sns.lineplot(x = sample_space, y = expon.pdf(sample_space,scale=1), label="sigma = 1 and beta = 1")
• sns.lineplot(x = sample_space, y = expon.pdf(sample_space,scale=2/3), label="sigma=1.5 and beta=0.67")
• plt.ylabel('f(x)')
• plt.xlabel('x')
• plt.title('Exponential distribution PDF', fontdict = {'size':16})
• plt.plot()
Outlier Treatment
• One of the most important steps as part of data preprocessing is
detecting and treating the outliers as they can negatively affect the
statistical analysis and the training process of a machine learning
algorithm resulting in lower accuracy.
• What are Outliers?
• We all have heard of the idiom ‘odd one out which means something
unusual in comparison to the others in a group.

• Similarly, an Outlier is an observation in a given dataset that lies far


from the rest of the observations. That means an outlier is vastly
larger or smaller than the remaining values in the set.
• Why do they occur?
• An outlier may occur due to the variability in the data, or due to
experimental error/human error.

• They may indicate an experimental error or heavy skewness in the


data(heavy-tailed distribution).
• What do they affect?
• In statistics, we have three measures of central tendency namely
Mean, Median, and Mode. They help us describe the data.
• Example:
• Consider a small dataset, sample= [15, 101, 18, 7, 13, 16, 11, 21, 5,
15, 10, 9]. By looking at it, one can quickly say ‘101’ is an outlier that
is much larger than the other values.
• Detecting Outliers
• If our dataset is small, we can detect the outlier by just looking at the dataset.
But what if we have a huge dataset, how do we identify the outliers then? We
need to use visualization and mathematical techniques.

• Below are some of the techniques of detecting outliers

• Boxplots
• Z-score
• Inter Quantile Range(IQR)
• Detecting outliers using Boxplot:
• Python code for boxplot is:
Mean
• The "average" number; found by adding all data points and dividing
by the number of data points.
• Mean is used to find the average value around which your data values
range.
• Generally, when working with data, you may want to know the
average data value.
• This will give you a term that incorporates every data value from the
dataset.
• The mean (aka the arithmetic mean, different from the geometric
mean) of a dataset is the sum of all values divided by the total
number of values
• . It's the most commonly used measure of central tendency and is
often referred to as the “average.”
Now, you will understand mean with the help of an example.
Consider a class whose students have obtained the following
marks out of 50 in mathematics :
You can see that there are 12 data points. So all you have to do is add
up each value and divide the result by 12, as shown below :

Hence, you get the mean as 37. This means that, on average, a student
belonging to the above class will score 37 out of 50 in mathematics.
What Is Mode?

• The Mode refers to the most frequently occurring value in your data.
• You find the frequency of occurrence of each number and the
number with the highest frequency is your mode.
• If there are no recurring numbers, then there is no mode in the data.
• Using the mode, you can find the most commonly occurring point in
your data.
• This is helpful when you have to find the central tendency of
categorical values, like the flavor of the most popular chip sold by a
brand.
• You cannot find the average based on the orders; instead, you choose
the chip flavor with the highest orders.
Over here, the value 35 occurs the most frequently and hence is the mode. But what if the values
are categorical? In that case, you must use the formula below:
Where,

l = lower limit of modal class

h = lower limit of preceding modal class

f1 = frequency of modal class

f0 = frequency of class preceding modal class

f2 = frequency of class succeeding modal class


What Is Median?

• Median refers to the middle value of your data.


• To find the median, you first sort the data in either ascending or
descending order and then find the numerical value present in the
middle of your data.
• The median refers to the middle value of your data.
• You can use the median to figure out the point around which your
data is centered.
• It divides the data into two halves and has the same number of data
points above and below.
Now, use the same example of a class of 12 students and their marks in
mathematics and find the median of this data.
• To find the middle term, you first have to sort the data or arrange the
data in ascending or descending order.
• This ensures that consecutive terms are next to each other.
So, the middle term in the range of marks is 37.
This means that the other marks lie in a frequency range of around 37.

We now come to the last of the mean, median, and mode trio - mode.
Skewness and Kurtosis
• The skewness is a measure of symmetry or asymmetry of data
distribution, and kurtosis measures whether data is heavy-tailed or
light-tailed in a normal distribution.
• Data can be positive-skewed (data-pushed towards the right side) or
negative-skewed (data-pushed towards the left side)
What are the three types of skewness?
• Right skew (also called positive skew).
• A right-skewed distribution is longer on the right side of its peak than
on its left.
• Left skew (also called negative skew).
• A left-skewed distribution is longer on the left side of its peak than on
its right.
• Zero skew.
• 1. Positively Skewed:
• In a distribution that is Positively Skewed, the values are more
concentrated towards the right side, and the left tail is spread out.
• Hence, the statistical results are bent towards the left-hand side.
• Hence, that the mean, median, and mode are always positive.
• In this distribution, Mean > Median > Mode.
Positively Skewed
2. Negatively Skewed:
• In a Negatively Skewed distribution, the data points are more
concentrated towards the right-hand side of the distribution.
• This makes the mean, median, and mode bend towards the right.
• Hence these values are always negative.
• In this distribution, Mode > Median > Mean.
Negatively Skewed
What Is a Normal Distribution?
• A normal distribution is a continuous probability distribution for a
random variable.
• A random variable is a variable whose value depends on the outcome
of a random event.
• For example, flipping a coin will give you either heads or tails at
random.
• You cannot determine with absolute certainty if the following
outcome is a head or a tail.
Normal Distribution

You might also like