Professional Documents
Culture Documents
Unit 1 - Intro to EDA
Unit 1 - Intro to EDA
INTRODUCTION
TO EXPLORATORY DATA
ANALYSIS (EDA)
Josie C. Calfoforo
CONTENT:
Introduction to EDA
Importance of EDA
Data Types
Visualization
1. INTRODUCTION TO EDA
Detecting Handling
Data preparation
Problem definition
Development and
finding the hidden the target audience in
correlations and the form of graphs,
relationships among the summary tables, maps
data, developing and diagrams.
predictive models, Result analyzed from the
evaluating the models, dataset should be
and calculating interpretable by the
accuracies (e.g. tables, business stakeholders.
graphs, descriptive Most graphical analysis
statistics, inferential techniques include
statistics, correlation scattering plots,
statistics, searching, character plots,
grouping, and histograms, box plots,
mathematical models). residual plots, mean
plots, and others.
Steps for EDA
1. Importing the required libraries for EDA
2. Loading the data into the dataframe
3. Checking the types of data
4. Dropping irrelevant columns
5. Renaming the columns
6. Dropping the duplicate rows
7. Dropping the missing or null values
8. Detecting Outliers
9. Plot different features against one another
(scatter), against frequency (histogram)
Visualization
❑ Univariate: Looking at one variable/column at a time.
Derive the data, define and summarize it, and analyze the
pattern present in it. In a dataset, it explores each variable
separately. It is possible for two kinds of variables-
Categorical and Numerical.
▪ Central Tendency (mean, mode and median), Dispersion
(range, variance), Quartiles (interquartile range), and
Standard deviation.
▪ Univariate data can be described through:
➢ Frequency Distribution Tables
➢ Bar charts
➢ Histograms
➢ Boxplot
➢ Pie charts
➢ Frequency Polygons
Visualization
❑ Bivariate: Looking at two different variables
▪ Types of bivariate analysis:
▪ Numerical Variables (Numerical-Numerical)
➢ Scatter plot
➢ Pie plot
➢ Heatmap (seaborn)
Visualization
❑ Bivariate: Looking at two different variables
▪ Types of bivariate analysis:
o Numerical Variables (Numerical-Numerical)
➢Scatter plot
➢ Linear correlation
o Bivariate Analysis of two categorical Variables
(Categorical-Categorical)
➢ Chi-square Test
o Bivariate Analysis of one numerical and one
categorical variable (Numerical-Categorical)
➢ Z-test and t-test
➢ Analysis of Variance (Anova)
Visualization
❑ Multivariate : Looking at three or more variables
➢ Cluster Analysis
➢ Factor Analysis
➢ Multiple Regression Analysis
➢ Principal Component Analysis
Data Visualization
1.Bar Chart
Syntax: sns.barplot()
2. Pie
Pie plot : Proportional
representation of the
numerical data in a
column.
Syntax:
dataframe.plot.pie(y=‘column_name’)
3.Histogram
➢ Representation
of the distribution
of data.
Syntax: dataframe.hist()
4.Scatter Plot
Syntax:
seaborn.heatmap()
sns.heatmap(dataframe.corr(), annot=True
6. Box Plot ❑ Distribution of
quantitative
data that it
facilitates the
comparisons
between
variables or
across levels of
a categorical
variable.
Syntax: seaborn.boxplot()
Parameters:
x, y, hue: Inputs for plotting
long-form data.
data: Dataset for plotting.
scale: The method used to
scale the width of each
violin.
9.Bubble Plot
❑ The value of an
additional numeric
variable is
represented through
the size of the dots
❑ Need 3 numerical
variables as input:
one is represented
by the X axis, one
by the Y axis, and
one by the dot size.
10. 3D Scatter Plot
❑ Plot data points on
three axes in the
attempt to show the
relationship between
three variables.
❑ Each row in the data
table is represented
by a marker whose
position depends on
its values in the
columns set on the X,
Y, and Z axes.
Outlier detection
● An outlier is a point or set of data points that lie away
from the rest of the data values of the dataset.
● Outliers are easily identified by visualizing the data.
● For example:
○ In a boxplot, the data points which lie outside the upper and lower bound can be
considered as outliers
○ In a scatterplot, the data points which lie outside the groups of datapoints can be
considered as outliers
Outlier Removal
● Calculate the interquartile range (IQR) as
follows:
➢Calculate the first and third quartile (Q1 and Q3)
➢Calculate the interquartile range, IQR = Q3-Q1
➢Find the lower bound which is Q1*1.5
➢Find the upper bound which is Q3*1.5
➢Replace the data points which lie outside this range.
➢They can be replaced by mean or median.
REFERENCES:
❖ https://medium.com/analytics-vidhya/why-you-need-to-explore-your-
data-how-you-can-start-13de6f29c8c1
❖ https://github.com/drshahizan/Python_EDA#-notes
❖ https://medium.com/@fareedkhandev/python-programming-in-
microsoft-excel-2c88df7633df
❖ https://seaborn.pydata.org/generated/seaborn.heatmap.html
❖ https://www.geeksforgeeks.org/seaborn-heatmap-a-comprehensive-
guide/