Unit 1

UNIT 1
INTRODUCTION
TO EXPLORATORY DATA
ANALYSIS (EDA)
Josie C. Calfoforo
CONTENT:
Introduction to EDA
Importance of EDA
Data Types
Python Packages for EDA
Steps for Exploratory Data Analysis
Visualization
1. INTRODUCTION TO EDA
❖ Exploratory Data Analysis refers to the critical

process of performing initial investigations on data
so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of
summary statistics and graphical representations.
EDA is primarily used to provide a better

understanding of dataset's variables and their
relationships.
Developed by the American mathematician John

Tukey in the 1970s, EDA techniques continue to be
a widely used method in the data exploration
process today.
EDA is an approach for data analysis using variety

of techniques to gain insights about the data.
• Cleaning and preprocessing

Basic steps in • Statistical Analysis
any exploratory • Visualization for trend analysis,
data analysis: anomaly detection, outlier
detection (and removal).
2. IMPORTANCE OF EDA
➢Identifying the most important variables/features in
your dataset.
➢Testing a hypothesis or checking assumptions related
to the dataset.
➢To check the quality of data for further processing and
cleaning.
➢Deliver data-driven insights to business stakeholders.
➢Verify expected relationships that exist in the data.
➢To find unexpected structure or insights in the data.
2. IMPORTANCE OF EDA
Improve understanding of variables by
extracting averages, mean, minimum, and
maximum values, etc.
Discover errors, outliers, and missing values

in the data.
Identify patterns by visualizing data in

graphs such as bar graphs, scatter plots,
heatmaps and histograms.
CRISP-DM
Credit to Mark Muir for the Image lifted from

https://blogs.sap.com/2018/08/28/sap-machine-
learning-approaching-your-project/
Two Categories of Data
Unstructured data is most often

categorized as qualitative data, and
it cannot be processed and
analyzed using conventional tools
and methods. Examples of
unstructured data include text,
video, audio, mobile activity, social
media activity, satellite imagery, and
surveillance imagery – the list goes
on and on.
Structured data are your typical

rectangular data frames or tables –
repeated data pattern with a pre-
defined number of fields. Easy to be
queried, sorted, and processed.
Example: relational databases, time
series, JSON files, CSVs or excel file
Data Types
Data Types and Scales of Measurement
Structured Data Types
Categorical - This is any data that isn’t a number or nonnumeric.
➢ Ordinal - Have a set of order but the interval between measurements is
not meaningful (e.g. rating happiness on a scale of 1-10, economic
status, etc).
➢ Binary - Have only two values (e.g. Male or Female)
➢ Nominal - No set of order and thus only give names or labels to various
categories (e.g. Countries)
Numerical – Data is in form of numbers.
➢ Continuous - numbers that don’t have a logical end to them (e.g height)
➢ Interval - Have meaningful intervals between measurements, but
there is no true starting point (zero). E.g. temp, dates, time gap, etc.
➢ Ratio - Have the highest level of measurement (e.g. height, weight,

length, etc.)
➢ Discrete - have a logical end to them (e.g. days in the month)
Python Packages for EDA
Descriptive Statistics (Pandas)
❑ Used to make preliminary assessments about the population distribution of
the variable.
Commonly used statistics:
1. Central tendency :
Mean – The average value of all the data points. : dataframe.mean()
Median – The middle value when all the data points are put in an ordered
list: dataframe.median()
Mode – The data point which occurs the most in the dataset:
dataframe.mode()
2. Spread : It is the measure of how far the datapoints are away from the mean
or median
Variance - The variance is the mean of the squares of the individual
deviations: dataframe.var()
Standard deviation - The standard deviation is the square root of the
variance:dataframe.std()
3. Skewness: It is a measure of asymmetry: dataframe.skew()
Descriptive Statistics (Pandas)
Other methods to get a quick look on the data:
❑ Describe() : Summarizes the central tendency, dispersion

and shape of a dataset’s distribution, excluding NaN
values.
Syntax: pandas.dataframe.describe()
❑ Info() :Prints a concise summary of the dataframe. This

method prints information about a dataframe including the
index dtype and columns, non-null values and memory
usage.
Syntax: pandas.dataframe.info()
Null values
Detecting Handling
Detecting Null- Handling null values:

values: • Dropping the rows with
• Isnull(): It is used as an null values: dropna()
alias for dataframe.isna(). function is used to delete
This function returns the rows or columns with null
dataframe with boolean values.
values indicating missing
values. • Replacing missing values:
fillna() function can fill the
missing values with a
• Syntax : dataframe.isnull() special value value like
mean or median.
Steps for EDA
Defining the main Define the sources of
Data preparation
Problem definition
problem of the data, define data

analysis, defining the schemas and tables,
main deliverables, understand main
outlining the main characteristics of the
roles and data, clean the
responsibilities, dataset, delete non-
obtaining current relevant datasets,
status of data, transform the data,
defining the time and divide data into
table, and performing required chunks for
cos/benefit analysis. analysis.
Steps for EDA
representation of the results

Summarizing of the data, * Presenting dataset to
Data analysis
Development and
finding the hidden the target audience in
correlations and the form of graphs,
relationships among the summary tables, maps
data, developing and diagrams.
predictive models, Result analyzed from the
evaluating the models, dataset should be
and calculating interpretable by the
accuracies (e.g. tables, business stakeholders.
graphs, descriptive Most graphical analysis
statistics, inferential techniques include
statistics, correlation scattering plots,
statistics, searching, character plots,
grouping, and histograms, box plots,
mathematical models). residual plots, mean
plots, and others.
Steps for EDA
1. Importing the required libraries for EDA
2. Loading the data into the dataframe
3. Checking the types of data
4. Dropping irrelevant columns
5. Renaming the columns
6. Dropping the duplicate rows
7. Dropping the missing or null values
8. Detecting Outliers
9. Plot different features against one another
(scatter), against frequency (histogram)
Visualization
❑ Univariate: Looking at one variable/column at a time.
Derive the data, define and summarize it, and analyze the
pattern present in it. In a dataset, it explores each variable
separately. It is possible for two kinds of variables-
Categorical and Numerical.
▪ Central Tendency (mean, mode and median), Dispersion
(range, variance), Quartiles (interquartile range), and
Standard deviation.
▪ Univariate data can be described through:
➢ Frequency Distribution Tables
➢ Bar charts
➢ Histograms
➢ Boxplot
➢ Pie charts
➢ Frequency Polygons
Visualization
❑ Bivariate: Looking at two different variables
▪ Types of bivariate analysis:
▪ Numerical Variables (Numerical-Numerical)
➢ Scatter plot
➢ Pie plot
➢ Heatmap (seaborn)
Visualization
❑ Bivariate: Looking at two different variables
▪ Types of bivariate analysis:
o Numerical Variables (Numerical-Numerical)
➢Scatter plot
➢ Linear correlation
o Bivariate Analysis of two categorical Variables
(Categorical-Categorical)
➢ Chi-square Test
o Bivariate Analysis of one numerical and one
categorical variable (Numerical-Categorical)
➢ Z-test and t-test
➢ Analysis of Variance (Anova)
Visualization
❑ Multivariate : Looking at three or more variables
▪ Types of multivariate analysis:
➢ Cluster Analysis
➢ Factor Analysis
➢ Multiple Regression Analysis
➢ Principal Component Analysis
Data Visualization
1.Bar Chart
❑ Presents data with

rectangular bars with lengths
proportional to the values that
they represent.
Syntax: sns.barplot()
2. Pie
Pie plot : Proportional
representation of the
numerical data in a
column.
Syntax:
dataframe.plot.pie(y=‘column_name’)
3.Histogram
➢ Representation
of the distribution
of data.
Syntax: dataframe.hist()
4.Scatter Plot
Shows the data as a collection of points.

Syntax: dataframe.plot.scatter(x = 'x_column_name', y = 'y_columnn_name’)
5. Heatmap (sns)
❑Graphical representation of
data using colors to visualize the
value of the matrix
❑Use annot to represent the cell
values with text.
Syntax:
seaborn.heatmap()
sns.heatmap(dataframe.corr(), annot=True
6. Box Plot ❑ Distribution of
quantitative
data that it
facilitates the
comparisons
between
variables or
across levels of
a categorical
variable.
Syntax: seaborn.boxplot()
Depicts numerical data graphically through their quartiles. The box

extends from the Q1 to Q3 quartile values of the data, with a line
at the median (Q2).
7. Line Plot
❑ Displays data with the
help of symbols above
a number line showing
the frequency of each
value.
❑ Used to organize the
data in a simple way
and is very easy to
interpret.
8. Violin Plot
Syntax:
seaborn.violinplot()
Parameters:
x, y, hue: Inputs for plotting
long-form data.
data: Dataset for plotting.
scale: The method used to
scale the width of each
violin.
9.Bubble Plot
❑ The value of an
additional numeric
variable is
represented through
the size of the dots
❑ Need 3 numerical
variables as input:
one is represented
by the X axis, one
by the Y axis, and
one by the dot size.
10. 3D Scatter Plot
❑ Plot data points on
three axes in the
attempt to show the
relationship between
three variables.
❑ Each row in the data
table is represented
by a marker whose
position depends on
its values in the
columns set on the X,
Y, and Z axes.
Outlier detection
● An outlier is a point or set of data points that lie away
from the rest of the data values of the dataset.
● Outliers are easily identified by visualizing the data.
● For example:
○ In a boxplot, the data points which lie outside the upper and lower bound can be
considered as outliers
○ In a scatterplot, the data points which lie outside the groups of datapoints can be
considered as outliers
Outlier Removal
● Calculate the interquartile range (IQR) as
follows:
➢Calculate the first and third quartile (Q1 and Q3)
➢Calculate the interquartile range, IQR = Q3-Q1
➢Find the lower bound which is Q1*1.5
➢Find the upper bound which is Q3*1.5
➢Replace the data points which lie outside this range.
➢They can be replaced by mean or median.
REFERENCES:
❖ Hands-On Exploratory Data Analysis. Retrieved from

https://github.com/PacktPublishing/Hands-on-Exploratory-Data-Analysis-
with-Python#download-a-free-pdf
❖ https://medium.com/analytics-vidhya/why-you-need-to-explore-your-
data-how-you-can-start-13de6f29c8c1
❖ https://github.com/drshahizan/Python_EDA#-notes
❖ https://medium.com/@fareedkhandev/python-programming-in-
microsoft-excel-2c88df7633df
❖ https://seaborn.pydata.org/generated/seaborn.heatmap.html
❖ https://www.geeksforgeeks.org/seaborn-heatmap-a-comprehensive-
guide/

Unit 1 - Intro to EDA

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1 - Intro to EDA

Uploaded by

Copyright:

Available Formats

Python Packages for EDA

Steps for Exploratory Data Analysis

❖ Exploratory Data Analysis refers to the critical

EDA is primarily used to provide a better

Developed by the American mathematician John

EDA is an approach for data analysis using variety

• Cleaning and preprocessing

Discover errors, outliers, and missing values

Identify patterns by visualizing data in

Credit to Mark Muir for the Image lifted from

Unstructured data is most often

Structured data are your typical

➢ Ratio - Have the highest level of measurement (e.g. height, weight,

❑ Describe() : Summarizes the central tendency, dispersion

❑ Info() :Prints a concise summary of the dataframe. This

Detecting Null- Handling null values:

problem of the data, define data

representation of the results

▪ Types of multivariate analysis:

❑ Presents data with

Shows the data as a collection of points.

Depicts numerical data graphically through their quartiles. The box

❖ Hands-On Exploratory Data Analysis. Retrieved from

You might also like