Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

UNIT 1

INTRODUCTION
TO EXPLORATORY DATA
ANALYSIS (EDA)
Josie C. Calfoforo
CONTENT:
Introduction to EDA

Importance of EDA

Data Types

Python Packages for EDA

Steps for Exploratory Data Analysis

Visualization
1. INTRODUCTION TO EDA

❖ Exploratory Data Analysis refers to the critical


process of performing initial investigations on data
so as to discover patterns, to spot anomalies, to test
hypothesis and to check assumptions with the help of
summary statistics and graphical representations.
1. INTRODUCTION TO EDA

EDA is primarily used to provide a better


understanding of dataset's variables and their
relationships.

Developed by the American mathematician John


Tukey in the 1970s, EDA techniques continue to be
a widely used method in the data exploration
process today.
1. INTRODUCTION TO EDA

EDA is an approach for data analysis using variety


of techniques to gain insights about the data.

• Cleaning and preprocessing


Basic steps in • Statistical Analysis
any exploratory • Visualization for trend analysis,
data analysis: anomaly detection, outlier
detection (and removal).
2. IMPORTANCE OF EDA
➢Identifying the most important variables/features in
your dataset.
➢Testing a hypothesis or checking assumptions related
to the dataset.
➢To check the quality of data for further processing and
cleaning.
➢Deliver data-driven insights to business stakeholders.
➢Verify expected relationships that exist in the data.
➢To find unexpected structure or insights in the data.
2. IMPORTANCE OF EDA
Improve understanding of variables by
extracting averages, mean, minimum, and
maximum values, etc.

Discover errors, outliers, and missing values


in the data.

Identify patterns by visualizing data in


graphs such as bar graphs, scatter plots,
heatmaps and histograms.
CRISP-DM

Credit to Mark Muir for the Image lifted from


https://blogs.sap.com/2018/08/28/sap-machine-
learning-approaching-your-project/
Two Categories of Data

Unstructured data is most often


categorized as qualitative data, and
it cannot be processed and
analyzed using conventional tools
and methods. Examples of
unstructured data include text,
video, audio, mobile activity, social
media activity, satellite imagery, and
surveillance imagery – the list goes
on and on.

Structured data are your typical


rectangular data frames or tables –
repeated data pattern with a pre-
defined number of fields. Easy to be
queried, sorted, and processed.
Example: relational databases, time
series, JSON files, CSVs or excel file
Data Types
Data Types and Scales of Measurement
Structured Data Types
Categorical - This is any data that isn’t a number or nonnumeric.
➢ Ordinal - Have a set of order but the interval between measurements is
not meaningful (e.g. rating happiness on a scale of 1-10, economic
status, etc).
➢ Binary - Have only two values (e.g. Male or Female)
➢ Nominal - No set of order and thus only give names or labels to various
categories (e.g. Countries)
Numerical – Data is in form of numbers.
➢ Continuous - numbers that don’t have a logical end to them (e.g height)
➢ Interval - Have meaningful intervals between measurements, but
there is no true starting point (zero). E.g. temp, dates, time gap, etc.

➢ Ratio - Have the highest level of measurement (e.g. height, weight,


length, etc.)
➢ Discrete - have a logical end to them (e.g. days in the month)
Python Packages for EDA
Descriptive Statistics (Pandas)
❑ Used to make preliminary assessments about the population distribution of
the variable.
Commonly used statistics:
1. Central tendency :
Mean – The average value of all the data points. : dataframe.mean()
Median – The middle value when all the data points are put in an ordered
list: dataframe.median()
Mode – The data point which occurs the most in the dataset:
dataframe.mode()
2. Spread : It is the measure of how far the datapoints are away from the mean
or median
Variance - The variance is the mean of the squares of the individual
deviations: dataframe.var()
Standard deviation - The standard deviation is the square root of the
variance:dataframe.std()
3. Skewness: It is a measure of asymmetry: dataframe.skew()
Descriptive Statistics (Pandas)
Other methods to get a quick look on the data:

❑ Describe() : Summarizes the central tendency, dispersion


and shape of a dataset’s distribution, excluding NaN
values.
Syntax: pandas.dataframe.describe()

❑ Info() :Prints a concise summary of the dataframe. This


method prints information about a dataframe including the
index dtype and columns, non-null values and memory
usage.
Syntax: pandas.dataframe.info()
Null values

Detecting Handling

Detecting Null- Handling null values:


values: • Dropping the rows with
• Isnull(): It is used as an null values: dropna()
alias for dataframe.isna(). function is used to delete
This function returns the rows or columns with null
dataframe with boolean values.
values indicating missing
values. • Replacing missing values:
fillna() function can fill the
missing values with a
• Syntax : dataframe.isnull() special value value like
mean or median.
Steps for EDA
Defining the main Define the sources of

Data preparation
Problem definition

problem of the data, define data


analysis, defining the schemas and tables,
main deliverables, understand main
outlining the main characteristics of the
roles and data, clean the
responsibilities, dataset, delete non-
obtaining current relevant datasets,
status of data, transform the data,
defining the time and divide data into
table, and performing required chunks for
cos/benefit analysis. analysis.
Steps for EDA

representation of the results


Summarizing of the data, * Presenting dataset to
Data analysis

Development and
finding the hidden the target audience in
correlations and the form of graphs,
relationships among the summary tables, maps
data, developing and diagrams.
predictive models, Result analyzed from the
evaluating the models, dataset should be
and calculating interpretable by the
accuracies (e.g. tables, business stakeholders.
graphs, descriptive Most graphical analysis
statistics, inferential techniques include
statistics, correlation scattering plots,
statistics, searching, character plots,
grouping, and histograms, box plots,
mathematical models). residual plots, mean
plots, and others.
Steps for EDA
1. Importing the required libraries for EDA
2. Loading the data into the dataframe
3. Checking the types of data
4. Dropping irrelevant columns
5. Renaming the columns
6. Dropping the duplicate rows
7. Dropping the missing or null values
8. Detecting Outliers
9. Plot different features against one another
(scatter), against frequency (histogram)
Visualization
❑ Univariate: Looking at one variable/column at a time.
Derive the data, define and summarize it, and analyze the
pattern present in it. In a dataset, it explores each variable
separately. It is possible for two kinds of variables-
Categorical and Numerical.
▪ Central Tendency (mean, mode and median), Dispersion
(range, variance), Quartiles (interquartile range), and
Standard deviation.
▪ Univariate data can be described through:
➢ Frequency Distribution Tables
➢ Bar charts
➢ Histograms
➢ Boxplot
➢ Pie charts
➢ Frequency Polygons
Visualization
❑ Bivariate: Looking at two different variables
▪ Types of bivariate analysis:
▪ Numerical Variables (Numerical-Numerical)
➢ Scatter plot
➢ Pie plot
➢ Heatmap (seaborn)
Visualization
❑ Bivariate: Looking at two different variables
▪ Types of bivariate analysis:
o Numerical Variables (Numerical-Numerical)
➢Scatter plot
➢ Linear correlation
o Bivariate Analysis of two categorical Variables
(Categorical-Categorical)
➢ Chi-square Test
o Bivariate Analysis of one numerical and one
categorical variable (Numerical-Categorical)
➢ Z-test and t-test
➢ Analysis of Variance (Anova)
Visualization
❑ Multivariate : Looking at three or more variables

▪ Types of multivariate analysis:

➢ Cluster Analysis
➢ Factor Analysis
➢ Multiple Regression Analysis
➢ Principal Component Analysis
Data Visualization
1.Bar Chart

❑ Presents data with


rectangular bars with lengths
proportional to the values that
they represent.

Syntax: sns.barplot()
2. Pie
Pie plot : Proportional
representation of the
numerical data in a
column.

Syntax:
dataframe.plot.pie(y=‘column_name’)
3.Histogram

➢ Representation
of the distribution
of data.

Syntax: dataframe.hist()
4.Scatter Plot

Shows the data as a collection of points.


Syntax: dataframe.plot.scatter(x = 'x_column_name', y = 'y_columnn_name’)
5. Heatmap (sns)
❑Graphical representation of
data using colors to visualize the
value of the matrix
❑Use annot to represent the cell
values with text.

Syntax:
seaborn.heatmap()
sns.heatmap(dataframe.corr(), annot=True
6. Box Plot ❑ Distribution of
quantitative
data that it
facilitates the
comparisons
between
variables or
across levels of
a categorical
variable.

Syntax: seaborn.boxplot()

Depicts numerical data graphically through their quartiles. The box


extends from the Q1 to Q3 quartile values of the data, with a line
at the median (Q2).
7. Line Plot
❑ Displays data with the
help of symbols above
a number line showing
the frequency of each
value.
❑ Used to organize the
data in a simple way
and is very easy to
interpret.
8. Violin Plot
Syntax:
seaborn.violinplot()

Parameters:
x, y, hue: Inputs for plotting
long-form data.
data: Dataset for plotting.
scale: The method used to
scale the width of each
violin.
9.Bubble Plot
❑ The value of an
additional numeric
variable is
represented through
the size of the dots
❑ Need 3 numerical
variables as input:
one is represented
by the X axis, one
by the Y axis, and
one by the dot size.
10. 3D Scatter Plot
❑ Plot data points on
three axes in the
attempt to show the
relationship between
three variables.
❑ Each row in the data
table is represented
by a marker whose
position depends on
its values in the
columns set on the X,
Y, and Z axes.
Outlier detection
● An outlier is a point or set of data points that lie away
from the rest of the data values of the dataset.
● Outliers are easily identified by visualizing the data.
● For example:
○ In a boxplot, the data points which lie outside the upper and lower bound can be
considered as outliers

○ In a scatterplot, the data points which lie outside the groups of datapoints can be
considered as outliers
Outlier Removal
● Calculate the interquartile range (IQR) as
follows:
➢Calculate the first and third quartile (Q1 and Q3)
➢Calculate the interquartile range, IQR = Q3-Q1
➢Find the lower bound which is Q1*1.5
➢Find the upper bound which is Q3*1.5
➢Replace the data points which lie outside this range.
➢They can be replaced by mean or median.
REFERENCES:

❖ Hands-On Exploratory Data Analysis. Retrieved from


https://github.com/PacktPublishing/Hands-on-Exploratory-Data-Analysis-
with-Python#download-a-free-pdf

❖ https://medium.com/analytics-vidhya/why-you-need-to-explore-your-
data-how-you-can-start-13de6f29c8c1

❖ https://github.com/drshahizan/Python_EDA#-notes
❖ https://medium.com/@fareedkhandev/python-programming-in-
microsoft-excel-2c88df7633df

❖ https://seaborn.pydata.org/generated/seaborn.heatmap.html
❖ https://www.geeksforgeeks.org/seaborn-heatmap-a-comprehensive-
guide/

You might also like