Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

FOUNDATION TO DATA SCIENCE

Business Analytics

Unit1: BASIC STATISTICS REFRESHER AND HOW


TO EXPLORE DATA
Day 5: Exploratory Data Analysis

Prof. Dr. George Mathew 1


B.Sc., B.Tech, PGDCA, PGDM, MBA, PhD
EXPLORATORY DATA ANALYSIS
Exploratory data analysis is both a data analysis
perspective and a set of techniques.
We will present unique and conventional techniques
including graphical and tabular devices to visualize the
data.
In exploratory data analysis (EDA) the researcher has
the fl exibility to respond to the patterns revealed in the
preliminary analysis of the data. Thus, patterns in the
collected data guide the data analysis or suggest
revisions to the preliminary data analysis plan. This
flexibility is an important attribute of this approach.
Data Exploration, Examination, and Analysis in the
Research Process
Exploratory data analysis techniques
Frequency Tables, Bar Charts, and Pie Charts
Nominal Displays of Data (Minimum Age for Social Networking
Nominal Displays of Data (Minimum Age for Social Networking
Histograms
The histogram is a conventional solution for the display of interval-ratio
data.
Histograms are used when it is possible to group the variable’s values
into intervals.
Histograms are constructed with bars (or asterisks) that represent data
values, where each value occupies an equal amount of area within the
enclosed area.
Data analysts find histograms useful for
(1) displaying all intervals in a distribution, even those without
observed values, and
(2) examining the shape of the distribution for skewness,
kurtosis, and the modal pattern.
When looking at a histogram, one might ask: Is there a single hump (a
mode)? Are subgroups identifiable when multiple modes are present?
Are straggling data values detached from the central concentration?
Pareto Diagrams
In quality management, J. M. Juran fi rst applied this concept by noting that only a
vital few defects account for most problems
evaluated for quality and that the trivial may explain the rest. Historically, this has
come to be known as the 80/20 rule—that is, an 80 percent improvement in quality
or performance can be expected by eliminating 20 percent of the causes of
unacceptable quality or performance.
The Pareto diagram is a bar chart whose percentages sum to 100 percent. The
data are derived from a multiple-choice, single-response scale; a multiple-choice,
multiple-response scale; or frequency counts of words (or themes) from content
analysis. The respondents’ answers are sorted in decreasing importance, with bar
height in descending order from left to right. The pictorial array that results
reveals the highest concentration of improvement potential in the fewest number of
remedies. The cumulative frequency line in this exhibit shows that the top two
problems (the repair did not resolve the customer’s problem, and the product was
returned multiple times for repair) accounted for 80 percent of the perceptions of
inadequate repair service.
Pareto Diagrams
Boxplots
The boxplot, or box-and-whisker plot, is another technique used
frequently in exploratory data analysis.

A boxplot reduces the detail of the stem-and-leaf display and


provides a different visual image of the distribution’s location,
spread, shape, tail length, and outliers. Boxplots are extensions of
the five-number summary of a distribution.

This summary consists of the median, the upper and lower


quartiles, and the largest and smallest observations. The median
and quartiles are used because they are particularly resistant
statistics .
z-scores
A z-score allows us to measure the relative
location of a value in the data set. More
specifically, a z-score helps us determine how
far a particular value is from the mean relative
to the data set’s standard deviation. Suppose
we have a sample of n observations, with the
values denoted by x 1 , x 2 , . . . , x n . In
addition, assume that the sample mean, x̄, and
the sample standard deviation, s, are already
computed. Associated with each value, x i , is
another value called its z-score.
Box Plots
A box plot is a graphical summary of the
distribution of data. A box plot is developed
from the quartiles for a data set. Figure 14 is
a box plot for the home sales data. Here are
the steps used to construct the box plot:
Box Plots
What can we learn from these box plots?
The most expensive houses appear to be in Shadyside and
the cheapest houses in Hamilton. The median home selling
price in Groton is about the same as the median home
selling price in Irving. However, home sales prices in Irving
have much greater variability. Homes appear to be selling
in Irving for many different prices, from very low to very
high. Home selling prices have the least variation in Groton
and Hamilton. Unusually expensive home sales (relative to
the respective distribution of home sales values) have
occurred in Fairview, Groton, and Irving, which appear as
outliers. Groton is the only location with a low outlier, but
note that most homes sell for very similar prices in Groton,
so the selling price does not have to be too far from the
median to be considered an outlier.

You might also like