Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 28

Exploratory Data Analysis

FIN 580
Fall 2018
Outline
• Standard exploratory data analysis techniques.
• Histogram
• Kernel density
• Bar chart
• Pie chart
• Mosaic Plot
• Boxplot
• QQ plot
Data Set
• We are interested in analyzing the wage for U.S. workers.
• It is based on data taken from Berndt (1991).
• They represent a random subsample of cross-section data originating
from the May 1985 Current Population Survey by the US Census
Bureau, comprising 533 observations and 11 variables.
Variables
• Wage: Wage (in dollars per hour).
• Education: Number of years of education.
• Experience: Number of years of potential work experience (age - education - 6).
• Age: Age in years.
• Ethnicity: Factor with levels "cauc", "hispanic", "other".
• Region: Factor. Does the individual live in the South?
• Gender: Factor indicating gender.
• Occupation: Factor with levels "worker" (tradesperson or assembly line worker), "technical"
(technical or professional worker), "services" (service worker), "office" (office and clerical
worker), "sales" (sales worker), "management" (management and administration).
• Sector: Factor with levels "manufacturing" (manufacturing or mining), "construction", "other".
• Union: Factor. Does the individual work on a union job?
• Married: Factor. Is the individual married?
Summary
• An useful way of gaining a quick overview of a data set is to use the
summary() method for data frames.
• It provides a summary for each of the variables.
• The type of the summary depends on the class of the respective
variable.
>summary(CPS1985)
Numerical Variable
> summary(wage)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 5.250 7.780 9.032 11.250 44.500
> mean(wage)
[1] 9.031707
> median(wage)
[1] 7.78
> var(wage)
[1] 26.4287
> sd(wage)
[1] 5.140885
> fivenum(wage)
[1] 1.00 5.25 7.78 11.25 44.50
Histogram
• A histogram is a visual representation of the distribution of a dataset.
• The shape of a histogram is its most obvious and informative
characteristic:
• It allows you to easily see where a relatively large amount of the data
is situated and where there is very little data to be found.
• It provides information about where the middle is in your data
distribution, how close the data lie around this middle and where
possible outliers are to be found.
Graphical Summaries
Histogram
Histogram with Kernel Density
Kernel Density
• A much better estimator is the kernel density estimator (KDE).
• The estimator takes its name from the so-called kernel function,
denoted here by K, which is a probability density function that is
symmetric about 0.
• The standard normal density function is a common choice for K and
will be used here.
• The kernel density estimator based on 𝑌1 , 𝑌2 , … , 𝑌𝑛 is
መ 1 𝑛 𝑦−𝑌𝑖
•𝑓 𝑦 = σ 𝑖=1 𝐾( )
𝑛𝑏 𝑏
• 𝑏 is the bandwidth, which determines the resolution of the estimator.
KDE
Categorical Variable
• For categorical data, it makes no sense to compute means and
variances.
• Instead one needs a table indicating the frequencies with which the
categories occur.
> summary(occupation)
worker technical services office sales management
155 105 83 97 38 55
> tab <- table(occupation)
> prop.table(tab)
occupation
worker technical services office sales management
0.2908 0.1970 0.1557 0.1820 0.0713 0.1032
Bar Chart
• Bar chart is often
used to display
frequencies for
each category of
data.
Pie Chart
• A pie chart graphically shows
relative frequencies.
• A pie chart is a circle
subdivided into slices that
represent the categories.
Two Categorical Variables
• The relationship between two categorical variables is typically
summarized by a contingency table.
> xtabs(~ gender + occupation, data = CPS1985)
occupation
gender worker technical services office sales management
male 126 53 34 21 21 34
female 29 52 49 76 17 21
Mosaic Plot
• Mosaic plots give a
graphical
representation of
the decompositions.
• Counts are
represented by
rectangles.
Boxplots
• This technique graphs five statistics: the minimum and maximum
observations, and the first, second, and third quartiles.
• The line in the middle of the box is at the median.
• The “whiskers” are the vertical dashed lines extending from the top
and bottom of each box.
• The whiskers extend to the smallest and largest data points whose
distance from the bottom or top of the box is at most 1.5 times the
IQR.
Boxplots
• The ends of the whiskers are indicated by horizontal lines.
• Any points lie outside the whiskers are called outliers.
• All observations beyond the whiskers are plotted with an “o”.
Boxplot of Wage
Parallel Boxplots of Wage
• Box plots are
useful for visually
comparing the
“centers” and
“spreads” of
multiple data
sets.
Parallel Boxplots of log(wage)
Quantile-Quantile Plot (QQ Plot)
• Quantile-quantile plots are used to compare two distributions.
• If the two data sets are essentially samples from the same
distribution, and the samples are the same size, then the
corresponding ordered values (order statistics) should roughly match
up.
• We can examine this graphically in a quantile-quantile plot, which is
just a scatterplot of the two sorted datasets, i.e. the smallest
observation from each are plotted against each other, then the next
smallest, and so on.
QQ Plot
• If the two distributions are similar, the plot should be approximately a
straight line with slope 1.
• If the two distributions are very different in shape, then no straight
line may fit the plot well.
QQ Plot
Scatter Plots
Scatter Plots
The End
Thank You!

28

You might also like