Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

INSY 662 – Fall 2023

Data Mining and Visualization

Week 2-1: Data Visualization and Exploratory Analysis

September 5, 2023
Elizabeth Han
Today’s Class

▪ Exploratory data analysis

▪ Data visualization basics and common pitfalls

▪ Coding session

Exploratory Data Analysis
▪ Allows deeper understanding of the dataset

▪ Helps specifying hypotheses

▪ Adds robustness to predictive models

Exploratory Data Analysis
▪ Common steps
– Understand the list of columns, their data types,
whether they have missing values or not, etc.
– For categorical variables, check the count of
each category
– For numerical variables, check descriptive stats
– Check relationship among variables
– Check patterns of observations

Data Visualization Basics
▪ Histogram
– For univariate analysis of a continuous variable
– Capture the distribution, skewness, kurtosis

Bin size = $500 Bin size = $10,000 Bin size = $50,000

Data Visualization Basics
▪ Boxplots
– For both univariate and bivariate analyses
– Capture the distribution and skewness
– Show outliers

Data Visualization Basics
▪ Bar plot
– Similar function as a boxplot

Data Visualization Basics
▪ Scatterplots
– For bivariate analysis
– Show the relationship between variables

Data Visualization Basics
▪ Correlation matrix

Data Visualization Basics
▪ What if we want to see more than two variables?
– Use a color scheme

– Use different sizes or transparencies

Data Visualization Basics
▪ What if we want to see more than two variables?
– Use different shapes

– Show in separate panels

Data Visualization Pitfalls
▪ Data visualization is a powerful tool to provide
insights into the data

▪ However, if used incorrectly, it can also lead to

misperception or misleading messages

▪ That is because human is prone to cognitive


▪ There are five common ways to create

misleading graphs

Data Visualization Pitfalls
▪ Manipulation of the baseline
– The natural baseline of bar charts should be zero.
– Truncated graph, which uses a different baseline,
can lead to incorrect interpretation.

Data Visualization Pitfalls
▪ Manipulation of the y-axis
– The y-axis represents the magnitude of the plot
– Stretching or squeezing the y-axis can lead to the
misperception with the variability of the data
– Especially salient for graphs that show changes
over time

Data Visualization Pitfalls
▪ Manipulation of the timeframe
– In line graphs that intend to show the time trend,
the range of the x-axis can control the narrative
– Should understand what trend the graph intends
to show (e.g., short or long) and compare that
with the narrative

Data Visualization Pitfalls
▪ Manipulation of the graph type
– Different types of graphs have different applications
(e.g., time trend, emphasizing differences)
– Using a “wrong” type of graphs can mislead the
perception of the readers

Data Visualization Pitfalls
▪ Use of different standard
– Human tends to have “common sense” of how the
data should be visualized
– For example, the color that represents positivity
vs. negativity, the shade that represents the
density, etc.
– Going against this “common sense” can mislead
the audience

Let’s do some coding!
▪ Please download automobile.csv and
from MyCourses.

▪ Assume that you are a data scientist at a new

automobile company.

▪ The company wants to understand the current

automobile market so that the price for its
products can be determined accordingly.

▪ You will be working with a raw dataset to do pre-

processing and exploratory analysis.


You might also like