Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

INSY 662 – Fall 2023

Data Mining and Visualization

Week 2-1: Data Visualization and Exploratory Analysis


September 5, 2023
Elizabeth Han
Today’s Class

▪ Exploratory data analysis

▪ Data visualization basics and common pitfalls

▪ Coding session

2
Exploratory Data Analysis
▪ Allows deeper understanding of the dataset

▪ Helps specifying hypotheses

▪ Adds robustness to predictive models

3
Exploratory Data Analysis
▪ Common steps
– Understand the list of columns, their data types,
whether they have missing values or not, etc.
– For categorical variables, check the count of
each category
– For numerical variables, check descriptive stats
– Check relationship among variables
– Check patterns of observations

4
Data Visualization Basics
▪ Histogram
– For univariate analysis of a continuous variable
– Capture the distribution, skewness, kurtosis

Bin size = $500 Bin size = $10,000 Bin size = $50,000

5
Data Visualization Basics
▪ Boxplots
– For both univariate and bivariate analyses
– Capture the distribution and skewness
– Show outliers

6
Data Visualization Basics
▪ Bar plot
– Similar function as a boxplot

7
Data Visualization Basics
▪ Scatterplots
– For bivariate analysis
– Show the relationship between variables

8
Data Visualization Basics
▪ Correlation matrix

9
Data Visualization Basics
▪ What if we want to see more than two variables?
– Use a color scheme

– Use different sizes or transparencies

10
Data Visualization Basics
▪ What if we want to see more than two variables?
– Use different shapes

– Show in separate panels

11
Data Visualization Pitfalls
▪ Data visualization is a powerful tool to provide
insights into the data

▪ However, if used incorrectly, it can also lead to


misperception or misleading messages

▪ That is because human is prone to cognitive


biases

▪ There are five common ways to create


misleading graphs

12
Data Visualization Pitfalls
▪ Manipulation of the baseline
– The natural baseline of bar charts should be zero.
– Truncated graph, which uses a different baseline,
can lead to incorrect interpretation.

13
Data Visualization Pitfalls
▪ Manipulation of the y-axis
– The y-axis represents the magnitude of the plot
– Stretching or squeezing the y-axis can lead to the
misperception with the variability of the data
– Especially salient for graphs that show changes
over time

14
Data Visualization Pitfalls
▪ Manipulation of the timeframe
– In line graphs that intend to show the time trend,
the range of the x-axis can control the narrative
– Should understand what trend the graph intends
to show (e.g., short or long) and compare that
with the narrative

15
Data Visualization Pitfalls
▪ Manipulation of the graph type
– Different types of graphs have different applications
(e.g., time trend, emphasizing differences)
– Using a “wrong” type of graphs can mislead the
perception of the readers

16
Data Visualization Pitfalls
▪ Use of different standard
– Human tends to have “common sense” of how the
data should be visualized
– For example, the color that represents positivity
vs. negativity, the shade that represents the
density, etc.
– Going against this “common sense” can mislead
the audience

17
Let’s do some coding!
▪ Please download automobile.csv and Week2-1.py
from MyCourses.

▪ Assume that you are a data scientist at a new


automobile company.

▪ The company wants to understand the current


automobile market so that the price for its
products can be determined accordingly.

▪ You will be working with a raw dataset to do pre-


processing and exploratory analysis.

18

You might also like