Data Visualization

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

3.

3 Visualization (~==)
! Visualization is the conversion of data into a visual or
tabular format so that the characteristics of the data
and the relationships among data items or attributes
can be analyzed or reported.
! Visualization of data is one of the most powerful and
appealing techniques for data exploration.
! Humans have a well developed ability to analyze large
amounts of information that is presented visually
! Can detect general patterns and trends
! Can detect outliers and unusual patterns
! A picture is worth a thousand words. ([ O?)
Dallenbachs Figure
Dallenbachs Figure
No texture Principal Direction
Texture Applied to Colon Data
t
e
m
p
e
r
a
t
u
r
e

p
r
e
c
i
p
i
t
a
t
i
o
n

p
r
e
s
s
u
r
e

w
i
n
d

s
p
e
e
d

t
e
m
p
e
r
a
t
u
r
e

p
r
e
c
i
p
i
t
a
t
i
o
n

p
r
e
s
s
u
r
e

w
i
n
d

s
p
e
e
d

Sea Surface Temperature July-1982
! Tens of thousands of data points summarized in a single figure


Sea Surface Temperature July-1982
! Tens of thousands of data points summarized in a single figure


Representation
! Is the mapping of information to a visual format
! Data objects, their attributes, and the relationships
among data objects are translated into graphical
elements such as points, lines, shapes, and colors.
! Example:
! Objects are often represented as points
! Their attribute values can be represented as the position of the
points or the characteristics of the points, e.g., color, size, and shape
! If position is used, then the relationships of points, i.e., whether they
form groups or a point is an outlier, is easily perceived.
Selection
! Elimination or de-emphasis of certain objects and attributes
! May involve choosing a subset of attributes
! Dimensionality reduction is often used to reduce the number of
dimensions to two or three
! Alternatively, pairs of attributes can be considered
! May also involve choosing a subset of objects
! A region of the screen can only show so many points
! Can sample, but want to preserve points in sparse areas
Visualization: Histograms
! A plot that displays the distribution of values for
attributes by dividing the possible values into bins
and showing the number of objects that fall into
each bin. (p. 111) ! a.k.a. Frequency Histogram
! For categorical data, each value is a bin ! but for
too many possible values, some can be combined.
! For continuous attributes, the range of values is
divided into, typically equal-width, bins, then the
values in each bin are counted
! Bar plot with each bin represented by one bar
! Shape of Histogram depends on the number of bins

Visualization: Histograms
! Example: Petal Width (10 and 20 bins, respectively)

Visualization: Histograms
Histogram of scores$e1p
scores$e1p
F
r
e
q
u
e
n
c
y
0 20 40 60 80 100
0
2
4
6
8
1
0
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
9
1
0
Visualization: Frequency Polygon
(3l=)
! Join the middle-top points of the columns of
histogram
! Place points at the midpoint of the histogram bins
! Two extra points at zero at either or both ends
! Draw lines to connect the points
! Good for understanding the shapes of distributions
! Especially convenient for comparing sets of data
! to compare multiple data sets
! hide (or delete) histogram chart
! show only frequency polygon in different colors
Visualization: Frequency Polygon
Histogram of scores$e1p
scores$e1p
F
r
e
q
u
e
n
c
y
0 20 40 60 80 100
0
2
4
6
8
1
0
0 10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
9
1
0
0 10 20 30 40 50 60 70 80 90 100
Visualization: Frequency Polygon
20 40 60 80 100
0
2
4
6
8
1
0
c(5, h1$breaks + 5)
c
(
0
, h
1
$
c
o
u
n
ts
, 0
)
10 20 30 40 50 60 70 80 90 100
0
1
2
3
4
5
6
7
8
9
1
0
Visualization Techniques: Box Plots
outlier
10
th
percentile
25
th
percentile
75
th
percentile
50
th
percentile
10
th
percentile
Example of Box Plots
! Box plots can be used to compare attributes

Example of Box Plots
1 2 3
0
2
0
4
0
6
0
8
0
1
0
0
Comparison of Exam Scores
Exam1 Exam2 Final
0
1
0
2
0
3
0
4
0
5
0
6
0
7
0
8
0
9
0
1
0
0
Visualization Techniques: Scatter Plots
! Attribute values determine the position
! Two-dimensional scatter plots most common, but can have
three-dimensional scatter plots
! Often additional attributes can be displayed by using the size,
shape, and color of the markers that represent the objects
! Useful for visualizing paired numeric data
! Arrays of scatter plots can compactly summarize the
relationships of several pairs of attributes


Scatter Plot Array of Iris Attributes Visualization Techniques: Contour Plots
! Contour plots
! Useful when a continuous attribute is measured on a spatial grid
! They partition the plane into regions of similar values
! The contour lines that form the boundaries of these regions connect
points with equal values " isolines
! Most common example is contour maps of elevation
! Can also display temperature, rainfall, air pressure, etc.
! Also known as level plot

Contour Plot Example: SST Dec, 1998
Celsius
Histogram in Excel
! Excel does not natively support Histogram
! 1. manual histogram via bar chart
! somewhere in your sheet, create two more columns
! one column for bins; bins should be equal-size
! the other column for counts: number of values for each bin
! manually count and enter into counts column OR use built-in
function COUNTIFS(range1, criteria1, range1, criteria2) instead
! select the values in counts column
! Insert " Chart " Bar " 2D-Bar
! 2. activate analysis toolpak
! http://www.excel-easy.com/data-analysis/analysis-toolpak.html
! find Analysis menu under the Data tab

Histogram in SPSS
! Depending on which version of SPSS you use, you may get a
different user interface
! SPSS does not automatically bin scale (ratio or interval) data for
you, you must create a new binned variable using Visual Binning
under Transform menu
! Open a data file or type the data in
! Define variables using Data " Define Variable Properties OR by
editing in the Variable View in the Data Editor
! Graphs menu ! Chart Builder
! From the Gallery area at the bottom of the dialog box,
! select Histogram
! select the icon of the simple histogram and drag it to the large chart
preview window
! drag the variable of interest to X-axis in the chart preview window

R: Basic Commands
! getwd(): returns current working directory/folder
(Misc " Get Working Directory)
! setwd(directory): changes working directory to the specified
directory (surrounded by double quotes)
(Misc " Change Working Directory)
! to import csv (comma-separated-values) data file data1.txt, type
scores <- read.csv(data1.txt, header=T)
! if the data has variables in the first row, they are automatically
picked up and can be referred to as
scores$var1, scores$var2,

R: Basic Commands
! getwd(): returns ^ F+= (E Hl F+=)
(Misc " Get Working Directory)
! setwd(directory): changes working directory to the specified
directory (surrounded by double quotes)
(Misc " Change Working Directory)
! to import csv (comma-separated-values) data file data1.txt, type
g1<- read.csv(data1.txt, header=T)
! with header=T (TRUE) option, data variable names in the first row
are automatically picked up and can be referred to as
g1$var1, g1$var2,

R: Basic Commands
! summary(variable): returns summary statistics min, 1stquartile,
mean, median, 3rdquartile, max
! cor(variable1, variable2): calculate correlation coefficient
! dim(matrix): returns # of rows and # of columns
! is.numeric(variable)
! is.factor(variable)
! runif(n, a, b): generates n random #s (decimal) between a and
b, inclusive (if n is > 1, a vector)
! sample(a:b, n, replace=T): generates n random #s (whole)
between a and b, inclusive. With replacement if replace=T,
without if replace=F.

R: Basic Commands
! each column (variable) can be also accessed by
scores[, 1], scores[, 2],
as R sees the data as a matrix (or 2-dimensional table)
! scores[row, col] accesses a single cell at the specified location
! scores[row, ] accesses the specified row (and all columns)
! scores[, col] accesses the specified column (and all rows)
! seq(a, b) creates a vector with [a, a+1, a+2, , b-1, b]
! seq(a, b, by=2) creates a vector with [a, a+2, a+4, ]
! to access sub-data, for example rows 5 to 10 (and all columns)
scores[ seq(5, 10), ]


R: Exercise
! download data1.csv from course website
! read data file (has header) to a data frame g1
! g1 <- read.csv(data1.csv, header=T)
! access g1 in various ways
! dim(g1) shows # of rows and columns
! g1 shows the entire data table (matrix)
! g1[1, ] shows the first row (Student 1s grades)
! g1[, 1] shows the first column (all student IDs)
! g1[1, 3] shows students hw0 score
! extract parts of g1 (student id, exam1, exam2, final) and assign to g2
! g2 <- g1[, c(1, 7, 13, 19)] OR
! g2 <- data.frame(sid=g1$student, e1=g1$exam1, e2=g1$exam2,
final=g1$final)

Histogram in R
! Most powerful and versatile of all three programs
! Once the environment is set-up and data file is loaded:
assume g1<- read.csv(data1.txt, header=T) executed
! Make a histogram for final exam scores
! hist(g1$final): creates histogram with default bin width based on the
data
! hist(g1$final, breaks=seq(0, 120, by=10)): creates histogram with
default bin width of 10, and bins are 0-10, 10-20, , 110-120
! hist(g1$final, breaks=seq(0, 120, by=10), col=blue): same as
above, but rectangles (bars) are filled in blue
! more options:
! xlab=x-axis label, ylab=y-axis label, main=chart title
! xlim=c(0, 120), ylim=c(0, 10)

Example: Histogram in R
! Download data1.csv from the course website: this file contains
all the scores for a class I taught in one semester
! All variables are quantitative attributes with true zeroes
! Start R, change working directory to where the data file is
downloaded to, import the data file to a matrix called g1
g1 <- read.csv(data1.csv, header=T)
! Type in R: summary(g1$final)
! Type in R: hist(g1$final, col=blue, main=Histogram of Final
Exam)
! Type in R: axis(2, seq(0, 10, by=1)) " to manipulate y-axis

Importing Data in R
! To import Excel files into R, you must first install gdata and gtools
libraries.
! In R, Packages & Data ! Package Installer
! @ R Package Installer dialog box, choose Korea1 (or Korea2) as
source then in the search box on the right, type in gtools
! Check Install Dependencies
! Click Install Selected
! Do the same for gdata
! Type in R Console, library(gdata)
! g = read.xls(data1.xlsx) by default, it imports sheet 1
! (in class, we used <- to assign value to a variable, but for
simplicity, we will use = instead)

Importing Data in R
! Excel RH j +N, gdata gtools == N
! In R, Packages & Data ! Package Installer
! @ R Package Installer \=^? , Korea1 (or Korea2) =
gtools 5l
! Check Install Dependencies
! Click Install Selected
! Do the same for gdata
! Type in R Console, library(gdata)
! g = read.xls(data1.xlsx) |s ~ [ , N H sheet 1
gH ].
(H <- /H, = ? l 1 , = s )

Plotting in R
! plot(data) draws unfilled circular points at each data point by
default
! plot (xData, yData) draws using both data
! There are many options:
! type=p (draw points), l (draw lines) h (histogram-like)
! pch=___ : point character (0 ~ 20 )
! lty=_____: line type (0:blank, 1:solid, 2:dashed, 3:dotted, 4:dotdash,
5:longdash, 6:twodash)
! lwd=___: line width (1, 2, 3, )
! main=chart title
! sub=subtitle
! xlab=x-axis title
! ylab=y-axis title
! col=color_of_points_or_lines : red, blue, skyblue, etc.
!

To Save a Graph to File in R
! Determine which graphics format to use, and create a file by
! bmp(filename, width=480, height=480, units=px, point-size=12) :
windows only
! jpeg(filename, width=480, height=480, units=px, point-size=12,
quality=75): universal
! postscript
! pdf
! By default (if unspecified), width=480, height=480,
! To draw a chart and save into a file called chart.jpg
! jpeg(chart.jpg) creates chart.jpg file
! plot(data) puts graph into chart.jpg file
! dev.off() closes (and saves) chart.jpg

You might also like