Unit 4 Part A

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 51

Unit 4

Data Visualization
Data visualization is the graphical representation of information and data. By using 
visual elements like charts, graphs, and maps, data visualization tools provide an
accessible way to see and understand trends, outliers, and patterns in data.In the
world of Big Data, data visualization tools and technologies are essential to analyze
massive amounts of information and make data-driven decisions.

Our eyes are drawn to colors and patterns. We can quickly identify red from blue,
square from circle. Our culture is visual, including everything from art and
advertisements to TV and movies. Data visualization is another form of visual art that
grabs our interest and keeps our eyes on the message.
Why data visualization?
Data visualization gives us a clear idea of what the information means by giving
it visual context through maps or graphs. This makes the data more natural for
the human mind to comprehend and therefore makes it easier to identify trends,
patterns, and outliers within large data sets.
Uses of Data Visualization :
Gain insight into an information space by mapping data onto graphical
primitives
●Provide qualitative overview of large data sets
●Search for patterns, trends, structure, irregularities, relationships among
data
●Help find interesting regions and suitable parameters for further
quantitative analysis
●Provide a visual proof of computer representations derived.
● It is a powerful technique to explore the data
with presentable and interpretable results.
● In the data mining process, it acts as a primary step in the pre-processing
portion.
● It supports the data cleaning process by finding incorrect data and
corrupted or missing values.
● It also helps to construct and select variables, which means we have to
determine which variable to include and discard in the analysis.
● In the process of Data Reduction, it also plays a crucial role while
combining the categories.
There are 4 types of visualization Techniques available
1)Pixel oriented visualization techniques
2)Geometric Projection visualization techniques
3)Icon based visualization techniques
4)Hierarchical visualization techniques
Pixel-Oriented Visualization Techniques
● A simple way to visualize the value of a dimension is to use a pixel where
the color of the pixel reflects the dimension’s value.
● For a data set of m dimensions, pixel-oriented techniques create m windows
on the screen, one for each dimension.
● The m dimension values of a record are mapped to m pixels at the
corresponding positions in the windows. The colors of the pixels reflect the
corresponding values.
● Inside a window, the data values are arranged in some global order shared by
all windows.
● The global order may be obtained by sorting all data records in a way that’s
meaningful for the task at hand.
Pixel-oriented visualization
● AllElectronics maintains a customer information table, which consists of four
dimensions: income, credit limit, transaction volume, and age.
● We can sort all customers in income-ascending order, and use this order to
lay out the customer data in the four visualization windows.
● The pixel colors are chosen so that the smaller the value, the lighter
the shading.
● Using pixel based visualization we can easily observe that credit_limit
increases as income increases customer whose income is in the
middle range are more likely to purchase more from All Electronics,
these is no clear correlation between income and age.
● Note that the windows do not have to be rectangular. For example, the circle
segment technique uses windows in the shape of segments of a circle.
● This technique can ease the comparison of dimensions because the
dimension windows are located side by side and form a circle.
Geometric Projection Visualization Techniques

● A drawback of pixel-oriented visualization techniques is that they cannot


help us much in understanding the distribution of data in a multidimensional
space. For example, they do not show whether there is a dense area in a
multidimensional subspace.
● Geometric projection techniques help users find interesting projections
of multidimensional data sets.
● The central challenge the geometric projection techniques try to address is
how to visualize a high-dimensional space on a 2-D display.
● A scatter plot displays 2-D data points using Cartesian coordinates. A third
dimension can be added using different colors or shapes to represent
different data points.
Scatter Plot 2D Scatter Plot 3D
● The scatter-plot matrix technique is a useful extension to the scatter plot.
For an n dimensional data set, a scatter-plot matrix is an n * n grid of 2-D
scatter plots that provides a visualization of each dimension with every
other dimension.
Parallel Coordinates
This type of visualization is used for plotting multivariate, numerical data.
Parallel Coordinates Plots are ideal for comparing many variables together
and seeing the relationships between them. For example, if you had to
compare an array of products with the same attributes.

In a Parallel Coordinates Plot, each variable is given its own axis and all
the axes are placed in parallel to each other. Each axis can have a
different scale, as each variable works off a different unit of measurement,
or all the axes can be normalized to keep all the scales uniform.

The order the axes are arranged in, can impact the way how the reader
understands the data. One reason for this is that the relationships
between adjacent variables are easier to perceive, then for non-adjacent
variables. So re-ordering the axes can help in discovering patterns or
correlations across variables
To visualize n-dimensional data points, the parallel coordinates
technique draws n equally spaced axes, one for each dimension,
parallel to one of the display axes.
A data record is represented by a polygonal line that intersects each
axis at the point corresponding to the associated dimension value.
A major limitation of the parallel coordinates technique is that it
cannot effectively show a data set of many records.
Even for a data set of several thousand records, visual clutter and
overlap often reduce the readability of the visualization and make the
patterns hard to find.
Parallel Coordinates
Icon-Based Visualization Techniques
Icon-based visualization techniques use small icons to represent multidimensional data
values. We look at two popular icon-based techniques: Chernoff faces and stick figures.

Chernoff faces were introduced in 1973 by statistician Herman Chernoff. They display
multidimensional data of up to 18 variables (or dimensions) as a cartoon human face.

Chernoff faces help reveal trends in the data. Components of the face, such as the eyes, ears,
mouth, and nose, represent values of the dimensions by their shape, size, placement, and
orientation.

For example, dimensions can be mapped to the following facial characteristics: eye size, eye
spacing, nose length, nose width, mouth curvature, mouth width, mouth openness, pupil size,
eyebrow slant, eye eccentricity, and head eccentricity.
● Chernoff faces make use of the ability of the human mind to recognize small
differences in facial characteristics and to assimilate many facial characteristics at
once.
● Chernoff faces make the data easier for users to digest.
● In this way, they facilitate visualization of regularities and irregularities present in the
data, although their power in relating multiple relationships is limited.
● Another limitation is that specific data values are not shown.
● Furthermore, facial features vary in perceived importance. This means that the
similarity of two faces (representing two multidimensional data points) can vary
depending on the order in which dimensions are assigned to facial characteristics.
Therefore, this mapping should be carefully chosen.
● Eye size and eyebrow slant have been found to be important.
Chernoff Faces
● Asymmetrical Chernoff faces were proposed as an extension to the original
technique.
● Since a face has vertical symmetry (along the y-axis), the left and right side of a face
are identical, which wastes space.
● Asymmetrical Chernoff faces double the number of facial characteristics, thus
allowing up to 36 dimensions to be displayed.
● The Stick Figure Visualization Technique maps multidimensional data to five-piece
stick figures, where each figure has four limbs and a body.
● Two dimensions are mapped to the display (x and y) axes and the remaining
dimensions are mapped to the angle and/or length of the limbs.
● Figure shows census data, where age and income are mapped to the display axes, and
the remaining dimensions (gender, education, and so on) are mapped to stick figures.
If the data items are relatively dense with respect to the two display dimensions, the
resulting visualization shows texture patterns, reflecting data trends.
Hierarchical Visualization Techniques
● The visualization techniques discussed so far focus on visualizing multiple
dimensions simultaneously.
● However, for a large data set of high dimensionality, it would be difficult to
visualize all dimensions at the same time.
● Hierarchical visualization techniques partition all dimensions into subsets
(i.e., subspaces). The subspaces are visualized in a hierarchical manner.
● “Worlds-within-Worlds,” also known as n-Vision, is a representative
hierarchical visualization method.
● Suppose we want to visualize a 6-D data set, where the dimensions are F,X1,
: : : ,X5.
● We want to observe how dimension F changes with respect to the other
dimensions. We can first fix the values of dimensions X3,X4,X5 to some
selected values, say, c3, c4, c5.
● We can then visualize F,X1,X2 using a 3-D plot, called a world. The position
of the origin of the inner world is located at the point .c3, c4, c5 in the outer
world, which is another 3-D plot using dimensions X3,X4,X5.
● A user can interactively change, in the outer world, the location of the origin
of the inner world.
● The user then views the resulting changes of the inner world. Moreover, a
user can vary the dimensions used in the inner world and the outer world.
● Given more dimensions, more levels of worlds can be used, which is why
the method is called “worlds-within-worlds.”
Worlds - Within - Worlds
● As another example of hierarchical visualization methods, tree-maps
display hierarchical data as a set of nested rectangles.
● For example, a tree-map visualizing Google news stories.
● All news stories are organized into seven categories, each shown in a
large rectangle of a unique color.
● Within each category (i.e., each rectangle at the top level), the news
stories are further partitioned into smaller subcategories.
Tree Map
Visualizing Complex Data and Relations
● In early days, visualization techniques were mainly for numeric data.
Recently, more and more non-numeric data, such as text and social networks,
have become available.
● Visualizing and analyzing such data attracts a lot of interest.
● There are many new visualization techniques dedicated to these kinds of
data. For example, many people on the Web tag various objects such as
pictures, blog entries, and product reviews.
● A tag cloud is a visualization of statistics of user-generated tags.
● Often, in a tag cloud, tags are listed alphabetically or in a user-preferred
order. The importance of a tag is indicated by font size or color.
Tag Cloud
● Tag clouds are often used in two ways.
● First, in a tag cloud for a single item, we can use the size of a tag to
represent the number of times that the tag is applied to this item by different
users.
● Second, when visualizing the tag statistics on multiple items, we can use the
size of a tag to represent the number of items that the tag has been applied
to, that is, the popularity of the tag.
● In addition to complex data, complex relations among data entries also raise
challenges for visualization.
● For example, the following uses a disease influence graph to visualize the
correlations between diseases.
● The nodes in the graph are diseases, and the size of each node is
proportional to the prevalence of the corresponding disease.
● Two nodes are linked by an edge if the corresponding diseases have a strong
correlation. The width of an edge is proportional to the strength of the
correlation pattern of the two corresponding diseases
PIE CHART
R Programming language has numerous libraries to create charts and graphs. A pie-
chart is a representation of values as slices of a circle with different colors. The
slices are labeled and the numbers corresponding to each slice is also represented in
the chart.
In R the pie chart is created using the pie() function which takes positive numbers
as a vector input. The additional parameters are used to control labels, color, title
etc.
Syntax
pie(x, labels, radius, main, col, clockwise)
x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn clockwise or anti
clockwise.
# Create data for the graph.
x <- c(21, 62, 10, 53)
labels <- c("London", "New York", "Singapore", "Mumbai")
# Give the chart file a name.
png(file = "city.png")
# Plot the chart.
pie(x,labels)
# Save the file.
dev.off()
A bar chart represents data in rectangular bars with length of the bar
proportional to the value of the variable. R uses the function barplot() to
create bar charts. R can draw both vertical and Horizontal bars in the bar
chart. In bar chart each of the bars can be given different colors.
Syntax
The basic syntax to create a bar-chart in R is −
barplot(H,xlab,ylab,main, names.arg,col)
Following is the description of the parameters used −
H is a vector or matrix containing numeric values used in bar chart.
xlab is the label for x axis.
ylab is the label for y axis.
main is the title of the bar chart.
names.arg is a vector of names appearing under each bar.
col is used to give colors to the bars in the graph.
Group Bar Chart and Stacked Bar Chart
We can create bar chart with groups of bars and stacks in each bar by using a matrix
as input values.More than two variables are represented as a matrix which is used to
create the group bar chart and stacked bar chart.
colors = c("green","orange","brown")
months <- c("Mar","Apr","May","Jun","Jul")
regions <- c("East","West","North")
# Create the matrix of the values.
Values <- matrix(c(2,9,3,11,9,4,8,7,3,12,5,2,8,10,11), nrow = 3, ncol = 5,
byrow = TRUE)
png(file = "barchart_stacked.png")
# Create the bar chart
barplot(Values, main = "total revenue", names.arg = months, xlab = "month",
ylab = "revenue", col = colors)
legend("topleft", regions, cex = 1.3, fill = colors)
dev.off()
are a measure of how well distributed is the data in a data set. It divides the
data set into three quartiles. This graph represents the minimum, maximum,
Boxplotsmedian, first quartile and third quartile in the data set. It is also
useful in comparing the distribution of data across data sets by drawing
boxplots for each of them.
Boxplots are created in R by using the boxplot() function.
Syntax
boxplot(x, data, varwidth, names, main)
Following is the description of the parameters used −
x is a vector or a formula.
data is the data frame.
varwidth is a logical value. Set as true to draw width of the box
proportionate to the sample size.
names are the group labels which will be printed under each boxplot.
main is used to give a title to the graph.
HISTOGRAM
A histogram represents the frequencies of values of a variable bucketed
into ranges. Histogram is similar to bar chat but the difference is it groups
the values into continuous ranges. Each bar in histogram represents the
height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as
an input and uses some more parameters to plot histograms.
Syntax
hist(v,main,xlab,xlim,ylim,breaks,col,border)
v is a vector containing numeric values used in histogram.
main indicates title of the chart.
col is used to set color of the bars.
border is used to set border color of each bar.
xlab is used to give description of x-axis.
xlim is used to specify the range of values on the x-axis.
ylim is used to specify the range of values on the y-axis.
LINE GRAPH:
A line chart is a graph that connects a series of points by drawing line
segments between them. These points are ordered in one of their
coordinate (usually the x-coordinate) value. Line charts are usually used in
identifying the trends in data.
The plot() function in R is used to create the line graph.
Syntax
plot(v,type,col,xlab,ylab)
v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to draw only the lines and
"o" to draw both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.
Multiple Lines in a Line Chart
More than one line can be drawn on the same chart by using
the lines()function.
After the first line is plotted, the lines() function can use an additional vector
as input to draw the second line in the chart,
Scatter Plot
Scatterplots show many points plotted in the Cartesian plane. Each point represents
the values of two variables. One variable is chosen in the horizontal axis and another
in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
x is the data set whose values are the horizontal coordinates.
y is the data set whose values are the vertical coordinates.
main is the tile of the graph.
xlab is the label in the horizontal axis.
ylab is the label in the vertical axis.
xlim is the limits of the values of x used for plotting.
ylim is the limits of the values of y used for plotting.
axes indicates whether both axes should be drawn on the plot.

You might also like