Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Eeman Qureshi SDA Lab 4

Lab 4: Data Visualization in STATA: Histograms, Boxplots, and Scatterplots.

Key Principles of Data Visualization:


Strive for clarity & simplicity
• Maximize impact, minimize noise
• If it doesn’t add value or serve a purpose then get rid of it

Focus on creating a Narrative:


• Don’t just show the data, tell a story with it
• Communicate key insights, clearly, quickly, and powerfully

Strike a balance between DESIGN & FUNCTION:


• When it comes to design, selecting the right type of graph is critical.
• Beautiful is good, functional is better, BOTH is ideal

RULE OF THUMB
If a viewer can’t interpret the story in 10-15 seconds, then it’s time to SIMPLIFY.

3 KEY QUESTIONS YOU NEED TO ANSWER WHEN VISUALIZING DATA


1. What type of data are you working with?
• Integar, real, categorical, time-series, geo-spatial etc.

2. What are you trying to communicate?


• Relationship, comparison, composition, distribution, trends etc.

3. Who is the visualization for?


• Analyst, client, professor, CEO, intern

Data Visualizations in STATA:


We use the simple help graph command to go over the different types of visualizations that are
present in STATA. We can also set up different schemes in STATA to make our graphs look better,
such as economist schemes, or clean plots etc.
Types of graphs in STATA:
1) One way graphs (Histogram, bar graph)
2) Two way graphs (Scatter plots, line plots)
Eeman Qureshi SDA Lab 4

For STATA graphs we need the following:

1) Graph command
2) Graph type
3) The variable name we want to use

Type help graph in STATA

Different schemes of graphs in STATA:


Typle help set scheme in STATA to see the different schemes available to you
set scheme economist
set scheme s2color
set scheme s1mono
set scheme cleanplots

Now Let’s open STATA’s example dataset ‘auto’ through the following command.
sysuse auto, clear

Histograms:
Visualizing Continuous Variables through One way graphs:
histogram price
histogram price, normal
histogram price, freq
sum price

Q: Open data1_sample.dta and keep only the observations with household size greater than or
equal to 1 & rent <2000.
keep if hhsize == 1 & rent < 2000

Bar Graphs:
Visualize data using the following graph commands
Eeman Qureshi SDA Lab 4

help graph bar


Q: Make bar graph of income:
graph bar income
^This is a futile visualization as its not communicating effectively or telling a story. The purpose is
not clear. We could get the same thing through the sum command

Q: Now suppose we want to make a comparison of average income across states:


Earlier we had done this through the bysort command so type the following command in your stata
window:
bysort state: sum income

Q: Provide the command to view the mean of income for each state in separate bar graphs:
graph bar income, by (state)
Minimizing the clutter and improving clarity through the following command:
graph bar income, over (state)

Something’s not right about the above visualization. The legend as well as the axis of the graph are
extremely important so you must ensure these are clearly visible.

Horizontal Bar Graph:


Let’s try the command below:
graph hbar income, over (state)

Q: Provide the command to see the total of ‘income’ by state:


graph hbar (sum) income, by(state)
Q: Provide the command to see the total of ‘income’ state-wise such that it should appear in the
same graph:
graph bar (sum) income, over (state)

Q: Provide the command below to get the bar graph of the variable 'income' by state and
gender:
graph hbar income, over(state) by(sex)
Eeman Qureshi SDA Lab 4

The over option stacks the graph into the same window whereas ‘by’ splits it into different
windows.
graph hbar income, over (state) over(sex)
The above command visualizes both sex and state within the same window
In the same way, we have specified sum in brackets, we can also specify multiple other options for the
graph to show such as median, mean etc.
Q: Suppose we wanted to see the mean and median of income together for each of the states in
the same graph?
graph hbar (median) income (mean) income, over (state)
*OR
graph hbar income (median) income, over (state)

Notes: When we perform the same command for sum and mean together we can only see sum in the
graph and not mean because sum is disproportionately higher than mean.

Box Plots:
Box plot shows the distribution in which median is the line at the middle and the box represents the
interquartile range, there is the standard deviation which is capped by minimum and max values.

Type the following command in your STATA windows:


graph box rent, by(state)
graph hbox rent, over(state)
graph hbox rent, over(state) noout
Eeman Qureshi SDA Lab 4

*The noout option eliminates outliers and provides a clear picture of the distribution.

SCATTERPLOTS:
Scatterplots are used to estimate the relationship between two variables so in other words, they are
useful for visualizing the bivariate relationship between variables.
You always write the y-variable before the x-variable in the scatterplot command:
graph twoway scatter rent size
Get rent separately for different states:
graph twoway scatter rent size, by(state)

Note: Over option does not work for this graph.

You can also change the shape of the scatter points using the msymbol () command and use the
following symbols:

• diamond (d)
• circle (o)
• triangle (t)
• plus (+)

Options: Change the shape from dots on the graph to triangles, you can use the msymbol
command for this
graph twoway scatter rent size, msymbol (t)

Likewise, you can change the colors as well using the mcolor command (as provided in your do-files)
graph twoway scatter rent size, mcolor(lime) msymbol (t)

You can also change the intensity of the color. The intensity of the colors can be changed from
0.1-0.9 with 0.1 being the lowest and 0.9 being the highest intensity of colors.
graph twoway scatter rent size, mcolor(lime*0.3) msymbol (t)

*Add labels, titles, subtitles to your scatter plot:


graph twoway scatter rent size, mcolor(lavender) msymbol (t) xtitle(Size) ytitle(rent)
title(Rent vs Size)
Eeman Qureshi SDA Lab 4

Questions
Q1: As we have discussed earlier that the distribution of continuous variables are
determined by the histogram. Using 'histogram' command, draw the histogram of the
variable ‘income’.
Q2: As you would have noticed that y-axis of histogram of income has density, can you
change the y-axis to the frequency of the values in the variable 'income'? By using the
Stata help on histogram, redraw the histogram with frequency on y-axis and your
graph should have 10 bins.
Q3: Draw a graph to show the distribution of rent of each state within the same graph
and also comment on which scores the largest distribution of income.
Q4: Draw a scatterplot between rent (y-axis) and size (x-axis). The points should be in
triangular form and it should have red color.

You might also like