Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

Data Analysis

Marc-Andrea Fiorina • During the training, find all materials in our shared
September 27, 2023 OneDrive: bit.ly/rrf23-materials

Development Impact Evaluation (DIME)


The World Bank
Introduction
Outline of data analysis process
Exploratory analysis
• Get quantities, concepts, and
presentation “sketched out”
• Dynamic documents provide
“dashboard” view of work
Final analysis
• Select, polish, and produce
publication-ready exhibits
• Dynamic documents compile
completed results
Export reproducible outputs

2
Outline of data analysis process

Two stages of data analysis:

• Exploratory data analysis: the research team will look for patterns in the
data, in a descriptive fashion
• Final data analysis: the research team will decide on and compile main
results for research outputs, polish, and structure

Note: For projects with pre-analysis plans, the main specifications will be
pre-defined, so the exploratory phase has fewer implications for final outputs.

3
Exporting “raw” reproducible outputs
Raw outputs
• Results are exported to files that
can be used as inputs for papers
and reports
• Self-standing tables and graphs
• Accessible formats
• PNG
• EPS
• TEX
• XLSX
Exploratory analysis outputs can be
compiled dynamically with various tools

4
Compile “final” results in dynamic documents

All final outputs should be dynamic


• Final outputs such as papers, brief
and reports created from results
should update automatically when
the raw outputs are updated
• Almost all final outputs should be
published in PDF or HTML (even
presentations)
• LATEX is an extremely useful tool
and R Markdown is gaining
popularity (LATEX training)

5
Exploratory data analysis
Structure of analysis workflow

Stage One ("Exploratory") Stage Two ("Final")


Purpose Rapid iteration Polished communication
Coding As little (efficient) as possible As much as it takes
Flexibility High: iterating design Low: data structure fixed
Formatting Little Complete styling and annotation

6
Stage One: Exploratory analysis

Exploratory analysis is “Stage One” work:

• No polished or custom-formatted outputs – need to iterate designs quickly


• Team knows the study well, so you can understand results without visual
narrative components
• Organize exploratory results broadly into “not-too-long” scripts
• Good code hygiene: Always clear the workspace and load analysis data
fresh (fully modular code)

7
Exploratory Analysis: Code organization

Data construction scripts stay separate from data analysis scripts.

• If you need to add information or change grouping attributes or other data


characteristics, go back to data construction
• Exploratory analysis allows you to figure out what samples and variables you
need (and don’t need)
• You will always run data construction scripts followed by data analysis scripts
through the master script
• Typically, you will have all three open at the same time so you can move back
and forth easily and keep code organized

8
Exploratory Analysis: Code organization

Exploratory analysis code is separate, with potentially many files:

• “Chunks” of code should run independently and be clearly labeled with


comments
• Use different do-files for different topics
• OK to include relevant interactive outputs (like tab, table, reg)
• You will want to output most results as easily-browsable files (.png, .csv, etc.)
• Separate GitHub branches can be useful to experiment with new analysis
methods/outputs

9
Example of exploratory analysis code

10
Example of exploratory analysis output

11
Example of exploratory analysis output

12
Final data analysis
Structure of analysis workflow

Stage One ("Exploratory") Stage Two ("Final")


Purpose Rapid iteration Polished communication
Coding As little as possible As much as it takes
Flexibility High: iterating design Low: data structure fixed
Formatting Little Complete styling and annotation

13
Stage Two: Final outputs

Final analysis is “Stage Two” work:

• Getting outputs into publication-ready format is time consuming


• Hard-code custom formatting only after the team has agreed on a final style
of the output
• The goal is to reduce the number of times you will need to make very precise
adjustments to the code and aesthetics of the output

14
Code organization

Data construction scripts stay separate from data analysis scripts.

• The data sets and variables you need should have been completed in
exploratory analysis
• This means you should not be subsetting data or generating new variables
• You should also remove or run quietly all commands with console outputs
(including regress) – Stata in particular is very slow in printing to the Results
window
• If you think you need to, think carefully about why you are changing the data...

15
Using analysis data: Final outputs

Developing “visual storytelling” components

• Tables and figures must be self-standing: it should be easy to read and


understand them with only the information they contain
• Remember to add labels to variables and axes
• Include in the notes all relevant information, such as sample used, model
specification, units and variable definitions
• Highly accessible formats can be useful for sharing individual outputs (PDF,
PNG, XLSX)
• Lightweight technical formats are ultimately preferred and can be
version-controlled (TEX, CSV, EPS)

16
Final analysis: Script organization

A well-organized analysis folder:


• Has “the right number” of separate do-files: You can group outputs
conceptually, but the essential part is that it is easy to know where the code
for each output is
• Typically, outputs will be created in order within at least four scripts
(main_figures, main_tables, appendix_figures, appendix_tables)
• Add more files if code for any output is very long (complex calculations or
hard-coded formatting) – you can always call longer code from the above files
using do long-simulation.do or equivalent
• Effective organization allows a reader to quickly find and run the part of the
code they are interested in from looking only at the final output (manuscript,
report, brief, presentation, etc.) DIME Analytics reproducibility reviewers (and
future you) will thank you! 17
Final analysis: Script organization

A well-organized analysis script:

• Starts with a completely fresh workspace


• Loads the constructed dataset (in Stata, with use at the start of every script)
• Is written so research decisions are clear (sample, clusters, controls)
• Has simple code that allows the reader to focus on the econometrics
• Exports the results obtained to an accessible file format
• Runs completely independently of all other scripts, except for the master script
• Can be linked to its outputs by name

18
Example of final analysis code

19
Final analysis: Professional expectations

• Is the design clean, professional, and complete?


• Can someone else understand what the figure shows without any
additional information?
• Have you written detailed notes for the result (not in the code) to accompany
it?
• Are the number of observations correct given your data?
• Are the results realistic given the econometric question?
• Is the result interpretable to a real-world meaning?
• Are the scales, units, and measures visible and appropriate?

20
Example of final analysis output

21
Example of final analysis output

22
Automating outputs from
statistical software
All results must be automatically exported from code

Never set up a workflow that requires


copying and pasting final results!

23
Automating outputs from code

All analysis results are exported first as “raw” outputs – even final analysis

• You can find example do-files for tables at


https://github.com/worldbank/stata-tables
• Stata commands for tables: estout, outreg2, and outwrite
• Stata command for figures: graph export
• R package for tables: stargazer, huxtable, kable
• R package for figures: ggplot2

24
Automating outputs from code

Manual outputs are impossible to replicate

• You will always need to update outputs


• Copying results from a software console is error-prone, inefficient, and
unnecessary
• Automating the creation of outputs will save you time
• Our blogpost (about tables): https:
//blogs.worldbank.org/impactevaluations/nice-and-fast-tables-stata

25
Exporting figures in Stata

Figures must be exported in high-quality, accessible file formats with


informative names (like fig-1.png)

• Always use graph export for final outputs – PNG is the most common, but
EPS or TIF may be required for some publishers
• Detailed information about creating visualizations on the DIME Wiki at https:
//dimewiki.worldbank.org/Stata_Coding_Practices:_Visualization
• Depending on your use case (and speed), you should use the nodraw option
whenever possible to avoid repeatedly rendering images

26
Exporting tables in Stata

Tables must be exported in publication-compatible file formats with


informative names (like tab-1.tex)

• DIME Analytics commands like iebaltab have built-in support for common
export formats. Use these!
• estout can solve most of your problems
• It can export both summary statistics and regression tables easily
• It also supports a lot of customization, and exports both to Excel and LATEX
• In Stata 17, table and collect export have new functions and syntax
• In most recent versions, putexcel and putdocx can also be useful

27
Exporting tables in Stata

Some less common alternatives:

• You may save results as a dataset in various ways (such as svmat), format
them here, and then export them to Excel with export excel, to csv with
export delimited or to LATEX with dataout
• You can create matrices and export them using mat2txt or outwrite. This
tends to make to code harder to read, and there are easier ways to export
tables in most formats.

28
Exporting tables in Stata

If you need to create a table with a very particular format, consider writing it
manually using file write. If you do this, make sure you have very clear
comments and organization so readers can easily locate the important statistics
and econometrics and ignore the formatting commands.

29
Resources for outputs in R

R commands are slightly more idiosyncratic than Stata but tend to be


well-supported for export:

• For R users, the stargazer and huxtable packages are the easiest way to
export formatted regression and summary statistics tables to LATEX (and html)
• Use modelsummary and gtsummary where appropriate. Info at
https://github.com/RRMaximiliano/r-latex-tables-sum-stats
• Creating custom tables is also much easier in R, since you can combine
objects to data frames and matrices, and use one of these commands, or
even write.csv to export them
• You can find sample codes and examples in our DIME R training repository at
https://github.com/worldbank/dime-r-training

30
Resources for outputs in R

Custom R functions for output creation

• It is very easy and time-saving to create custom functions in R


• As a result, we can use functions to create templates for plots and tables,
which can then be used repeatedly to avoid unnecessary copy/pasting
• With enough experience, you can even use this to automate customization of
stargazer or huxtable outputs, or even to create custom LateX templates
• You can find more information on how to do this in the Chapter 19 of the
ggplot2 textbook

31
Next steps and resources
What next?

• If you follow the steps outlined in this lesson, most of the data work involved in
the last step of the research process – publication – will already be done.
• Your analysis code will be organized in a reproducible way, so all you will
need to do release a replication package is a last round of code review.
• This will allow you to focus on what matters: writing up your results into a
compelling story.

32
DIME resources

• Development Research in Practice -


https://worldbank.github.io/dime-data-handbook/analysis.html
• R Econ Visual Library -
https://worldbank.github.io/r-econ-visual-library
• Stata Visual Library -
https://worldbank.github.io/Stata-IE-Visual-Library
• DIME LATEX training -
https://github.com/worldbank/DIME-LaTeX-Templates
• Checklist: Reviewing graphs -
https://dimewiki.worldbank.org/Checklist:_Reviewing_Graphs
• Checklist: Reviewing tables -
https://dimewiki.worldbank.org/Checklist:_Submit_Table

33
External resources

• Grids of Numbers (Butterick):


https://practicaltypography.com/grids-of-numbers.html
• Common Issues in Exhibits (JCE):
https://www.jclinepi.com/content/checklist_for_tables_and_figures
• Data Visualization Checklist (Evergreen):
https://stephanieevergreen.com/data-visualization-checklist
• Accessible Data Visualization (Organ): https://towardsdatascience.com/
an-incomplete-guide-to-accessible-data-visualization-33f15bfcc400

34

You might also like