Preparing Data Science Career

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Title: A Three-Step Guide to Training Computational Social Science Ph.D.

Students for

Academic and Non-Academic Careers

Aniket Kesari, Research Fellow, Information Law Institute, New York University,

aniket.kesari@nyu.edu

Corresponding author, Jae Yeon Kim, Assistant Professor, KDI School of Public Policy

Management, jaeyeonkim@kdis.ac.kr

Sono Shah, Computational Social Scientist, Data Labs, Pew Research Center,

sshah@pewresearch.org

Taylor Brown, Research Scientist, Computational Social Science Group, Meta,

taylor.w.brown@duke.edu

Tiago Ventura, Postdoctoral Associate, Center for Social Media and Politics, New York

University, venturat@umd.edu

Tina Law, Ph.D. Candidate, Department of Sociology, Northwestern University,

tina.law@u.northwestern.edu

The authors are listed alphabetically as all contribute equally to this article.

1
Abstract

In recent years, social scientists with data science skills have gained positions in academic and

non-academic organizations as computational social scientists who blend skillsets from data

science and social science. Yet as this trend is relatively new in the social sciences, navigating

these emerging and diverse career paths remains ambiguous. We formalize this hidden

curriculum by providing a step-by-step guide to graduate students based on our collective

experiences as computational social scientists working in academic, public, and private sector

organizations. Specifically, we break down the computational social science (CSS)

professionalization process into three steps: (1) learning data science skills; (2) building a

portfolio that focuses on using data science to answer social science questions; and (3)

connecting with other computational social scientists. For each step, we identify and elaborate on

core competencies and additional useful skills that are specific to the academic and

non-academic job markets. Although this article is not exhaustive, it provides a much-needed

guide for graduate students, as well as their faculty advisors and departments, to navigate the

growing field of computational social science. By sharing this guide, we hope to help make

computational social science professionalization more systematic and accessible.

Keywords

data science, computational social science, academic job market, non-academic job market

2
Media summary (approximately one page, or 400 words; see below on requirements)

This article is an accessible guide for members of the social science community who are

interested in the growing trend in “Computational Social Science.” Our primary audience is

early-stage social science doctoral students who wish to tailor their graduate training toward

computational social science, but should also be relevant for pre-doctoral and later stage

students, faculty, and members of the social science community more broadly. This guide should

be accessible for students with little prior technical training and students with more substantial

training. We emphasize three key steps for pursuing a computational social science career, 1)

Learning data science skills, 2) Building a data science portfolio, and 3) Connecting with other

computational social scientists.

We begin by providing an overview of the technical skills we believe computational

social science PhD students, who blend skillsets from data science and social science, should

acquire during their programs. We overview some of the latest developments in computational

social science including working with novel data sources such as text and images, manipulating

“big data” through databases, and using machine learning to drive new social science insights.

Throughout this article we make two main points about computational social science training.

The first is that we encourage students to invest early in developing programming skills in

languages such as R and Python because these foundations enable further advanced applications

later in their graduate careers. The second is that we surface how a combination of coursework,

internships, and independent research can prepare students to become computational social

scientists. We do not prescribe any one set path for students, but rather highlight options and the

tradeoffs involved with different approaches to graduate training.

3
Beyond skills acquisition, we also turn to practical advice about career preparation. We

discuss how to build a data science portfolio for contributing to open science and building a

demonstrable record of work in this area. We conclude with concrete recommendations about

how departments may adapt their current curricula to give students the flexibility needed to

pursue this exciting new field. Overall, we aim to keep this article at a level of generality that it

should be helpful to students across the social science disciplines and give them a starting place

for how to define themselves as computational social scientists.

4
1. Introduction

Computational social science is a rapidly growing field that engages the social sciences

and data science (Salganik 2019: xviii; Edelmann et al. 2020). The field applies novel digital

data and computational methods to advance social scientific understanding of human behaviors

(Edelman et al. 2020). As more and more social scientists have gained training and experience in

data science during their graduate studies an increasing number of them have gained positions as

computational social scientists in academic and non-academic organizations. Both academic and

industry research organizations increasingly see the value in core social science skills covering

the analysis of human behavior, institutions, and policy, and clamor for individuals who can

bring these foundations to the study of massive data. We define computational social scientists as

a type of data scientist who comes from a social science background, but is more invested in

programming and other data science skills than is required in most social science Ph.D.

programs. Places of employment for computational social scientists now include: academic

departments; professional schools; nonprofits (e.g., Code for America, Pew Research Center’s

Data Labs, Urban Institute); tech companies (e.g., Meta, Twitter, Google, Amazon, Microsoft,

among others); international organizations (e.g., The World Bank, UN Global Pulse Labs); and

government agencies (e.g., US Federal Reserve, Census Bureau, Office of Evaluation Sciences,

The Lab at DC).1

Indeed, there appears to be growing interest among social scientists in pursuing careers in

data science. The 2016 Moore-Sloan Data Science Environment Survey collected responses on

data science careers from students, researchers, staff, and faculty connected to the University of

1
For a more comprehensive list of potential opportunities for students and graduates in this field, see these helpful
resources curated by Ben Green: https://www.benzevgreen.com/jobs/

5
California, Berkeley, New York University, and the University of Washington, Seattle. The

results show that the percentage of respondents working in data science careers who identify

social sciences as their primary field is now ranked second (17.9%), only after physical sciences

(19.6%) (Geiger et al. 2018: 8-9). Of course, these three universities are far from representative

samples of US research universities or academic institutions in general. Yet, they still matter as

these universities are key members of the field of data science in the US and the world and

therefore, these survey responses are likely signals of broader trends.

Despite this growing demand to engage in data science among social scientists, building

and navigating data science education remains elusive for many social science Ph.D. students

because training to be a computational social scientist still feels like a “hidden curriculum”

(Calarco 2020; Barham and Wood 2021). The notion of pursuing data science as a career is

relatively new and there are diverse paths to becoming a data scientist, which means that

advising on this subject goes beyond the traditional capacity and resources of many academic

departments. The data science hidden curriculum exists in two ways. First, there is a lack of

formal training. Even in the U.S., where data science programs and courses are increasingly

available on university and college campuses, most Ph.D. programs in social sciences do not

offer systematic and formal training in data science or dedicated advising to help students to

navigate data science careers in the academic and non-academic job markets. Second, even when

such training exists, barriers for historically underrepresented groups in data science persist as

the society has told them this kind of analytic work is "not for them." Therefore, we need to

increase formal training in data science in social science Ph.D. programs and make these training

relevant for and accessible to all (Lue 2019; Kim and Ng 2022).

6
Within this evolving context, some innovative solutions have emerged, but they remain

ad-hoc. Namely, computational social science or data science summer institutes, such as Summer

Institute in Computational Social Science (SICSS) and Data Science for Social Good (DSSG),

have gained popularity among early-career scholars. Both programs provide social scientists with

additional data science training, networking, and professionalization opportunities.

Inter-university Consortium for Political and Social Research (ICPSR) offers more data science

courses, such as computational text analysis, in recent years. Nevertheless, the limitation of these

alternative programs is apparent. These are intensive summer programs and are not set up for

long-term training, networking, and professionalization of computational social scientists. While

these programs aim to be inclusive, they are also highly selective and therefore can only support

limited numbers of computational social scientists for a short period of time.

To initiate conversations about more formal and long-term training, professionalization,

and support of computational social scientists, this article seeks to make explicit the informal

knowledge of navigating computational social science for Ph.D. students in the social sciences.

We provide a step-by-step guide to social science graduate students on building computational

social science careers based on our collective experiences as computational social scientists

working in academic, public, and private sector organizations. We leverage our diverse

experiences to identify common themes in the computational social science professionalization

process that interested students may face and point them toward a variety of useful resources at

their home institutions and beyond.

Many of these recommendations overlap with the areas of data acumen highlighted in the

“Undergraduate Data Science” report (NASEM 2018) in that we emphasize core foundations,

7
experience working with real datasets, and ethical considerations. We add to this existing

conversation by considering the needs of social science Ph.D. students specifically. Graduate

social science programs differ from undergraduate data science majors in several important

ways, and we tailor our framework to highlight the synergies between data science and graduate

programs’ emphasis on domain knowledge, research skills, and scientific collaboration. In

addition, as social science graduates are increasingly interested in non-academic careers, we

focus on providing information on the differences that computational social science Ph.D.

students can expect in terms of preparing for academic versus non-academic careers.

Specifically, we break down the CSS professionalization process into three steps: (1)

learning data science skills; (2) building a computational social science portfolio; and (3)

connecting with other computational social scientists. For each step, we identify and elaborate on

core competencies and additional useful skills that are specific to the academic and

non-academic job markets (see Table 1). We conclude by briefly proposing some curricular

changes that departments can consider adopting to better support their emerging computational

social scientists. The guide we provide here is not exhaustive and is just one perspective on how

graduate students, as well as their faculty advisors and departments, can navigate the rapidly

growing field of computational social science. We hope to start a conversation on how the

computational social science professionalization process can be more systematic—putting the

scaffolding in place to make it accessible.

Table 1: Computational Social Science Professionalization Process

Step 1: Learning Data Science Skills

8
Core Competencies ● Engagement with ethical concerns that arise when
working with such data and paradigms (e.g. privacy
protection, algorithmic bias, etc.)
● Programming fluency in R and/or Python
● Experience with data management and teamwork
● Facility for working with large, messy, and
sometimes unstructured data
● Solid grasp of common inferential methods, such as
hypothesis testing, statistical modeling, etc.
● Understanding of machine learning. Students should
develop an “index” in their heads regarding what
specific tools they need to solve specific problems
and their limitations
● Knowledge of different frameworks for framing
problems, as well as their limitations. Students in
social science disciplines might encounter the causal
inference framework, machine learning for
prediction policy problems, machine learning for
developing social science measurements, etc. and
should be aware of the debates around how these
approaches frame scientific questions.

Additional Market-Specific ● Specialization in one or more computational


Skills methods (e.g., natural language processing) or
statistical methods (e.g., survey methodology and
field experimentation)
● Domain expertise (e.g., behavioral science, online
trust and safety, population and area studies)
● Proficiency with SQL and cloud-based databases
(both academic and non-academic)

Step 2: Building a Data Science Portfolio

9
Core Competencies ● Engagement with social aspects of a data science
project via problem definition, hypothesis
generation, data and outcome selection, etc.
● Reproducible, efficient, and communicable code via
GitHub
● Journal publications/conference proceedings

Additional Market-Specific ● Experience with teaching programming and


Skills computational methods (academic)
● Publicly available data science projects documented
from the end (problem definition) to end (value
communication)

Step 3: Connecting with Other Computational Social Scientists

Core Competencies ● Learning with other computational social scientists


through interdisciplinary (and often
cross-organizational) workshops and trainings
● Sharing research with other computational social
scientists at interdisciplinary conferences or
disciplinary pre-conferences focused on
computational social science

Additional Market-Specific ● Working with other computational scientists through


Skills internships (non-academic)

2. Step 1: Learning Data Science Skills

Methodological training for graduate students in the social sciences traditionally consists

of coursework on research design, statistics, and qualitative methods, with the exact mix varying

by discipline. There is a clear structure to how one deepens their training. For example, graduate

students interested in conducting research with quantitative methods can supplement their

10
introductory coursework on model-based statistical inference with advanced coursework on

causal inference.

Computational methods are a new frontier. We argue that in addition to the traditional

methods curriculum offered by most social science departments, students interested in

computational social science should develop programming fluency in R and/or Python, gain

experience with data management and teamwork, work with large and unstructured data,

understand machine learning as a social science framework, and engage with ethical concerns

that arise when working with big data and prediction modeling. When approaching these topics,

students may choose a mixture of coursework and independent social science projects. Students

should balance between in-house coursework in their own departments and coursework in other

departments. The advantage of in-house coursework is that it will often be more tailored toward

preparing students to conduct computational social science research within their own discipline,

and may provide a smoother on-ramp for students who are new to methods. External coursework

can be advantageous because students will be exposed to students from other disciplines within

and outside the social sciences, and they may benefit from instructors whose research careers are

squarely focused on these topics (e.g., taking a machine learning course from a statistics or

computer science professor). Coursework should be complemented with projects, either as a solo

endeavor or in collaboration with faculty and other students. This advice is especially prudent

after around the midway point of a PhD program when most students transition to conducting

independent research. This framework provides a more comprehensive and systematic

foundation that one needs to be an effective self-learner in both the early stages of a graduate

11
program while choosing coursework, and in the later stages when pursuing independent

projects.

2.1 Proficiency in Programming, Data Management, and Teamwork

Programming fluency. Regardless of the diversity in learning paths, programming

fluency is the core skill. Programming fluency means the ability to program in popular languages

used for data science, such as R and Python. One can still conduct statistical analysis using

proprietary commercial software such as Excel, SPSS, and Stata. However, these proprietary

software programs raise problems for transitioning from social science to computational social

science. First, these tools are not well suited for harnessing automation: the force behind

large-scale data collection and analysis. Scalability of general programming languages like R and

Python make knowledge of them a requirement of the field, particularly in settings like the tech

industry. Second, one cannot take advantage of advancements in modern data science such as

machine learning and natural language processing using these limited tools. This fact is not

simply a matter of tools like SPSS and Stata needing to catch up to integrate machine

learning–the flexibility and scalability of general programming languages is absolutely crucial

for properly implementing these methods (Kim and Ng 2022). In addition, learning R and/or

Python is useful because one can build an open source project using these languages (e.g., R

package or Python library), which can be a major component of data science portfolios. Finally,

investing in these languages helps to be part of data science communities built around R (e.g.,

#rstats, RStudio Conference) and Python (e.g., PyData, The Python Conference). Getting started

12
in either language does not matter substantially, especially for beginners. A language that gets a

job done is the right tool for the job. If students need, they could learn another language after

becoming fluent in at least one. Moreover, the ecosystems between these two languages have

become more integrated, as evident in RStudio changing its name to Posit to expand its focus on

Python and VS Code in 2022.

Data management. As the popularity and scale of computational social science research

grows, there is an increasing need for familiarity with managing the vast quantities of data that

are often collected and stored throughout the research process. At a certain point, the scale of

data requires researchers to implement data management practices to allow for ease of access and

analysis. Many students are already likely familiar with the basics of spreadsheet software such

as Excel. Students will generally have an intuitive understanding of data being stored in rows and

columns, with rows containing observations and columns containing information about them.

They may even have some practice with summarizing, aggregating, and manipulating data to

perform some basic calculations and visualizations. This knowledge is easily translated to the use

of “dataframes” in R and Python and students can learn the basics of working with spreadsheet

data in these languages. Eventually, students may find that they need to work with increasingly

larger datasets, or data structures other than Excel spreadsheets, flat files (e.g., csv, tsv, dta, etc),

and other traditional data formats. In many cases, this involves using cloud computing resources,

such as AWS, Microsoft's Azure, or Google’s BigQuery for storing data or conducting analysis.

Computational social science research also often makes extensive use of observational

data, ranging from digital forms of “big” data, such as tweets and other social media feeds, to

large-scale administrative data such as tax records, medical claims, 311 complaints, and

13
student-level educational records. Beyond storage and access challenges, this kind of data

frequently requires extensive cleaning and manipulation–often involving linking and

de-identification–before it can be used in analysis. In keeping with established best practices for

scientific computing, researchers should ideally preserve any data collected in its original form

and explicitly record all the steps used to process it. These steps are critical to ensuring that

research is reproducible. Here, researchers would benefit with familiarity with script-based data

workflow management tools such as targets, snakemake, or make. Students should also become

familiar with tools for preserving the privacy of data subjects. Concepts such as differential

privacy, encryption, and federated learning may be relevant, and implementing these concepts in

real-world algorithmic systems is an active area of research.

Another key skill is learning how to document datasets and accompanying code. Code

and data documentation (e.g., written information about coding decisions and the variables used

in the dataset) serve several purposes. It eases communication to other users of the data, as well

as one’s future self. It maximizes the usefulness of a dataset by making it readable and usable for

further analyses. Relatedly, learning version control ensures that data is not easily destroyed or

lost in a long project.

Team work. Computational social science research is rarely carried out individually and

often instead is a collaborative effort involving researchers, data scientists, and engineers, among

others. Such collaborative environments are the standard both in the industry, as well as in

academic research centers working on computational social science. Here, familiarity with a

version control framework such as git and GitHub are essential for working on larger scale

projects. Git and GitHub are now pretty much a prerequisite on collaborative computational

14
social science research projects, as these tools allow users to work collaboratively on multiple

branches of the same projects, publicly share their code, efficiently track changes and move back

and forth across different versions of a particular project.

Based on our academic and professional experiences, academic and industry employers

vary on the degree to which they employ version control (through git) in their day-to-day

activities. Therefore, we do not advise students to overprepare and become wizards on git. But

we do advise students to learn the basics of git (init, commit, push, cloning, and branching) as

part of their training to pursue a computational social science career. There are several intuitive

ways to gain this basic facility. GitHub offers a desktop graphical user interface that allows users

to point-and-click their way through the basic version control workflow. RStudio similarly

provides a lightweight graphical interface for working with GitHub. While the full capabilities of

git require use of the command line, these tools can be useful as students are starting out. For

students who want to develop intuitions around Git and GitHub, there are many tutorials on this

subject on the GitHub Learning Lab (https://lab.github.com/) and there is even a game for

learning Git (https://ohmygit.org/).

2.2 Facility for Working with Large and Unstructured Data

To be very straightforward on this suggestion: learning how to query and manage

databases using Structured Query Language (SQL) is a fundamental skill. Most industry data is

big and proprietary. Therefore, as computational social scientists, your data is likely to live in a

cloud database, not a personal computer or laptop. In this scenario, learning how to communicate

with a database becomes a required skill. Therefore, one needs to know how to query a database

15
using Structured Query Language (SQL). Fortunately, one can learn SQL without setting up a

database. For instance, SQL can be supplanted with experience in R and “dbplyr,” which is a

database backend for “dplyr”—a popular data manipulation tool in the R ecosystem.

2.3. Nuanced Understanding of Machine Learning and Predictive Paradigms

Both graduate students and their departments should also be cognizant of the fact that

computational social science involves not just a set of tools, but also a reframing of what

constitutes a social science question. Particularly in quantitative social science, social science

disciplines place a strong emphasis on causal inference. Computational methods offer a lot of

innovations in this space such as synthetic control methods (e.g., Abadie, Diamond, and

Hainmueller 2013), double robust machine learning (Chernozhukov 2016), and sensitivity

analysis (VanderWeele and Ding 2017). Computational social science in many ways provides a

natural evolution of existing methods training.

That being said, machine learning also opens up predictive questions. Take the example

of regression methods. Traditional regression-based approaches to causal inference require that

the analyst approaches the question with a theory of how different variables might affect an

outcome of interest, and involves constructing a model that controls for these variables in order

to identify the causal effect of one of those variables on the outcome. A machine learning

implementation of regression eschews this approach, and instead the analyst lets a computer

algorithm identify the best model by trying different combinations of variables and models to get

the best predictions of an outcome. Machine learning is therefore well suited to policy prediction

problems, though this is only one dimension of policy problems and has clear limitations. These

16
questions by no means replace quantitative social science’s traditional focus on causation, and

indeed machine learning and causal inference are being increasingly integrated (Athey 2015), but

students and departments should be aware of this conceptual shift. Departments should

encourage students who wish to pursue these new types of innovative questions in their

dissertations and other work, and recognize that these types of studies will become more

common.

Although these prediction policy problems represent an exciting frontier, the big data

revolution also prompts a critical examination of the role of theory in computational social

science. Whereas many computer science applications emphasize the prediction framework, we

also want to emphasize the role that social science theory and practice play in shaping good

quantitative research. In particular, we emphasize the critical need to theorize measurement in

quantitative studies. Before applying a machine learning or statistical model to a problem, social

scientists must first contemplate what the social problem is, and how they want to operationalize

that concept in a quantitative measure. In concrete terms, they should be able to answer the

simple question: what is the estimand? (Lundberg, Johnson, and Stewart 2021) Once a social

scientist is able to properly conceptualize the theoretical and social problem they are interested

in, machine learning becomes a tool to uncover new concepts, facilitate causal inference, and

derive new measurements (Grimmer, Roberts, and Stewart 2021). Importantly, we emphasize

that social scientists should not lose sight of the fact that they aim to solve social problems, and

not simply optimize the internal mechanics of a given model.

17
2.5. Engagement with Ethical Constraints that Arise When Working with Big and Complex

Data and Predictive Paradigms

Social scientists will likely play an important role in driving conversations around ethics

and privacy in machine learning and artificial intelligence. For example, the COVID-19

pandemic raised questions about how to balance individual privacy interests and public health

regulator access to information for end uses such as contact tracing. Several states are

considering using predictive algorithms for making bail decisions, and the debates about

implementing these systems implicate race and criminal justice (Angwin 2016). Social scientists

are often domain experts in these areas, and will therefore be essential in not only engineering

such systems, but also thinking about how to manage tradeoffs and do so with respect to impacts

on vulnerable populations.

2.6. A Computational Social Science Curriculum?

At this point, graduate students reading this article may wonder: so what courses should I

take? Here, we outline the basic building blocks of what courses might constitute a

“computational social science curriculum.” We must emphasize that most students will not take

all of these courses, or even most of them. Insofar as students can take data science courses that

cover the fundamentals of these concepts, they would find it advantageous to do so.

First, students should be aware of the mathematical foundations of computational social

science. Specifically, students should have some familiarity with probability, mathematical

statistics, calculus, linear algebra, discrete math, and optimization more broadly. Probability

courses will generally cover concepts like probability distributions, frequentist and Bayesian

18
thinking, and basic rules of probability. These building blocks are important for understanding

how statistical inference and certain machine learning models work. Calculus and optimization

introduce key concepts for understanding how machine learning “finds” the best solution to a

problem. Linear algebra also deals with this idea and is essential for understanding regression

methods. Some graduate departments, especially those with strong quantitative programs, may

already cover these topics in their own methods sequence. In other cases, students may find that

taking the corresponding undergraduate coursework, or graduate work aimed at beginners, will

be the best path forward.

Second, and less commonly taught in social science graduate methods sequences, are

concepts drawn from computer science. Oftentimes, computer science majors will start with

courses covering algorithms and data structures, and familiarity with these topics is essential for

making the most of computational approaches to social science. Students need not take these

courses explicitly, but should look for courses that allow them to learn the basics.

Finally, we recommend that computational social science students distinguish themselves

by learning about data visualization, data wrangling, and ethics in machine learning. These areas

are going to be where graduate social science students will have a comparative advantage

relative to computer science or statistics graduate students. Data visualization and storytelling are

key communication skills that Ph.D. students are encouraged to develop to share and promote

their work regardless of discipline. Social science students are also familiar with the challenges

of working with real-world data that are often messy and not generated in a controlled

environment, and thus are well-positioned to become expert data wranglers. Perhaps most

importantly, social science students are likely already well trained in thinking about the ethical

19
dimensions of their work, and the context surrounding various social problems. Fairness,

accountability, and transparency in machine learning is a rapidly growing field that is clamoring

for social science insights to help guide the development of ethical machine learning systems.

Social science students, particularly those who study topics like education, criminal justice,

public health, and other topics that implicate race, gender, immigration, etc. should seek out

coursework that hones their knowledge of these topics in the context of machine learning.

Within these broad categories, students may further specialize in various ways. Database

management, network science, computational linguistics, and computer vision are examples of

concepts that can enable working with complex data like datasets with millions of rows, text, and

images. Different universities and departments may cover these concepts in various ways, and

graduate students should chart their own path by taking advantage of courses, workshops, and

external opportunities to fill gaps in their home departments’ curricula. For example, some

departments may expect students to learn R or Python alongside introductory statistics materials

in the first-year curriculum, whereas others explicitly offer something like a “computational tools

course.” To illustrate, one of the authors of this piece developed and taught a year-long

“Computational Social Science” course designed for second-year Ph.D. students to develop

computational skills alongside applied machine learning and causal inference work. Another one

of us taught a computational tools course that emphasized developing skills in things like web

scraping, processing scanned documents, and managing file systems. Another option is to focus

on statistical and causal inference during semesters and host a programming bootcamp that

focuses on literate, efficient, and reproducible code over summers.

20
Beyond these core competencies, we observe that students specialize in a variety of ways.

Some focus on applying new methods to a pre-existing domain or subfield. Others become

experts on particular types of data like social network data, text data, or image data and work

across a variety of areas. Students may also find themselves leaning more toward the social

sciences or toward data science venues depending on their interests and aspirations.

Again, we emphasize in the strongest terms that no one person will be an expert in all of

these things, or even most of them. Because data science is so interdisciplinary, there is a

temptation to attempt to become a “unicorn” data scientist who is an expert in computer science,

statistics, and their own domain. This is especially a challenge for graduate students because they

often try to optimize for multiple career paths. However, this is likely not a realistic goal.

Instead, students should strive to develop the core competencies, the foundations for self-directed

study, we mentioned and then adapt to figure out what mixture of advanced skills best suits them.

The depth in the expertise is still social science (their domain), but they are competent in

inferential (traditional methods training) and computational thinking (complementary training in

data science) and have real-world data experience (part of their research experience). Ideally,

they partner with others who complement these skills and engage in collaborative learning. We

also emphasize that students will likely not take dozens of courses to cover all of these topics.

Students should strive to take courses that help them build strong foundations, and learn how to

teach themselves challenging new material. Because computational social science is fast

evolving, no student will be done learning after two years of coursework and instead should

focus on developing the capacity and confidence to learn new concepts and skills throughout

graduate school and afterward.

21
In closing, we want to emphasize that in addition to the importance of acquiring a

particular set of computational skills, we are firm believers that computational social science

training should be driven and based on students’ own domain knowledge. Learning

computational methods does not replace but rather supplements the traditional social science

training in theory and methods. First, computational methods are a vast and quickly growing

subject. No one can learn everything. Which computational methods one should learn first and

with how much depth depends on the problems one needs to solve. Second, expertise in data, not

only models, matters. Garbage in (low-quality data input), garbage out (low-quality data output).

Bias in (based data input), bias out (based data output). A high-quality and trustworthy model

comes from a deep understanding of the data generation process. Social scientific understanding

of human behaviors and institutions helps to develop data products (broadly construed) that

avoid unintentional harms and are aligned with human values (Engler 2022). Third, domain

knowledge is a competitive edge for social scientists. Core data science skills, such as basic

programming proficiency, have become commoditized thanks to the popularity of entry-level

data science education programs at both the undergraduate level and professional “boot camps.”

Therefore, social scientists will always face challenges competing with statisticians and software

engineers in the methods- or tools-focused jobs and positions. However, there are many data

science jobs where domain knowledge is highly valued, including but not limited to online trust

and safety, quantitative UX (user experience), quantitative marketing, and people analytics. In

these roles, domain knowledge is of key interest to hiring committees, and valuable commodities

for social scientists navigating the data science world.

22
3. Step 2: Building a Data Science Portfolio

3.1. What is a Data Science Portfolio?

A data science portfolio includes projects and outputs (Robinson and Nollis 2020). Such

a portfolio is an integral part of data science education (Nolan and Stoudt 2021) and

career-building (Dabbish 2013; Craig et al. 2018; Scolere 2019). Preparing a portfolio is

especially critical for pursuing a non-academic data science career where publication is not the

only and far from the most important metric for performance. Students do not need to take

courses in everything related to data science, and at some point, building a portfolio that

demonstrates applied knowledge will be a better use of their time and effort.

Below, we will explain this concept by comparing a data science portfolio with a

curriculum vitae (CV), the standard document used to summarize job market candidates’

accomplishments and qualifications in an academic job market.

First, a data science portfolio defines outcomes more broadly than a CV. It takes

non-publication outputs seriously. For instance, developing data products, such as open-source

software, interactive maps, and dashboards, is a strong indicator of performance for a data

science portfolio. They signal strong programming skills and public engagement.

Second, a data science portfolio is deeply concerned about processes, not only outputs. In

an academic setting, as long as one produces a correct figure or table for a manuscript, few will

scrutinize the exact process. However, applied settings have tight deadlines and other resource

constraints. There is also a strong norm of open source code and reproducibility–that is, someone

else may want to adapt code into their own project. Therefore, it is critical to produce the same

23
outputs more efficiently with as little communication costs as possible. Modern data science

projects are often large-scale and require collaboration. A data science portfolio offers the chance

to demonstrate some of the core technical skills such as writing legible code.

3.2. The Base Data Science Portfolio

How does one construct and advertise an effective data science portfolio? At a minimum,

graduate students should become familiar with version control tools, and make frequent use of

open-source coding platforms such as GitHub. Git is a tool for managing and tracking changes to

a codebase, thus allowing users to exercise control over different versions of their work. GitHub

is an online platform built on git and offers additional features for collaboration, tracking, and

hosting code repositories. This last function is especially important for developing a data science

portfolio–whenever possible, graduate students should open-source their code on a platform like

GitHub to signal their ability to work on different computational problems and contribute to the

community. Also, these GitHub repositories show that they can write good code in a common

data science language and use modern collaboration tools.

We advise that graduate students begin working with these tools and learning best

practices (good habits) as early as possible. There is a learning curve at the beginning as these

tools require some familiarity with command-line tools, but the payoff is enormous if a graduate

student develops a varied and extensive portfolio across several years of their Ph.D. program.

The problem defined in this basic data science portfolio does not need to be very novel

or deeply impactful. The point of having a portfolio is about showing and sharing your work and

its value when the project is not fully mature.

24
Finally, since this is a social science Ph.D. student’s data science portfolio, it should

demonstrate not only technical skills but also illustrate knowledge of some topic/policy

area/realm of social behavior/institution that they can bring to the table. Basically, there is a

competitive edge against some other candidates who have not received such extensive training in

social sciences.

3.3. Portfolio Plus

So far, we have emphasized hard skills. However, soft skills like communication are

equally, if not more, crucial. In applied settings, data scientists work in a team and solve a

problem the team faces using data. The analysis does not speak for itself. Data scientists need to

communicate with (their and other) team members, organizational leadership, and non-technical

audiences to explain the team’s problems and find solutions that address these issues given the

resource constraints. This challenge is also present in academia. Suppose that one gives a job talk

based on computational social science research. Using computational methods does not make the

research attractive. The candidate needs to appeal to the audience who is not familiar with and

even skeptical about using these new methods in social science research. In either case, data

communication is a core competency.,Data visualization serves the same goal as an art and

science of turning statistical information into a visual narrative.

One way of honing and displaying effective communication skills is to write data science

blog posts. This exercise is similar to teaching in its effect. It helps one to gain skills in

communicating computational tools and techniques for a wider audience. Yet another valuable

way of honing and displaying effective communication skills is by doing presentations. The

25
DSSG program emphasizes frequently presenting work to a variety of audiences–other fellows,

stakeholders and government partners, and the general public–and with different time

constraints. Being able to pitch the same project in 5 minutes or 20 minutes, and to both lay and

expert audiences is a key element of working across disciplinary lines, and within industry and

non-profit organizations. While publishing such posts is most valuable, for students who are new

to data communication, just writing them (and not sharing them) can be a valuable experience.

4. Step 3: Connecting with Other Computational Social Scientists

Being a computational social scientist means not only learning data science skills and

building a data science portfolio, but also being engaged in the field of computational social

science. Being engaged in computational social science activities can be immensely helpful for

your research by keeping you updated on the state-of-the-art in this constantly evolving field and

connecting you with collaborators. Interacting regularly with other computational social

scientists can also help Ph.D. students to find an intellectual community, which is sometimes

more challenging for computational social science Ph.D. students to do given that their interests

are more interdisciplinary compared to other social science Ph.D. students. This engagement can

take many forms. Here, we share three main forms of engagement: workshops and trainings,

conferences, and internships. Even students who are new to computational social science have

value to add to the community and being engaged in the community gives you power to connect

others and ensure people feel welcome in the community (Kim, Lebovits, and Shugars 2021).

Many of the coauthors of this article participated in programs like SICSS and DSSG and have

become community builders in their own institutions and beyond.

26
4.1. Workshops and Trainings

Workshops and trainings provide important opportunities for computational social

science Ph.D. students to learn and hone their skills in programming and computational methods.

Given that many social science Ph.D. programs do not yet provide courses on programming and

computational methods as part of their curriculum, it is not uncommon for computational social

science Ph.D. students to seek out additional workshops and trainings to supplement their

training. Workshops and trainings also provide computational social science Ph.D. students with

great opportunities to meet other students, faculty, and industry professionals in the field. Even

for early career computational social scientists, workshops and trainings can serve as useful

opportunities for continuous learning to keep up with the state-of-the-art in this rapidly evolving

field and to connect with new collaborators.

Fortunately, there are currently many options available for computational social science

workshops and training. A popular one is to participate in a computational social science summer

institute. Some examples of summer institutes include SICSS and the Santa Fe Institute’s

Complex Systems Summer School. These summer institutes are typically held every year, and

they provide participants with an in-person, cohort-based training experience over several weeks.

The summer institutes usually emphasize breadth of computational social science topics and

attract participants across different disciplines. The summer institutes are great immersive

experiences, but they require some advanced planning as they are usually quite selective and

some (but not all) are provided free of charge. For example, those interested in SICSS in an

27
upcoming summer should expect to prepare and submit an application package consisting of a

CV, statement of interest, and writing sample by late February of that year.

Another way that computational social science Ph.D. students can pursue additional

training is to take topic-specific short courses and workshops. Short courses and workshops

typically focus on a specific topic, and they can range from a few hours to a few days. The

ICPSR provides many short courses in the winter and summer on computational social science

topics like network analysis, machine learning, and agent-based models, though they do charge

fees. Some universities offer free short courses and workshops on computational social science

topics to their students through their libraries and/or research and computing departments.

More recently, computational social scientists have taken the initiative to organize online

tutorial series. For example, the NLP+CSS 201 tutorial series, organized by Ian Stewart and

Katie Keith, provided Ph.D. students with a virtual opportunity to learn about different topics in

natural language processing, such as word embedding models and BERT models. In fact, many

of the tutorials were taught by computational social science Ph.D. students. The recorded lectures

and notebooks were also shared on their website, providing additional opportunities to share

learning. This is a particularly promising model for organizing computational social science

workshops and trainings that emphasizes community and accessibility. Obviously, students do

not need to attend all the training and should prioritize the ones that fit their needs and draw their

interests.

4.2. Conferences

28
While social science Ph.D. programs typically train and encourage students to attend and

participate in key academic conferences within their disciplines, computational social science

Ph.D. students can benefit from taking part in a broader range of conferences. In particular, there

are many interdisciplinary conferences focused on computational social science topics where

computational social science Ph.D. students can share their work and make new connections.

One key conference is the International Conference on Computational Social Science, which

draws computational social science researchers from various disciplines in academia and

industry. Other more method- and topic-specific conferences that computational social science

Ph.D. students should consider include: the Text as Data Conference, NetSci, the International

Social Networks Conference (also known as Sunbelt), and the Politics and Computational Social

Science Conference.

In recent years, many disciplinary conferences have added preconferences focused on

showcasing research that use computational methods. These preconferences provide great

opportunities for computational social science Ph.D. students to present their work to a highly

tailored audience that can provide particularly useful substantive and methodological feedback.

For example, the American Sociological Association has organized a Computational Sociology

Preconference ahead of its annual conference in recent years. The Political Networks Section of

the American Political Science Association also convenes PolNet, an annual convening that

consists of workshops and panels.

4.3. Internships

29
Social science Ph.D. students often use their summers to prepare for program milestones

and conduct research. For computational social science Ph.D. students, summers may serve as

especially opportune periods to pursue additional training, conduct research, and engage in

career exploration through internships. While summer internships have long been a common part

of the undergraduate experience, they are also increasingly an important part of the graduate

experience for computational social science Ph.D. students. From a student's perspective,

participating in an internship helps them understand whether they want to be in academic or

outside of it. Similarly, from an employer's perspective, the internship helps convince them that

the person can be an effective part of a team and embrace non-academic standards, goals, and

workflows. So an internship is a mutually beneficial arrangement for both sides.

Many organizations now offer summer internships to computational social science Ph.D.

students. In particular, many tech and social media companies like Meta, Twitter, Google,

Amazon, and Microsoft Research have summer internship programs. Some public agencies and

nonprofit organizations also have summer internship programs. Civic Digital Fellowship2

provides opportunities to work with local, state, and federal governments across the U.S through

a 10-week internship. DSSG3 has a similar public interest goal and straddles line between

training (workshops covering both technical and non-technical data science skills) and internship

(building a real world project with a partner). While summer internships may vary considerably

in their structure, pay, and length, they typically provide opportunities to engage research in an

applied and often team-based setting. Some summer internships are resident-based, and others

2
https://www.codingitforward.com/summer-fellowships
3
Here we are mainly referring to the DSSG program hosted by Carnegie Mellon University (previously at the
University of Chicago) and international partners (https://www.dssgfellowship.org/). There are similar programs
carrying the same name as well, for example at the University of Washington
(https://escience.washington.edu/dssg/).

30
are remote. They tend to pay reasonably well—oftentimes more than a summer stipend as a

graduate student. And summer internships can provide opportunities for additional internships

and perhaps long-term employment.

While summer internships can provide many benefits to computational social science

Ph.D. students, they do require significant advanced planning and preparation. In terms of

timing, computational social science Ph.D. students should consider applying for internship

positions after completing their coursework and developing a strong data science portfolio.

Computational social science Ph.D. students should also discuss their interest in taking on

summer internships early on with their advisors, department administrators, and potentially their

financial aid departments. These summer internships are competitive, and most of them will

require more than one interview/test in areas like programming, algorithms, statistics, and

research design. In-depth coverage of how to apply for internships and prepare for interviews is

beyond the scope of this article, but interested computational social science Ph.D. students

should be aware that applying for a summer internship will likely involve advanced planning

(e.g., turning a CV into a skill-oriented resume) and preparation (e.g., technical interview). In

addition, some of these internship programs advertise that they look for computational social

scientists but exclusively hire computer scientists. Therefore, not getting an internship shouldn’t

always be taken as a slight on candidates’ skills. It could be a matching issue (fit) between hiring

organizations and candidates.

5. Conclusion

31
Data science opens up many new and exciting career opportunities for social scientists.

This article provides a step-by-step guide to graduate students on how to navigate these emerging

career paths in academic and non-academic job markets. In particular, we encourage Ph.D.

students interested in pursuing academic and non-academic careers as a computational social

scientist to (1) learn data science skills, (2) build a data science portfolio, and (3) connect with

other computational social scientists, and we share some advice and resources for accomplishing

each of these steps.

In this section, we propose some curricular changes that departments may consider

adopting to better support Ph.D. students interested in computational social science. First, we

encourage social science departments to integrate data science skills into their curriculum. For

example, introductory statistics courses that are often part of social science methods sequences

can support Ph.D. students to gain data science skills by offering the course in R or Python, or at

least including some modules in these programming languages. Social science departments can

also better support their Ph.D. by offering new courses that directly teach data science skills or

how to integrate quantitative and computational methods to conduct social science research. One

of the authors of this piece developed and taught a year-long “Computational Social Science”

course designed for second-year Ph.D. students to develop computational skills alongside

applied machine learning and causal inference work. Another one of us taught a computational

tools course that emphasized developing skills in things like web scraping, processing scanned

documents, and managing file systems.

Second, we propose that departments recognize credits received from other departments

if they are relevant to students’ data science training. This will give students more flexibility and

32
ownership on their coursework, especially methods training. Departments will need to be

mindful about how to balance between prescribing an exact set of courses that define the field

and leaving students free to embark on a “choose your own adventure” curriculum with little

guidance. The appeal of prescribing an exact set of courses is that it guarantees every student

will have the same preparation, but with the downside of being unresponsive to new

developments in other disciplines and depriving students of the opportunity to integrate data

science into a domain expertise. On the other hand, leaving students free to decide the entire

curriculum for themselves may see some students craft a unique, challenging, and integrative

curriculum, but other students who are less well resourced might suffer important skills gaps

because of unawareness. To deal with this tension, we advise that departments look at their

curricula in light of the major concepts we have surfaced, identify gaps in their own offerings,

and encourage students to help find appropriate training on campus or elsewhere. Departments

should be mindful not just about how to create a computational social science program, but how

to maintain it. We expect that even the concepts we surface here will change in the coming years,

and new ideas about what constitutes computational social science will develop. Avoiding the

trap of optimizing for today at the expense of flexibility tomorrow will be key to setting students

up for success.

Third, we propose that departments limit the number of field exams students should take

(ideally one) and provide them with information on summer internships–even as early as during

program orientation. This is a far more effective method to help graduate students navigate

non-academic careers than offering a workshop on this subject when they are near to being on

the job market. One of the challenges of training students in computational social science is that

33
many of the concepts and skills need to be taught in addition to the core curricula of their Ph.D.

programs. This approach can easily add a year or more to a typical graduate student's workload.

The standard model of two years of coursework followed by several field exams before moving

onto dissertation writing should therefore be reexamined. One potential change would be

trimming down the number of field exams to one, and allow students to substitute advanced

computational training.

Fourth, we propose that departments recruit computational social scientists back into

academia from industry and non-profit positions in order to have more faculty on staff who can

advise and mentor students who are interested in computational social science. Bringing these

talents into a department is a good way to design applied data science curriculum and support

students with broad professional development as well. These roles do not need to be tenure-track

and they could be visiting positions (a long sabbatical) from industry or non-profit. Professional

schools such as law and medicine traditionally hire many faculty members with a few years of

practice experience (i.e., professor of practice). Social science departments can similarly signal

that they value computational social science by hiring candidates with professional experience

who can help students navigate diverse career paths. Such candidates may also enrich the

disciplines by energizing them with new ideas and research agendas drawn from these

non-academic experiences.

Finally, we have intentionally aimed to provide information on academic and

non-academic job markets and balance them because we believe students should define their

success on their terms–and we hope that departments in this changing landscape will also do the

same. Given students’ career goals and life circumstances, they might prefer non-academic jobs

34
to academic ones. Regardless of field, good jobs are good jobs if they align with students’ own

priorities and values. Respecting these career decisions will give students more agency over and

satisfaction with their graduate training when mental stress is a pervasive problem in graduate

school (Almasri, Read, and Vandeweerdt 2021). An important aspect of supporting students’

career aspirations is to allow students the flexibility to spend time on endeavors not directly

related to their dissertation research such as internships and many of the summer programs we

mentioned. While social science departments have a long history of supporting students’

methods training through venues such as ICPSR, Northwestern’s Workshop on Causal Inference,

the Empirical Implications of Theoretical Models Institute (https://eitminstitute.org/), and the

Berkeley Initiative for Transparency in the Social Sciences (https://www.bitss.org/), they should

also encourage students to explore options outside of academia. Internships at private, non-profit,

and government organizations can help students explore different career paths, develop diverse

networks, and hone skills that are not well covered in academic settings. In addition, these

experiences help students develop confidence that non-academic jobs, which they usually regard

unknown and scary, could be a good fir for them. Importantly, we emphasize that this flexibility

is important for students interested in both academic and non-academic careers. The experience

of working with real-world, big, messy data is valuable regardless of what job market a student

eventually enters. Collaborating with diverse team members is similarly an asset that is hard to

develop solely as a Ph.D. student working on a dissertation alone. One of the biggest benefits of

non-academic work is the greater teamwork, more frequent feedback from colleagues, although

dependent on others is higher. In addition, it is worth highlighting as a point of negotiation and/or

selection the possibility for interns to publish during the internship and/or have access to internal

35
data for research after the internship. This makes it easier for social science gradaute students to

thread the needle of finding summer opportunities that are valuable to both their graduate

programs (academic standards of productivity) and non-academic job opportunities later.

Departments can also support students’ pursuing these strategies by allowing internship papers as

part of their dissertation. Some departments do not allow co-authored papers or have a limit to

how many co-authored papers one can have, but these rules should reflect students’ diverse and

evolving career paths. Overall, we hope this article provides a useful starting point for graduate

students and departments interested in computational social science. Surfacing the hidden

curriculum is an important step toward democratizing access to computational social science and

helping any and all students interested in data science and social science to think of themselves

as computational social scientists. We do not at all claim that we have the final word on what

makes a computational social scientist or computational social science curriculum, but instead

hope that this starts important and necessary conversations about the long-term development of

this field.

36
Disclosure Statement

Aniket Kesari, Jae Yeon Kim, Sono Shah, Taylor Brown, Tiago Ventura, and Tina Law have no

financial or non-financial disclosures to share for this article.

Acknowledgment

We thank Eric Giannella, Milan de Vries, Brian Heseung Kim, Rebecca Johnson, and Sarah

Shugars for their constructive comments on the earlier draft.

37
References

Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2014. “Comparative politics and the

synthetic control method.” American Journal of Political Science 59(2): 495-510.

Almasri, N., Read, B., and Vandeweerdt, C. (2021). “Mental Health and the PhD: Insights and

Implications for Political Science.” PS: Political Science & Politics, 1-7.

J. Angwin, J. Larson, S. Mattu, L. Kirchner. (23 May 2016). “Machine Bias: There’s Software

Used Across the Country to Predict Future Criminals. And It’s Biased Against Blacks,”

ProPublica URL:

www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing (accessed

August 6, 2022)

Athey, S., (2015). “Machine Learning and Causal Inference for Policy Evaluation.” In

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and

Data Mining. 5-6.

Barham, E. and Wood, C. (2021). “Teaching the Hidden Curriculum in Political Science.” PS:

Political Science & Politics, 1-5.

Calarco, J.M. (2020). A Field Guide to Grad School: Uncovering the Hidden Curriculum.

Princeton, NJ: Princeton University Press.

38
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins,

J., 2016. “Double Machine Learning for Treatment and Causal Parameters.” arXiv preprint

arXiv:1608.00060.

Craig, M., Conrad, P., Lynch, D., Lee, N., & Anthony, L. (2018). “Listening to Early Career

Software Developers.” Journal of Computing Sciences in Colleges, 33(4), 138–149.

Edelmann, A., Wolff, T., Montagne, D., & Bail, C. A. (2020). “Computational Social Science

and Sociology.” Annual Review of Sociology, 46(1), 61.

Engler, A. (2022, April 4). “Uncommon Advice on Becoming a Data Scientist in the Public

Interes.” Hertie School Centre for Digital Governance. URL:

https://www.hertie-school.org/en/digital-governance/research/blog/detail/content/uncommon-adv

ice-on-becoming-a-data-scientist-in-the-public-interest (accessed June 23, 2022)

Geiger, R.S., Mazel-Cabasse, C., Cullens, C.Y., Noren, L., Fiore-Gartland, B., Das, D. and

Brady, H. (2018). “Career Paths and Prospects in Academic Data Science.” Report of the

Moore-Sloan Data Science Environments Survey. Berkeley, California: UC-Berkeley Institute for

Data Science. URL: https://osf.io/preprints/socarxiv/xe823/ (accessed August 17, 2022)

39
Grimmer, J., Roberts, M., Stewart, B. “Machine Learning for Social Science: An Agnostic

Approach.” Annual Review of Political Science. URL:

https://www.annualreviews.org/doi/abs/10.1146/annurev-polisci-053119-015921 (accessed

August 17, 2022)

Lundberg, I., Johnson, R. and Stewart, B.M., 2021. “What Is Your Estimand? Defining the

Target Quantity Connects Statistical Evidence to Theory.” American Sociological Review, 86(3),

pp.532-565.

Kim, J.Y. and Ng, Y.M.M. (2022). “Teaching Computational Social Science for All.” PS:

Political Science & Politics, 1-5.

Kim, S.Y.S., Lebovits, H. and Shugars, S. (2021). “Networking 101 for Graduate Students:

Building a Bigger Table.” PS: Political Science & Politics, 1-6.

Lue, R. A. (2019). “Data Science as a Foundation for Inclusive Learning.” Harvard Data Science

Review, 1(2).

Marlow, J., & Dabbish, L. (2013). “Activity Traces and Signals in Software Developer

Recruitment and Hiring.” Proceedings of the Conference on Computer Supported Cooperative

Work, 145–156. URL: https://dl.acm.org/doi/10.1145/2441776.2441794 (accessed August 17,

2022)

40
National Academies of Sciences, Engineering, and Medicine. (2018). Data Science for

Undergraduates: Opportunities and Options. Washington, DC: National Academies Press.

Nolan, D. and Stoudt, S. (2021). “The Promise of Portfolios: Training Modern Data Scientists.”

Harvard Data Science Review, 3(3). URL: https://doi.org/10.1162/99608f92.3c097160

Robinson, E. and Nolis, J. (2020). Build a Career in Data Science. Shelter Island, NY: Manning

Publications.

Salganik, Matthew J. (2019). Bit By Bit: Social Research in the Digital Age. Princeton, NJ:

Princeton University Press.

Scolere, L. (2019). Brand Yourself, Design Your Future: Portfolio-building in the Digital Age.

New Media & Society, 21(9), 1891–1909.

VanderWeele, T.J. and Ding, P., 2017. “Sensitivity Analysis in Observational Research:

Introducing the E-value.” Annals of Internal Medicine, 167(4), pp.268-274.

41

You might also like