Professional Documents
Culture Documents
Preparing Data Science Career
Preparing Data Science Career
Preparing Data Science Career
Students for
Aniket Kesari, Research Fellow, Information Law Institute, New York University,
aniket.kesari@nyu.edu
Corresponding author, Jae Yeon Kim, Assistant Professor, KDI School of Public Policy
Management, jaeyeonkim@kdis.ac.kr
Sono Shah, Computational Social Scientist, Data Labs, Pew Research Center,
sshah@pewresearch.org
taylor.w.brown@duke.edu
Tiago Ventura, Postdoctoral Associate, Center for Social Media and Politics, New York
University, venturat@umd.edu
tina.law@u.northwestern.edu
The authors are listed alphabetically as all contribute equally to this article.
1
Abstract
In recent years, social scientists with data science skills have gained positions in academic and
non-academic organizations as computational social scientists who blend skillsets from data
science and social science. Yet as this trend is relatively new in the social sciences, navigating
these emerging and diverse career paths remains ambiguous. We formalize this hidden
experiences as computational social scientists working in academic, public, and private sector
professionalization process into three steps: (1) learning data science skills; (2) building a
portfolio that focuses on using data science to answer social science questions; and (3)
connecting with other computational social scientists. For each step, we identify and elaborate on
core competencies and additional useful skills that are specific to the academic and
non-academic job markets. Although this article is not exhaustive, it provides a much-needed
guide for graduate students, as well as their faculty advisors and departments, to navigate the
growing field of computational social science. By sharing this guide, we hope to help make
Keywords
data science, computational social science, academic job market, non-academic job market
2
Media summary (approximately one page, or 400 words; see below on requirements)
This article is an accessible guide for members of the social science community who are
interested in the growing trend in “Computational Social Science.” Our primary audience is
early-stage social science doctoral students who wish to tailor their graduate training toward
computational social science, but should also be relevant for pre-doctoral and later stage
students, faculty, and members of the social science community more broadly. This guide should
be accessible for students with little prior technical training and students with more substantial
training. We emphasize three key steps for pursuing a computational social science career, 1)
Learning data science skills, 2) Building a data science portfolio, and 3) Connecting with other
social science PhD students, who blend skillsets from data science and social science, should
acquire during their programs. We overview some of the latest developments in computational
social science including working with novel data sources such as text and images, manipulating
“big data” through databases, and using machine learning to drive new social science insights.
Throughout this article we make two main points about computational social science training.
The first is that we encourage students to invest early in developing programming skills in
languages such as R and Python because these foundations enable further advanced applications
later in their graduate careers. The second is that we surface how a combination of coursework,
internships, and independent research can prepare students to become computational social
scientists. We do not prescribe any one set path for students, but rather highlight options and the
3
Beyond skills acquisition, we also turn to practical advice about career preparation. We
discuss how to build a data science portfolio for contributing to open science and building a
demonstrable record of work in this area. We conclude with concrete recommendations about
how departments may adapt their current curricula to give students the flexibility needed to
pursue this exciting new field. Overall, we aim to keep this article at a level of generality that it
should be helpful to students across the social science disciplines and give them a starting place
4
1. Introduction
Computational social science is a rapidly growing field that engages the social sciences
and data science (Salganik 2019: xviii; Edelmann et al. 2020). The field applies novel digital
data and computational methods to advance social scientific understanding of human behaviors
(Edelman et al. 2020). As more and more social scientists have gained training and experience in
data science during their graduate studies an increasing number of them have gained positions as
computational social scientists in academic and non-academic organizations. Both academic and
industry research organizations increasingly see the value in core social science skills covering
the analysis of human behavior, institutions, and policy, and clamor for individuals who can
bring these foundations to the study of massive data. We define computational social scientists as
a type of data scientist who comes from a social science background, but is more invested in
programming and other data science skills than is required in most social science Ph.D.
programs. Places of employment for computational social scientists now include: academic
departments; professional schools; nonprofits (e.g., Code for America, Pew Research Center’s
Data Labs, Urban Institute); tech companies (e.g., Meta, Twitter, Google, Amazon, Microsoft,
among others); international organizations (e.g., The World Bank, UN Global Pulse Labs); and
government agencies (e.g., US Federal Reserve, Census Bureau, Office of Evaluation Sciences,
Indeed, there appears to be growing interest among social scientists in pursuing careers in
data science. The 2016 Moore-Sloan Data Science Environment Survey collected responses on
data science careers from students, researchers, staff, and faculty connected to the University of
1
For a more comprehensive list of potential opportunities for students and graduates in this field, see these helpful
resources curated by Ben Green: https://www.benzevgreen.com/jobs/
5
California, Berkeley, New York University, and the University of Washington, Seattle. The
results show that the percentage of respondents working in data science careers who identify
social sciences as their primary field is now ranked second (17.9%), only after physical sciences
(19.6%) (Geiger et al. 2018: 8-9). Of course, these three universities are far from representative
samples of US research universities or academic institutions in general. Yet, they still matter as
these universities are key members of the field of data science in the US and the world and
Despite this growing demand to engage in data science among social scientists, building
and navigating data science education remains elusive for many social science Ph.D. students
because training to be a computational social scientist still feels like a “hidden curriculum”
(Calarco 2020; Barham and Wood 2021). The notion of pursuing data science as a career is
relatively new and there are diverse paths to becoming a data scientist, which means that
advising on this subject goes beyond the traditional capacity and resources of many academic
departments. The data science hidden curriculum exists in two ways. First, there is a lack of
formal training. Even in the U.S., where data science programs and courses are increasingly
available on university and college campuses, most Ph.D. programs in social sciences do not
offer systematic and formal training in data science or dedicated advising to help students to
navigate data science careers in the academic and non-academic job markets. Second, even when
such training exists, barriers for historically underrepresented groups in data science persist as
the society has told them this kind of analytic work is "not for them." Therefore, we need to
increase formal training in data science in social science Ph.D. programs and make these training
relevant for and accessible to all (Lue 2019; Kim and Ng 2022).
6
Within this evolving context, some innovative solutions have emerged, but they remain
ad-hoc. Namely, computational social science or data science summer institutes, such as Summer
Institute in Computational Social Science (SICSS) and Data Science for Social Good (DSSG),
have gained popularity among early-career scholars. Both programs provide social scientists with
Inter-university Consortium for Political and Social Research (ICPSR) offers more data science
courses, such as computational text analysis, in recent years. Nevertheless, the limitation of these
alternative programs is apparent. These are intensive summer programs and are not set up for
these programs aim to be inclusive, they are also highly selective and therefore can only support
and support of computational social scientists, this article seeks to make explicit the informal
knowledge of navigating computational social science for Ph.D. students in the social sciences.
social science careers based on our collective experiences as computational social scientists
working in academic, public, and private sector organizations. We leverage our diverse
process that interested students may face and point them toward a variety of useful resources at
Many of these recommendations overlap with the areas of data acumen highlighted in the
“Undergraduate Data Science” report (NASEM 2018) in that we emphasize core foundations,
7
experience working with real datasets, and ethical considerations. We add to this existing
conversation by considering the needs of social science Ph.D. students specifically. Graduate
social science programs differ from undergraduate data science majors in several important
ways, and we tailor our framework to highlight the synergies between data science and graduate
focus on providing information on the differences that computational social science Ph.D.
students can expect in terms of preparing for academic versus non-academic careers.
Specifically, we break down the CSS professionalization process into three steps: (1)
learning data science skills; (2) building a computational social science portfolio; and (3)
connecting with other computational social scientists. For each step, we identify and elaborate on
core competencies and additional useful skills that are specific to the academic and
non-academic job markets (see Table 1). We conclude by briefly proposing some curricular
changes that departments can consider adopting to better support their emerging computational
social scientists. The guide we provide here is not exhaustive and is just one perspective on how
graduate students, as well as their faculty advisors and departments, can navigate the rapidly
growing field of computational social science. We hope to start a conversation on how the
8
Core Competencies ● Engagement with ethical concerns that arise when
working with such data and paradigms (e.g. privacy
protection, algorithmic bias, etc.)
● Programming fluency in R and/or Python
● Experience with data management and teamwork
● Facility for working with large, messy, and
sometimes unstructured data
● Solid grasp of common inferential methods, such as
hypothesis testing, statistical modeling, etc.
● Understanding of machine learning. Students should
develop an “index” in their heads regarding what
specific tools they need to solve specific problems
and their limitations
● Knowledge of different frameworks for framing
problems, as well as their limitations. Students in
social science disciplines might encounter the causal
inference framework, machine learning for
prediction policy problems, machine learning for
developing social science measurements, etc. and
should be aware of the debates around how these
approaches frame scientific questions.
9
Core Competencies ● Engagement with social aspects of a data science
project via problem definition, hypothesis
generation, data and outcome selection, etc.
● Reproducible, efficient, and communicable code via
GitHub
● Journal publications/conference proceedings
Methodological training for graduate students in the social sciences traditionally consists
of coursework on research design, statistics, and qualitative methods, with the exact mix varying
by discipline. There is a clear structure to how one deepens their training. For example, graduate
students interested in conducting research with quantitative methods can supplement their
10
introductory coursework on model-based statistical inference with advanced coursework on
causal inference.
Computational methods are a new frontier. We argue that in addition to the traditional
computational social science should develop programming fluency in R and/or Python, gain
experience with data management and teamwork, work with large and unstructured data,
understand machine learning as a social science framework, and engage with ethical concerns
that arise when working with big data and prediction modeling. When approaching these topics,
students may choose a mixture of coursework and independent social science projects. Students
should balance between in-house coursework in their own departments and coursework in other
departments. The advantage of in-house coursework is that it will often be more tailored toward
preparing students to conduct computational social science research within their own discipline,
and may provide a smoother on-ramp for students who are new to methods. External coursework
can be advantageous because students will be exposed to students from other disciplines within
and outside the social sciences, and they may benefit from instructors whose research careers are
squarely focused on these topics (e.g., taking a machine learning course from a statistics or
computer science professor). Coursework should be complemented with projects, either as a solo
endeavor or in collaboration with faculty and other students. This advice is especially prudent
after around the midway point of a PhD program when most students transition to conducting
foundation that one needs to be an effective self-learner in both the early stages of a graduate
11
program while choosing coursework, and in the later stages when pursuing independent
projects.
fluency is the core skill. Programming fluency means the ability to program in popular languages
used for data science, such as R and Python. One can still conduct statistical analysis using
proprietary commercial software such as Excel, SPSS, and Stata. However, these proprietary
software programs raise problems for transitioning from social science to computational social
science. First, these tools are not well suited for harnessing automation: the force behind
large-scale data collection and analysis. Scalability of general programming languages like R and
Python make knowledge of them a requirement of the field, particularly in settings like the tech
industry. Second, one cannot take advantage of advancements in modern data science such as
machine learning and natural language processing using these limited tools. This fact is not
simply a matter of tools like SPSS and Stata needing to catch up to integrate machine
for properly implementing these methods (Kim and Ng 2022). In addition, learning R and/or
Python is useful because one can build an open source project using these languages (e.g., R
package or Python library), which can be a major component of data science portfolios. Finally,
investing in these languages helps to be part of data science communities built around R (e.g.,
#rstats, RStudio Conference) and Python (e.g., PyData, The Python Conference). Getting started
12
in either language does not matter substantially, especially for beginners. A language that gets a
job done is the right tool for the job. If students need, they could learn another language after
becoming fluent in at least one. Moreover, the ecosystems between these two languages have
become more integrated, as evident in RStudio changing its name to Posit to expand its focus on
Data management. As the popularity and scale of computational social science research
grows, there is an increasing need for familiarity with managing the vast quantities of data that
are often collected and stored throughout the research process. At a certain point, the scale of
data requires researchers to implement data management practices to allow for ease of access and
analysis. Many students are already likely familiar with the basics of spreadsheet software such
as Excel. Students will generally have an intuitive understanding of data being stored in rows and
columns, with rows containing observations and columns containing information about them.
They may even have some practice with summarizing, aggregating, and manipulating data to
perform some basic calculations and visualizations. This knowledge is easily translated to the use
of “dataframes” in R and Python and students can learn the basics of working with spreadsheet
data in these languages. Eventually, students may find that they need to work with increasingly
larger datasets, or data structures other than Excel spreadsheets, flat files (e.g., csv, tsv, dta, etc),
and other traditional data formats. In many cases, this involves using cloud computing resources,
such as AWS, Microsoft's Azure, or Google’s BigQuery for storing data or conducting analysis.
Computational social science research also often makes extensive use of observational
data, ranging from digital forms of “big” data, such as tweets and other social media feeds, to
large-scale administrative data such as tax records, medical claims, 311 complaints, and
13
student-level educational records. Beyond storage and access challenges, this kind of data
de-identification–before it can be used in analysis. In keeping with established best practices for
scientific computing, researchers should ideally preserve any data collected in its original form
and explicitly record all the steps used to process it. These steps are critical to ensuring that
research is reproducible. Here, researchers would benefit with familiarity with script-based data
workflow management tools such as targets, snakemake, or make. Students should also become
familiar with tools for preserving the privacy of data subjects. Concepts such as differential
privacy, encryption, and federated learning may be relevant, and implementing these concepts in
Another key skill is learning how to document datasets and accompanying code. Code
and data documentation (e.g., written information about coding decisions and the variables used
in the dataset) serve several purposes. It eases communication to other users of the data, as well
as one’s future self. It maximizes the usefulness of a dataset by making it readable and usable for
further analyses. Relatedly, learning version control ensures that data is not easily destroyed or
Team work. Computational social science research is rarely carried out individually and
often instead is a collaborative effort involving researchers, data scientists, and engineers, among
others. Such collaborative environments are the standard both in the industry, as well as in
academic research centers working on computational social science. Here, familiarity with a
version control framework such as git and GitHub are essential for working on larger scale
projects. Git and GitHub are now pretty much a prerequisite on collaborative computational
14
social science research projects, as these tools allow users to work collaboratively on multiple
branches of the same projects, publicly share their code, efficiently track changes and move back
Based on our academic and professional experiences, academic and industry employers
vary on the degree to which they employ version control (through git) in their day-to-day
activities. Therefore, we do not advise students to overprepare and become wizards on git. But
we do advise students to learn the basics of git (init, commit, push, cloning, and branching) as
part of their training to pursue a computational social science career. There are several intuitive
ways to gain this basic facility. GitHub offers a desktop graphical user interface that allows users
to point-and-click their way through the basic version control workflow. RStudio similarly
provides a lightweight graphical interface for working with GitHub. While the full capabilities of
git require use of the command line, these tools can be useful as students are starting out. For
students who want to develop intuitions around Git and GitHub, there are many tutorials on this
subject on the GitHub Learning Lab (https://lab.github.com/) and there is even a game for
databases using Structured Query Language (SQL) is a fundamental skill. Most industry data is
big and proprietary. Therefore, as computational social scientists, your data is likely to live in a
cloud database, not a personal computer or laptop. In this scenario, learning how to communicate
with a database becomes a required skill. Therefore, one needs to know how to query a database
15
using Structured Query Language (SQL). Fortunately, one can learn SQL without setting up a
database. For instance, SQL can be supplanted with experience in R and “dbplyr,” which is a
database backend for “dplyr”—a popular data manipulation tool in the R ecosystem.
Both graduate students and their departments should also be cognizant of the fact that
computational social science involves not just a set of tools, but also a reframing of what
constitutes a social science question. Particularly in quantitative social science, social science
disciplines place a strong emphasis on causal inference. Computational methods offer a lot of
innovations in this space such as synthetic control methods (e.g., Abadie, Diamond, and
Hainmueller 2013), double robust machine learning (Chernozhukov 2016), and sensitivity
analysis (VanderWeele and Ding 2017). Computational social science in many ways provides a
That being said, machine learning also opens up predictive questions. Take the example
the analyst approaches the question with a theory of how different variables might affect an
outcome of interest, and involves constructing a model that controls for these variables in order
to identify the causal effect of one of those variables on the outcome. A machine learning
implementation of regression eschews this approach, and instead the analyst lets a computer
algorithm identify the best model by trying different combinations of variables and models to get
the best predictions of an outcome. Machine learning is therefore well suited to policy prediction
problems, though this is only one dimension of policy problems and has clear limitations. These
16
questions by no means replace quantitative social science’s traditional focus on causation, and
indeed machine learning and causal inference are being increasingly integrated (Athey 2015), but
students and departments should be aware of this conceptual shift. Departments should
encourage students who wish to pursue these new types of innovative questions in their
dissertations and other work, and recognize that these types of studies will become more
common.
Although these prediction policy problems represent an exciting frontier, the big data
revolution also prompts a critical examination of the role of theory in computational social
science. Whereas many computer science applications emphasize the prediction framework, we
also want to emphasize the role that social science theory and practice play in shaping good
quantitative studies. Before applying a machine learning or statistical model to a problem, social
scientists must first contemplate what the social problem is, and how they want to operationalize
that concept in a quantitative measure. In concrete terms, they should be able to answer the
simple question: what is the estimand? (Lundberg, Johnson, and Stewart 2021) Once a social
scientist is able to properly conceptualize the theoretical and social problem they are interested
in, machine learning becomes a tool to uncover new concepts, facilitate causal inference, and
derive new measurements (Grimmer, Roberts, and Stewart 2021). Importantly, we emphasize
that social scientists should not lose sight of the fact that they aim to solve social problems, and
17
2.5. Engagement with Ethical Constraints that Arise When Working with Big and Complex
Social scientists will likely play an important role in driving conversations around ethics
and privacy in machine learning and artificial intelligence. For example, the COVID-19
pandemic raised questions about how to balance individual privacy interests and public health
regulator access to information for end uses such as contact tracing. Several states are
considering using predictive algorithms for making bail decisions, and the debates about
implementing these systems implicate race and criminal justice (Angwin 2016). Social scientists
are often domain experts in these areas, and will therefore be essential in not only engineering
such systems, but also thinking about how to manage tradeoffs and do so with respect to impacts
on vulnerable populations.
At this point, graduate students reading this article may wonder: so what courses should I
take? Here, we outline the basic building blocks of what courses might constitute a
“computational social science curriculum.” We must emphasize that most students will not take
all of these courses, or even most of them. Insofar as students can take data science courses that
cover the fundamentals of these concepts, they would find it advantageous to do so.
science. Specifically, students should have some familiarity with probability, mathematical
statistics, calculus, linear algebra, discrete math, and optimization more broadly. Probability
courses will generally cover concepts like probability distributions, frequentist and Bayesian
18
thinking, and basic rules of probability. These building blocks are important for understanding
how statistical inference and certain machine learning models work. Calculus and optimization
introduce key concepts for understanding how machine learning “finds” the best solution to a
problem. Linear algebra also deals with this idea and is essential for understanding regression
methods. Some graduate departments, especially those with strong quantitative programs, may
already cover these topics in their own methods sequence. In other cases, students may find that
taking the corresponding undergraduate coursework, or graduate work aimed at beginners, will
Second, and less commonly taught in social science graduate methods sequences, are
concepts drawn from computer science. Oftentimes, computer science majors will start with
courses covering algorithms and data structures, and familiarity with these topics is essential for
making the most of computational approaches to social science. Students need not take these
courses explicitly, but should look for courses that allow them to learn the basics.
by learning about data visualization, data wrangling, and ethics in machine learning. These areas
are going to be where graduate social science students will have a comparative advantage
relative to computer science or statistics graduate students. Data visualization and storytelling are
key communication skills that Ph.D. students are encouraged to develop to share and promote
their work regardless of discipline. Social science students are also familiar with the challenges
of working with real-world data that are often messy and not generated in a controlled
environment, and thus are well-positioned to become expert data wranglers. Perhaps most
importantly, social science students are likely already well trained in thinking about the ethical
19
dimensions of their work, and the context surrounding various social problems. Fairness,
accountability, and transparency in machine learning is a rapidly growing field that is clamoring
for social science insights to help guide the development of ethical machine learning systems.
Social science students, particularly those who study topics like education, criminal justice,
public health, and other topics that implicate race, gender, immigration, etc. should seek out
coursework that hones their knowledge of these topics in the context of machine learning.
Within these broad categories, students may further specialize in various ways. Database
management, network science, computational linguistics, and computer vision are examples of
concepts that can enable working with complex data like datasets with millions of rows, text, and
images. Different universities and departments may cover these concepts in various ways, and
graduate students should chart their own path by taking advantage of courses, workshops, and
external opportunities to fill gaps in their home departments’ curricula. For example, some
departments may expect students to learn R or Python alongside introductory statistics materials
in the first-year curriculum, whereas others explicitly offer something like a “computational tools
course.” To illustrate, one of the authors of this piece developed and taught a year-long
“Computational Social Science” course designed for second-year Ph.D. students to develop
computational skills alongside applied machine learning and causal inference work. Another one
of us taught a computational tools course that emphasized developing skills in things like web
scraping, processing scanned documents, and managing file systems. Another option is to focus
on statistical and causal inference during semesters and host a programming bootcamp that
20
Beyond these core competencies, we observe that students specialize in a variety of ways.
Some focus on applying new methods to a pre-existing domain or subfield. Others become
experts on particular types of data like social network data, text data, or image data and work
across a variety of areas. Students may also find themselves leaning more toward the social
sciences or toward data science venues depending on their interests and aspirations.
Again, we emphasize in the strongest terms that no one person will be an expert in all of
these things, or even most of them. Because data science is so interdisciplinary, there is a
temptation to attempt to become a “unicorn” data scientist who is an expert in computer science,
statistics, and their own domain. This is especially a challenge for graduate students because they
often try to optimize for multiple career paths. However, this is likely not a realistic goal.
Instead, students should strive to develop the core competencies, the foundations for self-directed
study, we mentioned and then adapt to figure out what mixture of advanced skills best suits them.
The depth in the expertise is still social science (their domain), but they are competent in
data science) and have real-world data experience (part of their research experience). Ideally,
they partner with others who complement these skills and engage in collaborative learning. We
also emphasize that students will likely not take dozens of courses to cover all of these topics.
Students should strive to take courses that help them build strong foundations, and learn how to
teach themselves challenging new material. Because computational social science is fast
evolving, no student will be done learning after two years of coursework and instead should
focus on developing the capacity and confidence to learn new concepts and skills throughout
21
In closing, we want to emphasize that in addition to the importance of acquiring a
particular set of computational skills, we are firm believers that computational social science
training should be driven and based on students’ own domain knowledge. Learning
computational methods does not replace but rather supplements the traditional social science
training in theory and methods. First, computational methods are a vast and quickly growing
subject. No one can learn everything. Which computational methods one should learn first and
with how much depth depends on the problems one needs to solve. Second, expertise in data, not
only models, matters. Garbage in (low-quality data input), garbage out (low-quality data output).
Bias in (based data input), bias out (based data output). A high-quality and trustworthy model
comes from a deep understanding of the data generation process. Social scientific understanding
of human behaviors and institutions helps to develop data products (broadly construed) that
avoid unintentional harms and are aligned with human values (Engler 2022). Third, domain
knowledge is a competitive edge for social scientists. Core data science skills, such as basic
data science education programs at both the undergraduate level and professional “boot camps.”
Therefore, social scientists will always face challenges competing with statisticians and software
engineers in the methods- or tools-focused jobs and positions. However, there are many data
science jobs where domain knowledge is highly valued, including but not limited to online trust
and safety, quantitative UX (user experience), quantitative marketing, and people analytics. In
these roles, domain knowledge is of key interest to hiring committees, and valuable commodities
22
3. Step 2: Building a Data Science Portfolio
A data science portfolio includes projects and outputs (Robinson and Nollis 2020). Such
a portfolio is an integral part of data science education (Nolan and Stoudt 2021) and
career-building (Dabbish 2013; Craig et al. 2018; Scolere 2019). Preparing a portfolio is
especially critical for pursuing a non-academic data science career where publication is not the
only and far from the most important metric for performance. Students do not need to take
courses in everything related to data science, and at some point, building a portfolio that
demonstrates applied knowledge will be a better use of their time and effort.
Below, we will explain this concept by comparing a data science portfolio with a
curriculum vitae (CV), the standard document used to summarize job market candidates’
First, a data science portfolio defines outcomes more broadly than a CV. It takes
non-publication outputs seriously. For instance, developing data products, such as open-source
software, interactive maps, and dashboards, is a strong indicator of performance for a data
science portfolio. They signal strong programming skills and public engagement.
Second, a data science portfolio is deeply concerned about processes, not only outputs. In
an academic setting, as long as one produces a correct figure or table for a manuscript, few will
scrutinize the exact process. However, applied settings have tight deadlines and other resource
constraints. There is also a strong norm of open source code and reproducibility–that is, someone
else may want to adapt code into their own project. Therefore, it is critical to produce the same
23
outputs more efficiently with as little communication costs as possible. Modern data science
projects are often large-scale and require collaboration. A data science portfolio offers the chance
to demonstrate some of the core technical skills such as writing legible code.
How does one construct and advertise an effective data science portfolio? At a minimum,
graduate students should become familiar with version control tools, and make frequent use of
open-source coding platforms such as GitHub. Git is a tool for managing and tracking changes to
a codebase, thus allowing users to exercise control over different versions of their work. GitHub
is an online platform built on git and offers additional features for collaboration, tracking, and
hosting code repositories. This last function is especially important for developing a data science
portfolio–whenever possible, graduate students should open-source their code on a platform like
GitHub to signal their ability to work on different computational problems and contribute to the
community. Also, these GitHub repositories show that they can write good code in a common
We advise that graduate students begin working with these tools and learning best
practices (good habits) as early as possible. There is a learning curve at the beginning as these
tools require some familiarity with command-line tools, but the payoff is enormous if a graduate
student develops a varied and extensive portfolio across several years of their Ph.D. program.
The problem defined in this basic data science portfolio does not need to be very novel
or deeply impactful. The point of having a portfolio is about showing and sharing your work and
24
Finally, since this is a social science Ph.D. student’s data science portfolio, it should
demonstrate not only technical skills but also illustrate knowledge of some topic/policy
area/realm of social behavior/institution that they can bring to the table. Basically, there is a
competitive edge against some other candidates who have not received such extensive training in
social sciences.
So far, we have emphasized hard skills. However, soft skills like communication are
equally, if not more, crucial. In applied settings, data scientists work in a team and solve a
problem the team faces using data. The analysis does not speak for itself. Data scientists need to
communicate with (their and other) team members, organizational leadership, and non-technical
audiences to explain the team’s problems and find solutions that address these issues given the
resource constraints. This challenge is also present in academia. Suppose that one gives a job talk
based on computational social science research. Using computational methods does not make the
research attractive. The candidate needs to appeal to the audience who is not familiar with and
even skeptical about using these new methods in social science research. In either case, data
communication is a core competency.,Data visualization serves the same goal as an art and
One way of honing and displaying effective communication skills is to write data science
blog posts. This exercise is similar to teaching in its effect. It helps one to gain skills in
communicating computational tools and techniques for a wider audience. Yet another valuable
way of honing and displaying effective communication skills is by doing presentations. The
25
DSSG program emphasizes frequently presenting work to a variety of audiences–other fellows,
stakeholders and government partners, and the general public–and with different time
constraints. Being able to pitch the same project in 5 minutes or 20 minutes, and to both lay and
expert audiences is a key element of working across disciplinary lines, and within industry and
non-profit organizations. While publishing such posts is most valuable, for students who are new
to data communication, just writing them (and not sharing them) can be a valuable experience.
Being a computational social scientist means not only learning data science skills and
building a data science portfolio, but also being engaged in the field of computational social
science. Being engaged in computational social science activities can be immensely helpful for
your research by keeping you updated on the state-of-the-art in this constantly evolving field and
connecting you with collaborators. Interacting regularly with other computational social
scientists can also help Ph.D. students to find an intellectual community, which is sometimes
more challenging for computational social science Ph.D. students to do given that their interests
are more interdisciplinary compared to other social science Ph.D. students. This engagement can
take many forms. Here, we share three main forms of engagement: workshops and trainings,
conferences, and internships. Even students who are new to computational social science have
value to add to the community and being engaged in the community gives you power to connect
others and ensure people feel welcome in the community (Kim, Lebovits, and Shugars 2021).
Many of the coauthors of this article participated in programs like SICSS and DSSG and have
26
4.1. Workshops and Trainings
science Ph.D. students to learn and hone their skills in programming and computational methods.
Given that many social science Ph.D. programs do not yet provide courses on programming and
computational methods as part of their curriculum, it is not uncommon for computational social
science Ph.D. students to seek out additional workshops and trainings to supplement their
training. Workshops and trainings also provide computational social science Ph.D. students with
great opportunities to meet other students, faculty, and industry professionals in the field. Even
for early career computational social scientists, workshops and trainings can serve as useful
opportunities for continuous learning to keep up with the state-of-the-art in this rapidly evolving
Fortunately, there are currently many options available for computational social science
workshops and training. A popular one is to participate in a computational social science summer
institute. Some examples of summer institutes include SICSS and the Santa Fe Institute’s
Complex Systems Summer School. These summer institutes are typically held every year, and
they provide participants with an in-person, cohort-based training experience over several weeks.
The summer institutes usually emphasize breadth of computational social science topics and
attract participants across different disciplines. The summer institutes are great immersive
experiences, but they require some advanced planning as they are usually quite selective and
some (but not all) are provided free of charge. For example, those interested in SICSS in an
27
upcoming summer should expect to prepare and submit an application package consisting of a
CV, statement of interest, and writing sample by late February of that year.
Another way that computational social science Ph.D. students can pursue additional
training is to take topic-specific short courses and workshops. Short courses and workshops
typically focus on a specific topic, and they can range from a few hours to a few days. The
ICPSR provides many short courses in the winter and summer on computational social science
topics like network analysis, machine learning, and agent-based models, though they do charge
fees. Some universities offer free short courses and workshops on computational social science
topics to their students through their libraries and/or research and computing departments.
More recently, computational social scientists have taken the initiative to organize online
tutorial series. For example, the NLP+CSS 201 tutorial series, organized by Ian Stewart and
Katie Keith, provided Ph.D. students with a virtual opportunity to learn about different topics in
natural language processing, such as word embedding models and BERT models. In fact, many
of the tutorials were taught by computational social science Ph.D. students. The recorded lectures
and notebooks were also shared on their website, providing additional opportunities to share
learning. This is a particularly promising model for organizing computational social science
workshops and trainings that emphasizes community and accessibility. Obviously, students do
not need to attend all the training and should prioritize the ones that fit their needs and draw their
interests.
4.2. Conferences
28
While social science Ph.D. programs typically train and encourage students to attend and
participate in key academic conferences within their disciplines, computational social science
Ph.D. students can benefit from taking part in a broader range of conferences. In particular, there
are many interdisciplinary conferences focused on computational social science topics where
computational social science Ph.D. students can share their work and make new connections.
One key conference is the International Conference on Computational Social Science, which
draws computational social science researchers from various disciplines in academia and
industry. Other more method- and topic-specific conferences that computational social science
Ph.D. students should consider include: the Text as Data Conference, NetSci, the International
Social Networks Conference (also known as Sunbelt), and the Politics and Computational Social
Science Conference.
showcasing research that use computational methods. These preconferences provide great
opportunities for computational social science Ph.D. students to present their work to a highly
tailored audience that can provide particularly useful substantive and methodological feedback.
For example, the American Sociological Association has organized a Computational Sociology
Preconference ahead of its annual conference in recent years. The Political Networks Section of
the American Political Science Association also convenes PolNet, an annual convening that
4.3. Internships
29
Social science Ph.D. students often use their summers to prepare for program milestones
and conduct research. For computational social science Ph.D. students, summers may serve as
especially opportune periods to pursue additional training, conduct research, and engage in
career exploration through internships. While summer internships have long been a common part
of the undergraduate experience, they are also increasingly an important part of the graduate
experience for computational social science Ph.D. students. From a student's perspective,
outside of it. Similarly, from an employer's perspective, the internship helps convince them that
the person can be an effective part of a team and embrace non-academic standards, goals, and
Many organizations now offer summer internships to computational social science Ph.D.
students. In particular, many tech and social media companies like Meta, Twitter, Google,
Amazon, and Microsoft Research have summer internship programs. Some public agencies and
nonprofit organizations also have summer internship programs. Civic Digital Fellowship2
provides opportunities to work with local, state, and federal governments across the U.S through
a 10-week internship. DSSG3 has a similar public interest goal and straddles line between
training (workshops covering both technical and non-technical data science skills) and internship
(building a real world project with a partner). While summer internships may vary considerably
in their structure, pay, and length, they typically provide opportunities to engage research in an
applied and often team-based setting. Some summer internships are resident-based, and others
2
https://www.codingitforward.com/summer-fellowships
3
Here we are mainly referring to the DSSG program hosted by Carnegie Mellon University (previously at the
University of Chicago) and international partners (https://www.dssgfellowship.org/). There are similar programs
carrying the same name as well, for example at the University of Washington
(https://escience.washington.edu/dssg/).
30
are remote. They tend to pay reasonably well—oftentimes more than a summer stipend as a
graduate student. And summer internships can provide opportunities for additional internships
While summer internships can provide many benefits to computational social science
Ph.D. students, they do require significant advanced planning and preparation. In terms of
timing, computational social science Ph.D. students should consider applying for internship
positions after completing their coursework and developing a strong data science portfolio.
Computational social science Ph.D. students should also discuss their interest in taking on
summer internships early on with their advisors, department administrators, and potentially their
financial aid departments. These summer internships are competitive, and most of them will
require more than one interview/test in areas like programming, algorithms, statistics, and
research design. In-depth coverage of how to apply for internships and prepare for interviews is
beyond the scope of this article, but interested computational social science Ph.D. students
should be aware that applying for a summer internship will likely involve advanced planning
(e.g., turning a CV into a skill-oriented resume) and preparation (e.g., technical interview). In
addition, some of these internship programs advertise that they look for computational social
scientists but exclusively hire computer scientists. Therefore, not getting an internship shouldn’t
always be taken as a slight on candidates’ skills. It could be a matching issue (fit) between hiring
5. Conclusion
31
Data science opens up many new and exciting career opportunities for social scientists.
This article provides a step-by-step guide to graduate students on how to navigate these emerging
career paths in academic and non-academic job markets. In particular, we encourage Ph.D.
scientist to (1) learn data science skills, (2) build a data science portfolio, and (3) connect with
other computational social scientists, and we share some advice and resources for accomplishing
In this section, we propose some curricular changes that departments may consider
adopting to better support Ph.D. students interested in computational social science. First, we
encourage social science departments to integrate data science skills into their curriculum. For
example, introductory statistics courses that are often part of social science methods sequences
can support Ph.D. students to gain data science skills by offering the course in R or Python, or at
least including some modules in these programming languages. Social science departments can
also better support their Ph.D. by offering new courses that directly teach data science skills or
how to integrate quantitative and computational methods to conduct social science research. One
of the authors of this piece developed and taught a year-long “Computational Social Science”
course designed for second-year Ph.D. students to develop computational skills alongside
applied machine learning and causal inference work. Another one of us taught a computational
tools course that emphasized developing skills in things like web scraping, processing scanned
Second, we propose that departments recognize credits received from other departments
if they are relevant to students’ data science training. This will give students more flexibility and
32
ownership on their coursework, especially methods training. Departments will need to be
mindful about how to balance between prescribing an exact set of courses that define the field
and leaving students free to embark on a “choose your own adventure” curriculum with little
guidance. The appeal of prescribing an exact set of courses is that it guarantees every student
will have the same preparation, but with the downside of being unresponsive to new
developments in other disciplines and depriving students of the opportunity to integrate data
science into a domain expertise. On the other hand, leaving students free to decide the entire
curriculum for themselves may see some students craft a unique, challenging, and integrative
curriculum, but other students who are less well resourced might suffer important skills gaps
because of unawareness. To deal with this tension, we advise that departments look at their
curricula in light of the major concepts we have surfaced, identify gaps in their own offerings,
and encourage students to help find appropriate training on campus or elsewhere. Departments
should be mindful not just about how to create a computational social science program, but how
to maintain it. We expect that even the concepts we surface here will change in the coming years,
and new ideas about what constitutes computational social science will develop. Avoiding the
trap of optimizing for today at the expense of flexibility tomorrow will be key to setting students
up for success.
Third, we propose that departments limit the number of field exams students should take
(ideally one) and provide them with information on summer internships–even as early as during
program orientation. This is a far more effective method to help graduate students navigate
non-academic careers than offering a workshop on this subject when they are near to being on
the job market. One of the challenges of training students in computational social science is that
33
many of the concepts and skills need to be taught in addition to the core curricula of their Ph.D.
programs. This approach can easily add a year or more to a typical graduate student's workload.
The standard model of two years of coursework followed by several field exams before moving
onto dissertation writing should therefore be reexamined. One potential change would be
trimming down the number of field exams to one, and allow students to substitute advanced
computational training.
Fourth, we propose that departments recruit computational social scientists back into
academia from industry and non-profit positions in order to have more faculty on staff who can
advise and mentor students who are interested in computational social science. Bringing these
talents into a department is a good way to design applied data science curriculum and support
students with broad professional development as well. These roles do not need to be tenure-track
and they could be visiting positions (a long sabbatical) from industry or non-profit. Professional
schools such as law and medicine traditionally hire many faculty members with a few years of
practice experience (i.e., professor of practice). Social science departments can similarly signal
that they value computational social science by hiring candidates with professional experience
who can help students navigate diverse career paths. Such candidates may also enrich the
disciplines by energizing them with new ideas and research agendas drawn from these
non-academic experiences.
non-academic job markets and balance them because we believe students should define their
success on their terms–and we hope that departments in this changing landscape will also do the
same. Given students’ career goals and life circumstances, they might prefer non-academic jobs
34
to academic ones. Regardless of field, good jobs are good jobs if they align with students’ own
priorities and values. Respecting these career decisions will give students more agency over and
satisfaction with their graduate training when mental stress is a pervasive problem in graduate
school (Almasri, Read, and Vandeweerdt 2021). An important aspect of supporting students’
career aspirations is to allow students the flexibility to spend time on endeavors not directly
related to their dissertation research such as internships and many of the summer programs we
mentioned. While social science departments have a long history of supporting students’
methods training through venues such as ICPSR, Northwestern’s Workshop on Causal Inference,
Berkeley Initiative for Transparency in the Social Sciences (https://www.bitss.org/), they should
also encourage students to explore options outside of academia. Internships at private, non-profit,
and government organizations can help students explore different career paths, develop diverse
networks, and hone skills that are not well covered in academic settings. In addition, these
experiences help students develop confidence that non-academic jobs, which they usually regard
unknown and scary, could be a good fir for them. Importantly, we emphasize that this flexibility
is important for students interested in both academic and non-academic careers. The experience
of working with real-world, big, messy data is valuable regardless of what job market a student
eventually enters. Collaborating with diverse team members is similarly an asset that is hard to
develop solely as a Ph.D. student working on a dissertation alone. One of the biggest benefits of
non-academic work is the greater teamwork, more frequent feedback from colleagues, although
selection the possibility for interns to publish during the internship and/or have access to internal
35
data for research after the internship. This makes it easier for social science gradaute students to
thread the needle of finding summer opportunities that are valuable to both their graduate
Departments can also support students’ pursuing these strategies by allowing internship papers as
part of their dissertation. Some departments do not allow co-authored papers or have a limit to
how many co-authored papers one can have, but these rules should reflect students’ diverse and
evolving career paths. Overall, we hope this article provides a useful starting point for graduate
students and departments interested in computational social science. Surfacing the hidden
curriculum is an important step toward democratizing access to computational social science and
helping any and all students interested in data science and social science to think of themselves
as computational social scientists. We do not at all claim that we have the final word on what
makes a computational social scientist or computational social science curriculum, but instead
hope that this starts important and necessary conversations about the long-term development of
this field.
36
Disclosure Statement
Aniket Kesari, Jae Yeon Kim, Sono Shah, Taylor Brown, Tiago Ventura, and Tina Law have no
Acknowledgment
We thank Eric Giannella, Milan de Vries, Brian Heseung Kim, Rebecca Johnson, and Sarah
37
References
Abadie, Alberto, Alexis Diamond, and Jens Hainmueller. 2014. “Comparative politics and the
Almasri, N., Read, B., and Vandeweerdt, C. (2021). “Mental Health and the PhD: Insights and
Implications for Political Science.” PS: Political Science & Politics, 1-7.
J. Angwin, J. Larson, S. Mattu, L. Kirchner. (23 May 2016). “Machine Bias: There’s Software
Used Across the Country to Predict Future Criminals. And It’s Biased Against Blacks,”
ProPublica URL:
www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing (accessed
August 6, 2022)
Athey, S., (2015). “Machine Learning and Causal Inference for Policy Evaluation.” In
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and
Barham, E. and Wood, C. (2021). “Teaching the Hidden Curriculum in Political Science.” PS:
Calarco, J.M. (2020). A Field Guide to Grad School: Uncovering the Hidden Curriculum.
38
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W. and Robins,
J., 2016. “Double Machine Learning for Treatment and Causal Parameters.” arXiv preprint
arXiv:1608.00060.
Craig, M., Conrad, P., Lynch, D., Lee, N., & Anthony, L. (2018). “Listening to Early Career
Edelmann, A., Wolff, T., Montagne, D., & Bail, C. A. (2020). “Computational Social Science
Engler, A. (2022, April 4). “Uncommon Advice on Becoming a Data Scientist in the Public
https://www.hertie-school.org/en/digital-governance/research/blog/detail/content/uncommon-adv
Geiger, R.S., Mazel-Cabasse, C., Cullens, C.Y., Noren, L., Fiore-Gartland, B., Das, D. and
Brady, H. (2018). “Career Paths and Prospects in Academic Data Science.” Report of the
Moore-Sloan Data Science Environments Survey. Berkeley, California: UC-Berkeley Institute for
39
Grimmer, J., Roberts, M., Stewart, B. “Machine Learning for Social Science: An Agnostic
https://www.annualreviews.org/doi/abs/10.1146/annurev-polisci-053119-015921 (accessed
Lundberg, I., Johnson, R. and Stewart, B.M., 2021. “What Is Your Estimand? Defining the
Target Quantity Connects Statistical Evidence to Theory.” American Sociological Review, 86(3),
pp.532-565.
Kim, J.Y. and Ng, Y.M.M. (2022). “Teaching Computational Social Science for All.” PS:
Kim, S.Y.S., Lebovits, H. and Shugars, S. (2021). “Networking 101 for Graduate Students:
Lue, R. A. (2019). “Data Science as a Foundation for Inclusive Learning.” Harvard Data Science
Review, 1(2).
Marlow, J., & Dabbish, L. (2013). “Activity Traces and Signals in Software Developer
2022)
40
National Academies of Sciences, Engineering, and Medicine. (2018). Data Science for
Nolan, D. and Stoudt, S. (2021). “The Promise of Portfolios: Training Modern Data Scientists.”
Robinson, E. and Nolis, J. (2020). Build a Career in Data Science. Shelter Island, NY: Manning
Publications.
Salganik, Matthew J. (2019). Bit By Bit: Social Research in the Digital Age. Princeton, NJ:
Scolere, L. (2019). Brand Yourself, Design Your Future: Portfolio-building in the Digital Age.
VanderWeele, T.J. and Ding, P., 2017. “Sensitivity Analysis in Observational Research:
41