1 - Leeuw2023

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

Behavior Research Methods

https://doi.org/10.3758/s13428-023-02161-x

DataPipe: Born‑open data collection for online experiments


Joshua R. de Leeuw1

Accepted: 2 June 2023


© The Psychonomic Society, Inc. 2023

Abstract
DataPipe (https://​pipe.​jspsy​ch.​org) is a tool that allows researchers to save data from a behavioral experiment directly to the Open
Science Framework. Researchers can configure data storage options for an experiment on the DataPipe website and then use the
DataPipe API to send data to the Open Science Framework from any Internet-connected experiment. DataPipe is free to use and open-
source. This paper describes the design of DataPipe and how it can help researchers adopt the practice of born-open data collection.

Keywords Open data · Open science · Data sharing

Behavioral experiments conducted over the Internet create Open data means that the data from an experiment are posted
unique challenges for managing research data. Unlike labora- in a repository that anyone can access (Murray-Rust, 2008).
tory experiments, online experiments must save data on stor- There are many tools for sharing data in this way, including
age devices that are not directly connected to the computer the Open Science Framework (https://​osf.​io), ResearchBox
that is running the experiment. For example, a typical online (https://​resea​rchbox.​org), figshare (https://​figsh​are.​com),
experiment uses a web server to send experiment code to a Dataverse (https://​datav​erse.​org), Databrary (https://​datab​
participant's browser. The browser, on the participant's com- rary.​org), and more. These tools are typically free to use and
puter, then runs the experiment and sends the data back to the available to any researcher, yet many research articles are
web server. Configuring a web server is not a task that most published without making the data open. While there are no
researchers are familiar with. To make this easier for research- discipline-wide estimates of the proportion of articles that
ers, several open-source tools (e.g., JATOS, Lange et al., 2015; have open data, there are a few estimates from specific con-
PsiTurk, Gureckis et al., 2016; Pushkin, Hartshorne et al., texts in psychology that range from < 1 to 66% (Vanpaemel
2019; NivTurk, Zorowitz & Bennett, 2022) and commercial et al., 2015; Obels et al., 2020; Hardwicke, et al., 2022).
platforms (e.g., Pavlovia, https://​www.​pavlo​via.​org; Gorilla, There are costs and benefits to researchers for making data
https://​www.​goril​la.​sc; Cognition, https://​cogni​tion.​run; Open- available; achieving widespread adoption of the practice requires
Lab, https://​open-​lab.​online) have been developed. While these getting the balance of these competing factors right. The factors
tools solve the basic data management problem, they also typi- that encourage and discourage researchers to make their data
cally mimic the setup of a laboratory-based experiment. Data open include a diverse set of themes ranging from institutional
are stored in a database or folder that only the research team norms to individual beliefs (Zuiderwijk et al., 2020). At the insti-
has access to. If a researcher wants to later publish the raw tutional level, many journals have adopted data sharing policies
data, they must transfer the files to a public repository. that require open data (Nosek et al., 2015)—though the enforce-
Given that many researchers rely on this kind of tool, there ment of and follow-through on such policies is sometimes lax
is an opportunity to help researchers improve their data man- (Alsheikh-Ali et al., 2011; Hussey, 2023)—and/or recognition
agement by designing tools that support best practices. While for articles that elect to include open data (e.g., Kidwell et al.,
there are many facets of good data management (Van der 2016). Some research funders also incentivize data transparency
Eynden et al., 2011), the focus of this article is open data. by requiring funded researchers to make their data open. The
National Institutes of Health in the United States, for example,
recently adopted such a policy (National Institutes of Health,
* Joshua R. de Leeuw 2020). There are also calls for collective action, such as research-
jdeleeuw@vassar.edu
ers insisting on a minimal level of openness as a precondition for
1
Department of Cognitive Science, Vassar College, peer review (Morey et al., 2016).
Poughkeepsie, NY, USA

13
Vol.:(0123456789)
Behavior Research Methods

These institutional requirements are certainly one way to make it easier to collect born-open data than it is to collect
increase data availability, but they don't apply to all circum- traditional closed data.
stances. Some research is published in journals that does This paper describes DataPipe, a free and open-source
not have (or does not effectively enforce) open data policies, tool that allows researchers to send data directly from an
some research has no funder-based requirements, and not all experiment to the Open Science Framework (OSF). Data-
peer reviewers insist on making data open. If we want to push Pipe is a hosted service that does not require the researcher
towards broader adoption, we should think carefully about to configure their own system. Instead, researchers create
why researchers choose to not make data available when they an account with DataPipe, configure the experiment using
are not required to. One of the most straightforward reasons is the DataPipe website, and add a few lines of code to their
that publishing a dataset in an open repository is an extra step existing experiment. This makes the setup as easy or easier
of work for researchers. Sometimes this step is quite simple than most existing methods for managing online data col-
and other times it can be very time-consuming. It might be lection. DataPipe is also not only free, but because of its
reasonable to argue that this shouldn't be considered an extra design, it allows researchers to host their online experi-
step—that it should be considered a foundational part of the ments using free website providers. This means that a
research—but existing incentives don't always align with this researcher can run an online experiment entirely for free,
view. So, in addition to further shifts in incentive structures, aside from any payments that are made to participants.
another way that the practice can be encouraged is to make DataPipe can be used with any experiment software that
it easier for researchers to share their data so that there is as can make HTTP requests over the Internet. This includes
little additional effort required to make data open. any experiment software that runs online, plus many lab-
One way to do this is to create tools that automatically pub- based software packages as well.
lish all new data to a repository that is, or can be easily made,
public. Rouder (2016) called this born-open data and described
such a system used in his laboratory. In his implementation, DataPipe
data files are initially stored on a laboratory machine. These
files are saved in a directory that is configured as a Git reposi- DataPipe was developed with the goal of making born-open
tory, which allows Git to track any changes made to the direc- data collection an attractive default option for researchers
tory and synchronize the changes with a remote copy of the running online experiments. I tried to accomplish this goal
repository. A scheduled script automatically synchronizes the in two ways. First, using DataPipe requires no experience
repositories once per day. A similar outcome could be achieved with managing data storage for online experiments. Hope-
by saving data files directly to a folder that synchronizes with a fully, researchers who are new to online data collection will
cloud-based storage provider like Dropbox (Klein et al., 2018). view DataPipe as an accessible solution for data manage-
For data collected over the Internet, the same general approach ment and thus be encouraged to adopt born-open data col-
can be implemented by sending the data to a server that auto- lection. Second, experiments using DataPipe can be run
matically synchronizes the data with a public repository. Cous- entirely for free—not only is the DataPipe service free, but
ineau (2021) documents an approach along these lines using using DataPipe allows researchers to replace paid hosting
PHP scripts and Git, and provides a plugin for jsPsych that can providers with free hosting providers because DataPipe
connect to these scripts. handles the functionality that a paid hosting provider would
While these are all excellent examples of ways to achieve offer. My hope is that the cost savings that DataPipe can
born open data, they require some effort and technical com- provide will attract users who have a working data manage-
fort on the part of the researcher to set up. Any researcher ment solution but can save money by using DataPipe, and
could certainly follow the guide outlined by Rouder (2016) thus also be encouraged to adopt born-open data collection.
or use the scripts published by Cousineau (2021), but this In this section, I describe the design and implementation
would require additional time to set up relative to simply not of DataPipe with a focus on how it meets these two goals.
collecting data in this way. For a researcher who is already I do not provide a full tutorial of how to use DataPipe in
committed to publishing all of their data in a public reposi- this paper, as specific details may change as the applica-
tory, the initial investment in configuring these tools will tion develops and updating this paper is not possible after
save time. However, for a researcher who is not planning to publication. Instead, a “getting started” guide is available
publish their data in a repository, this setup is an extra task at https://​pipe.​jspsy​ch.​org/​getti​ng-​start​ed and detailed code
and does not save the researcher time. This is particularly examples are provided inside the DataPipe experiment dash-
true for online data collection because the simplest options board after an experiment is created. Examples of using
(e.g., saving all data files in a Dropbox folder) are not via- DataPipe for a variety of experiment software (including
ble when the data are collected on remote computers. To jsPsych, lab.js, and PsychoPy) are available at https://g​ ithub.​
maximize adoption of born-open data, the goal should be to com/​jspsy​ch/​datap​ipe-​examp​les.

13
Behavior Research Methods

Fig. 1  Overview of DataPipe's design. Researchers interact with the in DataPipe's experiment configuration database. When an experiment
DataPipe web application to configure an experiment. The parameters sends a data file to the DataPipe API, the API checks the configuration
of the experiment (e.g., which OSF component it connects to) are stored of the experiment, processes the request, and sends the file to the OSF

Design any data sent to this endpoint be a valid CSV and/or JSON
file, and can also provide a list of variable names that must
DataPipe is a service that operates between a researcher's appear in the data. If the validation fails, the file is rejected
experiment and a researcher's project on the Open Science by DataPipe. DataPipe also rejects a request if the filename
Framework. There are two ways to interact with DataPipe: via a already exists on the OSF. This prevents research partici-
web application that allows researchers to create and configure pants from overwriting existing data.
experiments and via an API (Application Programming Inter-
face) that experiments can use to communicate with DataPipe. Binary file storage Occasionally experiments need to store non-
To use DataPipe, a researcher logs in to the web application at text based data, such as audio or video recordings. To save this
https://​pipe.​jspsy​ch.​org, configures a DataPipe experiment, and kind of data using DataPipe, a researcher can send the data
then adds a few lines of code to their experiment software to encoded in base64 format1, as well as a filename. DataPipe
communicate with the API. When an experiment makes an API will then decode the base64 representation into the original
request, DataPipe processes the request, performs the appropri- filetype (e.g., .webm for a video recording from the participant's
ate action (e.g., uploading data to the OSF), and returns a mes- camera) and post the resulting file on the OSF. Converting to
sage indicating whether the request succeeded and if it failed, base64 format is a straightforward task in most programming
why. Figure 1 visualizes the components of DataPipe and how languages and some experiment software (e.g., jsPsych) will
they interact with each other, experiments, and the OSF. automatically produce base64 representations of media record-
DataPipe currently supports three distinct interactions: sav- ings. One advantage of using DataPipe to save these data is that
ing text-based data files, saving binary data (e.g., an image) the file conversion will happen automatically prior to posting
encoded in base64, and sequential condition assignment. the data on the OSF, so users accessing the OSF repository will
These features are accessed through calls to DataPipe's API. be able to interact directly with the media content.

Text‑based data storage Most experiment data is recorded Condition assignment DataPipe supports sequential con-
in a text-based format, such as CSV or JSON. To save text- dition assignment. Researchers can configure the number
based data using DataPipe, a researcher can send an API of conditions in an experiment using the web application.
request with the data and a filename. DataPipe will create
a new file containing the text on the OSF. In addition to 1
Base64 format encodes the data as a string using a vocabulary of
this simple pass-through storage mechanism, DataPipe also 64 characters. We use base64 for non-text-based data because it is
supports basic data validation. Researchers can require that easy to encode and decode the format in a variety of programming
languages and it enables transmission of the data as a text stream.

13
Behavior Research Methods

When DataPipe is queried for the condition, it returns the to record completions by condition in the API so that condi-
next condition number in sequential order, resetting to 0 tion assignment can be adaptive to dropouts. This is a tricky
when the maximum number of conditions is reached. While problem to solve in online experiments because participants
this feature is not directly relevant to using DataPipe as a are often completing the experiment at the same time, and
tool for born-open data, it does fill a need for researchers condition assignments have to be made before completion
who are trying to host their experiments without setting up information is known.
their own server. Any attempt at balanced condition assign-
ment in an online experiment requires some kind of server Are there limitations on how many files or how much data
to record which conditions participants have been assigned can be transferred using DataPipe? DataPipe has no restric-
to because participants are usually completing the experi- tions on the number of files or the size of the data. Practi-
ment in parallel. This feature is included in DataPipe to fill cally speaking, it may be unrealistic to transfer very large
a common need for researchers who are interested in using files using DataPipe because participants must keep their
DataPipe but need this simple form of condition assignment. browser window open while the files are being uploaded
and large files may require a long upload. Most text-based
Experiment configuration In order to use the features data is not large and even medium-sized files like audio or
described above, researchers must configure an experiment video recordings work just fine. The OSF API, at the time of
using the DataPipe web application (https://​pipe.​jspsyc​ h.o​ rg). writing, has a limit of 10,000 requests per day. This limit is
When creating a new experiment in DataPipe, the researcher account-specific, and if it were exceeded DataPipe would be
specifies the OSF project that will be used to store the data. unable to store additional files until the following day. OSF
DataPipe then generates a new OSF data component inside components have a storage limit of 5 GB for private com-
that project and uses this new component to store data. ponents and 50 GB for public components, and a maximum
Once an experiment is created on DataPipe, the researcher file size of 5 GB. DataPipe creates a new component for
can use the website dashboard to manage access to the dif- each experiment, so this limit is experiment-specific when
ferent features. Each of the three features (text-based data using DataPipe.
storage, base64-encoded data storage, and sequential condi-
tion assignment) can be enabled or disabled for an experi- Why is this design better than interacting with the OSF API
ment, giving the researcher full control over what kinds of directly? The OSF API can be directly accessed by any-
data can be written to the OSF component. Disabling unused one with an OSF account, so a researcher could send data
features and disabling all features when data collection is directly to the OSF rather than going through DataPipe.
complete can reduce the likelihood of malicious use (see What's the benefit of using DataPipe?
section on risks, below). Disabling data collection is also a Communication with the OSF API requires authoriza-
way to host an experiment online for demonstration purposes tion via a personal access token. The token is a long string
without sending new data to the OSF. of numbers and letters, and functions like a password to the
In addition to disabling and enabling features, researchers OSF account. In order to use the OSF API directly from a
can configure several options. Researchers can set a cap on JavaScript experiment, the OSF token would need to be pub-
the number of files that DataPipe will accept before rejecting licly viewable by the participant (since it is the participant's
additional data. For text-based data collection, researchers computer that would be communicating with the OSF). This
can set up validation rules for CSV and JSON data. These would be risky. A malicious participant could view the token
rules (1) check if the text is a valid CSV and/or JSON string and then make API calls to the OSF as if they were the
and (2) check for a list of required variable names. This researcher.
allows researchers to ensure that data being written to the DataPipe adds a layer of security between the OSF token
OSF meets some minimum requirements, which reduces the and the participant. The researcher using DataPipe provides
likelihood of spam data being saved to the OSF. A future DataPipe with the OSF token, and the participant sends data
development goal is to allow for additional data validation to DataPipe. The only information that the participant can
rules, like specifying a schema that data files must meet in access is the ID of the experiment on DataPipe and the OSF
order to be considered valid. token is securely stored in DataPipe's database.
Researchers who want to use the condition assignment By routing requests through DataPipe, the researcher
feature can configure the number of conditions in the exper- has a lot more control over what actions a participant can
iment. Currently, only sequential condition assignment is perform on the OSF repository. Tokens on the OSF can be
supported (e.g., the first participant gets condition 0, the assigned certain permissions, or actions that can be per-
second gets condition 1, etc.). This is better than pure ran- formed using the token. At the time of writing this article,
dom assignment if the goal is to balance participants across the options for permissions are to give a token full read-only
conditions, but it is not ideal. A future development goal is access to all OSF projects in an account, full write access

13
Behavior Research Methods

to all OSF projects in an account, read access for personal DataPipe can follow the guides on the DataPipe website for
profile data, and read access for the email address. This a non-technical overview of how to get started.
means that, if a researcher were using just the OSF API to There are three major components to DataPipe: (1) The
write data, the participant would then be able to have full API, (2) The experiment configuration database, and (3) The
write access to all OSF projects in the researcher's account. front-end web application (Fig. 1). The DataPipe codebase
This is much less granular access control than DataPipe can is open source. It is available at https://​github.​com/​jspsy​ch/​
provide. DataPipe will only write data to the single OSF datap​ipe.
component that is set in the configuration for an experiment.
The API There are currently three API endpoints, one for
How does DataPipe's design make online experiments each of the features described in the previous section. The
free? Setting up a web server to host an experiment is one /api/data endpoint is used for saving text-based data, /api/
of the more technically challenging aspects of configuring an base64 is used for saving base64-encoded data, and /api/con-
online experiment. While there are open-source tools avail- dition is used for sequential condition assignment. The API
able that make this process much easier (e.g., JATOS and uses standard HTTP REST-style requests. Each API function
PsiTurk), most solutions still require some technical skills returns standardized status codes (e.g., 201 on a successful
and access to a computer that can function as a web server. create request and 400 on an invalid request) and informative
The exception to this is a handful of commercial platforms error messages in the case of missing or invalid informa-
for research, which greatly reduce the technical challenges tion. An example API request is shown in Code Block 1.
but with the tradeoff of costing money to use. JavaScript-based code samples for all of the API endpoints
Hosting an online experiment does not need to be expen- are provided in the DataPipe experiment dashboard.
sive. The (relatively) expensive parts are the storage and
bandwidth required to host the experiment files (e.g., poten-
tially a large number of media files) and the infrastructure for
managing the data (e.g., some kind of permanent database or
file storage). Fortunately, there are platforms that are willing
to provide these services for free. Many web hosting provid-
ers offer free hosting for websites and media files that are
"static", meaning that the files are the same for every user and
there is no communication with a database. (An example of a
non-static site would be a social media page, where the web-
site needs to reference stored information in order to render
the correct content to the user.) Usually, online experiments
are not static because a database is needed to store the data
generated by the experiment. DataPipe solves this problem by
providing a way for a static website to send data to the OSF.
Does this mean that DataPipe is subsidizing the cost? Not Code Block 1. Example API request using JavaScript's fetch
significantly. The main organizations that are incurring a cost method to create a CSV file on the OSF. There are three
are the hosting provider (e.g., GitHub) and the data storage parameters that are sent in the body of the request. The
provider (OSF). DataPipe serves as a bridge between these experimentID identifies which DataPipe experiment the
services and, since it does not store any experiment data data are connected to, and is provided by DataPipe when an
directly, is a very low-cost service to run. The cost, based on experiment is created. The filename must be a unique identi-
cloud computing costs at the time of writing, is about 1 cent fier. The data are sent as a text string. In this example, it is
per 500 data files processed. This isn't nothing, and if Data- assumed that the dataAsString variable contains data that is
Pipe becomes very popular then it may require some fund- already formatted as a CSV — for example, the output from
raising to support, but a monthly budget of $100 USD via a jsPsych.data.get().csv().
donation model could handle roughly 5 million data files. The DataPipe API can be used by any experiment soft-
ware that can make HTTP requests, which includes all online
Implementation experiments and most offline experiments. API calls can be
made by Python code (PsychoPy; OpenSesame), MATLAB
This section describes, at a high level, how DataPipe is built code (Psychtoolbox), and most other languages. While Data-
and how to communicate with the API. This section is aimed Pipe can be used by offline experiments, it is most useful for
at researchers who have some technical familiarity with web online experiments. When experiments are being conducted
applications. Researchers who are interested in simply using offline it is safe to use the OSF API directly because the OSF

13
Behavior Research Methods

access token will not be exposed2. However, some research- The plugin is available in the jsPsych community contribu-
ers may choose to use DataPipe anyways because it simpli- tions repository (https://g​ ithub.c​ om/j​ spsyc​ h/j​ spsyc​ h-c​ ontri​ b/​
fies the configuration process. tree/​main/​packa​ges/​plugin-​pipe). Using this plugin requires
The DataPipe API is hosted using a set of Google Cloud adding just a few lines of code to a jsPsych experiment,
Functions. When an experiment makes a request to one of specifying the DataPipe experiment ID, the filename, and
the DataPipe API endpoints, the function is run on Google the data to be saved. The plugin can be inserted onto a time-
Cloud servers. The computing resources allocated to these line to run at a specific point. This is a convenient way to
functions automatically scale in response to increased load, save all of the data at the end of an experiment.
so even in the case of high demand for the DataPipe API There are also static methods in the plugin that can be
there should be no performance problems. The API can called separately from being added to an experiment time-
easily handle hundreds of concurrent requests, and given line. One example use case for the static methods is to save
that the API functions take only a few seconds to run, it is audio or video recordings. If an experiment records a lot of
unlikely that usage will ever go much beyond this. If it does, media from a user, it may require a lot of browser memory
additional resources can be provisioned to cover the usage. to store all of the data until the end of the experiment. Trans-
mitting all of the data at once will also cause a longer delay.
Experiment configuration database The experiment config- A good alternative is to send a request to DataPipe at the
uration database sits in between the web application and the end of a trial that records audio data. The static method of
API. It stores information for each experiment on DataPipe, the pipe plugin can be called in the audio recording trial's
including which OSF component each experiment connects on_finish event. Once the file is on the server, the copy that
to, which features are enabled for an experiment, and valida- is stored in the participant's browser can be deleted.
tion rules. When an API request is made, the cloud function
accesses the database to ensure the request is allowed. The
database does not store any research data. Risks of using DataPipe
The database itself, as well as the rest of the backend
features of DataPipe, is implemented using Google's Fire- There are some unavoidable risks of using a service that can
base application framework. Firebase includes features for automatically write data that is gathered remotely over the
hosting, account management, and document-based database Internet. DataPipe is designed to mitigate these risks, but it
storage. The web application and all associated user account is not risk free. These risks include:
data is hosted on Google Cloud, so DataPipe gets the ben-
efits of Google's infrastructure for managing and securing 1. The researcher must provide DataPipe with an OSF
a web application. token that grants write access to their OSF account. In
the event that DataPipe's database is breached, these
Front‑end web application The website at https://p​ ipe.j​ spsy​ tokens could become public and allow users to make
ch.​org is built using NextJS, Chakra UI, and React. The direct writes (through the OSF API) to a user's account.
website allows users to create and configure experiments, Access to such a token would not cause a researcher to
which results in changes to the experiment configuration lose access to their OSF account like a leaked password
database. In theory, these changes to the database could be might, but it would still be a serious security vulner-
made without interacting with the website. A future version ability.
of DataPipe might therefore support programmatically set-   This risk can be mitigated by a researcher disabling
ting up experiments, e.g., through a command line interface any unused OSF tokens through the OSF token settings.
or through experiment creation software. A token can be revoked by the researcher through the
OSF at any point. DataPipe's documentation recom-
jsPsych plugin mends that users disable tokens as soon as data collec-
tion is complete to reduce the risk of malicious use.
To help researchers who are using jsPsych (de Leeuw, 2015; 2. DataPipe creates an open connection between the Inter-
de Leeuw et al., 2023), there is a jsPsych plugin jsPsych-pipe net and an OSF component. A technically savvy user
that implements all of the standard API calls to DataPipe. could write fake data to the OSF component by making
additional requests to the DataPipe service or modifying
the requests that the experiment generates.
2
JavaScript code, used in all online experiments, can always be   This type of risk is nearly always present with online
viewed, so it is not too difficult for someone to find the access token experiments because the data are usually measured and
in the code. With offline experiments, the code is not easily viewable
by the research participant and there is little risk to using the OSF recorded on the participant's computer before being sent
API directly. to the server. A malicious user could change the data in

13
Behavior Research Methods

between when it is measured and when it is sent to the participants to reveal information that should not be included
server, or could create fake data files and send them to in the open data set.
the server. DataPipe can be used to collect private data. By default,
  The typical ways to mitigate this risk are to validate the OSF component that DataPipe creates to store data in a
incoming data files and to ensure that any fake data can- project is private, so only members of the OSF project can
not interfere with legitimate data. DataPipe does both of access it. Researchers can choose to make this component
these things. First, DataPipe will never overwrite exist- public for truly born-open data, or can keep data private
ing files on the OSF, so a malicious user cannot delete until they are ready to release it. One of the advantages of
existing data. If a user sends a request with a filename DataPipe, in terms of encouraging researchers to adopt born-
that already exists, DataPipe will refuse the request. Sec- open practices, is that even with private data collection the
ond, the researcher can specify a set of custom valida- barrier to going public is only changing the visibility of the
tion criteria that DataPipe will use to check incoming repository.
data files. A second kind of privacy concern is whether DataPipe
3. Experiments using DataPipe are dependent on DataPipe itself stores or accesses the data during the transfer to the
being available and functional. If DataPipe is down or OSF. This is an especially relevant concern for researchers
otherwise not working, an experiment cannot collect who are cautious about or explicitly forbidden from using
data and researchers will be dependent on the DataPipe US-based servers due to differences in data privacy regula-
team to fix the problem. Furthermore, DataPipe is not a tions in the US and, for example, the EU.
commercial venture with a dedicated user support team, DataPipe does not store any of the data it is sent. When
so there is some risk that support may be slow. the API is invoked, the data are processed by the Google
  Unfortunately, individual researchers have little con- Cloud function (which could include validating the content
trol over this risk. However, researchers should still of the data or converting the format) and then sent to the
feel confident using DataPipe. DataPipe is hosted using OSF API. Once this function is complete, there is no stored
the Google Cloud, so it benefits from the robustness record of the data. Researchers who need or want to host
of Google's infrastructure to make sure the service is their data files on a server located outside the US can set
secure and keeps running. Second, DataPipe is entirely the server location of their OSF project to a non-US locale.
open source, so researchers can inspect the codebase to DataPipe does keep some basic logging information about
see exactly how DataPipe works. Third, DataPipe has a the requests generated. Specifically, it tracks the number and
robust set of continuous integration testing, including a type of requests associated with each DataPipe experiment.
separate testing/staging environment for running a suite Additionally, Google Cloud functions record a temporary
of automated tests on the API. This should reduce the log of each request made. This log includes the URL of the
likelihood of bugs making it into the production service. site that the request originated from (the experiment URL, in
  Finally, because DataPipe is open source, a researcher most use cases), the time of the request, the user agent string,
could also choose to operate their own instance of Data- and the IP address that the request originated from. These
Pipe or operate a more limited set of features based on logs are automatically deleted after 30 days.
the DataPipe source code. For example, a researcher
could use just the code that runs the API to launch their
own lab-specific copy. This would eliminate the risk of Conclusions
being dependent on the DataPipe team and would also
avoid the risk of giving an OSF token to DataPipe. Born-open data collection makes it easier for researchers to
publish all of their data because it makes the process auto-
matic (Rouder, 2016). The goal of DataPipe is to make born-
Privacy open data collection an easy default option for researchers.
Internet-based data collection is one of the main ways that
One kind of privacy concern that a researcher might have behavioral science is done, and it is a method where data
about using DataPipe is whether it is possible to collect data management is challenging. DataPipe makes online data col-
without making the data immediately public on the Inter- lection easy by providing a simple set of API endpoints that
net. Some researchers may want to use DataPipe to collect can take data and write it directly to an OSF component. This
data without making their data born-open. Others may want removes the need for a researcher to set up or provision their
to make the data born-open but only after a short delay to own backend server, which is one of the most complicated
ensure no personally identifying information is included parts about setting up an online experiment. As an added ben-
in the data. For example, this could be a concern in an efit, servers—and services that provide servers—are one of
experiment with open-ended questions that may lead some the main costs of hosting an online experiment, so DataPipe

13
Behavior Research Methods

also makes online experiments much cheaper. By reducing Lange, K., Kühn, S., & Filevich, E. (2015). Just Another Tool for
the cost and complexity of running an online experiment Online Studies (JATOS): An easy solution for setup and manage-
ment of web servers supporting online studies. PLoS ONE, 10(6),
while simultaneously encouraging users to adopt born-open e0130834.
data collection, DataPipe may be able to shift the incentives Morey, R. D., Chambers, C. D., Etchells, P. J., Harris, C. R., Hoek-
enough for more researchers to adopt this important practice. stra, R., Lakens, D., ... & Zwaan, R. A. (2016). The peer review-
ers’ openness initiative: Incentivizing open research practices
through peer review. Royal Society Open Science, 3(1), 150547.
https://doi.org/10.1098/rsos.150547
Declarations Murray-Rust, P. (2008). Open data in science. Nature Precedings.
https://​doi.​org/​10.​1038/​npre.​2008.​1526.1
Conflict of interest The author declares that he has no conflicts of in- National Institutes of Health. (2020). Final NIH policy for data man-
terest or competing interests. agement and sharing (NIH Identifier NOT-OD-21-013). U.S.
Department of Health and Human Services, National Institutes
of Health. Available from https://​grants.​nih.​gov/​grants/​guide/​
References notice-​files/​NOT-​OD-​21-​013.​html
Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S.
D., Breckler, S. J., ... & Yarkoni, T. (2015). Promoting an open
Alsheikh-Ali, A. A., Qureshi, W., Al-Mallah, M. H., & Ioannidis, J. research culture. Science, 348(6242), 1422–1425. https://​doi.​org/​
P. (2011). Public availability of published research data in high- 10.​1126/​scien​ce.​aab23​74
impact journals. PLoS ONE, 6(9), e24357. Obels, P., Lakens, D., Coles, N. A., Gottfried, J., & Green, S. A. (2020).
Cousineau, D. (2021). Born-Open Data for jsPsych. https://​doi.​org/​10.​ Analysis of open data and computational reproducibility in regis-
31234/​osf.​io/​rkhng tered reports in psychology. Advances in Methods and Practices
de Leeuw, J. R. (2015). jsPsych: A JavaScript library for creating in Psychological Science, 3(2), 229–237.
behavioral experiments in a web browser. Behavior Research Rouder, J. N. (2016). The what, why, and how of born-open data.
Methods, 47, 1–12. Behavior Research Methods, 48, 1062–1069.
de Leeuw, J.R., Gilbert, R.A., & Luchterhandt, B. (2023). jsPsych: Van der Eynden, V., Corti, L., Woollard, M., Bishop, L., & Horton, L.
Enabling an open-source collaborative ecosystem of behavioral (2011). Managing and sharing data—best practices for research-
experiments. Journal of Open Source Software, 8(85), 5351. ers. University of Essex.
https://​joss.​theoj.​org/​papers/​10.​21105/​joss.​05351 Vanpaemel, W., Vermorgen, M., Deriemaecker, L., & Storms, G.
Gureckis, T. M., Martin, J., McDonnell, J., Rich, A. S., Markant, D., (2015). Are we wasting a good crisis? The availability of psycho-
Coenen, A., … Chan, P. (2016). psiTurk: An open-source frame- logical research data after the storm. Collabra, 1(1). https://​doi.​
work for conducting replicable behavioral experiments online. org/​10.​1525/​colla​bra.​13
Behavior Research Methods, 48, 829–842. Zorowitz, S., & Bennett, D. (2022). NivTurk (v1.2-prolific). Zenodo.
Hardwicke, T. E., Thibault, R. T., Kosie, J. E., Wallach, J. D., Kidwell, https://​doi.​org/​10.​5281/​zenodo.​66092​18
M. C., & Ioannidis, J. P. (2022). Estimating the prevalence of trans- Zuiderwijk, A., Shinde, R., & Jeng, W. (2020). What drives and inhibits
parency and reproducibility-related research practices in psychology researchers to share and use open research data? A systematic
(2014–2017). Perspectives on Psychological Science, 17(1), 239–251. literature review to analyze factors influencing open research data
Hartshorne, J. K., de Leeuw, J. R., Goodman, N. D., Jennings, M., adoption. PLoS One, 15(9), e0239283. https://​doi.​org/​10.​1371/​
& O’Donnell, T. J. (2019). A thousand studies for the price of journ​al.​pone.​02392​83
one: Accelerating psychological science with Pushkin. Behavior
Research Methods, 51, 1782–1803. Publisher's note Springer Nature remains neutral with regard to
Hussey, I. (2023). Data is not available upon request. https://​doi.​org/​ jurisdictional claims in published maps and institutional affiliations.
10.​31234/​osf.​io/​jbu9r Open practices statement DataPipe is open source. The code is
Kidwell, M. C., Lazarević, L. B., Baranski, E., Hardwicke, T. E., available at https://​github.​com/​jspsy​ch/​datap​ipe
Piechowski, S., Falkenberg, L. S., … Nosek, B. A. (2016). Badges
to acknowledge open practices: A simple, low-cost, effective method Springer Nature or its licensor (e.g. a society or other partner) holds
for increasing transparency. PLoS Biology, 14(5), e1002456. exclusive rights to this article under a publishing agreement with the
Klein, O., Hardwicke, T. E., Aust, F., Breuer, J., Danielsson, H., Hofe- author(s) or other rightsholder(s); author self-archiving of the accepted
lich Mohr, A., … Frank, M. C. (2018). A practical guide for trans- manuscript version of this article is solely governed by the terms of
parency in psychological science. Collabra: Psychology, 4(1), 20. such publishing agreement and applicable law.
https://​doi.​org/​10.​1525/​colla​bra.​158

13

You might also like