We are offering up to 10 awards of $200 to research teams that use data from our first study (PSA001) to follow an analysis pipeline meeting our specifications. We describe the background, rationale, and details below.
Psychology datasets contain a wealth of information, including dozens, hundreds, and sometimes even thousands of variables. Datasets that are well-documented can be even richer, as appropriate documentation can allow these datasets to be merged with secondary information (or meta-data), exponentially expanding the universe of possible analyses.
Although some researchers use publicly posted data in their research, we believe the potential of secondary analyses is, as yet, untapped. Some of this untapped potential may result from the typical structure of a psychology dataset release. In the best case, the dataset is described in an article in a journal (such as the Journal of Open Psychology Data or Scientific Data). In the worst, the dataset is undocumented and only available on request (if at all). We believe we can do better to make our datasets maximally informative.
Phased dataset release (with incentives)
Our test case for innovating with improving the data release process is PSA001, a project to test whether the valence-dominance model of face perception generalizes across world regions. The primary dataset contains ratings from over 11,000 participants across 11 world regions, 48 countries, and 28 languages. Each participant rated 120 faces twice on one of 13 traits. In addition to these ratings, we have access to datasets containing various meta-data. These include datasets of participant characteristics (such as race and gender – some locations only), site characteristics (such as world region and institutional affiliation), and characteristics of the faces that were rated (such as the gender of the face, picture luminance, and the size of various facial features).
In our release of this dataset, we are following the lead of other high quality data releases by carefully curating and documenting our datasets. However, we are adding an extra innovation: we are structuring the release in a way that we think will maximize the value of the resulting secondary analyses. Specifically, we are releasing separate exploratory and confirmatory segments of the data and incentivizing the use of these separate segments by offering up to 10 awards of $200 to research teams who complete the analysis pipeline of exploring with the exploratory segment, confirming with the confirmatory segment, and sharing the results on PsyArXiv.
The data release plan for this project consists of three phases: release of a simulated dataset (to allow people not directly involved in the project time to understand the variables we collected), release of an exploratory segment (⅓ of the full dataset), and release of a confirmatory segment (the full dataset). We will stratify by lab when creating our exploratory and confirmatory segments; in other words, we will randomly sample ⅓ of the participants within each lab that contributed data to create the exploratory segment. The full dataset will demarcate the exploratory and confirmatory All data drops will occur at randomly selected UTC times between 12am and 11pm.
We will provide up to 10 awards of $200 each for research teams that make secondary contributions from the exploratory and confirmatory datasets. If more than 10 teams submit contributions the winners will be chosen at random. To be eligible, a research team must:
- Write a computationally reproducible script that analyzes the exploratory dataset. The script may be written in any data analysis software, but we strongly encourage the use of open-source software such as R.
- Post the script to a project on the Open Science Framework and create a date-stamped preregistration of the script using OSF preregistrations. The proposing teams can use a preregistration template, such as this one for secondary data analysis, or they can use an open-ended preregistration that only contains the script that the team will use to analyze the confirmatory segment and a date stamp. At the top of the script, the proposing team should write their names and the following text: “I commit to analyzing the confirmatory segment of PSA001 Social Faces using this script upon the project’s release”. The date stamp of the preregistration must be before 12pm UTC, November 30, 2019, which is the point at which the confirmatory segment will be released. The script will be checked for computational reproducibility by a member of the PSA’s Data and Methods Committee.
- After the release of the confirmatory segment, post a preprint to PsyArXiv detailing the results of the analyses of the exploratory and confirmatory segments. To be eligible for the award, the preprint must be date-stamped by 12pm UTC, January 31, 2020. For the purposes of winning the award, the preprint may be very brief –tables or figures illustrating the results along with some descriptive text are sufficient. However, if the research team wishes, the preprint may be more detailed. The PsyArXiv preprint should be tagged with the study code for this project: “PSA001”.
Before issuing the awards, members of the Data and Methods Committee will verify that these steps have been followed.
Below are the key dates of this data release plan:
- The simulated dataset, along with a codebook, will be released (posted on OSF, tweeted, Facebooked, and blogged) on August 31, 2019, 24:00 UTC, 8pm EST (so it’s available today!!). It, along with detailed documentation of the dataset, are available at this OSF page.
- The exploratory segment can be found here, and was posted on October 31, 2019, concurrent with the submission of this project’s Stage 2 Registered Report.
- The preregistered analyses should be submitted via this form by November 30, 2019, 12pm UTC.
- The confirmatory segment will be released concurrently with the publication of the Stage 2 paper at Nature Human Behaviour.
- The preprint should be posted within one month of the release of the confirmatory segment. Once posted, the preprint can be submitted to the PSA001 team via this form.
If you have questions about this process, or the data that we have available, contact the PSA001 data manager (Patrick S. Forscher) at firstname.lastname@example.org.
We hope this project can serve as an exemplar of how the details of data release can add value to the scientific knowledge generated from a particular dataset. We hope you consider participating in our Secondary Analysis Challenge so we can see if this is indeed the case.