Scientific researchers are instructed from the very beginning of their training about the importance of citing previously published literature and carefully documenting methods. We are taught that this corroborating information forms a solid foundation on which to rest our claims and conclusions. Citing data sets in the same manner, however, is another story. Although researchers have enthusiastically embraced digital data archives that can be shared and updated easily, they do not always cite data sets from these resources in their publications in ways that facilitate verification and replication of their results—or the assembly of metrics to gauge how the data sets are used.
Benefits of data set citation include improved reproducibility (particularly when the exact version of data used is indicated) and credibility of research, and clarification about the provenance and use of—and the proper credit for—data. Readers of scientific literature, including researchers, funding agencies, and promotional committees, rely on data set citation for information about data set usage. These metrics are extremely important in assessing the impact of a given body of work and of the facilities that publish and deliver data to the scientific community. Improper, incorrect, and incomplete data set citation hinders such assessments.
Publishers and data repositories have made significant progress over the past decade in increasing awareness of data set citation, and science policy bodies such as AGU have recommended the practice. Despite extensive awareness efforts by such groups, however, we observe a shortage of clear references to cited data in published scientific literature. We need a renewed approach to promote, educate about, and enforce citation of data sets in manuscripts—and to improve the specificity with which they are described.
Collecting Data Set Metrics
Data centers like NASA’s Earth Observing System Data and Information System Distributed Active Archive Centers (EOSDIS DAACs), which archive and distribute large amounts of environmental data, have been advocating the adoption of data citation for many years. EOSDIS DAACs use data set citation metrics, such as the number of times a data set has been cited each year and data sets that are commonly used together, to assess the use and impact of data sets they supply, consistent with the purposes of indices that track author citations [Cook et al., 2016].
The Goddard Earth Sciences Data and Information Services Center (GES DISC), the National Snow and Ice Data Center DAAC (NSIDC DAAC), the Oak Ridge National Laboratory DAAC (ORNL DAAC), the Physical Oceanography DAAC (PO.DAAC), and the Socioeconomic Data and Applications Center (SEDAC) have developed processes for routinely collecting data set citation metrics by searching bibliographic databases and manuscripts.
Often, librarians at a data center’s host institution collect data set citation metrics. Processes for this collection vary across data centers, but they all involve considerable and time-consuming manual effort, including searching for keywords and assembling citation records. Today a computer-extracted citation approach (using Scopus or DataCite) typically yields less than 60% of the record matches listed in a manual, librarian-assembled benchmark citation database. Automated approaches are improving, but they could still be much better.
The lack of specificity in descriptions of data sets mentioned in published literature, the absence of data set references like digital object identifiers (DOIs), and the unavailability of robust open-source application programming interfaces (APIs) for scanning journal articles limit the adoption of automated approaches for collecting data set citation metrics. Scholarly literature search engines, such as Google Scholar, and the adoption of data set metadata standards for search engine optimization (such as the one offered by schema.org) have reduced the burden on data centers of collecting citation metrics for data archives. Machine learning techniques, open journal searches, wide adoption of data set DOI assignments, and improved DOI and data set specification in manuscripts will expedite the automated approach to data set citation metric collection.
Discovering Trends and Connections
The importance of data set citations has been widely reported in scholarly publications [Piwowar et al., 2007; Baggerly, 2010]. In 2015, AGU released this statement affirming this importance: “Connecting scholarly publication more firmly with data facilities thus has many advantages for science in the 21st century and is essential in meeting the aspirations of open, available, and useful data” [Hanson et al., 2015].
Data set citation metrics offer several advantages for data centers and producers as well as for data users and sponsors of data collection. By manually searching scholarly records and quality checking results, DAACs have assembled a time series database of data citation metrics, extending from 1997 to the present. The information collected includes DOIs and journal names of citing articles, as well as data set DOIs, the collection in which a data set is stored, and whether the data set was formally cited or simply acknowledged. Collecting citation information allows data centers not only to report metrics per data set but also to derive those metrics for project, data source, discipline, and other parameters.
Sponsors and members of research projects occasionally request citation metrics from data centers to gauge the impact of their research investments. Data centers also use citation information to understand linkages between various research domains. SEDAC, for example, has used data citation metrics to understand the interdisciplinary use of socioeconomic data with remotely sensed geospatial data in published studies. SEDAC uses this information to understand the impact of their published data sets in different application areas and sectors of society [Downs et al., 2017]. Citation metrics can also improve discovery of data sets by augmenting search indexes to highlight data sets related by citation.
When references to all citations for a particular data set are available, end users can readily see common data-processing methodologies for those data, which may expedite and inform their own research. Citation records provide data centers with insights into patterns of usage and applications of data sets. This information is valuable in enabling data centers to provide improved services, such as data set–specific download services and increased relevancy of data set search results that best meet the needs of the user community.
The challenges of collecting citation metrics notwithstanding, the advantages and importance of data citations are clear and well publicized. Unfortunately, data citation metrics are still not having the desired impacts.
As evidenced from data citation metrics starting from 1997, data sets are too often listed within the acknowledgments of published papers instead of as precise citations within a bibliography. When we analyzed the full catalog of publications from 1997 to 2019 included in the DAACs’ citation metrics database, we observed that for any given year, 25%–80% of data sets used within published scientific studies are listed as citations. Although the range is not uniform across DAACs and years, the target is to improve the lower end of the citation range and come closer to our goal of 100% citation.
Practical limitations, such as word count, can prohibit detailed descriptions of data sets in manuscripts. However, we observe a lack of specificity about, and references to, data sets even when DOIs and robust citations are available. Mentions of data sets in manuscripts commonly lack information about the version of the data set used, the date it was accessed, the access end point (the location or URL from which the end user downloaded the data set), and spatiotemporal details (ranges in time and space to which the data apply). The vagueness in the data set mentions impedes provenance tracking and undermines the ability to reproduce results, which is a foundational principle of science.
In our analysis, we observed only a gradual trend in the adoption of data set DOIs and that less than 40% of the data citations provided the needed specificity about the data set used. In many cases, references to data sets were provided as generic references to a satellite, sensor, or project instead of to the location of the data. Statements such as “temperature data were downloaded from PO.DAAC” and “MODIS–Terra data were obtained from NSIDC,” which do not provide sufficient details to pinpoint a data set, are distressingly common throughout the citation metrics database. Referencing data as being from the Terra satellite or the Moderate Resolution Imaging Spectroradiometer (MODIS) sensor, or from the Salinity Processes in the Upper Ocean Regional Study (SPURS) project, makes it very hard in most cases to trace the research methodology and analysis back to the source data sets. In this example, Terra, MODIS, and SPURS each have tens, if not hundreds, of data sets associated with them.
Without more specificity, librarians collecting citation metrics may have to read an entire article to determine the exact data set used, and even then, some ambiguity may remain. Also, librarians or other personnel may be able to associate data sets to an article only because of their specialized knowledge about the data sets, meaning a reader of the article will not be able to identify with specificity or determine the provenance of the data set used in the article. The issue of association increases in severity for older publications.
Critical Fixes to Ensure Sustainable Data Citation
Two factors have brought us to an important crossroads with respect to data citation. First, data sets aren’t being cited often enough and with enough specificity. Second, manual approaches for scanning, quality checking, and rigorously associating articles to data sets are becoming unsustainable. In light of the observed patterns described above, we need to act now to improve the citation of data in published literature.
We must maintain linkages between studies and the data sets upon which they are based to clarify research provenance and ensure that work is credible and reproducible. Without these linkages, many data sets will be undervalued and underused, and recognition of their contributions to science will be limited.
Clearly, the research community must acknowledge these issues and work to find immediate fixes. For starters, enforcement is needed for policies already in place, and new policies should be implemented. Only about half of journals provide a style manual for data citation. Journal publishers should expand the use of data citation to make it common practice and push more strongly for data citations to be comprehensive rather than concise [Mooney and Newton, 2012]. Journal publishers, editors, and reviewers should insist on this specificity and periodically audit their publications for data citation completeness. Such audits will help gradually shift current cultural norms. Enforcement can also come from scientists themselves as they promote expanded use of data set citations and call out instances of inadequate data attribution.
In addition, data centers must enhance their outreach and adopt additional tools and services to make data set citations prominent. For example, citation transformation service software that exports standardized citations from data sets to publishing tools like EndNote or LaTeX could simplify data citation. Recommended citation text should be made available alongside data sets at the point of data delivery, much in the same way that many journals provide “cite this article” links.
Acknowledging data set publication as a key research performance metric in research organizations and academic institutions will help increase awareness, use, and recognition of data sets. Such acknowledgments should, in turn, promote use of data set citations.
Research sponsors typically require regular status checks on the impact of data sets they fund. Increased use of data citations within publications should simplify the creation of metrics to evaluate the impact of data sets, similar to journal and author citation metrics, which could help facilitate status checks.
We are confident that with concerted efforts by the research community to adopt data set citations, we can very quickly observe a positive trend in the proper acknowledgment of data sets and in the specificity with which they are described in scholarly literature. In doing so, scientists will contribute to improving the provenance and reproducibility of research and thus to increasing its credibility and value.
We thank our sponsor, the Earth Science Data and Information System (ESDIS) Project; librarians at our host institutions; and data center personnel who helped assemble and study this information.