Antarctica comprises numerous unique environments, from the high Antarctic plateau to the deep subglacial bed, iceberg-congested coastal waters, and sparse rock outcrops and soils. Overall, the region is a key barometer of global environmental change: Interactions among the ice, ocean, atmosphere, biosphere, and lithosphere have implications for global sea level, ocean-atmosphere circulation, and biodiversity. Research in Antarctica spans efforts to study everything from particle physics at the South Pole to the extremes of life in the frozen ground, and this work is crucial for building scientific knowledge at scales from the cellular to the universal.
Understanding the processes that govern Antarctica’s interacting systems requires that data characterizing often fast-changing environmental conditions be placed into context with longer-term records and that point data from the field be integrated with larger-scale airborne and satellite observations. Given the unique and challenging conditions of the polar regions, physical, chemical, and biological data collected there, including temporal snapshots of environmental states that cannot be reproduced, are typically acquired with substantial logistical effort and financial expense. Preservation of these data is thus a critical need for the present and the future.
A Disjointed Approach, Historically
In the spirit of supporting collective stewardship of Antarctica, signatory countries of the Antarctic Treaty in 1959 agreed to make scientific observations from Antarctica open and freely available to everyone around the world. In 1998, the Scientific Committee on Antarctic Research (SCAR), which represents the international research community, adopted NASA’s Global Change Master Directory (GCMD) to serve as a central catalog of information about Antarctic research data sets. Each country is responsible for hosting its own data resources, but all countries contribute basic metadata describing their collected data to the Antarctic Master Directory (AMD) portals of the GCMD [SCAR, 2011].
In most signatory countries, researchers concentrated at dedicated national research centers such as the British Antarctic Survey, Germany’s Alfred Wegener Institute Helmholtz Centre for Polar and Marine Research, the Korea Polar Research Institute, and others conduct Antarctic research. These centers also provide national-scale data management, including registration within the AMD. In the United States, however, Antarctic research is conducted by researchers at universities and national laboratories across the country. This work is coordinated through the U.S. Antarctic Program (USAP) with funding from multiple agencies, including the National Science Foundation (NSF).
Historically, individual scientists in the United States were responsible for ensuring that their data were publicly accessible and registered within the AMD. Some Antarctic scientists can archive their data in disciplinary data repositories that serve the broader science community, such as the Incorporated Research Institutions for Seismology (IRIS) for seismology data or GenBank for DNA sequencing, but few appropriate disciplinary repositories exist. This lack of repositories often resulted in data being publicly hosted only on individual scientists’ websites or not at all.
Further, AMD registrations from the U.S. academic community were highly heterogeneous. Often, these registrations were completed as part of final funding reports, before publications and data archiving were complete, and scientists lacked incentives to go back and update records at a later date. These limitations led to significant gaps in the preservation and archiving of Antarctic research data sets produced by the U.S. academic community and to incomplete cataloging of these data sets. The resulting information gaps made it difficult for Antarctic researchers to reliably search for, find, and use these data.
Managing Antarctic Research Data
Antarctic research spanning a wide range of disciplines is supported by a variety of field observations, as well as sample collection, laboratory measurements, remotely sensed observations, and model experiments. Most of the resulting data are researcher- or project-based data and are diverse and unique data products. This diversity contrasts with the large-volume, more standardized data collections that form key observational infrastructure for some disciplinary communities, such as seismometer data managed by IRIS. Standards for heterogeneous researcher-based data are minimal, and managing these kinds of data is challenging.
Current best practices for research data stewardship center around a life cycle approach lasting from experiment design through data acquisition, processing, archiving, and publication and extending to data reuse and archiving of derivative products. Considering this data life cycle perspective helps ensure that provenance and integrity of data are maintained, which is essential for supporting the reproducibility and reuse of published research.
Information about the context of a project for which data are acquired is relevant for many aspects of the data life cycle, especially for researcher-based data products, but this information is often missing in data archives. For example, the underlying science goals of a project motivate the types of data that are acquired and how they are processed. Preserving information about the original goals and motivations informs whether data may be suitable for applications other than those for which they were the originally intended.
Many Antarctic science projects are conducted as multidisciplinary investigations. Researchers work within shared field camps or on shared research cruises to best leverage the logistically complex planning and high costs of working in Antarctica. The project context of these multidisciplinary efforts links the resulting complementary data sets, describing connections between data types collected and informing understanding of the temporal aspects of the data (e.g., what investigator X measured during snowmobile transect Y) that are relevant for their future reuse.
A Central Home for U.S. Antarctic Research Data
Since its beginnings as a data coordination center in 2007, the USAP Data Center (USAP-DC) has evolved to provide a comprehensive suite of data management services for the NSF-supported U.S. Antarctic research community. It is the only repository supporting the full spectrum of research conducted by NSF’s U.S. Antarctic Program, and it is designed specifically to host researcher-based data products of all sizes and disciplines, as well as to preserve links to other NSF-supported, disciplinary-focused repositories. Hosted data sets at USAP-DC include data from Antarctic studies spanning biological, atmospheric, space, ocean, and solid Earth science research, as well as the collection of glaciology data assembled by the Antarctic Glaciological Data Center, which operated from 1999 to 2016.
USAP-DC assists investigators in life cycle data management (Figure 1) through services that support the following:
- data management planning during NSF proposal creation
- data set submission tools used to gather standardized metadata and detailed information about data acquisition and processing methods
- long-term preservation and data publication with the assignment of digital object identifiers (DOIs) through the DataCite system
- project registration within the AMD and tools to help update information about data sets, field programs, publications, and more
The data center offers Web interfaces to browse, search, and retrieve data, as well as application programming interfaces that enable others to design their own tools to access data of interest. USAP-DC data sets can also be discovered through external registries, including the AMD and DataOne, and through Google data set searches facilitated by Web-accessible metadata and Schema.org protocols.
In response to Antarctic science community and funding agency management needs, USAP-DC recently designed and launched a new project catalog. This registry of USAP projects and research products is designed to further support life cycle data management. Scientists are encouraged to register their projects when their award is initiated, and project pages are updated throughout their duration as data are archived and as publications become available. Data sets may be submitted to the USAP-DC or to external disciplinary repositories; in the latter case, links to externally hosted data sets are provided on the USAP-DC project pages.
Information on data sets and publications associated with research projects is also harvested through automated tools and Web services whenever possible, for example, using the Crossref system. This automation minimizes burdens on investigators by enabling USAP-DC to identify relevant publications long after a funding award ends without scientists needing to provide publication information themselves.
The consolidation of information about USAP research products over the life cycle of a project will make it simpler for individual researchers, especially in the context of large collaborative research groups, to keep track of data sets and related information produced during their projects.
All projects are registered by USAP-DC within the AMD by the end of the project funding periods so that USAP project information is fully integrated with this registry of other Antarctic data from the broader international community (Figure 2). Although data submission and project registration within the USAP-DC are designed for the NSF-funded U.S. academic community, access to the project catalog and data is open to the entire Antarctic research community.
A Growing Resource
In recent years, the data stewardship community—and, increasingly, journal publishers like AGU—has embraced findable, accessible, interoperable, and reusable (FAIR) data principles [Wilkinson et al., 2016]. These principles are intended as guidelines for data repositories to enhance reusability of their data collections, for example, by designing new infrastructure to optimize data usage, thereby facilitating new scientific insights from existing data.
New approaches for supporting reusability of researcher-based data collections like the USAP-DC Antarctic collection are particularly important, given the great diversity of data types, documentation, and file formats and the uniqueness of much of the content. USAP-DC is further enhancing reusability by ensuring that rich descriptive information about project contexts is also preserved. Another focus is to automate data documentation whenever possible to ease burdens of data submission on scientists and to grow the collection, which increases its value for new science applications.
In the future, more automated data harmonization and synthesis using researcher-based data collections will be possible with emerging approaches for interoperability of nonstandardized data and for automated generation of metadata from text documentation [e.g., Wilkinson et al., 2017]. With ongoing contributions from the Antarctic scientific community, the growing resource of multidisciplinary research data hosted at USAP-DC will be available for these new applications and to support new areas of discovery about this critical region in our rapidly changing world.
Any opinions, findings, conclusions, or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of NSF.