Connecting data archives and physical samples
Last year, Earth and environmental science informaticians met to discuss the challenges of creating and maintaining links between data archives and the physical samples from which the data are extracted. Here a collection of rock samples is stored in hand-labeled cardboard boxes at Geoscience Australia. Credit: Simon J. D. Cox

Physical samples play an important role in Earth and environmental sciences. Data derived from samples are the basis of the interpretations published in the scientific literature. Vast amounts of samples, data, and publications collected in the past are a trove of scientific information, but their volume and variety make systematic interpretation challenging.

Can a uniform approach to data representation be applied when different disciplines and projects have very different approaches to sampling itself?

“Semantic Web” and “linked data” strategies have been proposed to support linking samples, data, interpretations, and reports. But are those the right tools to use at the scale of large specimen collections? There are also conceptual challenges: What does a sample represent? What do data from samples represent? Can a uniform approach to data representation be applied when different disciplines and projects have very different approaches to sampling itself? These approaches range from collecting small numbers of key specimens to large numbers for representative statistics.

A group of Earth and environmental science informaticians met to tackle these questions at a symposium last summer. More than 70 participants (including 18 non-Australian and 20 early-career researchers) from the solid Earth sciences, marine science, oceanography, ecosystems, biodiversity, soil science, and remote sensing met to discuss these issues.

Relationships between samples, parts of samples, and sampling artifacts must be recorded to allow the resulting observations to be related back to the world.

The week kicked off with a field trip to the rock store at Geoscience Australia and continued on to Australia’s National Botanic Garden, the National Herbarium, and the National Insect Collection to view examples of the problem at hand. The realities of large collections include practical concerns around identifiers and metadata that put the linked data theory to the test.

The formal meeting started with attendees from specific science disciplines explaining their motivations for linking samples and data. Technical sessions then focused on Web linking and identifiers, emerging semantic tools, data delivery services, and data publication. Standard approaches are necessary to be able to easily move from, for example, a figure in a paper to the underlying data in a repository to an unambiguous representation of the sample on which the observations were made.

A conference field trip to a sample archive confronts this informatician with the realities of large sample collections. Credit: Simon J. D. Cox

Attendees concurred that relationships between samples, parts of samples, and sampling artifacts such as drill holes and cores must also be recorded to allow the resulting observations to be related back to the world. They noted that incentives for the adoption of common standards vary between disciplines and between sectors (e.g., researchers versus agencies). Disciplines that rely on shared platforms, such as marine science and oceanography, tend to embrace standardization, but others that would benefit, like long-term ecosystem studies, may be tied to competitive identification cultures (e.g., taxonomy) that challenge standardization.

Formal presentations took less than half of the time at the symposium. The remainder of the time was spent on “unconference” sessions. Some of these tackled topics in response to the presentations earlier that day.

For example, one session explored whether the concept of “sample” is actually shared across the different communities. Participants devised a cross-disciplinary definition that encompasses material samples or specimens, along with sampling stations and statistical samples from populations. This definition can be used as an anchor for data, concepts, and interpretations via hyperlinks embedded in reports, data sets, and publications.

An “anticonference” session allowed participants to describe project failures, providing complementary insights to the usual boasts about successes. The meeting closed with a session in the Up-Goer Five Challenge format, consisting of four science presentations using only the 1,000 most used words in English. This feat requires considerable creativity when the topic is ontologies, diamond exploration, or remote sensing of polar ice.

The meeting’s theme of “linking” was achieved through bringing multiple disciplines together and improving our understanding of how the theory of linked data can work in practice for environmental data and samples.

—Simon Cox (email:; @dr_shorthair), CSIRO Land and Water, Clayton, Vic, Australia; Jens Klump (@snet_jklump), CSIRO Minerals, Kensington, WA, Australia; and Kerstin Lehnert, Lamont-Doherty Earth Observatory, Palisades, N.Y.


Cox, S.,Klump, J., and Lehnert, K. (2018), Connecting scientific data and real-world samples, Eos, 99, Published on 16 January 2018.

Text © 2018. The authors. CC BY 3.0
Except where otherwise noted, images are subject to copyright. Any reuse without express permission from the copyright owner is prohibited.