Vector illustration of people examining documents
Credit: Bro Vector–

Database Updates

Cover of the April 2023 issue of Eos

In the waning days of August 2017, Hurricane Harvey dumped more than 30 trillion gallons of water on Texas’s Gulf Coast. At least 68 people died. Hundreds of thousands of structures were flooded, and tens of thousands of people had to leave their homes. All told, the storm inflicted $125 billion in damages.

In the wake of the disaster, state and federal officials were forced to reckon with a familiar problem: access to information. “They always struggle with the data aspects of the problems that they’re confronted with,” said hydrogeologist Suzanne Pierce of the Texas Advanced Computing Center. During the storm, modelers would have been able to produce accurate predictions more easily if they’d had more information at their fingertips, Pierce said. Afterward, some rural communities struggled to apply for recovery funding because they couldn’t easily provide a full picture of how Harvey had affected their areas.

Harvey brought Texas’s data infrastructure problems into sharp focus and spurred the creation of the Texas Disaster Information System (TDIS), with Pierce as the director. The group is collecting data and models from a wide range of sources—including the National Weather Service, insurance providers, and the U.S. Army Corps of Engineers—so that these resources are readily available to disaster managers. Their first product, a centralized location for flood risk data and models, launched in September 2022 in collaboration with the Texas Water Development Board.

Texas’s data management struggles are emblematic of a problem throughout the geosciences: Researchers are buried by data served up by myriad tools from seismometers to satellites to social media. Put together, this information could reveal untold truths about our planet and improve the lives of the people living on it. But too often those data languish in personal computers or filing cabinets. Without appropriate databases and smooth submission and search processes, scientists struggle to share and access information.

Fortunately, a common set of best practices, clever computing tricks, and dedicated data experts are helping scientists overcome these barriers and make sure newly produced data enter the public realm. Meanwhile, data librarians and volunteers are making the most of scant resources to preserve data collected before the digital age.

These efforts are making it easier for scientists to synthesize their work with information from other disciplines. “In some ways, there’s a loosening of the boundaries between projects so that we can all learn together,” Pierce said. And that’s the way science should work, “in its most idyllic form.”

FAIR Standards Extend the Shelf Life of Data

“FAIR, to me, is sort of like taking everything that we’ve been traditionally doing to the next level.”

In 2014, several dozen scientists—from disciplines spanning from biology to computer science—gathered in Leiden, Netherlands, to figure out how to give data a life span beyond the project for which they were generated. Information must be Findable, Accessible, Interoperable, and Reusable, or FAIR, they wrote in a summary of the proceedings.

For Audrey Mickle, a data librarian at Woods Hole Oceanographic Institution (WHOI), the mindset underlying FAIR is nothing new. She and other librarians have always striven to make sure scientists can retrieve information quickly and easily. But FAIR codifies this mindset and presents a strategy for maximizing the usefulness of data. “FAIR, to me, is sort of like taking everything that we’ve been traditionally doing to the next level,” Mickle said.

FAIR was on Pierce’s mind when she and her colleagues designed TDIS’s infrastructure. Data and corresponding metadata are logged into the system, she said, and query tools ensure that they’re easy to track down. Intuitive organization makes data easily downloadable and therefore accessible. TDIS’s designers encourage depositors to name their data’s attributes using the same terminology found in models so that future users can easily connect the two, making the data interoperable. And if users add the results of simulations to TDIS, they must include a description of their methods, allowing users to reproduce the work; the methods are therefore reusable.

TDIS is far from the only organization thinking in FAIR terms; the Deep-time Digital Earth (DDE) program is a massive effort to promote the FAIR framework in the geosciences. “My ambition is very much to see all the kinds of data about our planet integrated into one source, with open access,” said mineralogist and astrobiologist Robert Hazen of the Carnegie Institution, one of the scientists behind DDE.

To achieve this goal, DDE scientists will link and expand existing databases and harmonize their structures. Sometimes they’ll need to create whole new databases.

Sedimentologist Chengshan Wang of the China University of Geosciences, the president of DDE’s executive committee, sees the project as a way of helping scientists from disparate fields and geographic areas communicate with one another. Right now, many discoveries are described in terms of “local knowledge,” he said. For example, he was involved in a publication about the Tibetan Plateau that describes the region only in terms of local place names—an impediment for outsiders trying to understand the work.

At age 71, Wang continues to steer the direction of the collaboration he started in 2018. “The biggest project for me is right now,” he said. “I want to enjoy my retirement, but [I have] no time to be retired!”

Being FAIR is not always easy. Bringing ongoing long-term projects in line with new standards can be a headache, because scientists must tweak their data collection practices partway through their efforts, said Jennifer Bayer, the coordinator of the Pacific Northwest Aquatic Monitoring Partnership for the U.S. Geological Survey (USGS). Inequity is also an issue. For example, USGS employs people who help scientists apply and pay for digital object identifiers, or DOIs, that uniquely identify information, making it easy to find. Bayer said that many organizations, such as Indigenous tribes, may not have the same level of support. There’s a need to “level the playing field with access to those kinds of resources,” she said.

A complementary framework known as CARE (Collective benefit, Authority to control, Responsibility, and Ethics) aims to ensure that the shift toward data accessibility does not compromise the rights of Indigenous Peoples to control data about their people, lands, and resources. The intersection between FAIR and CARE is “the sweet spot that we’re looking for,” Bayer said.

Rescuing Data

FAIR assumes that data are digital, said USGS data specialist Frances Lightsom. But tucked in the back corner of a USGS equipment warehouse in Falmouth, Mass., is a treasure trove of data, most of which have never seen the inside of a computer. Ten collapsible shelves, designed to slide so that only two shelves have an aisle between them at any time, line the Woods Hole Coastal and Marine Science Center Data Library. Reams of paper, film, CDs, VHS tapes, and punch cards fill the shelves, which reach 10 feet above the floor. Asked how much data the library holds, Lightsom, who is the library supervisor, said pensively, “I don’t think we have ever added it all up.”

These days, Lightsom and her colleagues usually add to the library by rescuing nondigital data from the nooks and crannies of retiring USGS scientists’ offices. Researchers can search the library’s catalog, then request that data librarian Linda McCarthy digitize resources that are relevant to their work.

If nobody requests the data, they remain in their original format. Cataloging and preserving materials are more than enough work for the library’s three staff members, and money seldom becomes available just to support digitization.

Seismological records are among the most requested, Lightsom said. Seafloor data are difficult to collect, and the techniques used to seismically image the subsurface can be harmful to marine mammals, so the data that exist are precious.

If nobody requests the data, they remain in their original format. Cataloging and preserving materials are more than enough work for the library’s three staff members, and money seldom becomes available just to support digitization. As a result, Lightsom estimated that she and her colleagues have digitized only about 1% of the library.

Down the road at the WHOI Data Library and Archives, library codirector Lisa Raymond has noticed over her 30-year career at WHOI that researchers have become less eager to use nondigital data. “When I first worked here, people would come here all the time,” she said. “They just have different expectations now.”

Retired information technology professional Thomas Pilitz volunteers his time to digitize the library’s resources, helping prevent them from slipping into the past. He’s digitized close to 5,000 cards carrying information that WHOI scientists collected from the 1930s to 1960s, during research cruises on the vessel Atlantis.

Each card is about 6 by 8 inches and carries such information as water temperature, oxygen content, and salinity. Pilitz scans the cards, creates comma-separated values (CSV) files containing the data, and compiles documents that record additional details about the research cruises. After more than 250 hours of work, he’s about halfway through the collection.

Despite the efforts of volunteers such as Pilitz, only “a teeny percentage” of the library’s physical resources have been digitized, said Raymond. She and Lightsom both worry that as time goes by, the nondigital data will be lost. Some media degrade, and accidents can happen. Some rooms at the WHOI Data Library and Archives are climate controlled and protected by waterless fire suppressant systems, but the USGS library has no such protection. Worse still, the metadata that describe some resources live only in researchers’ memories. “It’s scary, because you rely on their longevity,” McCarthy said.

Making Metadata Manageable

Fortunately, almost all data that are collected today are digital, making them easier to manage. “If [they] kept coming in on paper, we’d be out of luck,” Lightsom said. But capturing the metadata that researchers need to understand a digital data set can be time consuming—a big deterrent for overworked scientists.

When data systems architect Chris Jordan and his colleagues designed the stable isotope database IsoBank, they wanted to make metadata entry as easy as possible. But to serve researchers from a wide variety of fields, they needed to capture “an extraordinarily complex and interrelated set of metadata,” Jordan said. He and his colleagues created a choose-your-own-adventure-style system.

Scientists enter some preliminary information about their data, which kicks off an iterative process in which the database prompts the scientist to enter more information—some required, some only recommended—then adjusts based on the results. Jordan estimated that the system saves researchers from 1 hour to several hours compared to the time it would take if they had to familiarize themselves with all of the metadata that could possibly be entered and decide for themselves which were relevant.

Computer scientist Yolanda Gil of the University of Southern California described another iterative process that yielded a robust metadata framework called the Paleoclimate Community Reporting Standard (PaCTS).

Instead of bringing researchers together in person for hours of meetings and whiteboarding, they and their colleagues crowdsourced the framework online. First, one scientist described the kind of metadata that should be included. Then another scientist took that description and added additional metadata terms that would be valuable, and so on.

An algorithm developed by Gil’s group aided the scientists by suggesting terms they might want to use—similar to Google Search’s autocomplete feature—and organized the terms into an ontology. An editorial board made final decisions about which metadata terms would be included. Gil is very proud of their role in developing PaCTS. Without this metadata framework, “I don’t know that today [the paleoclimate community] would have a good way to make their data more integrated,” they said.

Helping Scientists Use Data More Responsibly

Even meticulously documented data can become “the Wild West” once scientists begin analyzing them on their personal computers, said artificial intelligence practitioner Ziheng Sun of George Mason University. Sun and his colleagues designed and developed a piece of software called Geoweaver, which allows scientists to compose and share analysis workflows so they can standardize high-quality protocols.

Geoweaver is built around FAIR principles, such as encouraging users to share their entire workflows, including how they prepared the data and produced their results, to make sure other users have everything they need to reuse the methods. Sun hopes that making standardized workflows easily available will allow Earth scientists to process data quickly, which could move scientists closer to analyzing extreme weather events such as hurricanes and tornados in real time.

Community is also key to making data accessible, said geochemist and IsoBank cofounder Gabriel Bowen of the University of Utah.

“If you’re working with data that [come] from outside your core area, how do you ensure that you’re doing the right thing with [them]?”

“If you’re working with data that [come] from outside your core area, how do you ensure that you’re doing the right thing with [them]?” he asked. Sometimes scientists need to connect with one another and pool their knowledge to work with data responsibly. Early IsoBank design workshops forged many such connections. Bowen said he would like to see the next stage of IsoBank involve the development of computational tools—and communities around those tools—so that scientists can easily make use of the data “in standardized, robust ways.”

Reaching Beyond the Typical Sources

Some scientists are looking outside the usual realms of academic and government data to advance their research.

When hydrologist Kai Schröter of Technische Universität Braunschweig wanted to assess how vulnerable residential buildings were to flooding and estimate their potential for economic loss, he and his colleagues turned to OpenStreetMap, a crowdsourced tool that captures local knowledge about roads, trails, buildings, notable landmarks, and more. Registered users can edit OpenStreetMap directly; municipalities and companies also contribute data. Anyone with an Android phone can contribute by completing quests, during which they visit locations in search of information that’s missing from the map.

The dimensions of houses, cafés, schools, and other buildings are described in a clear, structured way, which sparked Schröter’s interest in OpenStreetMap’s research potential. Because of this clarity, “you can very easily handle large amounts of data, and you can filter the data, and you can process [them] for other applications,” he said.

OpenStreetMap easily checks three of the four FAIR boxes, Schröter said. Finding and accessing the data simply require perusing the organization’s website; the data are clearly documented, making them interoperable. Reusability is where things get a little trickier: OpenStreetMap changes constantly as contributors make updates, and there’s no readily accessible archive.

“What you need to do is record a snapshot of the data that you have used,” Schröter said. Otherwise, other scientists may get different results when they try to replicate a study using a later version.

Crowdsourced databases come in all forms; some researchers are finding meaning in the public’s off-the-cuff social media comments.

Computer scientist Barbara Poblete of the University of Chile and her colleagues turned to Twitter to reveal how residents of Chile perceived earthquakes. “It takes just a few seconds for people to start tweeting,” Poblete said, and their comments can help seismologists and first responders understand shaking throughout a region.

Twitter data have historically been quite easy to find and access, Poblete said. But many algorithms used to analyze human language require humans to indicate the meaning of a subset of the language sample (also known as annotating) before machines can interpret the rest. This is where issues of interoperability arise.

There’s no standard format for language annotation, Poblete said. Each research group develops annotations that fit its needs. Annotation is also much more common in English than in other languages, putting researchers studying countries such as Chile, where Spanish is the dominant language, at a disadvantage.

Poblete and her colleagues are working around the second problem by creating a system that can automatically collect ground motion information about earthquakes when numerous people in a particular area tweet about shaking, without relying on annotated data, and therefore can be used in any language.

Back in Texas, Pierce is also working toward using natural human language to complement structured data in descriptions of events such as Hurricane Harvey. She and her collaborators have funding to record residents’ memories of disasters, then look for trends in these stories that can help answer questions such as where storm-related flooding is likely to occur, how deep the water will get, and how long it will take to subside.

Information collected by eyes and ears can become “a new knowledge layer,” complementing information collected by mechanical sensors in a comprehensive data ecosystem, Pierce said. After all, lived experiences are the ultimate reflection of how humans interact with Earth.

—Saima May Sidik (@saimamaysidik), Science Writer

Citation: Sidik, S. M. (2023), Welcome to a new era in geosciences data management, Eos, 104, Published on 27 March 2023.
Text © 2023. The authors. CC BY-NC-ND 3.0
Except where otherwise noted, images are subject to copyright. Any reuse without express permission from the copyright owner is prohibited.