Climate Change Editors' Vox

Challenges in Assembling and Managing Environmental Data Sets

Pulling together long-term data is increasingly important in assessing environmental changes, whether regionally or globally.

By , Stephanie E. Hampton, Sapna Sharma, Derek Gray, Jordan S. Read, , and Philipp Schneider

A recent paper in GRL discussed warming of lake surface waters around the globe. Their analyses show that surface water warming rates are dependent on combinations of climate and local characteristics, rather than just lake location, leading to the counterintuitive result that regional consistency in lake warming is the exception, rather than the rule.

The data set used in the study was described and published in a related paper in Scientific Data, following a workshop in 2012 to bring together the investigators and data sources (see related article in Eos). Pulling together such long-term data is increasingly important in assessing environmental changes, whether regionally or globally, and maintaining and building upon these data will be critical going forward.  AGU asked the authors to highlight further how they pulled the data sets together and some of the important challenges they see going forward. (Individual authors’ answers are identified by their initials.)

Q1: Can you describe the major challenges you faced in assembling the data set? How easy or difficult was it to find and pull together these data from different regions and countries? 

Assembling the dataset and preparing it for analysis was definitely the greatest challenge and took the most time. For in situ lakes, people were often very willing to collaborate and provide raw data, so much of our time was then spent processing these into a consistent metric for the annual summer surface water temperature. For the satellite dataset, the main challenge was to ensure that the resulting time series represented true geophysical variability in lake surface temperatures and were not affected by shifting observation times, calibration drift of the instruments, sampling biases, and similar issues that have the potential to mask the true underlying trends. We spent a surprising amount of time compiling metadata and information on the general sampling approach for the in-situ sampled lakes, which we needed to fully describe the dataset for publication. (COR)

Assembling the data set was the greatest challenge – most researchers who have done this sort of data synthesis would say that preparing data for analysis accounts for the vast majority of time spent on the project. (SH)

The limnological community has not converged on time series data standards, so our group spent time converting various formats from data contributors into the final published dataset. (JR)

Ensuring high data quality for such a large number of records  and maintaining a database that kept growing as we met new collaborators was challenging.  (SS)

The main challenge with regard to the satellite data was to ensure that the resulting time series represent true geophysical variability in lake surface temperatures and are not affected by shifting observation times, calibration drift of the instruments, sampling biases, and similar issue that have the potential to mask the true underlying trends. (PS)

We were fortunate to collaborate with investigators in many different countries and time zones who spoke different languages. This made timely communication a challenge. (DG)

We were fortunate to have two great starting points for this project: 1) Previous work from the remote sensing community that had already analyzed surface warming trends from 169 lakes globally, and 2) Existing lake networks, such as GLEON, that have in situ measurements and can bring in long-term field data from lakes around the world (particularly those that might not have already been in the satellite database). To bring these two communities of investigators together, we hosted a special session at the 2011 IAGLR conference in Duluth, MN, where 15-20 of us gathered to have our first face-to-face meeting. At that meeting, it was proposed that we secure funding from NSF and NASA to host a broader workshop to bring together all investigators and begin synthesizing our datasets. This was eventually accomplished in Lincoln, Nebraska in June of 2012, and the workshop (with 40+ investigators) is described in Eos. (JDL)

Q2: Publishing the data separately is something that is being increasingly encouraged, in part to help provide credit and visibility for important data sets.  Can you describe how this came about and any thoughts you have on the process or improvements?

The idea of publishing the dataset came up relatively early in the project. Publishing the dataset provided the most straightforward way for us to ensure that all the data providers would be credited for their hard work, through citation, in later manuscripts that use the data. It was also a way to make the dataset broadly available for new research – we know that some other scientists started working with the dataset almost as soon as it became accessible. Actually compiling the information that we needed to incorporate into the publication took a surprising amount of time. A data paper is really a very different type of manuscript than what we are used to writing, and we were fortunate to work with the information manager for one of the long-term ecological research sites. (COR)

Publishing the data set separately provided the most straightforward way for us to ensure that all the data providers would be credited for their hard work, through citation, in later manuscripts that use the data, and also that the data set will be broadly available for inspiring new collaborations. We had several requests for the data immediately after publishing the paper in GRL, and it was great to be able to say that the data set was already peer reviewed, published, and open access, and to point people directly to the data paper. (SH)

In addition, it provided an opportunity for our data sets, sampling description, and validation to be peer-reviewed and a venue for all of the sampling methodologies to be summarized in one location so that others can get a clear idea of exactly how the data were collected for each lake (SS).

Some of the contributors were nervous about publishing the data before our analysis paper was accepted for publication, as there was a worry that others would “beat us to the punch,” but the advantages described above outweighed any concerns we had about publication priority. (DG)

We were excited about sharing this dataset publicly, as it provides opportunities for others to explore different research questions in the future. (JR)

One of the challenges of publishing such a large data paper with so many investigators is the wide variation in detail, accuracy, and level of participation. As the 2012 workshop organizers and lead authors of the data paper can attest to, wrangling (and organizing) responses from more than 40 people is a significant challenge. Some individuals respond quickly, with good attention to detail, while other people can be difficult to pin down. We had originally started by having each individual/research group process and send their data in a standardized form, but some of the groups did not follow instructions and/or simply sent the raw data. So in the end, we had to compile and (re-)process all the raw data, which of course increases the time to project completion. (JDL)

Q3: Overall, how well are we systematically monitoring lakes? Your study looked at about 291 lakes, which is large compared to previous efforts but represents a sampling of lakes globally? 

The data paper includes more lakes than those that we actually used for the analyses published in GRL, which only involved 235 lakes. Some of the lakes that were initially incorporated ended up not meeting requirements that we later agreed upon for the analysis (for example, the number of missing years of data that were allowed). As might be expected, the lakes with the best long-term in-situ measurements are in the northern hemisphere, and even those are clustered in Europe and eastern North America. The satellite measurements help provide data on lakes that researchers cannot visit in person regularly, but these satellites could only ‘see’ relatively large lakes. So in many regions of the world we still have spotty knowledge of freshwater resources – for some of our remote satellite-sampled lakes, we could not even find information on how deep they are. Given that there are over 100 million lakes in the world, one could argue that we still have a long way to go! (COR)

More lakes in the Northern Hemisphere and in close proximity to limnology labs are being monitored.  It would be great to have a more systematic approach where more lakes in continents such as South America and Africa are supported by long-term ecological research networks (SS).

Satellite measurements have helped us to get data on lakes that researchers have not been able to visit in person regularly, but right now satellites only help us “see” pretty large lakes, and more than 90% of the world’s lakes are small and shallow. So in regions where researchers don’t have long-term in situ data, and lakes are too small for satellite data, we end up with spotty knowledge of the status of freshwater resources. In addition, we still need more research on interpreting satellite-derived estimates relative to in situ measurements, so that we can take better advantage of this technology, particularly as it continues to evolve. (SH)

Relative to the total number of lakes in the world, of course, 291 lakes is still a very small number. And with only 25 years of data for many of these lakes, we are just beginning to be able to distinguish the “signal from the noise” in terms of global lake warming patterns. As an initial step, though, I would say that we’re doing a reasonably good job of monitoring the surface temperatures of lakes globally. Satellites can only see the large lakes, but the technology is improving such that even small lakes can now be “seen” by remote sensing. Field monitoring of lakes is always subject to the vagaries of funding, but a few long-term monitoring stations have records that go back as far as 100 years. One of the challenges with this type of work is having good geographic coverage, and only about 26% of the lakes in the GLTC database are located in the northern, high latitudes (50-70°N), compared to 68% in the actual global distribution of lakes. So we significantly under-represent Arctic lakes in our database, while over-representing northern, midlatitude lakes (30-50°N; 55% compared to 12%). Coverage of lakes in the tropics and southern midlatitudes – while low (19% combined) – is actually quite good considering the small percentage of lakes that actually exist in those regions (11%). (JDL)

Q4: What is needed to maintain and expand on these or other important measurements in lakes going forward?

It’s critical to continue existing long-term research programs, both for in-situ research and to sustain operational satellite missions. Satellite data will probably always be necessary to measure lake surface temperatures in remote regions of the world, and current and future advances in satellite technology allow us to sample smaller lakes. Of course, it would be great to have more lakes in continents such as South America and Africa supported by long-term ecological research networks. Practically speaking, a dedicated data manager would be vital for expanding on this database in the future, both with respect to updates as lakes continue to be monitored as well as for adding additional lakes. (COR)

I believe that a data manager would be vital to expand on these important measurements.  We need a system in place to update data as lakes continue to be monitored and add additional lakes as we learn of more lakes that have collected long-term time series (SS).

This study also highlights the importance of funding long-term research programs. Since grants often operate on short time frames (3-5 years) it is difficult to maintain long-term records that can document changes in the environment. For example, the Scripps Institution for Oceanography has had a really hard time maintaining funding for the oldest direct source of CO2 monitoring on Mauna Loa in Hawaii. The fact that we were able to obtain so many long-term records of lake surface temperature highlights the tenacity of many of the investigators that contributed to this project. (DG)

The use of satellite data will always be necessary to monitor the lake surface temperatures in those regions of the world where frequent direct ground-based measurements are not possible. Current and future satellite instruments are able to spatially resolve many more smaller lakes at high sampling frequency, so the number of lakes that can be studied by satellite will grow significantly in future. In order to ensure the highest possible quality of satellite-based lake surface temperatures it will be necessary to ensure long-term sustained funding of operational satellite missions carrying instruments with similar characteristics and thus providing stable time series. (PS)

As with most research and monitoring programs that look for decadal-scale changes, the key ingredient is long-term support and funding. We were able to pull together this initial dataset and analysis using workshop funding and in-kind support from partner institutions, but that is not nearly enough to sustain a long-term program. At a minimum, this would require a full-time data manager to continue the process of collecting, standardizing, and maintaining the lake temperature data year after year so that we can track the long-term health of the world’s lakes. (JDL)

—Catherine O’Reilly, Stephanie E. Hampton, Sapna Sharma, Derek Gray, Jordan S. Read, John D. Lenters, and Philipp Schneider; email: [email protected]