Geology & Geophysics Editors' Vox

“Do You Expect Me to Just Give Away My Data?”

The Editor-in-Chief of JGR: Oceans explains why the new AGU data policy is important for the rigor and long-term security of scientific research.

By

As Editor-in-Chief of a major AGU journal I frequently deal with inquiries from authors about the process of having their papers accepted for publication. Nothing in the recent past has generated more correspondence than the matter of data reporting requirements, particularly the fact that we no longer permit the statement of “Data available by contacting the author.” I would like to describe, using some personal examples, why AGU’s new data policy is both necessary and a benefit to all.

Why the numbers are important

When you publish a research paper, you are also simultaneously publishing the data that supports your work. The readers of your article have equal rights to see both the words and the numbers – they are inseparable.

But what constitutes “data”? What about the raw instrument readings? What about the calibration runs? What about all the model code? etc, etc. Those data may well be deposited on local institutional servers and serve as backup for the team. Many fields have developed standards of practice for what is generally reported and stored as a single result from a run or measurement; these in some cases include the raw readings, in others a derived measurement and information on how it was obtained or processed from the raw data, or both. These standards are often set through repositories. Thus, for publication in JGR: Oceans and all the other AGU journals, at least these data need to be available in a publicly accessible repository.

We need the numbers – all the numbers – behind the published figures, graphs, contour plots etc. And these need to be specific; that is not averages within regions and so on. Your readers may well wish to re-plot these data to test a pet theory, or to assign them as a class problem, or to combine the results in a major review article. That is what your work is for.

How it came about

In May 2013, there was an Executive Order signed by President Obama and an Open Data Policy that applied to all US Federal Departments and Agencies, which declared that all data generated by the government be made open and accessible to the public.

Implementation of this policy in the context of the Earth Sciences was carefully reviewed at a workshop in May 2015 convened by then Editor-in-Chief of Science, Marcia McNutt. In fact, the US system was playing “catch up” as the routine deposition of data essential to support scientific publication has been far more common in Europe, where the PANGAEA data repository for Earth and environmental science is well established.

Representatives of all the major associations and publishing companies in the Earth Sciences were there, as was a senior NSF representative. Brooks Hanson, Senior Vice President for Publications, and I represented AGU. The outcome of this workshop was published in Science [McNutt et al., 2016].

Partly as a result, AGU revised and updated their data policy, and its implementation, adopting new data acknowledgement requirements for publication in all AGU journals.

Basically, the commonly-used statement “Data available by contacting the corresponding author” is no longer acceptable for our journals. Data must be stored in a domain repository that is accessible to all, and this information must be explicitly stated in the Acknowledgements section of the paper.  Further information is here.

Problems and perils of the old system

I would like to share two examples from my personal experience of why the old system just doesn’t work.

In September 2016 colleagues and I organized a UK Royal Society Discussion Meeting on Ocean Ventilation and Deoxygenation in a Warming World. My own contribution to this meeting was to provide a synthesis of tracer and model-based estimates of deep ocean oxygen consumption rates, almost all of which have been published in AGU journals and for some of these contributions I personally served as Editor.

In order to do this, I wrote to the authors of the papers asking for copies of the data files accompanying their published work. Not once did I receive a “Sure, I am happy to do that” reply: delay, obfuscation, pleas of being busy, and so on was the rule. In many cases these were colleagues I knew well. These are not big data files and typical profiles are only of a dozen or so depths. These are simple requests, but help was not at hand. Eventually I was able to put together a compilation and the work has now been published [Brewer and Peltzer, 2017]. But the ugly truth is that in some cases I was reduced to blowing up figures on the copier and drawing pencil lines across the image!

A compilation of tracer based ocean oxygen utilization rates plotted as an Arrhenius function. The Mediterranean data set stands out as near orthogonal to all others. The original data table was not published and could not be recovered; it is likely a misprint occurred. Credit: Brewer and Peltzer, 2017, Fig 6

Far more problematic is the matter of job mobility or human frailty. In Figure 6 that we plotted for Brewer and Peltzer [2017] (see right), one ocean profile ran almost exactly orthogonal to all the others. It was a hugely anomalous result and no reasonable explanation could be found.

The problem was definitely not measurement quality as the lab and the principal investigator had been known for superior work for decades.

However, the results we used did not come from a published data table as that was not given in the original paper; instead they were taken from a simple published equation written as a function of depth that reportedly represented the observations well.

It appeared to us that a misprint might have occurred and we contacted the corresponding author.

Alas, too much time had gone by since the paper was published in 2001 and our colleague was now suffering from badly failing eye sight, the student who executed the work and provided the data equation had moved on and contact had been lost. The original data are now not discoverable and we have no way of resolving this puzzle.

These are not isolated examples. My experience has been sobering. We need to do better than this. Your data are valuable and need to be shared – this will bring you nothing but credit and honor.

A positive example of the new system

I have had more than a few irate messages along the lines of “Do you expect me to just give my data away so that anyone can use it?” The answer is “yes”. For such skeptics, let me share an active example of large scale data archiving and sharing in the ocean sciences that works well.

The Gulf of Mexico Research Initiative established after the tragic April 2010 Deepwater Horizon oil spill had a requirement for open data access from the outset. This is now a large-scale ocean science program supporting hundreds of scientists around the world, and all are required to submit their data so that it is directly available at the time of publication. A great many of these publications are in AGU journals, as shown in the example below from Tomàs et al. [2017].

On their website you can search for data from 246 research groups and 2,546 people! That is quite a large subset of the ocean science community. There have been no cases of data theft or of job mobility, the vicissitudes of aging and human tragedy causing loss of results. In short, the system works well.

It is AGU policy and a requirement in JGR: Oceans, as well as the all the other journals, that all our publications have behind them the data to back up the published findings. That’s the right thing to do.

—Peter Brewer, Editor-in-Chief of JGR: Oceans and Monterey Bay Aquarium Research Institute; email: [email protected]

  • dafice

    This is such a large move , it should be referred to the membership in referendum.

  • Robert Link

    Serious question: what do you plan to do about research for which the data is larger than the limit allowed by public repositories? For example, data hosted by Zenodo is limited to 50 GB per data set. ESM output can easily run into many TB. There aren’t many institutions that can provide reliable long-term storage for multi-TB data sets. Will researchers working with huge sets be allowed to provide their data starting at the first analysis stage that reduces the data to a size that fits in public repositories, or will such research be limited to institutions that can afford to host their own large public repositories?

  • Chris Mebane

    Kudos to AGU for this move. It’s good policy and when major players such as AGU stop publishing data-free papers, it helps other societies and publishers to stiffen their spines on this too. However, this post and so many others emphasized responsibilities for authors and data producers, with little or no discussion on best or acceptable practices for data reuse. I often see synthesis papers that cite the repository or aggregator rather than taking the trouble to cite those who actually generated the data. Data curators and aggregators provide valuable services, but groups like AGU would do well to tighten up the expectations for authors of secondary analyses to demonstrate traceable provenance and give appropriate citation credits to the data originators.

  • Ann Schenk

    Metadata is often ignored or missing in smaller databases and, unfortunately, in data analysis. Detection limits of instruments change over time and between brands of instruments. Data curation requires both knowledge of the data and its methods of collection, and the continually changing methods of database creation and use. Not all data requests include the information on how the data were collected, and erroneous conclusions can be drawn.

    Mention metadata to many data analysts, and out come the crosses and silver bullets and “Delete” keys. Having dealt with large GIS oriented datasets, I ALWAYS included a “ReadMe” that stated the datum of all datasets, and if needed, a statement on detection limits in plain English when sending out requested data before the organization had web-enabled data downloads. When web downloads were common, it was decided to provide all location information using the same GIS datum as a separate data set, and to provide parameter data as data sets based on collection methods and detection limits. This made the “metadata” explicit, not buried in the details of the ESRI data.

    What concerns me most is data availability and curation from NGOs. Much data is collected to meet a multitude of Federal program requirements, but very little of that data is publicly available. The current US policies are tending toward private data collection due to budget cuts. Private data is not compelled to be publicly available, nor is its existence required to be known. Imagine weather data being handled the same as drug or chemical company data.

    Open data must include and acknowledge metadata.

    • Ted Habermann

      Peter and Ann both make very important points about metadata that are required to help users understand and trust shared data. They mention instrument data, calibration runs, and model code, just a few of the items that are critical to making shared data trustworthy. Let’s not forget processing histories, algorithm and software documentation, data quality tests/results, and user feedback including identified limitations and fixes…

      We all know that metadata are important for discovery. In the scientific community, discovering data is the first step in a complicated process of access, use, analysis, understanding and, ultimately, trust. Metadata must support this entire process. Sharing data is the first step, but let’s keep in mind that sharing understanding is the ultimate target.

  • Wayne Thogmartin

    I’m a proponent of making data publically available, regardless of how annoying or distracting it may be (and it is both annoying and distracting, to be sure). Nevertheless, I have three concerns, one practical, the others conceptual. My practical concern is a ramification of data. B uses data from A, which is then in turn combined with that of C. D comes along, thinks to use C and A, not realizing A is already within the set of C. Maybe if data curation is done correctly, whereby C keeps the provenance clear, and people pay attention to the data they have at hand, this won’t be a problem. As data grow in size and complexity, I have my doubts these issues of provenance will be correctly attended to. My other concern, conceptual in nature, is the role of data in science. I’ve made my career using data collected by others, and for that I am grateful. But, when ideas are tested with data, it doesn’t behoove us to argue over the inferences a researcher may draw by poring over his or her data but, rather, to collect new data, perhaps in a way that specifically addresses the area of disagreement. The current fascination with curation of data is fine and will, when combined with others collecting similar data, provide insight at scales not available to individual teams of data collectors, but the process of science leaps farthest when new data is brought to bear, not when old data is agonized over ad nauseum. My last conceptual concern is this. Researcher A looks at his data, tests every model under the sun. We call that data dredging, data fishing, data snooping, or p-hacking. When a slew of different researchers divvy up the hypotheses and test their favorite model against the data, we call that data sharing.

  • tptdac

    I note other problems. Many times an investigator has put up data or software for download on institutional servers (think universities, national labs, etc.). Later this material is moved to another page, leaving broken links, or it is deleted. This happens when an effort is made to “clean up” the web pages or free up server space. If the original investigator is no longer around, his material is likely to vanish. Only large organizations such as AGU or ACS or large publishers such as Elsevier seem likely to be able to avoid this problem. Government bodies including the USGS and NIST should have important roles to play here, but they are subject to the vagaries of government funding, and data preservation is only one of many priorities. Also, they often charge user fees that can be a problem for researchers with low budgets.

    Possible misinterpretation should be dealt with in normal scientific review and in open debate. As for dealing with the public, time is an important part of the process (like it or not).

  • Weftage

    Open data is a good policy. Two areas concern me most.

    One: How do we armor the repositories against tampering, hacking, or other attacks?

    Two, and even more serious: How can scientists protect their research from crankish or malicious misinterpretation? It is not a simple matter to get from raw data to a valid conclusion. It is easy for antagonists (I’ll mention Steve McIntyre) to misrepresent the data in public. Mr. McIntyre and his friends can easily manipulate a compliant press with lies. It is often difficult for laypeople (including scientists in other fields) to separate the truth from denialist chaff.

    (I realize these issues have no easy solution, and are the focus of on-going effort. )

    • Robert McGuinn

      To address your second point: In my opinion, it is far more productive go ahead and release the data in its rawest form and then allow the public debate to be about data interpretation methods. Not releasing the data creates the wrong perception. I have faith that intelligent public scientific debate, at least among the honest brokers, can handle the interpretive nuance. One will never be able to shut down those who exploit uncertainty for personal gain, so that will happen no matter what. Those types of folks will be revealed for what they are in the long run anyway. So, I say, release the data and let the fun begin!

  • JonFrum

    No mention of Canadian statistician Steve McIntyre, who made a second career out of making polite and reasonable requests to climate scientists for their data and/or code. Or the climate scientist who responded ‘Why should I give you my data so that you can prove me wrong?’ Yeah, that’s science in action – the search for the truth, buried under careerism.