Data rescue MIT
Students, activists, and volunteers gathered on the Massachusetts Institute of Technology’s campus to figure out ways to download mountains of government data. Credit: Renee Bell and Chi Feng, CC BY SA 4.0

On a sunny Sunday in mid-February, 150 people sat around a large room at circular tables, hunching over laptops and paper plates soaked in the grease of free pizza. Ambient chatter and keyboard clacking filled the Walker Memorial building on the Massachusetts Institute of Technology’s (MIT) campus as attendees urgently surfed through government websites.

These dedicated volunteers gathered at MIT that weekend for the same reason hundreds of volunteers have mobilized in similar hackathons across the country—to hunt for any crumbs of government data that can be downloaded, copied, and archived.

Data Doubts

With the election of Donald Trump, a new fear swept over scientists in the United States: Would an administration led by Trump, who has consistently downplayed the overwhelming scientific evidence of human-driven climate change, attempt to remove or delete government data from public websites? The instant after he was inaugurated, the climate change page of disappeared.

Even before Trump’s inauguration, scientists were worried.

Even before Trump’s inauguration, scientists were worried. Only a month after the election, scientists at the National Oceanographic and Atmospheric Administration (NOAA) started to copy data onto independent servers, concerned that the Trump administration would remove it.

Many thought of Canadian scientists, who faced their own round of government-led muzzling of communication. During his 2006–2015 term, then prime minister Stephen Harper barred government scientists from speaking with the public or media and ordered entire libraries full of data to be destroyed, particularly libraries that housed environmental data.

In 2013, the Professional Institute of the Public Service of Canada found that 90% of federal scientists felt that they were not allowed to speak freely to the media. Additionally, 24% of those surveyed had been asked to “exclude or alter information for non-scientific reasons,” and 37% “had been prevented…from responding to questions from the public and media.”

Repeating History

History tends to repeat itself, said Bethany Wiggin, a literary historian at the University of Pennsylvania. Wiggin is a cofounder of DataRefuge, which jump-started local data rescue events, like the MIT hackathon, in major cities like Philadelphia, New York, San Francisco, Seattle, and more. Wiggin also founded and directs the Penn Program in the Environmental Humanities, which challenges its students to create projects that combine science and humanities.

Soon after the election, Wiggin and her students started discussing the problem of vulnerable government data. After weeks of discussion with librarians, archivists, and scientists, Wiggin, her students, and her colleagues decided to spearhead an effort to secure vulnerable government data.

A two-pronged approach emerged, she said. The first strategy, a “top-down” approach, involved circulating a survey across American universities and scientific organizations, asking scientists to weigh in on what they thought were the most important—and vulnerable—data. Data sets were triaged, and those that got multiple flags were sent to the top of the to-download list.

The second strategy was recruiting volunteers to download any and all government data they could find. With the help of a partner organization, the Environmental Data and Governance Initiative (EDGI), the group came up with a “bottom-up” approach, Wiggin said, which resulted in the data rescue events. The volunteers at these events, rather than targeting specific data sets, start by learning everything they can about each government website before downloading any data.

These events take on two important roles, Wiggin said. “One, they shine light on the problem” of vulnerable data on government websites. Simultaneously, data rescue events “also start to solve the problem.”

Data Rescue Boston

The MIT hackathon consisted of three different groups: surveyors, seeders, and harvesters. The first group, the surveyors, combed through government websites to figure out their architecture. Based on a strategy developed by EDGI, this group created a map of the sites, identifying pages that could hold important data sets, which were compiled into “primers” that then were passed onto the seeders.

“The last election did it for me,” said surveyor Paula Ehler, a retired Harvard University secretary and current artist, when asked why she joined the event.

“Instead of feeling so helpless, at least I feel like I’m doing something.”

“I marched in the sixties,” she continued. “And I’m doing it again. Instead of feeling so helpless, at least I feel like I’m doing something.”

Using the primers, the seeders took over. Like “human Web crawlers,” these volunteers started looking at the sites and pages identified by the surveyors, determining what data could be simply archived online in the Internet Archive’s Wayback Machine and flagging data that need special attention, said Jeffrey Liu, an MIT Ph.D. student in civil engineering and co-organizer of the event.

Finally, the harvesters, supervised by volunteers with expertise in the various government websites, puzzled out how to download all the data. The harvesters wrote programs and built tools that captured the data and downloaded them into an easily accessible format like a Microsoft Word document. The harvesters also worked on data sets that have been nominated by scientists. For instance, one harvester, who goes by “Fuzzy,” was building a program that could compile decades of precipitation, wind, and water level data from hundreds of NOAA stations across the United States.

By the end of the day, 35 gigabytes of data had been rescued from the Department of Energy, the Environmental Protection Agency, NASA, and NOAA, said Liu. Surveyors at this event also started tackling non–climate science–related data from the departments of Labor, Justice, Health and Human Services, and Housing and Urban Development and the Federal Communications Commission, Liu added.

Future of Data

Data rescue events help the movement’s short-term goal of saving and copying as much government data as possible, but DataRefuge and their partner organizations have long-term goals as well, Wiggin said. The longer-term goals include curating the data in a way that can be useful to researchers. Specifically, Wiggin said, she and her colleagues hope all these efforts will culminate in a repository of expert-approved data run by the open data community and research libraries, so the data can be used and cited by scientists and researchers.

“This is something I can do that makes a difference,” said a scientist at the MIT data rescue event who wished to remain anonymous. “Science requires previous work to flourish. If we do not have access to research data that came before, we cannot make a difference in health and climate [research].”

By participating in events like these, “I can help ensure the work that I and my colleagues have done does not go in the dark,” she continued.

—JoAnna Wendel (@JoAnnaScience), Staff Writer


Wendel, J. (2017), Activists set out to save data, one byte at a time, Eos, 98, Published on 07 March 2017.

Text © 2017. The authors. CC BY-NC-ND 3.0
Except where otherwise noted, images are subject to copyright. Any reuse without express permission from the copyright owner is prohibited.