Raiders of the Lost Archive

A stack of yellowing datasheets sitting unused in the back corner of a lab, shorthand field notes decipherable only by the individual who took them, long-term records taken without standardized measurements and stored somewhere useless to anyone but the original data collectors – these are all too familiar sights to scientists, examples of the susceptibility of data to a slow and quiet death.

In the environmental sciences, which often rely on long-term data to explain changes in the environment, data death is especially problematic. Now, there is a growing movement to rescue decades-old (or even older) datasets or poorly documented data to help improve the process of interpreting the causes and consequences of environmental change.

Data rescue is the act of salvaging lost or forgotten information, which is becoming easier as the digital age, the discipline of environmental data science, and the open science movement mature and enable more tools for wrangling and tidying data more efficiently and effectively.

“The collective programming intelligence amongst ecologists is increasing, and with that growing intelligence comes an increased capacity to deal with data,” said Jeanette Clark, a projects data coordinator for NCEAS. “Because we are becoming more efficient, I think scientists are finding the time and interest to dig back into old datasets to figure out what can be done with them.”

In practice, data rescue can take many forms, and data can mean many things – from numerical measurements to audio recordings. It can be as simple as logging information from a few handwritten data sheets into a digital database, or as complex as standardizing years’ worth of data so they can be accessible and decipherable by any scientist in a data repository.

“The incentive for rescuing data is data reuse,” said Clark. “When it is accessible, you can recombine old data to conduct new analyses, answering new questions. Data rescue allows for more creativity and innovation in science.”

Not only that, with tremendous application to synthesis science and inherent ties to open science, Clark explains that data rescue and reuse leverages resources more efficiently than collecting new data.

“It could save quite literally millions of dollars, especially for more intensive data collection methods, if you could access an existing dataset that contains the information you’re after,” she said.

The idea of data rescue is emerging from the long wake of historical wariness of sharing data among scientists – although, there are many reasons why scientists would be protective of their data. Potential misuse by outside parties and failure to give proper credit were and continue to be major concerns that hold scientists back from making their data publicly available.

This attitude is slowly changing, especially as the next generation of researchers begin to take the stage.

“New researchers that grew up in the computer age grew up sharing - just look at social media and collaborative working platforms like Google Docs. This sharing culture has initiated a generational shift from the idea that ‘sharing data makes you vulnerable,’ to ‘sharing data makes us all stronger and gives back to the scientific community,’” said Julie Stewart Lowndes, a project scientist for the Ocean Health Index, which relies heavily on data reuse to develop its assessments.

Earlier this year, Lowndes and Clark helped a group of coral reef scientists get their feet wet in the practice of data rescue through a workshop they co-led at NCEAS for the Coral Reef Science & Cyberinfrastructure Network.

For example, two groups of researchers from separate institutions – and generations – worked together to rescue different sets of photographs of one of the world’s best-studied coral reef communities, Jamaica’s Discovery Bay. A pair of junior researchers from Scripps Institution of Oceanography tried to unify their set of recent digital photographs with a set of time series taken by a group of senior researchers several decades prior – many of which were taken before the dawn of digital photography.

“Late career researchers frequently still have non-digitized data artifacts, and these researchers are starting to retire. During the workshop, there was this real feeling that we have to digitize and make sense of these data before they retire, or we might lose it,” said Clark.

While the work was tedious, the multigenerational team of reef scientists was able to generate a long-term record of coral reef imagery, creating a valuable resource that ecologists will be able to use to understand changes over time in Discovery Bay and beyond.

“A big part of data rescue is reformatting and describing data in an organized fashion to make it accessible to more people than just you and your lab,” says Jesse Goldstein, projects data coordinator for the Arctic Data Center, the primary data and software repository for the Arctic research funded by the National Science Foundation.

Goldstein compared one of Leonardo DaVinci’s anatomical sketches to a page of handwritten observations from a modern lab notebook to illustrate. Both contain a wealth of information, but accessing and utilizing that information in a meaningful way can be challenging without essential context called metadata, an additional set of information that describes what the data contain – in other words, the where, when, and why the data were recorded.

“DaVinci’s notes are incredibly detailed, but they aren’t in a format that is widely accessible,” explained Goldstein.

Indeed, DaVinci’s notes were once confined to a single physical location in his journal, and even though they can now be seen by anyone via the internet, many of the figures and ratios from his studies are not suited for re-use because they lack the where, when, and why – the metadata.

“Sadly, we haven’t improved much in five hundred years. A lot of the submissions that we get [to the Arctic Data Center] look like this,” he said, motioning to the excerpt of modern lab notes, full of shorthand abbreviations and no explanations. “It may come to us digitally in an Excel spreadsheet, but we still don’t actually know what it means without context. We have to ask a lot of questions to ensure that we interpret the data correctly.”

Part of the reason for this challenge is the sense that there is little incentive to store or format data in a way that would make sense to anyone else, which lingers from the historical aversion to data sharing and which the Arctic Data Center is working to combat.

One recent data rescue by the Arctic Data Center was of a set of thousands of aerial photographs of North American glaciers taken between 1958 and 1999 by the late Austin Post, a photographer and glaciologist for the US Geological Survey. The rescue involved not just scanning and uploading thousands of analog photos (a feat itself), but also careful creation and curation of the metadata for each photograph, including Post’s own comments.

The resulting collection provides a decades-long record of glacial melt and retreat that will be invaluable to climate scientists today, enabling them to ask and answer new questions about glaciers and climate over time.

“With the internet and the technology we have now, we have the ability to share things more quickly than ever,” says Goldstein. “This gives us a lot more opportunity – and almost a responsibility – to rescue whatever we can.”

###

Amanda Kelley has been an NCEAS E-Connect Communications Fellow since Spring 2017.