A look at some challenges and solutions to working with big data
By Tia Kordell
It’s no secret that big data drives the modern world. It advances the security of our bank accounts, dictates the timing of traffic lights and tracks weather patterns.
Ecologists generate their own big data by sharing and combining datasets from different sources to analyze trends across time, space and disciplines, but this process is not always simple. They can encounter three steps of challenges: finding the right data, using the data and easily reproducing the whole process.
To help researchers overcome these hurdles, open source, open access tools have emerged as a solution. “Open source” programs have viewable and changeable source code, and “open access” programs are free for anyone to obtain and use, no matter the institution, country or funding.
NCEAS has been on the frontlines of the open data movement, supporting the creation of tools to enable researchers, practitioners and decision-makers to locate, analyze and repeat. For example, we are a founding partner of DataONE, a network of experts, institutes and data centers that are working together to make environmental data more accessible.
Here’s a brief look at tools that have emerged from our work - and from those three challenges as experienced by our own researchers.
The Searchability Challenge: Elusive Data
It can be dizzying to find the right data in the ocean of available information - whether it’s sifting through papers for a literature review or searching data repositories for specific datasets.
Raw data collection and literature reviews have one thing in common: finding answers takes time. And since informed ecological decisions typically depend on data collection, this challenge can limit researchers, policy makers and practitioners alike.
NCEAS postdoctoral researcher Halley Froehlich knows the frustrations of synthesizing raw data from multiple sources. For her research on aquaculture and food systems, she pulls data on sustainable agriculture and climate change from several different data banks, but the lack of a universal labeling language or system makes it difficult to compile data across disciplines.
“Each model has its own platform, and each platform has its own procedures,” she said.
Literature reviews can also lead down an irritating path, a frustration with which Samantha Cheng, another NCEAS postdoctoral researcher, is all too familiar. In one project alone, she sifted through about 38,000 papers to find the right information, a hunt that helped her find the true meaning of the words “time consuming.”
“I spent hours screening through titles and abstracts to find what’s relevant and what’s not,” she said.
NCEAS Solution: Colandr
In fact, Cheng was part of a team that created a tool in answer to these limitations. As part of her work with the Evidence-based Conservation Working Group of the Science for Nature and People Partnership, she worked with peers at Conservation International, which relies on literature reviews to inform their decisions, and pro-bono developer group DataKind to create Colandr, an open-access app that automates and accelerates the literature review process.
Colandr helps users amass a lot of scientific papers more efficiently and, with the tool’s machine-learning technology, can help them more quickly and precisely find what they are looking for. Colandr also has a unique “deduplication” feature, which automatically eliminates duplicate studies when downloading from multiple sources.
“Colandr takes down a barrier and gives everyone a step forward,” said Cheng.
By simplifying and speeding up the sifting process, Colandr also helps researchers and practitioners alike determine science-based solutions more quickly and efficiently.
The Analysis Challenge: Cannot Compute
Once the data are compiled, researchers face the daunting task of analyzing them. It can take massive computational power to turn large datasets, such as lists of species behaviors or long-term climate trends, into meaningful results or decision-making tools.
These days, this caliber of computing increasingly involves a skill most ecologists never get formal training in: coding.
Before coming to NCEAS, Julia Stewart Lowndes, a scientist with the Ocean Health Index, was a largely self-taught coder, a skill she picked up when more familiar software like Excel couldn’t process the large datasets needed for her doctoral research on squids.
“Coding made smaller, repetitive tasks more efficient and broadened the kinds of scientific questions I could ask,” said Lowndes, who has been able to enhance her coding knowledge and practice alongside her fellow coding colleagues at NCEAS.
Overcoming this quasi “analysis paralysis” is crucial for extracting the most results and value out of data, as well as for enabling those who lack advanced computational knowledge, such as politicians or practitioners, to find meaning in the data on their own.
NCEAS Solution: Circuitscape
An exemplary case is connectivity modeling, an ecological analysis used to predict how populations of animals move across landscapes. Connectivity modeling is important for prioritizing habitat conservation, but involves a great deal of data and careful analyses.
To make the analysis easier, NCEAS and The Nature Conservancy paired up to create Circuitscape, a data-analysis program that uses electronic circuit theory, or the idea that electricity moves along a circuit, to understand how wildlife moves across landscapes. With this program, users can gain meaningful results quickly, freeing up their time for other important analyses and environmental decision-making.
A brainchild and legacy of the late Brad McRae, a former NCEAS postdoctoral researcher, Circuitscape has become the most widely used connectivity analysis package in the world. It has helped organizations understand the movements of nature and make more informed decisions accordingly, aiding efforts such as designing habitat corridors for big cats, prioritizing land for gibbon conservation and developing strategies to prevent wildfires.
The Reproducibility Challenge: Reinventing the Wheel
Reproducibility, that important scientific tenet to be able to re-do analyses with consistent results, presents our final data dilemma. According to Lowndes, the project scientist for the Ocean Health Index (OHI), without code that can be rerun easily, researchers often face re-doing analyses from scratch with each new set of data, essentially “recreating the wheel” every time.
For years, the OHI team has annually assembled and analyzed data for a suite of indicators to evaluate the health of the world’s oceans. With the yearly influx of new data, the team quickly realized that, instead of spending hours tracking down emails and Excel sheets to make sense of their past methods, they needed a system for coding collaboratively.
Many researchers find themselves forming “homegrown” analysis strategies to deal with large datasets, but discovering not only does it cost them time to repeat their own analyses, keeping the strategies in their own heads also inhibits others from reproducing them. Because of this, researchers can end up backtracking instead of building on past progress.
NCEAS Solution: Ocean Health Index coding toolbox
To solve their reproducibility problem, the OHI team looked to Silicon Valley. Tech companies use coding software to quickly and easily run analyses on sales and marketing data, and they thought, why not apply that approach to science?
With the decision to “embrace coding,” Lowndes and the OHI team created a data analysis toolbox that uses open access, open source software, like RStudio and Github, to run analyses for them. With their inclusion of resources on how to make ocean health assessments from raw marine data, the toolkit is also available to anyone who wants to recreate the process on their own.
According to NCEAS executive director Ben Halpern, open science tools like these are carving the future of evidence-based environmental solutions.
“NCEAS believes in open and transparent science not only for the process but for decision-making, as well,” he said. “Open source, open access tools allow decisions based on that science to be more robust and widespread.”
Tia Kordell was an NCEAS E-Connect Fellow for the summer of 2017 and will complete her Master's degree at UC Santa Barbara's Bren School of Environmental Science and Management in the summer of 2018.