Data Science Research
Our research in data science and informatics makes environmental data easier to locate, access, interpret, and analyze.
Often done in partnership with other scientific institutions, our data science research generates solutions for the following:
-
Storage and management of environmental data
-
Discovery and preparation of data for further analysis and synthesis
-
Automated machine processing of information and models
-
Making these capabilities available to practicing scientists
Current Projects
ABC Tracker is a tool that enables scientists to track and study animal behavior by video. NCEAS is working in partnership with the project leads, who are based at the University of North Carolina, to help them archive the data generated by the tool and make that data accessible to other scientists.
CodeMeta is an effort to standardize the exchange of software metadata across repositories and organizations through a common vocabulary and schema that will connect data coding services such as GitHub, figshare, and Zenodo. This work supports shareable and reproducible data and methods.
Ecologists gather long-term data at multiple scales, necessitating tools that measure patterns and rates of change in plant and animal communities in response to the many factors that affect them. This toolbox makes analyses of ecological communities more accessible and usable, and is intended to minimize data preparation efforts and foster collaboration.
This project is gathering metrics of ecological dynamics into one toolbox that will allow ecologists to quantify how communities change over time. It is funded by the National Science Foundation and includes collaborators from University of New Mexico and University of Wisconsin-Madison’s Center for Limnology.
Data provenance involves clarifying where data came from and how scientists have previously used them, which is critical for scientific reproducibility and data reuse.
Our research team is building cyberinfrastructure that will collect and produce information about data provenance to improve researchers’ capacity to share their data and the processes involved in creating them, called scientific workflows. The models and software this team is building will allow detailed descriptions of the journeys of environmental data, including their “retrospective” and “prospective” provenance, or their past and possible future uses in scientific workflows.
A growing collection of standard protocols, formats, and vocabularies, often characterized as the Semantic Web ("Web of Data"), offers a powerful approach for publishing research data online. The GeoLink project brings together experts from geoscience, computer science, and library science in an effort to develop Semantic Web components that support discovery and reuse of geoscience data and knowledge.
Participating repositories include content from field expeditions, laboratory analyses, journal publications, conference presentations, theses/reports, and funding awards from many disciplines, ranging from marine geology to paleoclimate.
The U.S. Department of Energy’s (DOE) Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) is a repository for Earth and environmental science data - specifically, data obtained from observational, experimental, and modeling research that is funded by DOE’s Office of Science under its Subsurface Biogeochemical Research and Terrestrial Ecosystem Science programs within the Environmental Systems Science activity.
NCEAS data scientists helped build ESS-DIVE in collaboration with scientists from Lawrence Berkeley National Lab and National Energy Research Scientific Computing. The project is funded by the Data Management program within the Climate and Environmental Science Division under the DOE’s Office of Biological and Environmental Research program, and is maintained by the Lawrence Berkeley National Laboratory.
In recognition that research impact includes the generation of data, Making Data Count is an effort to collect usage and citation metrics for data objects and develop a service that collates and shares these metrics with the scientific community.
This project is working with the research community to develop a clear set of guidelines for defining data usage and create a central hub for data metrics, including the number of data views, downloads, citations, saves, and social media mentions.
Making Data Count is a partnership between NCEAS, DataONE, the California Digital Library, and DataCite and funded by the Alfred P. Sloan Foundation.
MetaDIG provides quality analysis tools for researchers to assess metadata and data records against community recommendations.
A computational engine allows researchers to write discrete metadata checks in multiple languages, including R, Python, and Java – which can also operate on data available from the DataONE federation of data repositories – returning results in a standard format.
MetaDig supports multiple metadata dialects, including Ecological Metadata Language (EML), ISO 19115, and the Biological Data Profile, among others. This project is supported by the National Science Foundation.
Improving scientists’ ability to find, understand, and integrate data is necessary for synthesis and other large-scale analyses. As more data become shareable via Web-based platforms, these exchanges increasingly rely on aligned terms for describing scientific measurements and metadata.
This research team is creating tools and approaches for describing scientific measurements in standardized ways to optimize information exchange over the Web. This includes building controlled vocabularies, or semantics, of measurements that have been vetted by the ecological and environmental community. This work will facilitate more efficient discovery, interpretation, and reuse of environmental data by promoting greater clarity and consistency in descriptions of scientific measurements.
This multi-institutional collaborative effort is helping researchers improve the reproducibility of their research. By offering tools and guidance, Whole Tale enables researchers to develop and share “living publications” that integrate data, code and scholarly articles.
Our collaborators include the University of Illinois at Urbana-Champaign, University of Chicago, University of Texas at Austin, and the University of Notre Dame.