Discovering existing Earth and environmental science data that can be applied to answering new questions or testing new hypotheses is ever more important in the era of “big data.” Data Observation Network for Earth (DataONE) is building a cyberinfrastructure that facilitates data discovery and access and is working to foster a culture of data sharing and sound data management. As part of the work with DataONE’s Community Engagement and Education working group, this working group has contributed to development of data management education curriculum.
A formal evaluation of the curriculum, as implemented in a 2-day data management workshop, highlighted the need for infusion of more ‘real-life’ stories into the education materials and inspired the launch of the Data Stories project. For this project, Jessica Bragg and Stacy Rebich Hespanha are conducting interviews with researchers and data managers to collect success stories and cautionary tales related to data management and sharing. As the project progresses, they will be integrating these stories into the curriculum and publishing them online as a resource for the data management education community.
Stacy is also collaborating with members of the DataONE Data Integration and Semantics working group and the SONet project to enhance environmental and earth science ontologies (sets of concepts and relationships between concepts that have been formally described) and apply them in the context of DataONE query tools. By integrating these formal ontologies and taxonomies with statistical representations derived from keywords and descriptions researchers use to describe their data, they will facilitate searching for data by offering suggestions for query refinement through an interactive interface. The background image is based on the outputs of a statistical topic model (Latent Dirichlet Allocation), and illustrates the thematic breadth of data currently available for discovery through DataONE.
DataONE currently holds a collection of over 125,000 metadata records linked to data sets in 10 environmental and earth science data repositories. This graphic shows the key themes represented in those datasets and is based on statistical patterns of word co-occurrence in the data descriptions (Latent Dirichlet Allocation). Circles represent groups of very similar data sets, and the size of the circles represent the number of datasets in each group. Labels indicate terms commonly used to describe data sets. Colors represent “clusters” of similar data sets; the descriptions for data sets contained in circles of the same color are more similar to each other than to data sets in circles of other colors. Circle locations are assigned by a Self-Organizing Map algorithm that attempts to keep similar data sets as close together as possible.