NCEAS Product 2000

Nottrott, Rudolf; Jones, Matthew B.; Schildhauer, Mark P. 1999. Using XML-Structured Metadata to automate quality assurance processing for ecological data. Proceedings of the Third IEEE Computer Society Metadata Conference. IEEE. Bethesda, MD. (Abstract) (Online version)

Abstract

In an attempt to mine the rich collection of ecological information, ecology and related disciplines have recently increased their emphasis on post hoc synthesis and analysis of existing data. Consequently, investigators must integrate large numbers of data sets with varying schemas from diverse sources. The scientists generally have little or no knowledge about the history, quality, and reliability of these data. Such data integration efforts are frequently necessary precursors to more cogent ecological analyses using the newly synthesized data.

We describe a prototype data validation system that uses structured metadata expressed in eXtensible Markup Language (XML) as the basis for automating quality assurance processing and thereby simplifying the process of data integration. Using a Java-based metadata editor, investigators produce metadata documents that describe the data set schema, variable naming conventions, range and type information for each of the variables in the data set, and details about data anomalies and errors. The validation system generates code from the metadata description of the data schema to load the data set into a 4GL analysis package, in this case SAS, and then check that each observation in the data set conforms to the assertions described in the metadata. A number of generic quality assurance analyses are then performed on the data based on values that have been extracted from the metadata. For example, the validation system automatically checks the type and range of all variables, checks that relationships among data tables are consistent and complete, and produces summaries of the data for inspection by the scientific user. This type of validation is critical to evaluating the reliability of data sets that are unfamiliar to investigators, and reveals areas where the investigator's metadata-encoded expectations about data set content differ from reality.

Currently, this type of quality assurance processing is accomplished manually for each data set, which is only feasible for small numbers of uniformly structured data sets. However, an automated data set validation engine is a tremendous advantage when researchers are trying to integrate large numbers of unfamiliar data sets, as is typically the case in data-driven synthetic investigations, such as are recently emerging in ecology.