Skip to main content

National Center for Ecological Analysis and Synthesis

Whither Data?

In the first essay in this series (J. H. Brown, 1997, EcoEssay 1), Jim Brown bemoaned the perceived lack of fundamental progress in ecology, at least since the mid-1970s. While his general conclusions are certainly debatable, one comment caught my eye: "There has not been conceptual progress in organizing and synthesizing existing information..." Despite increasing numbers of conferences, journals, and compendia devoted to reviews and syntheses, most publications in these outlets are little more than puff pieces that allow the author to highlight her own recent research findings or advance his own point of view. Because existing information, i.e., data, is hard to find or reliably collate into forms suitable for comparative analysis, these publications rarely are true organizations or syntheses of existing data. We know that data underscore all aspects of ecological research. They are needed to test hypotheses, models and theories, and provide uncomfortable prods in new directions when they fail to support our pet ideas (see, for example, R. Hilborn & M. Mangel, The Ecological Detective, Princeton University Press, 1997). Yet, good data are hard to come by. The purpose of this essay is to discuss some reasons that good data are hard to find, and to draw attention to mechanisms that can increase data accessibility and sharing within the ecological community.

Funding agencies disperse millions of dollars annually to scientists who collect data bearing on basic and applied research projects. Results from these data-gathering exercises, a.k.a. experiments, are compiled, synthesized, and published, but the raw data themselves are usually stored in the equivalent of a locked file drawer in the basement of an abandoned building on a small, empty planet somewhere in the vicinity of Alpha Centauri. It may strike many readers of this essay by surprise that usable ecological data are hard to find, especially in these days of instant global communication, increasing numbers of interdisciplinary, collaborative enterprises, NCEAS, and the promises of the world-wide-web. Yet think back to preparing your last grant proposal, or your most recent manuscript, during which time you read a paper for which the conclusions may have been relevant to your ideas, and you wanted to compare the published results with your own. With a looming deadline, you surfed the web and found the author's e-mail address. Off went the note asking for the raw data. Hope turned to despair when you received (choose one or more): (a) no response; (b) a response that the data were hand-written in a yellowing notebook that was too brittle to fax or photocopy; (c) a uuencoded Excel file that you had no way of opening; (d) a flat ASCII file with columns labeled in unintelligible code; (e) a formal letter from the investigators lawyers threatening you with a civil suit for intellectual piracy. Undeterred, you spent a few extra hours digitizing figures and simulating distributions from published means and variances in order to recover an approximation of a dataset. If all this seems too much of a stretch, try reanalyzing a dataset that you collected 15 years ago, haven't looked at since, and is stored on a deck of punch cards or (if you're lucky) an 8" floppy disk on your bookshelf. Ironically, the paper you read or the paper you published 15 years ago acknowledged the NSF for support of the research, and NSF guidelines explicitly state that all data collected using government funds reside in the public domain, freely available to all.

Not all scientists are as tight with their data as ecologists. But then, most other scientific disciplines rely on experimental replication before publication. Because the phenomena that we study are spatially and temporally contingent, we rarely truly repeat experiments and are much more dependent on the comparative method to test general theories. We also routinely invest years in learning detailed natural history and developing our own unique model systems, the payoff for which is publications leading to recognition and job promotion. Sharing unique data, before or after publication, could preclude subsequent publication of those data, and also treads heavily on our idiosyncratic individualism. However, raw data often are more valuable than published analysis, since other investigators and especially "synthesizers" rarely ask the same questions for which the data were originally gathered. Further advancement of our discipline, especially in areas dependent on broad comparisons (e.g., Jim Brown's "macroecology") and those focused on time-sensitive issues, such as endangered species or site conservation and management, require access to existing data in usable form. To facilitate access to data, I suggest that there are two changes that need to take place in the way we think about data. First, we need to publish the data themselves, not simply the statistical analysis, attendant data reduction, and interpretation. Second, and perhaps more importantly, we need to fundamentally change the culture of ecology from one based primarily on individuals studying "obscure organisms in pristine sites" to one based on investigators who see themselves as part of an ecological community working towards a common understanding of the functioning of the biosphere. That understanding will derive from a common and evolving ecological dataset, which I refer to as the tapestry of nature.

Data should be published

One way to make raw data available is to publish it. While limited journal pages or page charges precluded publication of raw data, they now can be published electronically with much less regard for cost or length. Appendices to articles, containing raw data or detailed summary tables can be posted on official web sites. Many disciplines require submission or computer archiving of raw data prior to publication of the articles with which they are associated (e.g., gene sequences, x-ray crystallography coordinates, statistical code or software). Individuals and groups of investigators should be encouraged, through standard academic reward mechanisms, to collate, organize, and synthesize data, and through peer-reviewed publication of such collations and syntheses, make others aware of the importance of these datasets. Beginning in 1998, the ESA family of journals will support electronic publication of appendices, supplemental datasets, and full, peer-reviewed data papers. The form of the latter likely will differ substantially from existing publications of experimental or theoretical results, but the scope will be broadly the same: collation, presentation, and synthesis of existing data of demonstrable importance to the ecological community.

But what does it really mean to compile and publish a dataset? One of the most difficult and time-consuming tasks is producing adequate documentation of a dataset (the so-called "metadata"; see Michener et al., Ecological Applications 7: 330-342, 1997). A careful reading of that paper suggests that compiling metadata sufficient to permit other investigators unfamiliar with the system to use the associated dataset is a task at least equivalent to producing a "normal" publication. Michener et al. identify a number of different levels of metadata, each appropriate to different uses ranging from informal sharing of data among close colleagues to use of a dataset by individuals completely unfamiliar with the study system and who have no way to contact the individual who collected the data. Proposals such as those by Michener et al. also illustrate the need for metadata "standards"; items of information that are collected by all investigators regardless of study system. Such standard items range from the apparently simple (latitude, longitude, elevation relative to sea-level, time and date of data collection) to the more complex (quality control procedures, data acquisition methods, history of dataset usage). Good examples of standardized data collection can be found in datasets produced by long-term ecological research (LTER) sites. Well-documented compiled datasets (e.g., 5715 time-series of annual abundances of 447 species of moths and aphids in the UK from the Rothampstead Insect Survey: I. P.Woiwod & I. Hanski, J. Anim. Ecol. 61: 619-629, 1992) and on-going, maintained datasets (e.g., the North American Breeding Bird Survey) continue to provide new theoretical insights into ecological processes. As ESA moves into the realm of publishing data, we hope to slowly change the ecological culture from one biased in favor of idiosyncratic styles of data collection and collation to one predominated by awareness of the need to standardize data collection to the greatest extent possible.

In addition, investigators and funding agencies need to recognize that there are real costs, in terms of money and personnel time, to data management and sharing. Just as collecting data is difficult, time-consuming, and costly, so is data management, archiving, and sharing. Discussions among participants in the ESA's Special Committee on Data Archiving and Sharing suggest that costs for data management (i.e., entry, complete documentation, and long-term archiving and curation) are on the order of 30% of normal budgeted direct costs. While these costs may decline with practice or improving technology, they will never disappear entirely. Acknowledging these costs carries with it the responsibility for investigators to budget for them, and for funding agencies to appropriate for them.

Ecological culture needs to change

I assert that in order to better serve not only the current community of ecologists and environmental policy- and decision-makers, but also our students and our students' students, that we need to change the prevailing "culture" of ecology. A glance through any recent ESA annual meeting program reveals that most ecological studies are apparently unconnected with any others. Sitting through a contributed paper session or wandering through the posters only reinforces this perception. Most ecologists work on their own isolated system, and pay only lip service to others working on similar problems. Data rarely are shared, and usually only after the maximum number of least publishable units have been dredged from them. Most importantly, explicit, quantitative or statistical comparisons with other studies are unusual, although such comparisons can lead to new insights and research directions.

I do not mean to imply that every ecologist should be working at an LTER site, or that only large, multi-investigator efforts are worthwhile (although the development of ecological "model systems" should be encouraged). Rather, it would be a good first step if individual investigators not only recognized that others are working in the same conceptual area (as indicated by numerous citations in paper introductions) but also recognized that they were generating useful comparative data. The increasing awareness and use of meta-analysis (e.g., G. Arnqvist & D. Wooster, Trends Ecol. Evol. 10: 236-240, 1995) and Bayesian statistical techniques (e.g., Hilborn & Mangel, op. cit.) illustrate the utility of building quantitatively on existing data, and NCEAS provides a valuable nexus for such synthetic activities. However, only a small fraction of ecologists have training or experience with meta-analysis or Bayesian inference, and fewer still have worked or will work at NCEAS. More generally, education and training in use and synthesis of existing data, and in collegial sharing of data should occur wherever ecology is taught. Examples include networks of environmental monitoring projects at the pre-college level (e.g., NOAA's GLOBE project), increasing emphasis on collaborative activities in undergraduate ecology classes (and elsewhere in science curricula), and training at the undergraduate and graduate level not only in critiquing, but also in constructively using the primary literature. For new masters and doctoral students, literature reviews for thesis proposals should include data synthesis from the assembled bibliography. Such bibliographies should extend beyond the most recent decade's publications. "Preliminary data" sections of research proposals could include meta-analyses or Bayesian prior probability distributions derived from comparable studies in order to support the need for further experimental work. All of these activities would reinforce the notion that published data are valuable in their own right, not just for the conclusions drawn from them, and would further encourage individuals to weave their own data into a common warp and weft.

Why bother?

It is a common assertion among sociologists of science that scientific conclusions are dependent on prevailing cultural norms (e.g., B. Latour, Science in Action, Harvard University Press, 1987). The scientific community's response and critique of this strongly relativist position (e.g., P. R. Gross & N. Levitt, Higher Superstition: The Academic Left and its Quarrels with Science, Johns Hopkins University Press, 1994) is rooted in the notion that there is an objective "real world" whose predictability is governed by rules discovered using the scientific method. The resolution of the debate between objectivist scientists and relativist sociologists of science hinges on data. As scientists, we argue that our conclusions are dependent on available data. Current conclusions should reflect available data. The conclusions may be influenced by intellectual fashion and available bandwagons, but data collected using sound, described procedures, available to all, adjudicates decisively among competing hypotheses. Only those individuals who contend that there is no "real world" can deny the utility of data used in this way to discern the laws of nature.

On a more mundane level, data should be the basis of decisions made to manage, conserve, restore, and protect the organisms and ecosystems that we study. If data that we have collected are not made available in usable fashion to those with whom we have entrusted the power to make such decisions, then we not only limit their ability to work in a reasoned way but also reduce opportunities for future ecological study. In either case, we diminish the tapestry of nature.

Citation format: Ellison, Aaron M. 1998. Data: the tapestry of nature. EcoEssay Series Number 2. National Center for Ecological Analysis and Synthesis. Santa Barbara, CA.

Responses to this article:

All EcoEssays in the series