Alroy in review, Paleobiology
Methods for removing sampling biases from diversity curves

John Alroy

RRH: SAMPLING METHODS AND DIVERSITY CURVES
LRH: JOHN ALROY


Abstract.--Interest in sampling problems has increased as paleobiologists have recognized the dangers of undersampled age-range and diversity data. Although there now are several important methods of correcting for sampling bias, they have not been tested thoroughly against real data sets with clearly defined sampling regimes. Here I compare seven major methods by using them to remove sampling effects from a single diversity curve. The data are Cenozoic age-ranges of North American mammal species. Even the more complicated methods can be applied because 1) the age-ranges are tied to occurrences in faunal lists; 2) each of the faunal lists has its own age estimate; and 3) the data can be binned into uniform 1 m.y. time intervals. The major criterion for evaluating the methods is the degree to which their correction of the diversity curve corresponds with known variation in sampling intensity, as measured by the number of faunal lists or taxonomic records in each interval. Other parameters such as preservation of the curve's original variance, serial correlation, overall trend, and net changes also are considered. The simplest method is to eliminate species that only range through a single time interval. Unfortunately, this appears to have little effect. The next three methods directly compute a diversity estimate for each interval, making it impossible to specify origination and extinction rates. The Chao-2 equation is extraordinarily sensitive to binomial error and therefore essentially uninformative. The FreqRat equation has similar problems, but even when these are addressed by smoothing the data the results are still idiosyncratic. A third equation based on counting Lazarus taxa performs reasonably well. The remaining three methods modify age-ranges and therefore do provide turnover rates, but each requires higher-quality data than often are available. Confidence limits appear to be insensitive to long-term trends in sampling, but are otherwise somewhat informative. A method based on counts of simulated ghost lineages overcorrects low-diversity intervals, but again does generally improve the curve. Finally, rarefaction vastly outperforms all the other methods, but violation of its assumptions seems to cause some distortion. The results suggest that although some methods are highly misleading and none of them are perfect, most are worth applying. Paleobiological diversity curves generally should be corrected using rarefaction when taxonomic lists are available; when they are not, the Lazarus approach provides a simple and reasonable alternative.
John Alroy. Department of Paleobiology, Smithsonian Institution, MRC 121, Washington, DC 20560


Introduction

..........In recent years, paleobiologists have become much more concerned about the impact of undersampling on age-ranges and counts of taxa. New methods for removing sampling biases have proliferated as a result. These approaches seem to attack different problems: fixing the temporal origin of sister taxa (Paul 1982; Norrell 1992), setting confidence intervals on age-ranges (Paul 1982; Strauss and Sadler 1989; Marshall 1990, 1994), identifying mass extinction horizons (Marshall 1995; Marshall and Ward 1996; Solow 1996), determining the number of missing species in a single sample (Colwell and Coddington 1994; Wing and DiMichele 1995; Anderson et al. 1996) or a wide-ranging data set (Foote and Raup 1996), assessing sampling completeness in different intervals or groups (Paul 1982; Holman 1985; Krause and Maas 1990; Maas et al. 1995; Erwin 1996), comparing diversity levels in different intervals (Wing et al. 1995; Wing and DiMichele 1995; Miller and Foote 1996), and correcting diversity curves (Sepkoski 1993b; Alroy 1996, in press; Miller and Foote 1996).
..........This cavalcade of problems and proposed solutions is baffling because almost every new approach makes use of a fundamentally different data source in a seemingly distinct way. But in fact, at least seven of the most popular methods can be applied to a single important problem: reconstructing diversity histories on long time scales. The purpose of this paper is to apply all seven methods to a single, well-constrained data set that illustrates the Cenozoic diversification of North American mammals (Alroy 1996, in press). A suite of evaluative statistics will show how each method changes the shape of the diversity curve, and whether these changes conform with sampling intensity curves based on direct tabulations of faunal lists.
..........Each of the methods has potential use when all others are simply inapplicable, or when their own assumptions are well met. However, I will argue that certain approaches fail so completely in this particular example that they must be treated generally with considerable caution. Furthermore, sampling biases in real data are indeed quite severe: to erase these biases, the best methods have to modify the apparent diversity pattern considerably. Even worse, none of these better methods are applicable to widely studied, large-scale data sets like those of Sepkoski (1982, 1993a), Niklas et al. (1980), and Benton (1995) that only document first and last appearances of taxa.
..........All of this raises the question of how much we really know about large-scale diversity patterns in the fossil record. Do we really understand the tempo and frequency of mass extinctions, the interactions among diversity histories of different groups, the interplay between diversification and environmental change, and the relative pace of morphological and taxonomic diversification? Or are all of these patterns side-effects of undersampling? Perhaps the time has come to address such concerns in a serious way. Any solution to the problem will have to involve additional locality-specific distributional data sets, such as those employed by Bambach (1977), Sepkoski and Sheehan (1983), Sepkoski and Miller (1985), Sepkoski (1988), Lidgard and Crane (1988, 1990), Alroy (1992, 1994, 1996, in press), Lidgard et al. (1993), Markwick (1994), Wing et al. (1995), Miller and Foote (1996), and Lupia et al. (in press).

Data set

..........The North American mammalian paleofaunal database employed here has been discussed previously (Alroy 1992, 1996, in press; Wing et al. 1995). In summary, the data consist of 4015 genus- and species-level faunal lists for all North American mammals except the marine clades Cetacea and Pinnipedia and the volant order Chiroptera. The lists are mostly up-to-date, having a median publication date of 1981. They span the Aquilan (early Campanian: about 84 Ma) through Sangamonian (last interglacial: 0.1 Ma) land-mammal ages, include 1195 genera and 3182 species, and have been standardized taxonomically using a database of 451 invalid genera, 2692 invalid species, and 1197 invalid genus-species combinations. Temporal information is provided by 186 stratigraphic sections that include 2499 of the lists, as well as 155 radioisotopic and paleomagnetic age estimates. Together, the faunal, taxonomic, stratigraphic, and geochronologic data are tied to a set of 2415 references. The data are available on the World Wide Web at http://homebrew.si.edu/nampfd.html.
.......... Although there is a well-established system of North American land-mammal ages (Woodburne 1987), this and previous studies proceed directly from the faunal list database to a temporally calibrated diversity curve (Fig. 1: see Alroy 1996, in press; Wing et al. 1995). The procedures are detailed in Alroy (1992, 1994, 1996, in press). The steps can be summarized very broadly as follows. First, the lists are arrayed in a relative sequence using a multivariate method called appearance event ordination. This in turn defines a relative sequence of first and last appearance events, with one event of each type for each genus or species. The events are numbered from oldest to youngest, the concurrent range-zones of the faunal lists are converted to these numbers, and the concurrent range-zones are then calibrated to numerical time using geochronological age estimates. Because the relationship is strongly monotonic but fundamentally non-linear, a method called "hinge" regression is used to define separate linear regression segments.
.......... The temporally calibrated sequence is equivalent to a set of age-ranges and therefore implies a diversity pattern and turnover rates. Separate age-ranges for the genera and species are combined using a set of "species-lineage" rules that merge and fill gaps in the species-level age-ranges on the basis of their generic allocations. For a study like the current one that is not concerned with turnover rates, ranges do not need to be merged but gaps do need to be filled: in any interval within the range of a genus but outside the range of any included species, one lineage of that genus must be declared present.
.......... Although an analysis could proceed directly from the calibrated curve, this would present two major problems. First, turnover rates cannot be defined if each appearance event is treated as a separate entity; calculating rates instead requires lumping series of successive events into bins. Second, many sampling correction methods also require binning of appearance events and/or faunal lists for computational purposes. For both of these reasons, the data are best analyzed by imposing a uniform sampling interval on the curve, with one data point taken every fixed number of years. This has the side-benefit of avoiding the many difficulties created by traditional, uneven time intervals, e.g., distortion of extinction metrics (Foote 1994). I previously derived an optimality criterion that suggests a sampling interval of 0.7 m.y. (Alroy 1996, in press), but opted to use an interval of 1.0 m.y. instead. This approach is not only conservative, but accords with errors around the geochronological calibration curve and with unpublished results suggesting that no more than 40 - 50 named Cenozoic mammal biochrons can be distinguished statistically.
.......... A final concern is geographic sampling. The Cretaceous and Paleogene localities are very narrowly distributed, with almost no mammal fossils of this age having been found outside of the foreland and intermontane basins of the Western Interior (Alroy 1996, in press). The pattern changes abruptly in the Miocene, and by the Pliocene there are many localities in the Gulf Coast, Great Basin, and West Coast. Because beta diversity becomes significant at this geographic scale for mammals, any naive comparison of late Neogene and earlier diversity will have a built-in bias that might defeat any sampling correction method. Therefore, the present study follows earlier ones in excluding all regions outside of the western U.S., Mexico, and western Canada from the diversity data. Even with this restriction, the number of faunal lists fluctuates wildly through the time series (Fig. 2). The extraordinary degree of temporal variation in sampling is the motivation for this paper.

Sampling correction methods

..........Seven important paleobiological methods have been used to estimate either age-ranges, sampling intensities, or standing diversities: removal of singleton taxa; the Chao-2 estimator of local diversity; the FreqRat estimator of sampling completeness; counts of Lazarus taxa; computation of confidence intervals on age ranges; counts of ghost taxa; and rarefaction of taxonomic lists. Any of these estimates can be used to correct a diversity curve. This section outlines the methods, their previous applications, their potential biases, and the exact means of using them to construct diversity curves.
.......... Several other methods are not analyzed here. In most cases this is due to difficulty in specifying how they might be used to correct a diversity curve, but four important exceptions are discussed by Raup (1976), Pease (1985, 1992), Signor (1978, 1985), and Nichols and Pollock (1983).
.......... Raup's approach is perhaps the simplest and most appealing of any that have been proposed: apparent diversity for each interval is regressed against a measure of sampling such as outcrop area, and the residual values are then computed. His method has been followed by a few later workers (e.g., Niklas 1978; Raymond and Metz 1995), but it suffers from a serious flaw. Time series such as diversity curves and outcrop area curves are by their nature autocorrelated. Therefore, spurious cross-correlations between such time series will arise even when there is no underlying causal connection between the data sets (McKinney 1990). On the other hand, a true relationship might be expressed in subtle variation that is swamped out by the features resulting from autocorrelation. So taking residuals will destroy legitimate patterns when the sampling data are irrelevant, and potentially fail to correct for the true problem when they are not.
.......... It is possible to modify Raup's approach by first eliminating autocorrelation in the diversity and sampling curves using standard methods (McKinney 1990). However, the resulting "diversity curve" would be so far removed the data - first logged, then detrended, then differenced, then transformed into residuals - that it would be hard to interpret. Such an approach might be worth pursuing, but will be put aside for now in the interest of economy.
.......... The equations of Pease (1985, 1992) assume that fossilization occurs at a constant rate and that fossils are destroyed by geological processes at another constant rate, which results in a fossil record that steadily improves through time. Fig. 2 shows that the supply of fossils does not improve monotonically; it fluctuates, with some intervals being an order of magnitude better sampled than others, and the best-sampled interval occuring early in the time series. Therefore, Pease's approach is definitely inappropriate for the current data set and probably just as inappropriate for any other paleobiological data set.
.......... The method of Signor (1978, 1985) makes some very strong assumptions about randomness of sampling, constancy of turnover, and the log-normal nature of the underlying dominance-diversity pattern. It also requires the user to choose arbitrarily among relative sampling measures; all other methods are intrinsically tied to one measure or another. Sepkoski (1994) showed that these assumptions are far from trivial. In particular, variation in randomness of sampling can produce extraordinary changes in the Phanerozoic marine diversity pattern that Signor (1985) was trying to reconstruct. Therefore, the model is too sensitive to unknown and possibly indeterminable parameters to be informative, and I believe that it does not require further investigation.
.......... The Jolly-Seber capture-recapture estimator of Nichols and Pollock (1983) is very similar to the Lazarus taxon method described below; for example, both of them rely upon counts of range-through taxa that fail to be sampled. However, it can be shown algebraically that the two methods only yield identical results when all taxa sampled in the interval of concern have been sampled in either a previous or a later interval. The peculiarity of this requirement suggests that the Jolly-Seber method makes unusual assumptions about the constancy of either sampling probabilities or turnover rates. Because of space constraints, and because the Lazarus method certainly does not make such assumptions, the present analysis will focus on the Lazarus method alone. However, the Jolly-Seber method may merit investigation in the future.
Lumping by interval.--It is important to note that after sampling the curve every 1.0 m.y, no age ranges have been lumped: all the counted taxa are known to cross a particular time plane, not merely to range partially into a broad interval. Therefore, the data still represent absolutely minimal estimates of diversity that apply to individual moments in time. Some methods, however, cannot proceed from this kind of "single time-plane" diversity data and do require counting all the species that range anywhere into each 1.0 interval (lumping). Because there is some concern that the time-plane and lumped data may present very different patterns, lumping will be analyzed as a "correction" method.
.......... It seems likely that lumping will exacerbate any variation in the diversity curve that is due to sampling. Additionally, it should exaggerate diversity in time periods that have high background turnover rates. Previous studies showed that the Paleocene witnessed much higher mammalian origination and extinction rates than the remaining Cenozoic epochs (Alroy 1996, in press), making this concern acute. Therefore, lumping is essentially an "anti-correction" method, and one would prefer it to have minor effects if any.
.......... Lumping can avoided not just when using age-ranges based on ordination like the current one, but when using more generalized age-range data sets such as global genus/stage-level compilations. One need merely replace counts of taxa ranging into intervals with counts of taxa crossing interval boundaries. By analogy with the ordination-based curve, taxa would be counted only if they ranged from some interval i to the next interval i+1; by definition, all of these taxa must have existed at the time plane fixing the boundary between i and i+1. Despite the apparent advantages of this approach, I am not aware of its previous application in the literature.
Non-singletons.--Taxa that occur only in one time interval (singletons) are known to have unusual sampling properties, and usually are over-represented relative to other taxa (Foote and Raup 1996). Sepkoski (1993b) suggested that removing these taxa from diversity curves will systematically improve the data. He argued that this would be most appropriate when the taxa have high turnover rates relative to the length of the sampling intervals, as with marine invertebrate genera and marine stages, or, in the current case, mammal species and 1 m.y. intervals. The same method was applied by Raup and Boyajian (1988) in a study of extinction rates, where it had minimal effects; its effect on diversity patterns has not yet been explored. In order to apply it to the current data set, the ordination-based age-ranges first must be converted from numerical values into the interval-by-interval age-ranges that are generated by lumping (see preceding section).
.......... Singletons have the advantage of being identifiable from any age-range data no matter what the resolution, and it is hard to imagine how removing them might distort the shape of a diversity curve. However, it remains to be seen if the differences between singletons and non-singletons are great enough to result in a detectable improvement.
Chao-2.--After many years of wariness, ecologists have recently developed a suite of methods to extrapolate total diversity from various features of frequency distributions. The most robust of these is believed to be the Chao-2 equation (Coddington and Colwell 1994); it has been applied to paleobotanical data sets by Wing and DiMichele (1995). Anderson et al. (1996) applied other, related extrapolation formulas, but gave no reason for favoring these over Chao-2.
.......... The basic idea of extrapolation measures is that total diversity is a function of the relative number of species in each sampling frequency class. For the Chao-2 equation,

DCH = DS + NSF1 2/ (2NSF2)

where DCH = the Chao-2 diversity estimate, DS = the total number of sampled species within an interval, NSF1 = the number of species sampled exactly once, and NSF2 = the number sampled exactly twice.
.......... The availability of precisely correlated faunal lists makes adapting the equation to the current data set easy, except for the issue of species-lineages. In intervals that include more than one species in a genus, specifically indeterminate records of that genus are uninformative because of their ambiguity. In other intervals that are outside the range of any named, included species, indeterminate genus-level records are informative because they demonstrate the presence of a species-lineage that must have existed. These considerations have been built in to the software used to implement the equation.
.......... The Chao-2 equation and similar formulas make important assumptions that may not be met by paleobiological data. First, the underlying frequency distribution must allow only a finite number of species in the total species pool. Distributions like the broken stick and the "veiled" version of the canonical log-normal do make this assumption (May 1975), and it seems like a good one for continent-wide data sets. However, distributions like the geometric and log series that assume no finite species pool are often better descriptors of local communities (Magurran 1988). Second, individual samples must be of roughly the same size. However, in paleobiological data sets there often are a few very large samples and many small samples. Third, the number of one- and two-locality species must be sufficiently large to eliminate most of the variation created by binomial error. However, this condition is not easily met.
.......... Diversity curves can be modified with extrapolation equations like Chao-2 by using available taxonomic lists to make separate estimates for each time interval. Thus, extrapolation absolutely requires a well-correlated set of taxonomic lists. Such methods generate a diversity curve but not a set of age-ranges, making it impossible to estimate origination or extinction rates per se.
FreqRat.--Foote and Raup (1996) developed a new method for determining sampling completeness in the fossil record. Their goal was to estimate the fraction of species preserved across an entire data set, not within particular intervals. However, it is easy to show how their FreqRat (frequency ratio) method may be modified for that purpose. Like ecological extrapolation methods, FreqRat is based on comparisons among sampling frequency classes. Unlike the ecological methods, it assumes that the data are age-ranges across successive time units, and instead of computing frequencies of occurrence within faunal lists, it computes frequencies of duration across intervals. FreqRat depends on the fact that the singleton frequency class (i.e., that of species with durations of one interval) behaves differently from other duration frequency classes whenever samples are lumped into finite time intervals. For most classes,

NDFi+12 = NDFiNDFi+2

where NDFi = the number of species in the ith duration frequency class. Thus,

NDFi+1/(NDFiNDFi+2) = 1

for well-behaved data. When preservation is incomplete, however, the singleton class becomes systematically overrepresented, such that

NDF22/(NDF1NDF3) < 1

Foote and Raup (1996) determined that this ratio, called FreqRat, is actually an exact measure of the proportion of all species in the "sampling universe" that actually have been sampled at least once in any one of the time intervals.
.......... Adapting FreqRat to the problem of correcting diversity curves may seem difficult. After all, the method generates one completeness estimate for the entire data set, and it depends upon computing frequency classes across time intervals, not across individual taxonomic lists. One simple approach is to restrict the analysis to sets of three consecutive intervals. Species are placed in frequency classes based on their occurrences within the intervals: singletons are those species found only in the first (focal) interval of a set, doubletons are those found in the focal and succeeding intervals only, and tripletons range through exactly three intervals starting with the focal interval. Note that longer ranges are not truncated for the purpose of making these counts. So, for example, if a species is found in the first of the three intervals being examined but also in preceding intervals it is not considered a "singleton."
.......... Frequency classes are counted moving in both directions through the time series, with each ith interval becoming the focus in turn. So in the forward pass doubletons are found in intervals i and i+1, whereas in the backward pass they are found in i and i-1. Finally a separate FreqRat is computed for each focal interval, and the estimates are used to back-compute diversity:

DFR = DS/(NDF22/[NDF1NDF3])

where DFR = the FreqRat estimate of diversity and DS = the number of species present in an interval, as in the Chao-2 equation.
.......... The new formula does create a problem: sampling completeness is now an average estimate for several neighboring intervals, not for a particular interval. However, in most data sets sampling intensity is highly autocorrelated to start with, so the variation among neighboring intervals should be small enough to allow reasonable accuracy.
.......... Unfortunately, precision is a greater problem here than accuracy: there is substantial binomial error due to the small number of species in the second and third frequency classes. In fact, the problem is so great for the North American mammal data set that it results in impossible FreqRat values of > 1.0 for many intervals. Such ratios imply that more species should have been sampled than actually did exist. To avoid this problem I modified the equation by adding data from neighboring intervals to each frequency class (smoothing). The procedure eliminates a large majority of > 1.0 FreqRat values only when the smoothing window is extended to five intervals on either side. Smoothing might have the undesired side-effect of creating unnaturally strong autocorrelation in the corrected curve. But as discussed below, the main features of the curve do not appear to result from this.
.......... A minor problem is that FreqRat cannot be adapted to the easiest method of computing species-lineages. Unlike any of the other methods discussed here, it instead requires explicitly merging species into lineages in order to compute durations. Because there are often multiple ways to merge species, and because these options can affect the shape of the frequency distribution, using species-lineage data seems unwise. The frequency counts are therefore based on species durations, although the final correction is applied to a count of speces-lineages (DS). This inconsistency should have little effect on the results.
.......... Like the other extrapolation methods, FreqRat generates point diversity estimates instead of age-ranges and attendant origination and extinction rates. This shortcoming is somewhat balanced by the fact that FreqRat's preservation probabilities are computed from the type of age-range data that can be computed from almost every synoptic paleontological database. Unfortunately, using the probabilities to reconstruct a diversity curve requires not just simple age-ranges, but presence-absence data for each time interval. Therefore, FreqRat's data requirements in this context are not at all trivial.
Lazarus taxa.--Batten (1973) and Paul (1982) recognized that during poorly sampled intervals, many taxa known before and afterwards are not recorded even though they must have existed. These "gap" (Paul 1982), unsampled "spanning" (Holman 1985), "Lazarus" (Jablonski 1986), or "unsampled range-through" (Maas et al. 1995) taxa are indicators of sampling intensity. Counts of Lazarus taxa have been used to quantify differences in preservability among higher taxa (Paul 1982; Holman 1985), changes in sampling intensity through time (Paul 1982; Erwin 1996), and the "parsimony debt" of phylogenetic hypotheses (Fisher 1982, 1994).
.......... Paul (1982) was the first to use Lazarus taxa as the basis for an index of sampling intensity. Unfortunately, both he and Krause and Maas (1990), who independently derived the idea, failed to recognize that only range-through taxa should figure in such an equation; taxa whose age-ranges begin and/or end in a sampling interval provide no direct information about sampling intensity. This problem was corrected by Holman (1985), and later Maas et al. (1995), who defined the function

CL = NSRT/(NURT + NSRT)

where CL = the Lazarus taxon estimate of completeness, NSRT = the number of sampled range-through taxa (i.e., all sampled taxa except first appearing and last appearing taxa), and NURT = the number of unsampled range-through taxa.
.......... None of the preceding authors used Lazarus taxon counts or indices to compute extrapolated diversity estimates. However, this can be done quite easily by dividing the count of sampled taxa by the completeness index, as was done previously with the FreqRat measure:

DL = DS(NURT + NSRT)/NSRT

where DL = the Lazarus estimate of diversity.
.......... Species-lineage data present a minor book-keeping problem for computing Lazarus taxa: in rare instances, the first or last appearance of a species may be interpretable as a range-through. These instances occur when a terminal appearance is abutted by a gap in the species-level ranges for the genus, even though the genus itself is known to range through. In such cases the terminal appearance is interpreted as a pseudo-extinction or pseudo-origination instead of a true terminus. Only one such range-through can be declared for each interval, further complicating the algorithm.
.......... The Lazarus estimate is expected to be robust because it makes few assumptions. However, some of these assumptions are dangerous. The most important is that range-through taxa are as preservable as first- and last-appearing taxa; the opposite assumption is exactly what justifies the singleton-removal and FreqRat methods. If singletons instead are systematically but variably over-represented, the Lazarus method might distort the data. Secondly, the Lazarus method assumes that sampling intensity does not swing wildly between intervals. But Fig. 2 shows that it does, so obtaining useful estimates of the number of range-through taxa might prove difficult. Finally, if turnover rates are high relative to the length of sampling intervals, there simply may be too few Lazarus taxa to fix a precise completeness index. This turns out not to be a problem for the current data set.
.......... Like the preceding two methods, the Lazarus approach can be used to make separate diversity estimates for each interval; no age-ranges are generated, and origination and extinction rates cannot be determined. Unlike the ecological methods and like FreqRat, the Lazarus method does not require taxonomic lists. It does, however, require presence-absence data for all intervals. Unfortunately, these data have not been recorded for several important synoptic data sets like those of Sepkoski (1982, 1993a) and Benton (1995).
Confidence intervals.--In his landmark paper on sampling in the fossil record, Paul (1982) also developed a basic method for computing the confidence interval around the ends of a taxon's local stratigraphic range. Paul's probability model was replaced with a more realistic one by Strauss and Sadler (1989), and Paul's suggestion of a non-parametric approach was developed into a full equation by Marshall (1994). Since then, confidence intervals have become highly important; they not only are interesting in and of themselves, but can be used to test for mass extinction events (Marshall 1995; Marshall and Ward 1996) and constrain phylogenetic hypotheses (Wagner 1995a).
.......... Although the extension seems obvious, I am not aware of earlier studies explicitly attempting to correct lengthy diversity curves using confidence intervals. However, this is easily done; one simply replaces the original, observed stratigraphic ranges with new ranges augmented by confidence intervals. There are only three complications. First, one has to choose between using a fixed confidence limit value to stretch the ends of each age-range to a specific point, or converting confidence interval probabilities into fractional "presences" that smear out the known ends of each age-range. Because the latter approach is complex and blurs the definition of originations and extinctions, the former seems preferable. However, it does require the arbitary choice of a confidence interval value that will fix the range ends. In this study I will use the 50% confidence limit instead of other values such as 90% or 95%. This is because overestimating half the age-ranges but underestimating the other half should give a statistically unbiased estimate of total diversity.
.......... A second problem is that the parametric equations of Strauss and Sadler (1989) are not appropriate for data that do not apply to a single stratigraphic section, such as the current appearance event sequence. The non-parametric equation of Marshall (1994) does seem appropriate; it equates the 50% confidence interval with the median-sized gap within an age-range. Unfortunately, this method makes a key assumption that is not borne out by the data: gaps within age-ranges should be randomly distributed. In fact, gaps at either the beginnings or ends of age-ranges are substantially longer than median-sized gaps (Alroy 1996). In the absence of any better approach, the current analysis deals with the problem by substituting the first or last gap for the median gap, as appropriate. This yields an estimate more prone to random error, but one that on average probably more closely approaches the true value.
.......... The final computational problem involves species-lineages. It would not be appropriate to use presences for merged sets of species to compute gaps in their joint species-lineage range, because the sampling properties of those species may have changed along with their morphology. Furthermore, doing so would create the same problems with equally-optimal solutions that were seen with FreqRat. A simpler, if still not completely justified approach is to compute separate confidence intervals for genera and species, then compute separate ranges across the sampling intervals, and then use the genus-level ranges to fill in the species-level ranges using the standard algorithm.
.......... Like the following two methods, confidence intervals have the advantage of preserving age-ranges instead of proceeding directly to point estimates of diversity. Unlike other extrapolation methods, this one has the disadvantage of never predicting the existence of species that are as yet completely unsampled: instead, it focuses only on extending age-ranges of known species. Also, the method does not address species known only from a single horizon or locality, whose age-ranges almost by definition are badly truncated.
Ghost lineages.--Norell (1992) amplified an argument made by Hennig (1966), and independently by Paul (1982), that all sister taxa must originate simultaneously. Hence, if two sister taxa first appear at different times in the fossil record, the origination of the younger-appearing taxon can be pushed back safely to equal that of the older-appearing taxon. The resulting "ghost lineage" hypothesis can be used to increase the estimate of standing diversity. Hence,

DGL = DA + NG,

where DGL = the ghost lineage estimate of diversity, DA = apparent diversity at a particular time plane, including range-through taxa, and NG = the number of ghost lineages. The time plane is best set at the midpoint of the time interval, not at one of its ends.
.......... It should be noted that the ghost lineage method is not the same as the phylogenetically-based "Lazarus taxon" correction applied by Smith (1988), which granted that relatively primitive taxa might evolve via anagenesis into relatively derived taxa. Thus, Smith (1988) drew the ranges of sister taxa back to the same point only when the more derived taxon appeared first in the fossil record. When the more primitive taxon appeared first, he merely filled in gaps to guarantee that the lineage would not disappear and then reappear as a Lazarus taxon.
.......... The ghost lineage method requires a complete phylogeny for the group in question. Such a phylogeny is not available for the current data set. Instead, I will use a hybrid simulation approach that is intended to increase the method's chances of success. The basic idea is to make use of all the available species-level age-ranges; constrain relationships among species so that named genera are never polyphyletic; and maximize the age-rank/clade-rank correlation (Norell and Novacek 1992) within each genus. Because there are multiple ways to meet the latter criterion whenever there is a non-trivial number of species within a genus, a particular phylogeny must be selected by Monte Carlo simulaton. The procedure should minimize problems with overamplification of diversification episodes (see below).
.......... One minor computational problem is that the basic simulation algorithm ignores phylogenetic relationships among genera. A large majority of these genera are believed to be native to the continent (Woodburne and Swisher 1995), so if the relationships among their basal species were known, they often would imply relevant ghost lineages. This oversight is easily corrected with a ratio that depends on knowing how much the diversity curve has been inflated for those taxa that could possibly have been matched to ghost lineages in the first place:

DGL' = DA(DA - NB + NG)/(DA - NB),

where DGL' = the corrected ghost lineage diversity estimate and NB = number of "basal" species, i.e., species that are the oldest known to occur within a genus. What drives the equation is recognizing that basal species cannot have been extended to make still more ancient ghost lineages within the same genus, but if we had had a complete phylogeny they might have been extended to create ghost lineages in other, paraphyletic genera.
.......... Obviously, computing species-lineages is not compatible with computing ghost lineages. Not only do the ghost lineages already imply that all gaps before the end of a genus' range will be filled, but they assume a completely different model of evolution in which species never directly give rise to each other. This point of incompatibility is unavoidable, and if it results in any major difference in the pattern as compared to the species-lineage data, this has to be interpreted on its own merits and not as a computational oversight.
.......... The ghost lineage method makes such strong assumptions about evolution that many researchers will reject it outright. For example, its assumption that ancestors never are sampled cannot be justified under any model (Foote 1996a), and several paleontological systematists have rejected the idea explicitly (Smith 1988; Wagner 1995a). Apart from this the ghost lineage argument is valid for setting a minimum date on the origination of particular lineages, but the method still suffers from numerous fundamental problems with respect to estimating overall diversity. Chief among them is the fact that only first appearances are corrected. Thus, the method applies an assymmetrical correction that can accentuate diversification episodes while leaving prolonged extinction events untouched (Foote 1996b).
.......... A second problem is the method's potential vulnerability to inaccuracies in phylogenetic reconstruction. Both simulation studies (Sepkoski and Kendrick 1993; Wagner 1996b) and empirical work (Wagner 1995b) suggested that accurately defined monophyletic higher taxa and ghost lineages do well at recovering origination and extinction rates. However, it stands to reason that if the phylogeny is highly inaccurate, ghost lineages might obscure severe, sudden extinction events by creating numerous "Elvis" taxa (Erwin and Droser 1994). Poor phylogenies also might amplify the method's built-in tendency to exaggerate explosive logistic diversification episodes, for example, by placing a late-appearing and actually highly derived taxon in a basal portion of the tree.
.......... A third problem is that ghost lineages are most likely to be created in poorly sampled intervals immediately preceding well-sampled intervals. If just one member of a diverse clade is sampled in one such undersampled interval, many of that clade's lineages in the following interval will be drawn back. Thus, there is no a priori reason to think that the correction works against sampling biases; it may work in tandem with such biases if variation in sampling is extreme. All of this suggests that ghost lineage corrections have the potential to be highly biased (Wagner 1996b); one possible case is that of MacFadden and Hulbert (1988).
.......... Finally, true ghost lineages can be created only when robust phylogenies are available for all of the species that are being studied. Very few data sets meet this criterion. However, the results discussed later suggest that the hybrid generic allocation/Monte Carlo simulation method may provide information about sampling even when no phylogeny is available at all. Additionally, ghost lineages share the desirable property of modifying age-ranges instead of computing point diversity estimates, allowing the "corrected" data to be used in studies of turnover rates.
Rarefaction.--Although ecologists had developed the idea earlier, Raup (1975) was the first to apply rarefaction to paleobiological diversity data. Rarefaction is extremely powerful because it applies to any type of data where items (e.g., taxa) are arrayed in samples (e.g., taxonomic lists) and where the number of samples differs between two or more data partitions (e.g., time intervals). In such cases, one can compare the diversity of taxa in any two intervals by equalizing the number of lists. This is done either with an equation, or by randomly drawing samples without replacement. Previous applications in paleobiology mostly have not focused on taxonomic lists (e.g., Foote 1992; McKinney 1995), but both Alroy (1996) and Miller and Foote (1996) have done so recently.
.......... One complication is that taxonomic lists per se might not be the best sampling units. It is widely known that individual fossil localities vary wildly in the number of identifiable specimens they preserve. Therefore, it is not strictly fair to rigidly equate lists with each other for the purpose of sampling. I previously proposed that the number of taxonomic records might be a better proxy (Alroy 1996, in press). Records are defined as the number of distinguishable taxa (species plus genera that include no identified species) in each taxonomic list. For example, lists of 2, 5, and 7 distinguishable taxa sum to 14 taxonomic records. In this study records are used instead of lists as the unit of sampling because the results independently show that taxonomic records are indeed a superior indicator of sample size.
.......... Although one could compute a series of rarefaction trials by fixing the numbers of lists to varying levels, in the current analysis I will focus on only a single sampling level: that defined by the size of the most under-represented time intervals. Hence, the "better" intervals are randomly subsampled until the number of taxonomic records drawn in each equals the total in the "worst" intervals.
.......... I noted earlier (Alroy 1996, in press) that the current data set includes a handful of extremely undersampled intervals, and therefore that rarefying to the very worst sampling level would be extreme. As before, I will instead use a compromise sampling level. With the most recent version of the data set the best compromise is reached at 100 records per m.y. Many intervals just barely surpass this cutoff, and the two intervals that fall short of it are simply too poorly sampled to figure in setting the standard: 40 - 39 Ma (50 records) and 22 - 21 Ma (65 records).
.......... A final complication is that independent rarefaction analyses of successive time intervals do not fully depict the sampling process that determines a set of age-ranges. This is because real age-ranges are based not just on examining lists, but on applying the range-through criterion: a species present before and after an interval is considered to be present within it regardless of whether it is actually sampled. Because separate rarefaction runs would not incorporate these range-throughs, they would not reflect the same sampling process as the original diversity curve.
.......... Although this problem was not recognized by Miller and Foote (1996), I have dealt with it in previous studies (Alroy 1996, in press). The conceptually simple but computationally burdensome solution is to perform exactly one rarefaction trial in each sampling interval, determine the implied presence and absence of each taxon in the intervals, and then compute range-throughs for all taxa and intervals. The procedure is then repeated numerous (in this case 100) times, with a new diversity curve and accompanying turnover rates being stored with every iteration.
.......... Rarefaction presents no special problems with respect to computing species-lineages, although they must be recomputed with every iteration of the algorithm.
.......... Rarefaction has obvious strengths, and it would be a surprise to find that it performs poorly. Chief among them is that it is an interpolation method, not an extrapolation method like all the others. Although this prevents rarefaction from reconstructing the absolute magnitude of diversity, it maximizes the odds of accurately reconstructing the relative magnitude of diversity because it avoids the strong assumptions needed for extrapolation.
.......... The method's main drawbacks are that it requires very detailed knowledge of taxonomic lists, and that the underlying relationships between sample size and proxies like taxonomic record counts are unknown. The first property, however, can be seen as a strength, because the method preserves not just estimated origination and extinction data, but distributional information at the locality level that can be crucial in evaluating corrected diversity patterns. For example, corrected curves based on individual rarefaction runs can be matched to taphonomic and geographic data concerning the localities that were sampled.

Evaluative methods

..........We do not know the true diversity history of North American fossil mammals. However, we can easily predict the differences between the true diversity history and the raw diversity curve that is based on all the available data. Therefore, we also can predict the differences between the raw curve and any accurate, statistically corrected curves. Each predicted difference maps to a simple descriptive statistic that can be used to evaluate the competing methods.
.......... Many of these descriptive statistics are correlations. Because the data are time series that typically show strong autocorrelation, standard significance tests would be far too liberal. However, significance per se generally is not a concern because the statistics are intended to summarize patterns, not to test for non-randomness. Another issue is normality; being autocorrelated, many of the distributions are highly skewed, which would give unfair weight to outliers in computing a standard correlation. Therefore, a rank-order transformation is applied to each of the variables before computing the following correlations.
.......... The predictive statistics used in the analysis are as follows:
.......... 1) The true curve must have a greater average magnitude than the uncorrected curve. Therefore, if the extrapolated curves are accurate estimates, they should have similar and relatively high mean values. If they differ from one another in magnitude, at least one of them must be inaccurate. This criterion does not apply to interpolation methods like singleton removal and rarefaction.
.......... 2) The true curve is probably less variable than, or as variable as, the uncorrected curve. If variation in sampling is a completely stochastic process, it should increase the apparent variation in diversity. However, we know that sampling does not vary completely at random, so we do not expect the increase in variation to have been very great. This prediction can be measured by converting the diversity curves to a log scale (Sepkoski 1991) and computing coefficients of variation for the resulting time series. The coefficients for the original and corrected data should be similar.
.......... 3) The corrected curves all should show the same long-term trend through time. Regardless of details, the successful methods should converge on a similar estimate for the average rate of increase in diversity across the whole time series. The best way to estimate this statistic is to log-transform the data, compute a least-squares regression against time, and convert the resulting slope coefficient into a rate of increase. Although the pattern is known to be logistic and not exponential (Alroy 1996, in press), this statistic still should be a useful summary of the overall trend. If the methods fail to arrive at a consensus on its value, the test is of minimal value; if they do, it should help to identify flawed methods.
.......... 4) The true curve is probably as autocorrelated or more autocorrelated than the uncorrected curve. This is because a highly stochastic sampling-bias overprint should destroy the intrinsic correlation between the diversity levels of neighboring time intervals. More predictable variation in sampling regimes still should decrease this correlation, but not greatly. The simplest test of the prediction is to compute rank-order serial correlations of diversity levels in each pair of intervals i and i+1 (= autocorrelations with lags of one interval) for the uncorrected and corrected time series. Proper serial correlation requires that a variable's mean not change through time, but diversity increases dramatically in almost every curve. Therefore, the diversity values first must be transformed by taking the residuals of a least-squares regression of their natural logarithms against time (McKinney 1990). The correlations should be strong and positive.
.......... 5) The true curve is probably different from the uncorrected curve, but not greatly so. There is no reason to think that the uncorrected curve is completely uninformative, but at the same time sampling effects should change its overall shape relative to the true curve. We can infer that a method is probably not very reliable if it produces no perceptable change in the pattern despite substantial temporal variation in sampling. Therefore, there should be a perceptible, but not overwhelmingly strong correlation between the diversity levels in the corrected and uncorrected curves. This can be expressed most robustly with a rank-order correlation, which should be moderate and positive.
.......... 6) The true net changes in the diversity between intervals are probably quite similar to the net changes in the uncorrected diversity curve. This is due to the fact that sampling intensity is strongly autocorrelated. Hence, neighboring intervals are likely to experience similar sampling regimes, and the proportionate changes in diversity are likely to be accurate even when the absolute magnitudes of diversity are far too low. This can be tested with a rank-order correlation of the percent diversity changes in each time series, which should be strong and positive.
.......... 7) Most importantly, the proportionate differences between the true and uncorrected curves should reflect sampling intensity. For extrapolation methods, intervals with very poor sampling should be corrected to relatively much higher diversity levels; intervals with robust sampling should be subject to very little correction. For interpolation methods like rarefaction, downwards corrections on the contrary should be greatest in the best-sampled intervals. Assuming that counts of faunal lists are a good measure of sampling intensity, one can compute a rank-order correlation of the correction ratio (corrected/uncorrected diversity) against the number of lists. A strong, negative correlation should be obtained for either extrapolation or interpolation methods if the correction is valid.
.......... 8) Similarly, if counts of taxonomic records are a good measure sampling, then the rank-order correlation of the correction ratio and the number of records per interval should be strong and negative. As a corollary, if one of the two sampling measures consistently correlates more strongly with the correction ratio than does the other, then this sampling measure is probably the most informative.

Results

Lumping by interval.--The lumped data (Fig. 3A) are very similar to the uncorrected time-plane data (Fig. 1). They have slightly lower variance (Table 1), but they almost exactly preserve the overall trend, serial correlation, relative magnitude of diversity, and relative changes in diversity (Table 2). In fact, the lumped data preserve the raw pattern of changes better than any of the sampling-correction methods discussed below - except for rarefaction. The only important difference between the curves is entirely predictable: the mean diversity averages about 31% higher in lumped data (Table 1).
.......... The most interesting result is that the few changes imposed by lumping do not correspond strongly with sampling intensity. The correction ratio is indeed positively correlated with the counts of faunal lists and of taxonomic records, confirming the fact that lumping is an "anti-correction" method. But the correlation is weak: less than 12% of the variance in the ratio is explained by sampling (Table 2; Fig. 4A). This is surprising given the differences in turnover rates between the Paleocene and Eocene-Pleistocene intervals, and it suggests that the use of lumped data may not be a fatal flaw of earlier studies.
Non-singletons.--Excluding singletons from the lumped data also appears to have little effect; although a spike in the very last 1.0 m.y. interval is removed, all other features such as an enormous peak in the mid-Miocene are retained (Fig. 3B). The singleton-removal curve is still very similar to the uncorrected time-plane curve by every measure (Tables 1 and 2). It is in fact the most conservative method in terms of the inferred absolute diversity and the correlation of uncorrected and corrected diversity values. More importantly, the changes it does make bear no strong correlation with the observed variation in sampling - although there is a definite suggestion that it is positively correlated with the number of taxonomic records (Fig. 4B). Essentially, there is no evidence that this method provides any real correction at all, and it may even actively distort the pattern by selectively removing species from poorly sampled intervals. In any case, applying it is most likely to give the false impression that the original pattern was reliable to start with. The best that can be said is that the disappointing results may relate to the evenness of the sampling intervals. For time series with highly uneven intervals, such as stages, singletons may perhaps record a sampling signal.
Chao-2.--Almost every measure shows that the Chao-2 equation severely distorts the diversity curve (Tables 1 and 2; Fig. 3C). Unlike any other method it eliminates the uncorrected data's serial correlation, dramatically changes the overall shape of the curve, and destroys the net diversification signal. Worse still, the changes it does impose are very weakly correlated with observed variation in sampling intensity, measured either by the number of faunal lists or of taxonomic records per m.y. (Fig. 4C). The only encouraging results are the close similarity in overall variation between the uncorrected and Chao-2 curves, the small, but visible correlation with the raw diversity pattern, and the reasonable average rate of increase. These, however, are characteristic of all the methods except those that exaggerate the long-term trajectory of the diversity curve. In summary, the Chao-2 method acts as if it was imposing a random variate upon the raw curve, for all intents and purposes rendering the "corrected" curve far less informative than the uncorrected data.
.......... This dramatic result probably results from two important factors. First, the Chao-2 method assumes that its two key input parameters (counts of one- and two-locality species) are large enough to minimize any binomial error. This is a stiff requirement given that the average 1.0 m.y. time interval includes 125 species-lineages and 49 faunal lists; hence, very few species in any interval will be found in only one or two lists. Second, the Chao-2 method assumes that sampling intensity within localities is approximately uniform, which, as mentioned before, is highly unrealistic for fossil assemblages that were collected using many different techniques and with no eye toward standardizing sample size.
.......... None of this argues against ecological extrapolation methods per se, or against their application to narrowly constrained paleoecological problems (e.g., Wing and DiMichele 1995; Anderson et al. 1996). It only cautions that such methods generally are not appropriate for studying even relatively high-quality compilations of taxonomic lists, and, by extension, for the problem of correcting diversity curves.
FreqRat.--Despite generous smoothing that should have avoided the binomial error problem, FreqRat performs very poorly (Tables 1 and 2; Fig. 3D). Statistically, this means an inflated coefficient of variation due to a dramatically steepened trend, a low correlation between uncorrected and corrected diversity curves, and virtually no correlation between the correction imposed by the method and the known variation in sampling intensity (Fig. 4D). But the moderately strong correlation of uncorrected and corrected diversity changes is a telling hint: it shows that FreqRat, unlike Chao-2, does little damage over short stretches of time, even though it is misleading when applied to the entire curve. FreqRat's problems are best shown by its differential treatment of the Paleocene - Eocene intervals, which it leaves almost unchanged, and the later intervals, which it greatly overcorrects.
.......... The clearcut and idiosyncratic bias seen in the curve may result from the fact that Neogene localities are much more geographically dispersed. Even though the data were restricted to the West, dispersion of localities in that area increases greatly after the early Miocene, with more consistent representation of such areas as the Great Basin and Mohave Desert. As a result, apparent beta diversity probably increases at this time. Perhaps FreqRat overcompensates when there is an increase in beta diversity, or undercompensates when beta diversity is extremely low. This is confirmed by the fact that the worst overcorrections being at 16 Ma, just when there is a major increase in geographic dispersal.
.......... Once again, poor results in this context should not be taken as a general condemnation of the method. FreqRat was, after all, designed to compute a single preservation probability for an entire time series, not separate values for each interval (Foote and Raup 1996). Another glimmer of hope is the fact that FreqRat and three of the four other extrapolation methods all arrive at similar estimates for average standing diversity (FreqRat's value of 168 species falls well within their range of 157 - 186). Although this strongly suggests that FreqRat yields a reasonable overall sampling probability, as intended by Foote and Raup (1996), the method's inexplicably varying treatment of the time series' two halves does suggest that the method has an unresolved and possibly fundamental flaw.
Lazarus taxa.--The Lazarus taxon method is not just the most easily applied of any discussed here, but one of the better ones (Tables 1 and 2; Fig. 3E). Most of the test statistics suggest this: the variation, rate of increase, and overall shape of the uncorrected curve are preserved almost intact, and at the same time the corrections do correlate negatively with known variation in samping intensity, if not very strongly (Fig. 4E). The predicted mean diversity value squares with those arrived at by other extrapolation methods.
.......... On the down side, the Lazarus method seems to introduce a significant amount of fine-scale noise. This is reflected by the curve's somewhat low serial correlation and the very poor preservation of the original curve's first difference pattern. The problems result from the fact that the method computes independent diversity estimates for each interval instead of modifying taxonomic age-ranges. Hence, it cannot be used to estimate origination and extinction rates even in principle and should not be used to infer such patterns in practice, all of which makes its uses limited. Studies relying on the Lazarus method should focus on long-term trends and not on short-term episodes like mass extinctions.
.......... The acceptable performance of the Lazarus method is due to its simple and persuasive logic: unless there is some systematic preservational difference between taxa that range through intervals and taxa whose ranges truncate in intervals, the proportion of range-throughs that are preserved truly must be an accurate estimate of the overall preservation probability (Paul 1982). It is hard to imagine how any such difference could be important. On the other hand, simple and persuasive logic does not necessarily translate into robust performance.
Confidence intervals.--Correcting the data by extending age-ranges to their 50% confidence intervals would seem like a highly intuitive approach, and, in fact, it does seem to improve the pattern (Tables 1 and 2). The "corrected" curve reasonably agrees with the original one in its serial correlation, relative magnitude, and net changes, and the "corrections" bear a weak but probably meaningful relationship to known variation in sampling intensity (Fig. 4F). However, the method fails to eliminate large sampling-related peaks, especially in the middle Miocene and Pleistocene (Fig. 3F). As a result, the curve's variation is greatly increased; and because the most inflated peaks are towards the end of the time series, the apparent rate of increase also is greatly over-estimated. In overall shape, however, the confidence interval curve is not so very far from the uncorrected data or from the several apparently reliable methods (rank-order correlations: vs. uncorrected, +0.860; vs. Lazarus taxa, +0.765; vs. ghost lineages, +0.721; vs. rarefaction, +0.604). Still, its difficulties with sampling peaks show that its solution is idiosyncratic insofar it represents a real change in the first place.
.......... There is probably a very simple explanation for the peculiar features of the confidence interval curve: the method only can be applied to known taxa with non-zero age-ranges, and these taxa tend to be clustered around well-sampled intervals. Because mammals have relatively short age-ranges to begin with, the real problem is one that confidence intervals do not address - the complete absence of many taxa in the poorly-sampled intervals. Despite this, confidence intervals do push the data in the direction pointed at by sampling intensity patterns, and applying them is better than ignoring them.
.......... Still, though, with the kind of horizon-specific data needed for the method, anyone using confidence intervals presumably could apply a superior method like rarefaction instead. Therefore, the method probably should be restricted to the special problems it originally was designed for, such as determining the relative order of appearance of known species (Paul 1982) or the synchronicity of mass extinction events (Marshall 1995; Marshall and Ward 1996).
Ghost lineages.--The ghost lineage method is perhaps the most controversial one dealt with here. For example, it has been claimed that no study of diversity patterns can proceed without corrections for ghost lineages (Norell 1992; Norell and Novacek 1992; Smith 1994), which has generated heated responses (e.g., Foote 1996b). The debate has even spilled over to the contentious issues of how phylogenies might be constrained by temporal distributions and how ancestors might be recognized in the fossil record (Norell 1996; Wagner 1996a). Despite this, the advocates of ghost lineages have not presented empirical evidence that the method improves diversity curves. Apart from one early study (Smith 1988), most published results seem to show that monophyletic higher taxa (Sepkoski and Kendrick 1993) and ghost lineages (Wagner 1995a) tell much the same story as uncorrected species-level age-ranges.
.......... Therefore, it is of considerable interest that the hybrid ghost lineage algorithm used in this study does seem to have positive effects - but does not work any better than several less contentious, complicated, and cumbersome methods (Fig. 3G). It does perform as well as most methods with respect to preserving the variance, slope, serial correlation, overall trajectory, and net changes of the uncorrected curve; its predicted mean value of 186 species seems reasonable; and its corrections are detectably correlated with variation in sampling, measured either by the number of lists or of records (Tables 1 and 2; Fig. 4G). These two correlations, however, are not nearly as strong as for the rarefaction method (Fig. 4H), which shares all of the ghost taxon method's other positive features.
.......... Interestingly, the ghost taxon curve is almost as strongly correlated with the Lazarus curve as with the uncorrected curve (rank-order correlation = +0.776 vs. +0.840). These two correction methods yield very similar results, with the most important difference being a smoothing out of the late Miocene extinction event in the ghost lineage curve (about 7 - 5 Ma). This is surprising because simulation studies (Sepkoski and Kendrick 1993) suggested that monophyletic higher taxa accurately reflect mass extinction horizons. Perhaps ghost lineages and monophyletic taxa are not comparable in this respect. In any event, the feature does appear to reflect some kind of a bias because all other methods depict a sharp drop in diversity at this time.
.......... Even though the simulated ghost lineage method seems to have performed well in the present study, this cannot be taken as a general endorsement. First of all, the original, cladogram-based method remains extremely unwieldy, requiring detailed knowledge of all phylogenetic relationships. With even fundamental issues of alpha taxonomy being unresolved in most fossil groups, this seems like an unattainable goal for any but the most modest studies (but see Wagner 1995b). Secondly, several negative features of the method were not addressed by the test statistics, including its inclination to erase known extinction events, its lopsided correction of diversity peaks, and its tendency to inflate exponential diversifications and overemphasize diversity plateaus (Wagner 1996b).
.......... Finally, the "phylogeny" used in this study was a hybrid of known generic affinities and optimally randomized within-genus relationships. This "phylogeny" was designed to minimize possible problems with poor age-rank/clade-rank correlation (Norell and Novacek 1992) that might lead to the basal placement of derived taxa, and therefore the gross inflation of apparent diversity. In other words, because the algorithm was designed to minimize the impact of ghost lineages, any biases they might have created were also minimized. Together, all of this suggests that investigators should be very wary of using only the ghost lineage method to correct diversity curves: if at all possible, other approaches like the Lazarus method should also be employed.
Rarefaction.--The results show that rarefaction is unquestionably the most reliable of all the methods, although it creates a different pattern than any of the others (Fig. 3H), as indicated by its steeper and smoother trend (Table 1). It closely matches the other successful methods by the other measures (Tables 1 and 2), for example preserving net changes better than any other algorithm. And it far and away outperforms the rest in terms of the most important test: the correlation between the changes it imposes and sampling variation, as measured by faunal lists or taxonomic records (Table 2; Fig. 4H).
.......... This strong performance is entirely predictable. Rarefaction makes more direct use of sampling intensity data than any other method, and it is the only one that requires having such data except for the highly unreliable ecological extrapolation methods. It also is the only method to interpolate diversity levels instead of extrapolating them. Given the built-in difficulties with extrapolation, it is no surprise that this property ends up being a strength.
.......... That said, it it troublesome that the rarefaction curve does differ visibly from all the other curves in its general features, as expressed by rank-order correlations (uncorrected: r = +0.654; Lazarus taxa: r = +0.589; confidence intervals: r = +0.604; ghost lineages: r = +0.604). Considering all the correlations together, it seems that the uncorrected, Lazarus, and ghost lineage curves form a group, with the rarefaction and confidence interval methods falling at opposite ends of a spectrum and rarefaction being generally the least conservative.
.......... Only one method can be the closest to the truth. The confidence interval approach can be discounted because its distinguishing characteristic is a failure to remove some of the very highest diversity peaks that clearly are related to sampling, such as in the Pleistocene. The Lazarus and ghost lineage curves have to be treated as a unit; they differ importantly only in that the latter obscures a probable extinction episode in the latest Miocene. The key issue, then, is the treatment of the middle Tertiary. Rarefaction shows a gradual climb in diversity from the late Paleocene (about 57 Ma) all the way through to the late Eocene (about 43 Ma). After an early Oligocene low that is seen in all the curves, diversity is shown as reaching an early Miocene peak and then declining. The two other curves show a very rapid increase in diversity close to the Paleocene-Eocene boundary (about 55 Ma), followed by a decline through the rest of the epoch; and they suggest lower diversity in the first half of the Miocene than the second.
.......... These differences may be due to overcompensation by the rarefaction method. Preliminary data suggest that alpha diversity was at a peak in the early Eocene. If so, then the number of taxonomic records would be an overly generous estimate of the number of specimens at this time; 100 early Eocene records would represent fewer specimens than, say, 100 late Miocene records. Therefore, there may be a systematic undersampling of this interval in the rarefaction analysis. Such an argument, however, would fail to account for the late Oligocene - late Miocene pattern.
.......... The other possibility is that the Lazarus taxa and ghost lineages both fail to correct fully for the extreme differences in sampling intensity between the early and late Eocene, or early and late Miocene. This is suggested by the poorer match of their correction ratios to observed sampling intensity. If so, then the rarefaction curve's unexpected features are telling us that because of their conservativeness, all other methods are systematically failing to capture important patterns.
.......... Regardless of this issue, virtually every analysis shows a major drop in the diversity at the end of the Eocene (typically between 35 and 34 Ma), with recovery not beginning until well into the Oligocene (typically 30 - 29 Ma). This is in direct conflict with the recently published analysis of Prothero and Heaton (1996), which purported to show stasis in mammalian diversity from the late Eocene to the early Oligocene and yet failed to address sampling biases in any way. The curves essentially agree on all other major features: an extraordinarily rapid diversification in the earliest Paleocene, a low plateau through the rest of this epoch, a higher plateau from the Eocene to the present with no large peak in the Miocene, and a rapid Pliocene diversification in the wake of the end-Miocene event.
.......... In terms of these general patterns, then, the Lazarus taxon, ghost lineage, and rarefaction methods are all in agreement. The differences are restricted to a handful of features whose significance is not certain. Despite its vulnerability to long-term changes in alpha diversity, rarefaction on the basis of taxonomic records still seems to be the best approach because the method's corrections are much more strongly related to sampling intensity than are any other's.
.......... This raises a final issue, which is the choice of taxonomic records as the standard for rarefaction instead of faunal lists per se. If the number of specimens per locality was a more unbiased and constant variable than the number of specimens per taxonomic record, then faunal lists would be a better rarefaction standard. That this is not the case is suggested by the consistently stronger correlation between the number of records and various sampling correction ratios than between the number of lists and the ratios. The same inequality is seen with all three of the most reliable methods, despite the fundamental differences among them. Although the exact number of specimens would be the best indicator of all, at present the number of records appears to be the best available option. Additionally, the fact that rarefaction provides explicitly corrected age-ranges makes it the only easily applied, reliable method that generates not just a diversity curve, but origination and extinction counts.

Discussion

..........There are several possible objections to the results presented in this study. First, because the analyses are based on real, not simulated data, the "true" diversity pattern is not known and the evaluative criteria are purely descriptive. Second, the results only apply to a single data set that may not be representative of most other paleontological data. Third, little effort has been made to vary a few parameters, like the level of rarefaction, that are not fixed by the data and might have changed the results. And finally, most of the methods were not originally intended to correct for sampling effects in diversity curves, perhaps making the tests Quixotic.
.......... All of these concerns have a rational basis. However, none of them cuts to the chase: for this data set, these methods do provide dramatically different results. Therefore, most of the methods must have failed to provide an optimal correction. All of these methods are plausible ways to remove sampling effects from diversity curves; a priori arguments would not have clearly predicted their behavior. For example, we could not have known beforehand that the Chao-2 method would be so vulnerable to binomial error, or that the FreqRat method would seem to show a peculiar sensitivity to the geographic dispersal of samples, or that the intuitively powerful Lazarus taxon method would be so mediocre. And because the results are tied so inextricably to complex features of the data, such as staggered and massive sampling peaks, long-term changes in turnover rates, a strongly equilibrial diversification pattern, and sporadic origination and extinction episodes, simulation studies most likely would have failed to uncover the subtle distinctions among the methods.
.......... The fundamental issue raised by these particular concerns, then, is just that more empirical research should be performed. More data sets should be analyzed, more parameters should be varied, more evaluative criteria should be considered, and more methods should be studied. The current analysis therefore should be seen only as the first step in a promising direction.

Conclusion

..........The problem of variable sampling was recognized early in the modern debate over Phanerozoic diversification patterns (e.g., Simpson 1960; Raup 1972, 1976). However, sampling eventually was dismissed as insignificant or intractable (Bambach 1977; Sepkoski et al. 1981; Sepkoski 1994). Sepkoski (1993a) maintains that his widely-studied synoptic database of marine family-level diversity probably is not subject to severe biases because large quantities of new data had confirmed patterns seen in his initial analyses (Sepkoski 1978, 1979, 1982).
.......... Despite some dissent (Signor 1978, 1985; Smith 1988), most other authors also have accepted this conclusion. For example, Niklas (1978) and Niklas et al. (1980) first followed Raup's lead by arguing persuasively that sampling intensity, even as measured by a poor proxy like rock volume, is a major determinant of apparent fossil plant diversity. But later Niklas et al. (1983, 1985) made no effort to correct for the problem. In a study of Cenozoic North American mammals, Van Valkenburgh and Janis (1993) also ignored the problem of sampling intensity, as well as that of temporal autocorrelation. But their "significantly" positive correlations between local and continental diversity, turnover and diversity, and prey and predator diversity all could have been predicted as a side-effect of sampling bias. Most recently, Benton (1995) repeated earlier arguments against the importance of sampling artifacts and used all the available raw data to depict Phanerozoic diversification patterns, drawing dangerous conclusions as a result.
.......... Far from encouraging such optimism, the results presented here should reinforce three general arguments that sampling problems always should be addressed in studies of diversity. First, the immense temporal variation in sampling intensity depicted by Fig. 2 illustrates that contrary to Sepkoski (1993a), sample size by itself is not the key problem: strong and predictable variation in sample size is. Such variation could not be dampened by binning the data into 2 m.y., 5 m.y., or even 10 m.y. intervals. There is no a priori reason to believe that sampling variation in the North American fossil mammal record is uniquely severe; probably every well-studied fossil record suffers from similar handicaps.
.......... Second, the wide discrepancy between the uncorrected diversity curve (Fig. 1) and the corrected curves, in particular the apparently most reliable one (Figs. 3H), shows that sampling effects are important even for data sets that otherwise have many advantages. Standardized taxonomy, complete literature coverage, numerous data points, independent temporal controls, careful geographic restrictions, and evenly spaced sampling intervals are no guarantee of accuracy.
.......... Finally and most importantly, the better approaches appear to be quite straightforward in their logic and application. The Lazarus method, confidence intervals, simulated ghost lineages, and rarefaction all yield similar results; the Lazarus method requires relatively minimal information about temporal distributions, and rarefaction has powerful support from the test statistics. With four justifiable methods to choose from, workers who do not wish to correct their data for sampling effects should explain why they are not able to do so. Unfortunately, most well-established global diversity databases do record only first and last appearances, making any direct corrections difficult. Therefore, it may now be time to move away from such databases and return to the locality-oriented approach pioneered by Bambach (1977). Some appropriate data sets already have been compiled, and with a bit of effort these can be transformed into large-scale diversity patterns that are mostly free from sampling effects.

Acknowledgments

..........This study was made possible by D. Raup and J. Sepkoski's pioneering studies of sampling and diversification patterns. S. Alin, M. Foote, J. Hunter, J. Huss, J. Sepkoski, P. Wagner, and P. Wilf provided extremely valuable comments on the manuscript. I also thank S. Wing and other colleagues at the Smithsonian Institution and the University of Chicago for helpful discussions. My research was supported by the Smithsonian Institution.


Literature cited

Alroy, J. 1992. Conjunction among taxonomic distributions and the Miocene
.......... mammalian biochronology of the Great Plains. Paleobiology 18:326-343.
--------. 1994. Appearance event ordination: a new biochronologic method.
.......... Paleobiology 20:191-207.
--------. 1996. Constant extinction, constrained diversification, and
.......... uncoordinated stasis in North American mammals. Palaeogeography,
.......... Palaeoclimatology, Palaeoecology 127:285-311.
--------. In press. Long-term equilibrium in North American mammalian
.......... diversity. In M. McKinney, ed. Biodiversity dynamics: turnover of populations,
.......... taxa and communities. Columbia University Press, New York.
Anderson, J., H. Anderson, P. Fatti, and H. Sichel. 1996. The Triassic
.......... Explosion(?): a statistical model for extrapolating biodiversity based on the
.......... terrestrial Molteno Formation. Paleobiology 22:318-329.
Bambach, R. K. 1977. Species richness in marine benthic habitats through the
.......... Phanerozoic. Paleobiology 3:152-167.
Batten, R. L. 1973. The vicissitudes of the gastropods during the interval of
.......... Guadalupian-Ladinian time. Pp. 596-607 in A. Logan and L. V. Hills, eds. The
.......... Permian and Triassic systems and their natural boundary. Canadian Society of
.......... Petroleum Geologists, Calgary, Alberta.
Benton, M. J. 1995. Diversification in the history of life. Science
.......... 268:52-58.
Colwell, R. K., and J. A. Coddington. 1994. Estimating terrestrial
.......... biodiversity through extrapolation. Philosophical Transactions of the Royal
.......... Society of London, Series B 345:101-118.
Erwin, D. H. 1996. Understanding biotic recoveries: extinction, survival, and
.......... preservation during the end-Permian mass extinction. Pp. 398-418 in
.......... D. Jablonski, D. H. Erwin, and J. H. Lipps, eds. Evolutionary paleobiology.
.......... University of Chicago Press, Chicago.
Erwin, D. H., and M. L. Droser. 1994. Elvis taxa. Palaios 8:623-624.
Fisher, D. C. 1982. Phylogenetic and macroevolutionary patterns within the
.......... Xiphosurida. Proceedings of the Third North American Paleontological Convention
.......... 1:175-180.
--------. 1994. Stratocladistics: morphological and temporal patterns and
.......... their relation to phylogenetic process. Pp. 133-171 in L. Grande and
.......... O. Rieppel, eds. Interpreting the hierarchy of nature. Academic Press, San Diego.
Foote, M. 1992. Rarefaction analysis of morphological and taxonomic
.......... diversity. Paleobiology 18:1-16.
--------. 1994. Temporal variation in extinction risk and temporal scaling of
.......... extinction metrics. Paleobiology 20:424-444.
--------. 1996. Perspective: evolutionary patterns in the fossil record.
.......... Evolution 50:1-11.
Foote, M., and D. M. Raup. 1996. Fossil preservation and the stratigraphic
.......... ranges of taxa. Paleobiology 22:121-140.
Hennig, W. 1966. Phylogenetic systematics. University of Illinois Press,
.......... Urbana, Illinois.
Holman, E. W. 1985. Gaps in the fossil record. Paleobiology 11:221-226.
Jablonski, D. 1986. Causes and consequences of mass extinction: a comparative
.......... approach. Pp. 183-229 in D. K. Elliott, ed. Dynamics of mass extinction. John
.......... Wiley & Sons, New York.
Krause, D. W., and M. C. Maas. 1990. The biogeographic origins of late
.......... Paleocene-early Eocene mammalian immigrants to the Western Interior of North
.......... America. Geological Society of America Special Paper 243:71-105.
Lidgard, S., and P. R. Crane. 1988. Quantitative analyses of the early
.......... angiosperm radiation. Nature 331:344-346.
--------. 1990. Angiosperm diversification and Cretaceous floristic trends: a
.......... comparison of palynofloras and leaf macrofloras. Paleobiology 16:77-93.
Lidgard, S., F. K. McKinney, and P. D. Taylor. 1993. Competition, clade
.......... replacement, and a history of cyclostome and cheilostome bryozoan diversity.
.......... Paleobiology 19:352-371.
Lupia, R., P. R. Crane, and S. Lidgard. In press. Angiosperm diversification
.......... and mid-Cretaceous environmental change. In S. J. Culver and P. F. Lawson, eds.
.......... Biotic response to global change: the last 145 million years. Chapman and Hall,
.......... London.
Maas, M. C., M. R. L. Anthony, P. D. Gingerich, G. F. Gunnell, and D. W.
.......... Krause. 1995. Mammalian generic diversity and turnover in the Late Paleocene
.......... and Early Eocene of the Bighorn and Crazy Mountains Basins, Wyoming and Montana
.......... (USA). Palaeogeography, Palaeoclimatology, Palaeoecology 115:181-208.
MacFadden, B. J., and R. C. Hulbert, Jr. 1988. Explosive speciation at the base of
.......... the adaptive radiation of Miocene grazing horses. Nature 336:466-468.
Magurran, A. E. 1988. Ecological diversity and its measurement. Princeton
.......... University Press, Princeton, New Jersey.
Markwick, P. J. 1994. "Equability," continentality and Tertiary "climate":
.......... the crocodilian perspective. Geology 22:613-616.
Marshall, C. R. 1990. Confidence intervals on stratigraphic ranges.
.......... Paleobiology 16:1-10.
--------. 1994. Confidence intervals on stratigraphic ranges: partial
.......... relaxation of the assumption of randomly distributed fossil horizons.
.......... Paleobiology 20:459-469.
--------. 1995. Distinguishing between sudden and gradual extinctions in the
.......... fossil record: predicting the position of the Cretaceous-Tertiary iridium
.......... anomaly using the ammonite fossil record on Seymour Island, Antarctica. Geology
.......... 23:731-734.
Marshall, C. R., and P. D. Ward. 1996. Sudden and gradual molluscan
.......... extinctions in the latest Cretaceous of western European Tethys. Science
.......... 274:1360-1363.
May, R. M. Patterns of species abundance and diversity. Pp. 81-120 in M. L.
.......... Cody and J. M. Diamond, eds. Ecology and evolution of communities. Belknap,
.......... Cambridge, Massachusetts.
McKinney, M. L. 1990. Classifying and analysing evolutionary trends. Pp. 28-58 in
.......... K. J. McNamara, ed. Evolutionary trends. University of Arizona Press, Tucson,
.......... Arizona.
--------. 1995. Extinction selectivity among lower taxa: gradational patterns
.......... and rarefaction error in extinction estimates. Paleobiology 21:300-313.
Miller, A. I., and M. Foote. 1996. Calibrating the Ordovician Radiation of
.......... marine life: implications for Phanerozoic diversity trends. Paleobiology
.......... 22:304-309.
Nichols, J. D., and K. H. Pollock. 1983. Estimating taxonomic diversity,
.......... extinction rates, and speciation rates from fossil data using capture-recapture
.......... models. Paleobiology 9:150-163.
Niklas, K. J. 1978. Coupled evolutionary rates and the fossil record.
.......... Brittonia 30:373-394.
Niklas, K .J., B. H. Tiffney, and A. H. Knoll. 1980. Apparent changes in the
.......... diversity of fossil plants: a preliminary assessment. Evolutionary Biology
.......... 12:1-89.
--------. 1983. Patterns in vascular plant diversification: a statistical
.......... analysis at the species level. Nature 303:614-616.
--------. 1985. Patterns in vascular plant diversification: an analysis at
.......... the species level. Pp. 97-128 in J. W. Valentine, ed. Phanerozoic diversity
.......... patterns: profiles in macroevolution. Princeton University Press, Princeton,
.......... New Jersey.
Norell, M. A. 1992. Taxic origin and temporal diversity: the effect of
.......... phylogeny. Pp. 89-118 in M. J. Novacek and Q. D. Wheeler, eds. Extinction and
.......... phylogeny. Columbia University Press, New York.
--------. 1996. Ghost taxa, ancestors, and assumptions: a comment on Wagner.
.......... Paleobiology 22:453-455.
Norell, M. A., and M. J. Novacek. 1992. The fossil record and evolution:
.......... comparing cladistic and paleontologic evidence for vertebrate history. Science
.......... 255:1690-1693.
Paul, C. R. C. 1982. The adequacy of the fossil record. Pp. 75-117 in K. A.
.......... Joysey and A. E. Friday, eds. Problems of phylogenetic reconstruction.
.......... Academic, London.
Pease, C. M. 1985. Biases in the durations and diversities of fossil taxa.
.......... Paleobiology 11:272-292.
--------. 1992. On the declining extinction and origination rates of fossil
.......... taxa. Paleobiology 18:89-92.
Prothero, D. R., and T. H. Heaton. 1996. Faunal stability during the Early
.......... Oligocene climatic crash. Palaeogeography, Palaeoclimatology, Palaeoecology
.......... 127:257-283.
Raup, D. M. 1975. Taxonomic diversity estimation using rarefaction.
.......... Paleobiology 1:333-342.
--------. 1976. Species diversity in the Phanerozoic: an interpretation.
.......... Paleobiology 2:289-297.
Raup, D. M., and G. E. Boyajian. Patterns of generic extinction in the fossil
.......... record. Paleobiology 14:109-125.
Raymond, A., and C. Metz. 1995. Laurussian land-plant diversity during the
.......... Silurian and Devonian: mass extinction, sampling bias, or both? Paleobiology
.......... 21:74-91.
Sepkoski, J. J., Jr. 1978. A kinetic model of Phanerozoic taxonomic
.......... diversity. I. Analysis of marine orders. Paleobiology 4:223-251.
--------. 1979. A kinetic model of Phanerozoic taxonomic diversity. II. Early
.......... Phanerozoic families and multiple equilibria. Paleobiology 5:222-251.
--------. 1982. A compendium of fossil marine families. Milwaukee Public
.......... Museum Contributions in Biology and Geology 51.
--------. 1988. Alpha, beta, gamma: where does all the diversity go?
.......... Paleobiology 14:221-234.
--------. 1991. Population biology models in macroevolution. Pp. 136-156 in
.......... N. L. Gilinsky and P. W. Signor, eds. Analytical paleobiology. Paleontological
.......... Society Short Courses in Paleobiology 4.
--------. 1993a. Ten years in the library: new data confirm paleontological
.......... patterns. Paleobiology 19:43-51.
--------. 1993b. Phanerozoic diversity at the genus level: problems and
.......... prospects. Geological Society of America Abstracts with Programs 25(6):A-50.
--------. 1994. Limits to randomness in paleobiologic models: the case of
.......... Phanerozoic species diversity. Acta Palaeontologica Polonica 38:175-198.
Sepkoski, J. J., Jr., R. K. Bambach, D. M. Raup, and J. W. Valentine. 1981.
.......... Phanerozoic marine diversity and the fossil record. Nature 293:435-437.
Sepkoski, J. J., Jr., and D. C. Kendrick. 1993. Numerical experiments with
.......... model monophyletic and paraphyletic taxa. Paleobiology 19:168-184.
Sepkoski, J. J., Jr., and A. I. Miller. 1985. Evolutionary faunas and the
.......... distribution of Paleozoic benthic communities in space and time. Pp. 153-190 in
.......... J. W. Valentine, ed. Phanerozoic diversity patterns: profiles in
.......... macroevolution. Princeton University Press, Princeton, New Jersey.
Sepkoski, J. J., Jr., and P. M. Sheehan. 1983. Diversification, faunal
.......... change, and community replacement during the Ordovician radiations. Pp. 673-717
.......... in M. J. S. Tevesz and P. M. McCall, eds. Biotic interactions in Recent and
.......... fossil benthic communities. Plenum, New York.
Signor, P. W., III. 1978. Species richness in the Phanerozoic: an
.......... investigation of sampling effects. Paleobiology 4:394-406.
--------. 1985. Real and apparent trends in species richness through time.
.......... Pp. 129-150 in J. W. Valentine, ed. Phanerozoic diversity patterns: profiles in
.......... macroevolution. Princeton University Press, Princeton, New Jersey.
Simpson, G. G. 1960. The history of life. Pp. 117-180 in S. Tax, ed. The evolution
.......... of life. University of Chicago Press, Chicago.
Smith, A. B. 1994. Systematics and the fossil record: documenting
.......... evolutionary patterns. Blackwell Scientific, Oxford, England.
Solow, A. R. 1996. Tests and confidence intervals for a common upper endpoint
.......... in fossil taxa. Paleobiology 22:406-410.
Strauss, D., and P. M. Sadler. 1989. Classical confidence intervals and
.......... Bayesian probability estimates for ends of local taxon ranges. Mathematical
.......... Geology 21:411-427.
Van Valkenburgh, B., and C. M. Janis. 1993. Historical diversity patterns in
.......... North American large herbivores and carnivores. Pp. 330-340 in R. E. Ricklefs
.......... and D. Schluter, eds. Species diversity in ecological communities: historical
.......... and geographical perspectives. University of Chicago Press, Chicago.
Wagner, P. J. 1995a. Stratigraphic tests of cladistic hypotheses.
.......... Paleobiology 21:153-178.
--------. 1995b. Diversity patterns among early Paleozoic gastropods:
.......... contrasting taxonomic and phylogenetic descriptions. Paleobiology
.......... 21:410-439.
--------. 1996. Ghost taxa, ancestors, assumptions, and expectations: a reply
.......... to Norell. Paleobiology 22:456-460.
Wing, S. L., J. Alroy, and L. J. Hickey. 1995. Plant and mammal diversity in
.......... the Paleocene to Early Eocene of the Bighorn Basin. Palaeogeography,
.......... Palaeoclimatology, Palaeoecology 115:117-156.
Wing, S. L., and W. A. DiMichele. 1995. Conflict between local and global
.......... changes in plant diversity through geological time. Palaios 10:551-564.
Woodburne, M. O. Cenozoic mammals of North America: geochronology and
.......... biostratigraphy. University of California Press, Berkeley.
Woodburne, M. O., and C. C. Swisher, III. 1995. Land mammal high resolution
.......... geochronology, intercontinental overland dispersals, sea-level, climate, and
.......... vicariance. Pp. 329-258 in W. A. Berggren, D. V. Kent, and J. Hardenbol, eds.
.......... Geochronology, time scales and global stratigraphic correlations: a unified
.......... temporal framework for an historical geology. Society of Economic Mineralogists
.......... and Paleontologists, Special Publication 54.


Table 1. Summary of basic statistics describing sampling-corrected diversity curves. Methods are discussed in the text. mean = mean diversity, computed after a log transformation but expressed in linear units; CV = coefficient of variation on a log scale; increase = mean rate of increase, estimated by a least-squares regression of logged diversity against time; r = serial correlation, based on log- and rank order-transformed data.

Method mean CV increase r
Raw data 95.2 6.4 +0.77% +0.731
Lumped 124.8 4.3 +0.50% +0.728
No singletons 104.4 5.2 +0.56% +0.785
Chao-2 176.8 5.8 +0.61% +0.130
FreqRat 168.4 9.0 +1.88% +0.620
Lazarus taxa 156.9 3.7 +0.55% +0.527
Confidence 110.6 11.0 +1.33% +0.756
Ghost lineages 185.8 5.0 +0.85% +0.663
Rarefaction 63.1 7.2 +0.89% +0.785


Table 2. Summary of correlations describing sampling-corrected diversity curves. Methods are discussed in the text. All correlations were computed after rank-order transformations of the variables. Significance tests of correlations are precluded by strong autocorrelation of the data. vs. D = correlation of corrected curve with uncorrected diversity curve; vs. dD = correlation of corrected net changes in diversity with net changes in uncorrected curve; vs. # lists = correlation of correction ratio (corrected/uncorrected diversity) with number of faunal lists per interval; vs. # records = correlation of correction ratio with number of taxonomic records per interval.

Method vs. D vs. dD vs. # lists vs. # records
Lumped +0.855 +0.690 +0.208 +0.343
No singletons +0.896 +0.625 +0.121 +0.310
Chao-2 +0.466 +0.040 -0.230 -0.223
FreqRat +0.644 +0.485 +0.072 +0.084
Lazarus taxa +0.757 +0.205 -0.251 -0.300
Confidence +0.860 +0.553 -0.144 -0.232
Ghost lineages +0.840 +0.426 -0.141 -0.327
Rarefaction +0.654 +0.883 -0.652 -0.806


Figure captions

Figure 1. Diversity curve for Cenozoic North American mammals based on appearance event ordination. No corrections for sampling effects are employed.

Figure 2. Variation in sampling intensity through the Cenozoic. Note order-of-magnitude variability, lack of monotonic trend, strong similarity of neighboring values, and match of peaks and valleys with those seen in the diversity curve (Fig. 1). A, Number of faunal lists per 1.0 m.y. interval. B, Number of taxonomic records per 1.0 m.y. interval.

Figure 3. Diversity curves after correction for sampling. Y-axes are scaled so that the mean diversity values (Table 1) fall on the Y-axis at three-eighths the length of the X-axis. A, Based on lumping of age-ranges by intervals. B, Based on removal of singletons. C, Based on the Chao-2 equation. D, Based on the FreqRat equation. E, Based on the Lazarus taxon equation. F, Based on confidence intervals. G, Based on simulated ghost lineages. H, Based on rarefaction.

Figure 4. Relationship between sampling intensity (number of taxonomic records per 1.0 m.y.) and the sampling correction ratio (corrected/uncorrected diversity). Both variables are log transformed. Y-axis spans a fourfold range except where noted. A, Based on lumping of age-ranges by intervals. B, Based on removal of singletons. C, Based on the Chao-2 equation. Tenfold range. D, Based on the FreqRat equation. Tenfold range. E, Based on the Lazarus taxon equation. Fivefold range. F, Based on confidence intervals. One point omitted. G, Based on simulated ghost lineages. H, Based on rarefaction.