|
Article Excerpt 1. INTRODUCTION
A sample of individuals is collected from a population consisting of N distinct species (or classes), and their species is identified. If we suppose that N is unknown, then estimation of N based on such a sample is often termed the "species problem" in statistics nomenclature. In this article we consider the setting in which the sampling probabilities vary over species.
This article presents a penalized nonparametric maximum likelihood estimator (NPMLE) of N with desirable properties in bias, variability, and robustness. A penalized form is used to account for an inherent instability in the likelihood. Although the scope of the article is limited to the species problem, the penalized NPMLE and the computing algorithm developed here can be easily extended to the capture-mark-recapture problem as well as to other nonparametric mixture problems. The algorithm can also be used to construct profile likelihoods.
The species problem has a wide variety of important applications covering multiple disciplines including ecology, linguistics, and numismatics (see Bunge and Fitzpatrick 1993 for an extensive review). Existing methods in this area can be loosely classified into parametric or nonparametric, depending on whether the species abundance pattern is modeled by a parametric form. We believe that the nonparametric approaches are generally more desirable because they have competitive performance while adding robustness. Popular nonparametric approaches under comparison with the proposed approach in this article include the lower-bound estimator, [^.N.sub.c.sub.0], of Chao (1984); two coverage coefficient of variation (CV)-based estimators, [^.N.sub.c.sub.1] and [^.N.sub.c.sub.2], of Chao and Lee (1992) (denoted by [^.N.sub.2] and [^.N.sub.3] in their article); the coverage-duplication based solution [^.N.sub.c.sub.3] of Chao and Bunge (2002), the jackknife solution, [^.N.sub.J] of Burnham and Overton (1978, 1979), and the unconditional NPMLE solution, [^.N.sub.UNP], of Norris and Pollock (1996, 1998).
Bunge and Fitzpatrick (1993, p. 364) commented that "there is not as yet a globally preferable estimator." Simulation studies that have appeared in the literature are generally not conclusive. The weakness of the aforementioned approaches have also been recognized in the literature. As a lower-bound estimator, [^.N.sub.c.sub.0] is stable but usually biased downward. The coverage estimators [^.N.sub.c.sub.1], [^.N.sub.c.sub.2], and [^.N.sub.c.sub.3] depend on the choice of a tuning parameter [tau], where one splits the sample into "rare" (observed no more than [tau] times) and "abundant" (seen more than [tau] times) groups. One then estimates N based on the rare species frequencies (Chao, Ma, and Yang 1993; Chao, Huang, Chen, and Kuo 2000). However, it was noticed that [^.N.sub.c.sub.1], [^.N.sub.c.sub.2], and [^.N.sub.c.sub.3] are all sensitive to [tau] (Chao and Bunge 2002). As shown in our simulation study, a bad choice of [tau] can bias the estimate substantially. The coverage-duplication estimator [^.N.sub.c.sub.3] can fail due to a negative estimate of the duplication fraction (Chao and Bunge 2002). The jackknife estimator [^.N.sub.J] is robust, but often has a positive bias (Smith and van Belle 1984). The unconditional NPMLE [^.N.sub.UNP] achieves remarkable insensitivity to [tau] but requires extremely long computing times, especially when N is large. We also demonstrate that it has an instability problem.
Two nonparametric likelihood approaches exist, the unconditional and conditional, as identified by Sanathanan (1972). In this article we unify these approaches and extend them by considering the addition of a penalty term to the conditional likelihood. The unconditional NPMLE can then be obtained, to a high degree of approximation, by maximizing the conditional likelihood under an appropriate penalty. This device provides a great reduction in computing time for the unconditional estimator.
However, both the conditional and unconditional NPMLEs have a severe instability problem. We show that by modifying the penalty, one can stabilize the resulting estimators. We discuss the statistical interpretation of the penalty in terms of Bayesian estimation, and demonstrate that the addition of an adaptive quadratic penalty term greatly improves the performance of the conditional NPMLE over a wide range of simulation settings.
2. NONPARAMETRIC MAXIMUM LIKELIHOOD ESTIMATOR SOLUTIONS
Let X = {[X.sub.1],..., [X.sub.N]} be the number of observed individuals from each distinct species, where the unobserved (and therefore unknown) species have [X.sub.i] = 0. If so, then [n.sub.j] = [[summation].sub.i=1.sup.N] l([X.sub.i]=j), for j = 1,... becomes the number of those species that had j individuals in the sample. The variable [n.sub.0] then represents the unknown number of species that were present in the population but never observed.
Assume that the [X.sub.i]'s, for i = 1,..., N, are iid observations from f(x; M), where M represents a set of unknown parameters (or an unknown mixing distribution in the nonparametric mixture case). Let t = [max.sub.i]([x.sub.i]), and let D = [n.sub.1] + ... + [n.sub.t] be the total of observed distinct species. As shown by Sanathanan (1972), the likelihood function for this model can be factored into two parts,
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII]. (1)
Here the likelihood [L.sub.m](N, M) is from the binomial marginal distribution of D, which depends on both N and M. The conditional distribution of X given D generates [L.sub.c](M), which depends on M alone.
Sanathanan (1972, 1977) discussed two likelihood-based estimation methods for this problem. The unconditional method finds the pair ([^.N], [^.M]) that maximize L(N, M; X) globally. In the second method, one finds [^.M] by first maximizing [L.sub.c](M), then calculating the MLE of N from [L.sub.m](N, M) treating M = [^.M]. If we define the odds parameter as
[theta](M) = [f(0; M)]/[1 - f(0; M)], (2)
then, for a given [theta], the conditional estimator [^.N.sub.c] equals <D(1 + [theta])> = <D/[1 - f(0; M)]>, where "<a>" means the largest integer no greater than a (see also Lindsay and Roeder 1987). In other words, we obtain the conditional MLE given f(0; M) by simply rescaling D by the non-0 probability. [Because the penalized approaches under consideration in this article all concern [theta] directly rather than f(0; M), we write the conditional estimator [^.N.sub.c] in terms of [theta] in the following context.] Sanathanan (1972, 1977) also estabilished that the unconditional and conditional MLEs of N have the same asymptotic distribution when the regularity conditions hold (e.g., in the Poisson-gamma model).
2.1 Poisson Mixture Model and Conditional NPMLE
We assume the model where f in (1) is a Q-mixture of Poisson variables, that is,
f(x; Q) = [integral][[[e.sup.-[lambda]][[lambda].sup.x]]/x!]dQ([lambda]),
where Q is an arbitrary unknown distribution on the parameter space [OMEGA]. This arises from the following hierarchical model; Suppose that the sampling count for species i is a Poisson random variable with abundance parameter [[lambda].sub.i], and that the distribution of the abundance parameters among the species is Q. Mao and Lindsay (2003) identified that the conditional log-likelihood in (1) can be reparameterized into a P-mixture of 0-truncated Poisson densities as
[l.sub.c](P) = [t.summation over (j=1)] [n.sub.j] log[g(j; P)], (3)
where g(j; P) = [integral][[[e.sup.-[lambda]][[lambda].sup.j]]/[j!(1- [e.sup.-[lambda]]]]dP([lambda]) and
dP ([lambda]) = [(1 - [e.sup.-[lambda]])dQ([lambda])]/[[integral](1 - [e.sup.-[lambda]])dQ([lambda])]. (4)
If [OMEGA] = (0, [infinity]), then Q [right arrow] P is a one-to-one transformation with inverse dQ([lambda]) = [(1 - [e.sup.-[lambda]])[.sup.-1]dP([lambda])]/[[integral](1 - [e.sup.-[lambda]])[.sup.-1]dP([lambda])].
However, if we were to include the boundary point in [OMEGA], then the map is not invertible. This corresponds to the fact that because [n.sub.0] is not observed, any mass of Q placed at is not identifiable. It is natural to exclude from [OMEGA], because any species with [lambda] = is not actually present and should not count in N. (Looking ahead, we will show that the near nonidentifiability of the mass in Q for [lambda] near still presents a severe problem.)
In what follows, we reserve the notation Q for the mixing distribution of the original Poisson mixture and P for its 0-truncated transformation. Due to the invertible relationship between Q and P on (0, [infinity]), for convenience we write [l.sub.m] and [l.sub.c] in P or Q interchangeably, that is, [l.sub.m](N, Q) [equivalent to] [l.sub.m](N, P) and [l.sub.c](Q) [equivalent to] [l.sub.c](P).
The advantage of the form (3) is that it is a standard non-parametric mixture likelihood (Lindsay 1995) of iid observations from a P-mixture of 0-truncated Poisson variables. The absence of the parameter N in [l.sub.c] (P) makes this a simpler optimization problem. One can first find [^.P] from [l.sub.c](P), a conditional MLE. Notice that because [theta](Q) from (2) has the form [theta](Q) = [f(0;Q)]/[1 - f(0; Q)], we can write [theta] as a function of P in the simple linear form,
[theta] = [integral] ([e.sup.[lambda]] - 1)[.sup.-1] dP. (5)
We can then maximize the marginal likelihood with P fixed at [^.P] to obtain an N-estimator of the form <D(1 + [theta]([^.P]))>. We call the resulting estimator the conditional NPMLE and denote it by [^.N.sub.CNP], although technically it is a two-step "conditional-then-marginal" MLE.
The conditional NPMLE of N is simple to implement and fast to compute. However, our numerical experience shows that it has a severe instability problem related to the boundary of the parameter space. In these cases, the conditional NPMLE of [^.P] from (3) contains [lambda] at or near as a support point. In particular, if one were to allow Q to have mass at [lambda] = 0, then the NPMLE would often put positive mass there. In such cases, if one restricted the parameter space to [OMEGA] = (0, [infinity]), then the maximum over Q would not be attained (and an algorithm will keep trying to put mass near 0), because the solution is on the excluded boundary.
To solve this, we could try allowing [lambda] = to be a mass point in the algorithm. But this creates two new problems. First, we cannot invert the Q to P map to find a unique [^.Q], because the mass at [lambda] = is not well defined. Second, this situation has a severe impact on N estimation. Recall that the estimator of N is <D(1 + [theta]([^.P]))>, and see that [theta]([^.P]) goes to [infinity] as any support point in [^.P] approaches 0; see (5). That is, if we allow mass at [lambda] = 0, then the estimator will be [infinity], and if we allow mass arbitrarily near 0, then [^.N.sub.CNP] will "blow up." In practice, a tiny [lambda] component is frequently fit in [^.P], which blows up [^.N.sub.CNP].
2.2 A Special Challenge
This numerical instability is closely related to the results of Mao and Lindsay (2002), who identified some challenging theoretical defects in the model. In particular, they showed that it is theoretically impossible to create an upper confidence limit for N that would hold a target confidence level across all possible abundance distributions Q (unless one uses the trivial value of [infinity] as the upper limit).
There is a simple logic behind the mathematics. Suppose that there were M species with abundance [lambda] = 0. Of course, they are never seen, and so we could not hope to determine their number. Even though we have excluded [lambda] = from [OMEGA], this does not exclude the possibility that there are many species with vanishingly small [lambda], small enough that none of them would ever appear in a reasonable sample. To put this in another way, there is less and less statistical information about the distribution Q as [lambda] [right arrow] 0, and it is these small [lambda]'s that can have a large effect on N estimation.
There exist a substantial number of nonparametric estimators of N, and they are often accompanied by a method for constructing standard errors. How can this be if upper confidence limits are impossible? It is because these estimators might better be called proxy estimators, because they consistently estimate proxy functionals of the model that equal N only for some subset of all distributions Q. Outside this subset, the estimators display nonvanishing bias at all sample sizes.
The ramifications of this can be seen in our simulations. For many simulation settings, the 95% central portion of the estimators' sampling distributions will not even cover the true value of N. On the other hand, the NPMLE has a sampling distribution that typically covers the true value, but it does so by frequently creating very large values due to its aforementioned instability.
In the face of this, our goal in this research was to see whether one could penalize the nonparametric likelihood in such a way that the resulting NPMLE retained much of its flexibility relative to other methods, but also behaved more stably.
2.3 Penalized Likelihood
Let l(P) be a log-likelihood function, where in our case P is an unknown probability measure. The penalized likelihood corresponding to penalty parameter [gamma] and penalty function h(P) is defined as
[l.sup.[gamma]] (P) = l(P) - [gamma]h(P). (6)
The penalized conditional likelihood corresponding to the conditional likelihood [l.sub.c] in (3) is then
[l.sub.c.sup.[gamma]] (P) = [l.sub.c](P) - [gamma]h(P). (7)
The resulting penalized full log-likelihood is
[l.sup.[gamma]](P) = [l.sub.m](N, P) + [l.sub.c.sup.[gamma]](P).
We write h in terms of P rather than Q, because [l.sub.c] is more naturally parameterized by P in the conditional likelihood form (3).
Remark 1. For [gamma] > and h(P) > 0, clearly a maximizer of (7) tends to avoid P with large values of h(P). Note, however, that the form (7) also arises when one maximizes [l.sub.c] subject to the constraint h(P) = [h.sub.0] using the method of Lagrange multipliers, with the [gamma] being the Lagrange parameter. For this reason, our methods are also relevant for profile likelihood construction.
To estimate N, we first find the penalized NPMLE of P, [^.P.sup.[gamma]], from [l.sub.c.sup.[gamma]](P), then define the penalized conditional NPMLE of N to be
[^.N.sup.[gamma]] = <D(1 + [theta]([^.P.sup.[gamma]]))>. (8)
The penalized likelihood form in (7) provides rich possibilities for controlling the behavior of estimators of N. However, one important issue in construction of the penalty is that computing the NPMLE from [l.sub.c.sup.[gamma]](P) may not be trivial. It is known that if the functional h(P) is linear [i.e., of the form h(P) = [integral] h([lambda]) dP([lambda])], then the NPMLE can be found using a simple extension of standard NPMLE methods (Lindsay 1995). For example, the odds parameter [theta](P) would be such a linear functional, with h([lambda]) = ([e.sup.[lambda]] - 1)[.sup.-1]. In this article we consider three choices of penalty function h(P) all involving the odds parameter [theta]. One of the innovations of this article is a fast extension of the standard computing algorithm to the situation where the penalizing function is a differentiable function of a linear functional like [theta](P), thereby expanding our possibilities.
2.4 Unconditional NPMLE via Penalized Likelihood
The first choice of the penalizing function is
[h.sub.1](P) = log[[[theta](P)]/[1 + [theta](P)]].
Note that log[[[theta](P)]/[1+[theta](P)]] = log[f(0, P)] [less than or equal to] 0. A larger f(0, P) causes more penalty (or, strictly, less reward) if [gamma] >...
|