|
Article Excerpt 1. INTRODUCTION
Given a normally distributed p-dimensional sample, [X.sub.1],...,[X.sub.n], the use of Mahalanobis distances or radii (Mahalanobis 1936) defined as
[r.sub.i.sup.n] = [square root of ([X.sub.i] - [bar.x])'[S.sup.-1]([X.sub.i] - [bar.x])],
where [bar.x] and S denote the sample mean and sample covariance matrix, has a long history in statistics. For instance, Healy (1968) proposed a multivariate normal plotting based on those radii generalizing the quantile plot. Andrews, Gnanadesikan, and Warner (1973) showed that the ([r.sub.i.sup.n])[.sup.2]'s are distributed approximately as a [[chi square].sub.p] distribution. Gnanadesikan and Kettering (1972) determined that the exact distribution of ([r.sub.i.sup.n])[.sup.2] is actually a multiple of the beta distribution, but the difference with the [[chi square].sub.p] approximation is practically negligible. Koziol (1982) introduced a Cramer-von Mises test based on the just-mentioned quantile plot. Malkovich and Afifi (1973) presented a Shapiro-Wilk test statistic, and Hawkins (1981) presented an Anderson-Darling type test.
But these approaches suffer from a serious handicap in their lack of robustness due to the effect of outliers on [bar.x] and S leading to bad behavior when dealing with contamination models. To remove this drawback, the nonrobust estimators [bar.x] and S can be replaced by robust versions leading to "generalized" radii,
[r.sub.i.sup.n] = [square root of ([X.sub.i] - [m.sub.n])'[[SIGMA].sub.n.sup.-1]([X.sub.i] - [m.sub.n])],
where [m.sub.n] and [[SIGMA].sub.n] are robust location and scatter estimators. This approach has been considered by Rousseeuw and van Zomeren (1990) as an aid for determining outliers and influence points in regression and by Rocke and Woodruff (1996) for detecting outliers in multivariate datasets. Robust Mahalanobis distances also appear when applying "forward-search" procedures for detecting multiple multivariate and regression outliers and in the construction of graphical displays to visualize these (see, e.g., Hadi 1992; Atkinson 1993, 1994; Atkinson, Riani, and Cerioli 2004). Kosinski (1999) and Billor, Hadi, and Velleman (2000) proposed modifications for this "forward-search" method. These kinds of robust distances have also been considered by Willems, Pison, Rousseeuw, and van Aelst (2002) for robustifying Hotelling's [T.sup.2] test.
Because the observations with largest [r.sub.i.sup.n]'s are the most remote ones in terms of [m.sub.n] and [[SIGMA].sub.n], if we wish to trim a proportion [alpha] of observations, then it is natural to trim these first. If we consider the radii arranged in decreasing order, then we can define a generalized radius process that maps the trimming proportion [alpha] onto the radius [r.sub.i.sup.n] corresponding to the observation just trimmed at that trimming level. In this article we derive a Gaussian limit law for this process when sampling from elliptically contoured distributions and with radii defined from general estimators [m.sub.n] and [[SIGMA].sub.n] satisfying certain mild assumptions on the rates of convergence. The convergence is obtained as a whole process that turns out to be useful for inferential purposes. The main result demonstrates that the limit law depends only (apart from the elliptical family considered) on how [[SIGMA].sub.n] serves to estimate the "scale" factor through its determinant; it does not depend on how [m.sub.n] estimates the location or how [[SIGMA].sub.n] estimates the "shape" of the distribution.
The adjustment of the empirical process when parameters are estimated goes back to the work of Durbin (1973). Here the approach is based on adjusting (when parameters are estimated) a "simple" multivariate quantile process. Other different notions for multivariate quantiles have been given by Chaudhuri (1996).
Section 3 derives the associated influence functions, providing a rich infinitesimal robustness description and a tool for obtaining an explicit expression for the asymptotic limit law. Section 4 is devoted to obtaining critical values for finite sample sizes for the multivariate normal family and radii defined from minimum covariance determinant (MCD) estimators. Finally, Section 5 presents two applications designed to graphically test the goodness of fit of a dataset to a given elliptical family and whether we can ensure this fit apart from a proportion [epsilon]* of outlying observations.
1.1 Generalized Radius Process
Suppose that the sample [X.sub.1],...,[X.sub.n] was generated by a member E([mu], [SIGMA]) from a family of elliptically contoured distributions admitting a density function of the form
f(x) = |[SIGMA]|[.sup.-1/2]h((x - [mu])'[[SIGMA].sup.-1](x - [mu])),
where [mu] [member of] [R.sup.p] and [SIGMA] is a positive definite symmetric (pds) p X p matrix.
Let us define a generalized radius variable R = ||Y||, with Y being the canonical or standardized distribution of the elliptically contoured family considered, Y [approximately] E(0, I), if X [approximately] E([mu], [SIGMA]), then Y = [[SIGMA].sup.-1/2](X - [mu]) [approximately] E(0, I). The density function of R has the form
[f.sub.R](r) = [[2[[pi].sup.p/2]]/[[GAMMA](p/2)]][r.sup.p-1]h([r.sup.2]),
and we can use the inverse of the distribution function [F.sub.R.sup.-1] to define a generalized radius, [r.sub.[alpha]], for every trimming size [alpha], as [r.sub.[alpha]] = [F.sub.R.sup.-1](1 - [alpha]). [F.sub.R.sup.-1] is known or tabulated for well-known distributions, such as the normal and the multivariate t. Note also that if [P.sub.X] is the law induced in [R.sup.p] by the random variable X, then
[r.sub.[alpha]] = inf{r:[P.sub.X]({x:(x - [mu])'[[SIGMA].sup.-1](x - [mu]) [less than or equal to] [r.sup.2]}) [greater than or equal to] 1 - [alpha]}. (1)
Because [mu] and [SIGMA] are unknown, we may consider consistent estimators [m.sub.n] [right arrow] [mu] and [[SIGMA].sub.n] [right arrow] [SIGMA] to define a sample analogous for [r.sub.[alpha]] as
[r.sub.[alpha].sup.n] = inf{r:[P.sub.n]({x:(x - [m.sub.n])'[[SIGMA].sub.n.sup.-1](x - [m.sub.n]) [less than or equal to] [r.sup.2]}) [greater than or equal to] 1 - [alpha]}. (2)
From the theoretical and sample radii [r.sub.[alpha]] and [r.sub.[alpha].sup.n] given in (1) and (2), we construct the following process.
Definition 1. Define the (generalized) radius process as
[alpha] [??] [n.sup.1/2][f.sub.R]([r.sub.[alpha]])([r.sub.[alpha].sup.n] - [r.sub.[alpha]]) for [alpha] [member of] (0, 1].
For practical purposes, this process is evaluated only at [alpha] = i/n, i = 1,...,n.
In the unrealistic case where [mu] and [SIGMA] were known [i.e., [m.sub.n] = [mu] and [[SIGMA].sub.n] = [SIGMA] in (2)], it is not difficult to see that the radius process converges to a standard Brownian bridge. This follows from the analogy between the radius process and the classical quantile process based on samples generated from the distribution of R. This fact leads us to consider the Brownian bridge B as a candidate limit law when [mu] and [SIGMA] are estimated by [m.sub.n] and [[SIGMA].sub.n]. If the convergence of the radius process to the Brownian bridge also holds in this case, then we can use the laws of some functionals of the Brownian bridge for inferential purposes. For instance, if [b.sub..05] represents the .95-quantile of...
|