|
Article Excerpt 1. INTRODUCTION
Identification of outliers is one of the oldest problems in statistics. In recent years, the existence of outliers caused by the measurement error of high-speed automated measuring systems has posed a new problem to quality engineers in manufacturing. If these outliers are not identified and removed, they will result in unnecessary false alarms, which will undermine the effectiveness of any charting and quality control procedures. The detection of outliers based on formal statistical hypothesis testing procedures has received immense attention by the statistical research community. Comprehensive discussions of formal outlier detection tests were given in books by Hawkins (1980) and Barnett and Lewis (1994). The techniques of exploratory data analysis based on the boxplot resistant rules have been used by Hoaglin, Iglewicz, and Tukey (1986), Kimber (1990), Davies and Gather (1993), and Iglewicz and Banerjee (2001) for identifying possible outliers in univariate data. Brant (1990) and Carey, Walters, Wager, and Rosner (1997) gave extensive power comparisons between the boxplot and various outlier labeling techniques.
There are various ways of defining outliers. We define outliers in a set of data to be a subset of observations that appear to be inconsistent with the remaining observations that follow a hypothesized distribution (see, e.g., Carey et al. 1997). In other words, a sample of size n has [n.sub.0] outliers if there are n - [n.sub.0] regular observations that are compatible with the hypothesized distribution with center location [theta], and the remaining [n.sub.0] observations yield residuals from [theta] that are larger in absolute value than those of the regular observations. Analogous to the discussions of Davies and Gather (1993), the outlier labeling problem can be formulated for an asymmetric population as follows. Given a random sample [X.sub.1], [X.sub.2],...,[X.sub.n], one is required to identify those observations, if any, that lie in an outlier region defined by
out([[alpha].sub.n], [theta], [[sigma].sup.2]) = {x:x < [theta] - [g.sub.l](n, [[alpha].sub.n])[sigma] or x > [theta] + [g.sub.u](n, [[alpha].sub.n])[sigma]},
where [g.sub.l](n, [[alpha].sub.n]) and [g.sub.u](n,[[alpha].sub.n]) are normalizing constants with respect to the hypothesized distribution [F.sub.[theta],[sigma]](x) such that Pr[X [member of] out ([[alpha].sub.n], [theta], [[sigma].sup.2])] = [[alpha].sub.n], and [[alpha].sub.n] is the error rate that an observation from a random sample of n regular observations is falsely labeled as outlier. An observation x is called an [[alpha].sub.n]-outlier with respect to [F.sub.[theta],[sigma]] if x [member of] out([[alpha].sub.n], [theta], [[sigma].sup.2]).
The error rate per observation [[alpha].sub.n] can be either given or determined from the requirement that for an outlier-free random sample, the probability that one or more observations in the sample will be wrongly classified as outliers is equal to a given small value [alpha]. This means that one can choose a value of [[alpha].sub.n] satisfying
Pr[one or more of the [X.sub.1], [X.sub.2],...,[X.sub.n] [member of] out([[alpha].sub.n], [theta], [[sigma].sup.2])] = [alpha], (1)
which leads to [[alpha].sub.n] = 1 - (1 - [alpha])[.sup.1/n]. The idea of using a fixed value [alpha] is analogous to using a fixed significance level, with which users of hypotheses testing are accustomed. Hoaglin et al. (1986) termed [[alpha].sub.n] the outside rate per observation and [alpha] the some-outside rate per sample.
In this article we focus our discussion on the identification of outliers with boxplots constructed with its lower fence (LF) and upper fence (UF) either (a) satisfying the requirement given in (1) with the outlier region given by [out.sub.boxplot]([[alpha].sub.n], [theta], [[sigma].sup.2]) = {x:x [member of] (-[infinity], LF) [union] (UF, [infinity])}, or (b) taken to be the tolerance limits obtained from a random sample that has a prescribed proportion 1 - [[alpha].sub.n] of the population values falling inside the random interval (LF, UF) with a given probability [gamma]. The proposed boxplot outlier rules are applicable to samples taken from the family of location-scale distributions.
2. BOXPLOT OUTLIER LABELING RULE WITH GIVEN ERROR RATE
The boxplot, developed by Tukey (1977), is one of the most popular techniques in exploratory data analysis. A boxplot graphically displays information about center, spread, whiskers, and outliers of a dataset. Observations that fall below the LF or above the UF of the boxplot are termed as outliers. In this article the asymmetric lower and upper fences of the boxplot are given as LF = [Q.sub.l] - [k.sub.l]([Q.sub.u] - [Q.sub.l]) and UF = [Q.sub.u] + [k.sub.u]([Q.sub.u] - [Q.sub.l]), where [Q.sub.l] and [Q.sub.u] are the lower and upper quartiles and [k.sub.l] and [k.sub.u] are standardized constants. The definition of finite-sample quartiles is not unique in the literature and is discussed further later in this section. Note that [k.sub.l] = [k.sub.u] if the hypothesized distribution is symmetric and are customarily taken to be 1.5 or 3 irrespective of the population distribution. In this section, for a random sample of size n taken from a family of location-scale distributions, we derive exact expressions for [k.sub.l] and [k.sub.u] satisfying (1) with the outlier region defined as
[out.sub.boxplot] ([[alpha].sub.n], [theta], [[sigma].sup.2]) = {x:x [member of] (-[infinity], LF) [union] (UF, [infinity])}. (2)
2.1 The Boxplot Outlier Labeling Rule With Known Parameters
We first consider the case when the location parameter [theta] and scale parameter [sigma] of the hypothesized distribution [F.sub.[theta],[sigma]](x) are both known. In this case the lower and upper quartiles can be evaluated as [theta] + [sigma] [F.sub.0,1.sup.-1](.25) and [theta] + [sigma][F.sub.0,1.sup.-1](.75), where [F.sub.0,1](x) is the standardized distribution of [F.sub.[theta],[sigma]](x) and will be abbreviated as F(x) in the sequel. By taking Pr[X [member of] (-[infinity], LF)] = Pr[X [member of] (UF, [infinity])] in (1) with the outlier region as given in (2), the values of [k.sub.l] and [k.sub.u] that will result in a probability [alpha] of wrongly detecting one or more outliers from a random sample of n regular observations are then given by
[k.sub.l] = [[F.sup.-1](.25) - [F.sup.-1]([[alpha].sub.n]/2)]/[[F.sup.-1](.75) - [F.sup.-1] (.25)] (3)
and
[k.sub.u] = [[F.sup.-1](1 - [[alpha].sub.n]/2) - [F.sup.-1](.75)]/[[F.sup.-1](.75) - [F.sup.-1](.25)]. (4)
For instance, suppose that one wishes to detect [[alpha].sub.n]-outliers in a given sample of size n with respect to the exponential distribution function [F.sub.[theta],[sigma]](x) = 1 - exp[-(x - [theta])/[sigma]], where x > [theta] and [sigma] > 0. Then by assuming that the parameters [theta] and [sigma] are known, we have LF = [theta] + ([log.sub.e](4/3) - [k.sub.l][log.sub.e] 3)[sigma] and UF = [theta] + ([log.sub.e] 4 + [k.sub.u] [log.sub.e] 3)[sigma], in which [k.sub.l] = [log.sub.e] [(4/3)(1 - [[alpha].sub.n]/2)]/[log.sub.e] 3 and [k.sub.u] = -[log.sub.e](2[[alpha].sub.n])/[log.sub.e] 3. Note that [k.sub.l] is close to the value [log.sub.e](4/3)/[log.sub.e] 3 even when the sample size n is as small as 10. Thus the LF of the boxplot for an exponential sample can be taken as LF = [theta] and its outlier region is reduced to the one-sided outlier region [out.sub.boxplot]([[alpha].sub.n], [theta], [[sigma].sup.2]) = {x:x [member of] (UF, [infinity])}. Results analogous to those shown in (3) and (4) have been obtained by Iglewicz and Banerjee (2001).
2.2 Construction of Boxplot With Unknown Parameters
We next consider the case when the parameters [theta] and [sigma] of the hypothesized distribution [F.sub.[theta],[sigma]] (x) are both unknown. In this case the lower and upper quartiles of [F.sub.[theta],[sigma]] (x) can be estimated by the lower-fourth [X.sub.l:n] and upper-fourth [X.sub.u:n] of a sample of n observations taken from [F.sub.[theta],[sigma]] (x). We define the asymmetric LF and UF of the boxplot as
[LF.sub.sample] = [X.sub.l:n] - [k.sub.l]([X.sub.u:n] - [X.sub.l:n]) (5)
and
[UF.sub.sample] = [X.sub.u:n] + [k.sub.u] ([X.sub.u:n] - [X.sub.l:n]). (6)
Construction of the boxplot satisfying the requirement given in (1) with the outlier region defined as [out.sub.boxplot] ([[alpha].sub.n], [theta], [[sigma].sup.2]) = {x:x [member of] (-[infinity], [LF.sub.sample]) [union] ([UF.sub.sample], [infinity])} was discussed by Hoaglin et al. (1986). These authors proposed a multistep approximation procedure that uses [[alpha].sub.n] with fixed [alpha] to guide choices of k (=[k.sub.l]=[k.sub.u]) in the boxplot for samples taken from the Gaussian population. The disadvantage of their approximation procedure is that statistical expertise and decision are needed in each step, which in turn will lead to the propagation of approximation errors. Hoaglin and Iglewicz (1987) used simulations to obtain the values of k for normal samples.
We now derive an exact expression that can be used to evaluate [k.sub.l] and [k.sub.u] for the family of location-scale distributions. Note that the event {one or more [X.sub.1],...,[X.sub.n] [member of] [out.sub.boxplot] ([[alpha].sub.n], [theta], [[sigma].sup.2])} is the union of the disjoint compound events {[X.sub.n:n] > [UF.sub.sample]} and {[X.sub.l:n] [less than or equal to] [LF.sub.sample], [X.sub.n:n] [less...
|