Publication: IIE Transactions Publication Date: 01-JUN-07 Delivery: Immediate Online Access Author: Hwang, Wookyeon ; Runger, George ; Tuv, Eugene
Article Excerpt 1. Introduction
Statistical process control (SPC) is used to detect changes from normal operating conditions. For the simplest case of a single variable, the typical three-sigma limits are used to bound the common variation in a process and this can be considered to be the region of normal operation. Points outside of this region signal potential unusual process behavior. In the case of multivariate SPC, a p x 1 observation vector x is obtained at each sample time from a process with mean vector [mu] and covariance matrix [SIGMA]. The familiar Hotelling [X.sup.2] = (x - [mu])' [[SIGMA].sup.-1] (x - [mu]) = [[chi square].sub.p] (Hotelling, 1947) (or [T.sup.2]) statistic performs a similar function. The set of points such that [X.sup.2] < c for a constant c defines a region of normal operation and the chart signals for points outside of this region. A univariate Shewhart control chart can be justified through bounds based on common-cause variation or more analytically through a normal distribution assumption and a probability argument. Similarly, the [X.sup.2] region can be justified as a reasonable measure based on Mahalanobis distance or through a probability argument based on an assumed multivariate normal distribution. However, the key point is that a region of normal operation is defined through an analytical expression such as [mu] [+ or -] 3[sigma] and then parameter estimates are used to finalize the bounds. Usually, the preliminary sample size is considered to be sufficiently large that estimation errors for the parameters are ignored.
In a modern environment with much more data this basic approach can be extended to not only estimate the parameters, but also to learn the form (shape) of an appropriate region of normal operation. That is the objective of this work. As is appropriate for a data mining method, it is assumed that much more data is available (several thousand cases or records) than would be typical for a traditional control chart. This preliminary data is referred to as training data and in this work it is used to learn the normal operating region of the process. In turn, it is used to signal points that are considered to be unusual.
If the training data were neatly partitioned into "on-target" and "off-target" sets this would be a traditional discriminant or classification (or supervised learning) problem. However, the off-target conditions are not usually fully represented in the training data. Instead, one has a preponderance of on-target data and the goal is to detect any points that appear unusual relative to this data. This leads to two comments. First, the region of normal operations to be learned requires a solution without the corresponding off-target data. Second, the traditional control chart design provides not only the shape of a control region, but with an assumed multivariate normal distribution, a boundary can be determined that sets the false alarm probability. That is, c is selected to fix P([X.sup.2] > c). These are important characteristics of the traditional method and an extension needs to provide a similar feature. Although a generalized method cannot provide the simplified solution for error rate that is obtained from the restrictive multivariate normal assumptions, the transform of the problem to one of supervised learning enables the full set of tools that have been developed to estimate generalization error (e.g., McLachlan (1992)) to be applied within the multivariate SPC problem. We illustrate with an ensemble classifier (derived from a decision tree) that provides an intrinsic error estimator based on out-of-bag data. Similarly, bootstrap and other generalization error estimates become available upon the transform to a supervised problem. The more traditional view of multivariate SPC as a clustering problem does not lead directly to such error estimation. Also, the supervised problem focuses on the boundary between the classes and this can be a much simpler problem than an alternative such as a density estimate. A density estimate would not focus on the difference between the training data and the random data we generate in the manner that is forced upon the supervised learner used in the method proposed here. The difference can be dramatic for some cases, although the details are not explored in this work.
Tuv and Runger (2003) provided an illustration of the method studied here in a conference paper. However, they did not quantify performance nor did they provide more than a simple graph. Here we provide detailed error rate calculations for higher dimensions, alternative supervised learners, illustrate examples where artificial contrasts automate the calculation of a complex control boundary, and present a brief introduction to sample size. Furthermore, the method can be easily extended to tune for specific shifts. This is a useful benefit and the procedure is briefly mentioned, but a full discussion is beyond the scope of work. Rather than an established methodology, this work begins a research path that can use new data mining methods to extend the SPC technique.
2. Supervised learning
Although the off-target case is often not represented in the training data there is some relationship between the multivariate SPC problem and supervised learning. If one had a control region it could be considered to define a solution to a classification (supervised learning) problem. Points inside the control region are classified into the "normal" or "on-target" class while those outside the region are classified as "off target". Therefore, a supervised learner would be useful for multivariate SPC, if an appropriate classification problem were developed.
One can consider the following novel technique (considered to be "statistical folklore") to transform a density estimation problem to the one of function approximation (Hastie et al., 2001) and classification. Let f(x) be an unknown probability density function over [R.sup.p] to be estimated, and let [f.sub.0](x)...
NOTE: All illustrations and photos have been removed from this article.

More articles from
IIE Transactions Locating capacitated facilities to maximize captured demand, 01-NOV-07 Erratum, 01-NOV-07 Sequencing with limited flexibility, 01-OCT-07
Looking for additional articles? Click here to search our database of over 3 million articles.
Looking for more in-depth information on this industry? Click here to search our complete database of Industry & Market reports by text, subject, publication name or publication date.
About Goliath Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.
Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information. |