Publication: IIE Transactions Publication Date: 01-JUN-07 Delivery: Immediate Online Access Author: Kim, Hyunjoong ; Loh, Wei-Yin ; Shih, Yu-Shan ; Chaudhuri, Probal
Article Excerpt 1. Introduction
Box (1979) wrote, "All models are wrong but some are useful". This statement is unquestionably true, but it raises the question: useful for what? There are two ways in which a model can be useful: (i) it can improve our understanding of the system generating the data; or (ii) it can make accurate predictions of future observations. For example, linear models for designed factorial experiments are useful because the terms they contain may be interpreted as main and interaction effects. On the other hand, accurate weather prediction models are useful even if they are hard to interpret.
There are many applications, however, where traditional statistical models are useless for prediction and for interpretation. An example is the study on house prices in the greater Boston area in 1970 reported in Harrison and Rubinfeld (1978) and made famous by Belsley et al. (1980). There are 506 observations on a variety of variables, with each observation pertaining to one census tract. The goal of the study was to build a regression model for the median house price (MEDV) and to use it to estimate the "marginal-willingness-to-pay for clean air," namely, the effect of nitrogen oxide concentration (NOX). Table 1 lists the predictor variables. After transforming some variables to satisfy normal-theory assumptions, Harrison and Rubinfeld (1978) obtained the fitted model shown in Table 2. Note that because the whole population is represented in the data, there is nothing to predict. In particular, the t-statistics do not have their usual statistical meaning.
We may hope that the model can explain the effects of the predictor variables on the response. For example, the sign associated with the coefficient for [NOX.sup.2] suggests that it has a negative effect on MEDV. Similarly, the negative coefficient for log(DIS) leads to the conclusion that MEDV is negatively associated with DIS. Table 2 shows, however, that the correlation between log(DIS) and log(MEDV) is positive! Another example is RAD, which has a positive regression coefficient but a negative correlation with MEDV. Of course, these apparent contradictions are easy to explain. First, a regression coefficient quantifies the residual effect of the predictor after the linear effects of the other predictors in the model have been accounted for. Second, the correlation between a predictor and the response measures their linear association, ignoring the other predictors. Nevertheless, the contradictions in signs are not intuitive.
Can we construct models that are more interpretable and that also fit the data well? Since a model that involves a single predictor variable is easiest to interpret because the fitted function can be graphed, one solution is to employ the best single-predictor model. Unfortunately, because such a model does not incorporate the information contained in the other predictors, it may not fit the data as well as a model that uses more than one predictor. Furthermore, a single-predictor model reveals nothing about the joint effect of all the predictors.
The goal of this paper is to study an alternative approach that: (i) retains the clarity and ease of interpretation of relatively simple models; (ii) allows expression of the joint effect of several predictors; and (iii) yields models with a higher average prediction accuracy than the traditional multiple linear regression model. We accomplish this by fitting simple-models to partitions of the dataset and sample space. One such model for the Boston data is shown by the tree structure in Fig. 1. The predictor space is split into three rectangular partitions. Within each partition, the best single predictor variable is selected to fit a linear model to MEDV. Notice that, unlike the Harrison-Rubinfeld model, we can directly model MEDV in terms of the original predictors without needing any transformations.
[FIGURE 1 OMITTED]
Figure 2 displays the data and fitted functions in the three partitions. The graphs indicate that LSTAT has a large negative effect on house price, except in census tracts with large houses (right panel) where PT is a stronger linear predictor. As expected, MEDV tends to increase with RM. These conclusions are consistent with the signs of the coefficients of log(LSTAT) and [RM.sup.2] in the Harrison-Rubinfeld model.
Besides a piecewise single-regressor model, a piecewise two-regressor model can also be used to reveal more insight into the data. The tree structure for the latter is presented in Fig. 3, with the selected regressors printed beneath the leaf nodes. By utilizing only two regressor variables in each node of the tree, we can employ shaded contour plots to display the fitted functions and the data points. These plots are shown in Fig. 4, with lighter shades corresponding to higher values of MEDV. Note that some of the contour lines are not parallel; this is due to truncation of the predicted values, as explained by the algorithm in Section 2. We observe that the higher-priced census tracts tend to have high values of RM and low values of LSTAT. The lowest-priced tracts are mostly concentrated in one leaf node (bottom left panel in dark gray) with below average values of RM and DIS, and above average values of RAD, LSTAT, and CRIM. Although the regression coefficients in each leaf node model suffer from the problems of interpretation noted earlier, we do not need their values for a qualitative analysis. The contour plots convey all the essential information.
How well do the tree models fit the data compared to the Harrison-Rubinfeld model? Figure 5 plots the fitted versus observed values of MEDV. The piecewise two-regressor model clearly fits best of all. Notice the lines of points on the right edges of the graphs for the Harrison-Rubinfeld and the one-regressor tree models. They are due to the observed MEDV values being truncated at $50000 (Gilley and Pace, 1996) and the inability of these two models to fit them satisfactorily. Our two-regressor model has no trouble with these points.
The rest of this article is organized as follows. Section 2 describes our regression tree algorithm. Section 3 analyzes another well-known dataset and compares the results with that of human experts. We take the opportunity there to highlight the important problem of selection bias. In Section 4 we compare the prediction accuracy of 27 algorithms from the statistical and machine learning literature on 52 real datasets. The results show that some machine learning methods have very good accuracy and that our methods are quite competitive. We prove an asymptotic consistency result in Section 5 to lend theoretical support to the empirical findings and close with some remarks in Section 6.
[FIGURE 2 OMITTED]
2. Regression tree method
Our algorithm is an extension of the GUIDE algorithm (Loh, 2002), which fits a constant or a multiple linear model at each node of a tree. The only difference is that we now use stepwise linear regression instead. The number of linear predictors permitted at each node may be restricted or unrestricted, subject to the standard F-to-enter and F-to-remove thresholds of 4.0 (Miller, 2002). A one- or two-regressor tree model is obtained by restricting the number of linear predictors to one or two, respectively. We present here the recursive sequence of operations for a two-regressor tree model; the method for a one-regressor tree model is similar.
[FIGURE 3 OMITTED]
Step 1. Let t denote the current node. Use stepwise regression to choose two quantitative predictor variables to fit a linear model to the data in t.
Step 2. Do not split a node if its model [R.sup.2] > 0.99 or if the number of observations is less than 2[n.sub.0], where [n.sub.0] is a small user-specified constant. Otherwise, go to the next step.
Step 3. For each observation, define the class variable Z = 1 if it is associated with a positive residual. Otherwise, define Z = 0.
Step 4. For each predictor variable X:
(a) Construct a 2 x m cross-classification table. The rows of the table are formed by the values of Z. If...
NOTE: All illustrations and photos have been removed from this article.

More articles from
IIE Transactions Locating capacitated facilities to maximize captured demand, 01-NOV-07 Erratum, 01-NOV-07 Sequencing with limited flexibility, 01-OCT-07
Looking for additional articles? Click here to search our database of over 3 million articles.
Looking for more in-depth information on this industry? Click here to search our complete database of Industry & Market reports by text, subject, publication name or publication date.
About Goliath Whether you're looking for sales prospects, competitive information, company analysis or best practices in managing your organization, Goliath can help you meet your business needs.
Our extensive business information databases empower business professionals with both the breadth and depth of credible, authoritative information they need to support their business goals. Whether it be strategic planning, sales prospecting, company research or defining management best practices - Goliath is your leading source for accurate information. |