|
...the block coordinate relaxation optimization technique. A rule for the automatic selection of the smoothing parameters, suitable for data mining of large datasets, is derived. The wavelet-based method is then extended to estimate generalized additive models. A primal-dual log-barrier interior point algorithm is proposed to solve the corresponding convex programming problem. Based on an asymptotic analysis, a rule for selecting the smoothing parameters is derived, enabling the estimator to be fully automated in practice. We illustrate the finite sample property with a Gaussian and a Poisson simulation.
Key Words: (Generalized) additive models; Convex programming; Data mining; Interior-point algorithm; Regularized likelihood; Relaxation algorithm; Signal denoising; Universal smoothing parameter.
1. INTRODUCTION
We first recall the standard regression problem. Available is a set of responses Y = ([Y.sub.1], ..., [Y.sub.N])' measuring a function f(*) at N locations [X.sub.1], ..., [X.sub.N] sampled from a distribution [F.sub.X]. Bold denotes a vector and X' denotes the transpose of the vector X; for instance X' = ([X.sub.1], ..., [X.sub.Q])' is the column vector of Q explanatory variables. Based on the training sample T = [{([Y.sub.n], [X.sub.n])}.sub.n=1, ..., N], the primary goal of regression is to estimate the multivariate function f(*) which explains best the association between the predictors X and the noisy measurement Y. Assuming the noise is additive and unbiased conditioned on x, that is,
Y = f(X) + [epsilon] with E([epsilon] | x) = 0,
the predictive performance of an estimate [??](*) is often measured by its mean squared error
(1.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.],
where the inside expectation is taken over a new explanatory variable [X.sup.*] with the same distribution [F.sub.X] as that of X in the training sample T. Although prediction is the primary goal, interpretability of the estimated function [??](*) is also desirable.
A well-studied model for such a regression problem is the parametric additive linear model f(X) = [[alpha].sub.0] + [[SIGMA].sub.q][[alpha].sub.q][X.sub.q], where the coefficients [alpha] are typically estimated by least squares. If the parametric model is correct, then the [L.sub.2] rate of convergence is in [N.sup.-1. But, what if the true function f(*) is far from being linear in each covariate? To avoid bias, nonparametric techniques do not assume a straight line but let the data fit themselves. The price to pay is a slower optimal [L.sub.2] rate of convergence typically in [N.sup.-2[tau]/(2[tau] + Q)] > [N.sup.-1], where [tau] is a measure of the smoothness of f(*) and Q is the number of covariates. Many scatterplot smoothers (Q = 1) have been developed and achieve an optimal rate of convergence for an appropriate selection of the smoothing parameter. To fit general high-dimensional surfaces when the number of predictors is larger (Q > 1), the number of observations N must grow exponentially with Q to achieve the same rate of convergence as when Q = 1. This is known as the curse of dimensionality.
The nonparametric additive model is a compromise between the rigid parametric linear model and the too flexible general high-dimensional nonparametric model. The nonparametric additive model looks for the closest approximation to f(*) of the form
(1.2) f(x) = [Q.summation over (q=1)][f.sub.q]([x.sub.q],
where each [f.sub.q](*) is a general univariate function. The advantage of nonparametric additive modeling is to avoid the curse of dimensionality, a result of Stone (1985) who provided the striking result that the optimal [L.sub.2] rate of convergence of each additive component [[??].sub.q](*) to [f.sub.q](*) is the same as in the univariate case (Q = 1), namely of the form [N.sup.-2[tau]/(2[tau] + 1)], when the multivariate function is indeed additive. Another advantage of the additive model is the ability to visualize the univariate trends in each covariate and provide a simple interpretation of the fitted model. For estimating additive models from data, two main approaches can be distinguished:
* The backfitting approach of Buja, Hastie, and Tibshirani (1989) iteratively applies a univariate linear smoother until convergence.
Properties of the estimate have been studied in many papers when the univariate smoother used is linear. Buja, Hastie, and Tibshirani (1989) established the existence of a solution and convergence of a backfitting algorithm for linear smoothers having a symmetric "hat" matrix with eigenvalues in [0, 1]. They also showed that uniqueness of solution is not guaranteed because of potential "concurvity." These results were extended to other linear smoothers by Opsomer and Ruppert (1997) with local polynomials; by Mammen, Linton, and Nielsen (1999) with local polynomials and the Nadaraya-Watson kernel smoother; and by Amato and Antoniadis (2001) with a linear wavelet-based smoother.
Conveniently, the backfitting approach uses the univariate smoothing technology, but, because the smoothers must be linear, spatially heterogeneous functions cannot be efficiently estimated. The selection of the smoothing parameters also remains an open question (Cantoni and Hastie 2002).
* The Turbo knot selection approach of Friedman and Silverman (1989) is nonlinear. It assumes that each [f.sub.q](*) can be written as a parsimonious expansion on a few truncated power functions defined by their knot location. Turbo then performs a stepwise search for the optimal knot location. If the knots are located optimally, the procedure is efficient even when estimating spatially heterogeneous functions. However, the knot selection procedure is known to suffer from an instability that many practitioners have observed and that Breiman (1996) has studied.
In this article, we propose to use the good features of both approaches by using a nonlinear smoother and a nonlinear backfitting algorithm. In particular, we consider the recently developed Waveshrink univariate smoother of Donoho and Johnstone (1994) that enjoys nice theoretical and computational properties (Donoho, Johnstone, Kerkyacharian, and Picard 1995). As in Turbo, Waveshrink models each [f.sub.q](*) as a parsimonious expansion on a set of basis functions, but uses wavelets instead of splines. The selection of which wavelets to use is then carried out by nonlinear shrinkage.
Additive models assume that, not only the underlying multivariate function, but also the noise is additive. To handle a wider class of noise distributions, Hastie and Tibshirani (1986) proposed the generalized additive models in the spirit of the generalized linear model of Nelder and Wedderburn (1972). Under mild conditions, Stone (1986) extended his convergence results obtained for additive models, showing that generalized additive models do not suffer from the curse of dimensionality either. In this article, we also generalize our wavelet-based estimator to estimate the components of generalized additive models.
The article is organized as follows. Section 2 reviews Waveshrink, the nonlinear wavelet-based univariate smoother of Donoho and Johnstone (1994). Section 3 defines the AMlet wavelet-based estimator of additive models and studies the convergence of a block coordinate relaxation (nonlinear backfitting) algorithm. Section 4 defines the GAMlet estimator that generalizes AMlet to a wide class of noise distributions. Both AMlet and GAMlet require smoothing parameters as input. This is based on Donoho and Johnstone's (1994) idea of universal threshold. Section 5 proposes a practical rule for choosing the smoothing parameters automatically, making AMlet and GAMlet fully implementable in practice. A Ganssian and a Poisson simulation show how well the universal rule works in practice. Section 6 discusses our results and suggests some further areas of research. For clarity, we postpone some mathematical derivations to the Appendix.
2. REVIEW OF WAVESHRINK
We consider the univariate situation when only one predictor is available (Q = 1). Waveshrink is a wavelet-based smoother and, as such, belongs to the class of expansion-based estimators: it assumes that the univariate function f(*) can be well represented by a linear combination of approximation [phi](*) and fine scale [psi](*) wavelets. The standard univariate wavelets are multiresolution functions that are locally supported and indexed by a location parameter k and a scale parameter j. A father wavelet [phi](*) such that [[intergral].sup.1.sub.0] [phi](x)dx = 1 generates [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.] approximation wavelets by means of the dilation and translation relation
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.];
they capture the coarse features of the signal. Similarly, a mother wavelet [psi](*) such that [[intergral].sup.1.sub.0][psi](x)dx = generates N - [p.sub.0] fine scale wavelets
[[psi].sub.j,k](x) = [2.sup.j/2][psi]([2.sup.j]x - k), j = [j.sub.0], ..., J; k = 0, 1, ..., [2.sup.j] - 1,
where J = [log.sub.2](N) - 1. Because they are locally supported, the fine scale wavelets capture the local features of the signal. Unless otherwise stated, we will assume that the function we want to estimate has the following expansion
(2.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.],
where the wavelet functions [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.] are orthonormal with respect to the [L.sub.2] norm. An orthonormal matrix [PHI] can be extracted from these functions that is appropriate for a fixed equispaced design for the sampling locations [x.sub.1], ..., [x.sub.N]. Hence the function f(*) calculated at the design points f = (f([x.sub.1]), ..., f([x.sub.N])) has the following wavelet decomposition
[MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.],
where [PHI}.sub.0] is the N x [p.sub.0] matrix of approximation wavelets, [PSI] is the N x (N - [p.sub.0]) matrix of fine scale wavelets, and [beta], [gamma] are the corresponding coefficients. To simplify notation, for...
NOTE: All illustrations and photos
have been removed from this article.

More articles from Journal of Computational & Graphical Statistics
A new, fast algorithm to find the regions of possible support for biva..., June 01, 2004 Sampling schemes for Bayesian variable selection in generalized linear..., June 01, 2004 Automatic smoothing with wavelets for a wide class of distributions., June 01, 2004 Optimal pair matching with two control groups., June 01, 2004 Variable length Markov chains: methodology, computing, and software., June 01, 2004
Looking for additional articles?
Search our database of over 3 million articles.
Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication
name or publication date.
About Goliath
Whether you're looking for sales prospects, competitive information, company
analysis or best practices in managing your organization,
Goliath can help you meet your business needs.
Our extensive business information databases empower business
professionals with both the breadth and depth of credible,
authoritative information they need to support their business
goals. Whether it be strategic planning, sales prospecting,
company research or defining management best practices -
Goliath is your leading source for accurate information.
|