|
...built kernel logistic regression (KLR). We show that the IVM not only performs as well as the SVM in two-class classification, but also can naturally be generalized to the multiclass case. Furthermore, the IVM provides an estimate of the underlying probability. Similar to the support points of the SVM, the IVM model uses only a fraction of the training data to index kernel basis functions, typically a much smaller fraction than the SVM. This gives the IVM a potential computational advantage over the SVM.
Key Words: Classification; Kernel methods; Multiclass learning; Radial basis; Reproducing kernel Hilbert space (RKHS); Support vector machines.
1. INTRODUCTION
In standard classification problems, we are given a set of training data ([x.sub.1], [y.sub.1]), ([x.sub.2], [y.sub.2]), ..., ([x.sub.n], [y.sub.n]), where the input [x.sub.i] [member of] [R.sup.p] and the output [y.sub.i] is qualitative and assumes values in a finite set C, for example, C = {1,2, ..., C}. We wish to find a classification rule from the training data, so that when given a new input x, we can assign a class e from C to it. Usually it is assumed that the training data are an independently and identically distributed sample from an unknown probability distribution P(X, Y).
The support vector machine (SVM) works well in two-class classification, that is, y [member of] {- 1, 1}, but its appropriate extension to the multiclass case is still an ongoing research issue (e.g., Vapnik 1998; Weston and Watkins 1999; Bredensteiner and Bennett 1999; Lee, Lin, and Wahba 2002). Another property of the SVM is that it only estimates sign [p(x) - 1/2] (Lin 2002), while the probability p(x) is often of interest itself, where p(x) = P(Y = 1|X = x) is the conditional probability of a point being in class 1 given X = x. In this article, we propose a new approach, called the import vector machine (IVM), to address the classification problem. We show that the IVM not only performs as well as the SVM in two-class classification, but also can naturally be generalized to the multiclass case. Furthermore, the IVM provides an estimate of the probability p(x). Similar to the support points of the SVM, the IVM model uses only a fraction of the training data to index the kernel basis functions. We call these training data import points. The computational cost of the SVM is O([n.sup.2][n.sub.s])(e.g., Kaufman 1999), where [n.sub.s] is the number of support points and [n.sub.s] usually increases linearly with n, while the computational cost of the IVM is O([n.sup.2][m.sup.2]), where m is the number of import points. Because m does not tend to increase as n increases, the IVM can be faster than the SVM. Empirical results show that the number of import points is usually much less than the number of support points.
In Section 2, we briefly review some results of the SVM for two-class classification and compare it with kernel logistic regression (KLR). In Section 3, we propose our IVM algorithm. In Section 4, we show some numerical results. In Section 5, we generalize the IVM to the multiclass case.
2. SUPPORT VECTOR MACHINES AND KERNEL LOGISTIC REGRESSION
The standard SVM produces a nonlinear classification boundary in the original input space by constructing a linear boundary in a transformed version of the original input space. The dimension of the transformed space can be very large, even infinite in some cases. This seemingly prohibitive computation is achieved through a positive definite reproducing kernel K(x, x), which gives the inner product in the transformed space.
Many people have noted the relationship between the SVM and regularized function estimation in the reproducing kernel Hilbert spaces (RKHS). An overview can be found in Burges (1998), Evgeniou, Pontil, and Poggio (1999), Wahba (1999), and Hastie, Tibshirani, and Friedman (2001). Fitting an SVM is equivalent to
(2.1) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.],
where [H.sub.K] is the RKHS generated by the kernel K(x, x). The classification rule is given by sign[f(x)]. For the purpose of simple notation, we omit the constant term in f(x).
By the representer theorem (Kimeldorf and Wahba 1971), the optimal f(x) has the form:
(2.2) f(x) = [n.summation over i=1][a.sub.i]K(x,[x.sub.i]).
It often happens that a sizeable fraction of the n values of [a.sub.i] can be zero. This is a consequence of the truncation property of the first part of criterion (2.1). This seems to be an attractive property, because only the points on the wrong side of the classification boundary, and those on the right side but near the boundary have an influence in determining the position of the boundary, and hence have nonzero [a.sub.i]'s. The corresponding xi's are called support points.
Notice that (2.1) has the form loss + penalty. The loss function [(1 - y f).sub.+] is plotted in Figure 1, along with the negative log-likelihood (NLL) of the binomial distribution. As we can see, the NLL of the binomial distribution has a similar shape to that of the SVM: both increase linearly as yf gets very small (negative) and both encourage y and f to have the same sign. If we replace [(1 - y f).sub.+] in (2.1) with ln(1 + [e.sup.-yf]), the NLL of the binomial distribution, the problem becomes a kernel logistic regression (KLR) problem:
(2.3) [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII.].
[FIGURE 1 OMITTED]
Because of the similarity between the two loss functions, we expect that the fitted function performs similarly to the SVM for two-class classification.
There are two immediate advantages of making such a replacement: (a) Besides giving a classification rule, KLR also offers a natural estimate of the probability p(x) = [e.sup.f(x)] / ( 1 + [e.sup.f(x)]), while the SVM only estimates sign[p(x) - 1/2] (Lin 2002); (b) KLR can naturally be generalized to the multiclass case through kernel multi-logit regression. However, because KLR compromises the hinge loss function of the SVM, it no longer has the support points property; in other words, all the [a.sub.i]'s in (2.2) are nonzero.
KLR is a well-studied problem; see Green and Yandell (1985), Hastie and Tibshirani (1990), Wahba, Gu, Wang, and Chappell (1995)...
NOTE: All illustrations and photos
have been removed from this article.

More articles from Journal of Computational & Graphical Statistics
Regression tree analysis using TARGET., March 01, 2005 Multicategory [psi]-learning and support vector machine: computational..., March 01, 2005 Generalized partially linear measurement error models., March 01, 2005
Looking for additional articles?
Search our database of over 3 million articles.
Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication
name or publication date.
About Goliath
Whether you're looking for sales prospects, competitive information, company
analysis or best practices in managing your organization,
Goliath can help you meet your business needs.
Our extensive business information databases empower business
professionals with both the breadth and depth of credible,
authoritative information they need to support their business
goals. Whether it be strategic planning, sales prospecting,
company research or defining management best practices -
Goliath is your leading source for accurate information.
|