|
...modeling engine. Although virtually any predictive modeling technique can be implemented within ProbE's software environment, ProbE's application programming interfaces (APIs) are particularly well-suited for implementing segmentation-based modeling techniques, wherein sets of data records are partitioned into segments and separate predictive models are developed for each segment.
This style of modeling is popular among data analysts and applied statisticians, and it is usually approached as a sequential process in which data are first segmented (using, for example, unsupervised clustering algorithms), and predictive models are then developed for those segments. The drawback of this sequential approach is that it ignores the strong influence that segmentation exerts on the predictive accuracies of the models within each segment. Good segmentations tend to be obtained only through trial and error by varying the segmentation criteria.
ProbE is able to perform segmentation and predictive modeling within each segment simultaneously, thereby optimizing the segmentation so as to maximize overall predictive accuracy and thus to produce better models. Currently, ProbE includes a top-down tree-based algorithm for constructing segmentations, as well as a collection of other algorithms for constructing segment models. The latter includes stepwise linear regression and stepwise naive Bayes algorithms for general-purpose modeling, and a joint Poisson/lognormal algorithm for insurance risk modeling.
IBM ATM-SE (Advanced Targeted Marketing for Single Events) is an application built on top of ProbE for mining high-dimensional customer interaction and promotion history data in order to construct customer profitability and response likelihood models for the retail industry. (1) An evaluation of ATM-SE was recently conducted with Fingerhut Inc., a leading U.S. direct-mail retailer and a sophisticated user of predictive analytics in their targeted marketing efforts. The segmentation-based response models produced by ProbE either equaled or slightly outperformed Fingerhut's proprietary models, in a completely automated mode. The outcome of this evaluation is significant because numerous vendors and consultants have attempted to beat Fingerhut's in-house modeling capability in the past, but previously none had succeeded. Moreover, ProbE achieved this result in a fully automated mode of operation with no manual intervention. Although further development and testing is still needed, early indications are that ProbE will be able to consistently produce high-quality models for this application on a fully automated basis without requiring costly manual adjustments of the models or the mining parameters by data mining experts, a necessary step in making data mining attractive to medium-sized businesses.
A key feature of ProbE is that it can be readily extended so as to construct a wide range of predictive models within a segment. For example, in the IBM UPA (Underwriting Profitability Analysis) application, (2) a joint Poisson/log-normal statistical model is used to simultaneously model both the frequency with which insurance claims are filed, and the amounts (i.e., severities) of those claims for each segment. Using this class of segment models, the segments identified by ProbE would thus correspond to distinct risk groups whose loss characteristics (i.e., claim frequency and severity) are estimated in accordance with standard actuarial practices.
A second example is found in the ATM-SE application for predicting customer response to promotional mailings. To predict the expected revenues that would be generated by a customer targeted in such mailings, segment models were constructed using least-squares linear regression with forward stepwise feature selection to select the variables that appear in the regression equations. Using this class of segment models, ProbE would construct piecewise-linear models in which the segments correspond to regions of the response surface that are approximately linear and the boundaries between segments correspond to nonlinearities detected in that surface.
To predict the probability of a customer responding to a promotional mailing, segment models were constructed using naive Bayes methods with forward stepwise feature selection to select the variables that appear in the conditional probability equations. Using this class of segment models, ProbE would construct piecewise naive Bayes models in which the segments correspond to regions of the response surface in which the naive Bayes independence assumptions are locally valid and the boundaries between segments correspond to interactions among features detected in the response surface that violate the naive Bayes assumptions.
In addition to being extensible with respect to segment models, ProbE also permits extensions to be made to its segmentation algorithms. This degree of extensibility was achieved through careful design of ProbE's APIs. In particular, a single API is used to implement all predictive modeling algorithms, including segmentation algorithms. This model API is general enough to permit a very wide range of predictive modeling techniques to be implemented within ProbE. No matter what kind of predictive models are used within each segment, the same segmentation algorithms are used in ProbE to optimize the predictive accuracies of the resulting ensemble of models independent of their internal details.
ProbE is also designed to be an embedded system that can be incorporated into industry-specific application environments. For example, ProbE does not have a graphical user interface (GUI) of its own; instead, one would have to be supplied by the host application if so desired, as is done in the UPA and ATM-SE solutions. The interface to ProbE has been kept as simple as possible. Host applications provide ProbE with specifications of data mining tasks to be performed, and ProbE returns the results of those tasks upon completion. At present, communication is conducted through specification and results files; however, future extensions to ProbE will permit full integration with relational database systems, with task specifications and mining results communicated through database tables.
Another consideration in the design of ProbE is scalability. ProbE is designed to work with very large, out-of-core data sets. Work is also underway to develop a data-partition parallelized version of ProbE that would allow large data sets to be partitioned across multiple processors, with each processor accessing data only in the partition assigned to it and with only statistical summary information being exchanged among processors. Because this approach would minimize the amount of...
NOTE: All illustrations and photos
have been removed from this article.

More articles from IBM Systems Journal
Cross training and its application to skill mining., September 01, 2002 Predictive algorithms in the management of computer systems., September 01, 2002 Discovering actionable patterns in event data., September 01, 2002 Machine learning in a multimedia document retrieval framework., September 01, 2002 Applying machine learning to automated information graphics generation..., September 01, 2002
Looking for additional articles?
Search our database of over 3 million articles.
Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication
name or publication date.
About Goliath
Whether you're looking for sales prospects, competitive information, company
analysis or best practices in managing your organization,
Goliath can help you meet your business needs.
Our extensive business information databases empower business
professionals with both the breadth and depth of credible,
authoritative information they need to support their business
goals. Whether it be strategic planning, sales prospecting,
company research or defining management best practices -
Goliath is your leading source for accurate information.
|