|
...paper, narrow our attention to the realm of computer systems; we demonstrate how predictive algorithms enable us to anticipate the occurrence of events of interest related to system failures, such as CPU overload, threshold violations, and low response time.
Predictive algorithms can play a crucial role in systems management. The ability to predict service problems in computer networks, and to respond to those warnings by applying corrective actions, brings multiple benefits. First, detecting system failures on a few servers can prevent the spread of those failures to the entire network. For example, low response time on a server may gradually escalate to technical difficulties on all nodes attempting to communicate with that server. Second, prediction can be used to ensure continuous provision of network services through the automatic implementation of corrective actions. For example, prediction of high CPU demand on a server can initiate a process to balance the CPU load by rerouting new demands to a back-up server.
Several types of questions are often raised in the area of computer systems:
* What will be the disk utilization or CPU utilization next month (next year)?
* What will be the server workload in the next hour (n minutes)?
* Can we predict a severe system event (e.g., router failure) in the next n minutes?
The questions above differ in two main aspects: time horizon and object of prediction. The former characterizes our ability to perform short-term or long-term predictions and has a direct bearing on the kind of corrective actions one can apply. Any action requiring human intervention requires at least several hours, but if actions are automated, minutes or even seconds may suffice. The latter relates to the outcome of a prediction and can be either a numeric variable (e.g., amount of disk utilization) or a categorical event (e.g., router failure).
Both time horizon and object of prediction are important factors in deciding which predictive algorithm to use. In this paper, we present three major predictive algorithms addressing the following problems: (1) long-term prediction of performance variables (e.g., disk utilization), (2) short-term prediction of abnormal behavior (e.g., threshold violations), and (3) short-term prediction of system events (e.g., router failure). The first problem is solved using a regression-based approach. A salient characteristic of a regression algorithm is the ability to form a piecewise model of the time series that can capture patterns occurring at different points in time. The second problem employs time-series analysis to predict abnormal behavior (e.g., threshold violations); prediction is achieved through a form of hypothesis testing. The third problem predicts critical events by using data-mining techniques to search for patterns frequently occurring before these events.
Our goal in this paper is to provide some criteria in the selection of predictive algorithms. We proceed by matching problem characteristics (e.g., time horizon and object of prediction) with the right predictive algorithm. We use our selection criteria in three case studies corresponding to the problems described above.
Extensive work has been conducted in the past trying to predict computer performance. For example, work is reported in the prediction of network performance to support dynamic scheduling, (1) in the prediction of traffic network, (2) and in the production of a branch predictor to improve the performance of a deep pipelined micro-architecture. (3) Other studies reported in the literature (4-7) focus on predicting at the instruction level, whereas we focus on predicting at the system and event level (e.g., response time, CPU utilization, network node down, etc.). A common approach to performance prediction proceeds analytically, by relying on specific performance models; one example is in the study of prediction models at the source code level, which plays an important role for compiler optimization, programming environments, and debugging tools. (8) Our view of the prediction problem is mainly driven by historical data (i.e., is data-based). Many studies have tried to bridge the gap between a model-based approach versus a data-based approach. (9)
The rest of the paper is organized as follows. First we provide a general view of prediction algorithms and describe our approach to selecting an algorithm for the problem at hand. In the following section we discuss algorithms for long-term prediction of computer performance. Next we discuss an algorithm for detecting threshold violations of workload demands, and then we describe our approach to the prediction of system events. We list our conclusions in the last section.
Prediction in computer networks
We begin by giving a general view of the prediction problem. We then provide some criteria for selecting a predictive algorithm to use, based on the characteristics of the problem at hand.
A formulation of the prediction problem. To make predictions, one needs access to historical information. We define historical information as an ordered collection of data, [S.sub.i], that starts at a point in time [t.sub.1], covering events up to a final time [t.sub.i]. Specifically, [S.sub.i] : {[s.sub.j]}, 1 [less than or equal to] j [less than or equal to] i, where the jth element is a pair [s.sub.j] = ([v.sub.j], [t.sub.j]). The first element of the pair, [v.sub.j], indicates the value of one or more variables of interest, whereas the second element of the pair, [t.sub.j], indicates its occurrence time. The elements of [S.sub.i] are ordered, that is, [t.sub.j] [less than or equal to] [t.sub.k] for any j < k.
As an example, assume monitoring systems capture the disk utilization on a server at five-minute intervals during an experiment of one hour. In this case, the historical information is the collection of pairs [MATHEMATICAL EXPRESSION NOT REPRODUCIBLE IN ASCII], where [v.sub.j] is the disk utilization at time [t.sub.j], and time increases in five-minute steps. In some cases we want to capture the values associated with several variables at time [t.sub.j], where the first element of each event pair is a vector [v.sub.j]. For example, [v.sub.j] = ([v.sub.j1], [v.sub.j2], [v.sub.j3]), where the values on the right represent the disk utilization, the memory utilization, and the CPU utilization at time [t.sub.j], respectively. We collect data up to a point in time [t.sub.j], and the goal is to predict the disk utilization at a time [t.sub.i+k] (i.e., k steps in the future).
A prediction is an estimation of the value of a variable [v.sub.i+k] occurring at time [t.sub.i+k] in the future conditioned on historical information [S.sub.i]. Hence, a prediction is the output of a generic function conditioned on [S.sub.i], [v.sub.i+k] = g([S.sub.i]) + [[epsilon].sub.i], in which g(.) is a function capturing the predictable component and [[epsilon].sub.i] models the possible noise.
Normally, the further out our prediction, the less accurate the result. Hence, a predicted value is ideally accompanied by a probability term that reflects our degree of confidence. This confidence can be measured by a conditional probability, P([v.sub.i+k]|[S.sub.i]). Although determining the conditional probability P([v.sub.i+k]|[S.sub.i]) is always desirable, it is not always possible.
Data characteristics. Variables of interest whose values we might want to predict include the memory utilization or disk utilization on a host or group of hosts, the number of HyperText Transfer Protocol (HTTP) operations per second in a Web server, and the status of a network node (up or down) at a given time. In all these cases we rely on historical information to anticipate future behavior. We wish to emphasize the temporal component of this information: knowing when an event occurred...
NOTE: All illustrations and photos
have been removed from this article.

More articles from IBM Systems Journal
Discovering actionable patterns in event data., September 01, 2002 Machine learning in a multimedia document retrieval framework., September 01, 2002 Applying machine learning to automated information graphics generation..., September 01, 2002 Machine intelligence and the Turing Test. (Technical forum)., September 01, 2002 An architecture of diversity for commonsense reasoning. (Technical for..., September 01, 2002
Looking for additional articles?
Search our database of over 3 million articles.
Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication
name or publication date.
About Goliath
Whether you're looking for sales prospects, competitive information, company
analysis or best practices in managing your organization,
Goliath can help you meet your business needs.
Our extensive business information databases empower business
professionals with both the breadth and depth of credible,
authoritative information they need to support their business
goals. Whether it be strategic planning, sales prospecting,
company research or defining management best practices -
Goliath is your leading source for accurate information.
|