|
...al. (2002) and Berman al. (2003). The real and specific problem that underlies the grid concept is coordinated resource sharing and problem solving in dynamic, multiinstitutional virtual organizations (Foster et al., 2001). The sharing is controlled by the Resource Management System (RMS), see e.g., Krauter et al. (2002) and Nabrzyski et al. (2003).
When the RMS receives a service request it divides the task into a set of subtasks and assigns those subtasks to available resources for execution. After the resources finish the assigned jobs, they return the results back to the RMS and then the RMS integrates the received results into the entire task output which is requested by the user. This process can be approximated by a structure with a star topology, as depicted by Fig. 1, where the center is the RMS directly connected with the resources through respective communication channels.
The performance of grid computing is of great concern (Abramson et al., 2000). Usually the measure of grid performance is the task execution time (service time). This index can be significantly improved by parallel execution of subtasks by multiple online resources.
The problem of evaluating the entire task execution time for a given distribution of subtasks among resources has been extensively studied in multiprocessor systems, and algorithms for subtask distribution minimizing the execution time have been developed (Wu et al., 2004). However, in distributed grid systems where the availability of elements can be much lower than in multiprocessor systems the service time cannot be evaluated without considering the reliability aspect.
It is observed in many grid projects that the service time is a random variable that is affected by many factors (Hamscher et al. 2000). First, the resources available online have different task processing speeds. Thus, the task execution time can vary depending on which resource is assigned to execute the task/subtasks. Second, some resources can fail when running the jobs, so the execution time is also affected by the resource reliability. Similarly, the communication links in a grid service can fail during the data transmission. Thus, the communication reliability influences the service time as well as data transmission speed (bandwidth) of the communication channels, see e.g., Deeter and Smith (1998) and Levitin and Xie (2006).
[FIGURE 1 OMITTED]
Two typical types of failures are usually observed in grid systems on both resources and communication channels: permanent failures and transient failures. The permanent failures make the equipment unavailable and unable to recover itself without maintenance. The transient failures do not make the equipment unavailable, but prevent the completion of the performed actions. For example, a communication channel may be disconnected when transmitting data, but it can often be reconnected to redo the job. Some grid projects such as Globus, Condor and IU-Grid (Fox, 2003) suggest repeating failed jobs/transmissions, to tackle the transient failures that are very common in grid services. Many conventional models (Dai et al., 2002; Xie et al., 2004) only consider permanent failures. Obviously, a realistic model should combine both types of failures (e.g., Dai et al. (2005)).
Dai et al. (2002) were the first to study grid reliability, but their work considered only the reliability without taking the performance into account. Since the performance is affected by component/resource failures, it is necessary to develop a combined model of performance and reliability. Dai and Levitin (2006) studied the grid service performance considering the reliability factors as well. However, this model assumes that all subtasks allocated to different resources can start simultaneously without considering the precedence constraints, which is inapplicable for many grid services with data dependencies. Subsequently, Levitin et al. (2006) studied a grid computing model with a data dependency caused by the fact that some subtasks cannot be executed until they have received data produced by other subtasks. This model considered only permanent failures.
To further improve the model, this paper investigates a more general problem by focusing on the transient failures. A novel model and an algorithm based on the UGF (Universal Generating Function) are presented to obtain the distribution of service times in a grid with a star topology taking the subtask precedence constraints and transient and permanent failures in the resources and communication channels into account. Performance/reliability measures (performability and expected execution time) are derived from this distribution (Lyle et al., 1999).
The suggested algorithm obtains the grid service reliability/performance indices for an arbitrary service task partition into subtasks and arbitrary subtask distribution among the resources. The optimal partition and distribution problem is beyond the scope of this paper. However, the suggested model and the algorithm can be used for solution quality evaluation in optimization procedures.
The rest of the paper is organized as follows. Section 2 presents the grid service performance and reliability model. Section 3 describes a computational approach for determining service reliability indices based on the UGF technique. Section 4 provides an illustrative example.
Notation
1(x) = unity function: 1(TRUE) = 1, 1(FALSE) = 0;
Pr{e} = probability of event e;
m = number of subtasks;
[[omega].sub.i] = set of resources processing subtask i;
[[phi].sub.i] = set of immediate predecessors of subtask i;
[c.sub.i] = computational complexity (number of operations) of subtask i;
[a.sub.i] = amount of input data for subtask i;
[b.sub.i] = amount of output data for subtask i;
[x.sub.j] = task processing speed of resource j (number of operations per second);
[s.sub.j] = data transmission speed (bandwidth) of communication channel j;
[p.sub.j](t) = probability that a permanent failure does not occur during time t in resource j or in channel j: p(t) = exp(-[[lambda].sub.j]t), where [[lambda].sub.j] is the sum of permanent failure rates of these channel and resource;
[q.sub.j](t) = probability that a transient failure does not occur in channel j during time t: q(t) = exp(-[[rho].sub.j]t), where [[rho].sub.j] is the transient failure rate of the channel;
[g.sub.j](t) = probability that a transient failure does not occur in resource j during time t: g(t) = exp(-[[epsilon].sub.j]t), where [[epsilon].sub.j] is the transient failure rate of the resource;
[theta] = maximum allowed service time;
R([theta]) = probability that service time is less than [theta];
W = conditional expected system execution time;
[[DELTA].sub.ij.sup.inp] = random time of subtask i input data transmission to resource j;
[[DELTA].sub.ij.sup.pr] = random time of subtask i processing by resource j;
[[DELTA].sub.ij.sup.out] = random time of subtask i output data transmission from resource j;
[[tau].sub.ij.sup.[v]] = minimal possible realization of [[DELTA].sub.ij.sup.[v]] ([v] stands for inp, pr or out) corresponding to the case when no transient failures occur;
[[tau].sub.ij.sup.c] = the control time (the resource management system checks the status of the subtask i after time [[tau].sub.ij.sup.c] from the beginning of its processing by resource j). [[tau].sub.ij.sup.c] [greater than or equal to] [[tau].sub.ij.sup.pr];
[t.sub.ij] = random time of subtask i execution by resource j;
[^.t.sub.ij,d] = dth realization of [t.sub.ij] (realization is a fixed value that a random variable can take);
[B.sub.ij,d] = Pr {[t.sub.ij] = [^.t.sub.ij,d]};
[D.sub.ij] = number of realizations of [t.sub.ij];
[~.t.sub.ij] = random completion time for subtask i executed by resource j;
[~.T.sub.i] = random completion time for subtask i;
[T.sub.i] = random start time of subtask i;
[^.T.sub.i,l] = lth realization of [T.sub.i];
[u.sub.ij.sup.[v]] = the universal generating function (u-function) representing the conditional probability mass function (pmf) of [[DELTA].sub.ij.sup.[v]](z) ([v] stands for inp, pr or out);
[u.sub.ij](z) = the u-function representing the conditional pmf of [t.sub.ij];
[~.u.sub.ij](z, t) = the u-function representing the conditional pmf of [~.t.sub.ij] given the execution of subtask i...
NOTE: All illustrations and photos
have been removed from this article.

More articles from IIE Transactions
A data mining approach to process optimization without an explicit qua..., August 01, 2007
Looking for additional articles?
Search our database of over 3 million articles.
Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication
name or publication date.
About Goliath
Whether you're looking for sales prospects, competitive information, company
analysis or best practices in managing your organization,
Goliath can help you meet your business needs.
Our extensive business information databases empower business
professionals with both the breadth and depth of credible,
authoritative information they need to support their business
goals. Whether it be strategic planning, sales prospecting,
company research or defining management best practices -
Goliath is your leading source for accurate information.
|