MPI microtask for programming the Cell Broadband Engine[TM] processor.(Message Passing Interface)
Publication Date: 01-MAR-06
Publication Title: IBM Systems Journal
Format: Online
Author: Ohara, M. ; Inoue, H. ; Sohda, Y. ; Komatsu, H. ; Nakatani, T.

You can view this article PLUS...

  • Hundreds of the most trusted magazines, newspapers, newswires, and journals (see list)
  • Business news from North America and around the World
  • More than 10 years of article archives
  • Unlimited Access at any time - ONLINE and all in ONE place

Now for a Limited Time, try Goliath Business News - Free for 7 Days!
Tell Me More   Terms and Conditions

Purchase this article for $4.95
Already a subscriber?
Log in to view full article

Description

The Cell Broadband Engine[TM] processor employs multiple accelerators, called synergistic processing elements (SPEs), for high performance. Each SPE has a high-speed local store attached to the main memory through direct memory access (DMA), but a drawback of this design is that the local store is not large enough for the entire application code or data. It must be decomposed into pieces small enough to fit into local memory, and they must be replaced through the DMA without losing the performance gain of multiple SPEs. We propose a new programming model, MPI microtask, based on the standard Message Passing Interface (MPI) programming model for distributed-memory parallel machines. In our new model, programmers do not need to manage the local store as long as they partition their application into a collection of small microtasks that fit into the local store. Furthermore, the preprocessor and runtime in our microtask system optimize the execution of microtasks by exploiting explicit communications in the MPI model. We have created a prototype that includes a novel static scheduler for such optimizations. Our initial experiments have shown some encouraging results.

INTRODUCTION

The Cell Broadband Engine ** (BE) processor (1) is an asymmetric multicore processor that combines a general-purpose IBM PowerPC* processor element (PPE) and eight synergistic processor elements (SPEs). (2) From an architectural standpoint, this processor has a high peak performance because the SPE is simpler and more efficient than general-purpose processors in terms of the micro and memory architecture. (3) One architectural aspect is the small high-speed local store at each SPE. Because the size of the local store is limited to a range of L2-cache sizes--256 KB for the first-generation Cell BE processor--many real-world applications do not fit in the local store. While conventional microprocessors have a hardware cache to manage such a small local store, the Cell BE processor must rely on a software mechanism to manage it. This requirement for software management could impose significant challenges to programmers, but at the same time it offers significant opportunities for the software to take advantage of the raw performance of the Cell BE processor.

The microtask we propose here provides a programming model that frees programmers from local-store management and enables the preprocessor and runtime system to optimize the scheduling of computations and communications by taking advantage of the explicit communication model in the Message Passing Interface (MPI). (4,5) In the microtask model, programmers are still responsible for partitioning the application into multiple microtasks. Each microtask is essentially a virtualized SPE that uses the MPI to communicate with other microtasks.

We have chosen MPI as a communication application programming interface (API) for the following two reasons. First, the Cell BE processor adopts a distributed-memory model; the PPE and SPEs use direct memory access (DMA) operations for communications. Thus, the overhead due to a message-passing layer can be inherently small because of the commonality between the native hardware and the message-passing model. The model, moreover, can hide hardware details from programmers. Second, and perhaps more important, the message-passing model allows us to analyze the dependency between microtasks by examining message APIs. Such dependency information is essential for various optimizations in task and communication management. Among existing message-passing interfaces, we selected MPI because it is widely used as a standard interface.

Our microtask system provides a preprocessor that transforms a microtask program in the message-passing model to one in a streaming model (2) that the Cell BE processor can execute efficiently. To do this, the preprocessor first divides each microtask into a collection of basic tasks, each of which represents a unit of computation that causes communication only at its beginning and end. Thus, each basic task corresponds to a computation kernel in stream programming languages (6,7) in the sense that the concept of the basic task separates computation from communication. This separation allows the preprocessor to schedule basic tasks in such a way that data streams through SPEs over high-speed, on-chip DMA channels.

To make the streaming model effective, the preprocessor then puts basic tasks with strong dependencies together as a cluster and applies a heuristic algorithm to schedule clusters. The cluster-scheduling algorithm creates a precedence graph of clusters in a series-parallel form (8) and then applies a dynamic programming algorithm. The nested structure of the series-parallel graph allows the dynamic programming algorithm to reuse partially scheduled results to reduce scheduling time. The preprocessor statically computes runtime parameters, such as the message buffer address, for each message-passing operation so that the runtime system can avoid the overhead of computing them.

While a number of scheduling algorithms for distributed memory systems have been studied,9 they are not directly applicable to the Cell BE processor. This is because of key differences in the architectural characteristics; that is, existing algorithms assume loosely coupled coarse-grain multiprocessors, where each processor has a large local memory but the communication latency between processors is very large. The Cell BE processor, on the other hand, is a tightly coupled fine-grain multicore processor where each SPE has a small local memory but the communication latency between SPEs is very small. These differences have led us to a new clustering approach in our static scheduling algorithm.

The contribution of this paper is twofold. First, we propose a microtask model for the Cell BE processor. It frees programmers from explicit local-store management, which could be a significant burden for them. Second, we propose a novel scheduling algorithm that converts a microtask program into one for a streaming model which the Cell BE processor can execute efficiently.

RELATED WORK

The microtask model is compared with other programming models proposed for the Cell BE processor and similar architectures, and related work in static scheduling algorithms is discussed.

PPE-centric versus SPE-centric programming models

Kahle et al. (2) proposed two approaches to map application programs to the Cell BE processor: function offload and computational acceleration models. The function offload model is a PPE-centric approach in which the main application program runs on the PPE and performance-critical functions in the program are offloaded to SPEs. Typically, programmers are responsible for identifying functions to be offloaded to SPEs and for modifying the original program in such a way that the main program on the PPE uses SPE functions through a remote procedure call. This model has been popular because programmers can usually port existing single-thread programs without changing their main application logic.

The computational acceleration model, on the other hand, is an SPE-centric approach, where the main program runs on SPEs in parallel, and service functions for SPEs are performed on the PPE. Thus, this model uses SPEs in a more integrated fashion than the function offload model. Typically programmers parallelize the program manually to map it to the PPE and SPEs.

The microtask model can be considered as an SPE-centric computational acceleration model; it allows programmers to parallelize the main program by using MPI. It extends the computational acceleration model through an asymmetric thread runtime model, (2) which allows multiple SPE threads to run on a physical SPE. The preprocessor of the microtask program, furthermore, converts microtask applications to those for a streaming model, which allows the multiple SPEs to execute the computation kernels through which the data flows as a stream. In particular, the static scheduler, a part of the preprocessor, optimizes such a conversion.

Shared-memory versus message-passing programming models

Application programs on the Cell BE processor can use a shared memory or message-passing programming model. When they use a message-passing model, each of the SPEs and the PPE has, from the perspective of the application, a separate memory space. The message-passing model makes sense because DMA operations can efficiently transfer messages between two local stores and also between the system memory and a local store. For example, an SPE can transfer a message directly between user-level buffers on a different local store. Historically, the IBM System/390 * coupling facility applied similar techniques to implement message passing. (10)

When applications use a shared-memory model, the PPE and SPEs share the off-chip system memory. SPEs can access the system memory by using DMA operations in a cache-coherent way. (2) The shared-memory approach makes sense because the memory access latency via DMA is comparable with that of L2-cache misses of conventional shared memory multicore processors and also because the PPE and SPEs can share a common effective-address space. To run conventional shared memory programs on SPEs, however, one must modify them because SPEs can access the shared memory only by using DMA operations but not by using load/store operations.

Eichenberger et al. (11) implemented an OpenMP compiler and...



More articles from IBM Systems Journal
A marketing maturity model for IT: building a customer-centric IT orga..., March 01, 2006
Online games and e-business: architecture for integrating business mod..., March 01, 2006
A context-aware smart-call-center solution: improving customer service..., March 01, 2006
Content protection for games., March 01, 2006
High-performance server systems and the next generation of online game..., March 01, 2006

Looking for additional articles?
Click here to search our database of over 3 million articles.