|
...memory requirements, and scalar code for functions such as game artificial intelligence, for which fast response times and a full-featured programming environment are critical. The Cell Broadband Engine[TM] architecture targets such applications, providing both flexibility and high performance by utilizing a 64-bit multithreaded PowerPC[R] processor element (PPE) with two levels of globally coherent cache and eight synergistic processor elements (SPEs), each consisting of a processor designed for streaming workloads, a local memory, and a globally coherent DMA (direct memory access) engine. Growth in processor complexity is driving a parallel need for sophisticated compiler technology. In this paper, we present a variety of compiler techniques designed to exploit the performance potential of the SPEs and to enable the multilevel heterogeneous parallelism found in the Cell Broadband Engine architecture. Our goal in developing this compiler has been to enhance programmability while continuing to provide high performance. We review the Cell Broadband Engine architecture and present the results of our compiler techniques, including SPE optimization, automatic code generation, single source parallelization, and partitioning.
INTRODUCTION
The Cell Broadband Engine ** (BE) processor provides both flexibility and high performance. The first generation Cell BE processor includes a 64-bit multithreaded PowerPC * processor element (PPE) with two levels of globally coherent cache. For additional performance, the Cell BE processor includes eight synergistic processor elements (SPEs), each containing a synergistic processing unit (SPU). Each SPE consists of a processor designed for streaming workloads, a local memory, and a globally coherent DMA engine. Computations are performed by 128-bit-wide single instruction multiple data (SIMD) functional units. An integrated high-bandwidth bus connects the nine processors and their ports to external memory and I/O.
The intricacy of the Cell BE processor spans multiple dimensions, each presenting its own set of challenges for both the highly skilled application developer and a highly optimizing compiler. At the elementary level, the Cell BE system has two distinct processor types, each with its own application-level instruction-set architecture (ISA). One ISA (for the PPE) is the familiar 64-bit PowerPC with a vector multimedia extension unit (VMX); the other (for the SPEs) is a new 128-bit SIMD instruction set for multimedia and general floating-point processing. The first Cell BE releases consist of one PPE and 8 SPEs, each with its own 256-KB local memory to accommodate both program instructions and data. Typical applications on the Cell BE processor consist of a variety of code to exploit both of these processors.
The most basic level of programming support for the Cell BE platforms consists of two separate compilers, one targeting the PPE and the other targeting the SPEs, along with a set of utilities and runtime support for loading and running code on the SPEs and transferring data between the system memory and the local stores of the SPEs. It has been demonstrated that very competitive performance can be achieved with the deployment of a low-level programming model, but to make the architecture interesting and accessible to a more general user community, it is useful to abstract the details and present a higher-level view of the system. This issue is addressed by providing a highly optimized compiler for the Cell BE architecture.
IBM has long provided state-of-the-art compiler support for the PowerPC platform, including automatic and user-directed exploitation of shared-memory parallelism. We use this same compiler technology to exploit the performance potential of the Cell BE architecture. The prototype compiler that we have developed for the Cell BE platform generates code, within a single compilation and under option control, for either the PPE or the SPEs, or both. The PPE path of the prototype is essentially the existing PowerPC compiler, complete with VMX support and tuned for the PPE pipeline. For the SPEs, a new path has been developed to target the specific architectural features of this attached processor, including automatic exploitation of the four-way SIMD units. The prototype compiler innovatively takes advantage of and extends existing parallelization technology to enable partitioning and parallelization across multiple heterogeneous processing elements from within a single compilation process. We also draw on the large body of existing research on programming restructuring techniques to automate and optimize data transfer between the multiple processing elements of the system. Our work extends previous research in taking into account not only the heterogeneity of the multiple processing elements but also the nature of the small attached local memories, which are designed to handle both code and data.
When compiling for the most elementary level of the Cell BE architecture, the pipelines of both processors must be taken into account. The SPEs present several challenges not seen in the PPE, chief among them instruction prefetch capabilities and the significant branch miss penalties resulting from the lack of hardware branch prediction. To achieve high rates of computation at moderate costs in power and area, functions that are traditionally handled in hardware, such as memory realignment, branch prediction, and instruction fetches, have been partially offloaded to the compiler. Our techniques address these new demands on the compiler. In the section "Optimized SPE code generation," we discuss in detail the following optimizations: generating scalar code on SIMD units, optimizing language-dictated conversions (i.e., those required by a particular programming language) to increase computations on subwords (i.e., data that is smaller than a word), reducing the performance impact of branches through branch hinting and branch elimination, and scheduling instructions in the presence of limited hardware support for dual issuing and instruction fetching.
At the next level of complexity, the SPE is a short SIMD or multimedia processor, which was not designed for high performance with scalar code. Although the compiler does support explicit programming of the SIMD engine by means of intrinsics (i.e., functions that are built into the compiler as opposed to those contained in libraries), it also provides the novel auto-SIMDization functionality, which generates vector instructions from scalar source code for the SPEs and the VMX units of the PPE. Auto-SIMDization is the process of extracting SIMD parallelism from scalar loops. In the section "Generation of SIMD code," we describe auto-SIMDization in some detail, including how it minimizes overhead due to misaligned data streams and how it is tailored to handle many of the code structures found in multimedia and gaming applications.
Using the parallelism of the Cell BE processor when deploying applications across all its processing elements, our compiler enhances its programmability by parallelizing and partitioning a single source program across the PPE and the eight SPEs, guided by user directives. The compiler also efficiently uses the complex memory system that ties all these processors together on the chip and interfaces with the external storage. While the PPE makes use of a conventional two-level cache, each SPE draws data and instructions from its own small memory, internal to the chip. Data transfers to and from the local stores must be explicitly managed by using a DMA engine. Within the compiler, we have developed techniques to generate and optimize the code that accomplishes data transfer, allowing a single SPE to process data that far exceeds the local store's capacity, using code that also exceeds the size of its local store, and scheduling the necessary transfers so that they overlap ongoing computation to the extent that this is achievable. In the section "Optimized SPE code generation," we discuss the compiler's generation of parallel code and describe our code-partitioning techniques.
Our goal in developing this compiler has been to enhance the programmability of the architecture, at the same time continuing to provide respectable performance. Currently average speedup factors of 1.3, 9.9, and 6.8 for our SPE, SIMD, and parallelization compilation techniques are demonstrated on suitable benchmarks, indicating some initial success with our approach. In the section "Measurements," we briefly review our current performance measurements, and we conclude in the following section.
CELL BE ARCHITECTURE
The implementation of the first-generation Cell BE processor (1) includes a Power Architecture processor and eight attached processor elements connected by an internal, high-bandwidth Element Interconnect Bus (EIB). Figure 1 shows the organization of the Cell BE elements.
[FIGURE 1 OMITTED]
The PPE consists of a 64-bit, multithreaded Power Architecture processor with two levels of on-chip cache. The cache preserves global coherence across the system. The processor also supports IBM's VMX (2) to accelerate multimedia applications by using VMX SIMD units.
A major source of computing power is provided by the eight on-chip SPEs. (3) An SPE consists of a new processor designed to accelerate media and streaming workloads, its local noncoherent memory, and its globally coherent DMA engine. The units of an SPE and key bandwidths are shown in Figure 1.
Most instructions operate in a SIMD fashion on 128 bits of data representing either two 64-bit double-precision floating-point numbers or longer integers, four 32-bit single-precision floating-point numbers or integers, eight 16-bit subwords, or sixteen 8-bit characters. The 128-bit operands are stored in a 128entry unified register file. Instructions may take up to three operands and produce one result. The register file has a total of six read and two write ports.
The memory instructions also access 128 bits of data, with the additional constraint that the accessed data must reside at addresses that are multiples of 16 bytes. Thus, when addressing memory with vector load or store instructions, the lower four bits of the byte addresses are simply ignored. To facilitate the loading and storing of individual values, such as a character or an integer, there is additional support to extract or merge an individual value from or into a 128-bit register.
An SPE can dispatch up to two instructions per cycle to seven execution units that are organized into even and odd instruction pipes. Instructions are issued in order and routed to their corresponding even or odd pipe by the issue logic, that is, a component which examines the instructions and determines how they are to be executed, based on a number of constraints. Independent instructions are detected by the issue logic and are dual-issued (i.e., dispatched two per cycle) provided they satisfy the following condition: the first instruction must come from an even word address and use the even pipe, and the second instruction must come from an odd word address and use the odd pipe. When this condition is not satisfied, the two instructions are executed sequentially. The instruction latencies and their pipe assignments are shown in Table 1.
The SPE's 256-KB local memory supports fully pipelined 16-byte accesses (for memory instructions) and 128-byte accesses (for instruction fetches and DMA transfers). Because the memory has a single port, instruction fetches, DMA, and memory instructions compete for the same port. Instruction fetches occur during idle memory cycles, and up to 3.5 fetches may be buffered in the instruction fetch buffer to better tolerate bursty peak memory usage. The maximum capacity of the buffer is thus 112 32bit instructions. An explicit instruction can be used to initiate an inline instruction fetch.
The SPE hardware assumes that branches are not taken, but the architecture allows for a "branch hint" instruction to override the default branch prediction policy. In addition, the branch hint instruction causes a prefetch of up to 32 instructions, starting from the branch target, so that a branch taken according to the correct branch hint incurs no penalty. One of the instruction fetch buffers is reserved for the branch-hint mechanism. In addition, there is extended support for eliminating short branches by using select instructions.
Data is transferred between the local memory and the DMA engine in units of 128 bytes. The DMA engine can support up to 16 concurrent requests of up to 16 KB originating either locally or remotely. The DMA engine is part of the globally coherent memory address space; addresses of local DMA requests are translated by an MMU (memory management unit) before being sent on the bus. Bandwidth between the DMA and the EIB bus is 8 bytes per cycle in each direction. Programs interface with the DMA unit through a channel interface and may initiate blocking as well as nonblocking requests.
OPTIMIZED SPE CODE GENERATION
In this section, we describe the current compiler optimization techniques that address key architectural features of the SPE. A user interested in SPE code generation may observe that our SPE compiler produces the high quality code normally associated with the XL compiler suite.
Scalar code on SIMD units
As mentioned in the section "Cell BE architecture," most SPE instructions are SIMD instructions operating on 128 bits of data at a time, including all memory instructions. One notable exception is the conditional branch instruction, which branches on nonzero values from the primary slot (i.e., the highest order or leftmost 32 bits) of a 128-bit register. The address fields are also expected by memory instructions to reside in primary slots.
When scalar code is generated on an SPE, it is critical that the SIMD nature of the processor does not get in the way of program correctness. For example, an a = b + c integer computation on a scalar processor simply requires two scalar loads, one add, and one store instruction. When executing on the SPE, a load of b...
NOTE: All illustrations and photos
have been removed from this article.

More articles from IBM Systems Journal
A marketing maturity model for IT: building a customer-centric IT orga..., March 01, 2006 Online games and e-business: architecture for integrating business mod..., March 01, 2006 A context-aware smart-call-center solution: improving customer service..., March 01, 2006 Content protection for games., March 01, 2006 High-performance server systems and the next generation of online game..., March 01, 2006
Looking for additional articles?
Search our database of over 3 million articles.
Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication
name or publication date.
About Goliath
Whether you're looking for sales prospects, competitive information, company
analysis or best practices in managing your organization,
Goliath can help you meet your business needs.
Our extensive business information databases empower business
professionals with both the breadth and depth of credible,
authoritative information they need to support their business
goals. Whether it be strategic planning, sales prospecting,
company research or defining management best practices -
Goliath is your leading source for accurate information.
|