|
...for quality of service, and is compatible with the standards being developed in the technical community. We describe how we are implementing this vision in IBM today and how we expect the implementation to evolve in the future.
Grid computing offers the power to address some of the world's most challenging problems; for example, struggles to prevent cancer and cure smallpox, to reliably predict earthquakes and global warming, and many others. Computationally intensive analytic applications could also benefit: accurate risk computations could help investment companies minimize losses; insurance companies could more rapidly detect fraud. Two key benefits of grid computing would enable these advances. First, grids harness heterogeneous systems together into a megacomputer, and hence, can apply greater computational power to a task. Second, a grid virtualizes these heterogeneous resources, so that applications for the grid can be written as if for a single, local computer, vastly simplifying the development needed for such powerful applications.
Of course, these wonderful applications depend not only on computing power, but also on data--and often on vast volumes of heterogeneous, distributed data, collected or generated by various groups, and stored in diverse systems. The data sources might be files, databases, or applications, and the data might be structured (e.g., relational), semi-structured (e.g., Extensible Markup Language--XML--documents) or unstructured content (e.g., images). For the promise of grid computing to be fulfilled, not only must we harness and virtualize multiple computing resources, but we must also abstract and hide the diversity and distribution of these various information sources to provide applications with a single, powerful virtual-information store for their virtual computer.
In this paper, we propose an information infrastructure for grid computing that will meet this lofty goal. This information infrastructure will have three key characteristics; in particular, it will be:
* Virtualized--allowing a collection of distributed information resources to be shared and managed as if they were a single information store, although they may in fact remain fully distributed.
* Autonomic--ensuring that the interconnected information systems can be managed effectively and efficiently through self-management just like the human autonomic nervous system.
* Open--utilizing open interfaces and agreed-upon standards to enable highly interoperable systems and processes.
[C]Copyright 2004 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract. but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
These characteristics will enable the information infrastructure to produce information on demand, in the spirit of IBM's on demand initiative. More specifically, they will help customers derive the benefits they expect from a grid. Customer motivations for using grid computing, besides the hope of meeting some of the grand challenges outlined above, include such pragmatic desires as:
* Enhanced collaboration within and across enterprises by rapid integration and sharing of distributed heterogeneous information
* Scalability by adding new copies of data to off load swamped servers
* Faster response times by efficient processing of data-intensive queries
* Better availability through transparently exploiting alternative copies of information
* Lower cost through leveraging existing resources and easier administration
*********** * Faster time to value by simplifying the task of application development
These motivations have been echoed by numerous customers in a range of industries, including life science, finance, manufacturing and the scientific community. Our proposed infrastructure for information will be the backbone the grid needs to meet these demands.
An information infrastructure to meet these needs requires a flexible architecture rich in function. We believe these needs will best be met by a service-oriented architecture in which each service is self-configuring and self-maintaining, guided by user-provided policies. The services will, together and individually, provide certain transparencies, or abstractions, that shield applications from the complexities of a distributed, heterogeneous environment. Core services of this architecture include data access and integration, discovery, meta-data, data placement, change-publish (for publishing changes to data), replication, and caching. Services can build on (use) other core services as well as other grid services such as registration, billing, policy management, and so on. The architecture will be an open one, supporting the existing and evolving grid standards (1) so that many implementations of individual services and of the collective infrastructure will be possible.
This paper proposes a service-oriented, policy-driven information infrastructure for the grid. We detail our vision and the services required in the next section, and illustrate how these services could be used to
facilitate data-intensive applications on the grid. In the third section, we look at how these services are being implemented today. The following section focuses on the likely future evolution of our implementation. Standards are critical to a service-oriented architecture, and we describe the relevant standards efforts in the section "The role of standards."
An information infrastructure for the grid
For a large-scale, distributed grid to be successful, the grid infrastructure should make application development easy. Ideally, all the resources used in computing--processors, storage, databases, applications--should be virtualized in such a way that the application developers, administrators, and users are shielded from the details and dynamics of how the necessary services are provided. Specifically, a requestor of grid services should not be affected by the number of data and computing resources, their locations, their failures, and their specific hardware and software configurations.
The grid infrastructure should transparently provision the right data and computing resources for each application. We also want application programmers to be able to specify end-to-end goals for the quality of the provisioning. Those goals, often referred to as quality of service (QoS) goals, may include goals for the system availability, response time, through-put, number of concurrent users supported, currency or accuracy of the data, and so on. The grid infrastructure needs to support the definition of policies that set QoS goals and define the conditions under which they must be met.
Thus, our vision for the information infrastructure consists of a set of services that individually and collectively support a set of transparencies. By upholding these transparencies, the services virtualize the underlying resources, simplifying application development. A set of policies governs the functioning of these services. This vision extends the work done in the Global Grid Forum (GGF) on the Open Grid Services Architecture (2) (OGSA) tO the information infrastructure, building on the work being done in the data-oriented working groups of GGF (e.g., References 3 and 4). In this section, we first provide an example of a grid scenario that we want the information infrastructure to handle. We then identify the set of services we believe are required for the information infrastructure, followed by a description of the transparencies we expect our services to provide to ap
plications and users. A fourth subsection provides a brief overview of policies and illustrates their use in our example scenario. Finally, we illustrate how the services will interact to maintain transparencies and meet policy goals.
An example. To understand the extent and power of our information infrastructure, consider a worldwide grid of hospital information systems, containing patient records such as hospital visits, medication history, doctor reports, x-rays, symptoms history, genetic information, and so on. Transparent access to such a grid with QoS guarantees could enable a variety of useful tasks. We outline a few examples below.
Patient Health Overview: Many health-related applications would benefit from an integrated view of medical records for individual patients. Today these records may be scattered across various hospitals and doctors' offices. For example, a doctor planning surgery could use such views to provide better, safer care. If some of the records for the patient were unavailable at a given time, the doctor would still like to get as many as possible to continue to plan the surgery to the extent possible.
Computer-Aided Diagnostics: To diagnose diseases, a doctor could compare a given patient's symptoms with those of other patients around the world. This would be especially helpful for diseases that are uncommon in a region and therefore unfamiliar to the local doctor. Again, a partial result set would be better than no information although the doctor might want even the partial results to span a representative subset of the data. Further, when certain symptoms are found, they may be propagated to the Centers for Disease Control and Prevention (CDC) to allow tracking of potential epidemics, or disseminated to other physicians in the area to alert them to the increased likelihood of a particular disease.
Pharmaceutical Research: A researcher could study patients with common characteristics to study the efficacy of various treatments on classes of people. The analysis would be both computation- and data-intensive, but the data and computation would be dynamically distributed among multiple nodes on a grid. Further, the researcher would need to link data about patients from hospital records with the pharmaceutical companies' own experimental results. As several researchers in a company may be working on related areas, the researcher would also like to
be informed when new results on particular biological or chemical substances are made available.
The challenge in performing these tasks is that medical information systems are distributed, heterogeneous, and autonomously administered. Patient information is independently entered at different hospitals, which bear responsibility for the security and privacy of this data. Data sources may come and go, due to events such as new medical centers joining, hardware and software failures, or even password expiration; thus, it is very difficult for application developers to program these tasks directly against data sources. Because the proposed information infrastructure presents a unified view of these diverse data sources, application developers can write their programs as if all the data were centrally located and always available.
A set of services for the information infrastructure. Figure 1 illustrates the set of services we imagine for the information infrastructure for the grid. While no system provides all of these capabilities today, prototypes and even commercial versions of some services do exist. Standards activities in GGF will ensure that as more of the pieces are created, they can be put together to form the information infrastructure we envision here.
[FIGURE 1 OMITTED]
The information infrastructure provides applications, such as those shown at the top of Figure 1, with transparent access to heterogeneous, dispersed information sources, like those that appear at the bottom of the figure. To add an information source explicitly, it is first registered via the Registration Services. This step provides information pertaining to that source to the Meta-data Services, which know about all available sources and how they ought to be represented within a unified view to the consuming applications. Discovery Services can be used to automatically identify possible information sources and to help knit them into a unified view by depositing the required meta-data into the Meta-data Repository (not shown) using Meta-data Services. A Discovery Service might use a Registration Service to enter sources it has found in the Meta-data Repository.
Arguably the most essential of the services shown are Data Services. Data Services handle requests for information from applications or from other services. A particular Data Service may represent one specific information source, for example, a file (mydata. xls) or a relational database (a single Oracle** in
stance). However, it could also be implemented by middleware that encapsulates and translates among several data sources. Distributed file systems, gateways, mediators, and federated systems are exampies of such middleware. In this case the Data Service represents the collection of information accessible by the middleware that implements it. By using the other services of the information infrastructure, it is possible to build a very sophisticated Data Service that can handle complex queries, locate relevant information sources for a query, and ensure that QoS goals are met.
Depending on the access patterns and locality of the consuming applications, a Placement Management Service can improve response time or availability by creating caches or replicas. In effect, a Placement Management Service automatically distributes cop ies of the data to optimize performance. The Placement Management Service provides intelligence to determine what data to copy to meet the QoS goals. It relies upon Replication or Cache Services to actually do the work, creating the data copies, registering their existence with a Meta-data Service, and then populating the copies. Of course, Replication and Cache Services can also be used independent of the Placement Management Service. For example, an administrator might use a Replication Service to ensure the availability of certain data for a disaster recovery scenario. Change-Publish Services can detect changes in data and deliver them to a consumer, providing the ability to "publish" changes. A Replication Service might use a Change-Publish Service to know when to propagate and apply changes for a replica, or an application might subscribe to changes in data it particularly cares about.
Finally, these services provided by the information infrastructure can use other grid services as indicated on the righthand side of Figure 1. For example, the overall grid architecture (2) provides a Notification Service, which could be used to inform an autonomic Meta-data Service of relevant changes in the state, meta-data, or location of information sources. Another important grid service is the Policy Management Service, which is used by most of the information infrastructure services to discover what QoS goals they must sustain.
We believe this set of services, working together as described above, is an information infrastructure that can fully support the needs of grid computing and its users. In the next two sections we first discuss the transparencies that we expect these services to provide, and then how policies are used by the services. With that background, we present a scenario using several of the services, elaborating on...
NOTE: All illustrations and photos
have been removed from this article.

More articles from IBM Systems Journal
Abstract interdomain security assertions: a basis for extra-grid virtu..., December 01, 2004 Global namespace for files.(global namespace service), December 01, 2004 A logger system based on web services.(Product/Service Evaluation), December 01, 2004 Service domains.(analysis), December 01, 2004 MyMED: a database system for biomedical research on MEDLINE data.(Prod..., December 01, 2004
Looking for additional articles?
Search our database of over 3 million articles.
Looking for more in-depth information on this industry?
Search our complete database of Industry & Market reports by text, subject, publication
name or publication date.
About Goliath
Whether you're looking for sales prospects, competitive information, company
analysis or best practices in managing your organization,
Goliath can help you meet your business needs.
Our extensive business information databases empower business
professionals with both the breadth and depth of credible,
authoritative information they need to support their business
goals. Whether it be strategic planning, sales prospecting,
company research or defining management best practices -
Goliath is your leading source for accurate information.
|