Trends in Cluster Computing, Exclusive Gartner Analyst Interview with Carl Claunch

Clustered systems have been on the rise in the High Performance Computing community. Can you comment on this trend? Why have clustered systems become so popular?

Clusters are the fruit of decades of labor in the HPC community striving towards two goals – increased performance and lowered cost. Our primary tool in seeking ever-higher pinnacles of performance has been maximizing the use of parallelism. The most effective fundamental strategy to drive down the cost of HPC computing has been the employment of standardized, high-volume, and near-commodity technology wherever possible.

Performance
Parallelism efforts aiming at setting new records for performance have required the HPC world to develop suitable algorithms, implement effective parallel codes, and construct systems that exploit many resources. Often cost was no object in this race for speed.

The earliest mainstream uses of parallelism in HPC, vector machines and high-performance SMP and MPP systems, were expensive compared to general purpose computing systems but able to reach the heights of performance sought by users. Standard, high-volume technologies are more capable today, allowing excellent levels of performance to be attained without resorting to the exotic and expensive approaches of the past.

The prior machines depended upon purpose-built circuits, communications paths and memories, appropriate given the cost-is-no-object mentality of the race for pure performance, but elevating these machines out of the reach of many potential users. The broader market for HPC also has an interest in more affordable costs.

Cost
The second grand objective of the HPC world, lowering the cost of HPC systems, matters even to the performance-obsessed few. Funding limits, budget constraints and competing needs are all practical issues of importance. For the wider potential market that either did not need the dizzying heights of maximum performance or that simply could not afford the price, the most important priority was the drive to lower the costs of HPC equipment.

Clusters, a solution built with a maximum of cost-effective, standard building blocks, and a minimum of expensive or custom components, provides parallelism in much the same way as the MPP machine. In addition, it makes use of widely available general networking technology for the communications paths between processors.

Clusters are now gaining a bigger share of the HPC market, being used where previously only a more expensive alternative like MPP would be required. This new found popularity stems from several sources. First, continued invention of more parallel algorithms, and diffusion into the market of more programs implementing them, have increased the problems for which parallel acceleration is possible. Second, standard high-volume networking technologies have become very high performance, supporting effective scaling for parallel execution for more classes of problems than in the past. Third, innovations have been applied to clustering that increase scaling, reduce delays and improve manageability. The sum of these trends yields a far wider applicability for clusters in the HPC market than had existed previously.

Would all clusters be considered equal? Do users have options when implementing cluster solutions?
There is a wide variety among the different clusters available in the market today. These vary in the type and speed of the processors, in the size of the shared-memory nodes, in the interconnects used to enable communications between parallel threads of execution, in the programming models and in the programming interfaces available. The results that can be achieved are absolutely not the same on all systems. In order to achieve the results they expect, businesses looking to solve problems with a cluster solution need to fit the type of system they employ to the intended use.

Processors
Clusters are created with processing nodes of many types. As examples, among the clusters available are systems based on Intel® Itanium®, Intel® XeonTM, Intel Xeon with 64 bit extensions, AMD OpteronTM, IBM POWER, and SunTM SPARCTM. Many of these solutions provide a 64-bit memory space for programs requiring large addressability, but some do not. Some provide the larger addressability mainly for real storage but not for programmer use, while most have directly addressable 64-bit memory spaces. These processor types vary in performance; for example, some have advantages in floating point computations while others might lead the pack in memory bandwidth. Even within any processor architecture, the manufacturer typically offers a range of chips with different cycle times, cache sizes and other performance-related characteristics.

Node Size
Compounding these fairly basic differences, one cluster may use a node featuring only two processors as a maximum, while other cluster designs may couple 8, 16 or more processors into each node. Machines with larger nodes, having more processors, may provide a better fit for workloads where at least some of the tasks run better on a single big node than when spread across several smaller ones; this may be because of intensive use of shared memory or with tasks that are not structured to run on more than one node at a time.

Networking Requirements
The performance of the networking connecting the nodes may be a critical factor if the application requires maximum speed. One application may scale well over parallel systems with a modest bandwidth, high-delay network between the parallel threads because the algorithm does not demand as constant an interchange of values as another application which will scale only on the fastest of cluster networks. You may be unfortunate enough to run one of the applications that run well only on an MPP system, for example, because they will not scale up without the extreme network performance delivered by the special switches inside the system. The network behavior of the tasks to be accomplished determines how stringent the network performance must be in a proposed cluster solution.

Application Type
Applications whose parallel threads of execution each touch only a small subset of the total data, with strong locality of reference, will scale well on a cluster by shipping the data associated with each thread to the node where it will execute. Other applications have unpredictable reference patterns, perhaps even tap-dancing across the entirety of the data pool from each parallel thread. In this latter type of application, the cluster must have excellent accessibility to all data in order to achieve decent scaling. The nature of the storage, the access paths and performance for cross-node input-output operations, can all be key selection factors in that case.

Memory Accessibility
Many clusters do not provide a shared memory across the nodes; the constraints on building large numbers of nodes in an SMP configuration led many designers to build shared-nothing machines depending solely upon networking connections between nodes. Some added shared-disk but still kept the memories separate between nodes, and a separate OS would run on each node in the system.

However, a relatively modern technique for sharing memory called Non-Uniform Memory Access (NUMA) broke through the SMP barriers, allowing larger numbers of nodes to share a single memory. These NUMA-based shared memory configurations may be the right solution for classes of applications whose scaling would be limited on shared-nothing designs.

Even with clusters having shared memory, there can be differences in how it is implemented. One type may have separate memory in each node, with a separate copy of the operating system running on each node. Others may implement a fully shared, common memory for all nodes.

One class of application that may demand a full shared memory model would have unpredictable, wide-ranging and high-intensity access to data. Because the shared memory system executes just a single copy of the operating system across all the nodes, the in-memory buffer pool is accessible by all. As soon as a node updates some part of the data, that changed data is held in the shared buffer for all to read. Rather than having each node issue a separate read to place the data in the buffer pool in its memory, one read by any node places the data where every node can then access it in memory.

Programming Approaches
The programming approach to build an application operating in parallel threads of execution is in some ways simplest in shared memory, using common queues, semaphores and other techniques to synchronize and coordinate the threads. Programming for a shared-nothing machine may involve different schemes and more complex mechanisms. When the applications come from a software vendor already prepared for parallel execution, the difficulty or ease of programming is not a factor, but when considering the changes needed to make homegrown codes more parallel, the complexity of the approach is a valid concern. The performance scaling may differ between partial and full shared-memory implementations, if the application requires considerable locking and synchronization support from the OS, because the partial shared-memory cluster is implementing this across many discrete OS instances.

Software Availability
Moving beyond the hardware itself, clusters vary considerably in the software available to program and operate the system. A cluster requiring a different operating system than other systems in the business will be less desirable than one that exploits existing skills and staffing. Different mechanisms exist for the control of parallel threads, particularly on shared-nothing clusters, and the appropriate libraries and tools must be available for a given cluster to be suitable to run applications dependent on that methodology. Even the software for managing and administering the cluster can be an important factor, as a more proficient system may be able to deliver high levels of service to the users and require fewer resources to do so than a more primitive implementation.

So, no, all clusters are not equal. A business should look closely at its needs, the intended applications to be run, and the specific characteristics of the candidate systems before choosing one to purchase.

What kind of challenges might a user experience with different approaches to clusters?
Because different types of clusters will behave differently, some applications can disappoint by scaling performance only slightly or not at all in spite of being dispatched over more and more nodes. For a given approach to clustering, the solution may require inventing a new algorithm, substantially recoding the application, or wastefully provisioning huge numbers of nodes to reap the meager incremental performance.

A cluster whose design or software complement does not match the tools, programming languages, programming interfaces or other aspects of the existing application, may require serious changes to support a given application. If the application is procured from a software vendor, it may not even be possible to get the application to run on the existing cluster machine you own. For example, an off-the-shelf application compiled to run on SPARC processors will not execute without recompilation on Itanium based machines.

A large cluster installation running a complex mix of different applications requires very good tools to manage them, to ensure good service for the applications and high utilization of the entire complex. If the tools available with a given cluster system are inadequate, users may experience frequent delays, aborted runs, erratic operation, extended outages and many idle nodes. Attempting to substitute people to overcome the defects of the management software can be a very expensive band-aid.

Because the cluster design can rule out certain programming approaches, users may have codes that are almost unsalvageable on those clusters, due to the radical changes that would be necessary.

If a cluster forces a business to adopt new operating systems, middleware, or programming languages, the impact can be high. Productivity goes down, training requirements push aside productive tasks, errors increase due to inexperience, and costs often balloon as well.

How do you see clusters evolving in the next 5 years?
We expect to see the pool of applications suitable for cluster execution grow steadily as improved algorithms, rewritten software, better cluster implementations, and continued market growth all impel the designers and vendors of these applications to aim them at clusters as a platform.

The high-volume, standardized, near-commodity networking technologies will continue to grow more performant, permitting those applications that today are limited to MPP machines to run well on more cost-effective clusters. A few years progress following Moore's and Gilder's laws (that the capabilities of chips and bandwidth, respectively, double in specific short periods) will give us extremely fast and wide pipes to connect the many nodes in clusters.

Scaling approaches across general networks to machines with multiple owners, grid in other words, will drive improvements in management and control software that will be immediately applicable to clusters as well. Since clusters will not involve the issues of priority, funding, security and so forth that occur in a grid setting, clusters will remain a more widespread solution than grid.

The market momentum of Linux®, already the most popular OS for cluster systems, will bring with it a larger pool of applications that can be run on clusters with little or no modification.

The processors themselves and the memory technology in the nodes will also compound as suggested by Moore's law, allowing the same sized task to be accomplished across fewer and fewer nodes as those nodes gets faster. Clusters with modest numbers of nodes will provide staggering levels of computer power, while the scalability of cluster technology with those fast future nodes will reach new Olympian peaks.

Source: Gartner

Back to Top


©2004 Silicon Graphics, Inc. All rights reserved. Silicon Graphics, SGI, Altix, the SGI logo and the SGI cube are registered trademarks and The Source of Innovation and Discovery are trademarks of Silicon Graphics, Inc., in the U.S. and/or other countries worldwide. Linux is a registered trademark of Linus Torvalds in several countries. Intel and Itanium are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. All other trademarks mentioned herein are the property of their respective owners.

Technology Insight is published by SGI. Editorial supplied by SGI is independent of Gartner analysis. All Gartner research is © 2004 by Gartner, Inc. and/or its Affiliates. All rights reserved. All Gartner materials are used with Gartner's permission and in no way does the use or publication of Gartner research indicate Gartner's endorsement of SGI's products and/or strategies. Reproduction of this publication in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable. Gartner disclaims all warranties as to the accuracy, completeness or adequacy of such information. Gartner shall have no liability for errors, omissions or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results. The opinions expressed herein are subject to change without notice.

 


Winter 2005
Trends in Cluster Computing

Inside this issue

Cover Article

Trends in Cluster Computing, Exclusive Gartner Analyst Interview with Carl Claunch

Cluster Architectures for
Multi-job Workloads

Case Study: Large-node Clusters for Improved Product and Process Deployment