Cluster's Latest Stand
Professional-grade tools for Linux and new attention from Microsoft make high-performance clusters an increasingly attractive alternative to expensive SMP systems.
December 1, 2005
High Performance Computing (HPC) usually brings one of two responses from enterprise IT architects. Most commonly, IT big thinkers see the annual list of the world's 500 fastest computers and marvel at the achievement, while simultaneously breathing a sigh of relief that their organization isn't tasked with managing such a beast.
But an increasing minority of architects is also looking at the other end of the list and thinking about the competitive advantage their company might achieve with a computing resource capable of cranking out calculations by the teraflops. We're not just talking traditional HPC applications such as simulations; nowadays data mining and even transaction processing can be handled by clusters.
The biggest names in the industry are now working to bring the cost of HPC down and joining forces with open-source projects that have long led the state of the art in clustering. Factor in the commercialization of open-source projects with a new offering from Microsoft and the roaring performance of Intel and AMD's 64-bit dual-core chips, and clusters become a resource that belongs on any IT shortlist. 200Gflops clusters can cost as little as $100,000 to build, whereas proprietary RISC-based Symmetrical Multiprocessors (SMPs) with similar performance typically cost at least five times as much.To be sure, the need for a massively parallel processing compute resource doesn't start with IT. Line-of-business requirements will likely dictate the need, but how IT responds is another matter. Historically, clusters have been used primarily in government institutions and university research programs, and sparingly throughout high-tech industries ranging from telecommunications to life sciences, geosciences, finance, and various realms of manufacturing. Rather than building their own cluster, even in coordination with OEMs, enterprise buyers often settle for commercially available SMP systems.
The reason for the enterprise's distaste for clusters has been fairly clear. Building and managing them requires expertise and a willingness to use open-source tools with few commercial alternatives. The result is a resource highly dependent on key staff, both to run the system and to build the applications that run on it. Meanwhile, SMPs are seen not only as more manageable, but also as possessing a larger software library. What's more, in most cases that software library has been developed on 64-bit RISC systems, so building a cluster based on commodity 32-bit x86 hardware is something of a step backward for applications.
All of these are valid concerns, but several drivers are coming together to make clusters much more attractive for the enterprise. Here's what you should know before making the decision to upgrade an existing SMP system.
COMMODITY MUSCLEPerhaps the most obvious pro-cluster development is the availability of economically priced 64-bit x86 multicore systems. Now that Intel and AMD are shipping dual-core 64-bit chips, creating a cluster that performs at teraflops levels doesn't require a lot of nodes (self-contained computers that may have one or more CPU cores). Nor does it necessarily require fancy interconnecting network technology. AMD, for instance, claims that a single Opteron with two CPU cores runs at about 7.8Gflops using the Linpack benchmark. The company further claims that a four-node cluster with eight cores turns in a Linpack score of 58Gflops (the scaling isn't linear because of parallel application overhead). A relatively modest 64-node (128-core) cluster of these systems should turn in a Linpack number of at least 0.5Tflops. Until just recently, that sort of 64-bit performance required specially configured RISC systems that came with commensurately larger price tags.
It should come as no surprise that Intel is a huge proponent of x86-based clusters (as well as custom-built massively parallel systems). To that end, it has made significant investments in compilers, math libraries, and parallel debuggers that allow ISVs and anyone else to port their applications to Intel-based clusters.
Pentium Floating Point PerformanceClick to Enlarge in another window |
PathScale is another development tool vendor with cluster-optimized tools. It sees a good bit of its business focused on AMD systems, where AMD also provides its own math libraries. PathScale's approach is unique because it's also a leading supplier of InfiniBand hardware. That tight integration can simplify such tasks as kernel bypass for message passing.The open-source community has made parallel enhancements to the GNU compilers, and somewhat surprisingly Microsoft is currently enhancing its visual studio environments to support cluster development. All these environments at least support the Message Passing Interface 2 (MPI-2) specification, which is commonly used for developing parallel applications.
Taken together, x86 performance and the availability of high-quality development tools that support parallel processing standards mean ISV and homegrown applications should port fairly easily to x86 clusters. But just getting the application running is only part of the problem. Managing clusters is another pain point that has given enterprise decision makers pause for concern.
USE MANY, MANAGE ONE
While cluster management would appear to be more complex than managing an SMP system, cluster computing from its origin has recognized the management dilemma and tried to alleviate it. While the object here is primarily to discuss systems with 64 nodes or less, clusters have always been envisioned as containing hundreds or thousands of nodes. So while policing configurations on 16- or 32-node systems might be possible, automation is clearly required when thousands of nodes are in play.
Management systems therefore now focus on both job scheduling and recovery as well as system configuration. Clusters are managed from a single master node, with processes appearing to the operator as though they're running on that node even when they're running out on the cluster. What tends to differentiate management systems is their ability to monitor the performance of the cluster; manage the distribution of system images, updates, and patches; and determine how the master node handles the insertion or deletion of compute nodes.The last of these is also affected by the programming methodology used. Programs using the MPI standard need to know details about the cluster at startup, such as the number of nodes in the cluster. Once an MPI application is running, those parameters can't change. That's fine for many applications, but some run indefinitely, chewing up data and spitting out results constantly. For these applications, using the Parallel Virtual Machine (PVM) programming methodology makes more sense because it can recover from a missing node or conversely take advantage of new nodes as they come online.
NOT JUST FOR OEMS ANYMORE
For architects, choosing a system--whether hardware or management--has largely been a matter of comfort with a particular OEM, giant or tiny. HP, IBM, SGI, and Sun Microsystems all offer RISC-based cluster systems as well as management software, oftentimes requiring the vendor's proprietary version of Unix, too. That's now changing. IBM has long offered Linux support through its Cluster Systems Management products. In February, SGI teamed up with cluster management vendor Scali for Linux support. Sun supports Linux through its N1 Grid software, though its Sun Cluster product still only supports Solaris. HP offers its XC Cluster software and has gained important system management capabilities through its purchase of RLX Technologies in October.
On the hardware side, all the vendors, including Dell, offer configurations specifically intended for cluster computing. Sun in particular is making a lot of noise with its Sun Fire Opteron-based servers. Sun offers a quarter teraflops single-rack configuration based on a diskless Sun Fire 2100 server with dual-core chips costing just less than $120,000, including the Solaris and Sun N1 grid management software. A half teraflops system takes two racks and costs less than $200,000. The Sun Fire 2100 is Sun's entry 1U single-chip server and starts at about $1,000--the total system cost is mostly for the software and extra memory normally required by cluster applications.
For those who want to build their own clusters, Scali and Skyld (pronounced "skilled") offer commercial management systems, while the open-source community offers the popular Rocks management tools distribution. It should be noted that Skyld is essentially a commercialized version of Beowulf (see "Beowulf: First HPC Cluster"). In mid-2003, Skyld merged with Penguin Computing, which continues to sell Skyld products separately and as part of turnkey clusters. One of the more interesting systems in Penguin's cluster offerings is its cluster on wheels. The 12-slot blade system can house up to 48 cores using dual-core technology. It provides a cool 200Gflops for about $100,000, and it'll all fit under your desk--or more importantly, in a lab next to test equipment.But perhaps even more unexpected is Microsoft's recent foray into HPC. In November, attendees of the Supercomputing 2005 (SC|05) show, which (probably not) coincidentally was held in Seattle, were not only treated to a keynote address from Bill Gates, but were surprised to find that the largest booth on the show floor was Microsoft's. At the show, Microsoft released beta 2 of its Compute Cluster Solution (CCS) for Windows Server 2003. Its booth was stocked with working demos running on hardware from all the major x86 vendors.
Microsoft's interest and efforts in HPC, though recent (it's only been at the last two SC shows, after a decade-long hiatus), are quite serious. The mover and shaker behind Microsoft's HPC interest appears to be none other than Gates himself, who recently also toured five university campuses encouraging students to consider careers in computer science and engineering. Along with the CCS product and Microsoft's enhancements to its development environment, Microsoft is sponsoring 10 "Institutes for HPC" around the world. What better way to prime the pump for CCS?
CCS compute nodes run a bare-bones but otherwise unmodified version of Server 2003. The master node serves as the basis for the image that's distributed to all compute nodes. CCS compute nodes are licensed differently than a typical server; Microsoft's license agreement essentially forbids the compute node from being used for any other purpose. The system is managed through snap-ins to the Microsoft Management Console (MMC), and users' jobs are run in the submitting user's context, with all the privileges and restrictions typical for that user.
Microsoft points to its integration with Active Directory, full implementation of MPI-2, and the use of MMC as the management interface as substantially reducing the complexity of managing a cluster computing environment. Whether Microsoft really got cluster management right on its first try remains to be seen, but the company's spokespeople are certainly saying all the right words. General availability of CCS is scheduled for the first half of 2006.
ETHERNET vs. INFINIBANDWhile the past year has seen a flurry of activity on the part of the software and systems hardware vendors, the interconnect technology for clusters has followed a slow and predictable evolution--that is, until Cisco Systems announced in April that it intended to purchase Topspin Communications. Topspin is a maker of InfiniBand switches with integrated Ethernet and Fibre Channel connectivity, with its sweet spot in clusters of 64 or fewer nodes. Before Cisco's purchase, InfiniBand was headed toward the more esoteric end of cluster computing--those clusters with many hundreds or thousands of nodes. In fact, that's still where Voltaire, Topspin's primary competitor, does its best business.
Cisco's purchase of Topspin was something of a shock because Intel, AMD, Microsoft, and all the major OEMs had dropped their InfiniBand plans in the post-bubble days of 2001 and 2002. That left a handful of start-ups to fill the void in what amounted to proprietary ways. That too is just now being addressed with the formation late last year of the Open InfiniBand Alliance, which is promoting the development of interoperable InfiniBand protocol stacks. The culmination of the group's efforts was a plugfest at SC|05. For its part, Cisco has announced new switches and new management software, but these are more grid-oriented than cluster-oriented.
The question comes whether InfiniBand is necessary for a high-performance cluster. The answer depends on the application at hand. While InfiniBand is a far less expensive 10Gbps medium than Ethernet, what's at least as important is its low latency. PathScale claims that process-to-process latency is as low as 1.4usec with its 10Gbps hardware tied directly to AMD's hypertransport. Gigabit Ethernet sees latencies in the tens of microseconds, and 10 Gigabit Ethernet just under 10usec. If TCP/IP is used as the transport protocol, latencies are considerably higher, particularly if no kernel bypass technology is at work--typically about 50usec for Gigabit Ethernet.
Latency is the key measure because messages between cluster processes tend to be very short, essentially mimicking local memory accesses. Ideally, accessing memory on another node should incur latency as close as possible to accessing local memory. The more internode traffic an application requires, the more important it is that the interconnection network not introduce significant latency. In the case where little or no internode communication is required, Gigabit Ethernet works just fine, and in fact some of the systems on the list of 500 fastest computers use it for the interconnection network.
THE WEAK LINKIf there's a weak link in the cluster story, it's that cluster file systems are maturing more slowly than other cluster technologies. For clusters with fewer than 100 nodes, existing cluster file system technology is fine; however, as clusters get very large, keeping data fed to them can be a challenge, as can be finding a place to store results. Much of the research going on today surrounds the issue of getting data to where it's needed, when it's needed. Veritas Software and PolyServe both offer cluster file system technology bearing note. PolyServe in particular got a boost last year when HP picked up PolyServe software on its price list. On the open-source side, Lustre is a project with particular promise for extremely data-intensive applications. Like Beowulf, it has a commercial version from Cluster File Systems, Inc.
Editor-in-Chief Art Wittmann can be reached at [email protected].
In 1994, NASA was looking for a way to let scientists test out their code before running it on super-expensive supercomputers. The original project used 16 systems sporting Intel 486 DX4 chips (486 DX was Intel's first with a built-in floating point unit) and plain old Ethernet connections (no Ethernet switch, however--those were too exotic and too expensive). Despite the relative performance of the component systems, Beowulf became an extremely useful computing resource in its own right, particularly when faster Pentium chips appeared along with fast Ethernet switches.
By 1997, Beowulf clusters with roughly 200 systems clocked in at 10Gflops, an impressive feat for a system that could be built for under $1 million. That was the same year the world's fastest computers broke the 3Tflops mark. Those systems came with price tags in the nine-digit range and interestingly enough were built with a lot of Intel CPUs.Today Beowulf development continues to benefit from advancements in commodity CPU performance as well as high-throughput, low-latency networking technologies.
You May Also Like