File Systems That Fly
The superfast input-output speeds of server cluster file systems could change the way companies approach storage.
June 20, 2005
Building powerful supercomputers from off-the-shelf PCs, disk drives, and Ethernet cables running on the open-source Linux operating system has become more than a way to get high-performance computing on the cheap. Those clusters have upended the market for large systems over the last decade. But the ability to shuttle data between the computers and disks hasn't kept pace with advances in microprocessor and memory speeds, adding time and costs to important projects. Now an emerging class of file-system software for clusters stands to change the way companies buy storage.
Cluster file systems, including the open-source Lustre technology developed by the Department of Energy and commercially backed by Hewlett-Packard, speed input-output operations. The technology already is making a difference at universities, national labs, and supercomputing research centers, and it could make inroads into general business computing in coming years.
Cluster file systems are "incredibly fast," says Oak Ridge National Labs CTO Studham.Photo by Brad Jones |
"In terms of raw performance, it's incredibly fast," says Scott Studham, chief technology officer for the National Center For Computational Sciences at Oak Ridge National Laboratory and president of a Lustre user group. With Lustre, I/O speeds range from hundreds of megabytes of data per second to or from disk to 2 Gbytes per second per computer. And since results increase nearly in lockstep with the number of workstations attached, aggregate speeds in a cluster can reach dozens of gigabytes per second while reading from disk.
"Enterprise-class file systems won't do this," says Greg Brandeau, VP of technology at Pixar Animation Studios, which runs a cluster file system from startup Ibrix Inc. The system serves up 240 billion data requests a day from Pixar's 2,400-CPU rendering farm for the computer-animated film Cars, due next year. Pixar is for the first time using "ray tracing" techniques that lend its characters reflective chrome and more realistic shadows, but which place massive demands on CPUs and networks. "We've realized over the past six months that we're not doing enterprise-class computing anymore--we're a high-performance computing shop," Brandeau says.
This week, HP plans to release a second version of its Scalable File Share, a server and software package launched in December that uses Lustre to distribute storage serving in a cluster, much as IT shops have been doing with computing servers for the better part of a decade. Scalable File Share lets Linux machines in a cluster read data at up to 35 Gbytes per second and allows for up to 512 terabytes of total storage, double its previous capacity. "One of the keys is you now build the storage system using cluster technology," says Kent Koeninger, HP's high-performance computing products marketing manager.Problems with scaling up traditional file systems have to do with the way computers manage data on disk. Instead of being cohesive wholes, computer files consist of blocks of data scattered across disks. File systems keep track of the blocks, assigning free ones to files as they need more space. When multiple computers vie for access to data, most file systems will lock a block in use by one computer, even if others are requesting it. When that machine is done, the block again becomes available to other nodes in the cluster. But as organizations add more machines to a cluster--sometimes hundreds or thousands--managing those data blocks takes up more of the system's CPU and networking bandwidth.
"At the end of the day, it translates into less application performance," says David Freund, an analyst at IT research firm Illuminata. "You've got a scaling problem." Lustre solves this problem by letting hundreds or thousands of servers share a file system by spreading management of blocks of data over multiple computers. Even though dozens of machines may be handling I/O chores, they look like one file server to the rest of the cluster. That translates into much higher I/O speeds than are possible using business-computing standards such as storage area networks or network-attached storage.
"Lustre solves a technology hurdle happening in the high-performance computing market that will happen in normal markets: Disk drives aren't getting faster at the rate that CPU and memory bandwidth are going up," Studham says. As users deploy their applications across many CPUs in clusters, reading data from disk, or writing it there, chokes performance. The problem has become so bad, he says, that his discussions with storage vendors focus on data speed, not size. "For the past 10 years, we've been negotiating dollar per gigabyte from our storage vendor," he says. "This year and next, it will be more about cost per bandwidth. This is the first time I've bought storage and said, 'I don't care how much you give me; I care about dollar per gigabyte per second.' We've just met that inflection point."
Clusters are becoming more important in science and business. According to a closely watched list of the world's 500 fastest supercomputers released in November by the University of Tennessee and the University of Mannheim in Germany, 296 of those systems are clusters. Storage also is getting more attention inside businesses, as federal regulations meant to prevent fraud are compelling companies to save more data. Sun Microsystems earlier this month said it would acquire Storage Technology Corp. for $4.1 billion in cash in a move meant to help it capitalize on that trend. If the emergence of Lustre and competing technologies at universities, national labs, and a small number of ambitious businesses takes hold more broadly, it could change the storage-buying equation.
"Lustre has gotten a remarkable amount of traction," says Chuck Seitz, CEO and chief technology officer at Myricom Inc., a maker of specialized networking equipment for clusters. The technology's speed and low cost have helped it carve a niche at sites such as Lawrence Livermore National Laboratory, Pacific Northwest National Laboratory, and the National Center for Supercomputing Applications.The NCSA runs Lustre on its 1,240-node, 9.8-teraflop cluster called Tungsten, which runs programs for atmospheric science, astronomy, and other applications. "You don't want an $8 million computer sitting there on an I/O wait," says Michelle Butler, a technical program manager for storage at the supercomputing center. Keeping wait times short also means scientists working on grants from the National Science Foundation get charged for less computing time. "With apps five or 10 years ago, no one ever did I/O because of the wait times," she says. "Now, data is everything." The NCSA's archive server grows by 40 to 60 terabytes per month because of data from National Science Foundation jobs being run on its computers. As recently as the late '90s, computer scientists learned how to use memory in their programs to avoid reading and writing data to disk. "Today that's no longer taught," Butler says. "The practices of computer scientists have changed so much."
Traditionally, there have been a couple of ways to make storage systems scale up. Highly standardized network-attached storage systems use popular protocols for file sharing on a LAN, such as Microsoft's CIFS or the Network File System that's a standard on Unix and Linux systems. Those let users attach many computers to a server and share a virtual disk that lives on the network. NAS uses inexpensive Ethernet connections between computers but transmits data at a relatively slow 1 Gbps for most applications. That causes logjams, since the speed of talking to local disk drives is faster than communicating with those on the network.
Storage area networks deliver higher speeds than NAS, getting up to 2 to 4 Gbps of data, but they need expensive Fibre Channel switches and require boards for each computer that can cost as much as $1,000 each. The iSCSI protocol, which lets disks and computers in a SAN talk directly over Ethernet, is gaining popularity in shared storage networks as well.
The fast-disk communication speeds of cluster file systems could appeal to companies in industries that run highly I/O-dependent software, including banking, oil exploration, microchip manufacturing, automaking, aerospace, and Hollywood computer animation.
The small companies selling cluster file systems are starting to land some marquee customers. Cluster File Systems Inc., owner of the intellectual property associated with the Lustre software, counts Chevron Corp. among its customers. Startup Panasas Inc. sells a cluster file system called ActiveScale on its own hardware used by customers including Walt Disney Co. Larger vendors also are exploring the technology: IBM is considering adding object-based storage to its GPFS file system, according to Illuminata. Dell, meanwhile, has a deal with Ibrix to sell Ibrix's Fusion file system.Demand for servers running Ibrix technology is causing Dell to re-examine how it bundles networking with computing. Dell historically has figured on about a gigabyte per second of bandwidth for every teraflop of computing power it sells a customer, says Victor Mashayekhi, a senior manager in Dell's scalable systems group. "Over time, we'll see that ratio increase," he says. "You'll see computing nodes become hungrier. Driving that is the amount of data that needs to be consumed."
Large data requirements compelled the Texas Advanced Computing Center at the University of Texas in Austin to bring in Ibrix's Fusion. The center, which provides computing to the university's researchers, uses the file system to speed up the performance of a computational fluid-dynamics application used by about 1,700 scientists that simulates turbulence for aerodynamics. Each process in the program needs to write a file that's between 300 and 400 Mbytes in size. Using NFS, writing all the data required--about 20 Gbytes--took about 50 minutes, says Tommy Minyard, the center's high-performance computing group manager. That needed to occur every hour. As a result, for each hour the application ran, it only computed for just 10 minutes. With Ibrix, the center's I/O write time for that program has shrunk to five minutes per hour. "It's benefiting all users of the system," Minyard says.
That's the good news. On the other hand, standards for cluster file systems are just emerging, and the software is owned by small companies without a proven track record. The systems also are limited in the types of computers on which they can run. Lustre is available only for Linux and Catamount, an obscure operating system from supercomputer maker Cray Inc. It's also hard to install and not well documented. That's causing some business-technology managers to wait and see.
"Lustre looks interesting, but we haven't deployed anything," says Andy Hendrickson, head of production technology at DreamWorks Animation SKG. "The big question is, does it do what it says and is it reliable enough to base production on? We have hard deadlines." And, perhaps summarizing the concerns for many companies considering cluster file systems, Hendrickson asks, "How much would we have to change our process?" Whether the benefits of this emerging technology outweigh the risks for businesses remains to be seen.
You May Also Like