Supercomputing: The Next Generation

A look at future supercomputers and how they will be built.

Jim O'Reilly

June 3, 2016

6 Min Read
NetworkComputing logo in a gray background | NetworkComputing

Supercomputing is at a crossroads. A desire for petaflop operations and beyond has stressed out existing architectural approaches from the compute engines themselves out through networking to storage. The pressure is increased by the need to compute closer to the data sources, which are often in remote locations such as, in the case of the Square-Kilometer Array (SKA), in either the Namib desert or Australian outback.

SKA is an extreme example of the challenges in supercomputing. Planned for commissioning in less than a decade, the array will be the ultimate data gatherer. Raw-data generation is expected to run to a few exabytes per day, all of which requires heavy mathematical computation. The output of this is going to be around 1,500 petabytes per year, which not only need to be stored but transmitted to scientists on other continents.

Compare SKA to the CERN’s Large Hadron Collider, which outputs just 15 petabytes per year, and has most of its scientists nearby in Europe. CERN uses relatively well understood solutions. It even stores that data in a Ceph object storage cluster, making it one of the largest storage farms outside of certain government agencies.

The largest supercomputer in the Southern Hemisphere, the Australian Pawsey Center in Perth has two supercomputer clusters which are crunching data from a prototype SKA project as well as a for the Murchison Wide Field array. The operation provides about half of its computing at Murchison where “correlators” shrink down the raw data. This setup creates around three petabytes of computed data annually.

SKA’s challenges are attracting a lot of multinational innovation. For example, IBM has partnered with ASTRON, a research center in the Netherlands, to devise an approach for a supercomputer project as a SKA candidate that in itself is worth $1.5 billion, but promises valuable technology gains. IBM is already working on a 120 petabyte storage array, using 200,000 1 TB hard drives and built on IBM’s general purpose file system with highly parallel access. Extrapolating this to the 30 TB SSDs expected in 2024, and we would have as much as three exabytes in the array, which is in the range for SKA’s output, especially if high-density tape is used to cold store older data.

Even with such a huge store, keeping the raw data and transmitting it to the data center is problematic. The answer clearly is to pre-process the data in “mast-head” computers near the antenna, splitting the compute load and reducing the data size enormously. This creates a logistics challenge. We would need a cloud-sized IT operation out in a desert. Power is the issue, both generating it and then cooling the systems; deserts can get hot!

Nvidia NVLink

What's needed are  the most efficient and powerful machines. Fortunately, the computation required is amenable to parallel processing, which fits the current trend of using GPUs in high-end supercomputers. Nvidia has a next-generation “Pascal” GPU with the ability to cross link to other GPUs at very high speed. NVLink, as this technology is called, is much faster than InfiniBand, with four links at 40 gigabytes per second on each P100 chip.

The Nvidia architecture takes advantage of High-Bandwidth Memory and stacking of memory and GPU dies onto a substrate to boost the performance of the chip. The company projects that Pascal’s successor, the “Volta” chip due late in 2017, will have terabyte per second performance per GV100 module. Volta will also support cache coherency across the NVLink. Nvidia expects that Volta will have a much faster NVLink 2 as well.

Nvidia has partnered with IBM on interfacing its POWER CPU chips to GPUs and the fruition of this is that Power will support NVLink connections, speeding CPU-oriented operations. Mainly, these operations are house-keeping, but a substantial part of the CPU task is communicating with the number-crunching arrays, where the data from many antennae are merged together. SKA is fortunate in one sense: The data travels in streams in one direction, so link latency is not an issue.

Since fiber length isn't a major problem for streaming, the number-crunching computer center can be somewhere where there is water and sufficient power to feed masses of hungry GPU nodes. That will mean a lot of fiber will be used. IBM estimates 80,000 KM of fiber to connect the antennae to the data center! Even so, one architectural bottleneck still exists. The connections to fiber are still made through PCIe-connected NICs, substantially slower than NVLink. This is a problem both in the antenna to data center connections and, once inside the data center, between nodes in the compute cluster. The latter is the more painful problem, since the cluster is doing array calculations across many machines and they need fast, low-latency fabrics to do their job efficiently.

It’s possible that we’ll see PCIe 3.0 or 4.0 fabrics carrying the interconnect, though the fact that NVLink uses PCI-e technology for its physical layer may give rise to an inter-server link as well.

The huge boost in compute performance has paybacks both in electrical power consumption and capital cost. A dozen 8-node Pascal systems at 40KW would replace 40 racks of CPU-based servers doing molecular dynamics calculations, for instance, which clearly helps SKA’s logistics problems, too.

There are other technologies considered for SKA. Some parts of the network could use dedicated hardware based on FPGAs as an alternative to GPUs, and the debate still rages. It’s also possible, perhaps probable, that the design and fabrication of the SKA computing will be spread over a number of the contributing companies. We should know the answer in the next few months.

Other supercomputing projects and components

On the broader front, the Piz Daint supercomputer, Europe’s fastest at 7.8 petaflops, is being upgraded with 4,500 Pascal GPUs. This system analyses CERN data as well as doing meteorology and other work. The upgrade will use Intel CPUs with the GPUs, in line with the existing system.

Berkeley Lab is installing a Cray supercomputer, at around 1.28 petaflops, based on Intel Haswell CPUs. The innovation here is a layer of NVRAM cache between memory and disk, coupled with Cray Aries high-speed interconnect. Cray currently has five of the Top Ten supercomputers, including the two newest entries, the Department of Energy's Trinity at Los Alamos and Hazel-Hen at HLRS, Stuttgart.

Shyh Wang Hall

Berkeley lab supercomputer.jpg

Intel is pushing hard for a place in the supercomputer ranks, with its Knight’s Landing chip promising a big horsepower gain. At 72 cores and with Intel’s version of High-Bandwidth Memory, this is a powerful chip. However, Intel doesn’t have a network fabric in the class of NVLink and that is likely to be a major issue for many applications.

On the storage front, Ceph-based object stores have a good hold at CERN, though the supercomputing space is somewhat loyal to Lustre and GPFS. CERN’s success, coupled with enhancements Red Hat and SanDisk have made for SSD storage, are making Ceph more attractive for hyper-scale storage.

WDLabs recently prototyped a storage cluster with Ethernet hard drives, porting the storage node (OSD) function onto the drives and in effect massively parallelizing OSD access. This could become the densest storage available as SSDs reach 30 TB in a 2.5” form-factor and it has the advantage of low power. It’s also a matter of time before NVDRAM technology migrates to High-Bandwidth Memory modules, bringing Cray-type caches closer in and much faster.

Taken together, all of these developments illustrate a challenging period for supercomputing, but solutions are on the drawing board. The technologies developed for supercomputers will spill over to commercial computing over time, making the cost, power and size of cloud computing shrink even more.

About the Author

Jim O'Reilly

President

Jim O'Reilly was Vice President of Engineering at Germane Systems, where he created ruggedized servers and storage for the US submarine fleet. He has also held senior management positions at SGI/Rackable and Verari; was CEO at startups Scalant and CDS; headed operations at PC Brand and Metalithic; and led major divisions of Memorex-Telex and NCR, where his team developed the first SCSI ASIC, now in the Smithsonian. Jim is currently a consultant focused on storage and cloud computing.

SUBSCRIBE TO OUR NEWSLETTER
Stay informed! Sign up to get expert advice and insight delivered direct to your inbox

You May Also Like


More Insights