Laying A Foundation For Distributed Computing's Next-Gen

See how a project leveraging recent advances in a form of machine intelligence could mark a breakthrough in how systems maintain themselves in the face of errors and attacks.

November 1, 2004

8 Min Read

Distributed computing won't evolve to the next level without new architectures for reliability and security, according to David A. Patterson, who has taught computer science at the University of California-Berkeley since 1977. So far, government researchers have not fully stepped up to this new mandate. Berkeley and Stanford have teamed up, however, on one important project leveraging recent advances in statistical learning theory — a form of machine intelligence that could mark a breakthrough in how systems maintain themselves in the face of errors and attacks.

Dave Patterson

Distributed computing is core to a number of advances in electronics as we head into the consumer era, including peer-to-peer networks, grid computing, distributed sensor networks and ad hoc mesh networks of many varieties. Patterson has long been a proponent of distributed computing. He led the design of RISC I, the first reduced-instruction-set computer, which became the basis for the Sparc CPU of Sun Microsystems. He was also involved in Berkeley's 1997 Network of Workstations Project, which helped spark the move to large systems made up from clusters of commodity systems (see www.cs.berkeley.edu/~pattrsn/Arch/prototypes2.html). EE Times' Rick Merritt sat down with Patterson recently to talk about Patterson's work and get his take on where distributed computing's headed.

EE Times: What's the future for distributed computing?

Dave Patterson: What people complain about in computing today its not that computers aren't cheaper or faster; they talk about all the hazards of security and viruses and spam.

I worry we may be jeopardizing our whole infrastructure. Suppose consumers decide the Internet is filled with people trying to steal their money and the best thing is to just avoid it. In a couple of years, that could happen. So it would be wise for us to make security and privacy first-class citizens in computing.EET: Is the government stepping up to its role here?

Patterson: Various committees are looking at it, but from what I can see as a researcher they are pretty small efforts so far, especially in nonclassified programs. They represent very modest amounts of money.

EET: What's the problem?

Patterson: Darpa [the Defense Advanced Research Projects Agency], which has funded a lot of great work, has decided to make a lot of its network security work classified. That's a decision that leaves universities and most startup companies out of the picture.

For example, Darpa's current Broad Agency Announcements look like fine research agendas in this area. But if you look at the end, it often says it is top secret [for example, see www.darpa.mil/ipto/solicitations/open/04-32_PIP.htm]. I don't know how that's going to help the rest of the country if it's classified. I think you will find other people concerned about this direction.EET: What should be done?

Patterson: I certainly hope the U.S. government will begin funding security and privacy as seriously in the future as it used to fund projects for making computers faster and cheaper. Our society and economy hugely benefited from those investments in lots of major industries and jobs. I would say society certainly needs these new improvements in these areas.

If we don't come up with some great ideas, [security and privacy could become] a downside of distributed computing. We have seen episodes where psychological anxiety about technology has slowed its growth — for instance the European reaction against genetic engineering of food. People are even talking about whether the fear of self-replicating devices could affect the development of nanotechnology. And some people are concerned about privacy issues with distributed sensor networks.

EET: You helped set a direction in distributed computing with your work here on the Network of Workstations Project, linking low-cost systems together into one big system. What did you learn from that?

Patterson: We came into it as a low-cost supercomputer based on off-the-shelf components. The project set a database sort record, but to get that record we had to run it at 3 a.m. What was interesting was why we couldn't we get that in the middle of the day.The technology was used by Inktomi and lots of other dot-com startups. After that, cluster computing became pretty standard, at least in the first tier of computing, where you have what are called stateless nodes.

In the end, we wound up building something that was pretty impressive, but only for short periods of time. We'd get these incredible cost/performance demonstrations, but if everything wasn't perfect it wasn't such a wonderful system to use. We came out of that much more interested in dependability, reliability and ease of use as opposed to peak performance.

EET: So what are you working on now?

Patterson: We are focused on building smart tools to help operators run computers, especially systems that recover quickly. I'm trying to get companies to think of reboot time as something they can compare and compete on.

We've been building, with people at Stanford, prototypes for an undo/redo mechanism to let the operator of a large system go back in time, repair something and replay what happened during that time. That's relatively easy to do in a word processor but hard to do in a large server and storage system.As part of this project on recovery-oriented computing, Stanford researchers have done work on tools to pinpoint why a system has failed. They have also done work on micro-reboots to restart components of a system rather than a whole system.

These ideas are being put into a version of Java J2EE to evaluate them. We are getting some company uptake where people are trying to put some of this in their products.

EET: What's next?

Patterson: The next phase of our project will look at ways to make distributed systems more reliable and adaptive. One of the things we are trying to do is get more data on why systems crash. We think it would be good if people knew why their systems were crashing. We are trying to collect our own set of data we can publicize. Then we will use statistical learning theory to try to analyze the large amount of systems-monitoring data.

This is the state of the art for people providing online services using distributed computers that millions of people depend on: They have a big network operations center with lots of human beings watching monitors to see what's going on. If something goes wrong, they react.We are interested to see if recent advances in statistical learning theory will allow us to make interesting insights into systems behavior more rapidly and accurately than human operators are doing now.

We have a faculty member here I refer to as the Michael Jordan of statistical learning theory — his name is Michael Jordan [see www.cs.berkeley.edu/~jordan/]. In his view, statistical learning theory has made great strides in the last decade. [Statistical learning theorists] have proved theorems that allow [developers] to do things they have never done before. Thanks to these proofs they can do amazingly complicated things in seconds that before these proofs would have taken years to compute.

They have been using these proofs to control chemical processes and other fields. We want to see if we can use these theorems so computers can help run computer systems.

EET: How does this relate to the growing complexity of systems? For instance, I heard a recent presentation from an Amazon.com executive who said the company analyzes a terabyte of Web transaction data every day to do what he called computational marketing.

Patterson: Statistical learning theory is making dramatic strides that are potentially much greater than the complexity growth of systems. What's interesting about statistical learning theory is it is at its best when you have phenomenal amounts of data. It has its advantage when there is too much data for a human being to analyze.EET: How does statistical learning theory relate to artificial intelligence and neural networks?

Patterson: Statistical learning theory is sort of the new AI. It is based on statistics rather than logic.

Back in the 1980s, AI made all these promises based on expert systems and logic that just didn't come true. The idea was that you would write rules that would make a synthetic expert in a narrow area, to look at certain situations and figure out what to do with a given a set of inputs.

The statistics model is that we can look at mountains of data and find a needle that says when you have a statistically significant situation.

By contrast, neural nets seem to be more fragile. A bad input could mess up how a neural net gets trained.Statistical learning theory seems to be more resistant to bad inputs, so it's hard to screw up. That ties in with the need for security. If you are throwing bad data in, it will be harder to ruin the system. So it has a nice ruggedness.

EET: What's the downside of this approach?

Patterson: We know machine learning has false positives. They may occur, say, 20 percent of the time. So what we will try to do is define actions that won't hurt the system if they do something that wasn't needed. Ee have to build compensating actions that are fast, predictable and not incredibly damaging.

It turns out, that's not such a terrible design constraint. In fact, that would probably be the basis of a really good system.

For instance, some people are working on ideas like mutating protocols. You can change the protocol being used by a system to avoid security attacks. If you were wrong, it would not be that bad; you just would go to the next protocol.That's a different way to architect systems from what we see with something like IPv6, where you need to get hundreds of people involved in committees to design the next protocol.

EET: What's the outlook for this project using statistical learning?

Patterson: This is the early, enthusiastic phase, and things look good so far. It has all the right features of an important new research direction. But it could take us three to five years to find out how significant this might become.

Related Topics

Recent in Infrastructure

Related Topics

Recent in Network Mgmt

Related Topics

Recent in Security

Related Topics

Recent in Enterprise Connectivity

Related Topics

Recent in Wireless

Related Topics

Recent in Careers

Related Topics

Laying A Foundation For Distributed Computing's Next-Gen

Editor's Choice

Related Topics

Recent in Infrastructure

Related Topics

Recent in Network Mgmt

Related Topics

Recent in Security

Related Topics

Recent in Enterprise Connectivity

Related Topics

Recent in Wireless

Related Topics

Recent in Careers

Related Topics

<span class="ArticleBase-LargeTitle">Laying A Foundation For Distributed Computing's Next-Gen</span>Laying A Foundation For Distributed Computing's Next-Gen

Editor's Choice

Laying A Foundation For Distributed Computing's Next-Gen