In HAL's Footsteps

Server, software, and systems vendors are making real progress in developing IT systems that do a better job of monitoring, analyzing, and fixing problems without human intervention.

October 10, 2005

9 Min Read
Network Computing logo

The film "2001" gained cinematic notoriety with the introduction of a self-aware, independent-thinking, murderous computer named HAL that became a sci-fi icon. In the movie's namesake year, IBM engineers launched an effort to develop technology to help computers monitor, diagnose, and heal their own problems.

IBM isn't trying to create a real-life HAL, but it does want to make computers smart enough to heal themselves. The promise of autonomic computing--systems that function automatically, much like reflexive bodily functions such as breathing, without external intervention--still remains formative. Developing these sorts of capabilities often requires multiple vendors to work together toward a long-term vision to build networkwide capabilities, sometimes piece by piece.

"We realize that autonomic computing isn't about building any one specific product," says Alan Ganek, chief technical officer and VP of the autonomic computing software group in IBM's Tivoli software unit. "It's about making all products exhibit these behaviors to the extent they can, and then integrating them to work more cohesively with others."

Several server, software, and services vendors are making the creation of intelligent systems and the equipment and software to go with them the underpinning of their enterprise-management development programs. So far, most have stopped short of creating full-fledged autonomic platforms. Companies like Hewlett-Packard, Sun Microsystems, EDS, and most recently Cisco Systems have created platform-level programs intended to simplify management across the enterprise.

Many companies are avoiding the term autonomic computing, which IBM has promoted. But they aim to improve system management with emerging technologies such as virtualization."Autonomic computing is an interesting long-term vision, but it's so far out that it's really hard to argue with," says Shane Robison, executive VP and chief strategy and technology officer at HP. "We're more interested in the interim steps leading up to a vision where the focus is on service-oriented architectures and grid computing."

The autonomic-computer effort grew out of advancements that produced ever-faster chips and lower-priced networks, giving businesses increasingly powerful IT infrastructures. But the complexity that was created by combining new technology with legacy hardware and software systems meant escalating costs and management challenges. "We were heading for a crisis, and a business' ability to absorb new technology would be lessened if we didn't address the issue head-on," Ganek says.

The challenge of managing an expanding and complex infrastructure was eating into many IT managers' budgets. They went from splitting their funds between new technology and managing and maintaining existing systems as recently as five years ago, to spending as much as 90% on merely keeping systems running.

Alleviating that complexity required tying together hardware, software, and networks, Ganek says. Finding approaches to creating autonomic features within and across those elements involved small steps and the integration of efforts from multiple parties. "We're not talking about magic here but about taking a very pragmatic and evolutionary approach," he says.

For IBM, that included establishing its autonomic-computing initiative at the corporate level, allowing it to tap into resources across the company, including hardware, software, and services. Working within the Organization for the Advancement of Structured Information Standards, known as Oasis, IBM and about 60 other technology vendors have been creating standard components that can be used in software and hardware to describe functions such as event management. These components can be processed and analyzed automatically, allowing businesses to isolate and complete problem-deter- mination cycles in about half the normal time, Ganek says.Much of the underlying architecture is based on the Information Technology Infrastructure Library, a set of common practices used across areas such as change management, configuration management, and release management.The deployment of autonomic-computing capabilities over the past year has let Carey Capaldi cut by 40% the time he spends manually digging through system-failure logs to understand why a problem happened. It also has let the product manager for the content-management system at Technicolor Creative Services create an automatic way to redeploy jobs that otherwise would be stalled for hours.

Technicolor Creative Services provides content-management capabilities to other business units within Technicolor, a major manufacturer and distributor of video tapes and DVDs, and for resale externally. Technicolor Creative Services--a subsidiary of Thomson, which provides technology and services to the entertainment and media industries--offers services like the management of media files, such as reels of film; encoding pay-for-view movies; and the creation of DVDs.

When Capaldi assigns jobs, a variety of events can trigger a failure and, historically, that has resulted in suspension of the job. In the majority of cases, once the failure is detected, the job can be restarted manually from the suspended queue and finished without further incident. However, many of the jobs had to run overnight, and if there was a disruption then, they could remain suspended until the problem was discovered the following day.

IBM contacted Capaldi and asked him to be a guinea pig in its autonomic-computing effort, using IBM's Autonomic Management Engine framework and Common Base Event, which monitors system resources, correlates information from various infrastructure components concurrently, and automatically determines the root causes of failures.

Using a log and trace analyzer tool, Capaldi can instantly gain access to custom logs that provide a detailed look at why failures happened. Taking such a look across his jobs to see specific points of failure saves time, he says. The real autonomic feature, however, is that the system can now resubmit the stalled job under specific criteria without Capaldi or his staff intervening.Technicolor Creative Services traditionally has written a lot of in-house software to aid in its effort to archive and manage the large amounts of digital content it handles, Capaldi says. In the future, he plans to create specific log files in new software than can be optimized to work with IBM's autonomic tools.

"Right now, a lot of this has to be tailored to exactly how you work as a company, and it would be nice if it was more off-the-shelf," he says.

Capaldi is ready to move further down the autonomic path. "In a heartbeat," he says. "I think there's a ton of potential that hasn't been tapped yet. Over the years, I've worked with a lot of bleeding-edge technology that eventually just didn't go anywhere, but this is an industry where you need to push the envelope."

The president and chief executive of LAN Solutions Inc., Victor Kellan, agrees. The company, which provides network-management services, saw growing opportunities to provide remote monitoring to customers as a managed service. Through trial and error, it built a network operations center. But as LAN Solutions grew, it experienced difficulties in quickly scaling the center to handle growing amounts of data going through the system.

When a problem happened, depending on its type, location, and complexity, it could take experts from several different areas to parse through thousands of log entries from databases, applications, Web servers, operating systems, or other network devices to find the problem's starting point and then determine a course of action. Typically, problem resolution was a time-consuming task accomplished by several people, each familiar with a specific type of log file.Another challenge was proactive security. Waiting for a security vulnerability or an exposure to be discovered proved costly. Once the damage was done, recovery could be a long and complex process.LAN Solutions went to work with Singlestep Technologies Corp. and IBM's autonomic-computing group to implement a system with robust event-correlation and network-event-response automation, Kellan says. The companies created a bundled product using Singlestep's Unity software and IBM's autonomic-computing toolkit.

The project team created the necessary correlations within IBM's Autonomic Management Engine framework to give LAN Solutions' staff methods of detecting early symptoms of problems and help them get to the root cause of resulting network issues, he says. The platform utilizes Unity's ability to send and receive IBM's Common Base Event to the Autonomic Management Engine, which then monitors system resources, correlates information from several components con- currently, and determines the root causes of failure.

The Museum of Modern Art is testing autonomic computing to solve problems and complete work faster. Photo by Museum of Modern Art

The new system has been deployed to about a quarter of LAN Solutions' customer base, Kellan says, letting them save between 20% and 40% of their monitoring costs. Although he's pleased with the platform, additional advances are needed. "I look at this as the first generation," he says. It's getting closer to being self-healing and self-aware, he says. "We've got the brain, but now we need the arms and legs to make this truly a self-realizing network."

Singlestep CTO Ophir Ronen says the company already is working on the next step, which it plans to introduce this month. Singlestep's autonomic platform, in conjunction with IBM, will be able to match a series of symptoms with specific resolutions that can be implemented with an automated self-healing policy.

"This is not just pie in the sky. These autonomic capabilities exist now and are helping customers get a handle on the cost and complexity associated with delivering IT services," Ronen says.This summer, New York's Museum of Modern Art began testing an autonomic platform that combines network-discovery technology from nLayers Ltd. with IBM's autonomic engine. "Like everyone, our big challenge is to do more with less," MoMA CIO Steve Peltzman says. "Anything that can make my four folks act like a staff of 10 or 12 is great."

By combining nLayers' InSight discovery platform with IBM's Autonomic Management Engine, MoMA reduced the time involved in its problem-resolution process by 10% to 20%, Peltzman says. When a failure occurs, the autonomic platform is able to assemble the appropriate logs needed for resolution and provide specific remedies automatically.

It's difficult at this stage to attribute a cost savings to the platform, Peltzman says, but "if you have a product that does the first 15 minutes of your job automatically, you'll save money. It may make an outage last 15 minutes instead of two hours, and that certainly relates to money."

The platform "is a first shot at trying to do this," he says. "It's not a mature product, [but] there are lots of encouraging signs, and we want to stay ahead of the curve rather than behind it."

Most business-technology managers are eager for smarter IT systems that can predict and resolve problems without human intervention. They hope those systems will free up staff time and budget dollars for new and innovative technology, reducing the need for IT departments to spend most of their time keeping the systems running. But they probably won't want technology that's as smart or as independent as HAL, which, after all, didn't heal itself.0

Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights