App Monitoring Minus False Alarms

Kaiser Permanente is using Heroix's RoboMon software to monitor its applications, ensuring real problems get resolved.

February 3, 2003

8 Min Read
Network Computing logo

The high-stakes applications, including Kaiser's laboratory, medical records and pharmacy tools, are the reason the nonprofit health-care organization puts so much stock in managing its applications--and doing it efficiently. If one application in the organization's laboratory system chain fails, for instance, it affects the medical staff and patients, too.

"We know within a minute if an application is having a problem," says Chip Gauthier, manager of SASS for the Oakland, Calf.-based HMO. "Instead of waiting for users to call in, we're more proactive."

Like most businesses, Kaiser operates with an especially tight IT budget these days. There is little wiggle room to hire more labor. So the organization relies heavily on monitoring tools to help track the company's growing population of servers and applications.

Kaiser's SASS group recently cut some of its overhead by adding a new management console, Heroix Corp.'s RoboCentral, to centralize alarm monitoring. The system eliminated the need for a full-time staffer to gather and consolidate alarms, Gauthier says.Code Red

One of Kaiser's applications, Gauthier says, is its family of OpenVMS-based lab applications. Kaiser's lab instruments are connected to the applications, so test results are sent to the system automatically. When a nurse orders a blood test using Kaiser's medical-record-management application, that order feeds into the lab-management system. If the lab application suffers a disk crash, the nurse can't get the test results for a patient, and that patient's treatment could be delayed.

So Kaiser runs application-monitoring tools to detect problems before or right as they occur. "In a network as large and complex as ours, no one person can understand what's going on," says Ralph Wagenet, a senior technician for Kaiser who handles day-to-day application-monitoring duties for the SASS group.

Kaiser's technicians write interfaces for these applications and load the Heroix RoboMon software-based monitors onto the application server. When an application's response time exceeds a predefined threshold, the monitor generates an alarm. The alarm is automatically fed into RoboCentral, and from there, the alarms and information are sent to the organization's Tivoli Enterprise Console management system, which forwards alarms to Kaiser's helpdesk application. There a trouble ticket is opened and the appropriate technician is paged.

The monitors track various conditions. One set of monitors Kaiser built checks whether transactions between its primary and backup database are working. The monitor is set to restart the transaction or to page a technician if the problem escalates.But there's a catch: To get the monitors to work with the applications, Kaiser has to write custom interfaces because many applications don't come with performance-monitoring APIs. And, Wagenet says, you need a way to detect specific activities, such as when an interface within an application fails. "It's hard to find a hook that gets you the data you really want, like the status of an interface inside an application," he says.

In addition to RoboMon, RoboCentral and Tivoli, SASS uses a management tool called Fortel's SightLine, which tracks application performance. RoboMon and SightLine aren't integrated, but SightLine sends its performance analysis to Tivoli.

Wagenet says most problems with an application aren't performance-related. They are more about the application's availability, such as when a disk crashes or a controller fails.

And, as Kaiser has learned, redundancy built into the network and systems architecture can elicit problems of its own. Kaiser's large number of backup servers, databases and network switches makes tracking applications more complicated. Software bugs or a disk failure in a primary or secondary server can go undetected since the network automatically shifts traffic to the backup device. "Sometimes you don't know something's broken until another piece you're backing up breaks," Wagenet says. "It's important to have monitors detect these failures or else a second failure can result" and wipe out the application.

False AlarmYet even with Kaiser's careful planning and selection of what it watches in its applications, it still isn't immune to false alarms. Wagenet and other SASS technicians often chase down what prove to be extraneous alarms, like those caused when a threshold for a dial-up connection isn't set correctly. One way to stem this particular type of alarm, Wagenet says, is to build in automatic retries for a dial-up connection rather than sounding an alarm every time the connection fails or drops.

The frequency of alarms is inconsistent, too. Some RoboMon monitors trigger alarms every five minutes, but others not frequently enough, Wagenet says. Some times the team gets multiple alarms for the same problem. The good news is that you can keep customizing the monitors to adjust alarm overkill or frequency, he says. But it's not possible to eliminate false alarms entirely without the risk of missing relevant ones, Wagenet says.

Next for Kaiser is a big database upgrade for its laboratory system in the organization's data center in Corona, Calif. Kaiser is replacing its old MUMPS-based database with InterSystems' Cache for storing different types of patient data. Bottom line: The SASS group will have to retire some of its old application monitors and create new ones that support the new database.

It still will take some sleuthing to pinpoint problems the monitors identify in the new or newly tweaked applications. Even though Kaiser can customize its monitors to watch specific functions, managing applications is not a science. "There's a lot of trial and error," Wagenet says. "We look at what breaks and try to figure out why it's breaking and how to reduce the probability of it breaking again."

Tell us about you Network and we may profile it in a future issue. Send e-mail to [email protected] or call (516) 562-5914.

It came down to sleep deprivation. Kaiser Permanente's Sun Alpha Support Services (SASS) technicians were getting paged at all hours of the day and night for everything from major outages to a printer on the fritz, so the group decided to do something about it.

Fast forward nine years: Not only does the SASS group get more shut-eye, but it's still using the latest version of the same application-monitoring solution the health-care organization purchased back then to keep tabs on its sensitive medical, laboratory and other applications. Kaiser spends about $46,000 per year in upgrades to the Heroix RoboMon software--now in version 7.0C--and its new management console, RoboCentral.

What sold Kaiser's senior management on the tools then, and now, is SASS's estimates that the application monitors save the organization's labs and other clients about $12,000 per hour in application-outage costs, says Chip Gauthier, manager of SASS for the Oakland, Calf.based company. "Once I showed my clients and lab technicians they were going to get improved application availability, the implementation got easy approval" within the organization, he says.

Sticking with the same application-monitoring tools has also been a necessity given the rules and interfaces SASS has developed for the tools over the years and because it's just plain expensive to make a change. "It does the job, so there's no reason to go out and replace it," Gauthier says.

It also doesn't make sense to invest in new tools for the aging Digital Alpha OpenVMS and Tru64 Unix platforms. Kaiser won't install any new applications on these servers, which run the organization's key laboratory, medical and pharmacy applications, Gauthier says. That doesn't mean Kaiser will tear out the Alpha platforms running its key applications any time soon. "It takes a lot of money to migrate to a new radiology or lab system," Gauthier says. "As long as the software vendors support the platform, it's not going away."Meanwhile, Gauthier says health-care organizations need to do more than just ping their network devices and servers. It's crucial to give users service-level agreements like Kaiser's that guarantee uptime, he says. "My job is to have the systems and applications available 24x7 at any cost," he says.

Chip Gauthier: Manager, Sun Alpha Support Services (SASS),Kaiser Permanente, Oakland, Calf.

Chip Gauthier, 47, has spent 20 of his 24 years in IT with Kaiser Permanente. He is responsible for the healthcare organization's HP Alpha and Sun server platforms, which run Kaiser's massive laboratory information system and other mission-critical applications. Gauthier's team uses monitoring tools to keep tabs on hundreds of OpenVMS and Tru64 Unix systems and applications.

Education: B.S. in Hospital/Health Care Business Management, California State University, Domingus Hills

If I Knew Then What I Know Now: I would have gotten a lot more sleep. We knew coding and/or scripting our own monitoring, however crude, would save the company capital dollars. But we couldn't keep up with the demand of proactive problem resolution and high availability requirements that came out of our implementation. We found it was nice to have an escalation procedure for anything that died--from a printer to whatever.Next Time I Will: Install and roll out the application monitoring tools a lot faster across all my platforms. Our proactive application and system monitoring provides high availability to the clinical applications and the lab, radiology and other technicians who use them.

What Sealed the Deal: The deciding factor for the application monitoring tool we picked was a detailed cost analysis that included the savings we would get from reduced outages to our clinical application systems.

Biggest Mistake Made in Technology Circles Today: Not thinking outside the box and asking "What if?"

Just for Fun: My name is Chip and I'm a golf-a-holic. I also like to water ski off the back of my three-man, personal watercraft, Sea-Doo.

Wheels: Pontiac Grand Am. My wife thought it looked cool.Biggest Bet Ever Made: That I wouldn't still be working for Kaiser Permanente after 20 years. Kaiser has always remained ahead of the IT curve, and the ever-changing technical environment and challenges here have kept me on board.

Stay informed! Sign up to get expert advice and insight delivered direct to your inbox
More Insights