Yes, You Can Automate IT. No, You Can't Avoid It.
It's been coming for a long time now. Automating your data center is emerging in a big way. You can't avoid it, you can only put it off for a while, but sooner or later, it's going to happen. I am not talking about piddly automation functions like distributing software and patches or moving a virtual machine from one hypervisor to another. I'm talking about event-based automation where the actions you would have initiated manually are done automatically. Many of the objections I hear are that au
September 3, 2010
It's been coming for a long time now. Automating your data center is emerging in a big way. You can't avoid it, you can only put it off for a while, but sooner or later, it's going to happen. I am not talking about piddly automation functions like distributing software and patches or moving a virtual machine from one hypervisor to another. I'm talking about event-based automation where the actions you would have initiated manually are done automatically. Many of the objections I hear are that automation is too risky, complex, time consuming and can't handle errors well. Those are reasonable objections, but they are also easily overcome.
If you want to take advantage of server, storage and network virtualization, you are going to have to engage in event-based automation. If you are just dipping your toe into virtualization, then you are already reaping the benefits of advanced hypervisor features and add-ons like fault tolerance, back-up and mobility that were unavailable to first adopters. If the news out of VMworld is any indication, data center automation and orchestration is going to a hot area in the coming years for good reason. The operational savings that can be achieved reaches into the 100's if not 1000's of hours saved per year, not to mention better and faster service to your customers.
A typical example of automation is demand scaling of an application, such as a web application that has periods of high demand off set by periods of low demand. In a static application, apps are typically designed for 75 percent load with performance falling off as use increases. The cost is slower responses during periods of high demand and wasted resources during periods of low demand periods. Neither situation is desirable.
To automate scaling, you have to find out what the bottleneck is, determine how to alleviate the bottleneck and then take action. There could be many reasons why bottlenecks occur, including programming errors, but with automation, you can typically address many of the common issues like increase in demand or a failed server faster than you can manually. The steps to increase capacity in a web application are typically to bring up a new server in that tier, provision it, make server connections and add it to the pool. Reducing capacity follows the steps in reverse after bleeding off connections.
If you manage capacity today, you already know what the steps are to add or remove capacity. You might even do it manually. Automating is taking each step of the process, no matter how minor, and plotting it out on a flow chart or an outline. You start with initial conditions that you can reliably expect, then you go step-by-step through the process. If you can't articulate the initial conditions, then you don't have a handle on your applications or your data center, and you are heading for bigger problems, anyway. You address the unknown initial conditions by standardizing them. Fact is, you don't have to spend months and months agonizing over these processes. You just have to document what you already do. Now is also a good time to make improvements to the process. Here is one example:Many moons ago I was involved in a project where I had to gather and process information from a number of remote offices and present it in a fixed format for another IT system to process. The company I was consulting with paid someone to do it manually. Some parts were scripted, but the scripts had a lot of hard-coded variables that made the system brittle. I asked them if they wanted an automated system, and after I described what I had in mind, they gave me the green light.
The first thing I noticed was the hard-coded variables. This company was a national retail chain and their stores where mostly the same, but like any large IT shop, they had different versions of software at different locations, and sometimes the software directories varied. Lots of little things. This whole mess was laid out in the source code and it was nasty. The person running the system spent most of their time managing the mess than actually doing any work. I focused on three things:
Replace as many fixed variables as I could with logic that would fill them in dynamically.
Assess and deal with common variations like product versions and locations.
Determine common failures and either handle them in code or generate a *useful* error message.
No one likes failure, but you have to balance the time it takes to build a fully-automated, fault-tolerant system against the time it takes to get an adequately designed, fault-tolerant system running. Adequate is good enough to start with. You can always improve it later. While in the planning phase, identify critical points of failure and address them. Ideally, you try to automatically handle failure gracefully. If a server could come up with a conflicting IP address, solve that as part of the automation. If a hypervisor could become resource-starved, figure out how to find one that isn't. Those are the likely problems. It is unlikely that a VM image is corrupted and there may not be an easy way to automatically recover it, although you could probably automatically restore from backup. In your first pass, focus on the likely problems, and in subsequent improvements you can get more fancy.
For the failures that you can't address, make sure you create meaningful error messages for the user and IT so that they can easily figure out what happened and fix it. Obviously, you want to clean up any anything that needs undoing. For example, if you provision a network port then find you can't use it, you probably want to deprovision the network port. Better yet, don't provision anything until you have assembled all the components and verified they are ready to go. Further illustration:
After a few weeks, I had a pretty good working system. It was stable and reliable and after running it, I was down to tweaking little bits of code to make it better. When the system went into production, the company was thrilled. The admin running the system went from managing the mess to doing very little other than dealing with the few exceptions that I couldn't reasonably deal with in code. They actually got to work on more interesting stuff.
Then I get a call from the manager telling me that IT was going to bring up a new point of sale system based on Windows 3.1 (it was that long ago). The program wasn't going to change, but the locations on the POS machines were going to be radically different and he forgot to let me know. The new POS was being rolled, and could I make sure the reporting system still worked? He'd pay me lots of cash. I asked if he had tried it out yet. He hadn't, because his IT people were convinced that due to the directory changes, the reporting system would break.I convinced him to take 20 minutes to set-up a test system and give it a shot. It would save him lots of cash and time. I'd even stay on the phone while he did it. Twenty minutes later, I swear I heard his jaw hit the floor when the reports started flowing off this new system with no changes the reporting system. All I had done was write some code to find the programs and find the version and populate the variables with the right information. But if something went wrong--if the files couldn't be found or the software version broke the scripts--I created log messages writing out what the script expected and what it found. Everything I needed to troubleshoot was in the event log.
Setting up an automated system is going to take time initially. There are many commercial orchestration packages from the likes of Citrix, Gale Technologies, HP, IBM and Novell as well as open source packages like Puppet and Chef that will get you pretty far along the with the path to automation, which at least provides the framework to implement your rules. But you have to spend the time planning out what needs to get done and don't get hung up on unimportant details.
About the Author
You May Also Like