Getting out of the trap is one of those chicken and egg situations. If you spent more time proactively monitoring system health and then planning and implementing strategies to keep it that way, then you wouldn't need to spend as much time handling emergencies. But you have too many emergencies needing immediate attention to do all the monitoring and managing you should.
Most companies have more than enough data to do an adequate job of managing the IT resources.
To begin with, there is the wealth of data that is stored in logs. When there is a crash, an administrator must go hunting through logs looking for the sequence of errors which occurred earlier. Now, if the administrator had looked earlier, he would have known there was trouble looming and could have taken preventive measures, but who has time to continually monitor all the dozens, hundreds or thousands of logs a company has?
''Our admins would go in after a crash and look at the logs to see what happened right before a server locked up,'' says Steve Luciano, network administrator for New Pig Corp. which provides products for liquid management, industrial safety and plant maintenance to more than 170,000 customers in more than 40 countries. ''But no one was checking their boxes on a regular basis. It was difficult to do considering how many servers they were responsible for and everything else they had to do.''
Zurich Life Insurance in Schaumburg, Ill. faced a similar problem.
''It was clear that the IT organization was in a reactionary mode,'' says Tim Hagn, Zurich's vice president of IT Operations and Engineering, describing what he found when he arrived on the job. ''We were addressing problems after the customer base had been affected or had called with a problem, which is not a successful mode to be in.''
In both cases, the answer was to gather the information that already existed throughout the network and present it to the administrators in a single console. For Hagn, the solution was to install Hewlett-Packard Co.'s HP OpenView and set it up to assemble the information from lower-level monitoring software and present it in a single console.
''Tools like IBM's Tivoli or BMC's patrol can't monitor Cisco's devices as well as Cisco's tools do,'' he explains. ''It works best to let the vendors' tools monitor their own devices and then dump the information into a central console.''
For Luciano, this meant buying and installing a log monitoring tool. He selected Logalot from Sanford, Me.-based Somix Technologies, Inc., which collects entries from Syslog, SnmpTrap and Windows Eventlogs and puts them into a combined database. He set it up to assemble the data from all his servers, switches and routers.
At that point he was able to establish policies on what to do with each of those entries. The vast majority are just informational and get archived. But others require immediate action and so the admins get alerted. This allows the staff to address potential problems before they cause a delivery outage. In doing so, they have been able to fix the underlying issues so they dont keep happening.
''Whenever they see the alert, they take the corrective action so the number of alerts has decreased,'' says Luciano.
And this is the real answer to freeing up staff time.
Presenting the information to the IT staff in a comprehensible fashion before it becomes a crisis. That can never be done if they have too many places to scan to get a complete picture of what is going on.
''I dont want to look at 17 different consoles,'' says Hagn. ''I want it all tying into one central location. The criteria now for every additional tool or utility is how well it can tie into OpenView.''