Why IT Operations is Like an Action TV Series

Just as on TV, crisis arrives in the datacenter and in the heat of action you need the right tools.
Posted August 26, 2009
By

Nir Livni


(Page 1 of 2)

I like watching the series "24," I can't really explain why. Every time they nearly get the bad guys something wrong happens, there's some sort of twist in the plot and they need to start all over again.

For example, I’m sure that you are familiar with the following classic scene: The CTU (Counter Terrorism Unit) chopper is following a suspect that is driving a black van. The suspect’s van enters a tunnel, but the van doesn’t leave the tunnel. Instead, a number of different vehicles leave the tunnel at the same time, and the suspect is probably in one of them. By the time they figure out that the black van has been left empty in the tunnel – they have already lost the suspect.

They shout “We have lost visual!!!” and are back to looking for the bad guys all over again - then they call Jack Bauer…

IT Operations is just like the CTU; the CTU is responsible for making sure that life goes on without any unpleasant surprises. Similarly, IT Operations needs to do the same in its own space and make sure that the business keeps on running and that business transactions are being executed properly and on time.

When something is about to go wrong, the CTU and IT Operations are expected to prevent it before it affects anyone. So they set up the war room, call everyone in, and start doing their detective work to find the needle in the haystack. If they don’t find it and something goes wrong then the results are significant; either people get hurt (in the CTU's case), or business is impacted.

IT Operations' War Chest

So which tools could IT Operations use to find out that there is a problem, identify the root cause of it and resolve the issue?

For example, IT Operations could use HTTP network appliances that help see every HTTP transaction and measure its response time. These network appliances are just like the CTU's choppers, they do not have adequate visibility into the datacenter. They can indicate that something is wrong with the response time of a transaction, but they cannot show why the response time of the transaction is high and cannot provide the visibility needed for resolution.

IT Operations also uses Event Correlation and Analysis (ECA) tools. ECA tools are like CSI detectives (yes… that’s another one I watch…), and rely on other tools to collect information for them, just like the CSI detective who collects evidence from a crime scene.

ECA tools are just as effective as the products that they rely on to provide them with the data. The issue with ECA tools is that, just like in a crime scene, the thief does not usually leave his ID behind, so all you are left with is just clues, and no accurate data to work with.

Additional tools that IT Ops rely on are: dashboards that monitor server resource consumption, J2EE/.Net tools that are capable of performing drill down diagnostics in application and database layers, synthetic transaction tools and Real User Measurement (RUM) tools.

With all of these monitoring tools IT Operations still finds itself in a situation where all lights are green while users are complaining about bad response times. In spite of all of the investment in monitoring tools; the infrastructure that IT Operations is accountable for is still unpredictable. Why?

A Simple Example

Perhaps it’s best to take a look at this classic example: a firm had a problem with a wire-transfer transaction. The liability for the problem kept on going back and forth between Operations and Applications, who were pointing fingers at each other as to who was responsible for the issue. “All lights are green” said Operations, “We tested the application and it works just fine” said Applications. Simply put, no existing monitoring tool could point out the problem.


Page 1 of 2

 
1 2
Next Page



Tags: database, data, server, IT, datacenter


0 Comments (click to add your comment)
Comment and Contribute

 


(Maximum characters: 1200). You have characters left.