The true test of an IT organization’s maturity is how it reacts when a significant issue occurs. By significant issue, I mean that a large portion (or even the entire) organization’s ability to function is severely impaired due to an IT-related failure.
What happens isn’t a test of any single group or process – it is a test of all IT groups and three tightly integrated processes: Incident, Problem, and Change.
Let me give you an example of how the perfect storm can break a business. Once upon a time, back in the 90’s, I got up very early every morning to be the first person in our office so I could take calls at the Service Desk. On this particular morning, it was evident that something was wrong. Some people couldn’t get to fileshares or email. Some users couldn’t get to the internet. Some people were reporting that client/server applications weren’t working correctly. Between calls, I launched my browser and pulled up CNN. It wasn’t just our organization. Others were reporting similar issues.
I immediately called my wife and told her that something big was going on and to cut the internet lines to her office. At the time, she was in charge of her company’s IT functions, and when I explained the situation, she went to the network closet and unplugged the cable to the external service provider. She then put a sign on the entry doors to her office telling everyone not to open emails from any external source.
It was Melissa, a worm-based virus that created massive amounts of network traffic by sending email out from everyone’s Outlook client.
Of course, we didn’t know that at the time.
All we knew was that large parts of the network were unreachable. So what did my organization do? The server team obviously thought there was a faulty card on a server sending out massive amounts of packets. The network team assumed that there was a defect in the network routing. The desktop team thought that a software update was being pushed out and that all the traffic was killing the routers. So everyone made countless, uncontrolled changes in their own little world for about six hours.
This was pure Incident Management at its worst. “Get it up and running as quickly as possible” is the first part of Incident Management’s goal. Unfortunately, the second part of the goal was completely ignored. The full goal of Incident Management is “Get it up and running as quickly as possible, while doing the least amount of harm as possible.”
To say that recovery was challenging is an understatement. Our IT groups did more damage to our infrastructure in six hours than the worm could have ever done. Once the news got out that many organizations were being affected by this worm, our Desktop team spent about eight hours canvassing the organization and physically touching every PC to remove the virus.
Our network and servers weren’t completely back to normal for three days.
What Change Management was done? None. What Problem Management was done? None.
Let’s just say that it was an excellent learning experience.
From that point forward, whenever there was a severe issue, the Service Desk Manager (me) started a conference line and then notified all the other IT managers (by email, phone or fax) to dial into the conference line immediately.
There was a formal structure to the conference call:
- Quorum called
- Problem statement defined
- Roll call made of each manager identifying how the problem was affecting their organization
- Roll call made of each manager identifying what team members were available to work on the problem
- Decision as to who owns the problem
- Decision as to who owns the communication
The problem owner would be the person responsible for:
- Organizing the response to the issue
- Engaging IT resources as needed
- Managing the issue to resolution
The person who owned the communication (usually me) would stay on the conference line and take any updates verbally from whoever called in. The communication owner would be responsible for:
- Updating the tracking ticket
- Ensuring that all parts of IT were made aware of any updates
- Ensuring that communications to the users were sent out in a timely fashion
- Informing the company’s upper management of the issue and its effect on the organization
- Being a buffer between the problem owner and everyone who wasn’t directly involved in the resolution of the issue
The problem owner has to be shielded from all the users, managers, customers, etc., etc. who want status updates or just make the problem owner’s life miserable (it’s amazing how many people want to pile on pain during a painful situation). It is a delicate balancing act for the communication owner (who is being pounded from all sides for the latest status) on how frequently to go the problem owner for an update while realizing that every request for a status update delays the resolution.
As the issue is worked towards resolution, it is the problem owner’s responsibility to ensure that proper controls are in place for critical infrastructure.
- Documentation updates noted
- Change permissions obtained
- Cross functional activities balanced
One of the hardest things for the problem owner to do is to negotiate between the teams implementing quick fixes (get the users up and running as quickly as possible) and the teams responsible for determining root cause (who typically needed the users to be left in a broken state for them to do their analysis).
Depending on the nature of the event, and how well the action plan reacted to it, there were formal and informal reviews. If things went well, it was usually an informal review. If things went badly, it was more formal.
How well IT is able to work across functional boundaries, still maintain processes, and ensure that sufficient controls are in place during a serious outage is quite often the quickest way to uncover departmental and process deficiencies.
The real lesson here is to never let a disaster go to waste.
BTW: My wife’s boss gave us a very expense bottle of wine when they came through the day unscathed.