During most of my posts I take a shorthand method of talking about Incidents and equate them to a “Service Outage”, but truthfully, an Incident is defined more broadly than just when disruptions in Services that are noticed by end users.
So what conditions should be logged as Incidents?
There are four conditions that should be the basis for entering in Incident records in your IT Service Management system:
- A Service outage
- A Service degradation
- An Event that increases the risk of a Service outage
- An Event that increases the risk of a Service degradation
Service outage – Obviously this is the common understanding of what defines an Incident. When end-users’ Services are disrupted, people who use ITIL terminology call this an Incident.
Service degradation – When a user’s Services are in a degraded state (slow performance, critical functions not working, etc.) an Incident should be logged. What level of degradation triggers an Incident is something I’ve written about in a previous post.
An Event occurs that increases the risk of a Service outage – Let’s say you have a server with 3 drives configured for RAID 5 and one of those drives fails. The risk of a Service outage has significantly increased. If another drive in that array fails, you will have significant data loss. Hopefully you have sufficient monitoring to alert you to the event rather than relying on someone noticing the red light on the array as they walk by, but regardless of how it is detected, an Incident should be logged.
An Event occurs that increase the risk of Service degradation – Let’s take a scenario where you have a FDDI ring with an ISDN fallback between two sites. Again, hopefully you have monitoring to tell you when your primary FDDI ring has failed but the secondary ring should be able to handle the users’ volume. What if that secondary ring goes down and you have to fail over to the ISDN connection? The users’ Services will be seriously degraded. Even though the risk of a complete Service outage is very low with this triple redundancy, the risk of Service degradation has risen dramatically when the primary ring failure occurred.
Many IT technicians don’t understand that all of these conditions warrant the capture and recording as an Incident. This significantly affects downstream processes like Problem, Config, Change, Availability, etc., etc.
How can you plan for high Availability if you don’t capture non-service outage events? How can you identify Problems if you don’t record Incidents that don’t directly affect the users’ service perceptions?
Many tools try to automate the recording of Incidents when non-user affecting events occur, but most of them generate so many spurious events that the volume of invalid Incidents created make the feature not worth using. Only with strong correlation rules would I trust automated Incident creation.
The best thing to do is to train all the IT technicians to understand that Incidents are not just for Service outages and to have good Service Level documentation to inform IT when Service degradation should trigger Incident creation.