During a cross-functional meeting today, an interesting observation was raised. What, in fact, is an outage?
Sometimes, the answer is black & white. A service is completely unavailable because somebody shut a server down or some such event.
Other times, however, we have a grey area. Suppose, for example, that a server is experiencing a “memory leak” due to a particular application. The server is up and running, the application is up and running, but response is so slow that the system is unusable. Although the switch is on, nobody can use the system to do their job – or maybe 1% of the people can – or 5% – or 25%?
Or suppose that an application is up and running, but for some reason a popular module of the application is non-responsive. The application (service) isn’t down, but for all intents and purposes it’s disrupted.
We expect our technicians to tell us if a service is or is not experiencing an outage. Left to their own judgment, we get inconsistency across the enterprise (this is a Fortune 500 company).
What black & white criteria can we present to our technicians to define what constitutes an outage?
I think you may be allowing the Technical side of the house to define an outage. According to ITIL, the Business is the group that defines an outage. This is done through the Service Level Management process and encompasses both interruptions of service and performance degradation.
Also, ITIL doesn’t use the term Outage except when the Availability Management process is calculating availability. ITIL uses the term Incident. An Incident is defined as any event outside the normal delivery of a service that causes (or may cause) an interruption or degradation in the service.
Per this definition, a server in cluster that fails would meet the criteria of an Incident. The increased risk of an outage and the increased risk of performance degradation, even though there was no perceivable affect in the delivery of the service, are grounds for it to be considered an Incident.
If you are measuring Availability and need to know the times when services are unavailable, then it is dependent on Service Level Management to define what levels of service delivery are acceptable to the business.