IT prides itself on ensuring that our services are highly available. We measure availability to ridiculously high levels of precision. Indeed, we have a magical number that represents availability nirvana. We call it “Five Nines”.
The term Five Nines indicates that our systems are available for 99.999% of the agreed time.
Notice that I said “agreed time”. If our users agree to a planned outage of 120 hours for system maintenance every week, meeting that Five Nines of availability becomes much easier to achieve. Of course, generally our users don’t give us the privilege of taking systems down for extended periods on a frequent basis, so we have to eek out a few hours here and there to do our periodic house-keeping.
To give you an example of how reliable a system that meets Five Nines of availability is, if you were to calculate the downtime for a service that is agreed to be up 24 hours a day, 7 days a week, the total unplanned outage time over the course of a year would need to be less than 5 minutes and 15 seconds. That is seriously expensive to do.
Some organizations don’t allow for even a single moment of downtime because their systems are so heavily relied upon. They go to extreme measures to ensure the highest availability that money can buy. I know of one organization that claims it loses a million dollars in revenue for every minute their most critical system goes down. That company is American Express, and the system would be their credit card transaction processing service.
Fortunately, most of us don’t work for companies that have such stringent standards for high availability, but that doesn’t mean we don’t feel the heat when the services do fail.
ITIL teaches us that there are two aspects we need to measure when addressing the business’ requirements:
- Availability or Uptime – Typically expressed as a % of time (e.g., “We were up 99.5% of the agreed service time!”)
- Reliability or Frequency of Outage – Rarely measured or even mentioned in the Service Level documentation
And the sad thing is – users care more about Reliability. Let’s look at two examples. Let’s assume that we have an agreed service time over the course of a month of 43,200 minutes (30*24*60).
In January we had 2 outages:
- 1/15 – Down for 24 minutes
- 1/22 – Down for 36 minutes
In February we had 10 outages:
- 2/3 – Down for 4 minutes
- 2/5 – Down for 8 minutes
- 2/7 – Down for 6 minutes
- 2/12 – Down for 2 minutes
- 2/15 – Down for 9 minutes
- 2/20 – Down for 8 minutes
- 2/22 – Down for 5 minutes
- 2/24 – Down for 3 minutes
- 2/26 – Down for 9 minutes
- 2/28 – Down for 6 minutes
The Availability calculations for the two months are exactly equal: 99.861% Uptime.
But the Reliability for the two months is dramatically different: 2 outages vs 10 outages.
Which month do you think would most upset the users?
- January User: Well, there were a couple of outages this month, but life went on.
- February User: G@#DA&M, MOTHER#$$^ING COMPUTER DEPARTMENT SUCKS! I WILL KILL YOU ALL! YOU SHOULD ALL BE FIRED YOU INCOMPETANT BAS%#RDS!
Yet for some reason we only measure and report on the aspect that is of lower importance to the users. Interesting.