Very Long Post Ahead

March 28th, 2014 | Posted by Don Boylan in Change Management | Favorite | Incident Management | Problem Management

This post can’t be summarized and includes multiple follow up questions which contain (I think) some of my best responses. So here it is in full:

Q1:

I just found this very useful forum and I want to ask the experts here if I understood the ITIL processes correct. I will give one example and map the steps to the ITIL processes and it would be great if you could confirm that the mapping is correct or not and if not, name the correct process with the reason. I struggle a bit with step 3/4.

Thanks a lot in advance!

Example

1. Service Desk of a company (selling e.g. PCs) receives a call of
a customer/user which bought a PC. PC is still under warranty
(checked in CMDB). The customer says that the PC “beeps” a lot of
time during booting and than “hangs”, does nothing.
PROCESS/FUNCTION MAPPING: Function of Service Desk and
use of Configuration Management to check the warranty.

2. Service Desk employee creates a “Service Incident Ticket”. A
quick search in  the problem/solution DB gave the result that
the most common reason is a defect RAM module (known error).
PROCESS MAPPING: Incident management which uses functionalities
from the problem management.

3. The Service Desk employee informs the customer that a RAM module
must be exchanged and that a service technician will be send out. Date
and time is fixed. The Service Desk employee changes the status of
the Service Ticket into “Repair Required”. PROCESS MAPPING:
Incident Management.

4. The Service Ticket was automatically assigned to a free technician and
additionally it was checked in the inventory if a replacement RAM exits.
PROCESS MAPPING: Now that’s were I struggle a bit. Does this
(sending out a technician) still belongs to the Incident Management or is
this Problem Management. I assume the second one because it is not
100% clear if the RAM is really the root cause of the problem and the
technician must check that On Site.  On the other hand the incident
management must ensure the incident is solved as quick as
possible, even with workarounds. But sending out the technician in
this case is the fastest way… Any thoughts on this are really welcome.

5. The technician is On Site, replaces the RAM. He confirms (electronically)
that the repair was successfully done.
PROCESS MAPPING: As in step 4 this could be a final step of the
incident management because this incident is solved or the final steps
of the problem management (maybe both). Additionally the
serial number of the RAM in the CMDB is updated automatically.
I assume that this part belongs to change management even if
this is done automatically from the electronic confirmation during
the incident/problem management process. An separate RFC for this
change is not created because that is a standard change and nobody
(also no change manager) will check if the technician has entered
the correct serial number.

A1:

Step four is pure Incident Management. You have done no Problem Management because you are not doing any root cause analysis. And remember:

* Involving Problem Management is a decision made by IT to start spending money to prevent the Incident from reoccurring
* The output of Problem Management is an RFC. I argue that even Standard Changes get RFCs. How else are you going to know to update the CMDB?
* Problem Management doesn’t “fix” anything. That’s Release Management’s job.

Now if you had a rash of bad memory (multiple Incidents with similar symptoms), then you might decide to spend the money to invoke Problem Management. Problem Management may determine that the brand is unreliable and open a RFC requesting that the vendor be changed. Or Problem Management may determine that excessive solar flares are causing cosmic rays, and the RFC would be to shield the building with a lead dome. Whatever the case may be, Problem Management doesn’t do any work. Change management approves and then Release Management implements.

Step 4 is Incident Management, but is not the end of the Incident Management process. That may Resolve the Incident, but Closure requires communication between IT and the user that the issue has been resolved to their satisfaction. Since it is an IT/user communication, it needs to come from the Service Desk. After that communication has occurred, then the Incident can be Closed.

Just as a side note, if you think that a CMDB can track details of every PC’s RAM serial number, you might be thinking a little too ambitiously. IF you are tracking every desktop (and that’s a big if) as a CI, then you would simply associate the CI record to the Incident record so that it is noted that the issue occurred. Keep in mind that the CMDB is not a replacement for Asset or Inventory databases. If your organization needs to track assets to that level of detail, then I would suggest using an Inventory or Asset repository.

Q2:

thank you very much for your detailed explanations, much appreciated. Made some things a lot clearer.
Just one follow on question concerning step 5. Lets assume for the minute (even if its too ambitiously) that I want to track the serial number of the RAM in the CMDB (could also be any other spare part in a bigger installation, must not be a PC). When the technician returns the info that the replacement took place (e.g. via an electronic confirmation) would this still be considered as a part of the incident management process. In my theory this would be a functionality/process of the configuration management which is triggered from the incident mgmt. Is this wrong?

A2:

On the assumption that you were going to track the serial number of ram as an Attribute of a CI, then you must submit an RFC. The definition of a Change is any change in status or attribute of a CI.

The technician would not be allowed to make a change to the attribute of the CI (new ram serial) until the Change had been approved. If it is a Standard Change, then filling out the RFC in essence grants approval. The technician would then be performing the role of Release Management when he implements the approved Change. Change would then review the Release to see if it was successful and then Configuration Management could update the CI serial number attribute.

So the way it would work is Incident Event > RFC > Change Approval > Release Mgt Implementing > Change Review > Incident Resolution > CI Serial Number Attribute Update.

(although I don’t think it matters if the Incident is resolved before, after or simultaneously as the CI Attribute Update)

The CI might get updated quite a few times in the above process (associating the Incident #, RFC #, noting that the CI Status is “off line” or “under repair”) but the CI Serial Number Attribute update is driven by the successful implementation of a Change.

Also remember that any process, IT functional group, or even the business can open an RFC. It is the approved and implemented Changes that drive the updates to the CMDB.

Q3:

I agree 100% with this process as you explained!
If you don’t mind I want to get one level deeper in the last discussed process. Step 3 is the important one.

1. Service Desk creates the Incident
2. Incident is assigned to the technician because only he can solve
the issue. You can think about this like an escalation because
the Service Desk Agent can not solve it, only the technician can.
3. Technician drives to the customer and replaces the RAM. In this
step there is no formal RFC (more or less the Incident itself is the RFC)
because the technician decides On Site that the replacement must be
done. It is more an implicit RFC. He owns the repair and approves it
by himself and also does the change review after the implementation
because he tests the machine after that.
So more or less he implements the change immediately and
confirms the change electronically. This then automatically updates
the CMDB and also closes the incident. Based on the incident we send
out an invoice for the service which basically is the last step of the
process.
The customer does not expect more information then the invoice,
so no further mail from the Service Desk (ok, to stay ITIL close
e.g. an automated eMail could be generated to inform the customer
that the machine was repaired and if there are further issue he should
call again).

Now I see that this doesn’t really fit to a pure ITIL process because a lot is automated in this process missing e.g. RFC or approvals. I would say some of the process steps like RFC, Approval, Implementation are done by the technician but not really documented because its a standard predefined repair process. He only confirms the successful implementation which then triggers the update of the CMDB. Creating more documents in the system or having additional reviews / approvals would slow down the repair process way too much.

Sorry if this all sounds a bit confusing. Would you completely disagree on this process in terms of ITIL compliance or would you agree that this is a process based on ITIL but shorted to work in the best way for the business.

A3:

You are getting closer to the concept of what an Incident is, but I don’t think you are understanding when to, and when not to, invoke an update to a CMDB record.

Your explanation above would be perfectly compatible with ITIL if you took out any mention of a CMDB. Could the technician be updating an inventory database? Yes, most definitely. But he cannot make a change to an attribute of a CI without an approved RFC. If having an approved RFC is too much overhead, then this equipment (or this attribute of the equipment) should not be tracked as a CI.

The question is “Why would ITIL require that any change to a CI have an approved RFC?”. Because in the past, IT had infrastructure that was in no way controlled. It was bedlam. The best-intentioned inventory systems were hopelessly out of date almost before physical inventories were completed. Even with automated tools and strict controls on what users were allowed to do, it was simply impossible.

So ITIL came along and said “Yes, it is impossible to keep an accurate inventory, but there are some pieces of IT that are too important for ad hoc changes”. These pieces of equipment we shall promote to the status of Configuration Items, and we shall track them in a Configuration Management Database, and under no circumstances shall any change be made to them without prior approval. And so it was done.

Did that invalidate the need for inventory systems? No. And most organizations that have a CMDB also have inventory systems. These are for tracking IT equipment that is outside the scope of the CMDB. The inventory DB also tracks equipment that exists in the CMDB. This may seem like a duplication of effort but it is because the CMDB only tracks certain attributes, and the inventory DB might need to track additional attributes (such as the serial number of RAM chips). But it is understood that the inventory DB is, at best, unreliable.

Let me give you a scenario. The same one you proposed, but the memory chip is going into a medical xray machine. The same Incident is raised. The tech arrives on site and discovers the issue is a bad chip. He has the chip available, but the instrument is maintained in a Controlled State. Would you want them adding a memory chip to it without having an approved back-out plan? Are you absolutely sure that the tech understands all the risks associated with swapping out a component in a machine that can kill? Would you want the procedure done on a “pre-approved” basis without having been through an approval process many times before with no incidents (deaths) reported?

I would hope that you see the value of the technician not installing the chip. They would open an RFC so that the Change could be reviewed to ensure that the risk was small enough that the next person who got shot with xrays didn’t die. Or, it could be pre-approved if the procedure had been through the Change Management process enough times without any incidents (deaths), and the procedures, risks, back out plans, etc. were all understood and well defined (this the definition of what is acceptable as a potential Standard Change). But the tech must still open an RFC because the approved Change is what will allow the machine to stay in a controlled state.

When the governing authorities come in and ask for a history of the machine, they will demand to see the Change Log. And, they will probably want to see the machine’s Status Accounting, which is a report that is obtained directly from the CMDB.

You may say that’s all well and good for pieces of medical equipment, but when could that apply to servers, or routers, or other critical pieces of infrastructure? I used to work for a Fortune 500 Pharma and I can tell you that the FDA does come in audit infrastructure.

If you are in an organization that gets its Change Management records audited then it is equally important to keep your non-critical changes out of the CMDB. You don’t really want to expose every single tiny Incident that a server has had if that Incident didn’t affect the Status of the server or any of the CI tracked attributes.

And why do some organizations track Changes and Configuration Items so closely? Because it is the best way to perform IT. You could even call it best practices. It is true that it isn’t feasible to do this depth of control for all IT assets. That’s why the CMDB has such well defined Scope and CI Level definitions. Anything outside of the CMDB can (and perhaps should) be tracked, but in an inventory or asset system.

You can follow any responses to this entry through the RSS 2.0 You can leave a response, or trackback.

Leave a Reply

Your email address will not be published. Required fields are marked *