Just because you rebuild the track doesn’t mean the train won’t derail again.
We have been looking in past articles at the tragic events in little Cherry Valley, Illinois in 2009. One person died and several more were seriously injured when a train-load of ethanol derailed at a level crossing. We talked about the resulting Incident Management, which focused on customers, trains and cargo – ensuring the services still operated, employing workarounds. Then we considered the Problem Management: the injured people and the wreck and the broken track – removing the causes of service disruption, restoring normal service.
A Problem is a problem, whether it has caused an Incident yet or not
In a previous article I said ITIL has an odd definition of Problem. ITIL says a Problem is the cause of “one or more incidents”. ITIL promotes proactive (better called pre-emptive) Problem Management, and yet apparently we need to wait until something causes at least one Incident before we can start treating it as a Problem. I think the washout in Cherry Valley was a problem long before train U70691-18 barrelled into town. A Problem is in fact the cause of zero or more Incidents. A Problem is a problem, whether it has caused an Incident yet or not.
We talked about how I try to stick to a nice crisp simple model of Incident vs. Problem. To me, an incident is an interruption to service and a problem is an underlying (potential) cause of incidents. Incident Management is concerned with the restoration of expected levels of service to the users. Problem Management is concerned with removing the underlying causes.
ITIL doesn’t see it that crisply delineated: the two concepts are muddied together. ITIL – and many readers – would say that putting out the fires, clearing the derailed tankers, rebuilding the roadbed, and relaying the rails can be regarded as part of the Incident resolution process because the service isn’t “really” restored until the track is back.
Problems can be resolved with urgency
In the last article I said this thinking may arise because of the weird way ITIL defines a Problem. I have a hunch that there is a second reason: people consider removing the cause of the incident to be part of the incident because they see Incident=Urgent, Problem=Slow. They want the incident Manager and the Service Desk staff to hustle until the cause is removed. This is just silly. There is no reason why Problems can’t be resolved with urgency. Problems should be categorised by severity and priority and impact just like Incidents are. The Problem team should go into urgent mode when necessary to mobilise resources, and the Service Desk are able to hustle the Problem along just as they would an Incident.
This inclusion of cause-removal over-burdens and de-focuses the Incident Management process. Incident Management should have a laser focus on the user and by implication the customer. It should be performed by people who are expert at serving the user. Its goal is to meet the user’s needs. Canadian National’s incident managers were focused on getting deliveries to customers despite a missing bit of track.
Problem Management is about fixing faults. It is performed by people expert at fixing technology. . The Canadian National incident managers weren’t directing clean-up operations in Cherry Valley: they left that to the track engineers and the emergency services.
Problem management is a mess
But the way ITIL has it, some causes are removed as part of Incident resolution and some are categorised as Problems, with the distinction being unclear (“For some incidents, it will be appropriate…” ITIL Service Operation 2011 126.96.36.199). The moment you make Incident Management responsible for sometimes fixing the fault as well as meeting the user’s needs, you have a mashup of two processes, with two sometimes-conflicting goals, and performed by two very different types of people. No wonder it is a mess.
It is a mess from a management point of view when we get a storm of incidents. Instead of linking all related incidents to an underlying Problem, we relate them to some “master incident” (this isn’t actually in ITIL but it is common practice) .
It is a mess from a prioritisation point of view. The poor teams who fix things are now serving two processes: Incident and Problem. In order to prioritise their work they need to track a portfolio of faults that are currently being handled as incidents and faults that are being handled as problems, and somehow merge a holistic picture of both. Of course they don’t. The Problem Manager doesn’t have a complete view of all faults nor does the Incident Manager, and the technical teams are answerable to both.
It is a mess from a data modelling point of view as well. If you want to determine all the times that a certain asset broke something, you need to look for incidents it caused and problems it caused
Every cause of a service impact (or potential impact) should be recorded immediately as a problem, so we can report and manage them in one place.
All that tirade is by way of introducing the idea of reactive and proactive Problem Management.
Cherry Valley needed Reactive Problem Management
Reactive Problem Management responds to an incident to remove the cause of the disruption to service. The ITIL definition is more tortuous because it treats “restoring the service” as Incident Management’s job, but it ends up saying a similar thing: “Reactive problem management is concerned with solving problems in response to one or more incidents” (SO 2011 4.4.2).
Pro-active Problem Management fixes problems that aren’t currently causing an incident to prevent them causing incidents (ITIL says “further” incidents).
So cleaning up the mess in Cherry Valley and rebuilding the track was reactive Problem Management.
Once the trains were rolling they didn’t stop there. Clearly there were some other problems to address. What caused the roadbed to be washed away in the first place? Why did a train thunder into the gap at normal track speed? Why did the tank-cars rupture and how did they catch fire?
Find the problems that need fixing
In Cherry Valley, the drainage was faulty. Water was able to accumulate behind the railway roadbed embankment, causing flooding and eventually overflowing the roadbed, washing out below the track, leaving rails dangling in the air. The next time there was torrential rain, it would break again. That’s a problem to fix.
Canadian National’s communication processes were broken. The dispatchers failed to notify the train crew of a severe weather alert, which they were supposed to do. If they had, the train would have operated at reduced speed. That’s a problem to fix.
The CN track maintenance processes worked, perhaps lackadaisically but they worked as designed. The processes could have been a lot better, but were they broken? No.
The tank cars were approved for transporting ethanol. Those were not required to be equipped with head shields (extra protection at the ends of the tank to resist puncturing), jackets, or thermal protection. In March 2012 the US National Transportation Safety Board (NTSB) recommended (R-12-5 ) “that all newly manufactured and existing general service tank cars authorized for transportation of denatured fuel ethanol … have enhanced tank head and shell puncture resistance systems”. The tank-cars weren’t broken (before the crash). This is not fixing a problem; it is improving the safety to mitigate the risk of rupture.
Proactive Problem Management prevents the recurrence of Incidents
I don’t think pro-active Problem Management is about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality. That is once again over-burdening a process. If you delve too far into preventing future problems, you cross over into Availability and Capacity and Risk Management and Service Improvement, (and Change Management!), not Problem Management.
ITIL agrees: “Proactive problem management is concerned with identifying and solving problems and known errors before further incidents related to them can occur again”. Proactive Problem Management prevents the recurrence of Incidents, not Problems.
In order to ensure that incidents will not recur, we need to dig down to find all the underlying causes. In many methodologies we go after that mythical beast, the Root Cause. We will talk about that next time.