Most readers have got the story now from my recent articles: Cherry Valley, Illinois, 2009, rain bucketing down, huge train-load of ethanol derails, fire, death, destruction.
Eventually the Canadian National’s crews and the state’s emergency services cleaned up the mess, and CN rebuilt the track-bed and the track, and trains rolled regularly through Cherry Valley again.
Then the authorities moved in to find out what went wrong and to try to prevent it happening again. In this case the relevant authority is the US National Transportation Safety Board (NTSB).
Keep asking why?
Every organisation should have a review process for requests, incidents, problems and changes, with some criteria for triggering a review.
In this case it was serious enough that an external agency reviewed the incident. The NTSB had a good look and issued a report. Read it as an example of what a superb post-incident review looks like. Some of our major IT incidents involve as much financial loss as this one and sadly some also involve loss of life.
IT has a fascination with “root cause”. Root Cause Analysis (RCA) is a whole discipline in its own right. The Kepner-Tregoe technique (ITIL Service operation 2011 Appendix C) calls it “true cause”.
The rule of thumb is to keep asking “Why?” until the answers aren’t useful any more, then that – supposedly – is your root cause.
This belief in a single underlying cause of things going wrong is a misguided one. The world doesn’t work that way – it is always more complex.
The NTSB found a multitude of causes for the Cherry Valley disaster. Here are just some of them:
- It was extreme weather
- The CN central rail traffic controller (RTC) didn’t put out a weather warning to the train crew which would have made them slow down, although required to do so and although he was in radio communication with the crew
- The RTC did not notify track crews
- The track inspector checked the area at 3pm and observed no water build-up
- Floodwater washed out a huge hole in the track-bed under the tracks, leaving the rails hanging in the air.
- Railroads normally post their contact information at all grade crossings but the first citizen reporting the washout could not find the contact information at the crossing where the washout was, so he called 911
- The police didn’t communicate well with CN about the washed out track: they first alerted two other railroads
- There wasn’t a well-defined protocol for such communication between police and CN
- Once CN learned of the washout they couldn’t tell the RTC to stop trains because his phone was busy
- Although the train crew saw water up to the tops of the rails in some places they did not slow down of their own accord
- There was a litany of miscommunication between many parties in the confusion after the accident
- The federal standard for ethanol cars didn’t require them to be double-skinned or to have puncture-proof bulkheads (it will soon: this tragedy triggered changes)
- There had been a previous washout at the site and a 36” pipe was installed as a relief drain for flooding. Nobody calculated what size pipe was needed and nobody investigated where the flood water was coming from. After the washout the pipe was never found.
- The county’s storm-water retention pond upstream breached in the storm. The storm retention pond was only designed to handle a “ten year storm event”.
- Local residents produced photographic evidence that the berm and outlet of the pond had been deteriorating for several years beforehand.
OK you tell me which is the “root cause”
Causes don’t normally arrange themselves in a nice tree all leading back to one. There are several fundamental contributing causes. Anyone who watches Air Crash Investigation knows it takes more than one thing to go wrong before we have an accident.
Sometimes one of them stands out like the proverbial. So instead of calling it root cause I’m going to call it primary cause. Sure the other causes contributed but this was the biggest contributor, the primary.
Ian Clayton once told me that root cause …er… primary cause analysis is something you do after the fact as part of the review and wash-up. In the heat of the crisis who gives a toss what the primary cause is – remove the most accessible of the causes. Any disaster is based on multiple causes. Removing any one cause of an incident will likely restore service. Then when we have time to consider what happened and take steps to prevent a recurrence, we should probably try to address all causes. Don’t do Root cause Analysis, just do Cause Analysis, seeking the multiple contributing causes. If we need to focus efforts then the primary cause is the one, which implies that a key factor in deciding primacy is how broad the potential is for causing more incidents.
All complex systems are broken
It is not often you read something that completely changes the way you look at IT. This paper How Complex Systems Fail rocked me. Reading this made me completely rethink ITSM, especially Root Cause Analysis, Major Incident Reviews, and Change Management. You must read it. Now. I’ll wait.
It says that all complex systems are broken. It is only when the broken bits line up in the right way that the system fails.
It dates from 1998! Richard Cook is a doctor, an MD. He seemingly knocked this paper off on his own. It is a whole four pages long, and he wrote it with medical systems in mind. But that doesn’t matter: it is deeply profound in its insight into any complex system and it applies head-on to our delivery and support of IT services.
“Complex systems run as broken systems”
“Change introduces new forms of failure”
“Views of ‘cause’ limit the effectiveness of defenses against future events… likelihood of an identical accident is already extraordinarily low because the pattern of latent failures changes constantly.”
“Failure free operations require experience with failure.”
Many times the person “to blame” for a primary cause was just doing their job. All complex systems are broken. Every day the operators make value judgements and risk calls. Sometimes they don’t get away with it. There is a fine line between considered risks and incompetence – we have to keep that line in mind. Just because they caused the incident doesn’t mean it is their fault. Think of the word “fault” – what they did may not have been faulty, it may just be what they have to do every day to get the job done. Too often, when they get away with it they are considered to have made a good call; when they don’t they get crucified.
That’s not to say negligence doesn’t happen. We should keep an eye out for it, and deal with it when we find it. Equally we should not set out on cause analysis with the intent of allocating blame. We do cause analysis for the purpose of preventing a recurrence of a similar Incident by removing the existing Problems that we find.
I will close by once again disagreeing with ITIL’s idea of Problem Management. As I said in my last article, pro-active Problem Management is not about preventing problems from occurring, or preventing causes of service disruption from entering the environment, or making systems more robust or higher quality.
It is overloading Problem Management to also make it deal with “How could this happen?” and “How do we prevent it from happening again?” That is dealt with by Risk Management (an essential practice that ITIL does not even recognise) feeding into Continual Service Improvement to remove the risk. The NTSB were not doing Problem Management at Cherry Valley.
Next time we will look at continual improvement and how it relates to problem prevention.
Image credit - © Van Truan – Fotolia.com