Those who have worked in IT Operations have a strong affinity with the skills of problem solving and troubleshooting. Although a huge amount of effort is taken to improve resiliency and redundancy of IT systems the ability to quickly diagnose the root cause of problems has never been more important and relevant.
IT Service Management has gone a long way towards making practices standardised and repeatable. For example you don’t want individual creative input when executing standard changes or fulfilling requests. Standard Operating Procedures and process manuals means that we expect our engineers and practioners to behave in predictable ways. Those reluctant to participate in these newly implemented processes might even complain all the fun has gone out of IT support.
A Home for Creative and Inquiring Minds?
However there is still a place for creative and inquiring minds in ITSM driven organisations. Complex systems are adept at finding new and interesting ways to break and stop functioning. Problem analysis still needs some creative input.
When I recruited infrastructure engineers into my team I was always keen to find good problem solvers. I’d find that some people were more naturally inclined to troubleshooting than others.
Some people would absolutely relish the pursuit of the cause of a difficult network or storage issue… thinking of possible causes, testing theories, hitting dead ends and starting again. They tackled problems with the mindset of a stereotypical criminal detective… finding clues, getting closer to the murder weapon, pulling network cables, tailing through the system log.
These kinds of engineers would rather puzzle over the debug output from their core switch than get stuck into the daily crossword. I’m sure if my HR manager let me medically examine these engineers I’d find that the underlying psychological brain activity and feeling of satisfaction would be very similar to crossword puzzlers and sudoku players. I was paying these guys to do the equivalent of the Guardian crossword 5 days a week.
Others would shy away from troubleshooting sticky problems. They didn’t like the uncertainty of being responsible for fixing a situation they knew little about. Or making decisions based on the loosest of facts.
They felt comfortable in executing routine tasks but lacked the capability to logically think through sets of symptoms and errors and work towards the root cause.
The problem I never solved
Working in a previous organisation I remember a particularly tricky problem. Apple computers running Microsoft PowerPoint would find that on a regular basis their open presentation would lock and stop them saving. Users would have to save a new version and rename the file back to its original name.
It was a typical niggling problem that rumbled on for ages. We investigated different symptoms, spent a huge amount of time running tests and debugging network traces. We rebuilt computers, tried moving data to different storage devices and found the root cause elusive. We even moved affected users between floors to rule out network switch problems.
We dedicated very talented people to resolving the problem and made endless promises of progress to our customers. All of which proved false as we remained unable to find the root cause of the problem.
Our credibility ran thin with that customer and we were alarmed to discover that our previous good record of creatively solving problems in our infrastructure was under threat.
What’s wrong with creative troubleshooting?
The best troubleshooters in your organisation share some common traits.
- They troubleshoot based on their own experiences
- They (probably) aren’t able to always rationalise the root cause before attempting to fix it
Making assumptions based on your experiences is a natural thing to do – of course as you learn skills and go through cycles of problem solving you are able to apply your learnings to new situations. This isn’t a negative trait at all.
However it does mean that engineers approach new problems with a potentially limited set of skills and experiences. To network engineers all problems look like a potentially loose cable.
Not being able to rationalise the root cause is a balance between intuition, backed up by evidence and research. Your troubleshooter will work towards the root cause and sometimes have hard evidence to confirm the cause.
“I can see this in the log… this is definitely the cause!”
But in some cases the cause might be suspected, but you aren’t able to prove anything until the fix is deployed.
Wrong decisions can be costly
Attempting the wrong fix is expensive in many ways, not least financially. It’s expensive in terms of time, user patience and most critically the credibility of IT to fix problems quickly.
Expert troubleshooters are able to provide rational evidence that confirm their root cause before a fix is attempted.
A framework is needed
As with a lot of other activities in IT a process or framework can aid troubleshooters to identify the root cause of problems quickly. In addition to providing quick isolation of the root cause, the framework I’m going to discuss can provide evidence as to why we are suggesting this as the root cause.
Using a common framework has other benefits. For example:
- To allow collaboration between teams – Complex infrastructure problems can span multiple functional areas. You would expect to find subject matter experts from across the IT organisation working together to resolve problems. Using a common framework in your organisation allows teams to collaborate on problems in a repeatable way. Should the network team have a different methodology for troubleshooting than the application support team?
- To bring additional resources into a situation – Often ownership of Problems will be handed between teams in functional or hierarchical escalation. External resources may be brought in to assist with the problem. Having a common framework allows individuals to quickly get an appraisal of the situation and understand the progress that has already been made.
- To provide a common language for problem solvers – Structured problem analysis techniques have their own terminology. Having shared understanding of “Problem Area”, “Root cause” and “Probable cause” will prevent mis-understandings and confusion during critical moments
The Kepner Tregoe Problem Analysis process
Kepner-Tregoe is a global management consultancy firm specialising in improving the efficiency of their clients.
The founders, Chuck Kepner and Ben Tregoe, were social scientists living in California in the 1950′s. Chuck and Ben studied the methods of problem solvers and managers and consolidated their research into process definitions.
Their history is an interesting one and a biography of the organisation is outside the scope of this blog post – but definitely worth researching.
One of the processes developed, curated and owned by Kepner-Tregoe, is Structured Problem Analysis, known as KT-PA.
KT-PA is used by hundreds of organisations to isolate problems and discover the root cause. It’s a framework used by problem solvers and troubleshooters to resolve issues and provide rational evidence that the investigation has discovered the correct cause.
Quick overview of the process
1. State the Problem
KT-PA begins with a clear definition of the Problem. A common mistake in problem analysis is a poor description of the problem, often leading to resources dedicated to researching symptoms of the problem rather than the issue itself.
Having a clear and accurate Problem Statement is critical to finding the root cause quickly. KT-PA provides guidance on identifying the correct object and it’s deviation.
A typical Problem Statement might be
Users of MyAccountingApplication are experiencing up to 2 second delays entering ledger information
This problem statement is explicit about the object (“Users of MyAccountingApplication”) and the deviation from normal service (“2 second delays entering ledger information”)
2. Specify the Problem
The process then defines how to specify the problem into Problem Areas. A Problem is specified in 4 dimensions and all should be considered. What, Where, When, Extent:
- What object has the deviation
- What is the deviation
- Where is the deviation on the object
- When did the deviation occur
- Extent of the deviation (How many deviations are occurring, What is the size of one deviation, Are the number of deviations increasing or decreasing)
The problem owner inspects the issue from these dimensions and documents his results. Results are recorded in the format of IS and IS NOT. Using the IS/IS NOT logical comparison starts to build a profile of the problem. Even at this early stage certain causes might become more apparent or less likely.
Already troubleshooters will be getting benefit from the process. The fact that the 2 second delay in the problem dimension of Where “IS Users in London” but “IS NOT Users in New York” is hugely relevant.
The fact that the delay occurs in entering ledger information but not reading ledger information is also going to help subject matter experts think about possible causes.
3. Distinctions and Changes
Having specified the problem and made logical comparisons as to where the problem IS and IS NOT each problem area the next step is to examine Distinctions and Changes.
Each answer to a specifying question is examined for Distinctions and Changes.
- What is distinct about users in London when compared to users in New York. What is different about their network, connectivity, workstation build?
- What has changed for users in London?
- What is distinct about August 2012 when compared to July?
- What changed around the 30th July?
As these questions are asked and discussed possible root causes should become apparent. These are logged for testing in the next step.
4. Testing the cause
The stage of testing the cause before confirmation is, for me, the most valuable step in the KT-PA process. It isn’t particularly hard to think of possible root causes to a problem. Thinking back to the “problem I never solved” we had many opinions on what the cause might be from different technical experts.
If we had used KT-PA with that problem we could have tested the cause against the problem specification to see how probable it is.
As an example lets imagine that during the Distinctions and Changes stage with our problem above 3 possible root causes were suggested
- LAN connection issue with the switch the application server is connected to
- The new anti-virus installation installed across the company in August is causing issues
- Internet bandwidth in the London office is saturated
When each possible root cause is evaluated against the problem specification you are able to test it using the following question
“If LAN connection issue with the switch the application server is connected to is the true cause of the problem then how does it explain why Users in London experience the issue. Users in New York do not”
This possible root cause doesn’t sound like a winner. If there were network connectivity issues with the server wouldn’t all users be affected?
“If The new anti-virus installation installed across the company in August is causing issues is the true cause of the problem then how does it explain why Users in London experience the issue. Users in New York do not”
We came to this root cause because of a distinction and change in the WHEN problem dimension. In August a new version of anti-virus was deployed across the company? But this isn’t a probable root cause for the same reason that New York users aren’t affected
“If Internet bandwidth in the London office is saturated is the true cause of the problem then how does it explain why Users in London experience the issue. Users in New York do not”
So far this possible root cause sounds most probable. The cause can explain the dimension of WHERE. Does it also prove other dimensions of the problem.
“If Internet bandwidth in the London office is saturated is the true cause of the problem then how does it explain why First noticed in August 2012, not reported before 30th July”
Perhaps now we’d be researching Internet monitoring charts to see if the possible root cause can be confirmed.
The New Rational Manager
You might find my recommendation of a book published in 1965 as one of the most relevant Problem Management books I’ve read to be incredulous.
But I’m recommending it anyway.
The New Rational Manager, written by Charles H Kepner and Benjamin B Tregoe is a must read for anyone that needs to solve problems, be they manufacturing, industrial, business or Information Technology.
It explains the process above in a readable way with great examples. I think the word “Computer” is mentioned once – this is not a book about modern technology – but it teaches the reader a process that can be applied to complex IT problems
Problem Management and troubleshooting is a critical skill in ITSM and Infrastructure and Operations roles. Many talented troubleshooters make their reputation by applying creative, technical knowledge to a problem and finding the root cause.
Your challenge is harnessing that creativity into a process to make their success repeatable in your organisation and to reduce the risk of fixing the wrong root cause.