I was wondering – do you have a Known Error Database? And are you getting the maximum value out of it?
The concept of a KEDB is interesting to me because it is easy to see how it benefits end users. Also because it is dynamic and constantly updated.
Most of all because it makes the job of the Servicedesk easier.
It is true to say that an effective KEDB can both increase the quality and decrease the time for Incident resolution.
The Aim of Problem Management and the Definition of “The System”
One of the aims of Problem Management is to identify and manage the root causes of Incidents. Once we have identified the causes we could decide to remove these problems to prevent further users being affected.
Obviously this might be a lengthy process – replacing a storage device that has an intermittent fault might take some scheduling. In the meantime Problem Managers should be investigating temporary resolutions or measures to reduce the impact of the Problem for users. This is known as the Workaround.
When talking about Problem Management it helps to have a good definition of “Your System”. There are many possible causes of Incidents that could affect your users including:
- Hardware components
- Software components
- Networks, connectivity, VPN
- Services – in-house and outsourced
- Policies, procedures and governance
- Security controls
- Documentation and Training materials
Any of these components could cause Incidents for a user. Consider the idea that incorrect or misleading documentation would cause an Incident. A user may rely on this documentation and make assumptions on how to use a service, discover they can’t and contact the Servicedesk.
This documentation component has caused an Incident and would be considered the root cause of the Problem
Where the KEDB fits into the Problem Management process
The Known Error Database is a repository of information that describes all of the conditions in your IT systems that might result in an incident for your customers and users.
As users report issues support engineers would follow the normal steps in the Incident Management process. Logging, Categorisation, Prioritisation. Soon after that they should be on the hunt for a resolution for the user.
This is where the KEDB steps in.
The engineer would interact with the KEDB in a very similar fashion to any Search engine or Knowledgebase. They search (using the “Known Error” field) and retrieve information to view the “Workaround” field.
The “Known Error”
The Known Error is a description of the Problem as seen from the users point of view. When users contact the Servicedesk for help they have a limited view of the entire scope of the root cause. We should use screenshot of error messages, as well as the text of the message to aid searching. We should also include accurate descriptions of the conditions that they have experienced. These are the types of things we should be describing in the Known Error field A good example of a Known Error would be:
When accessing the Timesheet application using Internet Explorer 6 users experience an error message when submitting the form.
The Known Error should be written in terms reflecting the customers experience of the Problem.
The Workaround is a set of steps that the Servicedesk engineer could take in order to either restore service to the user or provide temporary relief. A good example of a Workaround would be:
To workaround this issue add the timesheet application to the list of Trusted sites
1. Open Internet Explorer 2. Tools > Options > Security Settings [ etc etc ]
The Known Error is a search key. A Workaround is what the engineer is hoping to find – a search result. Having a detailed Workaround, a set of technical actions the Servicedesk should take to help the user, has multiple benefits – some more obvious than others.
Seven Benefits of Using a Known Error Database (KEDB)
- Faster restoration of service to the user - The user has lost access to a service due to a condition that we already know about and have seen before. The best possible experience that the user could hope for is an instant restoration of service or a temporary resolution. Having a good Known Error which makes the Problem easy to find also means that the Workaround should be quicker to locate. All of the time required to properly understand the root cause of the users issue can be removed by allowing the Servicedesk engineer quick access to the Workaround.
- Repeatable Workarounds - Without a good system for generating high-quality Known Errors and Workarounds we might find that different engineers resolve the same issue in different ways. Creativity in IT is absolutely a good thing, but repeatable processes are probably better. Two users contacting the Servicedesk for the same issue wouldn’t expect a variance in the speed or quality of resolution. The KEDB is a method of introducing repeatable processes into your environment.
- Avoid Re-work – Without a KEDB we might find that engineers are often spending time and energy trying to find a resolution for the same issue. This would be likely in distributed teams working from different offices, but I’ve also seen it commonly occur within a single team. Have you ever asked an engineer if they know the solution to a users issue to be told “Yes, I fixed this for someone else last week!”. Would you have prefered to have found that information in an easier way?
- Avoid skill gaps - Within a team it is normal to have engineers at different levels of skill. You wouldn’t want to employ a team that are all experts in every functional area and it’s natural to have more junior members at a lower skill level. A system for capturing the Workaround for complex Problems allows any engineer to quickly resolve issues that are affecting users.Teams are often cross-functional. You might see a centralised application support function in a head-office with users in remote offices supported by their local IT teams. A KEDB gives all IT engineers a single place to search for customer facing issues.
- Avoid dangerous or unauthorised Workarounds - We want to control the Workarounds that engineers give to users. I’ve had moments in the past when I chatted to engineers and asked how they fixed issues and internally winced at the methods they used. Disabling antivirus to avoid unexpected behaviour, upgrading whole software suites to fix a minor issue. I’m sure you can relate to this. Workarounds can help eliminate dangerous workarounds.
- Avoid unnecessary transfer of Incidents - A weak point in the Incident Management process is the transfer of ownership between teams. This is the point where a customer issue goes to the bottom of someone else queue of work. Often with not enough detailed context or background information. Enabling the Servicedesk to resolve issues themselves prevents transfer of ownership for issues that are already known.
- Get insights into the relative severity of Problems - Well written Known Errors make it easier to associate new Incidents to existing Problems. Firstly this avoids duplicate logging of Problems. Secondly it gives better metrics about how severe the Problem is. Consider two Problems in your system. A condition that affects a network switch and causes it to crash once every 6 months. A transactional database that is running slowly and adding 5 seconds to timesheet entry You would expect that the first Problem would be given a high priority and the second a lower one. It stands to reason that a network outage on a core switch would be more urgent that a slowly running timesheet system But which would cause more Incidents over time? You might be associating 5 new Incidents per month against the timesheet problem whereas the switch only causes issues irregularly. Being able to quickly associate Incidents against existing Problems allows you to judge the relative impact of each one.
The KEDB implementation
Technically when we talk about the KEDB we are really talking about the Problem Management database rather than a completely separate store of data. At least a decent implementation would have it setup that way.
There is a one-to-one mapping between Known Error and Problem so it makes sense that your standard data representation of a Problem (with its number, assignment data, work notes etc) also holds the data you need for the KEDB.
It isn’t incorrect to implement this in a different way – storing the Problems and Known Errors in seperate locations, but my own preference is to keep it all together.
Known Error and Workaround are both attributes of a Problem
Is the KEDB the same as the Knowledge Base?
This is a common question. There are a lot of similarities between Known Errors and Knowledge articles.
I would argue that although your implementation of the KEDB might store its data in the Knowledgebase they are separate entities.
Consider the lifecycle of a Problem, and therefore the Known Error which is, after all, just an attribute of that Problem record.
The Problem should be closed when it has been removed from the system and can no longer affect users or be the cause of Incidents. At this stage we could retire the Known Error and Workaround as they are no longer useful – although we would want to keep them for reporting so perhaps we wouldn’t delete them.
Knowledgebase articles have a more permanent use. Although they too might be retired, if they refer to an application due to be decommissioned, they don’t have the same lifecycle as a Known Error record.
Knowledge articles refer to how systems should work or provide training for users of the system. Known Errors document conditions that are unexpected.
There is benefit in using the Knowledgebase as a repository for Known Error articles however. Giving Incident owners a single place to search for both Knowledge and Known Errors is a nice feature of your implementation and typically your Knowledge tools will have nice authoring, linking and commenting capabilities.
What if there is no Workaround
Sometimes there just won’t be a suitable Workaround to provide to customers.
I would use an example of a power outage to provide a simple illustration. With power disrupted to a location you could imagine that there would be disruption to services with no easy workaround.
It is perhaps a lazy example as it doesn’t allow for many nuances. Having power is a normally a binary state – you either have adequate power or not.
A better and more topical example can be found in the Cloud. As organisations take advantage of the resource charging model of the Cloud they also outsource control.
If you rely on a Cloud SaaS provider for your email and they suffer an outage you can imagine that your Servicedesk will take a lot of calls. However there might not be a Workaround you can offer until your provider restores service.
Another example would be the February 29th Microsoft Azure outage. I’m sure a lot of customers experienced a Problem using many different definitions of the word but didn’t have a viable alternative for their users.
In this case there is still value to be found in the Known Error Database. If there really is no known workaround it is still worth publishing to the KEDB.
Firstly to aid in associating new Incidents to the Problem (using the Known Error as a search key) and to stop engineers in wasting time in searching for an answer that doesn’t exist.
You could also avoid engineers trying to implement potentially damaging workarounds by publishing the fact that the correct action to take is to wait for the root cause of the Problem to be resolved.
Lastly with a lot of Problems in our system we might struggle to prioritise our backlog. Having the Known Error published to help routing new Incidents to the right Problem will bring the benefit of being able to prioritise your most impactful issues.
A users Known Error profile
With a populated KEDB we now have a good understanding of the possible causes of Incidents within our system.
Not all Known Errors will affect all users – a network switch failure in one branch office would be very impactful for the local users but not for users in another location.
If we understand our users environment through systems such as the Configuration Management System (CMS) or Asset Management processes we should be able to determine a users exposure to Known Errors.
For example when a user phones the Servicedesk complaining of an interruption to service we should be able to quickly learn about her configuration. Where she is geographically, which services she connects to. Her personal hardware and software environment.
With this information, and some Configuration Item matching the Servicedesk engineer should have a view of all of the Known Errors that the user is vulnerable to.
Measuring the effectiveness of the KEDB.
As with all processes we should take measurements and ensure that we have a healthy process for updating and using the KEDB.
Here are some metrics that would help give your KEDB a health check.
Number of Problems opened with a Known Error
Of all the Problem records opened in the last X days how many have published Known Error records?
We should be striving to create as many high quality Known Errors as possible.
The value of a published Known Error is that Incidents can be easily associated with Problems avoiding duplication.
Number of Problems opened with a Workaround
How many Problems have a documented Workaround?
The Workaround allows for the customer Incident to be resolved quickly and using an approved method.
Number of Incidents resolved by a Workaround
How many Incidents are resolved using a documented Workaround. This measures the value provided to users of IT services and confirms the benefits of maintaining the KEDB.
Number of Incidents resolved without a Workaround or Knowledge
Conversely, how many Incidents are resolved without using a Workaround or another form of Knowledge.
If we see Servicedesk engineers having to research and discover their own solutions for Incidents does that mean that there are Known Errors in the system that we aren’t aware of?
Are there gaps in our Knowledge Management meaning that customers are contacting the Servicedesk and we don’t have an answer readily available.
A high number in our reporting here might be an opportunity to proactively improve our Knowledge systems.
want to ensure that Known Errors are quickly written and published in order to allow Servicedesk engineers to associate incoming Incidents to existing Problems.
One method of measuring how quickly we are publishing Known Errors is to use Organisational Level Agreements (or SLAs if your ITSM tool does’t define OLAs).
We should be using performance measurements to ensure that our Problem Management function is publishing Known Errors in a timely fashion.
You could consider tracking Time to generate Known Error and Time to generate Workaround as performance metrics for your KEDB process.
Additionally we could also measure how quickly Workarounds are researched, tested and published. If there is no known Workaround that is still valuable information to the Servicedesk as it eliminates effort in trying to find one so an OLA would be appropriate here.