Modern software systems and IT infrastructures are complex - incidents are bound to arise. The occurrence of errors and outages is not a question of "if", but of "when" and "for how long". To ensure that incidents are recognized systematically, effectively addressed, and followed up on in a structured manner, it is important to have IT Service Management in place. This requires a modern incident management approach, ideally consisting of five stages: Preparation, Detection and Alerting, Containment, Recovery, and Analysis.
Organizations that neglect parts of the full incident management cycle take unnecessary risks. They run the risk of teams responding incorrectly to incidents, which can prolong incidents and result in higher costs.
Let's take a closer look at the five stages of incident management.
Preparation is an essential yet often neglected part of incident management. In fact, some teams prepare issues only after an incident has caught them unprepared and has plunged them into chaos.
Experienced ITSM teams take preparation seriously. They explore what-if scenarios and define processes in advance. They pack, so to speak, an always-accessible rescue “jump bag” with critical information on how to deal with an unexpected event or problem. Centralizing this data enables quick action to be taken right away, and saves the team a lot of time.
Depending on the team structure and systems in use, information in this centralized repository could include incident response plans, contact lists, preparedness plans, escalation guidelines, link lists to key communication tools, access codes, compliance regulations, technical documentation and manuals.
2. Detection and Alerting
When an incident occurs, a team should identify it before tickets start pouring out of the customer portal and service phones start ringing off the hook. An important question is: how can teams be alerted?
With the increasing availability of high-quality IT monitoring tools, teams are now better informed about abnormalities and incidents. However, the use of multiple tools can also result in an excessive number of alarms, including false positives, which may hinder the reaction process. It would be sensible to include an extra layer in the monitoring process that centralizes the alerting system. That is where modern incident management tools such as Opsgenie from Atlassian comes in.
Opsgenie makes it possible to streamline the process significantly. They offer automated alerting workflows based on alert types, team plans and escalation policies, so that human error and/or delays can be largely eliminated.
In addition, it can make sense - especially on a large scale - to monitor the monitoring itself, as monitoring tools are not immune to incidents. A holistic alerting process ensures that both the systems and the monitoring tools are continuously checked.
A Triage, a preliminary assessment, in incident management is comparable to that of one in the medical field. The first step is to identify the extent of the incident. Then all incidents must be contained and, if possible, separated to prevent the situation from getting worse. All activities in this phase revolve around limiting damage and avoiding further impact.
The focus here is on finding temporary solutions, for example isolating a network, resetting a build, or restarting the servers. Ideally, this will solve the problem, but again, the focus is to limit the scope of the incident. All further steps and measures, i.e. efforts to fully recover, will later follow.
Even in the event of a problem, open communication should be maintained with customers. Of course, this is easier said than done in the heat of an IT crash. It is important to be transparent because it builds up a valuable asset, namely, trust. So, you should have a communication plan on top of an emergency “jump bag” ready in the event of a problematic incident.
A status page, Twitter and user forums are suitable platforms for sharing relevant information. The team should also maintain this open communication in subsequent incident management phases.
In this stage, the team implements long-term solutions to ensure that the incident is fully and effectively resolved. The aim is to understand the causes that led to the problems and how to correct the conditions so that similar incidents can be eliminated in the future.
The goal of this phase is not to return the system to its original stable state, but to make it even better and safer. It should have the same operational capabilities, but provide additional protection against similar incidents.
The workflows in professional incident management do not end when the dust has settled and the system is running smoothly and safely again. The next stage, called analysis, should begin. The purpose of this “post-mortem stage" is to both clearly understand the systemic causes of the incident and to critically examine the individual steps to a resolution.
Based on this, the incident team identifies opportunities for improving systems and processes. The evaluation of this information helps to develop new workflows that support greater system resilience and faster incident responses.
A good post-mortem looks at the entire incident and answers the who, what, why and how questions - but without blaming and holding individuals personally liable: IT is always a team sport! At its core, the analysis revolves around learning from the incident in order to optimize the team's performance and compile reference material for future incident scenarios.
Experienced ITSM teams conduct post-mortem analyses after every incident - not only after major outages. This way, they avoid the risk of overlooking lingering effects of smaller incidents. A detailed report is probably not necessary for every single incident, but there should always be time for a review. Awareness of specific situations promotes the further development of collective knowledge and a culture of continuous improvement.
Change is normal
In modern IT environments, change is guaranteed. This means that systems and infrastructures are constantly put under stress by new factors. As mentioned at the beginning, it is only a matter of time before any system experiences a failure.
Experienced teams have this awareness in their DNA. They are thoroughly prepared for incidents and have access to critical tools and information at all times to help them isolate, troubleshoot and communicate problems quickly and effectively.
Any organization that embraces IT service management is introduced to a new way of thinking and working within a team. Development, operations and support are no longer clearly separated, but go hand in hand.
- ITSM, ITIL and DevOps: What’s What?
- 5 Reasons Why ITSM Teams Rely on Jira Service Management Data Center
- Integrating Confluence into Jira Service Management: How ITSM Teams Efficiently Handle Service Desk Requests
- 7 Tips To Get Your ITSM Teams Ready for On-call
- We Won an Award! Seibert Media is one of the “Best IT Service Providers 2023”!