Or: Why ITSM no longer separates development, operations and support
In ITSM, development, operations and support are no longer clearly separated, but go hand in hand. As a result, organizations need to ensure that ITSM teams are actually operational when on-call. We provide 7 tips to help you do just that.
In a time before ITSM (IT Service Management), the focus of IT management approaches was on the technologies themselves. At that time, there were traditionally developers on the one side who created software products, services and infrastructures. On the other side were the system or IT administrators who maintained these products, services and infrastructures. The IT teams were focused on achieving their own goals.
But as we learned in the last blog article, ITSM focuses on delivering IT services in a well-organized way. In this approach, IT is no longer primarily aligned with its own goals, but with the business goals of its organization. In this context, the ITSM idea can also be implemented within a team in such a way that it is oriented toward ITIL practices and influenced by DevOps concepts.
In this case, there is then usually no longer a separation between development, operations and support. Instead, all stakeholders are responsible for ensuring that the systems function reliably. To ensure this, team members have to be able to rush to the rescue, in the event of problems and real emergencies occurring in the system’s operation. Accordingly, it's important to prepare ITSM teams to be ready to go when on call. We have 7 tips for you on how to prepare technicians who are new to the team for on-call duty!
1. Explain the basics of on-call plans and escalations
This first tip is probably a fairly obvious one, but it's still critically important. After all, without knowing the organizational basics of being on-call and the escalations in your organization, it's not possible for new team members to be value-driven.
For example, does your company have different rotas for day and night shifts? Do you have something like a primary and secondary on-call schedule? Do you have appropriate escalation procedures in place in case your technician can't be there for some reason? These are all examples of basic information that your new team members should know.
A practical example is provided by Opsgenie, Atlassian's central incident management platform: Below you can see the schedule with which Opsgenie organizes itself and the escalation guidelines that are applied.
2. Establish rules for notifying technicians in case of an incident
As the term "on-call technician" implies, your team members must always be available and ready for action in case of an incident. So work with them to establish rules for notifying them in the event of an incident.
For example, it is best practice to classify incidences: Depending on the urgency level, different notification methods can be used. For example, while high urgency incidents require a combination of mobile push and voice notifications, for less urgent incidences or those for informational purposes, an email, text message, mobile push or voice notifications are sufficient.
To give you a better idea of what this might look like in practice - here's a look at Opsgenie's defined rules for notification in the case of an incident:
3. Make sure everyone has the right tools and access rights
In a real emergency, speed is of the essence: to ensure that your technicians are not slowed down by insufficient tool knowledge or denied access rights, it is essential that everyone is familiar with the necessary commands and has the relevant access rights for the appropriate environments.
So check that everyone on the team is familiar with the following:
- SSH credentials
- sudo access permissions
- ChatOps commands
- Link to runbooks
4. Make sure everyone knows your infrastructure and technology portfolio
Problems in the system are similar to health problems: To get rid of them, you have to find the cause. Therefore, it is extremely important that all technicians know your organization's infrastructure in order to understand the cause of a problem and ultimately solve it.
Pass on your knowledge of the infrastructure and technology portfolio to new team members and ensure that the associated documentation is always complete and up-to-date.
5. Train your technicians to use relevant diagnostic tools
Depending on the team, a different diagnostic tool will be used to track operational integrity, application performance, or resource utilization. For this reason, it's important for your team to look at several such tools and how to use them.
For example, you could train your on-call technicians on the following diagnostic tools:
- Icinga or Grafana: Identify a complex problem by querying the correct incident with the correct metrics in the shortest amount of time in most scenarios.
- Amazon CloudWatch: Use CloudWatch to monitor almost all of your AWS services.
- Telegraf and InfluxDB: If you store customer logs or performance data in a TIG stack, your engineers should know about the different types of logs and how they are used and analyzed.
Set rules for notifications to the on-call schedule
This tip is also as simple as it is conclusive: Your on-call colleagues need to know when their next shift is coming up. So make sure that all team members have configured rules for notifications regarding the on-call schedule.
This is how it looks like in the implementation at Opsgenie:
7. Define responsibilities for first responders to incidents
What is the specific incident process for your on-call technicians? This process should be clearly defined (and documented) to avoid misunderstandings and frustration.
Here are a few sample questions that should be clarified in this context:
- When should an incident be acknowledged?
- How is an incident prioritized and classified?
- When should it be escalated to team members with more experience or to other teams?
- When are the appropriate stakeholders - such as managers and customer support - informed?
- What should be done when on-call technicians are away from the computer for a short period of time?
- How are incidents documented for post-mortem analysis?
Ready to get started? Train your ITSM team!
Every organization that embraces ITSM demands a new way of thinking and working from its teams. Development, operations and support are no longer clearly separated, but go hand in hand. At the same time, it is important to ensure that everyone on the team is actually operational during the on-call period. With the tips presented here, you can prepare your new on-call technicians in the best possible way.