Disorder and Swarming in Incident Management

This quick post is inspired by the recent discussions about using swarming for incident management, most recently in Learning from DevOps: Why complex systems necessitate new ITSM thinking by Jon Hall.

It all starts with Cynefin. It’s actually rather difficult to not start with Cynefin, because this sense-making framework just … makes so much sense.

Firstly, I would state that many if not most of the IT-related approaches we have been lucky (or not so lucky) to get exposed to have been attempts to deal with the complexity of the systems we are working in. DevOps, SRE, Agile, SAFe, Lean IT, ITIL, … I’m not saying that the attempts have necessarily been successful, or that they really are about what it says on the tin, but still. We can debate the merits of all of these, or have pseudo-religious wars over whose wrong approach is the best wrong approach, but that’s for another day.

The constraints (both governing and enabling) we have had to deal with have changed over time, and many of the practices we have in place today are the ‘best we can do under these circumstances’ responses to daily challenges from yesterday.

Making sense

Different types of systems come with different approaches to ‘sense-making in order to act’. What works in the ordered world differs from that needed in the unordered world. Sometimes, we can simply categorize the observable phenomenon and choose our following steps accordingly based on our previous knowledge. Sometimes we need to expert analysis before we can move ahead. And, sometimes no matter how much we analyze, nothing still seems to make sense, which is probably an indication of dealing with a complex situation.

Our initial ‘diagnosis’ of a situation can be incorrect, and we can have a different understanding of the situation compared to our colleagues, and situations can move around rather than be categorically stuck in any of the domains, so it’s all rather … tricky (phew) but still very useful.

Augmentation of practices

I will leave aside, and perhaps for another time, the notion of replacing X with Y and its sad and painful history in IT. I’m a strong believer in continual improvement and ever-lasting search of better ways of working, and I strongly oppose any attempts to claim that ‘we have now found that thing that works’. This is how scaffolding becomes accepted as the building we intended to raise.

But often, ‘replace’ means ‘do it in a better way’, which I believe is the case here. And we certainly can do much better when it comes to incident management practices.

The same way as discussions about AI would currently perhaps be more fruitful if A stood for ‘augmented’, we’re seeing the augmentation and evolution of practices over time. The eras that we like to use to describe the prevalence of certain types of practices are artificial categorization. For someone who compares current practices to those from 10–15 years ago, the difference can seem revolutionary, whereas for someone who has experienced the transition first-hand, it’s all fluid.

As a somewhat simplistic analogy here — think about how people age, and how different the feeling (and noticing) of someone you see on a daily basis getting older is that from seeing an old acquaintance for the first time after 10 years. Or how, quite suddenly, you realise your kids have grown up without you even knowing how and when that happened.

Incident management in two stages

Initial triage is what I would see through the lens of Disorder in the Cynefin framework. Something happened and we don’t yet know how to approach it. Is it something we might have a runbook for? Is it something that seems to require expert analysis? Is it something where it looks like we need to stabilize the situation first, and only then look at solutions? Or is it something where we feel lost and need to gather some people to figure out what’s happening.

That is to say: you can (and do) have incidents in complex systems that can be addressed using approaches from Obvious and Complicated domains.

The triage can be manual or (semi-) automated, strictly scripted or heuristics-based, but it must finish with a decision for what to do next. Often, this triage is done by the L1 team that sits in the service desk. For the purposes of this discussion, we can characterize this as an L0 task.

The five levels of support

I’m using a very simplified L0-L1-L2-L3-L4 structure here; in the real world, we often have a much more complicated structure with different expert teams, etc. Also, over time, more and more of what would usually be handled by L1 is automated. Even (human) interactions, to an extent, with chatbots and the like. But, automating empathy is still somewhat challenging.

The L2 team will then use their expertise to try and solve the incident. The structure of the L2 team can be fixed or ad-hoc, based on the nature of the incident. There might even not be an L2 team, just specialists in various technical teams also carrying the role of an L2 analyst, invoked as required.

The L3 team is what I would describe as ‘more specialist expertise is required’. Any technical team or a combination of teams can become a virtual L3 team. It is expected that a solution can be found by combining their expertise. An incident can move from L2 to L3 with a simple ‘guys, I looked at it and I am not sure whether to choose path A or path B; can you please advise’ request, or it could be time-bound or use any other rule that makes sense.

The L4 team would most likely be a virtual team by nature, to be assembled from specialists across the organization (and outside), when we are dealing with a complex incident. For these, no-one knows what the solution could be, or we have several people who think that they do know, but their guesses are conflicting. This is where we need to run probes (when dealing with high ambiguity) or experiments (when dealing with somewhat lower ambiguity) to learn more about the system so that we can spot patterns and causal links, and find solutions.

Here, we will be trying several things at once (the parallel safe-to-fail approach), monitor for the impact of our interventions, and attempt to amplify or dampen the effects based on what seems to work and what doesn’t.

I want to stress here that the discussion is about solving incidents — stopping the negative impacts of the incident on customers, not about ‘finding root causes’, etc. That’s yet another discussion for yet another day.

So where does swarming fit in?

Using swarming can make a lot of sense for complex incidents, but should not be used for all incidents in complex systems.

One of the key questions is about the flow of work — how do incidents move between support levels, how much context is provided for the next level, and what are the delays (and resulting costs and additional issues) resulting from this process. Being stuck with the wrong team (and in the wrong domain, based on what we already know) can be costly.

If we want to avoid losing context, then we could have the same people involved with every incident from the beginning, bringing in more people as required. Very soon, we will have to deal with the question of efficiency.

Not all incidents are alike even in complex systems. We’re back to granularity.

Does it make sense to use the same approach to all types of incidents? Does it make sense to manually solve incidents where we could leverage automation? There are many trade-offs that need to be made, and it remains a question of what works best in a specific context, for the specific organization, and the specific value expected by their specific customers.

There is no need to use techniques for navigating the complex domain when dealing with incidents that seem to fall into obvious or complicated domains. Conversely, using run-books and categorization matrices for complex incidents will lead to rather unhappy customers and possibly the need to update one’s CV.

Digital Government & Governance || Strategic Advisor || Former Minister of Foreign Trade and IT, Estonia. Former Head of ITIL. DevOpsing before DevOps was cool.