Disorder and Swarming in Incident Management

7 min readApr 8, 2019

This quick post is inspired by the recent discussions about using swarming for incident management, most recently in Learning from DevOps: Why complex systems necessitate new ITSM thinking by Jon Hall.

It all starts with Cynefin. It’s actually rather difficult to not start with Cynefin, because this sense-making framework just … makes so much sense.

Firstly, I would state that many if not most of the IT-related approaches we have been lucky (or not so lucky) to get exposed to have been attempts to deal with the complexity of the systems we are working in. DevOps, SRE, Agile, SAFe, Lean IT, ITIL, … I’m not saying that the attempts have necessarily been successful, or that they really are about what it says on the tin, but still. We can debate the merits of all of these, or have pseudo-religious wars over whose wrong approach is the best wrong approach, but that’s for another day.

The constraints (both governing and enabling) we have had to deal with have changed over time, and many of the practices we have in place today are the ‘best we can do under these circumstances’ responses to daily challenges from yesterday.

Making sense

Overall, when we’re encountering a new situation, we start by ‘making sense’ in (inauthentic) disorder, the only phenomenological domain in Cynefin out of the five total. We don’t yet know what kind of situation or system we’re dealing with, and based on our initial observations about the reality around us, and the characteristics of different types of systems we can spot, we take a guess about whether we’re dealing with something ordered, complex, or chaotic. The concept of granularity is important here — at too high a level, the system might look unapproachably complex, and at too low a level, we get fooled into expecting observable (or imaginable) causality in everything. Also, subjectivity.

Different types of systems come with different approaches to ‘sense-making in order to act’. What works in the ordered world differs from that needed in the unordered world. Sometimes, we can simply categorize the observable phenomenon and choose our following steps accordingly based on our previous knowledge. Sometimes we need to expert analysis before we can move ahead. And, sometimes no matter how much we analyze, nothing still seems to make sense, which is probably an indication of dealing with a complex situation.

Our initial ‘diagnosis’ of a situation can be incorrect, and we can have a different understanding of the situation compared to our colleagues, and situations can move around rather than be categorically stuck in any of the domains, so it’s all rather … tricky (phew) but still very useful.

Augmentation of practices

The context for the linked article was incident management and replacing the traditional tiered support structure with swarming because that works much better for complex systems.

I will leave aside, and perhaps for another time, the notion of replacing X with Y and its sad and painful history in IT. I’m a strong believer in continual improvement and ever-lasting search of better ways of working, and I strongly oppose any attempts to claim that ‘we have now found that thing that works’. This is how scaffolding becomes accepted as the building we intended to raise.

But often, ‘replace’ means ‘do it in a better way’, which I believe is the case here. And we certainly can do much better when it comes to incident management practices.

The same way as discussions about AI would currently perhaps be more fruitful if A stood for ‘augmented’, we’re seeing the augmentation and evolution of practices over time. The eras that we like to use to describe the prevalence of certain types of practices are artificial categorization. For someone who compares current practices to those from 10–15 years ago, the difference can seem revolutionary, whereas for someone who has experienced the transition first-hand, it’s all fluid.

As a somewhat simplistic analogy here — think about how people age, and how different the feeling (and noticing) of someone you see on a daily basis getting older is that from seeing an old acquaintance for the first time after 10 years. Or how, quite suddenly, you realise your kids have grown up without you even knowing how and when that happened.

Incident management in two stages

I would separate incident management into two distinct steps for the purposes of this article: initial triage and incident resolution.

Initial triage is what I would see through the lens of Disorder in the Cynefin framework. Something happened and we don’t yet know how to approach it. Is it something we might have a runbook for? Is it something that seems to require expert analysis? Is it something where it looks like we need to stabilize the situation first, and only then look at solutions? Or is it something where we feel lost and need to gather some people to figure out what’s happening.

That is to say: you can (and do) have incidents in complex systems that can be addressed using approaches from Obvious and Complicated domains.

The triage can be manual or (semi-) automated, strictly scripted or heuristics-based, but it must finish with a decision for what to do next. Often, this triage is done by the L1 team that sits in the service desk. For the purposes of this discussion, we can characterize this as an L0 task.

The five levels of support

After L0 work has been completed, and hopefully, very quickly, incidents that feel like Obvious are routed to the L1 team who attempts to use existing knowledge to solve them. Sometimes they are successful, and sometimes, after the initial diagnosis, they need to escalate the incident to the L2 team, because the known solution is not working.

I’m using a very simplified L0-L1-L2-L3-L4 structure here; in the real world, we often have a much more complicated structure with different expert teams, etc. Also, over time, more and more of what would usually be handled by L1 is automated. Even (human) interactions, to an extent, with chatbots and the like. But, automating empathy is still somewhat challenging.

The L2 team will then use their expertise to try and solve the incident. The structure of the L2 team can be fixed or ad-hoc, based on the nature of the incident. There might even not be an L2 team, just specialists in various technical teams also carrying the role of an L2 analyst, invoked as required.

The L3 team is what I would describe as ‘more specialist expertise is required’. Any technical team or a combination of teams can become a virtual L3 team. It is expected that a solution can be found by combining their expertise. An incident can move from L2 to L3 with a simple ‘guys, I looked at it and I am not sure whether to choose path A or path B; can you please advise’ request, or it could be time-bound or use any other rule that makes sense.

The L4 team would most likely be a virtual team by nature, to be assembled from specialists across the organization (and outside), when we are dealing with a complex incident. For these, no-one knows what the solution could be, or we have several people who think that they do know, but their guesses are conflicting. This is where we need to run probes (when dealing with high ambiguity) or experiments (when dealing with somewhat lower ambiguity) to learn more about the system so that we can spot patterns and causal links, and find solutions.

Here, we will be trying several things at once (the parallel safe-to-fail approach), monitor for the impact of our interventions, and attempt to amplify or dampen the effects based on what seems to work and what doesn’t.

I want to stress here that the discussion is about solving incidents — stopping the negative impacts of the incident on customers, not about ‘finding root causes’, etc. That’s yet another discussion for yet another day.

So where does swarming fit in?

In the previously described more traditional and simplified tiered support model, swarming is what would usually happen on L4. It can also be used on L3 if appropriate, but it might not be required. Using swarming for L1 and L2 might be useful, but could also be rather wasteful, because the solutions either are known or can be known without assembling a cross-functional (virtual) team. You could also use swarming for L0, but again, it might make more sense to equip L1 to perform those tasks than to turn to swarming.

Using swarming can make a lot of sense for complex incidents, but should not be used for all incidents in complex systems.

One of the key questions is about the flow of work — how do incidents move between support levels, how much context is provided for the next level, and what are the delays (and resulting costs and additional issues) resulting from this process. Being stuck with the wrong team (and in the wrong domain, based on what we already know) can be costly.

If we want to avoid losing context, then we could have the same people involved with every incident from the beginning, bringing in more people as required. Very soon, we will have to deal with the question of efficiency.

Not all incidents are alike even in complex systems. We’re back to granularity.

Does it make sense to use the same approach to all types of incidents? Does it make sense to manually solve incidents where we could leverage automation? There are many trade-offs that need to be made, and it remains a question of what works best in a specific context, for the specific organization, and the specific value expected by their specific customers.

There is no need to use techniques for navigating the complex domain when dealing with incidents that seem to fall into obvious or complicated domains. Conversely, using run-books and categorization matrices for complex incidents will lead to rather unhappy customers and possibly the need to update one’s CV.