Tuesday, February 27, 2024

Root Cause Analysis

Time for the post mortem, as you know, some take it seriously
The need to revisit, the urge to explain what happened previously
For an ounce of prevention is said to be worth a pound of cure
Let's get to the bottom of things and figure out the root cause

Oftentimes it's a simple mistake, a moment of inattention
Or sometimes it's really just idle exploration
One minute you wonder, what does this button do?
Then the thing happens that you can't recant
Oops, you realize you just shut down the power plant

Human fallibility tends to be the root cause

Ah right, power. It's quite fitting in this era of modernity
That one can't sing too highly of the virtues of Electricity
So essential, we almost always overlook this august substance
We only rue its wonder when confronted by its absence
And now we've lost power and everything must stop
Oops, lights out. In Ghana we call it dumsor

Power failures are prime candidates for the root cause

The next affliction, sadly, is all too common
Like ants, human beings just like to burrow
When in a mad rush to lay down some pipes, it's nigh inevitable
So busy that we never checked to see what could be an obstacle
Dig: bureaucracy got in the way, they were moving too slow
Oops, the contractor cut the critical cable with his backhoe

All too often, cable cuts tend to be the root cause

Things fall apart, they say,
   equipment sputters, machines fail
They blow hot and cold, or crack when used,
   there's wear and tear
Material scientists make a roaring trade
   as do structural engineers
That, sadly, alchemists never overcame nature's challenge
   is the lesson learned
Oops, the widget broke,
   a reminder that no condition is permanent

In this industrial age, hardware failures are a likely root cause

Sometimes you're just too popular,
   so crowded no one can get in or breathe
Congestion is the operative word,
   in matters of scale, a crowd changes things
Your service is the flavor of the month,
   and now you've become essential
Oops, you're completely unprepared for when you go viral

Woe is me, lack of capacity is frequently the root cause

And then we come to the bad actors,
   forever on the attack
Always probing for an opening,
   for vulnerabilities in your stack
And that's even before we consider
   the gremlins and parasites
Iconoclastic beasts with distinctive manners
   and singular appetites
Every complex ecosystem in history
   has had to deal with grifters
Oops, your hospital is held to ransom
   by a band of sneaky hackers

Always protect yourself, a lapse in security is invariably the root cause

There's more in this vein,
   mankind has never built a system without error
From the Tower of Babel to that fancy car,
   or even that blasted word processor
The raw materials of life,
   whether it's the design or the initial conception
Imposing one's will,
   it might be a flaw in the ultimate implementation

You probably have your own experience and area of expertise
Your own rules of thumb about these puzzling mysteries
Let me tell you something
   from my profession of software engineer
If you only knew,
   to defend a system in depth is an exercise in fear
How close we come to catastrophe,
   partial or complete, every day
Trust me, you really don't want to see how the sausage is made

Now one could argue about the order
   of this short list of failure modes
It is only in retrospect that one is truly able to diagnose
The human burden is to keep moving
   in the face of systemic error
To mitigate the worst,
   to build the fail-safes and systematic procedures

Spare a thought for the moron in a hurry,
   for one day it could be you
That, through omission or commision,
   will be blamed for the miscue
We haven't scratched the surface
   of how much the human factor has an impact
Fall back to folk wisdom,
   suffice to say that curiosity killed the cat

And what of the wider world,
   say failed love affairs, or even wars?
It's only human to search for simple answers
   and the root cause
Our prophets and philosophers have long emphasized
   moral suasion and the golden rule
You could do hardly do worse than social living
   and the mosquito principle

Focus on best practices, usability, and layers of protection
Try to put a process in place and make it official!
Make sure that it takes many, many big red buttons
   to launch that nuclear missile
If there's any moral to this tall tale of root cause analysis
Take heed, and wherever possible, make use of the checklists


wide load coming through


After
  • How Complex Systems Fail (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety) by Richard I. Cook
  • a quip about network outages from Sean Donelan

Root Cause, a playlist


A soundtrack for this note (spotify version)
See also: The Dining Philosophers Problem, Resilience and Adaptability, and Version Hell Revisited

This belated entry on failure modes is part of the Toli Technology Series

File under: , , , , , , , , , , , , , , , , , , ,

Writing log: April 3, 2022

1 comment:

Alex said...

never not time for Donelan content!