Root Cause Analysis
Time for the post mortem, as you know, some take it seriously
The need to revisit, the urge to explain what happened previously
For an ounce of prevention is said to be worth a pound of cure
Let's get to the bottom of things and figure out the root cause
Oftentimes it's a simple mistake, a moment of inattention
Or sometimes it's really just idle exploration
One minute you wonder, what does this button do?
Then the thing happens that you can't recant
Oops, you realize you just shut down the power plant
Human fallibility tends to be the root cause
Ah right, power. It's quite fitting in this era of modernity
That one can't sing too highly of the virtues of Electricity
So essential, we almost always overlook this august substance
We only rue its wonder when confronted by its absence
And now we've lost power and everything must stop
Oops, lights out. In Ghana we call it dumsor
Power failures are prime candidates for the root cause
The next affliction, sadly, is all too common
Like ants, human beings just like to burrow
When in a mad rush to lay down some pipes, it's nigh inevitable
So busy that we never checked to see what could be an obstacle
Dig: bureaucracy got in the way, they were moving too slow
Oops, the contractor cut the critical cable with his backhoe
All too often, cable cuts tend to be the root cause
Things fall apart, they say,
equipment sputters, machines fail
They blow hot and cold, or crack when used,
there's wear and tear
Material scientists make a roaring trade
as do structural engineers
That, sadly, alchemists never overcame nature's challenge
is the lesson learned
Oops, the widget broke,
a reminder that no condition is permanent
In this industrial age, hardware failures are a likely root cause
Sometimes you're just too popular,
so crowded no one can get in or breathe
Congestion is the operative word,
in matters of scale, a crowd changes things
Your service is the flavor of the month,
and now you've become essential
Oops, you're completely unprepared for when you go viral
Woe is me, lack of capacity is frequently the root cause
And then we come to the bad actors,
forever on the attack
Always probing for an opening,
for vulnerabilities in your stack
And that's even before we consider
the gremlins and parasites
Iconoclastic beasts with distinctive manners
and singular appetites
Every complex ecosystem in history
has had to deal with grifters
Oops, your hospital is held to ransom
by a band of sneaky hackers
Always protect yourself, a lapse in security is invariably the root cause
There's more in this vein,
mankind has never built a system without error
From the Tower of Babel to that fancy car,
or even that blasted word processor
The raw materials of life,
whether it's the design or the initial conception
Imposing one's will,
it might be a flaw in the ultimate implementation
You probably have your own experience and area of expertise
Your own rules of thumb about these puzzling mysteries
Let me tell you something
from my profession of software engineer
If you only knew,
to defend a system in depth is an exercise in fear
How close we come to catastrophe,
partial or complete, every day
Trust me, you really don't want to see how the sausage is made
Now one could argue about the order
of this short list of failure modes
It is only in retrospect that one is truly able to diagnose
The human burden is to keep moving
in the face of systemic error
To mitigate the worst,
to build the fail-safes and systematic procedures
Spare a thought for the moron in a hurry,
for one day it could be you
That, through omission or commision,
will be blamed for the miscue
We haven't scratched the surface
of how much the human factor has an impact
Fall back to folk wisdom,
suffice to say that curiosity killed the cat
And what of the wider world,
say failed love affairs, or even wars?
It's only human to search for simple answers
and the root cause
Our prophets and philosophers have long emphasized
moral suasion and the golden rule
You could do hardly do worse than social living
and the mosquito principle
Focus on best practices, usability, and layers of protection
Try to put a process in place and make it official!
Make sure that it takes many, many big red buttons
to launch that nuclear missile
If there's any moral to this tall tale of root cause analysis
Take heed, and wherever possible, make use of the checklists
After
- How Complex Systems Fail (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety) by Richard I. Cook
- a quip about network outages from Sean Donelan
Root Cause, a playlist
A soundtrack for this note (spotify version)
- Oops (Oh My) by Tweet ft. Missy Elliot
- My Mistake (was to love you) by Marvin Gaye
- Reasons by Earth, Wind & Fire
- Mistake by Fela Kuti
- Knockin' at the Wrong Door by The Rollers
- A Few Reasons by Dwele
- Who Can We Blame by Mica Paris
- Lessons by Eric Roberson
See also: The Dining Philosophers Problem, Resilience and Adaptability, and Version Hell Revisited
This belated entry on failure modes is part of the Toli Technology Series
File under: failure, systems, resilience, best practices, human factors, networks, operations, error, technology, software, hardware, culture, observation, perception, design, humour, strategy, Observers are worried, poetry, toli
Writing log: April 3, 2022