In a recent article in ACM Queue, Thomas Limoncelli explains how a DevOps mindset and blameless postmortems can help teams celebrate and learn from outages.
Limoncelli recounts an incident that occurred while he was working as a system administrator at Bell Labs in the 1990s. At the time, a simple change resulted in an outage and, although he successfully reverted the change and restored service, Limoncelli says he learned the wrong lesson from the event.
“I panicked. I hid in my office. Prayed that nobody would say anything or notice. And guess what? Nobody did,” he says.
This type of reaction can have unintended consequences and lead to larger problems, Limoncelli notes, including ignoring issues and shutting down lines of communication:
If IT workers fear they will be punished for outages, they will adopt behavior that leads to even larger outages. Instead, we should celebrate our outages: Document them blamelessly, discuss what we've learned from them openly, and spread that knowledge generously.
Ultimately, Limoncelli says, an outage can be considered as an investment in the people who have learned from it. “Managed correctly, every outage makes the organization smarter. In short, the goal should be to create a learning culture—one that seeks to make only new mistakes.”
Read the complete article at ACM Queue.