Image by Patricio González from Pixabay

Chaos Engineering Explained

By Amber Ankerholz, 20 April, 2022

Chaos engineering “may sound like an oxymoron or the title of a bad science-fiction movie,” says Fredric Paul at New Relic, but “it’s actually an increasingly popular approach to improving the resiliency of complex, modern technology architectures.”

In this article, we’ll look at the practice of chaos engineering and explain the methodology and tools involved in this testing approach.

What Is Chaos Engineering?

Chaos engineering is “the discipline of experimenting on a system in order to build confidence in the system’s capability to withstand turbulent conditions in production,” per the official definition from the Principles of Chaos Engineering. The goal, as TechTarget explains, “is to identify weakness in a system through controlled experiments that introduce random and unpredictable behavior.”

The practice was originally developed at Netflix as a way to better understand and observe the behavior of distributed systems and identify points of failure. As Paul explains, the move to distributed architectures, such as that built by Netflix to replace its monolithic stack, “introduced new types of complexity that required significantly more reliable and fault-tolerant systems.”

Thus, Netflix engineers created Chaos Monkey. The tool, which works by randomly shutting down virtual machine instances and containers running inside your production environment, “gave the company a way to proactively test everyone’s resilience to a failure, and do it during business hours so that people could respond to any potential fallout when they had the resources to do so, rather than at 3 a.m. when pagers typically go off,” says Casey Rosenthal, who built the chaos engineering team at Netflix back in 2015.

Chaos in Control

via GIPHY

“It’s a misnomer to think of chaos engineering as actually chaotic,” Paul says. In fact, “chaos engineering involves thoughtful, planned, and controlled experiments designed to demonstrate how your systems behave in the face of failure.”

The process typically involves the following four steps, according to TechTarget:

Establish a baseline. Testers must identify how the system should operate under optimal conditions and specify what constitutes a normal working state.
Create a hypothesis. Consider one or more potential weaknesses and formulate a hypothesis about the effects of those weaknesses. For example, what happens if a large traffic spike occurs?
Test. Conduct experiments to gauge the consequences of a particular event. For example, a traffic spike simulation might reveal a storage performance issue.
Evaluate. Measure and evaluate how your hypothesis holds up and determine which problems to fix.

“It’s important not to rush into the practice of chaos engineering without proper planning and designing of experiments,” cautions Biswajit Mohapatra. “Every chaos experiment should begin with a hypothesis. The test should be designed with a small scope with a focused team working on the same. Every organization should focus on controlled chaos promoted by observability to improve system resiliency.”

Various chaos engineering tools can also “help automate the process of temporarily disabling or throttling specific components of infrastructure to assess its effect on applications in production,” notes TechTarget. Tool options include:

Fixing Not Breaking

Chaos engineering is sometimes described as “breaking stuff in production,” Rosenthal notes. But, “while this might sound cool, it doesn’t appeal to enterprises running at scale and other complex system operators who can most benefit from the practice,” he says.

The idea of “fixing stuff in production” more effectively captures the value of chaos engineering, he says, because it is the “only major discipline in software that focuses solely on proactively improving safety in complex systems.”