DEV Community

丁久
丁久

Posted on • Originally published at dingjiu1989-hue.github.io

Chaos Engineering: Principles and Practical Tools

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

Chaos Engineering: Principles and Practical Tools

Chaos Engineering: Principles and Practical Tools

Chaos Engineering: Principles and Practical Tools

Chaos Engineering: Principles and Practical Tools

Chaos Engineering: Principles and Practical Tools

Introduction

Chaos engineering is the discipline of experimenting on a distributed system to build confidence in its ability to withstand turbulent conditions in production. Unlike traditional testing, chaos experiments proactively inject failures to uncover weaknesses before they cause customer-impacting incidents. This article covers the principles and practical tools for implementing chaos engineering.

Core Principles

The practice of chaos engineering rests on four principles defined in the Principles of Chaos:

  • Build a hypothesis around steady-state behavior: Define measurable indicators that your system is healthy.

2\\\\\\\\. Vary real-world events: Inject failures that mirror actual production incidents.

3\\\\\\\\. Run experiments in production: Use a small blast radius and automated rollback.

4\\\\\\\\. Automate experiments to run continuously: Chaos should be a regular part of operations.

Steady-State Hypothesis

Define measurable metrics that represent healthy behavior before and after experiments:

steady-state.yml

steady_state_hypothesis:

title: "Payment service remains available during node failure"

probes:

\\\\- name: payment-api-health

type: http

provider:

url: "https://api.example.com/health"

expected_status: 200

timeout: 5

\\\\- name: payment-latency-p99

type: promql

provider:

query: |

histogram_quantile(0.99,

sum(rate(http_request_duration_seconds_bucket{

service="payment", status="200"

}[5m])) by (le))

expected_value:

max: 500 # p99 under 500ms

\\\\- name: error-rate

type: promql

provider:

query: |

sum(rate(http_requests_total{

service="payment", status=~"5.."

}[5m])) / sum(rate(http_requests_total{

service="payment"

}[5m]))

expected_value:

max: 0.01 # Error rate under 1%

Chaos Monkey and Simian Army

Netflix's Chaos Monkey randomly terminates EC2 instances to ensure services survive instance failures:

// Chaos Monkey configuration

chaos.monkey:

enabled: true

assaults:

level: 3 # 1-5 intensity

latency-active: true

latency-range-start: 3000

latency-range-end: 10000

watcher:

controller: true

restController: true

service: true

component: true

repository: true

For Spring Boot applications, integrate Chaos Monkey directly:

application.yml

spring:

application:

name: payment-service

chaos:

monkey:

enabled: true

watcher:

controller: true

assaults:

exceptions-active: true

kill-application-active: false

memory-active: false

LitmusChaos on Kubernetes

LitmusChaos provides declarative chaos experiments as Kubernetes CRDs:

apiVersion: litmuschaos.io/v1alpha1

kind: ChaosEngine

metadata:

name: payment-chaos

spec:

appinfo:

appns: "production"

applabel: "app=payment"

appkind: "deployment"


Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

Top comments (0)