14.6 Chaos Engineering for Supply Chain Resilience¶
On March 22, 2016, npm experienced a cascade of failures after Azer Koçulu unpublished his packages. Builds broke worldwide. Companies discovered their deployment pipelines had no fallback—they simply assumed npm would always be available. Those who had tested for registry unavailability recovered quickly; those who hadn't scrambled to understand why nothing would build. Chaos engineering—deliberately introducing failures to test system resilience—would have revealed this vulnerability before it became a crisis.
This section applies chaos engineering principles to supply chain security, covering how to simulate dependency failures, registry outages, and compromises to improve your organization's preparedness.
Chaos Engineering Fundamentals¶
Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Netflix pioneered the approach with Chaos Monkey, which randomly terminates production instances to ensure services handle failures gracefully.
Core Principles:
- Build a hypothesis about steady state: Define what "normal" looks like
- Vary real-world events: Introduce realistic failures
- Run experiments in production: Test where it matters (safely)
- Automate experiments: Continuous validation, not one-time tests
- Minimize blast radius: Contain damage from experiments
Applied to Supply Chains:
Supply chain chaos engineering tests resilience to:
- Dependency unavailability
- Registry outages
- Compromised packages
- Build infrastructure failures
- Certificate and signing failures
Unlike traditional chaos engineering (which focuses on runtime), supply chain chaos spans build time, deployment, and runtime phases.
Many organizations discover their critical infrastructure dependencies only during outages. A public registry outage that halts all builds reveals systemic fragility. Regular chaos engineering—intentionally disrupting registry availability—identifies and hardens these single points of failure before real outages occur.
Supply Chain-Specific Chaos Experiments¶
Design experiments around realistic supply chain failure scenarios.
Experiment Categories:
| Category | Failure Type | Impact |
|---|---|---|
| Dependency | Package removed/corrupted | Build failure, runtime error |
| Registry | npm/PyPI unavailable | Build blocked |
| CDN | jsDelivr/unpkg down | Frontend failures |
| Build system | CI/CD outage | Deployment blocked |
| Signing | Certificate expired/revoked | Verification failures |
| Network | Registry unreachable | Build/deploy blocked |
Example Experiments:
Experiment 1: Critical Dependency Unavailable
# Experiment: lodash Unavailability
**Hypothesis**: Application builds successfully without lodash available
**Method**:
- Block lodash.com and npm registry lodash endpoints
- Attempt build
- Measure: Does build succeed from cache? How long until failure?
**Expected Outcome**: Build succeeds from local cache
**Actual Outcome**: [Record results]
**Learning**: [Document gaps]
Experiment 2: Registry Timeout
## Experiment: npm Registry Slow Response
**Hypothesis**: Builds complete within SLA despite registry latency
**Method**:
- Inject 30-second delay on registry responses
- Run standard build pipeline
- Measure: Build time, timeout behavior, retry logic
**Expected Outcome**: Build completes with extended time, no failure
**Actual Outcome**: [Record results]
Experiment 3: Package Integrity Failure
## Experiment: Checksum Mismatch
**Hypothesis**: Build fails safely when package integrity check fails
**Method**:
- Intercept package download, modify contents
- Attempt install with integrity checking enabled
- Verify: Build fails with clear error, no partial install
**Expected Outcome**: Clean failure, no corrupted installation
Dependency Failure Injection Techniques¶
Several techniques enable controlled dependency failures.
Network-Level Injection:
# Block specific registry endpoints
iptables -A OUTPUT -d registry.npmjs.org -j DROP
# Inject latency
tc qdisc add dev eth0 root netem delay 5000ms
# Simulate packet loss
tc qdisc add dev eth0 root netem loss 50%
Proxy-Based Injection:
# Toxiproxy configuration for chaos testing
proxies:
- name: npm-registry
listen: localhost:8080
upstream: registry.npmjs.org:443
toxics:
- type: latency
attributes:
latency: 10000
jitter: 5000
- type: timeout
attributes:
timeout: 30000
Application-Level Injection:
// Mock dependency for testing
jest.mock('critical-dependency', () => {
throw new Error('Simulated dependency failure');
});
// Or return degraded functionality
jest.mock('feature-library', () => ({
feature: () => fallbackBehavior()
}));
Build-Time Injection:
# CI pipeline with chaos stage
chaos-test:
stage: test
script:
# Block external registries
- echo "127.0.0.1 registry.npmjs.org" >> /etc/hosts
# Attempt build from cache only
- npm ci --offline
# Verify build succeeds
- npm run build
Graceful Degradation Patterns¶
Design systems to degrade gracefully when dependencies fail.
Pattern 1: Cached Fallback
async function fetchData() {
try {
// Try external service
const data = await externalService.fetch();
cache.set('data', data);
return data;
} catch (error) {
// Fall back to cached data
const cached = cache.get('data');
if (cached) {
metrics.increment('fallback.cache_hit');
return cached;
}
throw new DegradedServiceError('No cached data available');
}
}
Pattern 2: Feature Flags for Dependencies
// Disable feature when dependency unavailable
const featureEnabled = config.get('features.analytics');
if (featureEnabled) {
try {
await analytics.track(event);
} catch (error) {
// Disable feature, continue without
config.set('features.analytics', false);
logger.warn('Analytics disabled due to failure');
}
}
Pattern 3: Circuit Breaker
const circuitBreaker = new CircuitBreaker(externalCall, {
timeout: 3000,
errorThresholdPercentage: 50,
resetTimeout: 30000
});
circuitBreaker.fallback(() => fallbackResponse);
circuitBreaker.on('open', () => {
alerts.send('Circuit breaker opened for external service');
});
Degradation Testing Matrix:
| Dependency | Degraded Mode | User Impact |
|---|---|---|
| Analytics | Disabled | None visible |
| Search | Cached results | Stale results |
| Authentication | Cached sessions | Cannot login new |
| Payment | Queue for retry | Delayed processing |
| CDN assets | Self-hosted fallback | Slower loading |
Registry and CDN Unavailability Testing¶
Test resilience to infrastructure your builds depend on.
Registry Unavailability:
#!/bin/bash
# Test build resilience to registry outage
echo "=== Registry Unavailability Test ==="
# Save current state
cp /etc/hosts /etc/hosts.backup
# Block registries
echo "127.0.0.1 registry.npmjs.org" >> /etc/hosts
echo "127.0.0.1 registry.yarnpkg.com" >> /etc/hosts
# Attempt offline build
npm ci --offline
BUILD_RESULT=$?
# Restore
mv /etc/hosts.backup /etc/hosts
if [ $BUILD_RESULT -eq 0 ]; then
echo "✓ Build succeeded offline"
else
echo "✗ Build failed without registry access"
exit 1
fi
CDN Failover Testing:
// Test CDN fallback in browser
describe('CDN Resilience', () => {
beforeEach(() => {
// Mock CDN failure
cy.intercept('https://cdn.jsdelivr.net/**', { forceNetworkError: true });
});
it('loads application when primary CDN fails', () => {
cy.visit('/');
cy.get('#app').should('be.visible');
// Verify fallback CDN or self-hosted assets loaded
});
});
Testing Private Registry Resilience:
# Test Artifactory/Nexus failover
chaos-registry:
script:
# Stop primary registry
- docker stop artifactory-primary
# Verify builds use secondary
- npm ci --registry https://artifactory-secondary.internal
# Verify caching layer handles failover
- npm ci # Should use cached packages
# Restore
- docker start artifactory-primary
Build System Resilience Testing¶
CI/CD systems are critical supply chain infrastructure.
CI/CD Chaos Experiments:
| Experiment | Method | Tests |
|---|---|---|
| Runner unavailability | Terminate runners mid-build | Job recovery, retry logic |
| Cache corruption | Clear/corrupt build cache | Build without cache |
| Secret unavailability | Revoke vault access | Graceful secret failure |
| Artifact storage failure | Block artifact upload | Build completion without artifacts |
| Database failure | Stop CI database | Job queue resilience |
Build Cache Chaos:
# GitHub Actions: Test build without cache
name: Cache Resilience Test
on:
schedule:
- cron: '0 3 * * 0' # Weekly
jobs:
no-cache-build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
# Explicitly skip cache
- name: Install without cache
run: |
rm -rf ~/.npm
npm ci
- name: Build
run: npm run build
- name: Record timing
run: echo "Build completed in $SECONDS seconds"
Parallel Build Failure:
# Test handling of partial build failures
matrix-chaos:
strategy:
fail-fast: false
matrix:
os: [ubuntu, macos, windows]
node: [18, 20]
steps:
- name: Random failure injection
run: |
if [ $((RANDOM % 5)) -eq 0 ]; then
echo "Injecting chaos failure"
exit 1
fi
Experiment Design and Safety¶
Run chaos experiments safely to avoid unintended impact.
Safety Principles:
- Start in non-production: Test staging/dev first
- Limit blast radius: Affect minimal scope
- Have rollback ready: Quick recovery mechanism
- Monitor closely: Watch for unexpected cascades
- Communicate: Inform affected teams
Experiment Template:
## Chaos Experiment: [Name]
### Metadata
- **Owner**: [Team/Person]
- **Environment**: [Staging/Production]
- **Duration**: [Time limit]
- **Rollback**: [How to stop]
### Hypothesis
[What you expect to happen]
### Steady State
[Metrics that define "normal"]
### Method
1. [Step-by-step injection process]
2. [Observation points]
3. [Success/failure criteria]
### Abort Conditions
- [ ] Error rate exceeds X%
- [ ] Latency exceeds Xms
- [ ] Customer impact detected
### Results
[Recorded after experiment]
### Actions
[Improvements identified]
Graduated Rollout:
Development → Staging → Production (limited) → Production (full)
↓ ↓ ↓ ↓
Always Weekly Monthly Quarterly
Learning and Improvement Loop¶
Chaos experiments are valuable only if they drive improvement.
Post-Experiment Process:
- Document findings: What broke? What held?
- Identify gaps: Where are we vulnerable?
- Prioritize fixes: Risk-based remediation
- Implement improvements: Code, config, process changes
- Re-test: Verify improvements work
- Automate: Add to regular testing
Learning Documentation:
## Chaos Experiment Results: npm Registry Unavailability
### Date: 2024-01-15
### Findings
- Build failed after 60 seconds when registry unreachable
- No fallback to cache despite cache being populated
- Error messages unhelpful ("ETIMEDOUT")
### Root Causes
1. npm ci doesn't use cache when registry unreachable
2. No offline fallback configured
3. Timeout too short for retry
### Remediation
1. [x] Switch to npm ci --prefer-offline
2. [x] Implement local registry mirror
3. [x] Increase timeout to 5 minutes
4. [ ] Add alerting for registry latency
### Verification
- Re-ran experiment 2024-01-22
- Build completed in 45 seconds from cache
- Status: IMPROVED
Metrics to Track:
| Metric | Purpose |
|---|---|
| Mean time to detect failure | How fast do we notice? |
| Mean time to recover | How fast do we restore? |
| Blast radius | How much was affected? |
| Experiments run | Are we testing enough? |
| Issues found | Are experiments valuable? |
| Issues fixed | Are we improving? |
Recommendations¶
For SRE Teams:
-
Start with registry unavailability. It's common, impactful, and easy to test. Block your package registries and see what breaks.
-
Test your caches. Verify that cached dependencies actually enable offline builds. Many teams assume caching works but never verify.
-
Automate experiments. Run chaos tests regularly, not just once. Supply chain resilience degrades as dependencies change.
For DevOps Engineers:
-
Design for degradation. Build systems should handle dependency failures gracefully—fallback to caches, skip non-essential steps, provide clear errors.
-
Implement circuit breakers. Don't let one slow registry block all builds. Timeout, fail fast, and fall back.
-
Document recovery procedures. When chaos reveals vulnerabilities, document how to recover. Runbooks save time during real incidents.
For Organizations:
-
Include supply chain in chaos programs. If you do chaos engineering for production services, extend it to build and deployment infrastructure.
-
Budget for resilience. Local mirrors, redundant registries, and robust caching cost money but pay off during outages.
-
Learn from experiments. Chaos engineering that doesn't drive improvement is just chaos. Close the loop from findings to fixes.
Supply chain chaos engineering reveals vulnerabilities that only manifest during failures—exactly when you can least afford surprises. The 2016 npm incident, the 2024 PyPI outages, and countless smaller disruptions demonstrate that registry unavailability isn't hypothetical. Organizations that test their resilience before crises recover faster when crises occur. Those that don't discover their vulnerabilities at the worst possible time.