Skip to content

14.6 Chaos Engineering for Supply Chain Resilience

On March 22, 2016, npm experienced a cascade of failures after Azer Koçulu unpublished his packages. Builds broke worldwide. Companies discovered their deployment pipelines had no fallback—they simply assumed npm would always be available. Those who had tested for registry unavailability recovered quickly; those who hadn't scrambled to understand why nothing would build. Chaos engineering—deliberately introducing failures to test system resilience—would have revealed this vulnerability before it became a crisis.

This section applies chaos engineering principles to supply chain security, covering how to simulate dependency failures, registry outages, and compromises to improve your organization's preparedness.

Chaos Engineering Fundamentals

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Netflix pioneered the approach with Chaos Monkey, which randomly terminates production instances to ensure services handle failures gracefully.

Core Principles:

  1. Build a hypothesis about steady state: Define what "normal" looks like
  2. Vary real-world events: Introduce realistic failures
  3. Run experiments in production: Test where it matters (safely)
  4. Automate experiments: Continuous validation, not one-time tests
  5. Minimize blast radius: Contain damage from experiments

Applied to Supply Chains:

Supply chain chaos engineering tests resilience to:

  • Dependency unavailability
  • Registry outages
  • Compromised packages
  • Build infrastructure failures
  • Certificate and signing failures

Unlike traditional chaos engineering (which focuses on runtime), supply chain chaos spans build time, deployment, and runtime phases.

Many organizations discover their critical infrastructure dependencies only during outages. A public registry outage that halts all builds reveals systemic fragility. Regular chaos engineering—intentionally disrupting registry availability—identifies and hardens these single points of failure before real outages occur.

Supply Chain-Specific Chaos Experiments

Design experiments around realistic supply chain failure scenarios.

Experiment Categories:

Category Failure Type Impact
Dependency Package removed/corrupted Build failure, runtime error
Registry npm/PyPI unavailable Build blocked
CDN jsDelivr/unpkg down Frontend failures
Build system CI/CD outage Deployment blocked
Signing Certificate expired/revoked Verification failures
Network Registry unreachable Build/deploy blocked

Example Experiments:

Experiment 1: Critical Dependency Unavailable

# Experiment: lodash Unavailability

**Hypothesis**: Application builds successfully without lodash available

**Method**: 
- Block lodash.com and npm registry lodash endpoints
- Attempt build
- Measure: Does build succeed from cache? How long until failure?

**Expected Outcome**: Build succeeds from local cache

**Actual Outcome**: [Record results]

**Learning**: [Document gaps]

Experiment 2: Registry Timeout

## Experiment: npm Registry Slow Response

**Hypothesis**: Builds complete within SLA despite registry latency

**Method**:
- Inject 30-second delay on registry responses
- Run standard build pipeline
- Measure: Build time, timeout behavior, retry logic

**Expected Outcome**: Build completes with extended time, no failure

**Actual Outcome**: [Record results]

Experiment 3: Package Integrity Failure

## Experiment: Checksum Mismatch

**Hypothesis**: Build fails safely when package integrity check fails

**Method**:
- Intercept package download, modify contents
- Attempt install with integrity checking enabled
- Verify: Build fails with clear error, no partial install

**Expected Outcome**: Clean failure, no corrupted installation

Dependency Failure Injection Techniques

Several techniques enable controlled dependency failures.

Network-Level Injection:

# Block specific registry endpoints
iptables -A OUTPUT -d registry.npmjs.org -j DROP

# Inject latency
tc qdisc add dev eth0 root netem delay 5000ms

# Simulate packet loss
tc qdisc add dev eth0 root netem loss 50%

Proxy-Based Injection:

# Toxiproxy configuration for chaos testing
proxies:
  - name: npm-registry
    listen: localhost:8080
    upstream: registry.npmjs.org:443

toxics:
  - type: latency
    attributes:
      latency: 10000
      jitter: 5000

  - type: timeout
    attributes:
      timeout: 30000

Application-Level Injection:

// Mock dependency for testing
jest.mock('critical-dependency', () => {
  throw new Error('Simulated dependency failure');
});

// Or return degraded functionality
jest.mock('feature-library', () => ({
  feature: () => fallbackBehavior()
}));

Build-Time Injection:

# CI pipeline with chaos stage
chaos-test:
  stage: test
  script:
    # Block external registries
    - echo "127.0.0.1 registry.npmjs.org" >> /etc/hosts

    # Attempt build from cache only
    - npm ci --offline

    # Verify build succeeds
    - npm run build

Graceful Degradation Patterns

Design systems to degrade gracefully when dependencies fail.

Pattern 1: Cached Fallback

async function fetchData() {
  try {
    // Try external service
    const data = await externalService.fetch();
    cache.set('data', data);
    return data;
  } catch (error) {
    // Fall back to cached data
    const cached = cache.get('data');
    if (cached) {
      metrics.increment('fallback.cache_hit');
      return cached;
    }
    throw new DegradedServiceError('No cached data available');
  }
}

Pattern 2: Feature Flags for Dependencies

// Disable feature when dependency unavailable
const featureEnabled = config.get('features.analytics');

if (featureEnabled) {
  try {
    await analytics.track(event);
  } catch (error) {
    // Disable feature, continue without
    config.set('features.analytics', false);
    logger.warn('Analytics disabled due to failure');
  }
}

Pattern 3: Circuit Breaker

const circuitBreaker = new CircuitBreaker(externalCall, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000
});

circuitBreaker.fallback(() => fallbackResponse);

circuitBreaker.on('open', () => {
  alerts.send('Circuit breaker opened for external service');
});

Degradation Testing Matrix:

Dependency Degraded Mode User Impact
Analytics Disabled None visible
Search Cached results Stale results
Authentication Cached sessions Cannot login new
Payment Queue for retry Delayed processing
CDN assets Self-hosted fallback Slower loading

Registry and CDN Unavailability Testing

Test resilience to infrastructure your builds depend on.

Registry Unavailability:

#!/bin/bash
# Test build resilience to registry outage

echo "=== Registry Unavailability Test ==="

# Save current state
cp /etc/hosts /etc/hosts.backup

# Block registries
echo "127.0.0.1 registry.npmjs.org" >> /etc/hosts
echo "127.0.0.1 registry.yarnpkg.com" >> /etc/hosts

# Attempt offline build
npm ci --offline
BUILD_RESULT=$?

# Restore
mv /etc/hosts.backup /etc/hosts

if [ $BUILD_RESULT -eq 0 ]; then
  echo "✓ Build succeeded offline"
else
  echo "✗ Build failed without registry access"
  exit 1
fi

CDN Failover Testing:

// Test CDN fallback in browser
describe('CDN Resilience', () => {
  beforeEach(() => {
    // Mock CDN failure
    cy.intercept('https://cdn.jsdelivr.net/**', { forceNetworkError: true });
  });

  it('loads application when primary CDN fails', () => {
    cy.visit('/');
    cy.get('#app').should('be.visible');
    // Verify fallback CDN or self-hosted assets loaded
  });
});

Testing Private Registry Resilience:

# Test Artifactory/Nexus failover
chaos-registry:
  script:
    # Stop primary registry
    - docker stop artifactory-primary

    # Verify builds use secondary
    - npm ci --registry https://artifactory-secondary.internal

    # Verify caching layer handles failover
    - npm ci  # Should use cached packages

    # Restore
    - docker start artifactory-primary

Build System Resilience Testing

CI/CD systems are critical supply chain infrastructure.

CI/CD Chaos Experiments:

Experiment Method Tests
Runner unavailability Terminate runners mid-build Job recovery, retry logic
Cache corruption Clear/corrupt build cache Build without cache
Secret unavailability Revoke vault access Graceful secret failure
Artifact storage failure Block artifact upload Build completion without artifacts
Database failure Stop CI database Job queue resilience

Build Cache Chaos:

# GitHub Actions: Test build without cache
name: Cache Resilience Test
on:
  schedule:
    - cron: '0 3 * * 0'  # Weekly

jobs:
  no-cache-build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Explicitly skip cache
      - name: Install without cache
        run: |
          rm -rf ~/.npm
          npm ci

      - name: Build
        run: npm run build

      - name: Record timing
        run: echo "Build completed in $SECONDS seconds"

Parallel Build Failure:

# Test handling of partial build failures
matrix-chaos:
  strategy:
    fail-fast: false
    matrix:
      os: [ubuntu, macos, windows]
      node: [18, 20]
  steps:
    - name: Random failure injection
      run: |
        if [ $((RANDOM % 5)) -eq 0 ]; then
          echo "Injecting chaos failure"
          exit 1
        fi

Experiment Design and Safety

Run chaos experiments safely to avoid unintended impact.

Safety Principles:

  1. Start in non-production: Test staging/dev first
  2. Limit blast radius: Affect minimal scope
  3. Have rollback ready: Quick recovery mechanism
  4. Monitor closely: Watch for unexpected cascades
  5. Communicate: Inform affected teams

Experiment Template:

## Chaos Experiment: [Name]

### Metadata
- **Owner**: [Team/Person]
- **Environment**: [Staging/Production]
- **Duration**: [Time limit]
- **Rollback**: [How to stop]

### Hypothesis
[What you expect to happen]

### Steady State
[Metrics that define "normal"]

### Method
1. [Step-by-step injection process]
2. [Observation points]
3. [Success/failure criteria]

### Abort Conditions
- [ ] Error rate exceeds X%
- [ ] Latency exceeds Xms
- [ ] Customer impact detected

### Results
[Recorded after experiment]

### Actions
[Improvements identified]

Graduated Rollout:

Development → Staging → Production (limited) → Production (full)
     ↓           ↓              ↓                    ↓
  Always      Weekly        Monthly             Quarterly

Learning and Improvement Loop

Chaos experiments are valuable only if they drive improvement.

Post-Experiment Process:

  1. Document findings: What broke? What held?
  2. Identify gaps: Where are we vulnerable?
  3. Prioritize fixes: Risk-based remediation
  4. Implement improvements: Code, config, process changes
  5. Re-test: Verify improvements work
  6. Automate: Add to regular testing

Learning Documentation:

## Chaos Experiment Results: npm Registry Unavailability

### Date: 2024-01-15

### Findings
- Build failed after 60 seconds when registry unreachable
- No fallback to cache despite cache being populated
- Error messages unhelpful ("ETIMEDOUT")

### Root Causes
1. npm ci doesn't use cache when registry unreachable
2. No offline fallback configured
3. Timeout too short for retry

### Remediation
1. [x] Switch to npm ci --prefer-offline
2. [x] Implement local registry mirror
3. [x] Increase timeout to 5 minutes
4. [ ] Add alerting for registry latency

### Verification
- Re-ran experiment 2024-01-22
- Build completed in 45 seconds from cache
- Status: IMPROVED

Metrics to Track:

Metric Purpose
Mean time to detect failure How fast do we notice?
Mean time to recover How fast do we restore?
Blast radius How much was affected?
Experiments run Are we testing enough?
Issues found Are experiments valuable?
Issues fixed Are we improving?

Recommendations

For SRE Teams:

  1. Start with registry unavailability. It's common, impactful, and easy to test. Block your package registries and see what breaks.

  2. Test your caches. Verify that cached dependencies actually enable offline builds. Many teams assume caching works but never verify.

  3. Automate experiments. Run chaos tests regularly, not just once. Supply chain resilience degrades as dependencies change.

For DevOps Engineers:

  1. Design for degradation. Build systems should handle dependency failures gracefully—fallback to caches, skip non-essential steps, provide clear errors.

  2. Implement circuit breakers. Don't let one slow registry block all builds. Timeout, fail fast, and fall back.

  3. Document recovery procedures. When chaos reveals vulnerabilities, document how to recover. Runbooks save time during real incidents.

For Organizations:

  1. Include supply chain in chaos programs. If you do chaos engineering for production services, extend it to build and deployment infrastructure.

  2. Budget for resilience. Local mirrors, redundant registries, and robust caching cost money but pay off during outages.

  3. Learn from experiments. Chaos engineering that doesn't drive improvement is just chaos. Close the loop from findings to fixes.

Supply chain chaos engineering reveals vulnerabilities that only manifest during failures—exactly when you can least afford surprises. The 2016 npm incident, the 2024 PyPI outages, and countless smaller disruptions demonstrate that registry unavailability isn't hypothetical. Organizations that test their resilience before crises recover faster when crises occur. Those that don't discover their vulnerabilities at the worst possible time.