14.6 Chaos Engineering for Supply Chain Resilience¶

On March 22, 2016, npm experienced a cascade of failures after Azer Koçulu unpublished his packages. Builds broke worldwide. Companies discovered their deployment pipelines had no fallback—they simply assumed npm would always be available. Those who had tested for registry unavailability recovered quickly; those who hadn't scrambled to understand why nothing would build. Chaos engineering—deliberately introducing failures to test system resilience—would have revealed this vulnerability before it became a crisis.

This section applies chaos engineering principles to supply chain security, covering how to simulate dependency failures, registry outages, and compromises to improve your organization's preparedness.

Chaos Engineering Fundamentals¶

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Netflix pioneered the approach with Chaos Monkey, which randomly terminates production instances to ensure services handle failures gracefully.

Core Principles:

Build a hypothesis about steady state: Define what "normal" looks like
Vary real-world events: Introduce realistic failures
Run experiments in production: Test where it matters (safely)
Automate experiments: Continuous validation, not one-time tests
Minimize blast radius: Contain damage from experiments

Applied to Supply Chains:

Supply chain chaos engineering tests resilience to:

Dependency unavailability
Registry outages
Compromised packages
Build infrastructure failures
Certificate and signing failures

Unlike traditional chaos engineering (which focuses on runtime), supply chain chaos spans build time, deployment, and runtime phases.

Many organizations discover their critical infrastructure dependencies only during outages. A public registry outage that halts all builds reveals systemic fragility. Regular chaos engineering—intentionally disrupting registry availability—identifies and hardens these single points of failure before real outages occur.

Supply Chain-Specific Chaos Experiments¶

Design experiments around realistic supply chain failure scenarios.

Experiment Categories:

Category	Failure Type	Impact
Dependency	Package removed/corrupted	Build failure, runtime error
Registry	npm/PyPI unavailable	Build blocked
CDN	jsDelivr/unpkg down	Frontend failures
Build system	CI/CD outage	Deployment blocked
Signing	Certificate expired/revoked	Verification failures
Network	Registry unreachable	Build/deploy blocked

Example Experiments:

Experiment 1: Critical Dependency Unavailable

# Experiment: lodash Unavailability

**Hypothesis**: Application builds successfully without lodash available

**Method**: 
- Block lodash.com and npm registry lodash endpoints
- Attempt build
- Measure: Does build succeed from cache? How long until failure?

**Expected Outcome**: Build succeeds from local cache

**Actual Outcome**: [Record results]

**Learning**: [Document gaps]

Experiment 2: Registry Timeout

## Experiment: npm Registry Slow Response

**Hypothesis**: Builds complete within SLA despite registry latency

**Method**:
- Inject 30-second delay on registry responses
- Run standard build pipeline
- Measure: Build time, timeout behavior, retry logic

**Expected Outcome**: Build completes with extended time, no failure

**Actual Outcome**: [Record results]

Experiment 3: Package Integrity Failure

## Experiment: Checksum Mismatch

**Hypothesis**: Build fails safely when package integrity check fails

**Method**:
- Intercept package download, modify contents
- Attempt install with integrity checking enabled
- Verify: Build fails with clear error, no partial install

**Expected Outcome**: Clean failure, no corrupted installation

Dependency Failure Injection Techniques¶

Several techniques enable controlled dependency failures.

Network-Level Injection:

# Block specific registry endpoints
iptables -A OUTPUT -d registry.npmjs.org -j DROP

# Inject latency
tc qdisc add dev eth0 root netem delay 5000ms

# Simulate packet loss
tc qdisc add dev eth0 root netem loss 50%

Proxy-Based Injection:

# Toxiproxy configuration for chaos testing
proxies:
  - name: npm-registry
    listen: localhost:8080
    upstream: registry.npmjs.org:443

toxics:
  - type: latency
    attributes:
      latency: 10000
      jitter: 5000

  - type: timeout
    attributes:
      timeout: 30000

Application-Level Injection:

// Mock dependency for testing
jest.mock('critical-dependency', () => {
  throw new Error('Simulated dependency failure');
});

// Or return degraded functionality
jest.mock('feature-library', () => ({
  feature: () => fallbackBehavior()
}));

Build-Time Injection:

# CI pipeline with chaos stage
chaos-test:
  stage: test
  script:
    # Block external registries
    - echo "127.0.0.1 registry.npmjs.org" >> /etc/hosts

    # Attempt build from cache only
    - npm ci --offline

    # Verify build succeeds
    - npm run build

Graceful Degradation Patterns¶

Design systems to degrade gracefully when dependencies fail.

Pattern 1: Cached Fallback

async function fetchData() {
  try {
    // Try external service
    const data = await externalService.fetch();
    cache.set('data', data);
    return data;
  } catch (error) {
    // Fall back to cached data
    const cached = cache.get('data');
    if (cached) {
      metrics.increment('fallback.cache_hit');
      return cached;
    }
    throw new DegradedServiceError('No cached data available');
  }
}

Pattern 2: Feature Flags for Dependencies

// Disable feature when dependency unavailable
const featureEnabled = config.get('features.analytics');

if (featureEnabled) {
  try {
    await analytics.track(event);
  } catch (error) {
    // Disable feature, continue without
    config.set('features.analytics', false);
    logger.warn('Analytics disabled due to failure');
  }
}

Pattern 3: Circuit Breaker

const circuitBreaker = new CircuitBreaker(externalCall, {
  timeout: 3000,
  errorThresholdPercentage: 50,
  resetTimeout: 30000
});

circuitBreaker.fallback(() => fallbackResponse);

circuitBreaker.on('open', () => {
  alerts.send('Circuit breaker opened for external service');
});

Degradation Testing Matrix:

Dependency	Degraded Mode	User Impact
Analytics	Disabled	None visible
Search	Cached results	Stale results
Authentication	Cached sessions	Cannot login new
Payment	Queue for retry	Delayed processing
CDN assets	Self-hosted fallback	Slower loading

Registry and CDN Unavailability Testing¶

Test resilience to infrastructure your builds depend on.

Registry Unavailability:

#!/bin/bash
# Test build resilience to registry outage

echo "=== Registry Unavailability Test ==="

# Save current state
cp /etc/hosts /etc/hosts.backup

# Block registries
echo "127.0.0.1 registry.npmjs.org" >> /etc/hosts
echo "127.0.0.1 registry.yarnpkg.com" >> /etc/hosts

# Attempt offline build
npm ci --offline
BUILD_RESULT=$?

# Restore
mv /etc/hosts.backup /etc/hosts

if [ $BUILD_RESULT -eq 0 ]; then
  echo "✓ Build succeeded offline"
else
  echo "✗ Build failed without registry access"
  exit 1
fi

CDN Failover Testing:

// Test CDN fallback in browser
describe('CDN Resilience', () => {
  beforeEach(() => {
    // Mock CDN failure
    cy.intercept('https://cdn.jsdelivr.net/**', { forceNetworkError: true });
  });

  it('loads application when primary CDN fails', () => {
    cy.visit('/');
    cy.get('#app').should('be.visible');
    // Verify fallback CDN or self-hosted assets loaded
  });
});

Testing Private Registry Resilience:

# Test Artifactory/Nexus failover
chaos-registry:
  script:
    # Stop primary registry
    - docker stop artifactory-primary

    # Verify builds use secondary
    - npm ci --registry https://artifactory-secondary.internal

    # Verify caching layer handles failover
    - npm ci  # Should use cached packages

    # Restore
    - docker start artifactory-primary

Build System Resilience Testing¶

CI/CD systems are critical supply chain infrastructure.

CI/CD Chaos Experiments:

Experiment	Method	Tests
Runner unavailability	Terminate runners mid-build	Job recovery, retry logic
Cache corruption	Clear/corrupt build cache	Build without cache
Secret unavailability	Revoke vault access	Graceful secret failure
Artifact storage failure	Block artifact upload	Build completion without artifacts
Database failure	Stop CI database	Job queue resilience

Build Cache Chaos:

# GitHub Actions: Test build without cache
name: Cache Resilience Test
on:
  schedule:
    - cron: '0 3 * * 0'  # Weekly

jobs:
  no-cache-build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      # Explicitly skip cache
      - name: Install without cache
        run: |
          rm -rf ~/.npm
          npm ci

      - name: Build
        run: npm run build

      - name: Record timing
        run: echo "Build completed in $SECONDS seconds"

Parallel Build Failure:

# Test handling of partial build failures
matrix-chaos:
  strategy:
    fail-fast: false
    matrix:
      os: [ubuntu, macos, windows]
      node: [18, 20]
  steps:
    - name: Random failure injection
      run: |
        if [ $((RANDOM % 5)) -eq 0 ]; then
          echo "Injecting chaos failure"
          exit 1
        fi

Experiment Design and Safety¶

Run chaos experiments safely to avoid unintended impact.

Safety Principles:

Start in non-production: Test staging/dev first
Limit blast radius: Affect minimal scope
Have rollback ready: Quick recovery mechanism
Monitor closely: Watch for unexpected cascades
Communicate: Inform affected teams

Experiment Template:

## Chaos Experiment: [Name]

### Metadata
- **Owner**: [Team/Person]
- **Environment**: [Staging/Production]
- **Duration**: [Time limit]
- **Rollback**: [How to stop]

### Hypothesis
[What you expect to happen]

### Steady State
[Metrics that define "normal"]

### Method
1. [Step-by-step injection process]
2. [Observation points]
3. [Success/failure criteria]

### Abort Conditions
- [ ] Error rate exceeds X%
- [ ] Latency exceeds Xms
- [ ] Customer impact detected

### Results
[Recorded after experiment]

### Actions
[Improvements identified]

Graduated Rollout:

Development → Staging → Production (limited) → Production (full)
     ↓           ↓              ↓                    ↓
  Always      Weekly        Monthly             Quarterly

Learning and Improvement Loop¶

Chaos experiments are valuable only if they drive improvement.

Post-Experiment Process:

Document findings: What broke? What held?
Identify gaps: Where are we vulnerable?
Prioritize fixes: Risk-based remediation
Implement improvements: Code, config, process changes
Re-test: Verify improvements work
Automate: Add to regular testing

Learning Documentation:

## Chaos Experiment Results: npm Registry Unavailability

### Date: 2024-01-15

### Findings
- Build failed after 60 seconds when registry unreachable
- No fallback to cache despite cache being populated
- Error messages unhelpful ("ETIMEDOUT")

### Root Causes
1. npm ci doesn't use cache when registry unreachable
2. No offline fallback configured
3. Timeout too short for retry

### Remediation
1. [x] Switch to npm ci --prefer-offline
2. [x] Implement local registry mirror
3. [x] Increase timeout to 5 minutes
4. [ ] Add alerting for registry latency

### Verification
- Re-ran experiment 2024-01-22
- Build completed in 45 seconds from cache
- Status: IMPROVED

Metrics to Track:

Metric	Purpose
Mean time to detect failure	How fast do we notice?
Mean time to recover	How fast do we restore?
Blast radius	How much was affected?
Experiments run	Are we testing enough?
Issues found	Are experiments valuable?
Issues fixed	Are we improving?

Recommendations¶

For SRE Teams:

Start with registry unavailability. It's common, impactful, and easy to test. Block your package registries and see what breaks.
Test your caches. Verify that cached dependencies actually enable offline builds. Many teams assume caching works but never verify.
Automate experiments. Run chaos tests regularly, not just once. Supply chain resilience degrades as dependencies change.

For DevOps Engineers:

Design for degradation. Build systems should handle dependency failures gracefully—fallback to caches, skip non-essential steps, provide clear errors.
Implement circuit breakers. Don't let one slow registry block all builds. Timeout, fail fast, and fall back.
Document recovery procedures. When chaos reveals vulnerabilities, document how to recover. Runbooks save time during real incidents.

For Organizations:

Include supply chain in chaos programs. If you do chaos engineering for production services, extend it to build and deployment infrastructure.
Budget for resilience. Local mirrors, redundant registries, and robust caching cost money but pay off during outages.
Learn from experiments. Chaos engineering that doesn't drive improvement is just chaos. Close the loop from findings to fixes.

Supply chain chaos engineering reveals vulnerabilities that only manifest during failures—exactly when you can least afford surprises. The 2016 npm incident, the 2024 PyPI outages, and countless smaller disruptions demonstrate that registry unavailability isn't hypothetical. Organizations that test their resilience before crises recover faster when crises occur. Those that don't discover their vulnerabilities at the worst possible time.