19.3 Recovery and Remediation¶

Containment stops the bleeding; recovery heals the wound. After the Codecov breach, affected organizations faced a daunting remediation task: identify every secret that had transited through compromised CI/CD pipelines over two months, rotate every one of those credentials, rebuild affected systems from clean sources, and verify that no backdoors or persistence mechanisms remained. HashiCorp, one of the affected companies, published a security advisory describing their remediation work, which included rotating and replacing their GPG signing key used to verify software releases.

Recovery from a supply chain compromise requires more than simply updating a dependency version. The compromised component may have installed persistence mechanisms, exfiltrated credentials now in attacker hands, or modified other system components. Effective recovery addresses all potential impacts: removing the malicious code, validating its replacement, rebuilding affected systems, rotating exposed credentials, and verifying that the recovery is complete. Rushing this process or missing components leads to reinfection or continued exploitation through stolen credentials.

Removing Compromised Dependencies¶

The first recovery step eliminates the compromised component from your environment. This sounds straightforward but involves several considerations.

Direct dependency removal:

For direct dependencies, update your package manifest to a known-good version or remove the package entirely:

// package.json - before
{
  "dependencies": {
    "compromised-package": "^1.2.3"
  }
}

// package.json - after (pinned to last known-good version)
{
  "dependencies": {
    "compromised-package": "1.2.2"
  }
}

# Regenerate lockfile after manifest change
rm package-lock.json
npm install

# Verify the compromised version is gone
npm list compromised-package

Transitive dependency removal:

When the compromised package is a transitive dependency, you cannot simply remove it from your manifest. Options include:

Override the transitive dependency: Most package managers support version overrides:

// npm overrides
{
  "overrides": {
    "compromised-package": "1.2.2"
  }
}

# Cargo patch
[patch.crates-io]
compromised-crate = { git = "https://github.com/owner/compromised-crate", branch = "safe" }

Update the parent dependency: If the parent package has released a version that removes or fixes the compromised transitive dependency, update to that version.
Replace the parent dependency: If the parent package remains compromised or unmaintained, find an alternative that provides similar functionality.

Cached and vendored copies:

Dependencies often exist in multiple locations beyond your manifest:

CI/CD caches (GitHub Actions cache, npm cache, pip cache)
Container layer caches in registries
Vendored dependencies in source repositories
Local developer machine caches
Build artifact storage

Clear all caches that might contain the compromised version:

# Clear npm cache
npm cache clean --force

# Clear pip cache
pip cache purge

# Invalidate GitHub Actions cache
# (requires creating new cache key or manually deleting via API)

# Clear Docker build cache
docker builder prune --all

Validating the Integrity of Replacements¶

Before deploying replacement components, verify they are legitimate and uncompromised. The replacement version might also be malicious if the attacker maintained access to the project or registry.

Validation steps:

Verify publisher identity: Confirm the replacement was published by the legitimate maintainer, not the attacker. Check if the maintainer account shows signs of compromise (recent password changes, new MFA enrollment, unusual publishing patterns).
Review the changes: Examine the differences between the compromised version and the replacement. A legitimate fix should show minimal, targeted changes that remove the malicious code.
Check signatures and provenance: If the package supports Sigstore or other signing, verify signatures come from expected identities:

# Verify npm provenance
npm audit signatures

# Verify container image signature (using cosign: https://github.com/sigstore/cosign)
cosign verify myregistry.io/image:tag \
  --certificate-identity "expected-identity" \
  --certificate-oidc-issuer "expected-issuer"

Scan the replacement: Run security scanners against the replacement to ensure it does not contain known malicious patterns:

# Scan with multiple tools for confidence
# Trivy: https://trivy.dev/
trivy fs ./node_modules/replacement-package
# Socket: https://socket.dev/
socket scan ./node_modules/replacement-package

Test in isolation: Deploy the replacement in an isolated environment and monitor for suspicious behavior before production deployment.

When no safe replacement exists:

Sometimes no uncompromised version is available—the attacker may have compromised multiple versions, or the maintainer may be unresponsive. Options include:

Fork and fix: Clone the last known-good source, apply necessary security fixes, and host internally
Find alternatives: Identify alternative packages that provide similar functionality
Implement internally: For simple functionality, implement it directly rather than depending on external code
Accept temporary degradation: Disable the functionality that required the compromised dependency until a safe option exists

Rebuilding from Known-Good Sources¶

Systems that executed the compromised code may have been modified beyond simply loading the malicious dependency. Rebuilding from known-good sources ensures no persistence mechanisms remain.

Identifying known-good sources:

The challenge is determining what constitutes "known-good." Consider:

Source code: Git commits from before the compromise window should be unaffected, assuming the compromise did not include repository access
Container images: Images built before the compromise, verified by build timestamps and build logs
Infrastructure as code: Terraform state, Kubernetes manifests, and configuration from before the compromise
Backups: System backups from before the compromise, though these may lack recent legitimate changes

Rebuild strategies:

Container-based workloads: Rebuild container images from source using clean base images and verified dependencies. Do not simply update the running container:

# Don't do this (preserves potentially compromised state)
kubectl set image deployment/app app=newimage:tag

# Do this instead (full rebuild and replace)
docker build --no-cache -t app:clean .
docker push myregistry.io/app:clean
kubectl delete pods -l app=myapp  # Force fresh pods

Virtual machines and servers: For systems where full rebuild is impractical, consider:
Restore from pre-compromise backup, then apply legitimate changes
Reinstall operating system and redeploy application
Use configuration management to enforce known-good state
Developer workstations: If developer machines were affected (through compromised development tools), consider reprovisioning. At minimum, clear all dependency caches and reinstall development tools from verified sources.
CI/CD infrastructure: Build systems are high-value targets. Rebuild CI/CD runners, clear all caches, rotate all secrets, and verify runner configurations:

# Force fresh runner environment (GitHub Actions)
runs-on: ubuntu-latest
env:
  # New cache key forces fresh cache
  CACHE_VERSION: v2-post-incident

Credential Rotation and Access Review¶

Compromised components often have access to credentials through environment variables, configuration files, mounted secrets, or cloud provider metadata services. All potentially exposed credentials must be rotated.

Credential rotation checklist:

Credential Type	Rotation Method	Verification
API keys	Generate new key, update consumers, revoke old	Confirm old key returns 401
Database passwords	Update password, update connection strings	Test database connectivity
Cloud IAM keys	Create new key, deploy, delete old	Audit log shows only new key
JWT signing keys	Generate new keys, deploy, invalidate old tokens	Old JWTs fail validation
SSH keys	Generate new keypairs, update authorized_keys	Old keys cannot authenticate
TLS certificates	Issue new certificates, deploy, revoke old	OCSP/CRL shows revocation
OAuth tokens	Revoke all tokens, force reauthentication	Users must re-authenticate
Encryption keys	Rotate keys, re-encrypt data	Old keys cannot decrypt new data

Rotation process:

Inventory exposed credentials: Based on scope assessment, list all credentials the compromised component could have accessed.
Prioritize by sensitivity: Rotate production credentials before development; external-facing before internal; privileged before standard.
Coordinate rotation: Many credentials require coordinated updates across multiple systems. Plan rotation to minimize outage windows:

# Example coordinated rotation for database password
# 1. Generate new password
NEW_PASSWORD=$(openssl rand -base64 32)

# 2. Update database to accept both old and new
ALTER USER app_user IDENTIFIED BY 'new_password' REPLACE 'old_password';

# 3. Update all application configurations
kubectl create secret generic db-creds --from-literal=password=$NEW_PASSWORD --dry-run=client -o yaml | kubectl apply -f -

# 4. Restart applications to pick up new secret
kubectl rollout restart deployment/app

# 5. Verify applications are working with new password
# 6. Remove old password from database
ALTER USER app_user PASSWORD EXPIRE;

Document rotations: Record what was rotated, when, and by whom. This supports incident timeline reconstruction and compliance requirements.

Access review:

Beyond rotating credentials, review what access the compromised component had and whether that access was appropriate:

Did the application need access to all those environment variables?
Were secrets scoped appropriately, or did broad access expose unrelated credentials?
Could least-privilege principles reduce future exposure?

Use the incident as an opportunity to tighten access controls for the future.

Patching Affected Systems at Scale¶

Enterprise environments may have hundreds or thousands of systems requiring remediation. Manual patching does not scale; automated approaches are essential.

Mass patching strategies:

Dependency update automation: Tools like Dependabot and Renovate can create pull requests across multiple repositories simultaneously. Configure them to prioritize the compromised package:

# Renovate: Force immediate update for security issue
{
  "packageRules": [
    {
      "matchPackageNames": ["compromised-package"],
      "matchCurrentVersion": "1.2.3",
      "allowedVersions": ">=1.2.4",
      "schedule": ["at any time"],
      "automerge": true,
      "requiredStatusChecks": null
    }
  ]
}

Centralized build system updates: If you use a monorepo or centralized build system, update the dependency once and rebuild all affected components:

# Monorepo: Update and rebuild all affected packages
npm update compromised-package --workspace=packages/*
npm run build --workspaces

Container base image updates: If the compromise is in a base image, update the base and rebuild all derived images:

# Find all Dockerfiles using compromised base
grep -r "FROM compromised-base:1.2.3" --include="Dockerfile" .

# Update and rebuild
find . -name Dockerfile -exec sed -i 's/compromised-base:1.2.3/compromised-base:1.2.4/g' {} \;

Configuration management: Use Ansible, Puppet, Chef, or similar tools to push updates across server fleets:

# Ansible: Update package across all hosts
- hosts: affected_servers
  tasks:
    - name: Remove compromised package
      pip:
        name: compromised-package
        version: "1.2.3"
        state: absent

    - name: Install clean version
      pip:
        name: compromised-package
        version: "1.2.4"
        state: present

Tracking remediation progress:

Maintain visibility into remediation status across your environment:

Dashboard showing affected systems and their remediation status
Automated scanning to detect remaining instances of compromised versions
Regular status reports to incident commander and leadership

# Script to check remediation progress
for repo in $(cat affected-repos.txt); do
  version=$(gh api repos/$repo/contents/package-lock.json --jq '.content' | base64 -d | jq -r '.packages["node_modules/compromised-package"].version')
  echo "$repo: $version"
done | tee remediation-status.txt

Verifying Recovery¶

How do you know recovery is complete? Verification ensures that remediation was effective and no compromise remnants remain.

Recovery verification methods:

Dependency verification: Confirm no instance of the compromised version remains:

# Scan all repositories for compromised version
for repo in $(cat all-repos.txt); do
  if grep -q '"compromised-package": "1.2.3"' $repo/package-lock.json 2>/dev/null; then
    echo "STILL AFFECTED: $repo"
  fi
done

# Scan all container images
for image in $(cat deployed-images.txt); do
  trivy image --list-all-pkgs $image 2>/dev/null | grep -q "compromised-package.*1.2.3" && echo "STILL AFFECTED: $image"
done

Credential verification: Confirm old credentials no longer work:

# Test that old API key is rejected
curl -H "Authorization: Bearer $OLD_API_KEY" https://api.service.example.com/test
# Should return 401 Unauthorized

# Verify old database password fails
psql "postgresql://app:$OLD_PASSWORD@db.example.com/app" -c "SELECT 1"
# Should fail authentication

Behavioral verification: Monitor recovered systems for signs of continued compromise:
Unexpected network connections
Unusual process execution
Unauthorized access attempts using rotated credentials
Integrity verification: Compare recovered systems against known-good baselines:

# Compare deployed container against known-good image
docker pull myregistry.example.io/app:known-good
docker save myregistry.example.io/app:known-good | sha256sum
docker save myregistry.example.io/app:current | sha256sum
# Hashes should match

Independent review: Have someone not involved in remediation verify the recovery. Fresh eyes catch things the remediation team may have overlooked.

Recovery milestones and timeline:

Document expected milestones and track progress:

Milestone	Criteria	Target
Containment complete	No new systems can be infected	T+4 hours
Scope fully identified	All affected systems catalogued	T+24 hours
Critical credentials rotated	Production, external-facing, privileged	T+48 hours
All credentials rotated	Complete rotation per checklist	T+1 week
Systems rebuilt	All affected production systems clean	T+2 weeks
Verification complete	No compromised versions detected	T+2 weeks
Recovery declared	Leadership sign-off	T+3 weeks

Timelines vary based on incident scope; major supply chain compromises may take months to fully remediate.

Regression Prevention¶

Recovery should leave you more secure than before. Implement controls to prevent similar incidents:

Dependency monitoring: If not already in place, deploy tools like Socket, Snyk, or Dependabot to alert on suspicious dependency changes
Lockfile enforcement: Require lockfiles and enforce their use in CI/CD
SBOM generation: Generate SBOMs for all releases to accelerate future scope assessment
Reduced dependency footprint: Remove unnecessary dependencies identified during the incident
Improved secret hygiene: Reduce secrets exposed to build processes; use short-lived credentials where possible

Document lessons learned and update incident response playbooks based on what worked and what did not during this recovery. The goal is not just to recover from this incident but to be better prepared for the next one.