19.3 Recovery and Remediation¶
Containment stops the bleeding; recovery heals the wound. After the Codecov breach, affected organizations faced a daunting remediation task: identify every secret that had transited through compromised CI/CD pipelines over two months, rotate every one of those credentials, rebuild affected systems from clean sources, and verify that no backdoors or persistence mechanisms remained. HashiCorp, one of the affected companies, published a security advisory describing their remediation work, which included rotating and replacing their GPG signing key used to verify software releases.
Recovery from a supply chain compromise requires more than simply updating a dependency version. The compromised component may have installed persistence mechanisms, exfiltrated credentials now in attacker hands, or modified other system components. Effective recovery addresses all potential impacts: removing the malicious code, validating its replacement, rebuilding affected systems, rotating exposed credentials, and verifying that the recovery is complete. Rushing this process or missing components leads to reinfection or continued exploitation through stolen credentials.
Removing Compromised Dependencies¶
The first recovery step eliminates the compromised component from your environment. This sounds straightforward but involves several considerations.
Direct dependency removal:
For direct dependencies, update your package manifest to a known-good version or remove the package entirely:
// package.json - before
{
"dependencies": {
"compromised-package": "^1.2.3"
}
}
// package.json - after (pinned to last known-good version)
{
"dependencies": {
"compromised-package": "1.2.2"
}
}
# Regenerate lockfile after manifest change
rm package-lock.json
npm install
# Verify the compromised version is gone
npm list compromised-package
Transitive dependency removal:
When the compromised package is a transitive dependency, you cannot simply remove it from your manifest. Options include:
- Override the transitive dependency: Most package managers support version overrides:
# Cargo patch
[patch.crates-io]
compromised-crate = { git = "https://github.com/owner/compromised-crate", branch = "safe" }
-
Update the parent dependency: If the parent package has released a version that removes or fixes the compromised transitive dependency, update to that version.
-
Replace the parent dependency: If the parent package remains compromised or unmaintained, find an alternative that provides similar functionality.
Cached and vendored copies:
Dependencies often exist in multiple locations beyond your manifest:
- CI/CD caches (GitHub Actions cache, npm cache, pip cache)
- Container layer caches in registries
- Vendored dependencies in source repositories
- Local developer machine caches
- Build artifact storage
Clear all caches that might contain the compromised version:
# Clear npm cache
npm cache clean --force
# Clear pip cache
pip cache purge
# Invalidate GitHub Actions cache
# (requires creating new cache key or manually deleting via API)
# Clear Docker build cache
docker builder prune --all
Validating the Integrity of Replacements¶
Before deploying replacement components, verify they are legitimate and uncompromised. The replacement version might also be malicious if the attacker maintained access to the project or registry.
Validation steps:
-
Verify publisher identity: Confirm the replacement was published by the legitimate maintainer, not the attacker. Check if the maintainer account shows signs of compromise (recent password changes, new MFA enrollment, unusual publishing patterns).
-
Review the changes: Examine the differences between the compromised version and the replacement. A legitimate fix should show minimal, targeted changes that remove the malicious code.
-
Check signatures and provenance: If the package supports Sigstore or other signing, verify signatures come from expected identities:
# Verify npm provenance
npm audit signatures
# Verify container image signature (using cosign: https://github.com/sigstore/cosign)
cosign verify myregistry.io/image:tag \
--certificate-identity "expected-identity" \
--certificate-oidc-issuer "expected-issuer"
- Scan the replacement: Run security scanners against the replacement to ensure it does not contain known malicious patterns:
# Scan with multiple tools for confidence
# Trivy: https://trivy.dev/
trivy fs ./node_modules/replacement-package
# Socket: https://socket.dev/
socket scan ./node_modules/replacement-package
- Test in isolation: Deploy the replacement in an isolated environment and monitor for suspicious behavior before production deployment.
When no safe replacement exists:
Sometimes no uncompromised version is available—the attacker may have compromised multiple versions, or the maintainer may be unresponsive. Options include:
- Fork and fix: Clone the last known-good source, apply necessary security fixes, and host internally
- Find alternatives: Identify alternative packages that provide similar functionality
- Implement internally: For simple functionality, implement it directly rather than depending on external code
- Accept temporary degradation: Disable the functionality that required the compromised dependency until a safe option exists
Rebuilding from Known-Good Sources¶
Systems that executed the compromised code may have been modified beyond simply loading the malicious dependency. Rebuilding from known-good sources ensures no persistence mechanisms remain.
Identifying known-good sources:
The challenge is determining what constitutes "known-good." Consider:
- Source code: Git commits from before the compromise window should be unaffected, assuming the compromise did not include repository access
- Container images: Images built before the compromise, verified by build timestamps and build logs
- Infrastructure as code: Terraform state, Kubernetes manifests, and configuration from before the compromise
- Backups: System backups from before the compromise, though these may lack recent legitimate changes
Rebuild strategies:
- Container-based workloads: Rebuild container images from source using clean base images and verified dependencies. Do not simply update the running container:
# Don't do this (preserves potentially compromised state)
kubectl set image deployment/app app=newimage:tag
# Do this instead (full rebuild and replace)
docker build --no-cache -t app:clean .
docker push myregistry.io/app:clean
kubectl delete pods -l app=myapp # Force fresh pods
- Virtual machines and servers: For systems where full rebuild is impractical, consider:
- Restore from pre-compromise backup, then apply legitimate changes
- Reinstall operating system and redeploy application
-
Use configuration management to enforce known-good state
-
Developer workstations: If developer machines were affected (through compromised development tools), consider reprovisioning. At minimum, clear all dependency caches and reinstall development tools from verified sources.
-
CI/CD infrastructure: Build systems are high-value targets. Rebuild CI/CD runners, clear all caches, rotate all secrets, and verify runner configurations:
# Force fresh runner environment (GitHub Actions)
runs-on: ubuntu-latest
env:
# New cache key forces fresh cache
CACHE_VERSION: v2-post-incident
Credential Rotation and Access Review¶
Compromised components often have access to credentials through environment variables, configuration files, mounted secrets, or cloud provider metadata services. All potentially exposed credentials must be rotated.
Credential rotation checklist:
| Credential Type | Rotation Method | Verification |
|---|---|---|
| API keys | Generate new key, update consumers, revoke old | Confirm old key returns 401 |
| Database passwords | Update password, update connection strings | Test database connectivity |
| Cloud IAM keys | Create new key, deploy, delete old | Audit log shows only new key |
| JWT signing keys | Generate new keys, deploy, invalidate old tokens | Old JWTs fail validation |
| SSH keys | Generate new keypairs, update authorized_keys | Old keys cannot authenticate |
| TLS certificates | Issue new certificates, deploy, revoke old | OCSP/CRL shows revocation |
| OAuth tokens | Revoke all tokens, force reauthentication | Users must re-authenticate |
| Encryption keys | Rotate keys, re-encrypt data | Old keys cannot decrypt new data |
Rotation process:
-
Inventory exposed credentials: Based on scope assessment, list all credentials the compromised component could have accessed.
-
Prioritize by sensitivity: Rotate production credentials before development; external-facing before internal; privileged before standard.
-
Coordinate rotation: Many credentials require coordinated updates across multiple systems. Plan rotation to minimize outage windows:
# Example coordinated rotation for database password
# 1. Generate new password
NEW_PASSWORD=$(openssl rand -base64 32)
# 2. Update database to accept both old and new
ALTER USER app_user IDENTIFIED BY 'new_password' REPLACE 'old_password';
# 3. Update all application configurations
kubectl create secret generic db-creds --from-literal=password=$NEW_PASSWORD --dry-run=client -o yaml | kubectl apply -f -
# 4. Restart applications to pick up new secret
kubectl rollout restart deployment/app
# 5. Verify applications are working with new password
# 6. Remove old password from database
ALTER USER app_user PASSWORD EXPIRE;
- Document rotations: Record what was rotated, when, and by whom. This supports incident timeline reconstruction and compliance requirements.
Access review:
Beyond rotating credentials, review what access the compromised component had and whether that access was appropriate:
- Did the application need access to all those environment variables?
- Were secrets scoped appropriately, or did broad access expose unrelated credentials?
- Could least-privilege principles reduce future exposure?
Use the incident as an opportunity to tighten access controls for the future.
Patching Affected Systems at Scale¶
Enterprise environments may have hundreds or thousands of systems requiring remediation. Manual patching does not scale; automated approaches are essential.
Mass patching strategies:
- Dependency update automation: Tools like Dependabot and Renovate can create pull requests across multiple repositories simultaneously. Configure them to prioritize the compromised package:
# Renovate: Force immediate update for security issue
{
"packageRules": [
{
"matchPackageNames": ["compromised-package"],
"matchCurrentVersion": "1.2.3",
"allowedVersions": ">=1.2.4",
"schedule": ["at any time"],
"automerge": true,
"requiredStatusChecks": null
}
]
}
- Centralized build system updates: If you use a monorepo or centralized build system, update the dependency once and rebuild all affected components:
# Monorepo: Update and rebuild all affected packages
npm update compromised-package --workspace=packages/*
npm run build --workspaces
- Container base image updates: If the compromise is in a base image, update the base and rebuild all derived images:
# Find all Dockerfiles using compromised base
grep -r "FROM compromised-base:1.2.3" --include="Dockerfile" .
# Update and rebuild
find . -name Dockerfile -exec sed -i 's/compromised-base:1.2.3/compromised-base:1.2.4/g' {} \;
- Configuration management: Use Ansible, Puppet, Chef, or similar tools to push updates across server fleets:
# Ansible: Update package across all hosts
- hosts: affected_servers
tasks:
- name: Remove compromised package
pip:
name: compromised-package
version: "1.2.3"
state: absent
- name: Install clean version
pip:
name: compromised-package
version: "1.2.4"
state: present
Tracking remediation progress:
Maintain visibility into remediation status across your environment:
- Dashboard showing affected systems and their remediation status
- Automated scanning to detect remaining instances of compromised versions
- Regular status reports to incident commander and leadership
# Script to check remediation progress
for repo in $(cat affected-repos.txt); do
version=$(gh api repos/$repo/contents/package-lock.json --jq '.content' | base64 -d | jq -r '.packages["node_modules/compromised-package"].version')
echo "$repo: $version"
done | tee remediation-status.txt
Verifying Recovery¶
How do you know recovery is complete? Verification ensures that remediation was effective and no compromise remnants remain.
Recovery verification methods:
- Dependency verification: Confirm no instance of the compromised version remains:
# Scan all repositories for compromised version
for repo in $(cat all-repos.txt); do
if grep -q '"compromised-package": "1.2.3"' $repo/package-lock.json 2>/dev/null; then
echo "STILL AFFECTED: $repo"
fi
done
# Scan all container images
for image in $(cat deployed-images.txt); do
trivy image --list-all-pkgs $image 2>/dev/null | grep -q "compromised-package.*1.2.3" && echo "STILL AFFECTED: $image"
done
- Credential verification: Confirm old credentials no longer work:
# Test that old API key is rejected
curl -H "Authorization: Bearer $OLD_API_KEY" https://api.service.example.com/test
# Should return 401 Unauthorized
# Verify old database password fails
psql "postgresql://app:$OLD_PASSWORD@db.example.com/app" -c "SELECT 1"
# Should fail authentication
- Behavioral verification: Monitor recovered systems for signs of continued compromise:
- Unexpected network connections
- Unusual process execution
-
Unauthorized access attempts using rotated credentials
-
Integrity verification: Compare recovered systems against known-good baselines:
# Compare deployed container against known-good image
docker pull myregistry.example.io/app:known-good
docker save myregistry.example.io/app:known-good | sha256sum
docker save myregistry.example.io/app:current | sha256sum
# Hashes should match
- Independent review: Have someone not involved in remediation verify the recovery. Fresh eyes catch things the remediation team may have overlooked.
Recovery milestones and timeline:
Document expected milestones and track progress:
| Milestone | Criteria | Target |
|---|---|---|
| Containment complete | No new systems can be infected | T+4 hours |
| Scope fully identified | All affected systems catalogued | T+24 hours |
| Critical credentials rotated | Production, external-facing, privileged | T+48 hours |
| All credentials rotated | Complete rotation per checklist | T+1 week |
| Systems rebuilt | All affected production systems clean | T+2 weeks |
| Verification complete | No compromised versions detected | T+2 weeks |
| Recovery declared | Leadership sign-off | T+3 weeks |
Timelines vary based on incident scope; major supply chain compromises may take months to fully remediate.
Regression Prevention¶
Recovery should leave you more secure than before. Implement controls to prevent similar incidents:
- Dependency monitoring: If not already in place, deploy tools like Socket, Snyk, or Dependabot to alert on suspicious dependency changes
- Lockfile enforcement: Require lockfiles and enforce their use in CI/CD
- SBOM generation: Generate SBOMs for all releases to accelerate future scope assessment
- Reduced dependency footprint: Remove unnecessary dependencies identified during the incident
- Improved secret hygiene: Reduce secrets exposed to build processes; use short-lived credentials where possible
Document lessons learned and update incident response playbooks based on what worked and what did not during this recovery. The goal is not just to recover from this incident but to be better prepared for the next one.