# Runbook: Bad Bundle Rollback

- Canonical URL: https://docs.fairvisor.com/docs/cookbook/bad-bundle-rollback-runbook/
- Section: docs
- Last updated: n/a
> Deterministic rollback procedure when a newly applied bundle causes regressions.


## Purpose / When to use

Use this runbook when a recent policy bundle introduces severe regressions: widespread rejects, incorrect matching, or unstable behavior.

## Blast radius & risk level

- Risk level: high
- Primary risk: prolonged customer impact if rollback is delayed or non-monotonic

## Signals / symptoms

- Reject spike immediately after bundle deployment
- Unexpected reject reasons or `RateLimit*` behavior on unaffected routes
- Operator reports mismatched expected vs active policy version

## Detection queries

```promql
sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject",reason="no_bundle_loaded"}[5m])
```

Operational checks:

```bash
fairvisor status --edge-url=http://localhost:8080
curl -sS http://localhost:8080/readyz
```

## Triage checklist

1. Confirm incident correlates with latest bundle version.
2. Identify last known-good version/hash.
3. Freeze further policy changes until rollback completes.
4. Assign rollback owner and verifier.

## Mitigation playbook

SaaS mode rollback:

1. Revert control plane bundle to last known-good content.
2. Increment version as required by monotonic flow.
3. Confirm all edges pull and apply rollback bundle.

Standalone rollback:

```bash
cp /etc/fairvisor/policy.backup.json /etc/fairvisor/policy.json
kill -HUP $(pidof nginx)
```

## Verification checklist

1. Active bundle version/hash match expected rollback target.
2. Reject distribution returns near baseline.
3. Critical endpoints validate with synthetic checks.
4. No `no_bundle_loaded` rejects during recovery.

## Exit criteria

- Traffic stabilized
- rollback bundle confirmed across fleet
- incident bridge agrees policy state is safe

## Rollback / recovery path

1. Keep rollout freeze until root cause is documented.
2. Add regression test reproducing failure pattern.
3. Reattempt rollout only with shadow-first path.

## Post-incident notes

Record:

- faulty bundle diff summary
- exact rollback version/hash
- validation evidence and timestamps
- test gap that allowed regression

## Do not

- Do not deploy a "quick fix" bundle before restoring known-good state.
- Do not skip monotonic version discipline.
- Do not close incident without regression test addition.

## Related docs

- [Runbook: Reject Spike](/docs/cookbook/reject-spike-runbook/)
- [Shadow Mode Rollout](/docs/cookbook/shadow-mode-rollout/)
- [Policy Lint Checklist](/docs/reference/policy-lint-checklist/)

