# Runbook: Reject Spike

- Canonical URL: https://docs.fairvisor.com/docs/cookbook/reject-spike-runbook/
- Section: docs
- Last updated: n/a
> Decision-tree triage and mitigation for sudden reject-rate spikes.


## Purpose / When to use

Use this runbook when reject volume rises sharply and customers report increased 429 responses.

## Blast radius & risk level

- Risk level: high
- Primary risk: customer-impacting traffic denial and cascading retries

## Signals / symptoms

- Sharp increase in `fairvisor_decisions_total{action="reject"}`
- Elevated `Retry-After` values
- API clients report sustained 429s

## Detection queries

```promql
sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject"}[5m])
rate(fairvisor_retry_after_bucket_total[5m])
rate(fairvisor_descriptor_missing_total[5m])
```

## Triage checklist

1. Identify top reject reason bucket.
2. Map reason to policy/rule using logs/metrics; for per-request attribution use debug session headers.
3. Confirm whether a bundle rollout preceded the spike.
4. Check descriptor-missing regressions for identity keys.
5. Check if kill-switch recently changed.

Reason map:

- `token_bucket_exceeded` -> burst/rate pressure
- `budget_exceeded` -> quota period exhaustion
- `kill_switch` -> active block entry
- `circuit_breaker_open` -> spend velocity trip

## Mitigation playbook

Safe-first options:

1. Move impacted policy to shadow temporarily.
2. Increase conservative thresholds for hot path.
3. Narrow selectors/scope for over-broad rules.

Aggressive options:

1. Roll back to known-good bundle.
2. Use global shadow incident mode with short TTL.
3. Add targeted kill-switch only if abuse is confirmed.

## Verification checklist

1. Reject rate trends back to baseline.
2. Top reject reason aligns with intended control.
3. No new latency or allow-path regressions.
4. Retry-after distribution returns to expected range.

## Exit criteria

- Reject spike resolved
- Policy delta documented and peer-reviewed
- Follow-up hardening tasks created

## Rollback / recovery path

1. Reapply last known-good bundle version.
2. Confirm `readyz` and policy version consistency.
3. Re-run triage queries for 15-30 minutes.

## Post-incident notes

Record:

- top reason and impacted policy/rule (from logs/debug session)
- mitigation chosen and timing
- customer impact window
- permanent prevention action

## Do not

- Do not tune multiple unrelated policies at once.
- Do not keep emergency overrides without TTL/owner.
- Do not ignore descriptor-missing spikes while tuning limits.

## Related docs

- [Decision Tracing](/docs/reference/decision-tracing/)
- [Runbook: Global Shadow + Kill-Switch Override](/docs/cookbook/global-shadow-bypass-runbook/)
- [Runbook: Bad Bundle Rollback](/docs/cookbook/bad-bundle-rollback-runbook/)

