# Runbook: Kill-Switch Incident Response

- Canonical URL: https://docs.fairvisor.com/docs/cookbook/kill-switch-incident-response/
- Section: docs
- Last updated: n/a
> Activate, validate, and safely retire kill-switch entries during active abuse incidents.


## Purpose / When to use

Use this runbook to rapidly block a specific abusive actor, token, tenant, or route when immediate containment is required.

## Blast radius & risk level

- Risk level: high
- Primary risk: over-broad scope causing collateral 429s

## Signals / symptoms

- Active abuse pattern tied to identifiable descriptor value
- Sharp reject increase needed for containment in minutes, not hours
- Existing throttles are too slow or too permissive

## Detection queries

```promql
sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject",reason="kill_switch"}[5m])
```

Header verification:

```bash
curl -i -X POST http://localhost:8080/v1/decision \
  -H 'X-Original-Method: POST' \
  -H 'X-Original-URI: /v1/critical' \
  -H 'x-api-key: <suspect-key>'
```

## Triage checklist

1. Identify smallest possible scope key/value pair.
2. Decide whether route-scoped switch is sufficient.
3. Set explicit incident id in `reason`.
4. Set short TTL first (`expires_at`).
5. Prepare rollback owner before deploy.

## Mitigation playbook

Minimal scoped entry:

```json
{
  "scope_key": "header:x-api-key",
  "scope_value": "key_compromised_123",
  "route": "/v1/critical",
  "reason": "inc-2026-02-20-abuse",
  "expires_at": "2026-02-20T19:00:00Z"
}
```

Execution order:

1. Add kill-switch entry to bundle.
2. Increment `bundle_version`.
3. Deploy bundle.
4. Verify containment and collateral.

## Verification checklist

1. `X-Fairvisor-Reason: kill_switch` appears for expected traffic.
2. Reject volume rises only for targeted scope.
3. Critical non-target flows remain healthy.
4. No descriptor-missing regression on switch key.

## Exit criteria

- Abuse contained
- Downstream systems stable
- Safe permanent policy fix prepared

## Rollback / recovery path

1. Remove switch or let TTL expire.
2. Deploy new bundle version.
3. Confirm `kill_switch` reject rate returns to expected baseline.

## Post-incident notes

Record:

- exact scope and TTL used
- collateral findings
- time-to-containment
- permanent control added after incident

## Do not

- Do not start with global scope unless incident severity justifies it.
- Do not deploy kill-switches without expiry on first pass.
- Do not skip verification of non-target traffic.

## Related docs

- [Kill Switches](/docs/policy/kill-switches/)
- [Runbook: Reject Spike](/docs/cookbook/reject-spike-runbook/)
- [Runbook: Global Shadow + Kill-Switch Override](/docs/cookbook/global-shadow-bypass-runbook/)

