# LLM Token Limiter

- Canonical URL: https://docs.fairvisor.com/docs/algorithms/llm-limiter/
- Section: docs
- Last updated: n/a
> TPM/TPD token limiter with pessimistic reservation, reconciliation, and optional streaming enforcement.


`token_bucket_llm` is the algorithm for **token-based LLM governance**: it enforces per-minute (TPM) and optional per-day (TPD) token budgets per limiter key, reserves tokens pessimistically at request time, and refunds unused tokens after the response. Use it on OpenAI-compatible endpoints to control cost by tokens rather than request count.

## How it works

At decision time, the limiter runs in this order:

1. Estimate prompt tokens
2. Determine completion reservation
3. Compute `estimated_total = prompt_tokens + reserved_completion`
4. Enforce prompt/request caps
5. Consume TPM budget
6. Consume TPD budget (if configured)
7. Allow or reject
8. Reconcile later (refund unused tokens)

<div class="callout callout-note">
  <span class="callout-icon">ℹ️</span>
  <div><p><strong>Ordering detail:</strong> TPM is checked first. If TPM passes but TPD fails, the TPM reservation is refunded immediately — you do not leak minute budget on a TPD rejection.</p></div>
</div>

## Reservation model

The limiter is intentionally pessimistic before upstream generation starts.

```text
reserved_completion = request.max_tokens (if present and > 0)
                      else default_max_completion

if max_completion_tokens is set:
  reserved_completion = min(reserved_completion, max_completion_tokens)

estimated_total = prompt_tokens + reserved_completion
```

Why this matters:

- You block runaway requests before they hit the model
- You keep budgets safe even when actual completion size is unknown
- Reconciliation later corrects over-reservation

## Prompt token estimation

Supported estimators:

- `simple_word` (default)
- `header_hint`
### `simple_word`

- Fast heuristic: approximately `ceil(chars / 4)`
- If request body has a `"messages"` array, it prefers `content` fields
- Body scan is capped at 1 MiB for hot-path performance

### `header_hint`

- Reads `X-Token-Estimate` from request headers (case-insensitive)
- If header is absent/invalid, falls back to `simple_word`

## State

State lives in `ngx.shared.fairvisor_counters`.

Key format:

- TPM: `tpm:{limit_key}`
- TPD: `tpd:{limit_key}:{YYYYMMDD}` (UTC date key)

Reset behavior:

- TPM refills continuously (token-bucket semantics)
- TPD resets at midnight UTC
- On TPD rejection, `Retry-After` is seconds until next UTC midnight

## Configuration

```json
{
  "algorithm": "token_bucket_llm",
  "algorithm_config": {
    "tokens_per_minute": 120000,
    "tokens_per_day": 2000000,
    "burst_tokens": 120000,
    "max_tokens_per_request": 8192,
    "max_prompt_tokens": 4096,
    "max_completion_tokens": 4096,
    "default_max_completion": 1000,
    "token_source": {
      "estimator": "header_hint"
    },
    "streaming": {
      "enabled": true,
      "enforce_mid_stream": true,
      "buffer_tokens": 100,
      "on_limit_exceeded": "graceful_close",
      "include_partial_usage": true
    }
  }
}
```

| Field | Required | Default | Rules / notes |
|---|---|---|---|
| `tokens_per_minute` | yes | - | Positive number |
| `tokens_per_day` | no | unset | Positive number |
| `burst_tokens` | no | `tokens_per_minute` | Must be `>= tokens_per_minute` |
| `max_tokens_per_request` | no | unset | Positive number |
| `max_prompt_tokens` | no | unset | Positive number |
| `max_completion_tokens` | no | unset | Positive number; clamps requested `max_tokens` |
| `default_max_completion` | no | `1000` | Used when request has no usable `max_tokens` |
| `token_source.estimator` | no | `simple_word` | `simple_word`, `header_hint` |
| `streaming.*` | no | enabled defaults | Mid-stream SSE enforcement controls |

## Reconciliation (refund unused tokens)

After response completion, Fairvisor computes:

```text
refund = estimated_total - actual_total
```

If `refund > 0`, it credits tokens back to TPM and TPD.

### Non-streaming responses

Actual usage is extracted from response JSON (typically `usage.*`).

If usage cannot be extracted (parse errors, oversized body, missing usage path), runtime falls back safely and marks fallback in internal accounting instead of failing requests.

### Streaming responses

For SSE flows, reconciliation occurs in streaming body-filter completion logic. See [Streaming Enforcement](/docs/reference/streaming/).

## Rejection reasons

- `prompt_tokens_exceeded` — prompt estimate is above `max_prompt_tokens`
- `max_tokens_per_request_exceeded` — prompt + reserved completion exceeds `max_tokens_per_request`
- `tpm_exceeded` — per-minute budget exhausted
- `tpd_exceeded` — per-day budget exhausted

## Response headers

On allowed requests (in `enforce` mode):

```
RateLimit-Limit: 120000
RateLimit-Remaining: 87432
RateLimit-Reset: 23
```

On rejection:

```
HTTP 429 Too Many Requests
Retry-After: 23
X-Fairvisor-Reason: tpm_exceeded
```

For TPD rejection, `Retry-After` is seconds until next UTC midnight:

```
HTTP 429 Too Many Requests
Retry-After: 86400
X-Fairvisor-Reason: tpd_exceeded
```

## Performance notes

Hot-path design points:

- No external dependency required (shared dict only)
- O(1) budget checks per request
- No background refill timers (lazy refill on check)
- Prompt estimation avoids full JSON decode in default path
- Body scan is bounded to 1 MiB to cap CPU cost

## Failure behavior

If the shared dict increment fails for TPM or TPD counters (for example, under dict memory pressure), the algorithm fails open:

- the request is allowed
- the failed budget check is skipped
- the failure is logged for metrics

Traffic is never blocked due to storage failure.

## Tuning

1. Start with realistic TPM from provider plan and expected concurrency
2. Keep `burst_tokens` near TPM unless you need short spikes
3. Set `max_tokens_per_request` as hard safety rail for prompt-injection/runaway tools
4. Set `max_completion_tokens` to cap tail latency and cost
5. Use `header_hint` only if trusted upstream provides reliable token estimate
6. Monitor rejection reasons distribution (`tpm_exceeded` vs `tpd_exceeded`) and adjust budgets
7. Validate estimator quality by comparing reserved vs actual usage over production traffic

## Example

```json
{
  "name": "chat-llm-budget",
  "limit_keys": ["jwt:org_id"],
  "algorithm": "token_bucket_llm",
  "algorithm_config": {
    "tokens_per_minute": 60000,
    "tokens_per_day": 1200000,
    "burst_tokens": 60000,
    "max_prompt_tokens": 12000,
    "max_completion_tokens": 1500,
    "max_tokens_per_request": 13000,
    "default_max_completion": 800,
    "token_source": { "estimator": "simple_word" }
  }
}
```

This gives each org:

- steady 60k TPM
- 1.2M TPD ceiling
- hard per-request cap to protect from extreme prompts/completions

