# Streaming Enforcement

- Canonical URL: https://docs.fairvisor.com/docs/reference/streaming/
- Section: docs
- Last updated: n/a
> How Fairvisor Edge enforces token limits on SSE streaming responses.


Fairvisor Edge can enforce token budgets mid-stream on Server-Sent Events (SSE) responses — the standard format used by OpenAI-compatible LLM APIs. This page covers how streaming enforcement works and how to configure it.

## Overview

Standard rate limiting happens at request time: the edge reserves tokens and either allows or rejects the request before the response begins. For streaming responses, the actual completion length is unknown until the `[DONE]` event arrives.

Fairvisor Edge handles this in two phases:

1. **Reservation** — at request time, reserve `prompt_tokens + max_completion_tokens`; reject if the budget is insufficient
2. **Mid-stream enforcement** — as SSE chunks arrive in the body filter phase, count completion tokens; truncate the stream if the limit is exceeded

## Detection

The streaming path activates when any of the following is true:

- The request includes `Accept: text/event-stream`
- The JSON body contains `"stream": true`
- The `request_context.stream` flag is set by the rule engine

## Body filter pipeline

```
upstream SSE chunk
  └─ body_filter_by_lua_block
       └─ buffer until complete SSE event detected
       └─ parse delta.content from JSON
       └─ accumulate token count (ceil(chars / 4))
       └─ check: tokens_used > max_completion_tokens?
           ├─ No: forward chunk unchanged
           └─ Yes: send close event; suppress remaining chunks
  └─ client receives filtered stream
```

The body filter runs per nginx chunk (which may span multiple SSE events). Chunks are buffered until a complete `data: ...\n\n` event is detected.

## Token counting in the stream

Completion tokens are estimated as `ceil(content_chars / 4)`. This is the same rule-of-thumb used for prompt estimation (`simple_word` estimator). The count accumulates across all chunks until the stream ends or is truncated.

Checks run every `buffer_tokens` accumulated tokens (default 100) to avoid per-chunk overhead.

## Truncation

When `tokens_used > max_completion_tokens`, the stream is truncated. Remaining chunks are suppressed and a synthetic final event is injected:

### Graceful close (`on_limit_exceeded: "graceful_close"`)

```
data: {"choices":[{"delta":{},"finish_reason":"length"}],"usage":{"prompt_tokens":52,"completion_tokens":100,"total_tokens":152}}

data: [DONE]

```

The client receives a well-formed stream with `finish_reason: "length"` — identical to what a model returns when it reaches its `max_tokens` limit naturally.

### Error chunk (`on_limit_exceeded: "error_chunk"`)

```
data: {"error":{"message":"max completion tokens exceeded","type":"rate_limit_error","code":"completion_tokens_exceeded"},"usage":{...}}

data: [DONE]

```

Use `error_chunk` when you want the client to distinguish a policy-enforced truncation from a natural length limit.

## Configuration

Streaming enforcement is configured in the `algorithm_config` of a `token_bucket_llm` rule:

```json
{
  "algorithm": "token_bucket_llm",
  "algorithm_config": {
    "tokens_per_minute": 120000,
    "tokens_per_day": 2000000,
    "burst_tokens": 120000,
    "max_completion_tokens": 4096,
    "streaming": {
      "enabled": true,
      "enforce_mid_stream": true,
      "buffer_tokens": 100,
      "on_limit_exceeded": "graceful_close",
      "include_partial_usage": true
    }
  }
}
```

| Field | Type | Default | Description |
|---|---|---|---|
| `enabled` | bool | `true` | Enable streaming body filter. |
| `enforce_mid_stream` | bool | `true` | Actually truncate when limit is exceeded. Set `false` to count only (useful with shadow mode). |
| `buffer_tokens` | int | `100` | How often to check the token budget (every N accumulated tokens). |
| `on_limit_exceeded` | string | `"graceful_close"` | `"graceful_close"` or `"error_chunk"`. |
| `include_partial_usage` | bool | `true` | Append a `usage` object to the close event. |

## Reconciliation

When the stream ends (either naturally at `[DONE]` or via truncation), the actual token count is reconciled against the reservation:

```
refund = reserved_tokens - actual_tokens_used
```

The TPM and TPD buckets are refunded the difference. This keeps the running total accurate over time even though reservations are pessimistic.

For streaming, reconciliation happens when `[DONE]` is received in the body filter. For non-streaming, it happens immediately after the full response body is received.

## Shadow mode

In shadow mode, streaming enforcement is fully simulated:

- Token counts accumulate normally
- Truncation is logged as `would_truncate` with the token count
- The stream is **not** interrupted — the client receives the full response
- Reconciliation still runs

This lets you validate your `max_completion_tokens` limits against real traffic before enabling enforcement.

## Partial usage headers

If `include_partial_usage: true`, each SSE event in the stream has a `usage` field appended:

```
data: {"id":"chatcmpl-...","choices":[...],"usage":{"prompt_tokens":52,"completion_tokens":43,"total_tokens":95}}
```

This is useful for client-side progress tracking and debugging.

## Example: streaming request flow

```
POST /v1/chat/completions
Authorization: Bearer eyJ...
Content-Type: application/json

{"model":"gpt-4","messages":[...],"stream":true,"max_tokens":500}
```

1. Edge receives request; estimates prompt = 80 tokens
2. max_completion = min(500, 4096) = 500
3. Reserve 580 tokens from TPM; check TPD
4. If denied: `429 tpm_exceeded` (before stream starts)
5. If allowed: forward request to upstream; start streaming
6. Body filter accumulates completion tokens chunk by chunk
7. At token 500: inject graceful close event, suppress upstream chunks
8. Reconcile: actual = 500, reserved = 580, refund = 80 tokens to TPM/TPD

