# Prompt caching

Reuse a stable prompt prefix across calls to cut both cost and time-to-first-token. Caching works on every surface and needs no special endpoint.

## How caching works

When consecutive requests share an identical leading prefix — a long system prompt, a tool catalog, a document you keep asking about — that prefix can be served from a cache instead of reprocessed. The first request writes the cache; later requests read it, paying a fraction of the input cost and starting to respond sooner. Caches are short-lived and scoped to your account.

## Marking content to cache

Add a `cache_control` breakpoint to the end of the content you want cached. Everything up to that point becomes the cacheable prefix. The Anthropic surface takes it natively; the OpenAI-compatible surface accepts it on content parts too.

```python
from anthropic import Anthropic

client = Anthropic(api_key="llm_live_...", base_url="https://app.directinference.com/di")

msg = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=[
        {
            "type": "text",
            "text": LONG_STABLE_INSTRUCTIONS,        # reused on every call
            "cache_control": {"type": "ephemeral"},  # mark the prefix as cacheable
        },
    ],
    messages=[{"role": "user", "content": "What changed in section 4?"}],
)

print(msg.usage)   # cache_creation_input_tokens / cache_read_input_tokens
```

```typescript
import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ apiKey: "llm_live_...", baseURL: "https://app.directinference.com/di" });

const msg = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 512,
  system: [
    {
      type: "text",
      text: LONG_STABLE_INSTRUCTIONS,        // reused on every call
      cache_control: { type: "ephemeral" }, // mark the prefix as cacheable
    },
  ],
  messages: [{ role: "user", content: "What changed in section 4?" }],
});

console.log(msg.usage); // cache_creation_input_tokens / cache_read_input_tokens
```

```bash
# cache_control is also accepted on the OpenAI-compatible surface.
curl https://app.directinference.com/di/v1/chat/completions \
  -H "Authorization: Bearer llm_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.5-mini",
    "messages": [
      { "role": "system", "content": [
        { "type": "text", "text": "<long stable instructions>",
          "cache_control": { "type": "ephemeral" } }
      ]},
      { "role": "user", "content": "What changed in section 4?" }
    ]
  }'
```

:::tip[Order matters: stable content first]
Put the parts that never change (instructions, schemas, documents) at the front and mark the breakpoint after them; keep the volatile user turn last. A prefix only hits the cache when it is byte-for-byte identical to a previous request, so anything that varies per call must come after the breakpoint.
:::

## Reading cache stats

Every response reports what the cache did, in the native shape of the surface you called. On the OpenAI-compatible surface, `usage.prompt_tokens_details.cached_tokens` and `cache_write_tokens` carry the read/write split. On the Anthropic surface, that split is `cache_read_input_tokens` and `cache_creation_input_tokens`.

## How caching is billed

A cached request is billed in four buckets. Reads are far cheaper than fresh input, which is where the savings come from; a write costs a little more than input once, then pays for itself across subsequent reads.

| Bucket | What it is |
| --- | --- |
| Uncached input | Input tokens read fresh — your normal input rate. |
| Cache read | Input tokens served from a prior write — billed at a steep discount to input. |
| Cache write | The one-time cost of storing a new prefix, slightly above the input rate. |
| Output | Generated tokens — your normal output rate, never cached. |

## Cache savings in the dashboard

You do not have to add up tokens by hand. The portal surfaces cache activity and the money it saved per request and in aggregate.

:::note[Where to look]
The [Overview](https://app.directinference.com/overview) shows a cache savings strip and timeline when there is cache activity, and each row in [Usage & analytics](https://docs.directinference.com/usage/) breaks down cached vs. uncached tokens and the savings.
:::