Prompt caching

Reuse a stable prompt prefix across calls to cut both cost and time-to-first-token. Caching works on every surface and needs no special endpoint.

How caching works

When consecutive requests share an identical leading prefix — a long system prompt, a tool catalog, a document you keep asking about — that prefix can be served from a cache instead of reprocessed. The first request writes the cache; later requests read it, paying a fraction of the input cost and starting to respond sooner. Caches are short-lived and scoped to your account.

Marking content to cache

Add a cache_control breakpoint to the end of the content you want cached. Everything up to that point becomes the cacheable prefix. The Anthropic surface takes it natively; the OpenAI-compatible surface accepts it on content parts too.

from anthropic import Anthropic

client = Anthropic(api_key="llm_live_...", base_url="https://app.directinference.com/di")

msg = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    system=[
        {
            "type": "text",
            "text": LONG_STABLE_INSTRUCTIONS,        # reused on every call
            "cache_control": {"type": "ephemeral"},  # mark the prefix as cacheable
        },
    ],
    messages=[{"role": "user", "content": "What changed in section 4?"}],
)

print(msg.usage)   # cache_creation_input_tokens / cache_read_input_tokens

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic({ apiKey: "llm_live_...", baseURL: "https://app.directinference.com/di" });

const msg = await client.messages.create({
  model: "claude-sonnet-4-6",
  max_tokens: 512,
  system: [
    {
      type: "text",
      text: LONG_STABLE_INSTRUCTIONS,        // reused on every call
      cache_control: { type: "ephemeral" }, // mark the prefix as cacheable
    },
  ],
  messages: [{ role: "user", content: "What changed in section 4?" }],
});

console.log(msg.usage); // cache_creation_input_tokens / cache_read_input_tokens

# cache_control is also accepted on the OpenAI-compatible surface.
curl https://app.directinference.com/di/v1/chat/completions \
  -H "Authorization: Bearer llm_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-5.5-mini",
    "messages": [
      { "role": "system", "content": [
        { "type": "text", "text": "<long stable instructions>",
          "cache_control": { "type": "ephemeral" } }
      ]},
      { "role": "user", "content": "What changed in section 4?" }
    ]
  }'

Reading cache stats

Every response reports what the cache did, in the native shape of the surface you called. On the OpenAI-compatible surface, usage.prompt_tokens_details.cached_tokens and cache_write_tokens carry the read/write split. On the Anthropic surface, that split is cache_read_input_tokens and cache_creation_input_tokens.

How caching is billed

A cached request is billed in four buckets. Reads are far cheaper than fresh input, which is where the savings come from; a write costs a little more than input once, then pays for itself across subsequent reads.

Bucket	What it is
Uncached input	Input tokens read fresh — your normal input rate.
Cache read	Input tokens served from a prior write — billed at a steep discount to input.
Cache write	The one-time cost of storing a new prefix, slightly above the input rate.
Output	Generated tokens — your normal output rate, never cached.

Cache savings in the dashboard

You do not have to add up tokens by hand. The portal surfaces cache activity and the money it saved per request and in aggregate.