Skip to content

Prompt caching

Reuse a stable prompt prefix across calls to cut both cost and time-to-first-token. Caching works on every surface and needs no special endpoint.

When consecutive requests share an identical leading prefix — a long system prompt, a tool catalog, a document you keep asking about — that prefix can be served from a cache instead of reprocessed. The first request writes the cache; later requests read it, paying a fraction of the input cost and starting to respond sooner. Caches are short-lived and scoped to your account.

Add a cache_control breakpoint to the end of the content you want cached. Everything up to that point becomes the cacheable prefix. The Anthropic surface takes it natively; the OpenAI-compatible surface accepts it on content parts too.

from anthropic import Anthropic
client = Anthropic(api_key="llm_live_...", base_url="https://app.directinference.com/di")
msg = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=512,
system=[
{
"type": "text",
"text": LONG_STABLE_INSTRUCTIONS, # reused on every call
"cache_control": {"type": "ephemeral"}, # mark the prefix as cacheable
},
],
messages=[{"role": "user", "content": "What changed in section 4?"}],
)
print(msg.usage) # cache_creation_input_tokens / cache_read_input_tokens

Every response reports what the cache did, in the native shape of the surface you called. On the OpenAI-compatible surface, usage.prompt_tokens_details.cached_tokens and cache_write_tokens carry the read/write split. On the Anthropic surface, that split is cache_read_input_tokens and cache_creation_input_tokens.

A cached request is billed in four buckets. Reads are far cheaper than fresh input, which is where the savings come from; a write costs a little more than input once, then pays for itself across subsequent reads.

BucketWhat it is
Uncached inputInput tokens read fresh — your normal input rate.
Cache readInput tokens served from a prior write — billed at a steep discount to input.
Cache writeThe one-time cost of storing a new prefix, slightly above the input rate.
OutputGenerated tokens — your normal output rate, never cached.

You do not have to add up tokens by hand. The portal surfaces cache activity and the money it saved per request and in aggregate.