Prompt caching
Reuse a stable prompt prefix across calls to cut both cost and time-to-first-token. Caching works on every surface and needs no special endpoint.
How caching works
Section titled “How caching works”When consecutive requests share an identical leading prefix — a long system prompt, a tool catalog, a document you keep asking about — that prefix can be served from a cache instead of reprocessed. The first request writes the cache; later requests read it, paying a fraction of the input cost and starting to respond sooner. Caches are short-lived and scoped to your account.
Marking content to cache
Section titled “Marking content to cache”Add a cache_control breakpoint to the end of the content you want cached. Everything up to that point becomes the cacheable prefix. The Anthropic surface takes it natively; the OpenAI-compatible surface accepts it on content parts too.
from anthropic import Anthropic
client = Anthropic(api_key="llm_live_...", base_url="https://app.directinference.com/di")
msg = client.messages.create( model="claude-sonnet-4-6", max_tokens=512, system=[ { "type": "text", "text": LONG_STABLE_INSTRUCTIONS, # reused on every call "cache_control": {"type": "ephemeral"}, # mark the prefix as cacheable }, ], messages=[{"role": "user", "content": "What changed in section 4?"}],)
print(msg.usage) # cache_creation_input_tokens / cache_read_input_tokensimport Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic({ apiKey: "llm_live_...", baseURL: "https://app.directinference.com/di" });
const msg = await client.messages.create({ model: "claude-sonnet-4-6", max_tokens: 512, system: [ { type: "text", text: LONG_STABLE_INSTRUCTIONS, // reused on every call cache_control: { type: "ephemeral" }, // mark the prefix as cacheable }, ], messages: [{ role: "user", content: "What changed in section 4?" }],});
console.log(msg.usage); // cache_creation_input_tokens / cache_read_input_tokens# cache_control is also accepted on the OpenAI-compatible surface.curl https://app.directinference.com/di/v1/chat/completions \ -H "Authorization: Bearer llm_live_..." \ -H "Content-Type: application/json" \ -d '{ "model": "gpt-5.5-mini", "messages": [ { "role": "system", "content": [ { "type": "text", "text": "<long stable instructions>", "cache_control": { "type": "ephemeral" } } ]}, { "role": "user", "content": "What changed in section 4?" } ] }'Reading cache stats
Section titled “Reading cache stats”Every response reports what the cache did, in the native shape of the surface you called. On the OpenAI-compatible surface, usage.prompt_tokens_details.cached_tokens and cache_write_tokens carry the read/write split. On the Anthropic surface, that split is cache_read_input_tokens and cache_creation_input_tokens.
How caching is billed
Section titled “How caching is billed”A cached request is billed in four buckets. Reads are far cheaper than fresh input, which is where the savings come from; a write costs a little more than input once, then pays for itself across subsequent reads.
| Bucket | What it is |
|---|---|
| Uncached input | Input tokens read fresh — your normal input rate. |
| Cache read | Input tokens served from a prior write — billed at a steep discount to input. |
| Cache write | The one-time cost of storing a new prefix, slightly above the input rate. |
| Output | Generated tokens — your normal output rate, never cached. |
Cache savings in the dashboard
Section titled “Cache savings in the dashboard”You do not have to add up tokens by hand. The portal surfaces cache activity and the money it saved per request and in aggregate.