Skip to content

Effort

One optional knob biases any call toward latency, cost, or quality — no model swap and no code rewrite.

The preferred form is the X-DI-Effort header, which you can set once on the client so it applies to every request. A ?effort= query parameter works too.

client = OpenAI(
api_key="llm_live_...",
base_url="https://app.directinference.com/di/v1",
default_headers={"X-DI-Effort": "high"},
)

Lower levels bias toward cost and latency; higher levels bias toward quality — model order, retries, and repair budget. Effort tunes the chosen request type; it does not pick the type.

fastminimallowmediumhighxhighmax

cheaper / faster → higher quality

LevelBias
autoInferred from model intent and use case (the default).
fastLowest latency and cost; can keep simple work on cheaper models.
minimalMinimal spend; trims optional steps.
lowLeans cheaper and faster.
mediumBalanced cost and quality.
highLeans toward quality — model order and retries.
xhighStronger quality bias; more repair budget.
maxMaximum quality bias regardless of cost.

If several effort signals are present, the first one found wins, in this order:

  1. X-DI-Effort request header
  2. ?effort= query parameter
  3. OpenAI reasoning.effort or reasoning_effort
  4. Gemini thinking config
  5. Anthropic thinking.budget_tokens
  6. Model suffix, e.g. claude-sonnet-4-6@xhigh or gpt-5:fast
  7. Auto inference from model intent and use case