DevUp Docs
Back to Dashboard

Chat Completions

Prompt Caching

Reduce latency and cost by caching repeated prompt prefixes.

Prompt caching allows DEVUP AI to reuse the KV (key-value) cache from previous requests when the beginning of your prompt is identical. This reduces both latency and cost for workloads that repeatedly send the same prefix — such as a long system prompt, a large document, or a fixed set of few-shot examples.

How it works

When you send a request, DEVUP AI checks whether the beginning of your prompt matches a cached prefix from a recent request on the same model. If it does, the cached KV state is reused instead of recomputing it, which:

  • Reduces time-to-first-token — the model skips processing the cached portion
  • Lowers cost — cached input tokens are billed at a reduced rate

Usage

Prompt caching is automatic — no extra parameters required. Just structure your prompts so that the reused content appears at the beginning.

from openai import OpenAI

client = OpenAI(
    api_key="$DEVUP_API_KEY",
    base_url="https://api.devupai.com/v1",
)

# Long system prompt that stays the same across requests
SYSTEM_PROMPT = """You are a helpful AI assistant with deep expertise in Python.
[... thousands of tokens of instructions or context ...]
"""

# First request - full processing
response1 = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "How do I use list comprehensions?"},
    ],
)

# Second request - cached prefix reused, faster and cheaper
response2 = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": "What are Python generators?"},
    ],
)

Best practices

  • Put stable content first: Ensure static instructions, documents, or context appear at the very top of your prompt sequence.
  • Keep the prefix identical: The cache match is broken by any differing characters, including trailing spaces or altered message roles.
  • Longer prefixes save more: Because caching evaluates the common prefix length, front-loading the heaviest token payloads yields the best performance gains.

Common use cases

Use caseCached prefix
Chatbot with a long system promptSystem prompt
RAG / document Q&ARetrieved documents
Few-shot classificationExamples
Code assistant with a large codebaseCodebase context
Multi-turn conversationPrevious turns

Checking cache usage

The response usage object indicates how many tokens were served from cache:

json
{
  "usage": {
    "prompt_tokens": 5000,
    "completion_tokens": 50,
    "total_tokens": 5050,
    "prompt_tokens_details": {
      "cached_tokens": 4800
    }
  }
}

In this example, 4800 of the 5000 input tokens were cached.

Explicit cache keys

While prompt caching evaluates prefixes automatically, you can explicitly guarantee KV cache separation by passing a custom prompt_cache_key. We highly recommend using a session-scoped key for multi-tenant applications to prevent accidental prefix collisions across user contexts.

from openai import OpenAI

client = OpenAI(
    api_key="$DEVUP_API_KEY",
    base_url="https://api.devupai.com/v1",
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "How do I use async/await?"},
    ],
    extra_body={"prompt_cache_key": "user123-chat456"},
)

Requests with the same prompt_cache_key and model will share a KV cache, completely isolated from other requests.

ParameterTypeDescription
prompt_cache_keystringAn optional custom identifier to strictly scope the KV cache to a specific user, session, or application tenant.

Notes

  • Model Support: Prompt caching is currently enabled by default across all DeepSeek, Llama 3, and Qwen models on DEVUP AI.
  • Cache Expiration: KV caches are maintained dynamically based on cluster demand, typically persisting for 5-10 minutes after the last request.
  • Scoping: Without an explicit cache key, caches are securely scoped globally to identical prompt sequences for the requested model.
  • Billing: Cached tokens do not count towards active inference rate limits and are billed at a discounted "Cached Input" rate.