Chat Completions
Prompt Caching
Reduce latency and cost by caching repeated prompt prefixes.
Prompt caching allows DEVUP AI to reuse the KV (key-value) cache from previous requests when the beginning of your prompt is identical. This reduces both latency and cost for workloads that repeatedly send the same prefix — such as a long system prompt, a large document, or a fixed set of few-shot examples.
How it works
When you send a request, DEVUP AI checks whether the beginning of your prompt matches a cached prefix from a recent request on the same model. If it does, the cached KV state is reused instead of recomputing it, which:
- Reduces time-to-first-token — the model skips processing the cached portion
- Lowers cost — cached input tokens are billed at a reduced rate
Usage
Prompt caching is automatic — no extra parameters required. Just structure your prompts so that the reused content appears at the beginning.
from openai import OpenAI
client = OpenAI(
api_key="$DEVUP_API_KEY",
base_url="https://api.devupai.com/v1",
)
# Long system prompt that stays the same across requests
SYSTEM_PROMPT = """You are a helpful AI assistant with deep expertise in Python.
[... thousands of tokens of instructions or context ...]
"""
# First request - full processing
response1 = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "How do I use list comprehensions?"},
],
)
# Second request - cached prefix reused, faster and cheaper
response2 = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "What are Python generators?"},
],
)Best practices
- Put stable content first: Ensure static instructions, documents, or context appear at the very top of your prompt sequence.
- Keep the prefix identical: The cache match is broken by any differing characters, including trailing spaces or altered message roles.
- Longer prefixes save more: Because caching evaluates the common prefix length, front-loading the heaviest token payloads yields the best performance gains.
Common use cases
| Use case | Cached prefix |
|---|---|
| Chatbot with a long system prompt | System prompt |
| RAG / document Q&A | Retrieved documents |
| Few-shot classification | Examples |
| Code assistant with a large codebase | Codebase context |
| Multi-turn conversation | Previous turns |
Checking cache usage
The response usage object indicates how many tokens were served from cache:
{
"usage": {
"prompt_tokens": 5000,
"completion_tokens": 50,
"total_tokens": 5050,
"prompt_tokens_details": {
"cached_tokens": 4800
}
}
}In this example, 4800 of the 5000 input tokens were cached.
Explicit cache keys
While prompt caching evaluates prefixes automatically, you can explicitly guarantee KV cache separation by passing a custom prompt_cache_key. We highly recommend using a session-scoped key for multi-tenant applications to prevent accidental prefix collisions across user contexts.
from openai import OpenAI
client = OpenAI(
api_key="$DEVUP_API_KEY",
base_url="https://api.devupai.com/v1",
)
response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-V3",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "How do I use async/await?"},
],
extra_body={"prompt_cache_key": "user123-chat456"},
)Requests with the same prompt_cache_key and model will share a KV cache, completely isolated from other requests.
| Parameter | Type | Description |
|---|---|---|
| prompt_cache_key | string | An optional custom identifier to strictly scope the KV cache to a specific user, session, or application tenant. |
Notes
- Model Support: Prompt caching is currently enabled by default across all DeepSeek, Llama 3, and Qwen models on DEVUP AI.
- Cache Expiration: KV caches are maintained dynamically based on cluster demand, typically persisting for 5-10 minutes after the last request.
- Scoping: Without an explicit cache key, caches are securely scoped globally to identical prompt sequences for the requested model.
- Billing: Cached tokens do not count towards active inference rate limits and are billed at a discounted "Cached Input" rate.