Chat Completions

DEVUP AI offers an OpenAI-compatible chat completions API for all LLM models at the best prices. Whether you are using Llama 3, DeepSeek, or Qwen, our gateway strictly follows the OpenAI schema.

POST

https://api.devupai.com/v1/chat/completions

Install the SDK

pip install openai

Basic chat completion

By setting the baseURL and passing your DEVUP AI token, you can route requests directly to our optimized infrastructure.

from openai import OpenAI

openai = OpenAI(
    api_key="$DEVUP_API_KEY",
    base_url="https://api.devupai.com/v1",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello!"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

Multi-turn conversations

To create a longer conversation, include the full message history in every request. The models themselves do not have memory. To maintain conversational state, you must pass the entire conversation history in the messages array on each request.

from openai import OpenAI

openai = OpenAI(
    api_key="$DEVUP_API_KEY",
    base_url="https://api.devupai.com/v1",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": "Respond like a michelin starred chef."},
        {"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
        {"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form..."},
        {"role": "user", "content": "Tell me more about the second method."},
    ],
)

print(chat_completion.choices[0].message.content)

Supported parameters

We support the majority of standard OpenAI chat completion parameters.

model

string · requiredID of the model to use. See the model library for all available options.

messages

array · requiredA list of messages comprising the conversation so far.

max_tokens

integer · optionalThe maximum number of tokens that can be generated in the chat completion.

stream

boolean · optional · default: falseIf set, partial message deltas will be sent. Tokens will be sent as data-only server-sent events.

temperature

number · optional · default: 1What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

top_p

number · optional · default: 1An alternative to sampling with temperature, called nucleus sampling. We generally recommend altering this or temperature but not both.

stop

string / array · optionalUp to 4 sequences where the API will stop generating further tokens.

integer · optional · default: 1How many chat completion choices to generate for each input message.

presence_penalty

number · optional · default: 0Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far.

frequency_penalty

number · optional · default: 0Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far.

response_format

object · optionalAn object specifying the format that the model must output. Compatible with JSON mode and structured outputs.

tools
tool_choice

array / object · optionalA list of tools the model may call, and controls for which tool is called.

service_tier

string · optionalSpecifies the latency tier to use. See the Service tier section below for details.

reasoning_effort

string · optionalConstrains effort on reasoning for reasoning models. Currently supports 'low', 'medium', and 'high'.

We may not be 100% compatible with all OpenAI parameters. Let us know on support if something is missing.

Service tier

DEVUP AI offers an optional priority service tier that guarantees the highest quality of service by routing your inference requests to our fastest, dedicated GPU clusters.

⚠ Priority inference incurs a 20% surcharge on top of the model's standard per-token price.

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"service_tier": "priority"},
)

Max output tokens

Due to hardware limits, there is a hard cap on how many output tokens models can generate in a single request. For models like DeepSeek V3, this is usually 8192 tokens.

Continuing responses beyond the limit

If a model stops generating because it reached the maximum output tokens (the finish_reason will be length), you can simply send the truncated response back as an assistant message and the model will continue right where it left off.

bash

curl -X POST https://api.devupai.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEVUP_API_KEY" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3",
    "messages": [
      {
        "role": "user",
        "content": "Write a 5,000 word essay on quantum physics."
      },
      {
        "role": "assistant",
        "content": "<previous truncated response>"
      }
    ]
  }'

What's Next