DevUp Docs
Back to Dashboard

Chat Completions

Vision & OCR

Send images to multimodal models for visual understanding and text extraction.

DEVUP AI hosts multimodal models that accept both images and text as input and produce text output. These models use the standard OpenAI vision API format and cover two major use cases:

  • Visual understanding — describe images, answer questions about visual content, compare images, analyze charts
  • OCR (Optical Character Recognition) — extract text from scanned documents, receipts, invoices, screenshots, handwritten notes, and PDFs

Available vision models

  • Qwen/Qwen2.5-VL-32B-Instruct
  • Qwen/Qwen2.5-VL-7B-Instruct
  • meta-llama/Llama-3.2-11B-Vision-Instruct

Available OCR models

We host a growing set of OCR-specialized models for high-accuracy text extraction. Browse the full OCR model catalog.

OCR models currently use the same vision API format below. A dedicated OCR endpoint optimized for document processing is coming soon.

Quick start

Images are passed in two ways:

  1. URL — pass a link to a publicly accessible image
  2. Base64 — encode the image and include it directly in the request

Image URL

curl "https://api.devupai.com/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEVUP_API_KEY" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-32B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/sample-image.webp"
            }
          },
          {
            "type": "text",
            "text": "What is in this image?"
          }
        ]
      }
    ]
  }'

Base64 encoded image

from openai import OpenAI
import base64
import requests

client = OpenAI(
    api_key="$DEVUP_API_KEY",
    base_url="https://api.devupai.com/v1",
)

image_url = "https://example.com/sample-image.webp"
base64_image = base64.b64encode(requests.get(image_url).content).decode("utf-8")

chat_completion = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-32B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{base64_image}"
                    }
                },
                {
                    "type": "text",
                    "text": "What is in this image?"
                }
            ]
        }
    ]
)

print(chat_completion.choices[0].message.content)

OCR example

Extract all text from a document image:

from openai import OpenAI
import base64

client = OpenAI(
    api_key="$DEVUP_API_KEY",
    base_url="https://api.devupai.com/v1",
)

with open("invoice.png", "rb") as f:
    base64_image = base64.b64encode(f.read()).decode("utf-8")

response = client.chat.completions.create(
    model="Qwen/Qwen2.5-VL-32B-Instruct",
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{base64_image}"}
                },
                {
                    "type": "text",
                    "text": "Extract all text from this document. Preserve the structure and layout as much as possible."
                }
            ]
        }
    ]
)

print(response.choices[0].message.content)

Common OCR prompts:

  • Extract all text from this image.
  • Extract all text from this document. Preserve the structure and layout as much as possible.
  • Convert this image to a markdown table.

Multiple images

You can pass multiple images in a single request by including multiple image_url content items:

curl "https://api.devupai.com/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEVUP_API_KEY" \
  -d '{
    "model": "Qwen/Qwen2.5-VL-32B-Instruct",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "What are the differences between these two images?"
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/page1.jpg"
            }
          },
          {
            "type": "image_url",
            "image_url": {
              "url": "https://example.com/page2.jpg"
            }
          }
        ]
      }
    ]
  }'

Pricing and token counting

Images are automatically rescaled and processed by the vision models into tokens. You are billed based on the total number of tokens generated from both your text prompt and the encoded images.

The exact number of tokens used by your images will be returned in the {"usage": {"prompt_tokens": ...}}} dictionary within the API response.

Limitations

  • Supported formats: jpg, png, webp
  • Max size: 20MB per image
  • The detail parameter is not currently supported