Skip to main content
All inference endpoints are on port 8000 and follow the OpenAI API spec exactly.

Chat completions

POST http://localhost:8000/v1/chat/completions
{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false,
  "temperature": 0.7,
  "max_tokens": 512
}

Streaming

Set "stream": true — response is server-sent events, same as OpenAI.
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2.5-7B-Instruct", "messages": [{"role": "user", "content": "Hi"}], "stream": true}'

Embeddings

POST http://localhost:8000/v1/embeddings
{
  "model": "Qwen/Qwen2.5-7B-Instruct",
  "input": "The quick brown fox"
}

List models

GET http://localhost:8000/v1/models Returns the currently loaded model(s).

Compatible clients

  • Python: openai SDK, langchain, litellm
  • Node.js: openai npm package
  • Go: sashabaranov/go-openai
  • Open WebUI: set base URL to http://localhost:8000/v1
  • Anything else that speaks OpenAI