Create chat completion

Creates a model response for the given chat conversation. Supports streaming via SSE when stream: true.

Response headers

Inference-Id — Unique ID for this request. Include this when contacting support.

Error codes

Status	Meaning
`402`	Insufficient credits
`429`	Rate limited
`504`	No backend available
`529`	No healthy backends are available for the requested model

Authorization

BearerAuth

AuthorizationBearer <token>

API key passed as Bearer token

In: header

Request Body

application/json

TypeScript Definitions

Use the request body type in TypeScript.

Response Body

`application/json`

curl -X POST "https://example.com/v1/chat/completions" \  -H "Content-Type: application/json" \  -d '{    "model": "string",    "messages": [      {}    ]  }'

{
  "id": "string",
  "object": "string",
  "created": 0,
  "model": "string",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "string",
        "content": "string",
        "tool_calls": [
          {}
        ],
        "tool_call_id": "string"
      },
      "finish_reason": "string"
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }
}

Creates a model response for the given chat conversation. Supports streaming via SSE when stream: true.

Response headers

Inference-Id — Unique ID for this request. Include this when contacting support.

Error codes

Status	Meaning
`402`	Insufficient credits
`429`	Rate limited
`504`	No backend available
`529`	No healthy backends are available for the requested model

Authorization

BearerAuth

AuthorizationBearer <token>

API key passed as Bearer token

In: header

Request Body

application/json

TypeScript Definitions

Use the request body type in TypeScript.

model*string

The model to use for inference, e.g. meta-llama/Llama-3.3-70B-Instruct.

messages*array<>

A list of messages comprising the conversation so far.

temperature?number

Sampling temperature between 0 and 2. Higher values produce more random output; lower values are more deterministic. Defaults to 1.

max_tokens?integer

Maximum number of tokens to generate. Deprecated in favor of max_completion_tokens.

max_completion_tokens?integer

Upper bound on tokens generated, including reasoning tokens. Preferred over max_tokens.

top_p?number

Nucleus sampling: only tokens comprising the top top_p probability mass are considered. Defaults to 1.

frequency_penalty?number

Number between -2.0 and 2.0. Penalizes tokens based on their frequency so far, reducing repetition.

presence_penalty?number

Number between -2.0 and 2.0. Penalizes tokens that have appeared at all so far, encouraging new topics.

stop?|array<string>

One or more sequences where generation stops. The stop string itself is not included in the output.

stream?boolean

If true, the response is streamed back as Server-Sent Events (SSE). Each chunk is a data: line containing a partial JSON object. The stream ends with data: [DONE].

stream_options?|

Options for streaming responses. Only valid when stream is true.

tools?array<>

A list of tools the model may call. Each tool should follow the OpenAI function-calling schema.

tool_choice?|

Controls which tool the model calls. none disables tools, auto lets the model decide, required forces a tool call. Pass {"type": "function", "function": {"name": "..."}} to force a specific function.

parallel_tool_calls?boolean

Whether to allow the model to make multiple tool calls in parallel. Defaults to true.

response_format?object

Output format. Use {"type": "json_schema", "json_schema": {...}} for Structured Outputs, {"type": "json_object"} for legacy JSON mode, or {"type": "text"} for plain text.

seed?integer

If specified, the system will attempt deterministic sampling so repeated requests with the same seed and parameters return the same result.

n?integer

How many chat completion choices to generate per message. Defaults to 1.

user?string

A unique identifier representing the end-user. Useful for abuse monitoring.

logprobs?boolean

Whether to return log probabilities for each output token. Defaults to false.

top_logprobs?integer

Number of top token log probabilities to return per position (0–20). Requires logprobs: true.

logit_bias?object

Map of token IDs to bias values (-100 to 100). Positive values increase, negative values decrease the likelihood of the token being selected.

service_tier?string

Latency tier for the request. Use auto or default.

modalities?array<string>

Output types to generate. Defaults to ["text"]. Use ["text", "audio"] for audio output.

reasoning_effort?string

Controls how much reasoning the model does before responding. Common values: low, medium, high. Supported by reasoning models (o-series, etc.).

function_call?|

Deprecated. Use tool_choice instead.

functions?array<>

Deprecated. Use tools instead.

reasoning?

Akash-specific reasoning control parameters.

[key: string]?any

Response Body

`application/json`

curl -X POST "https://example.com/v1/chat/completions" \  -H "Content-Type: application/json" \  -d '{    "model": "string",    "messages": [      {}    ]  }'

{
  "id": "string",
  "object": "string",
  "created": 0,
  "model": "string",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "string",
        "content": "string",
        "tool_calls": [
          {}
        ],
        "tool_call_id": "string"
      },
      "finish_reason": "string"
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 0
  }
}

Documentation

Create chat completion

Authorization

Request Body

Response Body

`application/json`

Documentation

Create chat completion

Authorization

Request Body

Response Body

`application/json`

Documentation

Create chat completion

Authorization

Request Body

Response Body

200application/json

Documentation

Create chat completion

Authorization

Request Body

Response Body

200application/json

`application/json`

`application/json`