Zum Hauptinhalt springen

API-Dokumentation LLMaaS

Base URL

https://api.ai.cloud-temple.com/v1

Authentication

Alle Anfragen erfordern einen Authorization-Header mit Ihrem API-Token:

Authorization: Bearer IHR_API_TOKEN

Rate Limiting und Abrechnung

The Third-Party Principle: Access Tier, Budget, and Capacity

Our third-party system is designed as complete service envelopes that define three key aspects of your usage:

  1. An Access Tier (Purchase Credit): For Tiers 1 to 4, this is an upfront amount to be paid in advance to activate the service and unlock the technical and budgetary capabilities of the selected tier.
  2. A Monthly Budget Limit: This is the cap on your monthly consumption, ensuring full control over your costs.
  3. Technical Capacity: These are the throughput limits (tokens per day and per hour) that guarantee stable and predictable performance for your call volume.

Choosing a tier therefore involves balancing your initial investment, your projected monthly budget, and the required technical capacity. Your consumption within this envelope is then billed according to the current pricing rates.

Third Parties Table

TierPurchase CreditMonthly LimitTokens Output/HourTokens Output/DayDescription
Tier 1200 €1,000 €150,0003,600,000Standard usage
Tier 2500 €3,000 €300,0007,200,000Professional use
Tier 31,000 €5,000 €450,00010,800,000High volume
Tier 44,000 €10,000 €600,00014,400,000Enterprise
Monthly BillingN/AUnlimitedHigh priorityHigh priorityContact sales

Note: Rate limits are calculated based on output tokens. Token pricing varies by usage:

  • Input tokens: 1.90 € / million
  • Output tokens (standard): 8.00 € / million
  • Output tokens (reasoner): 8.00 € / million (applies to most advanced models for complex agent-like or reasoning tasks)

Audio Billing

  • Audio Transcription: 0.01 € / minute (every started minute is billable)

Rate Limit Headers

Responses include informative headers:

X-RateLimit-Limit-Requests: 1000
X-RateLimit-Remaining-Requests: 999
X-RateLimit-Reset-Requests: 1640995200

Fehler 429 - Limit erreicht

{
"error": {
"message": "Ratenlimit überschritten. Bitte aktualisieren Sie Ihren Tarif oder versuchen Sie es später erneut.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}

Endpoints

POST /v1/chat/completions

Generiert conversationalen Antworten.

Anfrage

curl -X POST "https://api.ai.cloud-temple.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer IHR_API_TOKEN" \
-d '{
"model": "granite3.3:8b",
"messages": [
{
"role": "user",
"content": "Erklären Sie die Photosynthese"
}
],
"max_tokens": 200,
"temperature": 0.7
}'

Parameters

ParameterTypeRequiredDescription
modelstringID of the model (see catalog)
messagesarrayConversation (role: system/user/assistant)
streambooleanEnables streaming (default: false)
temperaturefloatCreativity 0.0-2.0 (default: 0.7)
max_tokensintegerToken limit (default: 1024)
top_pfloatNucleus sampling 0.0-1.0 (default: 1.0)
presence_penaltyfloatPresence penalty -2.0 to 2.0 (default: 0)
frequency_penaltyfloatFrequency penalty -2.0 to 2.0 (default: 0)
userstringUnique user ID
toolsarrayList of tools the model can call.
tool_choicestring/objectControls whether the model should call a tool. "none", "auto", or {"type": "function", "function": {"name": "my_function"}}.

Standard Response

{
"id": "chatcmpl-bc52de347f2e4068b7bde380c0f8db37",
"object": "chat.completion",
"created": 1749114814,
"model": "granite3.3:8b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Die Photosynthese ist ein biologischer Prozess..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 42,
"total_tokens": 57
}
}

Response with Tool Call

If the model decides to call a tool, the response will have a finish_reason of tool_calls and the message will contain a tool_calls array.

{
"id": "chatcmpl-9f27a53f52b44a9693753f2a5e1f7a73",
"object": "chat.completion",
"created": 1749115200,
"model": "qwen3:14b",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": null,
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\n \"location\": \"Paris, France\",\n \"unit\": \"celsius\"\n}"
}
}
]
},
"finish_reason": "tool_calls"
}
],
"usage": {
"prompt_tokens": 82,
"completion_tokens": 18,
"total_tokens": 100
}
}

After receiving a tool_calls response, you must execute the tool on your side and then send the result back to the model using a message with role: "tool".

{
"model": "qwen3:14b",
"messages": [
{
"role": "user",
"content": "What is the weather like in Paris?"
},
{
"role": "assistant",
"tool_calls": [
{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_current_weather",
"arguments": "{\"location\": \"Paris, France\", \"unit\": \"celsius\"}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call_abc123",
"content": "{\"temperature\": \"22\", \"unit\": \"celsius\", \"description\": \"Sunny\"}"
}
]
}

Streaming (SSE)

Mit "stream": true wird die Antwort tokenweise geliefert:

Antwort-Header:

Content-Type: text/event-stream
Cache-Control: no-cache

Format der Ereignisse:

data: {"choices":[{"delta":{"content":"Die"},"finish_reason":null,"index":0}],"created":1749114814,"id":"chatcmpl-bc52de347f2e4068b7bde380c0f8db37","model":"granite3.3:8b","object":"chat.completion.chunk"}

data: {"choices":[{"delta":{"content":" photo"},"finish_reason":null,"index":0}],"created":1749114814,"id":"chatcmpl-bc52de347f2e4068b7bde380c0f8db37","model":"granite3.3:8b","object":"chat.completion.chunk"}

data: {"choices":[{"delta":{"content":""},"finish_reason":"stop","index":0}],"created":1749114814,"id":"chatcmpl-bc52de347f2e4068b7bde380c0f8db37","model":"granite3.3:8b","object":"chat.completion.chunk"}

data: [DONE]

Struktur der Chunks:

  • choices[].delta.content: Inkrementeller Inhalt
  • finish_reason: null während des Streamings, danach "stop"
  • Ende-Signal: data: [DONE]

Multimodal Queries (Vision)

To analyze images, you can send a request where the content field of a user message is an array containing both text and images.

The format for an image is an object with type: "image_url" and an image_url field containing the image URL in data URI (base64) format.

Compatibility Note

Although the standard and recommended format is {"type": "image_url", "image_url": {"url": "data:..."}}, the API also supports, for flexibility, a simplified format {"type": "image", "image": "data:..."}. However, we recommend using the standard image_url format for better compatibility with the OpenAI ecosystem.

OCR and Document Analysis

For specific document analysis tasks (PDFs, scans, tables), we recommend using the specialized model DeepSeek-OCR. See the dedicated documentation.

Beispiel für eine Vision-Anfrage

curl -X POST "https://api.ai.cloud-temple.com/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer IHR_TOKEN_API" \
-d '{
"model": "gemma3:27b",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Was siehst du auf diesem Bild?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,..."
}
}
]
}
],
"max_tokens": 500
}'

POST /v1/completions

warnung

Hinweis: Der Endpunkt /v1/completions verwendet dasselbe Format wie /v1/chat/completions mit Nachrichten.
Für einfache Textkompletierung verwenden Sie eine Benutzer-Nachricht mit Ihrem Prompt.

Textkompletierung über das Chat-Format.

Anfrage

curl -X POST "https://api.ai.cloud-temple.com/v1/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer IHR_API_TOKEN" \
-d '{
"model": "granite3.3:8b",
"messages": [
{
"role": "user",
"content": "Vervollständigen Sie diesen Satz: Künstliche Intelligenz ist"
}
],
"max_tokens": 100,
"temperature": 0.7
}'

Parameter

Identisch zu /v1/chat/completions – siehe vorherige Section.

Answer

Same format as /v1/chat/completions.

POST /v1/audio/transcriptions

Audio in Text transkribieren (Whisper).

Anfrage

curl -X POST "https://api.ai.cloud-temple.com/v1/audio/transcriptions" \
-H "Authorization: Bearer IHR_API_TOKEN" \
-F "file=@audio.wav" \
-F "language=fr" \
-F "response_format=json"

Parameter

ParameterTypeRequiredDescription
filebinaryAudio file (wav, mp3, m4a).
languagestringISO 639-1 language code (e.g., "fr"). Automatic detection if not provided.
initial_promptstringContext or specific words to improve transcription accuracy.
taskstringTask to perform: transcribe (default) or translate (translate to English).
response_formatstringjson (default, equivalent to verbose_json). Formats text, srt, vtt are currently not supported.

Answer (json)

{
"text": "Hello, this is a test of audio transcription.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 0.0,
"end": 4.0,
"text": " Hello, this is a test of audio transcription.",
"tokens": [ 50364, 40365, 33, 2373, 359, 456, 2373, 323, 1330, 2373, 2264, 50564 ],
"temperature": 0.0,
"avg_logprob": -0.25,
"compression_ratio": 1.5,
"no_speech_prob": 0.05
}
],
"language": "en"
}

POST /v1/embeddings

Creates an embedding vector representing the input text.

Anfrage

curl -X POST "https://api.ai.cloud-temple.com/v1/embeddings" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer IHR_API_TOKEN" \
-d '{
"model": "granite-embedding:278m",
"input": "Der zu vektorisierende Text"
}'

Parameters

ParameterTypeRequiredDescription
modelstringID of the embedding model (see catalog)
inputstring or array of stringsThe text or list of texts to vectorize.

Antwort

{
"object": "list",
"data": [
{
"object": "embedding",
"index": 0,
"embedding": [
0.018902843818068504,
-0.023282647132873535,
...
-0.016484618186950684
]
}
],
"model": "granite-embedding:278m",
"usage": {
"prompt_tokens": 5,
"total_tokens": 5
}
}

GET /v1/models

List of available models.

Anfrage

curl -X GET "https://api.ai.cloud-temple.com/v1/models" \
-H "Authorization: Bearer IHR_API_TOKEN"

Antwort

{
"object": "list",
"data": [
{
"id": "granite3.3:8b",
"object": "model",
"created": 1749110897,
"owned_by": "CloudTemple",
"root": "granite3.3:8b",
"aliases": ["granite3.3:8b"],
"max_model_len": 60000,
"permission": [
{
"id": "modelperm-granite3.3:8b-1749110897",
"object": "model_permission",
"allow_sampling": true,
"allow_view": true,
"allow_fine_tuning": false
}
]
}
]
}

Error Codes

400 - Ungültige Anfrage

{
"error": {
"message": "Ungültiger Parameter 'temperature': muss zwischen 0 und 2 liegen",
"type": "invalid_request_error",
"param": "temperature"
}
}

401 - Nicht autorisiert

{
"error": {
"message": "Ungültiger API-Schlüssel bereitgestellt",
"type": "authentication_error"
}
}

404 - Modell Nicht Gefunden

{
"error": {
"message": "Modell 'unknown-model' existiert nicht",
"type": "invalid_request_error",
"param": "model",
"code": "model_not_found"
}
}

429 - Ratenbegrenzung

{
"error": {
"message": "Ratenbegrenzung überschritten. Bitte aktualisieren Sie Ihren Tarif oder versuchen Sie es später erneut.",
"type": "rate_limit_error",
"code": "rate_limit_exceeded"
}
}

500 - Server-Fehler

{
"error": {
"message": "Interner Serverfehler",
"type": "server_error"
}
}

503 - Dienst nicht verfügbar

{
"error": {
"message": "Dienst vorübergehend nicht verfügbar",
"type": "service_unavailable_error"
}
}

Examples by Language

Python mit requests

import requests
import json

# Konfiguration
# It is recommended to protect your API key by using environment variables.
# Beispiel: API_KEY = os.getenv("LLMAAS_API_KEY")
API_KEY = "IHRE_API_TOKEN"
BASE_URL = "https://api.ai.cloud-temple.com/v1"

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}"
}

# Chat completion
payload = {
"model": "granite3.3:8b",
"messages": [
{"role": "user", "content": "Hallo!"}
],
"max_tokens": 100
}

try:
response = requests.post(
f"{BASE_URL}/chat/completions",
headers=headers,
json=payload,
timeout=30 # Hinzufügen eines Timeouts für die Anfrage
)

response.raise_for_status() # Wirft eine Ausnahme bei HTTP-Fehlern (4xx, 5xx)
result = response.json()
print(result["choices"][0]["message"]["content"])

except requests.exceptions.HTTPError as e:
print(f"Fehler HTTP: {e.response.status_code} - {e.response.text}")
except requests.exceptions.RequestException as e:
print(f"Netzwerkfehler: {e}")
except json.JSONDecodeError:
print(f"Fehler beim Decodieren von JSON: {response.text}")
except Exception as e:
print(f"Ein unerwarteter Fehler ist aufgetreten: {e}")

Python mit Streaming

import requests
import json

def stream_chat(message, model="granite3.3:8b"):
# Es wird empfohlen, Ihren API-Schlüssel mithilfe von Umgebungsvariablen zu schützen.
# Beispiel: API_KEY = os.getenv("LLMAAS_API_KEY")
API_KEY = "IHRE_API_TOKEN"
BASE_URL = "https://api.ai.cloud-temple.com/v1"

headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}"
}

payload = {
"model": model,
"messages": [{"role": "user", "content": message}],
"stream": True,
"max_tokens": 200
}

try:
response = requests.post(
f"{BASE_URL}/chat/completions",
headers=headers,
json=payload,
stream=True,
timeout=30 # Hinzufügen eines Timeouts für die Anfrage
)

response.raise_for_status() # Löst eine Ausnahme für HTTP-Fehlercodes (4xx, 5xx) aus

for line in response.iter_lines():
if line:
line = line.decode('utf-8')
if line.startswith('data: '):
data = line[6:] # 'data: ' entfernen
if data == '[DONE]':
break
try:
chunk = json.loads(data)
content = chunk['choices'][0]['delta'].get('content', '')
if content:
print(content, end='', flush=True)
except json.JSONDecodeError:
print(f"Fehler beim Decodieren von JSON im Stream: {data}")
continue
print() # Neue Zeile nach dem Stream
except requests.exceptions.HTTPError as e:
print(f"Fehler HTTP: {e.response.status_code} - {e.response.text}")
except requests.exceptions.RequestException as e:
print(f"Fehler im Netzwerk: {e}")
except Exception as e:
print(f"Ein unerwarteter Fehler ist aufgetreten: {e}")

Usage

stream_chat("Erklären Sie die Quantenphysik")

JavaScript/Node.js

const axios = require('axios');

// Konfiguration
// Es wird empfohlen, Ihren API-Schlüssel durch Umgebungsvariablen zu schützen.
// Beispiel: const API_KEY = process.env.LLMAAS_API_KEY;
const API_KEY = 'IHRE_API_TOKEN';
const BASE_URL = 'https://api.ai.cloud-temple.com/v1';

async function chatCompletion(message) {
try {
const response = await axios.post(
`${BASE_URL}/chat/completions`,
{
model: 'granite3.3:8b',
messages: [
{ role: 'user', content: message }
],
max_tokens: 100
},
{
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${API_KEY}`
},
timeout: 30000 // Hinzufügen eines Timeouts für die Anfrage (30 Sekunden)
}
);

return response.data.choices[0].message.content;
} catch (error) {
console.error('Fehler:', error.response?.data || error.message);
// Detaillierte Fehlerbehandlung kann hier hinzugefügt werden, falls erforderlich
// Beispiel: if (error.response?.status === 429) { console.error("Rate limit exceeded"); }
}
}

// Verwendung
chatCompletion('Hallo!').then(response => {
if (response) {
console.log(response);
}
});

JavaScript mit Fetch (Browser)

async function fetchCompletion(message) {
const response = await fetch('https://api.ai.cloud-temple.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${API_KEY}`
},
body: JSON.stringify({
model: 'granite3.3:8b',
messages: [
{ role: 'user', content: message }
],
max_tokens: 100
})
});

if (!response.ok) {
throw new Error(`HTTP ${response.status}: ${await response.text()}`);
}

const data = await response.json();
return data.choices[0].message.content;
}

Best Practices

Fehlerbehandlung

def safe_api_call(payload):
try:
response = requests.post(url, headers=headers, json=payload)
response.raise_for_status()
return response.json()
except requests.exceptions.HTTPError as e:
if response.status_code == 429:
print("Rate Limit erreicht, warten...")
time.sleep(60) # 1 Minute warten
return safe_api_call(payload) # Wiederholen
else:
print(f"Fehler HTTP: {e}")
except requests.exceptions.RequestException as e:
print(f"Fehler im Netzwerk: {e}")

Kostenoptimierung

  1. Verwenden Sie geeignete Modelle: Verwenden Sie kleinere Modelle für Tests
  2. Beschränken Sie max_tokens: Vermeiden Sie zu lange Antworten
  3. Wiederverwenden von Gesprächen: Effiziente Nutzung des Kontextfensters
  4. Überwachung: Verfolgen Sie Ihre Nutzung in der Konsole

Security

  1. Protect your token: Environment variables
  2. Regular rotation: Change your keys periodically
  3. Input validation: Sanitize user data
  4. Client rate limiting: Implement your own limits

SDK and Integrations

The LLMaaS API is compatible with existing OpenAI SDKs by simply changing the base URL:

OpenAI Python SDK

from openai import OpenAI

# It is recommended to protect your API key by using environment variables.
# Beispiel: api_key=os.getenv("LLMAAS_API_KEY")
client = OpenAI(
api_key="IHRE_API_TOKEN",
base_url="https://api.ai.cloud-temple.com/v1"
)

try:
response = client.chat.completions.create(
model="granite3.3:8b",
messages=[
{"role": "user", "content": "Hallo!"}
],
max_tokens=50 # Hinzufügen von max_tokens zur Konsistenz mit den Tests
)

print(response.choices[0].message.content)

except Exception as e:
print(f"Fehler im OpenAI SDK: {e}")

LangChain

from langchain_openai import ChatOpenAI
from langchain.schema import HumanMessage

# Chat Model Configuration (LLMaaS-compatible)
# It is recommended to protect your API key by using environment variables.
# Example: api_key=os.getenv("LLMAAS_API_KEY")
chat = ChatOpenAI(
api_key="YOUR_API_TOKEN",
base_url="https://api.ai.cloud-temple.com/v1",
model="granite3.3:8b",
# Note: Parameters like max_tokens are passed via model_kwargs
# to ensure compatibility across LangChain versions.
model_kwargs={"max_tokens": 200}
)

try:
# Usage with messages
messages = [HumanMessage(content="Explain AI in 3 sentences")]
response = chat.invoke(messages)
print(response.content)

# Or with a simple string
response = chat.invoke("Hello, how are you?")
print(response.content)

except Exception as e:
print(f"LangChain error: {e}")

Using Embeddings

Incompatibility with standard LangChain clients

Currently, using the embedding endpoint with LangChain's standard classes (langchain_openai.OpenAIEmbeddings or langchain_community.OllamaEmbeddings) is incompatible with our API.

  • OpenAIEmbeddings sends pre-computed tokens instead of raw text, which is rejected.
  • OllamaEmbeddings does not handle the required Bearer Token authentication.

Until a permanent solution is available, it is recommended to create a custom embedding class or call the API directly, as demonstrated in the examples/simple-rag-demo example.

from langchain.embeddings.base import Embeddings
from typing import List
import httpx

class LLMaaSEmbeddings(Embeddings):
"""
Custom embedding class to interact with Cloud Temple's LLMaaS API.
This class is designed to be compatible with LangChain's `Embeddings` interface,
enabling its use within LangChain pipelines while calling our specific API.
"""
def __init__(self, api_key: str, base_url: str = "https://api.ai.cloud-temple.com/v1", model_name: str = "granite-embedding:278m"):
self.api_key = api_key
self.base_url = base_url
self.model_name = model_name
self.headers = {
"Authorization": f"Bearer {self.api_key}",
"Content-Type": "application/json",
}

def _embed(self, texts: List[str]) -> List[List[float]]:
payload = {"input": texts, "model": self.model_name}
try:
with httpx.Client(timeout=30.0) as client:
response = client.post(f"{self.base_url}/embeddings", headers=self.headers, json=payload)
response.raise_for_status()
data = response.json()['data']
# Sort embeddings by index to ensure correct order
data.sort(key=lambda e: e['index'])
return [item['embedding'] for item in data]
except httpx.HTTPStatusError as e:
print(f"HTTP error during embedding retrieval: {e.response.status_code}")
print(f"Response: {e.response.text}")
return []

def embed_documents(self, texts: List[str]) -> List[List[float]]:
return self._embed(texts)

def embed_query(self, text: str) -> List[float]:
return self._embed([text])[0]

# Usage
# embeddings = LLMaaSEmbeddings(
# api_key="YOUR_API_TOKEN",
# base_url="https://api.ai.cloud-temple.com/v1",
# model_name="granite-embedding:278m"
# )
# vector = embeddings.embed_query("My text to vectorize")

Support

  • Documentation : Quickstart Guide
  • Model Catalog : Complete List
  • Console : Management and monitoring via Cloud Temple Console
  • Support : Via Cloud Temple Console