Skip to main content

Concepts and Architecture of LLMaaS

Overview

Cloud Temple's LLMaaS (Large Language Models as a Service) provides secure and sovereign access to the most advanced artificial intelligence models, with ANSSI's SecNumCloud qualification.

🏗️ Technical Architecture

Cloud Temple Infrastructure

LLMaaS Cloud Temple Technical Architecture

Main Components

1. API Gateway LLMaaS

  • Compatible OpenAI : Seamless integration with the existing ecosystem
  • Rate Limiting : Quota management per billing tier
  • Load Balancing : Intelligent distribution across 12 GPU machines
  • Monitoring : Real-time metrics and alerting

2. Authentication Service

  • Secure API Tokens : Automatic rotation
  • Access Control : Granular permissions per model
  • Audit trails : Complete access traceability

🤖 Models and Tokens

Model Catalog

Complete catalog: Model list

Token Management

Token Types

  • Input tokens : Your prompt and context
  • Output tokens : Model-generated response
  • System tokens : Metadata and instructions

Cost Calculation

Chat/Completion = (Tokens entrée × 1.8€/M) + (Tokens sortie × 8€/M) + (Tokens sortie Raisonnement × 8€/M)
Reranking = Documents rerankés × 4€/M
Batch (async) = (Tokens entrée × 0.9€/M) + (Tokens sortie × 4€/M)
Audio (ASR) = 0.01€ / minute de transcription

Optimization

  • Context window : Reuse conversations to save costs
  • Appropriate models : Choose the size based on complexity
  • Max tokens : Limit response length

Tokenization

# Example of token estimation
def estimate_tokens(text: str) -> int:
"""Estimation approximative : 1 token ≈ 4 caractères"""
return len(text) // 4

prompt = "Expliquez la photosynthèse"
response_max = 200 # desired max tokens

estimated_input = estimate_tokens(prompt) # ~6 tokens
total_cost = (estimated_input * 1.8 + response_max * 8) / 1_000_000
print(f"Coût estimé: {total_cost:.6f}€")

🔒 Security and Compliance

SecNumCloud Qualification

The LLMaaS service runs on a technical infrastructure that holds the ANSSI SecNumCloud 3.2 qualification, ensuring:

Data Protection

  • End-to-end encryption : TLS 1.3 for all communications
  • Secure storage : Data encrypted at rest (AES-256)
  • Isolation : Dedicated environments per tenant

Digital Sovereignty

  • France Hosting : Certified Cloud Temple datacenters
  • French Law : Native GDPR compliance
  • No Exposure : No transfers to foreign clouds

Audit and Traceability

  • Complete logs : All interactions tracked
  • Retention : Retention according to legal policies
  • Compliance : Audit reports available

Security Controls

LLMaaS Security Controls

Prompt Security

Prompt analysis is a native and integrated security feature of the LLMaaS platform. Enabled by default, it aims to detect and prevent "jailbreak" attempts or malicious prompt injections before they even reach the model. This protection relies on a multi-layered approach.

:::tip Contact support for deactivation It is possible to disable this security analysis for very specific use cases, although it is not recommended. For any questions regarding this or to request a deactivation, please contact Cloud Temple support. :::

1. Structural Analysis (check_structure)

  • Malformed JSON Check: The system detects if the prompt starts with a { and attempts to parse it as JSON. If parsing succeeds and the JSON contains suspicious keywords (e.g., "system", "bypass"), or if parsing fails unexpectedly, this may indicate an injection attempt.
  • Unicode Normalization: The prompt is normalized using unicodedata.normalize('NFKC', prompt). If the original prompt differs from its normalized version, this may indicate the use of deceptive Unicode characters (homoglyphs) to bypass filters. For example, "аdmin" (Cyrillic) instead of "admin" (Latin).

2. Detection of Suspicious Patterns (check_patterns)

  • The system uses regular expressions (regex) to identify known prompt attack patterns, across multiple languages (French, English, Chinese, Japanese).
  • Examples of detected patterns :
    • System Commands : Keywords such as "ignore les instructions", "ignore instructions", "忽略指令", "指示を無視".
    • HTML Injection : Hidden or malicious HTML tags, for example <div caché>, <hidden div>.
    • Markdown Injection : Malicious Markdown links, for example [texte](javascript:...), [text](data:...).
    • Repeated Sequences : Excessive repetition of words or phrases such as "oublie oublie oublie", "forget forget forget".
    • Special/Mixed Characters : Use of unusual Unicode characters or script mixing to mask commands (e.g., "s\u0443stème").

3. Behavioral Analysis (check_behavior)

  • The load balancer maintains a history of recent prompts.
  • Fragmentation Detection: It combines recent prompts to check if an attack is fragmented across multiple requests. For example, if "ignore" is sent in one prompt and "instructions" in the next, the system can detect them together.
  • Repetition Detection: It identifies whether the same prompt is repeated excessively. The current threshold for repetition detection is 30 consecutive identical prompts.

This multi-layered approach enables the detection of a wide range of prompt attacks, from the simplest to the most sophisticated, by combining static content analysis with dynamic behavioral analysis.

📈 Performance and Scalability

Real-Time Monitoring

Access via Cloud Temple Console :

  • Usage metrics per model
  • Latency and throughput graphs
  • Alerts on performance thresholds
  • Request history

🌐 Integration and Ecosystem

OpenAI Compatibility

The LLMaaS service is compatible with the OpenAI API:

# Seamless migration
from openai import OpenAI

# Before (OpenAI)
client_openai = OpenAI(api_key="sk-...")

# After (Cloud Temple LLMaaS)
client_ct = OpenAI(
api_key="votre-token-cloud-temple",
base_url="https://api.ai.cloud-temple.com/v1"
)

# Identical code!
response = client_ct.chat.completions.create(
model="gpt-oss:120b", # Cloud Temple Model
messages=[{"role": "user", "content": "Bonjour"}]
)

Supported Ecosystem

AI Frameworks

  • LangChain : Native integration
  • Haystack : Document pipeline
  • Semantic Kernel : Microsoft orchestration
  • AutoGen : Conversational agents

Development Tools

  • Jupyter : Interactive Notebooks
  • Streamlit : Quick Web Applications
  • Gradio : AI User Interfaces
  • FastAPI : Backend APIs

No-Code Platforms

  • Zapier : Automations
  • Make : Visual Integrations
  • Bubble : Web Applications

🔄 Model Lifecycle

Model Updates

LLMaaS Model Lifecycle

Versioning Policy

  • Stable Models : Fixed versions available for 6 months
  • Experimental Models : Beta versions for early adopters
  • Deprecation : 3-month notice before removal
  • Migration : Professional services available to ensure your transitions

Projected Lifecycle Schedule

The table below presents the projected lifecycle of our models. The generative AI ecosystem evolves very rapidly, which explains why lifecycles may appear short. Our goal is to provide you with access to the most performant models currently available.

However, we are committed to preserving over time the models that are most widely used by our clients. For critical use cases requiring long-term stability, extended support phases are available. Please do not hesitate to contact support to discuss your specific needs.

This schedule is provided for informational purposes and is reviewed at the beginning of each quarter.

  • DMP (Production Release Date) : The date on which the model becomes available in production.
  • DSP (End of Support Date) : The projected date from which the model will no longer be maintained. A 3-month notice period is respected before any effective removal.
ModèleÉditeurPhaseDMPDSPLTSMigration conseillée
cogito:32bDeep CogitoProduction13/06/202530/06/2026Nogpt-oss:120b
embeddinggemma:300mGoogleProduction10/09/202530/06/2026No
gemma3:27bGoogleProduction13/06/202530/06/2026No
glm-4.7-flash:30bZhipu AIProduction22/01/202630/06/2026No
ministral-3:14bMistral AIProduction30/12/202530/06/2026No
ministral-3:3bMistral AIProduction30/12/202530/06/2026No
ministral-3:8bMistral AIProduction30/12/202530/06/2026No
olmo-3:32bAllenAIProduction30/12/202530/06/2026No
olmo-3:7bAllenAIProduction30/12/202530/06/2026No
qwen3-omni:30bQwen TeamProduction05/01/202630/06/2026No
qwen3-vl:2bQwen TeamProduction30/12/202530/06/2026No
qwen3-vl:32bQwen TeamProduction30/12/202530/06/2026No
qwen3-vl:8bQwen TeamProduction05/01/202630/06/2026No
rnj-1:8bEssential AIProduction30/12/202530/06/2026No
devstral-small-2:24bMistral AI & All Hands AIProduction02/02/202630/09/2026No
gemma4:e2bGoogleProduction19/04/202630/09/2026No
gemma4:e4bGoogleProduction19/04/202630/09/2026No
gpt-oss:20bOpenAIProduction08/08/202530/09/2026No
mistral-small3.2:24bMistral AIProduction23/06/202530/09/2026No
qwen3.5:4bQwen TeamProduction24/03/202630/09/2026No
qwen3.5:9bQwen TeamProduction24/03/202630/09/2026No
bge-reranker-largeBAAIProduction13/05/202630/12/2026No
deepseek-ocrDeepSeek AIProduction22/11/202530/12/2026No
functiongemma:270mGoogleProduction30/12/202530/12/2026No
gemma4:31bGoogleProduction14/04/202630/12/2026No
granite3-guardian:2bIBMProduction13/06/202530/12/2026No
granite3-guardian:8bIBMProduction13/06/202530/12/2026No
granite3.2-vision:2bIBMProduction13/06/202530/12/2026No
mistral-small4:119bMistral AIProduction13/05/202630/12/2026No
nemotron-3-super:120bNVIDIAProduction01/04/202630/12/2026No
nemotron-cascade:30bNVIDIAProduction01/04/202630/12/2026No
nemotron3-nano:30bNVIDIAProduction04/01/202630/12/2026No
qwen-coder-next:80bQwen TeamProduction04/02/202630/12/2026No
qwen3-embedding:0.6bQwen TeamProduction14/05/202630/12/2026No
qwen3-embedding:4bQwen TeamProduction14/05/202630/12/2026No
qwen3-embedding:8bQwen TeamProduction14/05/202630/12/2026No
qwen3-next:80bQwen TeamProduction02/02/202630/12/2026No
qwen3-reranker:0.6bQwen TeamProduction13/05/202630/12/2026No
qwen3-reranker:4bQwen TeamProduction13/05/202630/12/2026No
qwen3-vl:235bQwen TeamProduction04/01/202630/12/2026No
qwen3-vl:30bQwen TeamProduction30/12/202530/12/2026No
qwen3-vl:4bQwen TeamProduction30/12/202530/12/2026No
qwen3.5:0.8bQwen TeamProduction24/03/202630/12/2026No
qwen3.6:27bQwen TeamProduction01/05/202630/12/2026No
qwen3.6:35bQwen TeamProduction01/05/202630/12/2026No
qwen3:0.6bQwen TeamProduction13/06/202530/12/2026Yes
translategemma:12bGoogleProduction22/01/202630/12/2026No
translategemma:27bGoogleProduction22/01/202630/12/2026No
translategemma:4bGoogleProduction22/01/202630/12/2026No
voxtralMistral AIProduction01/04/202630/12/2026No
z-image:16bCommunityProduction01/04/202630/12/2026No
nvidia/llama-nemotron-rerank-vl-1b-v2NVIDIAProduction13/05/202630/06/2027No
bge-m3:567mBAAIProduction18/10/202530/12/2027Yes
gpt-oss:120bOpenAIProduction11/11/202530/12/2027Yes
granite-embedding:278mIBMProduction13/06/202530/12/2027Yes
llama3.3:70bMetaProduction13/06/202530/12/2027Yes
qwen3-2507:235bQwen TeamProduction04/01/202630/12/2027Yes
qwen3-2507-think:4bQwen TeamProduction31/08/202530/12/2027Yes

Legend

  • Phase: Model lifecycle (Evaluation, Production, Deprecated)
  • DMP: Production Release Date
  • DSP: Planned Decommissioning Date
  • LTS: Long Term Support. LTS models benefit from guaranteed stability and extended support, ideal for critical applications.
  • Recommended Migration: Model recommended to replace an end-of-life model.

To track the lifecycle status in real time, visit the page: LLMaaS Status - Lifecycle


Deprecated Models

The world of LLMs is evolving very rapidly. To ensure our clients have access to the most cutting-edge technologies, we regularly deprecate models that no longer meet current standards or are no longer in use. The models listed below are no longer available on the public platform. However, they can be reactivated for specific projects upon request.

ModelPhaseDeprecation Date
devstral:24bDeprecated30/03/2026
granite3.1-moe:2bDeprecated30/03/2026
granite4-small-h:32bDeprecated15/05/2026
granite4-tiny-h:7bDeprecated15/05/2026
medgemma:27bDeprecated15/05/2026
qwen3-2507-gptq:235bDeprecated15/05/2026
qwen3-coder:30bDeprecated30/03/2026
qwen3:30b-a3bDeprecated30/03/2026
deepseek-r1:14bDeprecated30/12/2025
deepseek-r1:32bDeprecated30/12/2025
gemma3:1bDeprecated30/12/2025
gemma3:4bDeprecated30/12/2025
qwen3:1.7bDeprecated30/12/2025
qwen3:14bDeprecated30/12/2025
qwen3:4bDeprecated30/12/2025
qwen3:8bDeprecated30/12/2025
qwen3:32bDeprecated30/12/2025
qwq:32bDeprecated30/12/2025
granite3.3:2bDeprecated30/12/2025
granite3.3:8bDeprecated30/12/2025
mistral-small3.1:24bDeprecated30/12/2025
qwen2.5vl:32bDeprecated30/12/2025
qwen2.5vl:3bDeprecated30/12/2025
qwen2.5vl:72bDeprecated30/12/2025
qwen2.5vl:7bDeprecated30/12/2025
cogito:8bDeprecated30/12/2025
deepcoder:14bDeprecated30/12/2025
cogito:3bDeprecated30/12/2025
qwen3:235bDeprecated22/11/2025
qwen3-2507-think:30b-a3bDeprecated14/11/2025
gemma3:12bDeprecated21/11/2025
cogito:14bDeprecated17/10/2025
deepseek-r1:70bDeprecated17/10/2025
granite3.1-moe:3bDeprecated17/10/2025
llama3.1:8bDeprecated17/10/2025
phi4-reasoning:14bDeprecated17/10/2025
qwen2.5:0.5bDeprecated17/10/2025
qwen2.5:1.5bDeprecated17/10/2025
qwen2.5:14bDeprecated17/10/2025
qwen2.5:32bDeprecated17/10/2025
qwen2.5:3bDeprecated17/10/2025
deepseek-r1:671bDeprecated17/10/2025

💡 Best Practices

To get the most out of the LLMaaS API, it is essential to adopt cost, performance, and security optimization strategies.

Cost Optimization

Cost management relies on the intelligent use of tokens and models.

  1. Model Selection : Do not use an overly powerful model for a simple task. A larger model is more capable, but it is also slower and consumes significantly more energy, which directly impacts the cost. Adjust the model size to the complexity of your requirement for an optimal balance.

    For example, to process one million tokens:

    • Gemma 3 1B consumes 0.15 kWh.
    • Llama 3.3 70B consumes 11.75 kWh, which is 78 times more.
    # For sentiment classification, a compact model is sufficient and cost-effective.
    if task == "sentiment_analysis":
    model = "qwen3.5:0.8b"
    # For complex legal analysis, a larger model is required.
    elif task == "legal_analysis":
    model = "gpt-oss:120b"
  2. Context Management : The conversation history (messages) is returned with each call, consuming input tokens. For long conversations, consider summarization or windowing strategies to retain only relevant information.

    # For a long conversation, you can summarize the initial exchanges.
    messages = [
    {"role": "system", "content": "Vous êtes un assistant IA."},
    {"role": "user", "content": "Résumé des 10 premiers échanges..."},
    {"role": "assistant", "content": "Ok, j'ai le contexte."},
    {"role": "user", "content": "Voici ma nouvelle question."}
    ]
  3. Output Token Limitation : Always use the max_tokens parameter to avoid excessively long and costly responses. Set a reasonable limit based on your expectations.

    # Request a summary of up to 100 words.
    response = client.chat.completions.create(
    model="gpt-oss:120b",
    messages=[{"role": "user", "content": "Résume ce document..."}],
    max_tokens=150, # Safety margin for ~100 words
    )

Performance

The responsiveness of your application depends on how you handle API calls.

  1. Asynchronous Requests : To process multiple requests without waiting for each to finish, use asynchronous calls. This is particularly useful for backend applications handling a high volume of concurrent requests.

    import asyncio
    from openai import AsyncOpenAI

    client = AsyncOpenAI(api_key="...", base_url="...")

    async def process_prompt(prompt: str):
    # Process a single request asynchronously
    response = await client.chat.completions.create(model="gpt-oss:120b", messages=[{"role": "user", "content": prompt}])
    return response.choices[0].message.content

    async def batch_requests(prompts: list):
    # Launch multiple tasks in parallel and wait for their completion
    tasks = [process_prompt(p) for p in prompts]
    return await asyncio.gather(*tasks)
  2. Streaming for User Experience (UX) : For user interfaces (chatbots, assistants), streaming is essential. It allows displaying the model's response word by word, giving an impression of immediate responsiveness instead of waiting for the complete response.

    # Display the response in real-time in a user interface
    response_stream = client.chat.completions.create(
    model="gpt-oss:120b",
    messages=[{"role": "user", "content": "Raconte-moi une histoire."}],
    stream=True
    )
    for chunk in response_stream:
    if chunk.choices[0].delta.content:
    # Display the text chunk in the UI
    print(chunk.choices[0].delta.content, end="", flush=True)

Security

Application security is paramount, especially when handling user input.

  1. Input Validation and Sanitization: Never trust user input. Before sending it to the API, sanitize it to remove any potentially malicious code or "prompt injection" instructions. Also limit their size to prevent abuse.

    def sanitize_input(user_input: str) -> str:
    # Simple example: remove code delimiters and limit length.
    # More robust libraries can be used for advanced sanitization.
    cleaned = user_input.replace("`", "").replace("'", "").replace("\"", "")
    return cleaned[:2000] # Limits the size to 2000 characters
  2. Robust Error Handling: Always wrap your API calls in try...except blocks to handle network errors, API errors (e.g., 429 Rate Limit, 500 Internal Server Error), and provide a degraded but functional user experience.

    from openai import APIError, APITimeoutError

    try:
    response = client.chat.completions.create(...)
    except APITimeoutError:
    # Handle the case where the request takes too long
    return "Le service prend plus de temps que prévu, veuillez réessayer."
    except APIError as e:
    # Handle specific API errors
    logger.error(f"Erreur API LLMaaS: {e.status_code} - {e.message}")
    return "Désolé, une erreur est survenue avec le service d'IA."
    except Exception as e:
    # Handle all other errors (network, etc.)
    logger.error(f"Une erreur inattendue est survenue: {e}")
    return "Désolé, une erreur inattendue est survenue."