Skip to main content

Concepts and Architecture of LLMaaS

Overview

The LLMaaS (Large Language Models as a Service) offering from Cloud Temple provides secure and sovereign access to the most advanced artificial intelligence models, with the SecNumCloud certification from ANSSI.

🏗️ Technical Architecture

Cloud Infrastructure Temple

Technical Architecture of LLMaaS Cloud Temple

Main Components

1. API Gateway LLMaaS

  • OpenAI Compatible: Seamless integration with existing ecosystem
  • Rate Limiting: Quota management by billing tier
  • Load Balancing: Intelligent distribution across 12 GPU machines
  • Monitoring: Real-time metrics and alerting

2. Authentication Service

  • Secure API Tokens: Automatic rotation
  • Access Control: Granular permissions per model
  • Audit Trails: Full access traceability

🤖 Models and Tokens

Model Catalog

Complete catalog: Model List

Token Management

Token Types

  • Input tokens: Your prompt and context
  • Output tokens: Response generated by the model
  • System tokens: Metadata and instructions

Cost Calculation

Total cost = (Input tokens × 1.9€/M) + (Output tokens × 8€/M) + (Reasoning output tokens × 8€/M)

Optimization

  • Context window: Reuse conversations to save costs
  • Appropriate models: Choose size based on complexity
  • Max tokens: Limit response length

Tokenization

# Token Estimation Example
def estimate_tokens(text: str) -> int:
"""Approximate estimation: 1 token ≈ 4 characters"""
return len(text) // 4

prompt = "Explain photosynthesis"
response_max = 200 # maximum desired tokens

estimated_input = estimate_tokens(prompt) # ~6 tokens
total_cost = (estimated_input * 1.9 + response_max * 8) / 1_000_000
print(f"Estimated cost: {total_cost:.6f}€")

🔒 Security and Compliance

SecNumCloud Qualification

The LLMaaS service is hosted on a technical infrastructure that holds the SecNumCloud 3.2 qualification from ANSSI, ensuring:

Data Protection

  • End-to-end Encryption: TLS 1.3 for all communications
  • Secure Storage: Data encrypted at rest (AES-256)
  • Isolation: Dedicated environments per tenant

Digital Sovereignty

  • Hosting in France: Cloud Temple datacenters certified
  • French law: Native GDPR compliance
  • No exposure: No data transfers to foreign clouds

Audit and Traceability

  • Complete logs: All interactions tracked
  • Retention: Stored according to legal policies
  • Compliance: Audit reports available

Security Controls

Security Controls LLMaaS

Prompt Security

Prompt analysis is a native and integrated security feature of the LLMaaS platform. Enabled by default, it aims to detect and prevent attempts at "jailbreaking" or injecting malicious prompts before they even reach the model. This protection is based on a multi-layered approach.

Contact support for deactivation

It is possible to disable this security analysis for very specific use cases, although this is not recommended. For any questions regarding this or to request deactivation, please contact Cloud Temple support.

1. Structural Analysis (check_structure)

  • Malformed JSON detection: The system checks whether the prompt starts with a { and attempts to parse it as JSON. If parsing succeeds and the JSON contains suspicious keywords (e.g., "system", "bypass"), or if parsing fails unexpectedly, this may indicate an injection attempt.
  • Unicode normalization: The prompt is normalized using unicodedata.normalize('NFKC', prompt). If the original prompt differs from its normalized version, this may indicate the use of deceptive Unicode characters (homoglyphs) to bypass filters. For example, "аdmin" (Cyrillic) instead of "admin" (Latin).

2. Suspicious Pattern Detection (check_patterns)

  • The system uses regular expressions (regex) to identify known attack patterns in prompts, across multiple languages (French, English, Chinese, Japanese).
  • Examples of detected patterns:
    • System Commands: Keywords such as "ignore the instructions", "ignore instructions", "忽略指令", "指示を無視".
    • HTML Injection: Hidden or malicious HTML tags, for example <div hidden>, <hidden div>.
    • Markdown Injection: Malicious Markdown links, for example [text](javascript:...), [text](data:...).
    • Repeated Sequences: Excessive repetition of words or phrases such as "forget forget forget", "oublie oublie oublie".
    • Special/Mixed Characters: Use of unusual Unicode characters or mixing scripts to obfuscate commands (e.g., "s\u0443stème").

3. Behavioral Analysis (check_behavior)

  • The load balancer maintains a history of recent prompts.
  • Fragmentation Detection: It combines recent prompts to check whether an attack is fragmented across multiple requests. For example, if "ignore" is sent in one prompt and "instructions" in the next, the system can detect them together.
  • Repetition Detection: It identifies if the same prompt is repeated excessively. The current threshold for repetition detection is 30 consecutive identical prompts.

This multi-layered approach enables detection of a wide range of prompt attacks—from simple to highly sophisticated—by combining static content analysis with dynamic behavioral analysis.

📈 Performance and Scalability

Real-Time Monitoring

Access via Cloud Temple Console:

  • Model usage metrics
  • Latency and throughput graphs
  • Performance threshold alerts
  • Request history

🌐 Integration and Ecosystem

OpenAI Compatibility

The LLMaaS service is compatible with the OpenAI API:

# Transparent migration
from openai import OpenAI

# Before (OpenAI)
client_openai = OpenAI(api_key="sk-...")

# After (Cloud Temple LLMaaS)
client_ct = OpenAI(
api_key="your-cloud-temple-token",
base_url="https://api.ai.cloud-temple.com/v1"
)

# Same code!
response = client_ct.chat.completions.create(
model="granite3.3:8b", # Cloud Temple model
messages=[{"role": "user", "content": "Hello"}]
)

Supported Ecosystem

AI Frameworks

  • LangChain : Native integration
  • Haystack : Document pipelines
  • Semantic Kernel : Microsoft orchestration
  • AutoGen : Conversational agents

Development Tools

  • Jupyter : Interactive notebooks
  • Streamlit : Rapid web applications
  • Gradio : AI user interfaces
  • FastAPI : Backend APIs

No-Code Platforms

  • Zapier : Automations
  • Make : Visual integrations
  • Bubble : Web applications

🔄 Model Lifecycle

Model Updates

LLMaaS Model Lifecycle

Versioning Policy

  • Stable Models: Fixed versions available for 6 months
  • Experimental Models: Beta versions for early adopters
  • Deprecation: 3-month notice before removal
  • Migration: Professional services available to support your transitions

Projected Lifecycle Planning

The table below outlines the projected lifecycle of our models. The generative AI ecosystem is evolving rapidly, which explains why lifecycle durations may appear short. Our goal is to provide you with the most performant models available at any given time.

That said, we are committed to preserving models that are most widely used by our clients over time. For critical use cases requiring long-term stability, extended support phases are possible. Please contact support to discuss your specific requirements.

This planning is provided for informational purposes only and is reviewed at the beginning of each quarter.

  • DMP (Date of Production Launch): The date when the model becomes available in production.
  • DSP (Date of Support End): The projected date from which the model will no longer be maintained. A 3-month notice period is observed before any actual deprecation.
ModelPublisherPhaseDMPDSPLTSRecommended Migration
devstral:24bMistral AI & All Hands AIProduction13/06/202530/03/2026Nodevstral-small-2:24b
granite3.1-moe:2bIBMProduction13/06/202530/03/2026Nogranite4-tiny-h:7b
qwen3-coder:30bQwen TeamProduction02/08/202530/03/2026Noqwen-coder-next:80b
qwen3:30b-a3bQwen TeamProduction30/08/202530/03/2026Noqwen3-next:80b
cogito:32bDeep CogitoProduction13/06/202530/06/2026Nogpt-oss:120b
gemma3:27bGoogleProduction13/06/202530/06/2026No
glm-4.7-flash:30bZhipu AIProduction22/01/202630/06/2026No
medgemma:27bGoogleProduction02/12/202530/06/2026No
ministral-3:14bMistral AIProduction30/12/202530/06/2026No
ministral-3:3bMistral AIProduction30/12/202530/06/2026No
ministral-3:8bMistral AIProduction30/12/202530/06/2026No
nemotron3-nano:30bNVIDIAProduction04/01/202630/06/2026No
olmo-3:32bAllenAIProduction30/12/202530/06/2026No
olmo-3:7bAllenAIProduction30/12/202530/06/2026No
qwen3-omni:30bQwen TeamProduction05/01/202630/06/2026No
qwen3-vl:235bQwen TeamProduction04/01/202630/06/2026No
qwen3-vl:2bQwen TeamProduction30/12/202530/06/2026No
qwen3-vl:32bQwen TeamProduction30/12/202530/06/2026No
qwen3-vl:8bQwen TeamProduction05/01/202630/06/2026No
rnj-1:8bEssential AIProduction30/12/202530/06/2026No
devstral-small-2:24bMistral AI & All Hands AIProduction02/02/202630/09/2026No
gpt-oss:20bOpenAIProduction08/08/202530/09/2026No
granite4-small-h:32bIBMProduction03/10/202530/09/2026No
granite4-tiny-h:7bIBMProduction03/10/202530/09/2026No
mistral-small3.2:24bMistral AIProduction23/06/202530/09/2026No
deepseek-ocrDeepSeek AIProduction22/11/202530/12/2026No
functiongemma:270mGoogleProduction30/12/202530/12/2026No
granite3.2-vision:2bIBMProduction13/06/202530/12/2026No
qwen-coder-next:80bQwen TeamProduction04/02/202630/12/2026No
qwen3-next:80bQwen TeamProduction02/02/202630/12/2026No
qwen3-vl:30bQwen TeamProduction30/12/202530/12/2026No
qwen3-vl:4bQwen TeamProduction30/12/202530/12/2026No
qwen3:0.6bQwen TeamProduction13/06/202530/12/2026No
translategemma:12bGoogleProduction22/01/202630/12/2026No
translategemma:27bGoogleProduction22/01/202630/12/2026No
translategemma:4bGoogleProduction22/01/202630/12/2026No
bge-m3:567mBAAIProduction18/10/202530/12/2027Yes
embeddinggemma:300mGoogleProduction10/09/202530/12/2027Yes
gpt-oss:120bOpenAIProduction11/11/202530/12/2027Yes
granite-embedding:278mIBMProduction13/06/202530/12/2027Yes
llama3.3:70bMetaProduction13/06/202530/12/2027Yes
qwen3-2507-gptq:235bQwen TeamProduction04/01/202630/12/2027Yes
qwen3-2507-think:4bQwen TeamProduction31/08/202530/12/2027Yes

Legend

  • Phase: Model lifecycle stage (Evaluation, Production, Deprecated)
  • DMP: Date of Production Deployment
  • DSP: Forecasted Deletion Date
  • LTS: Long Term Support. LTS models offer guaranteed stability and extended support, ideal for critical applications.
  • Recommended Migration: Model recommended to replace a deprecated model.

To track the lifecycle status in real time, visit: LLMaaS Status - Lifecycle

Deprecated Models

The world of LLMs is evolving rapidly. To ensure our customers have access to the most advanced technologies, we regularly deprecate models that no longer meet current standards or are no longer in use. The models listed below are no longer available on the public platform. However, they can be reactivated for specific projects upon request.

ModelPhaseDeprecation Date
deepseek-r1:14bDeprecated30/12/2025
deepseek-r1:32bDeprecated30/12/2025
gemma3:1bDeprecated30/12/2025
gemma3:4bDeprecated30/12/2025
qwen3:0.6bDeprecated30/12/2025
qwen3:1.7bDeprecated30/12/2025
qwen3:14bDeprecated30/12/2025
qwen3:30b-a3bDeprecated30/12/2025
qwen3:4bDeprecated30/12/2025
qwen3:8bDeprecated30/12/2025
qwen3:32bDeprecated30/12/2025
qwq:32bDeprecated30/12/2025
granite3.3:2bDeprecated30/12/2025
granite3.3:8bDeprecated30/12/2025
mistral-small3.1:24bDeprecated30/12/2025
qwen2.5vl:32bDeprecated30/12/2025
qwen2.5vl:3bDeprecated30/12/2025
qwen2.5vl:72bDeprecated30/12/2025
qwen2.5vl:7bDeprecated30/12/2025
cogito:8bDeprecated30/12/2025
deepcoder:14bDeprecated30/12/2025
cogito:3bDeprecated30/12/2025
qwen3:235bDeprecated22/11/2025
qwen3-2507-think:30b-a3bDeprecated14/11/2025
gemma3:12bDeprecated21/11/2025
cogito:14bDeprecated17/10/2025
deepseek-r1:70bDeprecated17/10/2025
granite3.1-moe:3bDeprecated17/10/2025
llama3.1:8bDeprecated17/10/2025
phi4-reasoning:14bDeprecated17/10/2025
qwen2.5:0.5bDeprecated17/10/2025
qwen2.5:1.5bDeprecated17/10/2025
qwen2.5:14bDeprecated17/10/2025
qwen2.5:32bDeprecated17/10/2025
qwen2.5:3bDeprecated17/10/2025
deepseek-r1:671bDeprecated17/10/2025

💡 Best Practices

To get the most out of the LLMaaS API, it is essential to adopt strategies for optimizing costs, performance, and security.

Cost Optimization

Mastering costs relies on intelligent use of tokens and models.

  1. Model Selection: Don't use an overly powerful model for simple tasks. Larger models are more capable, but they are also slower and consume significantly more energy, directly impacting cost. Match the model size to the complexity of your task for optimal balance.

    For example, processing one million tokens:

    • Gemma 3 1B consumes 0.15 kWh.
    • Llama 3.3 70B consumes 11.75 kWh, which is 78 times more.
    # For sentiment classification, a compact model is sufficient and cost-effective.
    if task == "sentiment_analysis":
    model = "granite3.3:2b"
    # For complex legal analysis, a larger model is required.
    elif task == "legal_analysis":
    model = "deepseek-r1:70b"
  2. Context Management: The conversation history (messages) is sent back with every call, consuming input tokens. For long conversations, consider strategies like summarization or windowing to retain only relevant information.

    # For long conversations, summarize the initial exchanges.
    messages = [
    {"role": "system", "content": "You are an AI assistant."},
    {"role": "user", "content": "Summary of the first 10 exchanges..."},
    {"role": "assistant", "content": "OK, I have the context."},
    {"role": "user", "content": "Here is my new question."}
    ]
  3. Output Token Limitation: Always use the max_tokens parameter to prevent excessively long and costly responses. Set a reasonable limit based on your expected output.

    # Request a summary of up to 100 words.
    response = client.chat.completions.create(
    model="granite3.3:8b",
    messages=[{"role": "user", "content": "Summarize this document..."}],
    max_tokens=150, # Safety margin for ~100 words
    )

Performance

The responsiveness of your application depends on how you manage API calls.

  1. Asynchronous Requests: To handle multiple requests without waiting for each one to complete, use asynchronous calls. This is especially useful for backend applications processing a large volume of simultaneous requests.

    import asyncio
    from openai import AsyncOpenAI

    client = AsyncOpenAI(api_key="...", base_url="...")

    async def process_prompt(prompt: str):
    # Process a single request asynchronously
    response = await client.chat.completions.create(model="granite3.3:8b", messages=[{"role": "user", "content": prompt}])
    return response.choices[0].message.content

    async def batch_requests(prompts: list):
    # Launch multiple tasks in parallel and wait for their completion
    tasks = [process_prompt(p) for p in prompts]
    return await asyncio.gather(*tasks)
  2. Streaming for User Experience (UX): For user interfaces (chatbots, assistants), streaming is essential. It enables displaying the model's response word by word, creating an impression of immediate responsiveness instead of waiting for the full response.

    # Display the response in real time in a user interface
    response_stream = client.chat.completions.create(
    model="granite3.3:8b",
    messages=[{"role": "user", "content": "Tell me a story."}],
    stream=True
    )
    for chunk in response_stream:
    if chunk.choices[0].delta.content:
    # Display the text chunk in the UI
    print(chunk.choices[0].delta.content, end="", flush=True)

Security

The security of your application is critical, especially when handling user inputs.

  1. Input Validation and Sanitization: Never trust user inputs. Before sending them to the API, sanitize them to remove any potentially malicious code or "prompt injection" instructions. Also, limit their size to prevent abuse.

    def sanitize_input(user_input: str) -> str:
    # Simple example: remove code delimiters and limit length.
    # More robust libraries can be used for advanced sanitization.
    cleaned = user_input.replace("`", "").replace("'", "").replace("\"", "")
    return cleaned[:2000] # Limit length to 2000 characters
  2. Robust Error Handling: Always wrap your API calls in try...except blocks to handle network errors, API errors (e.g., 429 Rate Limit, 500 Internal Server Error), and provide a degraded but functional user experience.

    from openai import APIError, APITimeoutError

    try:
    response = client.chat.completions.create(...)
    except APITimeoutError:
    # Handle case where the request takes too long
    return "The service is taking longer than expected, please try again."
    except APIError as e:
    # Handle specific API errors
    logger.error(f"LLMaaS API Error: {e.status_code} - {e.message}")
    return "Sorry, an error occurred with the AI service."
    except Exception as e:
    # Handle all other errors (network, etc.)
    logger.error(f"An unexpected error occurred: {e}")
    return "Sorry, an unexpected error occurred."