Local LLM

Model-serving requirements

The model is pinned by exact artifact version, image digest, and serving configuration.

The serving endpoint is internal-only.

Readiness must mean the model is loaded and able to answer a small chat request.

Health checks must fail when the model is unloaded, the GPU is unavailable, or the inference runtime cannot allocate the required memory.

The model must support reliable structured tool/function calls with the tool schemas that Aperium sends for every enabled MCP connector.

Context length must be sufficient for the system prompt, conversation context, and the active MCP tool schema set.

GPU memory, batching, concurrency, and max-token settings must be sized from load tests, not defaults.

Required env profile

Use the dedicated local OpenAI-compatible provider with an internal base_url. These values map to the LLM providers section of the env reference:

DEFAULT_LLM_PROVIDER=<local_provider>
PRIMARY_LLM_PROVIDER=<local_provider>
PRIMARY_LLM_MODEL=gemma-4
SECONDARY_LIGHTWEIGHT_LLM_PROVIDER=<local_provider>
SECONDARY_LIGHTWEIGHT_LLM_MODEL=<smaller_local_model_or_same_model>
ENABLE_LLM_FALLBACK=false

Do not configure ENABLE_LLM_FALLBACK=true to a cloud provider for this deployment shape unless your security model explicitly permits data leaving your network boundary.

Why fallback is off

The on-prem deployment shape exists to keep inference inside your network boundary. Re-enabling cloud fallback silently violates that contract. Treat any change to ENABLE_LLM_FALLBACK as a security review item: a release that flips it on must be accompanied by an explicit, documented exception from your security model.

Overview

Deployment

Admins

Model-serving requirements

Required env profile

Why fallback is off

Overview

Deployment

Admins

Documentation Index

​Model-serving requirements

​Required env profile

​Why fallback is off

Model-serving requirements

Required env profile

Why fallback is off