LLM Gateway Architecture
Status: v1 foundation implemented; first backend pilot is code-on by default with application-level fallback. Implemented so far: separate FastAPI service skeleton, unauthenticated/health, service-authenticated/ready, service-authenticated/v1/chat/completions, repo-local config schemas/loader, checked-inomi:auto:chat-structuredlane artifacts, internal service-auth helpers, request-level credential context helpers, typed gateway errors, OpenAI chat-completions request validation, route resolution for the v1 auto lane, the non-streaming provider executor, fake providers for deterministic tests, a live OpenAI-compatible provider adapter, local service URL configuration for backend callers, and code-on structured-output caller pilots forchat_extraction.requires_context,conversation_discard.should_discard, andconversation_action_items.extract.shadowwith application-level fallback.
Decision
Build a separate LLM Gateway service in this repo. Backend, pusher, and later desktop-facing relay surfaces call it through an OpenAI-compatible HTTP surface instead of choosing providers directly for auto-routed work. The gateway is intentionally narrow:- Accept explicit model IDs exactly as before, or
omi:auto:*lane IDs. - Resolve an auto lane to a versioned route artifact.
- Validate request capabilities before execution.
- Execute the primary provider/model and compatible fallbacks.
- Return a normal provider-compatible response while recording route/fallback/cost metrics.
/pick endpoint for product clients.
The existing realtime backend/routers/auto_model.py and desktop AutoModelSelector path are legacy context, not the template for this service. New gateway work must not add a public picker endpoint, must not fetch benchmark data in the request path, and must not expand desktop-side routing caches.
First Pilot
The first lane is:/v2/messages, backed by the retrieval and agentic chat graph documented in Chat System Architecture.
chat-structured is for non-streaming structured extraction and classification work, such as memory extraction, chat extraction helpers, conversation post-processing structure, and schema-bound feature decisions. It is the safest first lane because it has clear success checks: JSON/schema validity, parser repair rate, extraction precision/recall, latency, and cost.
The first concrete backend pilot is utils.llm.chat.requires_context(question: str) -> bool. Non-BYOK requests call the gateway first and fall back to the existing provider path if the gateway misses or fails, so this is an infrastructure change with no intended user-visible behavior change.
Non-BYOK requests call /v1/chat/completions first with model: "omi:auto:chat-structured", text-only messages, and a JSON schema response format for RequiresContext. The caller sends low-cardinality metadata identifying chat_extraction.requires_context, not raw user text in logs or telemetry. Service auth uses the existing bearer-token contract; backend callers read OMI_LLM_GATEWAY_SERVICE_TOKEN first and then LLM_GATEWAY_SERVICE_TOKEN for local compatibility.
BYOK requests skip this pilot because gateway BYOK forwarding is not implemented yet. The application-level fallback is the safety mechanism: any gateway HTTP error, timeout, unsupported capability response, malformed JSON, missing or invalid content, Pydantic validation failure, or unexpected exception falls back to the existing get_llm('chat_extraction').with_structured_output(...) path and returns that result. The pilot does not depend on gateway route/LKG fallback; if the gateway internally returns an error after trying its configured policy, the caller still treats that as a miss and uses the existing path.
The backend exposes the Prometheus counter llm_gateway_chat_extraction_requests_total{feature,outcome,reason} for this pilot. Current outcomes are success, fallback, and skipped; reasons are bounded classes such as ok, timeout, request_error, http_503, invalid_json, schema_validation, unexpected_error, and byok.
Primary user chat should follow later as a separate lane, likely:
Why A Separate Service
The current backend keeps model/provider routing inbackend/utils/llm/model_config.py, constructs clients in backend/utils/llm/providers.py, and exposes callers through backend/utils/llm/clients.py. That remains the source of truth for feature routing, but the new auto-route execution brain should be isolated as a service because:
- backend and pusher both run LLM workloads and should share one runtime policy;
- route artifacts need deploy-time validation and runtime fallback independent of product code;
- route observability needs one consistent set of labels;
- future surfaces can call the same OpenAI-compatible gateway without learning provider details;
- rollback must be config-only and not require desktop or mobile releases.
Language And Framework
Use Python 3.11 + FastAPI + Pydantic + httpx. Reasons:- The backend is already Python 3.11 and FastAPI.
- Existing auth, logging, sanitizer, executor, test, and deployment patterns are Python.
- Existing LLM provider knowledge lives in Python modules.
- Pydantic is already the natural schema/validation tool for FastAPI.
httpx.AsyncClientmatches backend async I/O rules and avoids blocking the event loop.
chat-structured lane.
Current service layout:
utils/auto_router package under the main backend and do not wire a public task picker into backend/main.py.
Implemented internal service auth uses Authorization: Bearer <LLM_GATEWAY_SERVICE_TOKEN> plus X-Omi-Service-Caller. The default allowlist is low-cardinality service names backend and pusher; optional X-Omi-User-Uid and X-Omi-Tenant-Id populate request context only. /health remains unauthenticated. /ready and /v1/chat/completions depend on these helpers instead of Firebase user auth.
Credential context is request-level metadata. It records credential mode, caller, and provider-key presence or approved key references without exposing raw key material in model dumps or repr output. BYOK credential failures such as missing key, invalid auth, quota, rate-limit, and unsupported provider are visible errors and are not fallback-eligible by default.
Deployment Shape
For v1, build the gateway as a separate FastAPI app in the backend tree, using the same Python toolchain and dependency-lock workflow as the main backend. Preferred rollout:- Add the app entrypoint and unit tests without any production traffic.
- Add a local run command and Docker/Helm wiring for a separately named service.
- Add
/readystartup/readiness validation that fails closed unless active and LKG route config is valid. - Add service-to-service auth from backend and pusher to the gateway.
- Route one structured feature through the gateway in shadow mode.
- Promote only after the route artifact has an Omi eval report and rollback has been tested.
backend/main.py. That would make the gateway look separate in code while still sharing the main backend process, lifecycle, scaling, and blast radius.
Local development runs the gateway as its own process:
scripts/dev-instance.sh and defaults to PYTHON_PORT + 1000. Override with LLM_GATEWAY_PORT=<port> when needed. Backend callers use repo-local configuration through OMI_LLM_GATEWAY_URL; utils.llm.gateway_client.get_llm_gateway_base_url() defaults to http://127.0.0.1:9080 for the primary checkout and strips trailing slashes from explicit values.
Deployment Configuration
The deployed gateway is a separate internal GKE service, not a public client endpoint. The Helm release name is:- gateway chart:
backend/charts/llm-gateway/ - manual workflow:
.github/workflows/gcp_llm_gateway.yml - backend caller env:
backend/charts/backend-listen/{dev,prod}_omi_backend_listen_values.yaml - shared secret mapping:
backend/charts/backend-secrets/{dev,prod}_omi_backend_secrets_values.yaml
| Env | Value |
|---|---|
OMI_LLM_GATEWAY_URL | Internal service URL above |
OMI_LLM_GATEWAY_SERVICE_TOKEN | *-omi-backend-secrets key OMI_LLM_GATEWAY_SERVICE_TOKEN |
| Env | Value |
|---|---|
OMI_LLM_GATEWAY_PROD | true |
OMI_LLM_GATEWAY_SERVICE_TOKEN | same *-omi-backend-secrets key OMI_LLM_GATEWAY_SERVICE_TOKEN |
LLM_GATEWAY_ALLOWED_CALLERS | backend |
OPENAI_API_KEY | existing *-omi-backend-secrets key OPENAI_API_KEY |
METRICS_SECRET | existing *-omi-backend-secrets key METRICS_SECRET |
OMI_LLM_GATEWAY_SERVICE_TOKEN in both projects:
ClusterIP service. /health is unauthenticated. Kubernetes liveness/startup/readiness probes call /health; the deployment workflow then runs an in-cluster authenticated smoke test against /ready and /v1/chat/completions with Authorization: Bearer ... plus X-Omi-Service-Caller: backend.
Before any deploy, validate the values files:
- Create or verify
OMI_LLM_GATEWAY_SERVICE_TOKENinbased-hardware-devSecret Manager. - Run the values validation command above.
- Run focused gateway unit tests and preflight checks.
- Manually dispatch
gcp_llm_gateway.ymltodevelopment. - Run
/readyand chat-completions smoke checks. - Deploy backend-listen values only after gateway smoke passes, so backend callers do not point at a missing service.
- Create or verify an independent
OMI_LLM_GATEWAY_SERVICE_TOKENinbased-hardwareSecret Manager. - Confirm DEV smoke passed with the same image/commit.
- Run the prod values validation command.
- Manually dispatch
gcp_llm_gateway.ymltoprod. - Run prod
/readyand chat-completions smoke checks from the prod GKE network. - Deploy backend-listen prod values after gateway prod smoke passes.
Major Library Choices
Use:- FastAPI for HTTP endpoints and service lifecycle.
- Pydantic for lane, artifact, request, and validation schemas.
- httpx.AsyncClient for provider HTTP calls and gateway-to-provider calls.
- OpenAI Python SDK only where it materially reduces request/stream parsing risk. Prefer direct
httpxfor the gateway core so we preserve request/response metadata, timeouts, headers, and streaming behavior consistently. - Prometheus-compatible metrics following existing backend observability conventions.
- pytest for deterministic resolver, validator, and executor tests.
- LangChain inside the gateway execution path. Existing product code can keep using LangChain, but the gateway should speak provider HTTP contracts directly so it can preserve OpenAI-compatible request/response shapes and streaming semantics.
- Firestore/Redis as routing config stores. All lane and route artifacts live in repo files for v1.
- LiteLLM, Portkey, Envoy AI Gateway, or Kong as the gateway foundation.
- Runtime benchmark fetching, including Artificial Analysis calls, inside request handling.
Open Source Position
We should learn from existing gateways, but not build on top of them for v1.| Project | Use For | Do Not Use For |
|---|---|---|
| LiteLLM | Provider normalization ideas, error mapping, cost accounting examples, OpenAI-compatible proxy behavior | Canonical gateway foundation, admin dashboard, virtual keys, spend product, broad registry |
| Portkey Gateway | Config-driven fallback/routing patterns and composable fallback examples | User prefs, hosted-control-plane assumptions, its config DSL as our source of truth |
| Envoy AI Gateway | Edge gateway inspiration if Omi later standardizes on Envoy/Kubernetes AI traffic policy | Application route brain or first implementation substrate |
| Kong AI Gateway | Mature edge/platform concepts | Omi route artifacts or feature-to-lane semantics |
model_config.py as the feature route source and promotes route artifacts through Omi evals.
Reference links checked while writing this spec:
API Surface
MVP endpoint:model_config.py. A bare model string is not enough because Omi’s source of truth is (provider, model), not model name alone.
Or an Omi lane:
Config Model
All config is checked into this repo for v1. Implemented config files:backend/llm_gateway/gateway/config_loader.py loads those files by default, validates cross-file references, rejects duplicate route IDs, validates active/LKG compatibility, rejects dev/mock evidence in prod mode, and validates route artifact digests.
Lane config:
route_artifact_id values and exposes a content digest for every artifact. Checked-in artifacts should include artifact_digest; if the artifact content changes without changing the digest, validation fails. Once an artifact ID has shipped, changing its content is treated as invalid operational behavior; create a new artifact ID instead.
Integration With Existing Backend Code
backend/utils/llm/model_config.py remains the feature routing source. Add typed route refs behind it:
get_model(feature)get_provider(feature)get_llm(feature)get_route_options(feature, model, provider)
get_provider() expecting a concrete provider. The initial migration should keep existing tuple behavior for all current features, then add explicit new helpers:
get_route_ref(feature) currently returns an ExplicitRouteRef for every existing feature by default, including pinned routes, profile routes, fallback routes, and provider/model construction options from get_route_options(feature, model, provider). AutoLaneRouteRef is available behind an intentionally empty feature-to-lane mapping in model_config.py so a later ticket can map selected features to omi:auto:chat-structured without changing the legacy helpers.
Backend callers that use get_llm(feature) can be migrated feature by feature. Route refs do not change return values from get_model(feature), get_provider(feature), or get_llm(feature). The first pilot routes only non-BYOK utils.llm.chat.requires_context calls through the gateway first; all other chat_extraction helpers remain on the existing explicit path.
BYOK Policy
BYOK failures fail visibly by default. If a request is using a user-provided provider key and that key is invalid, rate-limited, out of quota, or rejected by the provider, the gateway returns a clear typed error. It must not silently fall back to an Omi-paid provider route unless a route artifact explicitly allows that behavior and the product owner has approved it. The gateway must not reuse current backend BYOK fallback behavior where unsupported BYOK chat clients or failed embedding calls can fall back to Omi-paid credentials. That behavior may remain in legacy callers until migrated, but the gateway contract is stricter. Gateway requests carry aCredentialContext owned by service-authenticated backend/pusher callers. Desktop and mobile clients never call the gateway directly and never send raw BYOK credentials directly to it. The initial implementation must choose one of these internal patterns before live traffic:
- backend forwards a short-lived BYOK credential envelope to the gateway over service-authenticated transport;
- gateway receives a key reference and resolves it through an approved internal secret path.
omi_paid or byok, whether BYOK-to-Omi-paid fallback is allowed, and which failure classes are fallback eligible.
Default behavior:
| Failure | Behavior |
|---|---|
| BYOK invalid key | visible auth error |
| BYOK quota/rate limit | visible provider/key error |
| BYOK unsupported provider for route | capability/config error |
| Omi-paid primary timeout before output | retry/fallback if route policy allows |
| Omi-paid primary schema invalid | repair attempt or compatible fallback if route policy allows |
Service Auth
/health may be unauthenticated.
/ready and /v1/chat/completions require internal service authentication. The first allowed callers are backend and pusher. Requests must propagate enough caller, tenant, user, BYOK, and usage context for accounting and policy enforcement, but must not expose provider keys in logs or metrics.
Desktop, mobile, and third-party product clients must not call the gateway directly in v1.
Request Validation And Route Resolution
Implemented v1 route resolution is intentionally narrow:is_auto_lane_id(model)recognizes only theomi:auto:namespace.omi:auto:chat-structuredis the only supported auto lane.- unknown
omi:auto:*lanes return a typed model-not-found error. - bare provider model names such as
gpt-4o-miniare not direct gateway routes in v1 and return a typed unsupported-model error. - the resolver maps the supported lane to its configured
active_routeandlast_known_goodroute artifact. - runtime route checks defensively reject active/LKG lane, surface, capability, and credential-mode mismatches as invalid config.
model: "omi:auto:chat-structured";- non-empty text
messages; streamabsent or false;- no tool use;
response_format.type: "json_schema";- a
response_format.json_schema.schemaobject.
HTTP Surface
GET /ready is implemented as a service-authenticated readiness check. It loads the same repo-local gateway config used by the route dependency, validates active/LKG artifacts through load_gateway_config(prod_mode=True), and returns lane IDs plus route artifact count. Config validation failures return 503 with a generic message.
POST /v1/chat/completions is implemented as an OpenAI-compatible non-streaming route for internal services:
- requires service auth;
- accepts the same request shape validated by the resolver;
- builds an Omi-managed request credential context for the current v1 lane;
- resolves
model: "omi:auto:chat-structured"to the checked-in active route; - calls the executor and returns the provider response payload with
modelrewritten to the requested lane ID; - forwards supported OpenAI chat-completions controls such as
temperature,max_tokens,max_completion_tokens,seed,top_p, penalties,stop, anduser; - maps typed gateway exceptions to OpenAI-style error envelopes with
error.message,error.type,error.param, anderror.code; - rejects unknown auto lanes, bare provider model names, streaming, tools, unsupported capabilities, and unknown top-level request parameters before provider execution.
conversation_discard.should_discard through omi:auto:chat-structured and falls back to the legacy conv_discard LLM path if the gateway returns invalid output, times out, or fails. Action-item extraction also sends conversation_action_items.extract.shadow through the gateway as shadow traffic, then still uses the legacy conv_action_items result for product behavior. These routes give the rollout real backend-originated gateway traffic without adding latency to the foreground desktop chat path.
The default provider registry includes the OpenAI-compatible adapter for the checked-in openai provider route. It is cached per process, closed during FastAPI lifespan shutdown, uses OPENAI_API_KEY, optional OPENAI_BASE_URL, and optional OPENAI_MAX_RESPONSE_BYTES, posts to /chat/completions with httpx.AsyncClient, and fails closed with invalid_route_config when the managed key is absent or rejected. Tests can still override the registry with fake providers.
Provider Execution
Implemented executor behavior is non-streaming only:- the caller passes a resolved route plus a request-level
CredentialContext; - the executor sends an OpenAI-compatible chat-completions payload to the active route primary provider first;
- the provider-facing payload replaces the lane model with the selected provider model and forces
stream: false; - the caller-facing response payload keeps the normal provider response shape but reports
modelas the requested lane ID, such asomi:auto:chat-structured; - selected route artifact, provider, provider model, fallback reason, and LKG usage are returned only on the executor result metadata, not embedded into the response payload.
- active route fallbacks are attempted only when the route policy allows the normalized failure class;
- LKG is attempted only through
select_lkg_route_for_failure, so active route policy remains the single gate for runtime LKG; - BYOK missing key, auth, quota, rate limit, unsupported provider, capability mismatch, and invalid config failures fail visibly and do not fall back by default;
- Omi-paid timeout before output, provider 429, and provider 5xx may fall back only when the route artifact allows those classes;
- streaming-after-first-token recovery is unsupported for MVP. The executor assumes no partial output exists and does not try to recover or replay partial responses.
OpenAICompatibleChatCompletionProviderfor live non-streaming OpenAI-compatible calls;- in-memory fake providers for unit tests.
Observability
Gateway-side Prometheus metrics are exposed on/metrics with METRICS_SECRET bearer auth. The main gateway request metrics are:
lane_id, route_artifact_id, provider, model, used_lkg, fallback_used, fallback_reason, outcome, and error_class.
The backend caller also exposes llm_gateway_chat_extraction_requests_total{feature,outcome,reason} for app-level success/fallback/skipped behavior.
Do not log raw prompts, transcripts, screenshots, memory contents, provider response bodies, or BYOK keys. Use existing sanitizer patterns for error bodies and user text.
Build Status
- Add this architecture doc and temporary implementation tickets. Done.
- Add
llm_gatewayservice skeleton with health endpoint. Done. - Add Pydantic schemas for lane config, route artifacts, feature bundles, capability declarations, retries, timeouts, rollout, evidence, credential policy, fallback policy, and LKG. Done.
- Add service-auth and credential-policy contracts. Done.
- Add resolver tests with no network calls. Done.
- Add capability validator tests for
chat-structured. Done. - Add provider executor with fake providers and fallback/error normalization tests. Done.
- Add
/v1/chat/completionsnon-streaming support behind service auth. Done. - Add
/ready, local run command, service URL configuration, and separate-process deployment wiring before shadow traffic. Done. - Add route refs and gateway URL helpers in main backend so selected
model_config.pyfeatures can call the gateway later. Done. - Route one low-risk structured feature to
omi:auto:chat-structuredin shadow mode. Done for non-BYOKchat_extraction.requires_context, protected by application-level fallback. - Add gateway-side metrics and authenticated metrics endpoint. Done.
- Add Omi eval reports and promote only after schema validity, extraction quality, latency, and cost gates pass.
Shadow Safety
Shadow mode must be explicitly bounded before any live user content goes through the gateway:- feature-owner/privacy approval for the feature being shadowed;
- sampling limits and a kill switch;
- cost cap;
- no BYOK shadow by default;
- no provider expansion beyond the current production provider class without approval;
- no persistence of raw prompts or raw responses;
- metrics-only comparison unless an approved eval store exists.
Explicit Non-Goals
- No desktop Settings UI.
- No quality/latency/cost sliders.
- No per-user routing preferences.
- No Firestore or Redis routing prefs.
- No desktop-side model/route cache.
- No public
/v1/auto-router/pick. - No new public
/v1/auto/model-pick. - No request-path benchmark fetching.
- No production route from mock benchmark data.
- No benchmark-only promotion.
- No wholesale LiteLLM, Portkey, Envoy AI Gateway, or Kong adoption.
Maintainer Checklist
Before broad production traffic:- explicit model routing remains backward compatible;
model_config.pystill owns feature-to-route mapping;- gateway config validation fails closed on invalid prod config;
- active route has valid LKG;
- route artifacts are immutable;
- BYOK failure does not silently fall back to Omi-paid traffic;
/v1/chat/completionsrequires internal service auth;- LKG/fallback is limited to artifact-approved failure classes;
- mock benchmarks cannot load in prod;
- Omi eval report exists;
- shadow/canary completed;
- rollback is config-only;
- observability includes route, fallback, latency, errors, and cost.