Back

Operating LLMs at scale with AI Matic Model Gateway

Akash Chandrasekar

June 8, 2026
Table of contents

Reliability becomes a major concern as organizations deploy AI applications in production. Rate limits, provider outages and API changes can disrupt critical workflows, while supporting multiple providers often leads to duplicate integration work.

The AI Matic Model Gateway provides a combined interface for OpenAI compatible applications. It connects applications to approved models through a configurable routing layer that handles logging, monitoring, retries, failover, circuit breaking, token tracking with provider selection. Teams can also add model providers or update routing rules through configuration rather than code changes.

The paper outlines the architecture of the gateway, the current features, the security and monitoring controls, the planned enhancements with the questions that the engineering teams are asking most frequently when evaluating a large language model (LLM) infrastructure layer.

4 

Provider types supported  

3 

Routing strategies implemented 

3 

Observability backends supported 

1 

Interface for all providers 

1. The problem of Model Gateway with direct provider integration

Direct integration with an LLM provider can work well for a single application and a stable use case. As systems grow, teams face a different reality. They work with multiple providers, changing model versions also higher reliability expectations. What starts as a simple integration gradually becomes harder to maintain.

Different teams will implement authentication methods, error-handling techniques and monitoring differently on separate applications. Rate limits and outages from third-party providers can interrupt services when retry or fallback mechanisms are unavailable. Inconsistent monitoring practices, coupled with weak controls for credentials and sensitive information, increase operational risk.

Many teams address these challenges independently, leading to different approaches across systems. AI Matic Model Gateway centralizes these controls within a shared layer, applying them consistently to every LLM request. This reduces maintenance effort and helps create a more stable operating environment.

The gateway does not add new model features. It supplies the operational foundation needed for production use, making LLM integrations easier to observe, more resistant to service disruptions, and less demanding to maintain as providers, models, and application needs change.

2. AI Matic Model Gateway architecture

The gateway is a Python library that provides a single interface for interacting with LLM providers. Applications call the gateway instead of provider-specific SDKs. The gateway selects the appropriate provider, applies routing rules, enforces retry and circuit breaker policies, captures operational metrics, and returns a standardized response. The diagram below illustrates the current architecture.

2.1 The unified interface

The consistent set of operations available across supported models is exposed through the gateway. This applies to chat completions (both synchronous and asynchronous) that utilize similar parameters as well as embedding, counting tokens, and legacy prompt formatting, converted behind the scenes to be compliant with the standard chat format.  

The interface is similar to common LLM SDK designs, allowing for easy migration of existing applications. Application developers can choose their provider directly or via routing aliases, effectively decoupling application code from the choice of model and provider.

2.2 Provider abstraction

The gateway isolates provider-specific behaviur by creating an abstraction layer using different adapters to provide a unified public interface for all supported providers. Furthermore, by managing authentication, timeouts, parsing of responses, streaming of responses, and handling of errors in an adapter layer, applications can expect to receive a standardized format regardless of which underlying provider model is being used. Some providers require additional handling due primarily to the disparate nature in which models or APIs provide support for response formatting(s).  

An example of this is that the Bedrock adapter validates the structured output for the response stream returned from the provider this is required because different models and/or API versions provide varying levels of support for generating structured output. When the adapter detects there is no native structured output for the model API, it generates, parses and validates a JSON object before returning a response to the application. This isolates any provider specific logic from the applications and reduces the ongoing cost of maintenance for development teams.

3. Routing

The router uses a structured configuration supplied through an environment variable or configuration file. Teams define aliases, provider groups, and model candidates within that configuration. Applications reference an alias instead of a specific model. The router resolves the alias, evaluates the available candidates using the configured routing rules and selects the provider and model for the request.

3.1 Routing strategies

Strategy 

Mechanism 

Use Case 

Weighted hash 

Deterministic selection based on a hash of the user identifier and configured weights 

Consistent routing for a given user; supports A/B allocation and gradual migration 

Round robin 

Sequential rotation through candidates in defined order 

Even distribution of load across providers with no per-user consistency requirement 

Random 

Probabilistic selection weighted by the candidates' configured weight values 

Simple load distribution where consistency is not required 

Deterministic routing is particularly useful in production environments because it assigns the same provider to the same routing context on every request. This consistency makes provider comparisons more reliable and supports controlled traffic shifts between models without introducing random variation. The routing context can be any stable identifier, such as a user ID, team name, or session ID.

3.2 Retry and circuit breaker

One retry policy is used by the gateway when making a call. Each team can specify attempts limit, base delay, maximum delay, and whether to use a jitter type. Jitter settings include full jitter which causes the gateway to select random time delays between 0 and the calculated back off value (based on the backoff time) so that many calls are not made at once when there is a short connection loss to the provider.  

The retry policy is applicable only for errors that can be resolved by trying again, including timeouts, rate limiting, and HTTP 5xx errors. Failures to authenticate or those that result from malformed requests will not retry on a future attempt as they will produce the same error again. The Circuit Breaker tracks repeated failures for each provider.  

Once these failings exceed a certain threshold, the provider will not receive any additional calls from the Gateway until the configured Cooldown period has ended. This prevents the application from having excessive delays or re-trying costs associated when there is an outage with a specific provider.

4. Observability

Through configuration settings that are defined in the environment, observability can be done automatically for every request made to a gateway by the application without needing any additional instrumentation from the application itself. Application deployments must specify one observability backend via configuration, and all of the telemetry being sent from the gateway to the observability back end is done through the same configuration.

4.1 Observability backends

Backend 

Description 

Best For 

Open Telemetry 

Vendor-neutral spans exported to any compatible backend: Jaeger, Grafana Tempo, Datadog, AWS X-Ray 

Standard enterprise observability infrastructure: Jaeger Search and Monitor support included in the provided Docker stack 

Langfuse 

Trace and generation records written to Langfuse Cloud or a self-hosted Langfuse instance 

LLM-specific tracing with prompt visibility, generation timelines, and evaluation tooling 

goML Tracer 

Self-contained trace storage with a query API and a visual dashboard UI 

Private, offline, or self-hosted tracing with no external service dependency 

4.2 What is captured

Requests are recorded in the observability layer of the system as follows:

  • The model chosen to fulfill the request.  
  • The provider used to fulfill the request.
  • The routing decision that directed the request.  
  • Token usage for the request.  
  • Request duration is associated.
  • The response status for this request.
  • All error detail associated with this request.

The correlation ID, which allows connecting structured logs to traceable data.  

The prompt and completion content captured will also be logged or traced, depending on your configuration. If configured for PII redaction, the gateway will process prompt and completion items prior to being logged. This processing of prompt and completion records is done for all back-end observability solutions supported.

The custom provider interface is also the recommended pattern for testing. A mock provider that returns deterministic responses can be registered also used in test suites without network calls or API keys. The call count and inputs to the mock are inspectable, which supports assertion-based testing of prompt construction and response handling logic.

5. The AI Matic Model Gateway advantage

Most LLM infrastructure tools address a single concern. AI Matic Model Gateway brings provider abstraction, routing, resilience controls, observability, and security into one production layer. Teams can manage LLM calls through one gateway instead of combining separate tools for each function.

The gateway is part of the AI Matic platform and integrates with existing services without requiring major architecture changes.

5.1 What the bundle delivers

Step 

Description 

Provider Connectivity 

Out-of-the-box connectivity to all supported providers. Credentials are managed securely at the deployment level, isolated from application code. 

Unified Interface 

A single, consistent interface replaces all provider-specific SDK calls. Application code is fully decoupled from the underlying provider or model in use. 

Enterprise Observability 

Full observability is active from the first call with no additional instrumentation. Choose from three enterprise-grade tracing backends to match existing infrastructure. 

Intelligent Routing 

Intelligent routing with deterministic, round-robin, and probabilistic strategies. Aliases decouple model selection from application code entirely, enabling provider migration without code changes. 

Extensible Provider Registry 

Any model including internal or private deployments can be registered as a provider. Custom providers participate in routing, retry, and observability on equal footing with built-in providers. 

Safe, Testable by Design 

Built-in support for deterministic mock providers activate thorough testing of AI-integrated services without live API calls, credentials, or network access. 

AI Matic Model Gateway brings provider abstraction, routing, resilience controls, security, and observability together in a single platform component designed for production AI workloads. Rather than assembling and maintaining separate tools, teams gain a unified infrastructure layer that is ready to integrate with existing applications. This approach reduces operational overhead, simplifies provider management, and establishes consistent controls across every LLM request.

6. Questions from engineering teams on model gateway

The following addresses the questions raised most consistently during technical evaluations. They reflect concerns from platform engineers, AI infrastructure leads, and security-focused architects.

On provider and model coverage

We use a model hosted on our own infrastructure. Can the gateway call it?

Yes. The extension interface supports any model that can be called from within the platform. Implement a provider that makes the call and returns a standard response structure. Register it and it is immediately available as a routing candidate with full retry, circuit breaker, and observability support.

We want to test a new model against a subset of traffic without committing to a full migration. How does routing support that?

Configure a group with two candidates: the current model at a high weight and the new model at a low weight. Using weighted hash routing, the allocation is deterministic per user, so each user consistently reaches the same model for the duration of the test. Observability records which provider served each call, making the comparison measurable. Adjusting the weights changes the allocation without changing application code.

On reliability and failover

A common question is whether the gateway can automatically switch from OpenAI to Amazon Bedrock when OpenAI becomes unavailable.

In the current implementation, the gateway retries transient failures on the selected provider. If a provider exceeds the configured failure threshold, the circuit breaker temporarily removes it from routing. When a routing group contains multiple candidates, requests are directed to the remaining available providers. Support for ordered fallback chains, where requests automatically move through a predefined list of providers after a failure, is planned for a future release.

Our provider occasionally returns 429 rate limit responses under load. How does the gateway handle that?

Rate limit is a supported retry on error class. The gateway retries the call with exponential backoff and jitter up to the configured max attempts. Full jitter randomises the retry delay to prevent synchronized retry storms across multiple service instances. If the provider continues returning 429 responses and the failure count reaches the circuit breaker threshold, the circuit opens and the provider is bypassed for the cooldown period.

On observability and security

We need trace data to stay within our environment. Can we run observability without external services?

Yes. The GoML Tracer backend stores traces locally with no external service dependency. It provides a query API for analysis and a visual dashboard for inspection. The storage location is configurable. This is fully in-environment with no network egress for trace data.

Prompts in our system contain PII. How does the gateway prevent that from appearing in logs?

Turn on the PII redaction feature in your deployment configuration. The redaction layer runs over prompt and completion content before any text is written to structured logs or trace spans. The redaction applies regardless of which observability backend is active. Note that redaction applies to the observability layer only; the prompt reaches the model provider unmodified. For environments where prompt content must not leave the perimeter at all, AWS Bedrock within the client’s own account is the appropriate provider choice.

Multiple teams in our organization use different provider accounts. Can the gateway handle per-team credentials?

The gateway supports per-request credential overrides when deployments require them. For environments that need complete separation of provider credentials between teams or tenants, support for independently managed and injected provider keys is planned for a future release. Today, the recommended approach is to run separate gateway deployments with credentials scoped to each team.

On testing and development

We want unit tests that do not make real API calls. How do we test code that uses the gateway?

Register a mock provider before the test runs. The mock returns deterministic responses, tracks call counts inputs, and requires no credentials or network access. The gateway routes calls to the mock through the same code path as real providers, meaning the test exercises routing, retry logic, and observability instrumentation with no external dependencies. Mocks can be swapped or removed between tests.

Conclusion

Connecting directly to LLM providers may seem straightforward in the early stages, but maintaining those integrations across providers and business units introduces recurring operational challenges. Error handling, retries, monitoring, credential administration, and protection of sensitive information often evolve differently across systems, increasing maintenance demands over time.

AI Matic Model Gateway brings these responsibilities together within a shared infrastructure layer. Applications interact with a single interface while the gateway handles provider selection, routing, retries, circuit breaking, observability, PII redaction, and credential management. Provider choices, model updates, and routing policies can change without requiring updates to application integrations.

Today, the gateway includes support for four provider types, multiple routing methods, configurable retry and circuit breaker policies, three observability backends with PII redaction, and an extension framework for custom providers. Planned additions focus on broader security, administrative controls, and operational functions for larger deployments.

GoML partners with engineering organizations to integrate and configure the gateway within existing environments. Engagements start with a review of provider usage patterns and operational requirements, followed by a configuration approach tailored to the environment and objectives of the organization.