Operating LLMs at scale with AI Matic Model Gateway

Table of contents

A model gateway typically sits between your applications and LLM providers, handling routing, retries, observability, and resilience. As AI applications move into production, maintaining reliability has become increasingly difficult. Rate limits, service outages, API changes, and the use of multiple providers can disrupt workflows and create unnecessary integration overhead.

The AI Matic Model Gateway by GoML addresses these challenges by providing a single interface for connecting OpenAI-compatible applications to approved AI models. Its routing layer manages provider selection and operational functions such as logging, monitoring, retries, failover, circuit breaking, and token tracking through configuration rather than code changes.

This document outlines the architecture of the Model Gateway, its security and monitoring controls and the capabilities it provides to engineering teams. It also answers common questions that arise when evaluating infrastructure layers for large language model applications.

Provider types supported

Routing strategies implemented

Observability backends supported

Interface for all providers

The problem of Model Gateway with direct provider integration

Direct integration with an LLM provider may work for a single application, but growing AI systems often require multiple providers, newer models, and stronger reliability. As integrations expand, teams can end up handling authentication, error management, and monitoring differently across applications, creating inconsistency and operational risk.

The AI Matic Model Gateway addresses these challenges through a shared layer that applies the same controls to every LLM request. It supports retries, failover, monitoring, and provider management without requiring changes to application code.

The gateway does not add new model capabilities. Instead, it provides the operational foundation needed to run LLMs in production, improving visibility, resilience, and maintainability as requirements evolve.

AI Matic Model Gateway architecture

The Gateway is a Python-based client library that provides a single interface for interacting with LLM providers such as Google and OpenAI. Instead of using each provider's SDK directly, developers integrate with the Gateway, which handles provider selection based on predefined routing rules. It also manages retries, circuit breakers, and operational metrics, while returning a consistent response format across all providers

The unified interface

A key feature of the Gateway is its ability to present a single, consistent interface across all supported models. It supports both synchronous and asynchronous chat completions, along with embeddings, token counting, and legacy prompt formats. These older prompt formats are automatically converted into the standard chat format behind the scenes.

The interface is designed to feel familiar to developers who have worked with other LLM SDKs, making application migration simpler. Developers can either select a specific provider or use routing aliases, which keeps application code independent of model and provider choices.

Provider abstraction

At the core of the Gateway are adapters that provide a single interface for applications using multiple LLM providers. They abstract provider-specific behaviour and manage authentication, timeouts, response parsing, streaming, and error handling. As a result, applications receive responses in a consistent format regardless of which provider is being used.

Certain providers require additional processing because support for features such as structured outputs varies across models and API versions. The Bedrock adapter, for example, validates structured responses returned by the provider. When native structured output is unavailable, the adapter generates, parses, and validates a JSON response before sending it back to the application.

This approach removes provider-specific logic from application code, simplifies integrations, and reduces the ongoing maintenance effort for development teams.

Routing

The router uses a structured configuration supplied through an environment variable or configuration file. Teams define aliases, provider groups, and model candidates within that configuration. Applications reference an alias instead of a specific model. The router resolves the alias, evaluates the available candidates using the configured routing rules and selects the provider and model for the request.

Routing strategies

Strategy	Mechanism	Use Case
Weighted hash	Deterministic selection based on a hash of the user identifier and configured weights	Consistent routing for a given user; supports A/B allocation and gradual migration
Round robin	Sequential rotation through candidates in defined order	Even distribution of load across providers with no per-user consistency requirement
Random	Probabilistic selection weighted by the candidates' configured weight values	Simple load distribution where consistency is not required

Deterministic routing is particularly useful in production environments because it assigns the same provider to the same routing context on every request. This consistency makes provider comparisons more reliable and supports controlled traffic shifts between models without introducing random variation. The routing context can be any stable identifier, such as a user ID, team name, or session ID. All three strategies are configurable through the AI Matic platform without any code changes.

Retry and circuit breaker

To improve reliability, the Gateway includes configurable retry and circuit breaker mechanisms. Teams can define how many times a request should be retried, the initial and maximum delay between retries, and whether to apply jitter.

The Gateway supports full jitter, where retry attempts are delayed by a random interval rather than occurring at the same time. This helps spread requests during short service disruptions and prevents a large number of retries from reaching the provider simultaneously.

Retries are only applied to temporary failures that may succeed on a subsequent attempt, such as timeouts, rate limits, and HTTP 5XX errors. Errors caused by invalid credentials or malformed requests are not retried because repeating the request would produce the same result.

The circuit breaker tracks failed requests for each provider. When failures exceed a predefined threshold, the Gateway temporarily stops sending requests to that provider until a cooldown period ends. This reduces time spent waiting on repeated failures and helps avoid unnecessary costs from retrying requests against an unavailable service.

Observability

Through configuration settings that are defined in the environment, observability can be done automatically for every request made to a gateway by the application without needing any additional instrumentation from the application itself. Application deployments must specify one observability backend via configuration, and all of the telemetry being sent from the gateway to the observability back end is done through the same configuration.

Observability backends

Backend	Description	Best For
Open Telemetry	Vendor-neutral spans exported to any compatible backend: Jaeger, Grafana Tempo, Datadog, AWS X-Ray	Standard enterprise observability infrastructure: Jaeger Search and Monitor support included in the provided Docker stack
Langfuse	Trace and generation records written to Langfuse Cloud or a self-hosted Langfuse instance	LLM-specific tracing with prompt visibility, generation timelines, and evaluation tooling
goML Tracer	Self-contained trace storage with a query API and a visual dashboard UI	Private, offline, or self-hosted tracing with no external service dependency

‍What is captured

Requests are recorded in the observability layer of the system as follows:

The model chosen to fulfill the request.

The provider used to fulfill the request.

The routing decision that directed the request.

Token usage for the request.

Request duration is associated.

The response status for this request.

All error detail associated with this request.

The correlation ID, which allows connecting structured logs to traceable data.

When prompt and content logging is enabled, the Gateway removes personally identifiable information (PII) before any data is stored or sent to observability platforms. All supported back-end monitoring solutions process prompts and related content only after this sanitization step.

For testing, the recommended approach is to use a custom interface provider. Teams can also register a mock provider that returns deterministic responses, allowing test cases to run without network calls or API keys. The mock provider exposes request counts and input data, making it possible to use assertions to verify both prompt construction and application logic.

‍

The AI Matic Model Gateway advantage

Most LLM infrastructure tools address a single concern. AI Matic Model Gateway brings provider abstraction, routing, resilience controls, observability, and security into one production layer. Teams can manage LLM calls through one gateway instead of combining separate tools for each function.

The gateway is part of the AI Matic platform and integrates with existing services without requiring major architecture changes.

"The model gateway layer is what separates teams that successfully scale LLMs in production from those that are constantly firefighting provider issues."

—Prashanna Hanumantha Rao, VP of Engineering, GoML

What the bundle delivers

Step	Description
Provider Connectivity	Out-of-the-box connectivity to all supported providers. Credentials are managed securely at the deployment level, isolated from application code.
Unified Interface	A single, consistent interface replaces all provider-specific SDK calls. Application code is fully decoupled from the underlying provider or model in use.
Enterprise Observability	Full observability is active from the first call with no additional instrumentation. Choose from three enterprise-grade tracing backends to match existing infrastructure.
Intelligent Routing	Intelligent routing with deterministic, round-robin, and probabilistic strategies. Aliases decouple model selection from application code entirely, enabling provider migration without code changes.
Extensible Provider Registry	Any model including internal or private deployments can be registered as a provider. Custom providers participate in routing, retry, and observability on equal footing with built-in providers.
Safe, Testable by Design	Built-in support for deterministic mock providers activate thorough testing of AI-integrated services without live API calls, credentials, or network access.

AI Matic Model Gateway brings provider abstraction, routing, resilience controls, security, and observability together in a single platform component designed for production AI workloads. Rather than assembling and maintaining separate tools, teams gain a unified infrastructure layer that is ready to integrate with existing applications. This approach reduces operational overhead, simplifies provider management, and establishes consistent controls across every LLM request.

Questions from engineering teams on model gateway

The following addresses the questions raised most consistently during technical evaluations. They reflect concerns from platform engineers, AI infrastructure leads, and security-focused architects.

On provider and model coverage

We use a model hosted on our own infrastructure. Can the gateway call it?

Yes. The extension interface supports any model that can be called from within the platform. Implement a provider that makes the call and returns a standard response structure. Register it and it is immediately available as a routing candidate with full retry, circuit breaker, and observability support.

We want to test a new model against a subset of traffic without committing to a full migration. How does routing support that?

Configure a group with two candidates: the current model at a high weight and the new model at a low weight. Using weighted hash routing, the allocation is deterministic per user, so each user consistently reaches the same model for the duration of the test. Observability records which provider served each call, making the comparison measurable. Adjusting the weights changes the allocation without changing application code.

On reliability and failover

A common question is whether the gateway can automatically switch from OpenAI to Amazon Bedrock when OpenAI becomes unavailable.

In the current implementation, the gateway retries transient failures on the selected provider. If a provider exceeds the configured failure threshold, the circuit breaker temporarily removes it from routing. When a routing group contains multiple candidates, requests are directed to the remaining available providers. Support for ordered fallback chains, where requests automatically move through a predefined list of providers after a failure, is planned for a future release.

Our provider occasionally returns 429 rate limit responses under load. How does the gateway handle that?

Rate limit is a supported retry on error class. The gateway retries the call with exponential backoff and jitter up to the configured max attempts. Full jitter randomizes the retry delay to prevent synchronized retry storms across multiple service instances. If the provider continues returning 429 responses and the failure count reaches the circuit breaker threshold, the circuit opens and the provider is bypassed for the cooldown period.

On observability and security

We need trace data to stay within our environment. Can we run observability without external services?

Yes. The GoML Tracer backend stores traces locally with no external service dependency. It provides a query API for analysis and a visual dashboard for inspection. The storage location is configurable. This is fully in-environment with no network egress for trace data.

Prompts in our system contain PII. How does the gateway prevent that from appearing in logs?

Turn on the PII redaction feature in your deployment configuration. The redaction layer runs over prompt and completion content before any text is written to structured logs or trace spans. The redaction applies regardless of which observability backend is active. Note that redaction applies to the observability layer only; the prompt reaches the model provider unmodified. For environments where prompt content must not leave the perimeter at all, AWS Bedrock within the client’s own account is the appropriate provider choice.

Multiple teams in our organization use different provider accounts. Can the gateway handle per-team credentials?

The gateway supports per-request credential overrides when deployments require them. For environments that need complete separation of provider credentials between teams or tenants, support for independently managed and injected provider keys is planned for a future release. Today, the recommended approach is to run separate gateway deployments with credentials scoped to each team.

On testing and development

We want unit tests that do not make real API calls. How do we test code that uses the gateway?

Register a mock provider before the test runs. The mock returns deterministic responses, tracks call counts inputs, and requires no credentials or network access. The gateway routes calls to the mock through the same code path as real providers, meaning the test exercises routing, retry logic, and observability instrumentation with no external dependencies. Mocks can be swapped or removed between tests.

Conclusion

A model gateway is no longer optional for teams running LLMs in production. While direct provider integrations may work initially, managing multiple providers, retries, monitoring, credentials, and security quickly becomes difficult at scale.

The AI Matic Model Gateway brings these responsibilities into a single infrastructure layer. It handles routing, provider selection, observability, retries, circuit breaking, and PII redaction while allowing provider and model changes without updating application code.

GoML helps organizations integrate and configure the gateway based on their operational needs and existing environments. As AI adoption grows, this is precisely what a model gateway solves at enterprise scale.

‍

Operating LLMs at scale with AI Matic Model Gateway

Akash Chandrasekar

The problem of Model Gateway with direct provider integration