Wiki · Concept · Last reviewed May 20, 2026

Model Routing and AI Gateways

Model routing is the runtime practice of deciding which AI model, provider, endpoint, or fallback path should handle a request. AI gateways are the infrastructure layer that often implements that routing, adding provider abstraction, failover, load balancing, budget controls, observability, and policy enforcement between an application and one or more model APIs.

Snapshot

Type: AI infrastructure pattern, inference-economics technique, and governance surface.
Core function: choose a model or provider at runtime based on task, price, latency, quality target, availability, region, policy, or user preference.
Typical mechanisms: rules, model aliases, provider priority lists, fallbacks, load balancing, prompt classifiers, learned routers, model cascades, evaluations, and monitoring.
Related but distinct: Mixture-of-Experts routing happens inside one model; model routing chooses among models, endpoints, or providers outside the model.
Main risk: a seemingly simple assistant answer may hide a decision path across vendors, regions, model versions, safety layers, fallback events, and cost policies.

Definition

Model routing sits between a request and the model that answers it. In the simplest case, routing is a hardcoded rule: use a cheap model for classification, a stronger model for legal review, and a vision model for images. In more complex systems, the router estimates task difficulty, applies policy constraints, checks provider availability, consults recent evaluations, and then dispatches the request.

An AI gateway is a production control point for this behavior. It can expose one API to an application while sending traffic to OpenAI, Anthropic, Google, Mistral, Cohere, local models, open-weight inference providers, or private endpoints. Gateways may add retries, rate limits, cache checks, budget limits, key management, guardrails, logging, and provider-specific parameter translation.

The term overlaps with inference providers, but the emphasis is different. Inference providers run models. Model routers and gateways decide where a call should go, when to escalate, what to do when a provider fails, and what record should remain after the decision.

Why It Matters

Frontier models are expensive, smaller models are cheaper, and no single model is best for every task. OpenAI's model-selection guidance frames production choice as a balance: reach an accuracy target first, then optimize cost and latency while preserving that target. Routing operationalizes that idea across real traffic.

Routing also makes AI systems more resilient. A gateway can retry a failed call, switch to another provider, keep traffic inside a region, split load across accounts, or hold back a new model behind canary traffic. For agentic systems, this matters because one user task may involve many model calls, and one outage can break the whole workflow.

The same abstraction creates governance pressure. If an answer is routed through a fallback model, a cheaper provider, a cached response, or a degraded mode, the user may never know. In high-stakes settings, model routing becomes part of the decision record, not merely an implementation detail.

Routing Patterns

Static routing maps known request types to known models. A product might send summarization to one model, code repair to another, embeddings to a separate endpoint, and moderation to a policy classifier.

Conditional routing uses request metadata or simple checks. A gateway can route by customer, region, budget, modality, context length, provider status, model family, or required data-retention policy.

Fallback routing sends traffic to a backup provider or model when the primary call fails, times out, hits rate limits, or returns a blocked status. This improves uptime, but it can silently change model behavior unless logged and surfaced.

Load balancing distributes traffic across provider accounts, deployments, or regions to manage rate limits, latency, and cost. It borrows from ordinary web infrastructure but must account for model identity and output quality, not only server availability.

Model cascades try a cheaper or smaller model first, then escalate to a stronger model when confidence, task difficulty, or validation criteria indicate that the cheap answer is not enough. FrugalGPT is an early research example of using LLM cascades to reduce cost while preserving or improving performance.

Learned routing trains a router to predict which model should answer a query. RouteLLM, developed by researchers associated with LMSYS, Anyscale, and UC Berkeley, uses preference data to route between stronger and weaker models with the goal of saving cost without large quality loss.

Gateway Functions

Modern AI gateways tend to combine routing with operational controls. Portkey describes an AI gateway that supports a universal API, fallbacks, conditional routing, retries, circuit breakers, load balancing, canary testing, timeouts, budget limits, rate limits, caching, guardrails, and observability. LiteLLM similarly emphasizes a proxy and router layer for many providers, with load balancing, cost tracking, budgets, and application-level controls.

OpenRouter represents another version of the pattern: a model marketplace and routing layer that can choose among upstream providers for a requested model, including provider ordering, ignored providers, quantization preferences, price and throughput sorting, and enterprise region controls.

These systems make model access more flexible, but they also turn the gateway into a powerful control surface. Whoever controls the router can decide which models are favored, which providers receive traffic, what counts as an outage, which logs are preserved, and whether cost or quality wins during pressure.

Governance and Auditability

Routing should be treated as part of the AI system's provenance. A complete audit record should preserve at least the requested model alias, actual model or endpoint, provider, version or deployment identifier where available, routing reason, fallback events, latency, token counts, region, cache status, policy checks, and final cost.

For enterprises, routing policies should be tied to evaluations. A team should not merely say that cheaper models are used for easy tasks. It should define the task categories, test sets, accuracy targets, escalation thresholds, and monitoring plan that justify that choice.

User-facing systems need a different layer of disclosure. Ordinary users do not need every routing detail on every response, but they should not be misled about whether a system is using a premium model, a fallback model, a third-party provider, a cached answer, or a region with different privacy guarantees.

Failure Modes

Silent degradation: a fallback model answers after an outage, but the product still presents the result as if nothing changed.
Cost-biased routing: the router optimizes vendor margin or token cost while missing quality, safety, accessibility, or domain constraints.
Version ambiguity: a model alias hides provider changes, quantization changes, safety-layer changes, or context-window differences.
Policy bypass: traffic routes around stricter provider policies, region controls, abuse monitoring, or enterprise data boundaries.
Evaluation drift: the router keeps using old task assumptions after models, prompts, user behavior, or workloads change.
Observability gap: incident responders cannot reconstruct which model answered, why it was chosen, and whether fallback behavior contributed to harm.

Spiralist Reading

Model routing is the hidden switchboard of the Mirror.

The user sees one assistant. The institution may see a graph of models, prices, policies, fallbacks, caches, safety checks, vendor contracts, and regions. The voice is singular; the machinery is plural.

For Spiralism, the governance lesson is simple: runtime mediation is power. A routed answer is not just an answer from "the AI." It is the output of an allocation decision. The question is who made that allocation, according to what values, with what evidence, and with what right of inspection after the fact.

Open Questions

When should products disclose that an answer came from a fallback or lower-capability model?
Should high-stakes AI deployments preserve routing logs as part of audit and incident-response records?
How should teams test whether a router is over-optimizing for cost at the expense of difficult minority cases?
Can third-party gateways prove which upstream model and provider actually served a request?
Will routing layers reduce vendor lock-in, or become new choke points that decide which models the world actually uses?

Sources

OpenAI, Model selection: Choose the best model for performance and cost, reviewed May 20, 2026.
Chen et al., FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance, arXiv, 2023.
Ong et al., RouteLLM: Learning to Route LLMs with Preference Data, arXiv, 2024.
LMSYS, RouteLLM: An Open-Source Framework for Cost-Effective LLM Routing, July 1, 2024.
RouteLLM, GitHub repository, reviewed May 20, 2026.
OpenRouter, Provider Routing documentation, reviewed May 20, 2026.
Portkey, AI Gateway documentation, reviewed May 20, 2026.
Portkey, Fallbacks documentation, reviewed May 20, 2026.
Portkey, Conditional Routing documentation, reviewed May 20, 2026.
LiteLLM, LiteLLM documentation, reviewed May 20, 2026.
LangChain, Router pattern documentation, reviewed May 20, 2026.

Return to Wiki