Published May 21, 2026

What Is an LLM Gateway? Routing, Cost Control, and the Data Access Gap

May 21, 2026

An LLM gateway is an infrastructure layer that sits between your application and multiple model APIs — OpenAI, Anthropic, Google Gemini, Mistral, and others. Teams use gateways to manage model routing, automatic failover, observability, latency, and AI spend across providers.

But an LLM gateway does not solve the harder problem behind most production AI features: how the model gets authorized access to current customer data across CRM, ATS, ticketing, HRIS, accounting, and file storage integrations.

This post breaks down what a gateway actually does, where it fits in a complete AI stack, and what the rest of that stack needs to look like if you're building something that operates on real customer data.

What an LLM Gateway Actually Does

At its core, an LLM gateway is a proxy layer between your application and one or more model APIs. Instead of calling OpenAI directly, you call the gateway. The gateway handles the rest.

Here's what the major players — LiteLLM, Portkey, Helicone, Merge Gateway — typically include:

Multi-provider routing. You get an OpenAI-compatible API that works across many different models. Switching from GPT-4o to Claude Sonnet is a config change, not a code rewrite.

Intelligent failover. If a model API degrades or hits a rate limit, the gateway routes the request to your next preferred option automatically. Your application stays live.

Cost controls. You can set budget limits by team, project, or feature. Some gateways offer semantic caching, which reuses responses for similar requests to reduce token spend. A few add context compression to shrink inputs before they hit the model.

Observability. Every request is logged with routing decisions, latency, cost, and response metadata. This is the piece most teams retrofit after their first production incident, and a good gateway handles it from day one.

These are real problems, and gateways solve them well. If your AI spend is out of control, if a model API outage has knocked your product offline, or if your team has no visibility into what your models are actually doing in production — an LLM gateway is the right fix.

It's worth being direct about something: gateway vendors aren't trying to solve the data problem. They're solving a specific networking and infrastructure problem, and they do it well. Some gateways are also starting to move up-stack — adding prompt management, evals, and lightweight orchestration features — so the boundaries between the routing layer and the orchestration layer are blurring at the edges. But the core data access problem remains outside the scope of what any current gateway is designed to solve. The issue isn't that gateways overpromise. The issue is that teams sometimes mistake solving the gateway problem for having solved the AI infrastructure problem. Those are different things.

The Problem a Gateway Was Never Designed to Touch

A gateway optimizes the pipe. It doesn't change what flows through it.

Think about what most AI features actually need to do something useful. A sales copilot needs to know what deals are currently open, who the contacts are, and what happened on the last call. A recruiting assistant needs to see which candidates are in which stage, what their resumes say, and what the hiring manager noted in the scorecard. A customer support agent needs the full ticket history, the customer's account status, and what SLA tier they're on.

None of that information lives in the model. It lives in your customers' CRM integrations, their ATS integrations, their helpdesk APIs, their HRIS — whichever tools they run their business on.

A gateway can route that prompt to the cheapest model that meets your latency target, retry it if a model API fails, log the response for your compliance team, and keep you within a weekly budget. What it cannot do is fetch the Salesforce opportunity record, the Greenhouse candidate profile, or the Zendesk ticket history, normalize that data, and put the right slice into the context window before the model responds.

Most LLM gateways are not designed to solve SaaS integration access. They assume your application already has the authorization, normalization, and data access layer required to retrieve customer records before the model runs. That assumption is load-bearing — and it's where most AI product complexity actually lives.

The Two Data Problems (And Why They're Different)

When engineers start thinking seriously about getting data into their models, they usually run into two distinct problems that are easy to conflate.

The first is unstructured internal data: PDFs, internal wikis, Slack threads, call transcripts, codebases, support email archives. This data doesn't live in a SaaS integration with an API — it lives in documents and conversations. Getting it into a model requires a RAG (retrieval-augmented generation) pipeline: chunk the content, generate vector embeddings, store them in a vector database, and retrieve the most relevant pieces at inference time based on semantic similarity. This is a well-established pattern with a rich tooling ecosystem around it.

The second is structured operational data: the live state of your customers' business integrations. Deals in a CRM. Candidates in an ATS. Employees in an HRIS. Open tickets in a helpdesk. This data is structured, it changes constantly, and it needs to be current at the moment the model generates a response. A RAG pipeline built on stale exports doesn't solve this — you need real-time reads directly from source APIs, and often real-time writes back to them when the agent takes action.

These are different problems with different solutions. If your primary data challenge is unstructured internal content, a RAG pipeline is the right starting point. If it's structured operational data from your customers' SaaS integrations, you need an integration access layer — one that handles authorization, schema normalization, and direct API execution against source systems at request time.

This post focuses on the second problem, because it's the one most directly relevant to teams building B2B AI features on top of their customers' tool stacks.

Solving the Structured Data Problem

There are two common approaches teams take when they realize their AI features need live operational data from customer integrations.

The first is to build it themselves. You write custom integration logic to pull data from your customers' tools, normalize it into a schema your models can consume, keep it up to date, and wire it into your prompt construction. For one or two integrations, this is manageable. For anything broader, it becomes a maintenance burden that compounds — every new customer brings new tool combinations, every model API change breaks something, and the integration layer starts consuming engineering capacity that was supposed to go toward the product.

A second option — one that works well when your feature targets one or two systems deeply — is to build direct integrations for exactly those systems and skip the abstraction layer entirely. This trades breadth for control, and for vertical AI products focused on a specific integration like Salesforce or Zendesk, it's often the right call.

For teams that need to cover many integrations across a diverse customer base, the third approach is to use a unified integration layer that does this for you.

A real-time unified API handles OAuth, token refresh, schema normalization, and live reads and writes across hundreds of SaaS integrations. Instead of building a Salesforce integration and a HubSpot integration and a Greenhouse integration separately, you build once against a normalized schema. A "contact" follows a common schema whether it comes from Salesforce, HubSpot, Zoho, or Pipedrive. A "candidate" follows a common schema whether it comes from Greenhouse, Lever, or Workday Recruiting.

This approach has real tradeoffs worth naming. You're trading custom integration logic for vendor dependency. Your normalized schema is someone else's design decisions. If a customer uses an integration not in the catalog, you're back to building custom logic anyway. And there's a cost premium for the abstraction layer.

When that integration layer is also wired into your AI stack — through a GenAI API or an MCP server — you get something a gateway alone can never give you: a model that can reason over live customer data and take action in connected integrations as part of the same request.

What This Looks Like in Practice

Here's a concrete example. Say you're building a sales copilot for a B2B SaaS product. Your customers use half a dozen different CRM integrations — Salesforce, HubSpot, Pipedrive, and others. They want the copilot to review their open pipeline, suggest next steps on stalled deals, and log activity notes back into the CRM automatically.

With a pure LLM gateway, your stack looks like this:

Your application → Gateway → Model API

The gateway routes the request, manages cost, handles failover. But your application still has to figure out how to pull the right deals from each CRM integration, normalize that data into something the model can use, and write the notes back after the model responds. You're building and maintaining that custom integration logic yourself, for every CRM your customers use.

With a unified integration layer and MCP support, your agent can call authorized integration actions across your customers' CRM integrations directly. Unified handles authorization, schema normalization, and direct API execution against the source integration. The model can retrieve current deal records, reason over them, draft next steps, and write updates back through the same authorized connection — without your team building or maintaining a Salesforce integration or a HubSpot integration separately.

Note: the flow above is simplified. In practice, your agent runtime orchestrates when to call integration actions, what context to pass to the model, and how to handle the model's responses. The integration layer handles authorization and execution; your application logic handles the orchestration.

The gateway still has a role here — routing that model call efficiently, managing cost, handling failover. But it's the third layer of the stack, not the first.

Where the Market Currently Stands

The LLM gateway market is crowded: LiteLLM, Portkey, Helicone, Braintrust, Kong AI Gateway, OpenRouter, and Merge Gateway all offer routing, cost controls, observability, and reliability as table stakes — but several also compete on governance, security, prompt management, and higher-level AI control-plane features. None of them are designed to be a unified SaaS integration access layer — their scope is the model side of the stack, not normalized access to your customers' business integrations.

The unified API market is a separate category: Merge, Unified, Apideck, Nango, Paragon, and others. Merge has built three separate products — Merge Unified for data access, Agent Handler for MCP-style tool execution, and Merge Gateway for LLM routing — that together cover similar ground when used in combination. But those are separate products with separate implementation, evaluation, and pricing paths.

Some managed model platforms are also starting to add routing and cost optimization features natively — Amazon Bedrock's Intelligent Prompt Routing and Azure AI Foundry's model router are early signals of where first-party platforms are headed. What's still relatively uncommon is a single platform that treats real-time integration data access, MCP tool execution, and model routing as one cohesive product rather than three separate purchasing decisions.

Where This Gets Complicated in Practice

Real-time API reads at inference time sound ideal on paper. In production they introduce tradeoffs worth understanding before you architect around them.

The first is latency. Fetching live data from multiple SaaS APIs at request time means your agent's response time depends on the slowest integration in the chain. A multi-system fanout — pull deals from Salesforce, candidates from Greenhouse, tickets from Zendesk — can add meaningful latency that compounds with model inference time.

The second is cascading failure risk. If one integration is slow or unavailable, it can degrade the entire agent response. Writes introduce additional complexity around retries, idempotency, and partial failure states.

The third is cost. Integration API calls have their own rate limits and sometimes their own per-call costs. In high-volume production systems, the cost of integration calls can rival or exceed the cost of model inference itself.

In practice, most production teams end up with hybrid approaches: precomputed or cached state for frequently accessed data, selective real-time fetches for data that must be current, and write queues with retry logic for actions. Pure real-time reads on every request is the right starting point for development; it's rarely what ships at scale unchanged.

None of this invalidates the architecture — it just means the data access layer is harder to operate than it looks in a demo. Understanding these tradeoffs upfront is part of evaluating any integration platform seriously.

The Right Order of Operations

If you're building an AI feature that needs to operate on real customer data, the infrastructure decision order matters.

Start with the data question. What kind of data does your feature need — structured operational data from customer integrations, unstructured internal content, or both? The answer determines which infrastructure you're actually shopping for. If it's unstructured content, start with your RAG pipeline and vector storage strategy. If it's structured SaaS data, start with your integration access layer — ideally one that gives you authorized, real-time reads and safe writes across the integrations your customers actually use.

Then think about the tool execution layer. How will your agent actually invoke reads and writes at request time? An MCP server that maps your integrations to callable, authorized actions is the emerging standard. It matters whether that layer is tightly coupled to your integration catalog or assembled separately.

The routing and cost control layer — the LLM gateway — comes third. Once you know what data you're working with and how your agents will access it, optimizing the model routing layer is relatively straightforward.

Gateways are easy to evaluate and easy to sell because cost control and reliability are legible problems with clear metrics. The data access problem is harder to scope, harder to price, and harder to demo — but it's the one that determines whether your AI feature is actually useful.

What to Look For in a Complete AI Infrastructure Stack

If you're evaluating tools for a production AI feature, here are the questions worth asking beyond "which gateway has the best routing logic":

What kind of data does your feature actually need? Structured operational data from customer integrations and unstructured internal content require different infrastructure. Know which problem you're solving before you start evaluating tools.

How does your AI feature get authorized access to live customer data? If the answer is "we're building that ourselves," understand what that means at scale — custom integration logic per customer integration, maintained indefinitely as APIs change.

Does your integration layer support real-time reads and writes directly from source APIs, or is it a batch sync layer? Batch sync layers introduce data freshness problems that matter significantly in AI contexts. A model reasoning over yesterday's CRM data is a meaningfully different thing than a model reasoning over the current state of the pipeline.

Does your MCP or tool execution layer cover writes, not just reads? An AI agent that can read a ticket but can't update it, log a note, or close it is a read-only reporting tool. The write surface is what makes an agent useful.

The Bottom Line

An LLM gateway is a necessary part of a production AI stack. The major players do it well. Get one. Use it.

But don't mistake it for the hard part. The hard part is getting the right data — current, structured, and actionable — into the context window before the model responds. For teams building B2B AI features on top of their customers' integration stacks, that means solving the data access layer first. For teams building on their own internal content, it means solving the RAG pipeline first.

Either way, the gateway comes after.

Unified gives AI product teams the data access layer that gateways don't provide: authorized connections, unified schemas, direct API execution against source integrations, MCP support, and 450+ integrations across CRM, ATS, HRIS, accounting, ticketing, storage, and more. If your AI product needs to read and write customer data across multiple integrations, start with the docs or book a demo.

All articles