Published June 10, 2026

What Is a RAG Pipeline?

June 10, 2026

A RAG pipeline is the full sequence of infrastructure that takes data from a source system, processes it for retrieval, and delivers relevant context to a language model at query time. It is the implementation of retrieval-augmented generation — the concrete system, not the concept.

The stages are: ingest → chunk → embed → store → retrieve → generate.

Most discussions of RAG focus on the retrieval and generation ends: which embedding model, which vector database, how to rerank results, how to evaluate response quality. That focus is understandable. It is also where most production RAG pipelines do not break.

The stage that breaks first, and most quietly, is ingestion.

The six stages of a RAG pipeline

Ingest

The pipeline begins by connecting to data sources and pulling content into the processing flow. In B2B SaaS environments, those sources typically include file storage APIs (Google Drive, SharePoint, Dropbox), knowledge management platforms (Confluence, Notion), CRM records, ticketing threads, ATS candidate profiles, and accounting objects.

Each source exposes data differently. Some support native webhooks that push change events in real time. Others require polling against timestamps to detect what changed. Some return full content on list endpoints; others return metadata only and require a separate fetch per object to retrieve body text.

Ingestion is not a one-time operation. It requires a first full load (backfill) followed by a continuous stream of incremental updates as source data changes. A pipeline that handles the initial load but not ongoing sync will degrade silently — the index grows stale while the application continues answering queries with outdated context.

Chunk

Raw content from source APIs is rarely in a form that embeds and retrieves well. A 50-page contract, a 200-message Slack thread, or a 10,000-word Confluence article needs to be divided into retrievable units before embedding.

Chunking decisions affect retrieval quality significantly. Chunks that are too large return more context than the model needs, diluting relevance. Chunks that are too small lose the surrounding context that makes a passage meaningful. Splitting at fixed character counts rather than semantic boundaries produces chunks that cut through arguments, split table rows, or break code blocks mid-statement.

Each chunk should carry stable identifying attributes: the source object's ID, the integration it came from, the object type, and the timestamp of the last update. These attributes enable targeted re-embedding when a source record changes and tenant-scoped filtering at retrieval time.

Embed

Chunks are converted to vector representations using an embedding model. The resulting vectors encode semantic meaning in a form the vector store can search by similarity.

Embedding model choice affects retrieval quality for domain-specific content — general-purpose models may not cluster technical or industry-specific terminology well. Switching embedding models after an index is built requires re-embedding the entire corpus, since new vectors are not comparable to old ones.

Store

Embeddings are stored in a vector database alongside the chunk metadata. The index schema determines what retrieval filters are available at query time: by tenant, by object type, by source integration, by recency.

A chunk stored without a tenant identifier cannot be reliably filtered to prevent cross-customer data exposure. A chunk stored without an object ID cannot be targeted for replacement when the source record updates.

Retrieve

At query time, the user's query is embedded and compared against stored vectors. The top matching chunks are selected and assembled into the context passed to the model.

Retrieval quality depends directly on index quality. Stale chunks, duplicated content from multiple ingestion runs, and chunks with lost permission metadata all degrade what the retrieval step can return — regardless of how well the similarity search itself is implemented.

Generate

The language model receives the retrieved chunks as context and generates a response grounded in that content. At this stage, the pipeline's earlier decisions surface as either clean context or accumulated errors.

A model given stale context produces a response that sounds correct but reflects state that no longer exists in the source system. A model given context from an unauthorized document produces a response that should never have been generated. These failures originate in ingestion and indexing — they are invisible by the time generation runs.

Where RAG pipelines break in production

Post-mortems on production RAG systems point to a consistent pattern: most failures are attributed to the model or to retrieval quality, but trace back to ingestion and change detection.

Silent connector failures. An integration to Confluence, Salesforce, or a ticketing API breaks after a schema change or an OAuth token expires. The ingestion job reports success but returns zero records. The index stops updating. The pipeline continues answering queries from increasingly stale context — with no error surfaced to the application layer.

No stable chunk IDs. When a source record updates, the pipeline re-embeds it and inserts new vectors — but without deterministic chunk IDs tied to object IDs, the old vectors remain in the index. Retrieval returns both old and new versions of the same content, producing contradictory or redundant context.

Periodic full rebuilds instead of incremental updates. Nightly or weekly full re-indexing jobs miss intra-day changes. For slowly-changing content this may be acceptable; for CRM records, ticket threads, or candidate profiles that change multiple times per day, hours of drift between source state and index state create a meaningful correctness gap.

Permission loss at ingestion. Content is indexed without capturing the ownership and access-control attributes from the source API. At retrieval time, there is nothing to filter against. Documents surface to users who were never authorized to see them.

Embedding drift. When the embedding model is updated, the new model's vector space is not comparable to the old one. Queries using new embeddings return poor results against an index built with old embeddings. Fixing this requires re-embedding the full corpus — often discovered only after retrieval quality degrades noticeably in production.

The consistent finding across production audits: teams that treat ingestion as a solved problem and focus optimization effort on retrieval and generation are tuning the wrong stage.

The two distinct problems in a RAG pipeline

A useful way to think about a RAG pipeline is as two systems with different operational characteristics:

The ingestion system — connects to source APIs, detects changes, fetches updated content, and delivers it to the chunk/embed/store pipeline. Its job is completeness, freshness, and reliability. It must handle API rate limits, pagination, partial failures, OAuth token refresh, and the difference between sources that support native webhooks and those that require managed polling. It needs a checkpoint mechanism so a failed run resumes from where it stopped rather than restarting from zero.

The retrieval system — receives a query, searches the vector index, applies filters, and assembles context for the model. Its job is relevance, speed, and authorization enforcement. It operates on whatever the ingestion system has delivered.

The retrieval system can only be as good as the ingestion system allows. A well-tuned retrieval layer returning stale, duplicated, or permission-stripped content will still produce unreliable answers.

These two systems have different build-vs-buy trade-offs. The retrieval system — vector database selection, embedding model choice, reranking logic — is specific to the application being built and is typically owned by the team building the product. The ingestion system — connecting to dozens of SaaS APIs, handling their individual change detection mechanisms, managing backfill and incremental sync — is largely generic infrastructure that every team building on SaaS data has to solve the same way.

What the ingestion layer requires

A production-grade RAG pipeline ingestion layer needs:

Backfill on first load. Before any incremental sync runs, the full corpus of existing records must be loaded into the pipeline. Without a managed backfill mechanism, teams write custom list-endpoint loops per API, handle pagination and rate limits manually, and manage their own checkpoints to resume interrupted runs.

Incremental updates on change. After backfill completes, the pipeline must stay current as source data changes. For APIs with native webhooks, change events arrive in real time. For APIs without native webhooks, a managed polling layer must detect what changed since the last run and emit events only for affected records — not trigger full re-ingestion.

Stable object IDs for targeted re-embedding. When a source record updates, only the chunks derived from that record should be re-embedded. This requires deterministic chunk IDs tied to source object IDs, so the pipeline can identify and replace exactly the affected vectors without touching the rest of the index.

Checkpoint and resume on failure. If an ingestion run fails partway through — due to a rate limit, a transient API error, or an application deployment — the pipeline should resume from the last successful position, not restart from scratch. Without this, failed runs produce partial updates and require manual intervention to correct.

Tenant scoping from the start. Each ingested object must carry a tenant identifier (in Unified's model, a connection_id) from the moment it enters the pipeline. This identifier travels with every chunk into the vector store and enables filtering at retrieval time to prevent cross-customer data exposure.

Where Unified fits in a RAG pipeline

Unified handles the ingestion layer — not the embedding model, vector database, or retrieval logic.

Across CRM, ATS, ticketing, accounting, file storage, knowledge management, and additional categories, Unified provides:

Authorized reads from source APIs — normalized object schemas across 460+ integrations, fetched directly from the source without intermediate storage
Backfill on first subscription — existing records delivered to the pipeline endpoint in pages until the initial load is complete, then automatically transitioning to incremental change events on the same subscription
Change detection via native and virtual webhooks — for APIs with native webhook support, events arrive in real time; for APIs without it, Unified manages polling and emits events only when changes are detected
Checkpointed delivery — failed runs resume from the last successful position; Unified tracks state so the pipeline doesn't need to

The boundary is explicit: Unified delivers normalized objects and change events to whatever endpoint the team has built. Chunking, embedding, vector storage, and retrieval are the team's responsibility.

For a deeper look at how the ingestion layer connects to retrieval strategy — specifically when to index content vs. when to read from source APIs at query time — see Index-Time RAG vs Real-Time RAG. For the specifics of keeping an index current after the initial load, see Keeping Your RAG Index in Sync with Live SaaS Data.

When to build the ingestion layer vs. when to use existing infrastructure

The ingestion layer is the part of a RAG pipeline that looks deceptively simple to build and is consistently more expensive to maintain than anticipated.

Connecting to a single SaaS API — handling its authentication flow, pagination, rate limits, and change detection — takes a few days. Connecting to twenty, maintaining those connections as APIs change, handling token expiry, managing backfill and incremental sync across all of them, and building the observability to detect when a connector has silently stopped delivering data — that is a different order of magnitude.

Teams that build the ingestion layer from scratch tend to discover this at the point where the third or fourth integration breaks in production. At that point, the choice is between investing ongoing engineering time in integration maintenance or replacing the bespoke ingestion layer with infrastructure designed for that problem.

The retrieval system is worth building. The ingestion layer is worth evaluating carefully before committing to maintaining it.

→ Explore Unified's RAG pipeline documentation

→ Talk to us about building AI features on live SaaS data

All articles