Published June 4, 2026

Keeping Your RAG Index in Sync with Live SaaS Data

June 4, 2026

Most RAG content is about retrieval: which embedding model, which vector database, how to chunk, how to rerank. Almost none of it is about the part that breaks first in production — keeping the index aligned with source data that never stops changing.

This post is about that layer, and only that layer. It is not about vector database design, and it does not try to replace your vector store. It is about getting high-integrity change events and a reliable backfill out of the SaaS systems your customers use, and into whatever vector infrastructure you already trust. The boundary is the whole point:

Unified delivers the change signal and the backfill; your pipeline decides what to re-embed and how to store it.

Index drift is the default, not the exception

A RAG index is a snapshot. The moment you finish embedding a corpus, the source data starts moving away from it. A contract gets a new version in Google Drive. A deal changes stage in Salesforce. A ticket gets reassigned. A permission gets revoked. None of that reaches your vector store unless something is actively detecting the change and feeding it back in.

When that machinery is missing or lagging, the index keeps answering — confidently, and wrong. The failure is quiet: there is no error, just a model grounding its answer in last week's state. The common ways it happens are familiar to anyone who has built one of these:

New records that never make it in because an ingestion run failed silently halfway through.
Updates that land hours or days late because re-indexing is a nightly batch job.
Deleted or access-revoked records that stay retrievable because nothing told the index they were gone.

Teams pour attention into model choice and almost none into treating ingestion as a first-class system with its own freshness targets. Ingestion is where RAG quality actually degrades.

The boundary: what Unified handles, what your stack owns

The clearest way to think about this is as a contract between two systems.

Unified handles: authorization to each SaaS API, change detection across providers, the initial backfill, retries, and tracking the last successful position so a failed run resumes instead of restarting.

You handle: chunking, embeddings, vector database choice, index schema, and all query-time logic.

Unified is the source of what changed and the mechanism for the first complete load. It does not own where or how you index. We don't manage your vector schema or your embeddings — the focus is getting you a complete first load and a reliable stream of changes on top of each integration.

Backfill: the first complete load

Every index starts empty. Backfill is how you populate it once, from zero, pulling every relevant object out of the source API and into your pipeline.

With Unified, you enable this on a webhook subscription with include_all=true. Instead of writing list-endpoint loops, handling pagination, and storing your own checkpoints, your endpoint receives the existing records in pages until the backfill completes — and then the same subscription transitions automatically to incremental change events. Pagination, rate limits, and backoff are handled underneath.

The important property is that backfill and live updates arrive through one path. You build the ingestion handler once. Historical data and ongoing changes hit the same endpoint, in the same shape, so there is no separate sync job to maintain alongside your event handler.

Incremental: re-embed only what changed

Once backfill completes, the subscription delivers changes. On each sync you receive the objects that have changed since the last run, identified by their IDs and updated_at timestamps, scoped to the event types you subscribed to (created, updated). Your pipeline looks up the affected records, re-chunks if needed, re-embeds, and upserts them into your vector store.

The win is cost and latency: you avoid full re-indexes and pay the embedding cost only for objects that actually changed. You can also constrain payloads to the fields you care about, so you're not moving — or reprocessing — data your index doesn't use.

One failure mode worth handling explicitly: a stale embedding for a deleted or access-revoked record. Delete handling depends on the source. Where a provider emits native delete events, those come through the change stream and your pipeline can tombstone or remove the corresponding embedding. Where a provider does not expose deletes, no normalization layer can invent them — you'll need a reconciliation strategy, such as a periodic full sync with "missing means deleted" logic, to keep deleted content from surfacing in results.

Freshness is a budget you set

How current the index needs to be is your decision, not a fixed property of the system. The mental model is simple:

Native webhooks, where the provider supports them: changes are pushed as they happen.
Virtual webhooks, where the provider doesn't: Unified polls the source on an interval you configure and emits an event only when it detects a real change.

The interval is a knob — roughly one minute on paid plans, down from ~60 minutes on the free tier. A one-minute interval keeps an index close to live; a longer interval accepts bounded staleness in exchange for fewer reads. Either way, you set the staleness budget per the use case, and Unified's job is to reliably deliver whatever change stream you've asked for. Because virtual webhooks only bill on intervals where data is actually retrieved, checking frequently for data that rarely changes doesn't cost you for the empty checks.

Reliability: no silent skips

The reason ingestion fails quietly is that most homegrown pipelines guess where they were when something breaks. Unified doesn't guess. It tracks the last successful position, and on failure — whether the source API rate-limits it, or your own endpoint is down — it backs off and resumes from that checkpoint rather than skipping ahead or replaying from the start.

This runs in both directions. If a source returns a 429, Unified backs off and remembers where it was. If your server can't accept delivery, the same applies on the outbound side. The result your pipeline can rely on is a reliable, checkpointed change stream you can resume from without losing updates.

The handoff to your vector infrastructure

What Unified hands you is two things: a backfill mechanism that loads your index once, and a reliable, checkpointed change stream on top of each SaaS integration after that. That's the output contract.

From there, it plugs into the vector pipeline you've already chosen — a homegrown service on pgvector, a managed vector database, or a search engine with hybrid retrieval. Unified delivers the change signal and the backfill; your pipeline decides what to re-embed and how to store it.

Two adjacent decisions sit just past this boundary, and both are covered elsewhere:

Turning these events into chunks and embeddings, and why normalizing SaaS objects before embedding matters: see RAG with SaaS Data: Files, Tickets, CRM, and Why Normalization Matters.
Deciding when to embed and index versus when to read from the source at query time: see Index-Time RAG vs Real-Time RAG: Choosing the Right Retrieval Strategy.

Get the ingestion layer right and the rest of the RAG stack has something stable to stand on. Get it wrong and no amount of model tuning will keep your answers true.

→ Start your 30-day free trial

→ Book a demo

All articles