Unified.to
All articles

How to Train AI on ATS Data with Unified's ATS API


February 2, 2026

Training AI on ATS data is rarely blocked by model choice.

It's blocked by data reality.

Across ATS systems, the same concepts are represented inconsistently:

  • Resumes may be downloadable in one system and unavailable in another.
  • Skills might exist in a schema but be empty for most providers.
  • Outcome fields (hire, reject, offer acceptance) are often missing or partially recorded.
  • Stage progression history is usually not available as an event timeline.

Unified's ATS API gives you normalized objects and consistent retrieval semantics. It does not invent missing fields, reconstruct stage history, or guarantee that any single provider exposes complete outcomes.

This guide shows how to train AI on ATS data using Unified's ATS API in a way that is defensible: build a training dataset from what's observable, handle provider variability explicitly, and keep the dataset fresh over time.

What 'train AI on ATS data' means here

This article focuses on building an AI-ready dataset and retraining loop using ATS primitives.

We will:

  • Build a dataset from normalized ATS objects (jobs, candidates, applications, interviews, documents)
  • Extract structured and unstructured features safely
  • Construct labels when the provider exposes them
  • Implement incremental refresh for continuous training

We will not:

  • Assume resumes are always retrievable
  • Assume skills are always populated
  • Assume hires/rejections/offers are universally available
  • Claim stage transition history exists as a timeline

The mental model: AI training datasets are built from three layers

1) Inputs (features)

  • Job requirements and context (AtsJob)
  • Candidate qualifications (AtsCandidate)
  • Application context and answers (AtsApplication.answers)
  • Interview cadence and status (AtsInterview)
  • Resume text, when available (AtsDocument)

2) Relationships (joins)

  • Candidates and jobs are linked through applications:
    • AtsApplication.candidate_id
    • AtsApplication.job_id

3) Outcomes (labels)

When available:

  • hired_at
  • rejected_at
  • offers[] (nested inside application, and provider-dependent)

If these are missing for a provider, you cannot pretend they exist. Train on what you have or use alternative labels (for example, 'advanced to interview stage' based on current status, if your product accepts that limitation).

Objects you'll use

Jobs (AtsJob)

Useful fields:

  • id
  • name
  • description
  • status
  • minimum_experience_years
  • minimum_degree
  • employment_type
  • remote
  • questions[]
  • metadata[]

Note: skills[] exists in the normalized schema but is not reliably populated by most ATS integrations.

Candidates (AtsCandidate)

Useful fields:

  • id
  • first_name, last_name
  • title
  • company_name
  • experiences[]
  • education[]
  • tags[]
  • metadata[]
  • link_urls[]

Note: skills[] exists in the normalized schema but is only exposed by a small subset of integrations.

Applications (AtsApplication)

Useful fields:

  • id
  • candidate_id
  • job_id
  • status (current normalized stage)
  • original_status (provider raw stage)
  • applied_at
  • answers[]
  • rejected_at (provider-dependent)
  • hired_at (provider-dependent)
  • offers[] (provider-dependent, nested, slow when present)

Interviews (AtsInterview)

Useful fields:

  • application_id
  • status
  • start_at, end_at
  • user_ids[]

Documents (AtsDocument)

Useful fields:

  • id
  • type (RESUME, COVER_LETTER, etc.)
  • filename
  • document_url (short-lived when present)
  • document_data (rarely returned on read; provider-dependent)
  • candidate_id, application_id, job_id

Important constraints:

  • document_url is not returned by all providers.
  • Some providers return document metadata only (no URL, no base64).
  • Some providers do not support the document object at all.
  • type: RESUME exists but resume classification is not guaranteed to be consistent across providers.

Step 1: Build your base dataset from jobs, candidates, and applications

A safe training dataset starts with a clean join table:

  • candidate ↔ application ↔ job

Fetch open jobs (or all jobs)

import { UnifiedTo } from '@unified-api/typescript-sdk';

const sdk = new UnifiedTo({
  security: { jwt: process.env.UNIFIED_API_KEY! },
});

async function listOpenJobs(connectionId: string) {
  const out = [];
  let offset = 0;
  const limit = 100;

  while (true) {
    const page = await sdk.ats.listAtsJobs({
      connectionId,
      status: 'OPEN',
      limit,
      offset,
      sort: 'updated_at',
      order: 'asc',
      fields: '',
      raw: '',
    });

    if (!page || page.length === 0) break;
    out.push(...page);
    offset += limit;
  }

  return out;
}

Fetch candidates

async function listCandidates(connectionId: string) {
  const out = [];
  let offset = 0;
  const limit = 100;

  while (true) {
    const page = await sdk.ats.listAtsCandidates({
      connectionId,
      limit,
      offset,
      sort: 'updated_at',
      order: 'asc',
      fields: '',
      raw: '',
    });

    if (!page || page.length === 0) break;
    out.push(...page);
    offset += limit;
  }

  return out;
}

Fetch applications (join layer)

async function listApplications(connectionId: string) {
  const out = [];
  let offset = 0;
  const limit = 100;

  while (true) {
    const page = await sdk.ats.listAtsApplications({
      connectionId,
      limit,
      offset,
      sort: 'updated_at',
      order: 'asc',
      fields: '',
      raw: '',
    });

    if (!page || page.length === 0) break;
    out.push(...page);
    offset += limit;
  }

  return out;
}

Join locally

Store everything in your database and build:

  • candidates_by_id
  • jobs_by_id
  • applications_by_id

Then create training rows by joining each application:

function buildTrainingRows(applications: any[], candidatesById: Map<string, any>, jobsById: Map<string, any>) {
  const rows = [];

  for (const app of applications) {
    const candidate = app.candidate_id ? candidatesById.get(app.candidate_id) : null;
    const job = app.job_id ? jobsById.get(app.job_id) : null;
    if (!candidate || !job) continue;

    rows.push({ application: app, candidate, job });
  }

  return rows;
}

This gives you a stable foundation for feature engineering and labeling.

Step 2: Feature engineering without relying on fields that aren't reliably populated

Use structured signals that are broadly available

Good candidates (no pun intended):

  • Job description text (job.description)
  • Candidate experiences (candidate.experiences[])
  • Candidate education (candidate.education[])
  • Candidate title + company name (candidate.title, candidate.company_name)
  • Application answers (application.answers[])
  • Metadata (metadata[]) when your customers use it consistently

Treat skills[] as optional enrichment

Even though skills[] exists in the normalized schema for candidates and jobs, it is not universally returned. Only a small subset of integrations expose it. Do not build your model assuming it exists.

Step 3: Resume ingestion for embeddings (only when the provider supports it)

There is no universal 'resume always available' assumption you can make.

What to do instead

  1. List documents for a candidate or application
  2. Attempt to retrieve document_url (and/or document_data when available)
  3. Download within the URL lifetime (commonly short-lived)
  4. Parse text → generate embeddings → store in your own system

List documents (application-level example):

async function listApplicationDocuments(connectionId: string, applicationId: string) {
  const out = [];
  let offset = 0;
  const limit = 100;

  while (true) {
    const page = await sdk.ats.listAtsDocuments({
      connectionId,
      application_id: applicationId,
      limit,
      offset,
      sort: 'updated_at',
      order: 'asc',
      fields: '',
      raw: '',
    });

    if (!page || page.length === 0) break;
    out.push(...page);
    offset += limit;
  }

  return out;
}

Then, for each document:

  • If document_url exists, download immediately.
  • If document_data is present on read (provider-dependent), decode base64.
  • If neither is present, treat resume ingestion as unavailable for that provider.

Also: type: RESUME is a valid enum, but documentation does not guarantee every provider uses it consistently. Treat it as a useful hint, not a guarantee.

Step 4: Labels and outcomes (train on what you can actually observe)

The normalized application schema includes:

  • hired_at
  • rejected_at
  • offers[] with accepted_at

But providers differ widely:

  • Some don't return hired_at at all.
  • Some don't return rejected_at (or return only rejected_reason).
  • offers[] is only supported by certain integrations.
  • When offers are supported, they are often marked as slow fields.
  • offers[].accepted_at is not guaranteed to be populated even when offers exist.

Practical labeling strategy

Build labels per-integration capability and treat missing values as normal.

Example label logic:

function labelFromApplication(app: any) {
  if (app.hired_at) return 'HIRED';
  if (app.rejected_at) return 'REJECTED';
  // Optional: if offers exist and accepted_at exists, treat as HIRED-like
  if (Array.isArray(app.offers) && app.offers.some((o: any) => o.accepted_at)) return 'OFFER_ACCEPTED';
  return 'IN_PROGRESS';
}

Then store:

  • label
  • label_source (which field produced it)
  • missingness flags (e.g., has_hired_at, has_rejected_at, has_offers)

This makes model evaluation honest and prevents silent bias.

Step 5: Do not assume stage history exists

Unified exposes:

  • AtsApplication.status (current normalized stage)
  • AtsApplication.original_status (raw provider stage)
  • AtsStatus via GET /ats/{connection_id}/applicationstatus (stage vocabulary)

It does not document:

  • an ordered stage progression model
  • a stage transition timeline endpoint
  • historical stage transitions

So 'sequence modeling' based on transitions is not something you can claim as universally supported. If you need detailed stage history, you must rely on provider-native data via passthrough (if the provider exposes it), or accept that you're training on current state + timestamps.

Step 6: Dataset scale and sync strategy for training

Pagination constraints

ATS list endpoints paginate with:

  • limit (default 100)
  • offset (0-based)

Documentation indicates:

  • limit cannot exceed 100 records per request
  • no total record limit is documented

What this implies for training

For any non-trivial dataset, you should:

  1. Sync into your own database (don't train directly off live list calls)
  2. Use incremental updates and/or webhooks to keep the dataset current
  3. Train from your DB snapshots

Step 7: Incremental refresh and retraining loop

All ATS list endpoints support:

  • updated_gte

So a safe retraining loop looks like:

  1. Initial backfill (paginate until exhaustion)
  2. Store a watermark per object type (last updated_at)
  3. Periodically fetch deltas using updated_gte
  4. Update your DB
  5. Recompute features for changed entities
  6. Retrain or refresh your model on schedule (or when enough data changes)

Example incremental application pull:

async function listUpdatedApplications(connectionId: string, updatedSince: string) {
  return await sdk.ats.listAtsApplications({
    connectionId,
    updated_gte: updatedSince,
    sort: 'updated_at',
    order: 'asc',
    limit: 100,
    offset: 0,
    fields: '',
    raw: '',
  });
}

For near-real-time workflows, ATS webhooks exist, but:

  • filter keys vary by integration and object
  • filters are only supported for virtual webhooks
  • native webhooks don't accept filters

This is enough to build a continuous learning loop without polling every few minutes.

Step 8: What to surface as 'AI insights' (without overclaiming)

If you're training models, your product outputs should map to observable data:

  • Candidate-job match score (based on job description + experience/resume text when available)
  • Candidate ranking for a job
  • Similar candidate retrieval (embedding search)
  • Pipeline conversion insights (only if outcomes exist)
  • Suggested next actions (based on application status + interview cadence)

Be explicit in UX about missing data:

  • 'Resume unavailable for this ATS integration'
  • 'Offer outcomes not returned by provider'
  • 'Hire timestamp not present'

That transparency prevents trust issues.

Closing thoughts

Training AI on ATS data is viable when you treat ATS systems as they are: incomplete, inconsistent, and provider-dependent.

Unified's ATS API gives you the right foundation:

  • normalized candidates, jobs, applications, interviews, and documents
  • consistent pagination and incremental update patterns
  • optional webhooks for continuous updates

From there, the difference between a useful model and a fragile one is whether your pipeline is honest about missing data, builds labels defensibly, and stores training-ready datasets in your own system.

Start your 30-day free trial

Book a demo

All articles