How to Train AI on ATS Data with Unified's ATS API
February 2, 2026
Training AI on ATS data is rarely blocked by model choice.
It's blocked by data reality.
Across ATS systems, the same concepts are represented inconsistently:
- Resumes may be downloadable in one system and unavailable in another.
- Skills might exist in a schema but be empty for most providers.
- Outcome fields (hire, reject, offer acceptance) are often missing or partially recorded.
- Stage progression history is usually not available as an event timeline.
Unified's ATS API gives you normalized objects and consistent retrieval semantics. It does not invent missing fields, reconstruct stage history, or guarantee that any single provider exposes complete outcomes.
This guide shows how to train AI on ATS data using Unified's ATS API in a way that is defensible: build a training dataset from what's observable, handle provider variability explicitly, and keep the dataset fresh over time.
What 'train AI on ATS data' means here
This article focuses on building an AI-ready dataset and retraining loop using ATS primitives.
We will:
- Build a dataset from normalized ATS objects (jobs, candidates, applications, interviews, documents)
- Extract structured and unstructured features safely
- Construct labels when the provider exposes them
- Implement incremental refresh for continuous training
We will not:
- Assume resumes are always retrievable
- Assume skills are always populated
- Assume hires/rejections/offers are universally available
- Claim stage transition history exists as a timeline
The mental model: AI training datasets are built from three layers
1) Inputs (features)
- Job requirements and context (
AtsJob) - Candidate qualifications (
AtsCandidate) - Application context and answers (
AtsApplication.answers) - Interview cadence and status (
AtsInterview) - Resume text, when available (
AtsDocument)
2) Relationships (joins)
- Candidates and jobs are linked through applications:
AtsApplication.candidate_idAtsApplication.job_id
3) Outcomes (labels)
When available:
hired_atrejected_atoffers[](nested inside application, and provider-dependent)
If these are missing for a provider, you cannot pretend they exist. Train on what you have or use alternative labels (for example, 'advanced to interview stage' based on current status, if your product accepts that limitation).
Objects you'll use
Jobs (AtsJob)
Useful fields:
idnamedescriptionstatusminimum_experience_yearsminimum_degreeemployment_typeremotequestions[]metadata[]
Note: skills[] exists in the normalized schema but is not reliably populated by most ATS integrations.
Candidates (AtsCandidate)
Useful fields:
idfirst_name,last_nametitlecompany_nameexperiences[]education[]tags[]metadata[]link_urls[]
Note: skills[] exists in the normalized schema but is only exposed by a small subset of integrations.
Applications (AtsApplication)
Useful fields:
idcandidate_idjob_idstatus(current normalized stage)original_status(provider raw stage)applied_atanswers[]rejected_at(provider-dependent)hired_at(provider-dependent)offers[](provider-dependent, nested, slow when present)
Interviews (AtsInterview)
Useful fields:
application_idstatusstart_at,end_atuser_ids[]
Documents (AtsDocument)
Useful fields:
idtype(RESUME,COVER_LETTER, etc.)filenamedocument_url(short-lived when present)document_data(rarely returned on read; provider-dependent)candidate_id,application_id,job_id
Important constraints:
document_urlis not returned by all providers.- Some providers return document metadata only (no URL, no base64).
- Some providers do not support the document object at all.
type: RESUMEexists but resume classification is not guaranteed to be consistent across providers.
Step 1: Build your base dataset from jobs, candidates, and applications
A safe training dataset starts with a clean join table:
- candidate ↔ application ↔ job
Fetch open jobs (or all jobs)
import { UnifiedTo } from '@unified-api/typescript-sdk';
const sdk = new UnifiedTo({
security: { jwt: process.env.UNIFIED_API_KEY! },
});
async function listOpenJobs(connectionId: string) {
const out = [];
let offset = 0;
const limit = 100;
while (true) {
const page = await sdk.ats.listAtsJobs({
connectionId,
status: 'OPEN',
limit,
offset,
sort: 'updated_at',
order: 'asc',
fields: '',
raw: '',
});
if (!page || page.length === 0) break;
out.push(...page);
offset += limit;
}
return out;
}
Fetch candidates
async function listCandidates(connectionId: string) {
const out = [];
let offset = 0;
const limit = 100;
while (true) {
const page = await sdk.ats.listAtsCandidates({
connectionId,
limit,
offset,
sort: 'updated_at',
order: 'asc',
fields: '',
raw: '',
});
if (!page || page.length === 0) break;
out.push(...page);
offset += limit;
}
return out;
}
Fetch applications (join layer)
async function listApplications(connectionId: string) {
const out = [];
let offset = 0;
const limit = 100;
while (true) {
const page = await sdk.ats.listAtsApplications({
connectionId,
limit,
offset,
sort: 'updated_at',
order: 'asc',
fields: '',
raw: '',
});
if (!page || page.length === 0) break;
out.push(...page);
offset += limit;
}
return out;
}
Join locally
Store everything in your database and build:
candidates_by_idjobs_by_idapplications_by_id
Then create training rows by joining each application:
function buildTrainingRows(applications: any[], candidatesById: Map<string, any>, jobsById: Map<string, any>) {
const rows = [];
for (const app of applications) {
const candidate = app.candidate_id ? candidatesById.get(app.candidate_id) : null;
const job = app.job_id ? jobsById.get(app.job_id) : null;
if (!candidate || !job) continue;
rows.push({ application: app, candidate, job });
}
return rows;
}
This gives you a stable foundation for feature engineering and labeling.
Step 2: Feature engineering without relying on fields that aren't reliably populated
Use structured signals that are broadly available
Good candidates (no pun intended):
- Job description text (
job.description) - Candidate experiences (
candidate.experiences[]) - Candidate education (
candidate.education[]) - Candidate title + company name (
candidate.title,candidate.company_name) - Application answers (
application.answers[]) - Metadata (
metadata[]) when your customers use it consistently
Treat skills[] as optional enrichment
Even though skills[] exists in the normalized schema for candidates and jobs, it is not universally returned. Only a small subset of integrations expose it. Do not build your model assuming it exists.
Step 3: Resume ingestion for embeddings (only when the provider supports it)
There is no universal 'resume always available' assumption you can make.
What to do instead
- List documents for a candidate or application
- Attempt to retrieve
document_url(and/ordocument_datawhen available) - Download within the URL lifetime (commonly short-lived)
- Parse text → generate embeddings → store in your own system
List documents (application-level example):
async function listApplicationDocuments(connectionId: string, applicationId: string) {
const out = [];
let offset = 0;
const limit = 100;
while (true) {
const page = await sdk.ats.listAtsDocuments({
connectionId,
application_id: applicationId,
limit,
offset,
sort: 'updated_at',
order: 'asc',
fields: '',
raw: '',
});
if (!page || page.length === 0) break;
out.push(...page);
offset += limit;
}
return out;
}
Then, for each document:
- If
document_urlexists, download immediately. - If
document_datais present on read (provider-dependent), decode base64. - If neither is present, treat resume ingestion as unavailable for that provider.
Also: type: RESUME is a valid enum, but documentation does not guarantee every provider uses it consistently. Treat it as a useful hint, not a guarantee.
Step 4: Labels and outcomes (train on what you can actually observe)
The normalized application schema includes:
hired_atrejected_atoffers[]withaccepted_at
But providers differ widely:
- Some don't return
hired_atat all. - Some don't return
rejected_at(or return onlyrejected_reason). offers[]is only supported by certain integrations.- When offers are supported, they are often marked as slow fields.
offers[].accepted_atis not guaranteed to be populated even when offers exist.
Practical labeling strategy
Build labels per-integration capability and treat missing values as normal.
Example label logic:
function labelFromApplication(app: any) {
if (app.hired_at) return 'HIRED';
if (app.rejected_at) return 'REJECTED';
// Optional: if offers exist and accepted_at exists, treat as HIRED-like
if (Array.isArray(app.offers) && app.offers.some((o: any) => o.accepted_at)) return 'OFFER_ACCEPTED';
return 'IN_PROGRESS';
}
Then store:
- label
- label_source (which field produced it)
- missingness flags (e.g.,
has_hired_at,has_rejected_at,has_offers)
This makes model evaluation honest and prevents silent bias.
Step 5: Do not assume stage history exists
Unified exposes:
AtsApplication.status(current normalized stage)AtsApplication.original_status(raw provider stage)AtsStatusviaGET /ats/{connection_id}/applicationstatus(stage vocabulary)
It does not document:
- an ordered stage progression model
- a stage transition timeline endpoint
- historical stage transitions
So 'sequence modeling' based on transitions is not something you can claim as universally supported. If you need detailed stage history, you must rely on provider-native data via passthrough (if the provider exposes it), or accept that you're training on current state + timestamps.
Step 6: Dataset scale and sync strategy for training
Pagination constraints
ATS list endpoints paginate with:
limit(default 100)offset(0-based)
Documentation indicates:
- limit cannot exceed 100 records per request
- no total record limit is documented
What this implies for training
For any non-trivial dataset, you should:
- Sync into your own database (don't train directly off live list calls)
- Use incremental updates and/or webhooks to keep the dataset current
- Train from your DB snapshots
Step 7: Incremental refresh and retraining loop
All ATS list endpoints support:
updated_gte
So a safe retraining loop looks like:
- Initial backfill (paginate until exhaustion)
- Store a watermark per object type (last
updated_at) - Periodically fetch deltas using
updated_gte - Update your DB
- Recompute features for changed entities
- Retrain or refresh your model on schedule (or when enough data changes)
Example incremental application pull:
async function listUpdatedApplications(connectionId: string, updatedSince: string) {
return await sdk.ats.listAtsApplications({
connectionId,
updated_gte: updatedSince,
sort: 'updated_at',
order: 'asc',
limit: 100,
offset: 0,
fields: '',
raw: '',
});
}
For near-real-time workflows, ATS webhooks exist, but:
- filter keys vary by integration and object
- filters are only supported for virtual webhooks
- native webhooks don't accept filters
This is enough to build a continuous learning loop without polling every few minutes.
Step 8: What to surface as 'AI insights' (without overclaiming)
If you're training models, your product outputs should map to observable data:
- Candidate-job match score (based on job description + experience/resume text when available)
- Candidate ranking for a job
- Similar candidate retrieval (embedding search)
- Pipeline conversion insights (only if outcomes exist)
- Suggested next actions (based on application status + interview cadence)
Be explicit in UX about missing data:
- 'Resume unavailable for this ATS integration'
- 'Offer outcomes not returned by provider'
- 'Hire timestamp not present'
That transparency prevents trust issues.
Closing thoughts
Training AI on ATS data is viable when you treat ATS systems as they are: incomplete, inconsistent, and provider-dependent.
Unified's ATS API gives you the right foundation:
- normalized candidates, jobs, applications, interviews, and documents
- consistent pagination and incremental update patterns
- optional webhooks for continuous updates
From there, the difference between a useful model and a fragile one is whether your pipeline is honest about missing data, builds labels defensibly, and stores training-ready datasets in your own system.