Evaluating AI fairness and accuracy in hiring models

How we evaluated the predictive accuracy and fairness of several top commercial LLMs
Why general-purpose models underperform in candidate-job matching
A detailed overview of our bias auditing methodology and findings

Hiring decisions are high stakes. They affect real people, their careers, and livelihoods. Yet as AI models grow more powerful, they’re increasingly being considered for resume screening and candidate matching, sometimes without understanding whether they’re actually designed for that purpose.At Eightfold, we rigorously evaluated a range of leading commercial LLMs, from OpenAI, Google, Anthropic, Meta, and others, on a task that matters to us: candidate-job matching. We benchmarked these models against our domain-specific Match Score system to measure two things:

How accurate are they at predicting candidate-job fit?
How fair are their predictions across race, gender, and intersectional groups?

This post walks through how we designed that benchmark, what we learned, and why task-specific models matter when fairness is a requirement, not an afterthought. We found that there shouldn’t be a dichotomy between choosing accuracy or fairness: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes. Our paper is published here.

Why general LLMs fall short in hiring contexts

Large language models are trained on broad internet-scale corpora to generalize across countless downstream tasks. They can write code, summarize text, and answer complex queries. But hiring is more nuanced and involves more than another text-generation task.

Candidate-job matching requires:

Domain context (e.g., understanding how a Data Engineer evolves into an ML Ops Lead)
Structured signal integration (skills, experience, job title relevance, recency of roles)
A rigorous fairness boundary, especially in jurisdictions with legal constraints

These aren’t just nice-to-haves. They’re essential for a model to be trusted in real-world hiring workflows.

When we evaluated general-purpose LLMs, we found performance gaps on both accuracy and equity.

Our evaluation framework
To ensure consistency, we used a shared input-output pipeline across all models.

Data:

~10,000 anonymized candidate-job pairs from the past 24 months from real-world hiring scenarios
Self-reported race and gender for fairness analysis (used only for evaluation)

Inputs:

Structured job descriptions and parsed resumes, with all personal identifiers masked
Input prompts guiding LLMs to evaluate “fit” based on six neutral, relevance-based criteria
Identical formatting across all systems to ensure fair comparison

Models Evaluated:

GPT-4.1 and GPT-4o (OpenAI)
Claude 3.5 (Anthropic)
Gemini 2.5 Flash (Google)
LLaMA 3.1 and 4-Maverick (Meta)
Deepseek R1 (Deepseek)
Eightfold Match Score (supervised ML model trained on hiring-specific data)

Metrics we measured

We focused on three categories of accuracy evaluation:

1. Predictive Accuracy

ROC AUC: Measures model discrimination ability across thresholds
PR AUC: Emphasizes precision in imbalanced datasets
F1 Score: Balance between precision and recall at operational threshold

2. Fairness

We applied the four-fifths rule, a standard established by the EEOC, to verify fairness. It says a model’s scoring rate for any group (by race, gender, or both) should be at least 80% of the highest group’s rate. We applied this rule to both the distribution of those rates.

Impact Ratio (IR): Ratio of scoring rates between the highest and lowest scoring groups
Scoring Rate (SR): How frequently each group was scored above the selection threshold
Subgroup analysis across gender, race, and race+gender (intersectional)

Key findings

Eightfold’s Task-Specific Match Score Models Outperform
Match Score achieved a ROC AUC of 0.85 and F1 of 0.753, significantly outperforming every general LLM tested (top LLM AUC: 0.77).
LLMs Show Consistent Fairness Gaps
LLMs like GPT-4o underperformed on racial and intersectional subgroups.

- Match Score’s lowest race IR was 0.957.
- GPT-4o dropped to 0.774; Claude and Gemini fell even further. Deepseek R1 performed the best out of all LLMs, but only at a 0.809 IR.
- All LLMs breached the 0.80 threshold for at least one intersectional group while Match Score achieved 0.906 IR.

Model Size Doesn’t Equal Accuracy or Fairness
Even the most powerful open- and closed-weight LLMs (e.g., GPT-4o, LLaMA 4-Maverick) showed wide variation in scoring rates across demographic groups.
Fairness and Accuracy Can Be Aligned
The Eightfold Match Score model was the only one to maintain high accuracy and fairness simultaneously—challenging the assumption that there has to be a trade-off between the two.

What we did differently
Our model wasn’t trained on general, scraped internet text, which is filled with inherent biases. It was purpose built for candidate-job matching. That allowed us to:

Use features like career trajectory, job transitions, and skill evolution
Integrate structured resume/job fields rather than only treating them as raw text
Calibrate model parameters using real hiring outcomes
Audit fairness continuously across subgroups

This purpose-built approach is particularly important when decisions affect livelihoods.

Final takeaway

General-purpose LLMs are powerful, but when it comes to hiring, specificity wins.

If you’re building or evaluating AI models for hiring, test for both accuracy and fairness, and don’t assume you need to compromise between the two.

Next up

Read our published paper on how Eightfold’s Match Score compares to LLMs here, and read our bias audit summary for a deeper look into our evaluation methods and regulatory compliance journey.

See our talent intelligence platform in action

A single AI platform for all talent

The ultimate buyer’s guide for an agentic talent platform

Eightfold AI achieves FedRAMP Moderate Authorization

The CHRO–CIO Alliance

Eightfold Talent Table in review

Eightfold and Salesforce bridge human potential and agentic AI

Scaling skills and mobility with AI at a leading biotech company ⭐

Where talent meets transformation

Responsible Al at Eightfold

Eightfold and Salesforce: Developing the AI-human workforce of the future

Eightfold AI achieves all three levels of ISO/IEC 42001:2023 certification