Evaluating AI fairness and accuracy in hiring models

This post walks through how we designed that benchmark, what we learned, and why task-specific models matter when fairness is a requirement.

Evaluating AI fairness and accuracy in hiring models

4 min read
  • How we evaluated the predictive accuracy and fairness of several top commercial LLMs
  • Why general-purpose models underperform in candidate-job matching
  • A detailed overview of our bias auditing methodology and findings

Hiring decisions are high stakes. They affect real people, their careers, and livelihoods. Yet as AI models grow more powerful, they’re increasingly being considered for resume screening and candidate matching, sometimes without understanding whether they’re actually designed for that purpose.At Eightfold, we rigorously evaluated a range of leading commercial LLMs, from OpenAI, Google, Anthropic, Meta, and others, on a task that matters to us: candidate-job matching. We benchmarked these models against our domain-specific Match Score system to measure two things:

  1. How accurate are they at predicting candidate-job fit?
  2. How fair are their predictions across race, gender, and intersectional groups?

This post walks through how we designed that benchmark, what we learned, and why task-specific models matter when fairness is a requirement, not an afterthought. We found that there shouldn’t be a dichotomy between choosing accuracy or fairness: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes. Our paper is published here.

Why general LLMs fall short in hiring contexts

Large language models are trained on broad internet-scale corpora to generalize across countless downstream tasks. They can write code, summarize text, and answer complex queries. But hiring is more nuanced and involves more than another text-generation task.

Candidate-job matching requires:

  • Domain context (e.g., understanding how a Data Engineer evolves into an ML Ops Lead)
  • Structured signal integration (skills, experience, job title relevance, recency of roles)
  • A rigorous fairness boundary, especially in jurisdictions with legal constraints

These aren’t just nice-to-haves. They’re essential for a model to be trusted in real-world hiring workflows.

When we evaluated general-purpose LLMs, we found performance gaps on both accuracy and equity.

Our evaluation framework
To ensure consistency, we used a shared input-output pipeline across all models.

Data:

  • ~10,000 anonymized candidate-job pairs from the past 24 months from real-world hiring scenarios 
  • Self-reported race and gender for fairness analysis (used only for evaluation)

Inputs:

  • Structured job descriptions and parsed resumes, with all personal identifiers masked
  • Input prompts guiding LLMs to evaluate “fit” based on six neutral, relevance-based criteria
  • Identical formatting across all systems to ensure fair comparison

Models Evaluated:

  • GPT-4.1 and GPT-4o (OpenAI)
  • Claude 3.5 (Anthropic)
  • Gemini 2.5 Flash (Google)
  • LLaMA 3.1 and 4-Maverick (Meta)
  • Deepseek R1 (Deepseek)
  • Eightfold Match Score (supervised ML model trained on hiring-specific data)

Metrics we measured

We focused on three categories of accuracy evaluation:

1. Predictive Accuracy

  • ROC AUC: Measures model discrimination ability across thresholds
  • PR AUC: Emphasizes precision in imbalanced datasets
  • F1 Score: Balance between precision and recall at operational threshold

2. Fairness

We applied the four-fifths rule, a standard established by the EEOC, to verify fairness. It says a model’s scoring rate for any group (by race, gender, or both) should be at least 80% of the highest group’s rate. We applied this rule to both the distribution of those rates.

  • Impact Ratio (IR): Ratio of scoring rates between the highest and lowest scoring groups
  • Scoring Rate (SR): How frequently each group was scored above the selection threshold
  • Subgroup analysis across gender, race, and race+gender (intersectional)

Key findings

  1. Eightfold’s Task-Specific Match Score Models Outperform
    Match Score achieved a ROC AUC of 0.85 and F1 of 0.753, significantly outperforming every general LLM tested (top LLM AUC: 0.77).
  2. LLMs Show Consistent Fairness Gaps
    LLMs like GPT-4o underperformed on racial and intersectional subgroups.
    • Match Score’s lowest race IR was 0.957.
    • GPT-4o dropped to 0.774; Claude and Gemini fell even further. Deepseek R1 performed the best out of all LLMs, but only at a 0.809 IR.
    • All LLMs breached the 0.80 threshold for at least one intersectional group while Match Score achieved 0.906 IR.
  1. Model Size Doesn’t Equal Accuracy or Fairness
    Even the most powerful open- and closed-weight LLMs (e.g., GPT-4o, LLaMA 4-Maverick) showed wide variation in scoring rates across demographic groups.
  2. Fairness and Accuracy Can Be Aligned
    The Eightfold Match Score model was the only one to maintain high accuracy and fairness simultaneously—challenging the assumption that there has to be a trade-off between the two.

What we did differently
Our model wasn’t trained on general, scraped internet text, which is filled with inherent biases. It was purpose built for candidate-job matching. That allowed us to:

  • Use features like career trajectory, job transitions, and skill evolution
  • Integrate structured resume/job fields rather than only treating them as raw text
  • Calibrate model parameters using real hiring outcomes
  • Audit fairness continuously across subgroups

This purpose-built approach is particularly important when decisions affect livelihoods.

Final takeaway

General-purpose LLMs are powerful, but when it comes to hiring, specificity wins

If you’re building or evaluating AI models for hiring, test for both accuracy and fairness, and don’t assume you need to compromise between the two.

Next up

Read our published paper on how Eightfold’s Match Score compares to LLMs here, and read our bias audit summary for a deeper look into our evaluation methods and regulatory compliance journey. 

You might also like...

Share Popup Title

[eif_share_buttons]