General-purpose LLMs fall short in fair, accurate hiring — here’s what to use instead

General-purpose LLMs may impress, but fall short in fairness and accuracy in hiring. See why purpose-built models, like our Match Score, deliver the compliance and equal opportunity your organization needs.

General-purpose LLMs fall short in fair, accurate hiring — here’s what to use instead

5 min read
  • General-purpose LLMs, like GPT-4 and Claude, aren’t trained to deliver hiring outcomes, which limits these tools’ accuracy in predicting candidate-job matches.
  • Fairness isn’t baked into general-purpose LLMs, and our study found they can systematically underserve underrepresented groups.
  • A purpose-built model, like Eightfold Match Score, delivers superior accuracy and fairness, making it the safe choice for hiring decisions.

General-purpose large language models (LLMs) like GPT-4 and Claude are powerful. These tools can summarize articles, write code, and even reason. 

But does that make these LLMs suitable for hiring decisions? 

Our latest research, “Evaluating the promise and pitfalls of LLMs in hiring decisions,” shows the answer is not without significant risks.

We tested how well general-purpose LLMs match candidates to jobs by comparing these tools to our specialized Match Score model. Using 10,000 real candidate-job pairs, we stripped personal details from résumés, standardized the data, and asked each model to rate the match. We then measured each model’s accuracy in ranking candidates and its fairness in treating underrepresented groups.

Our findings showed our domain-specific Match Score model — which also includes hiring outcomes data across organizations, industries, roles, and languages — achieves superior predictive accuracy and more equitable outcomes across diverse demographic groups, outperforming top general-purpose LLMs on every metric.

But there’s even more to the story. Let’s break down our findings with these takeaways.

[Ed’s note: While Eightfold Match Score includes its own custom LLM, for the sake of clarity in this article, LLM will refer only to general-purpose LLMs used in the study.]

Related content: Explore the crucial impact of AI in hiring decisions, from candidate screening to talent acquisition.

LLMs aren’t trained on hiring outcomes

General-purpose LLMs are trained on enormous amounts of text from the internet — everything from Reddit posts and Wikipedia articles, to books and news stories. This breadth gives LLMs incredible general knowledge and language understanding that is still evolving.

But here’s the catch: general-purpose LLMs have never been trained on hiring outcomes.

That means while these models have “read” résumés, job descriptions, and even career advice,  general-purpose LLMs have never been taught what a successful hire looks like. These models are less adept at distinguishing between a candidate who was hired, one who only made it to the interview stage, or one who was rejected.

In contrast, the purpose-built Eightfold Match Score model is trained on:

  • Real résumés with verified skills, experiences, and qualifications.
  • Real job descriptions reflecting actual role requirements across industries.
  • Real outcomes, such as data indicating which candidates got interviews, offers, or hires.

Fine-tuned models, like Match Score, have task-specific feedback baked in. General-purpose LLMs don’t, and that differentiator is critical.

General-purpose LLMs lack supervised, labeled training data mapping candidate features to actual hiring success, while our custom-built Eightfold Match Score model uses millions of data points to prioritize skills match, industry fit, and more.

LLMs are not optimized for fairness

A critical misunderstanding in applying general-purpose LLMs to hiring is the assumption that these models will be inherently fair. These are not — and that’s because fairness isn’t part of what these models are optimized to do.

General-purpose LLMs, like GPT-4, Claude, and Gemini, are trained to be fluent, useful, and safe — not fair decision-makers. These models primary objective is to predict the next word in a sentence, based on patterns learned from massive internet-scale datasets, and the results delivered can vary based on the prompt given. 

While fine-tuning steps like reinforcement learning from human feedback are applied to make general-purpose LLMs more helpful or less toxic, those steps don’t teach these models how to avoid bias in structured decision-making.

In hiring, fairness isn’t a byproduct — it’s a requirement. 

A model can sound inclusive while still reinforcing inequities behind the scenes. For example, it may favor polished résumés or elite institutions, even if those aren’t the best predictors of job success. It may also replicate subtle language-based signals that correlate with race, gender, or socioeconomic status — without even “knowing” it’s doing so.

Because general-purpose LLMs aren’t trained on labeled hiring outcomes or explicitly penalized for demographic bias, these have no mechanism to learn equitable treatment. LLMs reflect the world as it is, not how it should be.

In contrast, purpose-built hiring models, like Match Score, are trained with fairness objectives in mind. These models are explicitly evaluated and optimized for equitable outcomes across race, gender, and intersectional groups. 

Fairness isn’t an afterthought — it’s engineered into the core.

Here is how Eightfold Match Score measures against general-purpose LLMs in fairness metrics.

Fairness metrics fall short using LLMs

Digging deeper into the metrics behind fairness was one of the most important findings from our research. General-purpose LLMs, when unmodified, show significant fairness gaps — particularly across intersectional dimensions — due to training on uncurated public information.

For example: 

  • The best general-purpose LLM we tested had a lowest intersectional impact ratio (IR) of 0.773. 
  • Our custom-built Match Score model achieved an IR of 0.906, a much narrower gap. 

In practical terms, this means that even when general-purpose LLMs are close on overall accuracy, these can systematically under-score candidates from certain minority groups. This happens because general-purpose LLMs reflect the biases present in the vast, uncurated internet text they’re trained on or based on the prompt given. 

Even if you mask names or remove explicit gender markers, general-purpose LLMs identify details like certain schools, job titles, or even sentence structure that can correlate with demographic attributes. 

Fairness isn’t a nice-to-have — it’s critical. Bias mitigation must be baked into the model your recruiters use.

Share This Post

Why a purpose-built model matters

While it might seem logical that a purpose-built model would outperform a general-purpose LLM, additional findings in our research show that it’s not only a matter of performance. 

It’s about the deep risks of relying on general-purpose LLMs in high-stakes decisions like hiring. 

When we talk about AI in hiring, it’s not just about getting good results. It’s about getting fair, consistent, and auditable results that support your business case, are easy to audit, and build trust with hiring managers. 

In high-stakes areas like hiring, where decisions impact people’s lives and livelihoods, “good enough” isn’t enough. The fairness and accuracy provided by a purpose-built, domain-specific model isn’t just better — it’s essential to the hiring process.

Read the full research paper, “Evaluating the promise and pitfalls of LLMs in hiring decisions.”

Varun Kacholia is CTO and Co-founder of Eightfold AI.  

You might also like...

MM Group AI HR Adoption
MM Group AI HR Adoption
Aug 19, 2024 12 min read Eightfold AI

MM Group acquired a new company and needed to onboard 3.5k employees across 20 sites in 11 countries. Learn how their HR leaders did it.

Public Sector Hiring Solutions
Public Sector Hiring Solutions
Oct 10, 2024 9 min read Eightfold AI

Discover how a skills-based approach to hiring can help government agencies find and retain top performers.

Emerging Hiring Expectations
Emerging Hiring Expectations
May 10, 2022 8 min read Eightfold AI

It’s a candidate’s market, and expectations have changed. Here’s how employers can succeed with worker-focused hiring processes.

Show More (8)Show Less
Eightfold AI Talent Matching
Eightfold AI Talent Matching
Aug 06, 2024 1 min read Eightfold AI

See how Eightfold’s Match Score is designed to improve the relationship between recruiter résumé review decisions and job performance.

Purpose-Built AI Hiring
Purpose-Built AI Hiring
Jun 04, 2025 12 min read Varun Kacholia

General-purpose LLMs may impress, but fall short in fairness and accuracy in hiring. See why purpose-built models are best.

Responsible AI Blueprint
Responsible AI Blueprint
Jan 22, 2026 9 min read Ashutosh Garg

Learn more about the Eightfold blueprint for Responsible AI: an engineering-first approach to transparency, bias mitigation, and global governance that ensures fair, merit-based hiring.

AI Recruiting Platform Evaluation
AI Recruiting Platform Evaluation
May 23, 2025 1 min read Eightfold AI

We rigorously compare leading general-purpose Large Language Models (LLMs) with Eightfold AI's purpose-built Match Score model. See results.

Finance Recruitment Bias
Finance Recruitment Bias
Jun 18, 2019 12 min read Eightfold AI

When diversity goals are clear and the right AI-based recruiting tools are used, financial services companies lay the groundwork for building better teams.

Employer Brand Communication Strategies
Employer Brand Communication Strategies
Nov 03, 2020 12 min read Eightfold AI

Artificial intelligence has taken on a central role in effectively communicating your employer brand. Here are four areas in which it excels.

Academic Recruitment
Academic Recruitment
Aug 27, 2019 11 min read Eightfold AI

AI-powered hiring technology is the perfect tool for helping universities improve recruitment in academia, broadening talent pools and finding top talent.

AI Tools for Mid-Size Organizations
AI Tools for Mid-Size Organizations
Apr 07, 2026 11 min read Eightfold AI

Discover how an AI interviewing tool helps mid-size organizations cut time-to-fill, standardize screening, and scale hiring.

Share Popup Title

Share this article