Resume screening: comparing AI models for fair and accurate CV parsing

How to benchmark AI models for resume extraction with a focus on accuracy, speed, and demographic fairness.

10 min read

For every open position, recruiters receive hundreds of applications. Reading each thoroughly is impossible. But skimming leads to missed candidates and unconscious bias.

AI-powered resume screening can help—but the model you choose matters more than you might think. HR document extraction isn’t just about accuracy. It’s about fairness.

The resume extraction challenge

Resumes are uniquely challenging documents:

Resume challenges

ChallengeImpact

Format variety PDFs, Word docs, creative layouts, ATS-formatted

Multilingual content Names, education, certifications from global candidates

Implicit structure Section headers vary, chronology differs by culture

Demographic sensitivity Names, schools, locations can encode protected characteristics

Example scenario

Sample input

A software engineer resume containing:

Document type: PDF resume
Key fields to extract:
- Contact information
- Work experience with dates
- Education and certifications
- Skills and technologies
- Languages spoken

Sample output

{
  "contact": {
    "name": "Priya Krishnamurthy",
    "email": "priya.k@email.com",
    "phone": "+1-555-0123",
    "location": "San Francisco, CA",
    "linkedin": "linkedin.com/in/priyak"
  },
  "experience": [
    {
      "title": "Senior Software Engineer",
      "company": "TechCorp Inc.",
      "location": "San Francisco, CA",
      "start_date": "2021-03",
      "end_date": null,
      "current": true,
      "highlights": [
        "Led migration to microservices architecture",
        "Reduced API latency by 40%",
        "Mentored 3 junior engineers"
      ]
    },
    {
      "title": "Software Engineer",
      "company": "StartupXYZ",
      "location": "Palo Alto, CA",
      "start_date": "2018-06",
      "end_date": "2021-02",
      "current": false,
      "highlights": [
        "Built real-time data pipeline processing 1M events/day",
        "Implemented CI/CD reducing deploy time by 60%"
      ]
    }
  ],
  "education": [
    {
      "degree": "M.S. Computer Science",
      "institution": "Stanford University",
      "graduation_year": 2018
    },
    {
      "degree": "B.Tech Computer Science",
      "institution": "IIT Bombay",
      "graduation_year": 2016
    }
  ],
  "skills": {
    "languages": ["Python", "Go", "TypeScript", "SQL"],
    "frameworks": ["React", "FastAPI", "Kubernetes"],
    "tools": ["AWS", "Docker", "Terraform"]
  }
}

Model comparison

4 models

# ModelAccuracyCostTime

1 GPT-4o 94.6% $0.026 2.4s

2 Gemini 2.0 Flash 92.1% $0.002 1.0s

3 GPT-4o-mini 89.8% $0.003 1.2s

4 Claude 3.5 Haiku 87.4% $0.009 0.9s

Best accuracy 94.6%

Lowest cost $0.002

Fastest 0.9s

The fairness dimension

Standard accuracy metrics can hide demographic disparities. Analyzing extraction accuracy across name origin categories reveals significant differences:

Fairness analysis

Demographic variance

# ModelW. EuropeanE. EuropeanMiddle EastAfricanAsianVariance

1 GPT-4o 96.2% 94.8% 93.6% 92.4% 94.1% 3.8%

2 Gemini 2.0 Flash 94.6% 92.1% 89.8% 88.4% 91.2% 6.2%

3 GPT-4o-mini 93.2% 89.6% 86.4% 83.8% 87.4% 9.4%

Most fair GPT-4o

Lowest variance 3.8%

Lower variance = more equitable performance

GPT-4o shows only 3.8% variance across name origins. Budget models exceed 9%—creating a systematic disadvantage for candidates with non-Western names.

Why fairness matters for extraction

Poor extraction on a candidate’s name or education institution doesn’t just affect accuracy metrics—it affects their chances:

Misspelled names make candidates harder to find and contact
Missing education means qualifications aren’t matched correctly
Incorrect dates can disqualify candidates on experience requirements

A model that performs 5% worse on non-Western names systematically disadvantages those candidates.

Transformation metrics

What does fair, accurate resume screening deliver?

Transformation metrics

Impact

MetricBeforeAfterChange

Screening time per role 23 hours 3 hours -87%

Candidate diversity 18% 31% +72%

Interview-to-hire ratio 8:1 5:1 +37%

By ensuring consistent extraction quality across demographics, you’re not inadvertently filtering out qualified candidates.

Key insights for HR document processing

1. Measure demographic variance, not just accuracy

Overall accuracy can hide systematic biases. Test extraction performance across name origins and education backgrounds.

2. Budget models have higher variance

Cost savings on per-document processing may come at the cost of fairness. Calculate the true cost including missed candidates.

3. Resume format diversity matters

Test on your actual resume formats—creative designs, ATS-formatted, international CVs. Performance varies significantly.

4. Speed enables better candidate experience

Faster screening means faster responses to candidates. Top talent won’t wait.

Try it yourself

LLMCompare helps HR teams evaluate models for resume extraction with a focus on both accuracy and fairness. Upload your resumes, define your extraction schema, and measure performance across demographic categories.

Because fair hiring starts with fair data extraction.