legal

Contract analysis: finding the right AI model for legal documents

How to benchmark AI models for extracting key terms from contracts, with a focus on consistency across document complexity.

10 min read

When a private equity firm acquires a company, the legal team reviews every contract. A mid-market M&A deal might have 2,000-5,000 contracts to review. Traditional review takes 4-6 weeks with 8-10 attorneys, costing $500,000+.

AI-assisted review can reduce that to 4-6 days. But choosing the wrong model means missed risks and potentially failed deals.

This guide explores how to benchmark AI models for contract analysis, with a focus on performance consistency across document complexity.


Example scenario

Sample input

A commercial software license agreement containing:

Sample output

{
  "parties": {
    "licensor": "TechCorp Solutions Inc.",
    "licensee": "Acme Enterprises LLC",
    "effective_date": "2024-01-15"
  },
  "term": {
    "initial_period": "3 years",
    "renewal": "Auto-renewal for 1-year periods",
    "termination_notice": "90 days prior to renewal"
  },
  "financial": {
    "license_fee": 150000,
    "payment_terms": "Annual, due within 30 days of invoice",
    "price_escalation": "3% annually"
  },
  "liability": {
    "cap": "12 months of fees paid",
    "exclusions": ["IP indemnification", "gross negligence", "willful misconduct"],
    "consequential_damages": "Excluded except for IP claims"
  },
  "change_of_control": {
    "trigger": "50% ownership change",
    "consequence": "Termination right for non-changing party",
    "notice_period": "30 days"
  }
}

Model comparison

Model comparison
4 models
# ModelAccuracyCostTime
1 GPT-4o 91.4% $0.048 3.8s
2 Gemini 2.0 Flash 87.6% $0.005 1.8s
3 GPT-4o-mini 84.2% $0.006 2.1s
4 Claude 3.5 Haiku 82.8% $0.016 1.6s
Best accuracy 91.4%
Lowest cost $0.005
Fastest 1.6s

The complexity factor

Contract complexity varies dramatically. A simple NDA is different from a 50-page acquisition agreement with exhibits. Model performance should be tested across complexity levels:

Complexity factor
Consistency
# ModelSimpleMediumComplexDelta
1 GPT-4o 94.8% 91.2% 86.4% 8.4%
2 Gemini 2.0 Flash 92.4% 87.6% 81.2% 11.2%
3 GPT-4o-mini 89.6% 83.8% 76.4% 13.2%
Most consistent GPT-4o
Lowest delta 8.4%

GPT-4o shows the best consistency—only 8.4% accuracy drop from simple to complex contracts, versus 11-13% for smaller models.

For legal work, this consistency matters more than peak performance. You need to trust the model on your most complex documents.


Clause-level accuracy

Different clause types have different extraction difficulty:

Clause-level accuracy
6 clause types
Clause typeGPT-4oGemini FlashGPT-4o-mini
Party names 96.4% 93.8% 91.2%
Dates 95.2% 92.6% 89.8%
Payment terms 92.8% 88.4% 84.6%
Liability caps 89.4% 84.2% 79.8%
Change of control 86.2% 79.6% 74.2%
IP assignment 83.8% 76.4% 71.6%

Complex clauses like change of control and IP assignment need more careful review, regardless of model choice.


M&A due diligence results

What can AI-assisted contract review deliver?

M&A due diligence results
Impact
MetricImprovement
Review time -80%
Cost savings -76%
Missed issues -75%

The reduction in missed issues is particularly valuable. AI catches patterns that human reviewers miss when fatigued from reading thousands of pages.


1. Consistency across complexity levels is critical

A model that performs well on simple documents but degrades on complex ones creates risk. Test specifically for complexity variance.

2. High-risk clauses need human review

Change of control, IP assignment, and indemnification clauses should always have human oversight, regardless of model confidence.

3. Benchmark with your actual contract types

Commercial leases differ from software licenses differ from employment agreements. Test on your actual document mix.

4. Speed enables better outcomes

Faster review means more time for negotiation and issue resolution, not just cost savings.


Try it yourself

LLMCompare helps legal teams evaluate models for contract review. Upload your contracts, define your extraction schema, and get the accuracy data you need for confident deployment.

Because in legal work, missed clauses mean missed risks.