MCP Server

Talk to your benchmarks

Connect LLMCompare to Claude, ChatGPT, Cursor, and other AI assistants. Run benchmarks, query results, and automate workflows using natural language.

## What is MCP?

The open standard for AI tool integrations

Model Context Protocol (MCP) is an open standard that lets AI assistants connect to external tools and data sources. Our MCP server exposes LLMCompare's full capabilities—so you can benchmark models, analyze results, and manage projects entirely through conversation.

Natural language interface

Ask "Run a benchmark comparing GPT-4o vs Claude on my invoices" and watch it happen. No clicking through menus or writing code.

Works with your AI tools

Claude Desktop, Claude Code, Cursor, Windsurf, and any MCP-compatible client. Use whichever AI assistant you prefer.

Workflow automation

Let AI assistants orchestrate entire benchmark workflows. Upload documents, run tests, analyze results—all in one conversation.

Secure API access

Uses your existing LLMCompare API key. All operations are authenticated and scoped to your workspace.

## Available tools

Actions the AI can perform on your behalf

run_benchmark

Trigger a document extraction benchmark with specified models and parameters.

get_benchmark_status

Check progress of a running benchmark job in real-time.

cancel_benchmark

Stop a running benchmark job if you need to abort.

create_project

Create a new benchmark project with custom extraction schema.

## Available resources

Data the AI can read and query

llmcompare://models List of 50+ vision-capable LLM models available for benchmarking
llmcompare://projects Your benchmark projects with documents and prompts
llmcompare://projects/:id Detailed project info with extraction schema and documents
llmcompare://runs Recent benchmark runs with summary metrics
llmcompare://runs/:id Full benchmark results with per-model accuracy, cost, and extractions

## Example conversations

See what's possible with natural language

claude — llmcompare
What models can I benchmark?
Read llmcompare://models

You can benchmark 47 vision-capable models including Claude 3.5 Sonnet, GPT-4o, Gemini 1.5 Pro, and many more.

Would you like me to filter by provider or capability?

Run a benchmark comparing Claude and GPT-4o on my EPC project
run_benchmark claude-3.5-sonnet, gpt-4o

Started benchmark job job_abc123

Processing 5 documents across 2 models

get_benchmark_status
40% — GPT-4o on document 2/5
Show me the results of my last benchmark
Read llmcompare://runs llmcompare://runs/run_xyz
Claude 3.5 Sonnet 94.2% $0.12 2.3s
GPT-4o 91.8% $0.18 3.1s

Claude performed better on address extraction (98% vs 89%) while GPT-4o was stronger on numeric fields.

## Supported platforms

Works with your favorite AI tools

Claude Desktop
Claude Code
Cursor
Windsurf
Any MCP client

## Quick setup

Get started in under a minute

1

Get your API key

Log into LLMCompare and create an API key from your account settings. Keys start with llmc_.

2

Configure your client

Add the MCP server to your Claude Desktop or IDE configuration:

{
  "mcpServers": {
    "llmcompare": {
      "command": "npx",
      "args": ["@llmcompare/mcp"],
      "env": {
        "LLMCOMPARE_API_KEY": "llmc_your_key_here"
      }
    }
  }
}
3

Start talking to your benchmarks

Open a conversation and ask Claude about your models, projects, or results. Try "What benchmark projects do I have?"

Ready to talk to your benchmarks?

Install the MCP server and start using natural language to manage your LLM benchmarks.