AI Benchmarks

Measure what actually matters

Faithfulness, relevancy, and hallucination scores across every major AI model — tested on real document Q&A, research, and extraction tasks.

Faithfulness

What fraction of the AI's answer is supported by the retrieved context? High faithfulness = low hallucination risk.

Relevancy

Does the answer actually address the question asked? Measures semantic alignment between question and response.

BERTScore F1

Semantic similarity between the generated answer and a gold-standard expected answer. Uses transformer embeddings.

Hallucination Risk

Inverse of faithfulness — the fraction of the answer that is NOT grounded in the source documents.

Benchmark reports (publishing soon)

GPT-4o vs Claude 3.5 Sonnet — RAG Q&AComing soon

DeepSeek vs Gemini — Document SummarizationComing soon

Best Models for Contract Analysis 2026Coming soon

RAG Faithfulness Benchmark — Top 10 ModelsComing soon

Run benchmarks on your own data

Upload a golden Q&A dataset, run it against any model, and get faithfulness, relevancy, and BERTScore — all inside SevenLLM's evaluation service.