Benchmarks¶
RAGTruth (English)¶
Evaluated on the RAGTruth test set. This benchmark measures how well models detect hallucinations in LLM-generated text across QA, summarization, and data-to-text tasks.
Example-Level Detection¶
Binary classification: does the answer contain any hallucination?
| Model | Type | Overall F1 |
|---|---|---|
| GPT-4 | LLM (zero-shot) | 63.4% |
| Luna | Encoder | 65.4% |
| lettucedetect-base-v1 | Encoder (149M) | 76.8% |
| Llama-2-13B (fine-tuned) | LLM | 78.7% |
| lettucedetect-large-v1 | Encoder (395M) | 79.2% |
| RAG-HAT (Llama-3-8B) | LLM | 83.9% |
Span-Level Detection¶
LettuceDetect achieves state-of-the-art span-level results among models that report this metric, outperforming fine-tuned Llama-2-13B. Span-level evaluation measures how precisely the model can locate the exact hallucinated text within an answer.
What These Numbers Mean¶
- Example F1 — Can the model tell if an answer has any hallucination? Higher is better.
- Span F1 — Can the model point to exactly which parts are hallucinated? This is the harder task and where LettuceDetect excels relative to its size.
LettuceDetect models are 50-500x smaller than LLM-based detectors while achieving competitive or better accuracy.