Architecture Research: Detection Models for Code Hallucination¶
Research notes on model architectures for training on the code hallucination dataset. We compare four approaches ranging from fast encoder-based classifiers to generative span detectors.
Approach A: Token Classification (Encoder)¶
Architecture: ModernBERT/EuroBERT + linear classification head
The current LettuceDetect approach. Each answer token gets a binary label (0=supported, 1=hallucinated). Consecutive hallucinated tokens are merged into spans at inference.
Input: [CLS] context [SEP] question [SEP] answer [SEP]
Output: [-100, -100, ..., 0, 0, 1, 1, 1, 0, 0, ...]
^^^^^^^^^ hallucinated span
| Property | Value |
|---|---|
| Models | ModernBERT-base (149M), ModernBERT-large (395M), EuroBERT (210M-2.1B) |
| Context | 8K tokens |
| Inference | Single forward pass, 30-60 samples/sec on A100 |
| Training | Standard token classification, CrossEntropyLoss |
| Validated by | LettuceDetect (79.2% F1), HaluGate (vLLM), PsiloQA (EMNLP 2025) |
Strengths: Fast, simple, production-ready. Handles long contiguous spans well. Weaknesses: No code-specific pretraining. Cannot explain why something is hallucinated.
Approach B: Token Classification (Decoder LLM)¶
Architecture: Qwen3.5-2B + bidirectional attention (LLM2Vec) + linear head
Use a decoder LLM pretrained on massive code corpora, convert to bidirectional encoder via LLM2Vec, then add a token classification head.
Step 1: Load Qwen3.5-2B base (2B params, code-heavy pretraining)
Step 2: Enable bidirectional attention (remove causal mask)
Step 3: Short MNTP adaptation (masked next token prediction with LoRA)
Step 4: Add linear head (hidden_dim=2048 → 2 classes)
Step 5: Fine-tune on code hallucination dataset with LoRA
| Property | Value |
|---|---|
| Model | Qwen3.5-2B (2B params) |
| Context | 262K native (practically limited by GPU memory) |
| Inference | Single forward pass, ~5-15 samples/sec |
| VRAM | ~5-8GB in bf16 |
| Reference | Looking Right is Sometimes Right (ACL 2024) — 0.947 F1 on NER with mask removal |
Strengths: Deep code understanding from pretraining. Bidirectional attention after conversion. Weaknesses: 5x larger than ModernBERT. Requires LLM2Vec conversion step. Novel (unvalidated for hallucination detection).
Key insight: The ACL 2024 paper showed decoder LLMs with causal mask removal reach 0.947 F1 on NER, significantly above RoBERTa-large (0.900). The gains come from combining rich pretrained representations with bidirectional context.
Approach C: Chunk Verification (Reranker-style)¶
Architecture: Qwen3.5-2B or Qwen3-0.6B, reranker-style yes/no scoring
Inspired by Qwen3-Reranker. Split the answer into chunks (lines, statements), then ask the model for each chunk: "Is this code correct given the context?"
Input: "Given this source code, is this line correct? yes/no"
Output: P(yes) = 0.12 → hallucinated
P(yes) = 0.95 → supported
No architectural modifications. Uses the LLM's native next-token prediction to classify.
| Property | Value |
|---|---|
| Models | Qwen3-0.6B (tiny, fast) or Qwen3.5-2B |
| Inference | N forward passes per sample (one per chunk) |
| Training | Standard SFT with yes/no labels |
| Reference | MiniCheck (EMNLP 2024) — GPT-4-level at 400x lower cost |
Strengths: No architecture changes. Uses LLM code reasoning directly. Can work with tiny models. Weaknesses: Slowest inference (N passes per sample). Chunk boundary sensitivity. No sub-chunk granularity.
Approach D: Generative Span Detection¶
Architecture: Qwen3.5-2B, standard SFT, generates JSON with hallucinated spans
The model directly outputs which spans are hallucinated and why. This is the reverse of the hallucination injection process.
Input: "Given the source code and answer, identify hallucinated spans."
Output: {
"hallucinated_spans": [
{"text": "response.json_decode()", "explanation": "method is json(), not json_decode()"}
]
}
| Property | Value |
|---|---|
| Models | Qwen3.5-2B or larger |
| Inference | Single generation (autoregressive, slower than forward pass) |
| Training | Standard SFT with LoRA |
| SOTA | RL4HS (Oct 2025) — 58.3 F1 on RAGTruth, beats GPT-5 (42.2) and o3 (51.2) |
Strengths:
- No architecture changes — pure text generation
- Free explanations alongside span detection
- Naturally handles variable span counts
- Can leverage the LLM's code knowledge ("this API doesn't exist")
- Training data format already matches (reverse of injection pipeline)
- Current SOTA approach (RL4HS)
Weaknesses: Autoregressive generation is slower. Risk of hallucinating in the detector itself. String matching needed to map spans back to character offsets.
RL enhancement: RL4HS shows that adding reinforcement learning (GRPO with span-level rewards) on top of SFT dramatically improves performance. SFT alone is a strong baseline; RL pushes it to SOTA.
Comparison¶
| A. Encoder token | B. LLM token | C. Chunk verifier | D. Generative span | |
|---|---|---|---|---|
| Base model | ModernBERT-large | Qwen3.5-2B | Qwen3-0.6B | Qwen3.5-2B |
| Parameters | 395M | 2B | 0.6B | 2B |
| Architecture mods | None | Mask removal | None | None |
| Inference speed | Fastest | Medium | Slowest | Medium-slow |
| Explainable | No | No | No | Yes |
| Code understanding | Limited | Deep | Deep | Deep |
| Training complexity | Simple | LLM2Vec + LoRA | Simple SFT | Simple SFT |
| SOTA reference | LettuceDetect, HaluGate | ACL 2024 paper | MiniCheck | RL4HS |
Recommended Experiments¶
-
A vs D — Token classification (ModernBERT) vs generative span detection (Qwen3.5-2B). The core comparison: fast encoder vs reasoning LLM, both trained on the same dataset.
-
A vs B — Does code pretraining help token classification? Same task, different backbone.
-
D with RL — If SFT results are promising, add GRPO with span-overlap rewards (following RL4HS).
Key References¶
- LettuceDetect (arXiv:2502.17125) — Encoder token classification baseline
- HaluGate (vLLM, Dec 2025) — Production ModernBERT + NLI pipeline
- RL4HS (arXiv:2510.02173) — SOTA generative span detection with RL
- FAVA (COLM 2024) — Generative hallucination editing
- PsiloQA (EMNLP 2025) — Multilingual encoder-based span detection
- Looking Right is Sometimes Right (ACL 2024) — Decoder LLMs for token classification
- LLM2Vec (2024) — Converting decoders to bidirectional encoders
- MiniCheck (EMNLP 2024) — Sentence-level fact checking
- Qwen3-Reranker — LLM-based yes/no classification
- CodeMirage (2024) — Code hallucination taxonomy (snippet-level only)