Code Hallucination Dataset Pipeline¶
A modular 9-phase pipeline for generating span-level code hallucination detection datasets from SWE-bench. Produces training data where each sample contains source code context, a code answer, and character-level annotations marking hallucinated spans.
Why This Dataset?¶
Existing hallucination detection datasets (RAGTruth, RAGBench) focus on text — question answering, summarization, data-to-text. There is no established span-level code hallucination dataset. CodeMirage classifies entire snippets but doesn't localize where the hallucination is.
This pipeline generates samples where an LLM coding assistant answers a developer's question about a real codebase, and we know exactly which character spans in the answer are hallucinated — enabling training of both token-level classifiers (ModernBERT) and generative span detectors (decoder LLMs).
Dataset Overview¶
| Property | Value |
|---|---|
| Source | SWE-bench (all splits) |
| Total instances | ~21,500 (19k train + 225 dev + 2.3k test) |
| Repos | 53 unique repos, zero overlap between splits |
| Clean/hallucinated ratio | ~60% clean / ~40% hallucinated |
| Hallucination types | Structural, behavioral, semantic |
| Answer formats | Code with explanation, complete function, fragment, edit-style |
| Annotation granularity | Character-level spans |
Quick Start¶
Test with a few examples¶
# Using Groq (fast, free tier available)
OPENAI_API_KEY=your_groq_key \
python -m scripts.code_hallucination.pipeline --test 5
# Using any OpenAI-compatible API
OPENAI_API_KEY=your_key \
API_BASE_URL=https://api.example.com/v1 \
MODEL=your-model-name \
python -m scripts.code_hallucination.pipeline --test 10
Run the full pipeline¶
# Run all 9 phases
python -m scripts.code_hallucination.pipeline --all
# Run specific phases
python -m scripts.code_hallucination.pipeline --phase 1 2 3
# Override LLM settings via CLI
python -m scripts.code_hallucination.pipeline --all \
--api-key YOUR_KEY \
--base-url https://api.groq.com/openai/v1 \
--model moonshotai/kimi-k2-instruct-0905
Run with local vLLM (recommended for bulk generation)¶
# Terminal 1: Start vLLM
vllm serve Qwen/Qwen3.5-2B --port 8000
# Terminal 2: Run pipeline with batch processing
BATCH_SIZE=16 \
API_BASE_URL=http://localhost:8000/v1 \
OPENAI_API_KEY=dummy \
MODEL=Qwen/Qwen3.5-2B \
python -m scripts.code_hallucination.pipeline --all
BATCH_SIZE>1 enables async concurrent requests — no rate limiting, full GPU saturation.
CLI Options¶
| Flag | Description |
|---|---|
--test N |
Test mode: run pipeline on N random test instances using GitHub API (no repo cloning) |
--all |
Run all 9 phases |
--phase 1 2 3 |
Run specific phases (1-9) |
--api-key |
LLM API key (or set OPENAI_API_KEY env var) |
--base-url |
LLM API base URL (or set API_BASE_URL env var) |
--model |
LLM model name (or set MODEL env var) |
| Environment Variable | Description |
|---|---|
BATCH_SIZE |
Number of concurrent requests (default: 1). Set >1 for local vLLM |
OPENAI_API_KEY |
API key for the LLM provider |
API_BASE_URL |
OpenAI-compatible API endpoint |
MODEL |
Model name |
CONTEXT7_API_KEY |
API key for Context7 documentation service |
Pipeline Architecture¶
graph TD
A[Phase 1: Load SWE-bench] --> B[Phase 2: Fetch Sources]
B --> C[Phase 3: Rewrite Queries]
B --> D[Phase 4: Fetch Docs]
B --> E[Phase 5: Assign Formats]
A --> F[Phase 8: Select Targets]
C --> G[Phase 6: Inject Hallucinations]
E --> G
F --> G
D --> H[Phase 7: Assemble Samples]
G --> H
E --> H
C --> H
H --> I[Phase 9: Validate]
Phases 3 and 4 can run in parallel. Phase 8 (target selection) runs before Phase 6 (injection).
Output Format¶
Each sample follows the HallucinationSample format used by LettuceDetect:
{
"prompt": "File: django/http/response.py\n```python\n...\n```\n\nUser request: How do I fix the Content-Type header issue?",
"answer": "def fix_content_type(self):\n self.headers['Content-Type'] = response.get_type()\n ...",
"labels": [
{"start": 55, "end": 75, "label": "structural"}
],
"split": "test",
"task_type": "code_generation",
"dataset": "swebench_code",
"language": "en"
}
prompt: Source code files + documentation + user queryanswer: Code in one of four formats (code with explanation, complete function, fragment, edit-style)labels: Character-level span annotations (empty for clean samples)split: train/dev/test (inherited from SWE-bench, zero repo overlap)
Supported LLM Providers¶
The pipeline works with any OpenAI-compatible API. Tested with:
| Provider | Model | Notes |
|---|---|---|
| Groq | openai/gpt-oss-120b |
Best quality, recommended |
| Groq | moonshotai/kimi-k2-instruct-0905 |
Fast |
| Novita AI | qwen/qwen3.5-27b |
Good for bulk generation |
| Local (vLLM/Ollama) | Any model | Free, best for large runs |
Directory Structure¶
scripts/code_hallucination/
├── __init__.py # Package declaration
├── pipeline.py # CLI orchestrator
├── config.py # Constants, paths, API settings
├── swebench_loader.py # Phase 1: Load SWE-bench instances
├── source_fetcher.py # Phase 2: Clone repos, fetch source files
├── query_rewriter.py # Phase 3: LLM query rewriting
├── context7_docs.py # Phase 4: Fetch library documentation
├── format_builder.py # Phase 5: Assign answer formats
├── hallucination_injector.py # Phase 6: LLM hallucination injection
├── sample_assembler.py # Phase 7: Assemble final samples
├── splitter.py # Phase 8: Select hallucination targets
└── validator.py # Phase 9: Quality validation
Output data:
data/code_hallucination/
├── swebench_instances.json # Phase 1 output
├── repos/ # Phase 2: cloned repos (bare)
├── source_cache/ # Phase 2: per-instance source data
│ └── {instance_id}.json
├── queries.jsonl # Phase 3 output
├── documentation.jsonl # Phase 4 output
├── formats.jsonl # Phase 5 output
├── hallucinated_samples.jsonl # Phase 6 output
├── code_hallucination_data.json # Phase 7: final dataset
├── code_hallucination_metadata.json # Phase 7: metadata
└── validation_report.txt # Phase 9 output
Design Decisions¶
One sample per instance¶
Each SWE-bench instance produces exactly one sample — either clean (gold patch answer) or hallucinated (LLM-injected). No instance appears in both classes. This avoids the artificial pairing problem where models learn to distinguish the specific instance rather than the hallucination.
JSON-based span annotations¶
Hallucination spans are extracted from the LLM's structured JSON response, not from difflib character-level diffs. The LLM returns {"hallucinated_code": "...", "changes": [{"original": "...", "hallucinated": "..."}]} and spans are found by string matching. Quality controls enforce 2-3 spans per sample (avg 2.8), minimum 15 chars per span, and total coverage under 60%.
20% documentation split¶
A subset of instances (20%) include Context7 library documentation, filtered to only the repo's primary library. Documentation is also passed to the hallucination injector, enabling semantic hallucinations that contradict documented API behavior.
Zero repo overlap between splits¶
SWE-bench's train/dev/test splits naturally have zero repository overlap across 53 unique repos. This means test performance measures generalization to completely unseen codebases.
Training on This Dataset¶
Once you've generated the dataset, there are two training approaches. See Architecture Research for the full rationale behind each.
Approach A: Token Classification (ModernBERT)¶
The standard LettuceDetect approach — a lightweight encoder that labels each token as supported or hallucinated.
# Code data only
python scripts/train_code_hallucination.py \
--code-data-path data/code_hallucination/code_hallucination_data.json \
--model-name answerdotai/ModernBERT-base \
--output-dir output/code_hallucination_detector \
--batch-size 4 \
--epochs 6 \
--learning-rate 1e-5
# Code + RAGTruth combined (better generalization across text and code)
python scripts/train_code_hallucination.py \
--code-data-path data/code_hallucination/code_hallucination_data.json \
--ragtruth-path data/ragtruth/ragtruth_data.json \
--model-name answerdotai/ModernBERT-base \
--output-dir output/code_hallucination_detector \
--batch-size 4 \
--epochs 6
The training script uses SWE-bench splits directly — train for training, dev for validation, test held out. Zero repository overlap between splits.
Approach D: Generative Span Detection (Qwen SFT)¶
Fine-tune a decoder LLM to read context + answer and generate a JSON list of hallucinated spans with explanations. This is the reverse of the injection pipeline.
# Requires: pip install peft
python scripts/train_generative_detector.py \
--code-data-path data/code_hallucination/code_hallucination_data.json \
--model-name Qwen/Qwen3.5-2B \
--output-dir output/generative_detector \
--batch-size 2 \
--epochs 3 \
--lora-r 16
The model learns to output:
{"hallucinated_spans": [{"text": "response.json_decode()", "explanation": "method is json(), not json_decode()"}]}
For clean samples it outputs {"hallucinated_spans": []}.
Training uses LoRA for memory efficiency (~5-8GB VRAM). Only the model's response tokens contribute to the loss — context and prompt tokens are masked.
| Parameter | Default | Description |
|---|---|---|
--model-name |
Qwen/Qwen3.5-2B |
Any HuggingFace causal LM |
--lora-r |
16 | LoRA rank (higher = more capacity, more memory) |
--lora-alpha |
32 | LoRA scaling factor |
--batch-size |
2 | Training batch size |
--epochs |
3 | Number of training epochs |
--learning-rate |
2e-4 | Learning rate |
--gradient-accumulation-steps |
4 | Accumulate gradients over N steps (simulates larger batch) |
Evaluate¶
python scripts/evaluate_code_hallucination.py \
--model_path output/code_hallucination_detector \
--data_path data/code_hallucination/code_hallucination_data.json \
--evaluation_type example_level
End-to-End Example¶
# 1. Generate dataset (local vLLM for speed)
vllm serve Qwen/Qwen3.5-2B --port 8000 # in another terminal
BATCH_SIZE=16 \
API_BASE_URL=http://localhost:8000/v1 \
OPENAI_API_KEY=dummy \
MODEL=Qwen/Qwen3.5-2B \
python -m scripts.code_hallucination.pipeline --all
# 2. Train
python scripts/train_code_hallucination.py \
--code-data-path data/code_hallucination/code_hallucination_data.json \
--model-name answerdotai/ModernBERT-base \
--output-dir output/code_hallucination_detector \
--batch-size 4 --epochs 6
# 3. Evaluate
python scripts/evaluate_code_hallucination.py \
--model_path output/code_hallucination_detector \
--data_path data/code_hallucination/code_hallucination_data.json \
--evaluation_type example_level
The pipeline is fully resumable — every slow phase saves results incrementally to JSONL. If it crashes, re-run the same command and it picks up where it left off.