Pipeline Phases¶
Detailed documentation for each of the 9 pipeline phases.
Phase 1: Load SWE-bench¶
Module: swebench_loader.py
Loads all SWE-bench splits from HuggingFace and tags each instance with split and Lite membership.
| Split | Instances | Repos | Source |
|---|---|---|---|
| Train | 19,008 | 35 | princeton-nlp/SWE-bench train |
| Dev | 225 | 6 | princeton-nlp/SWE-bench dev |
| Test | 2,294 | 12 | princeton-nlp/SWE-bench test |
| Lite | 300 | 12 | princeton-nlp/SWE-bench_Lite (subset of test) |
Key function: load_all_splits() -> list[dict]
Each instance includes: instance_id, repo, base_commit, patch, problem_statement, split, is_lite.
Output: data/code_hallucination/swebench_instances.json
Phase 2: Fetch Sources¶
Module: source_fetcher.py
Clones repositories and extracts source code at the base commit for each instance. Builds three answer format variants.
Strategy¶
- Default: Clone repos as bare git repos to
data/code_hallucination/repos/. Usegit show {commit}:{path}for instant file access. - Test mode: Use GitHub raw API (
raw.githubusercontent.com) — slower but no cloning needed. - Fallback: If cloning fails, automatically falls back to GitHub API.
What it extracts per instance¶
| Field | Description |
|---|---|
changed_files |
File paths modified by the gold patch |
source_files |
Original source code at base commit |
patch_code |
Added/changed lines from the diff (fragment format) |
edit_style |
"In file X, replace Y with Z" format |
modified_functions |
AST-extracted functions that changed (complete function format) |
Key functions¶
extract_changed_files(patch)— Parse unified diff for file paths (anchored regex, notlstrip("b/"))clone_repo(repo)—git clone --barewith 30min timeoutfetch_file_at_commit(repo_dir, commit, filepath)—git showfor file contentsapply_patch_and_get_file(repo_dir, commit, patch, filepath)— Apply patch in temp worktreeextract_modified_functions(original, patched)— AST-based function diff
Output: data/code_hallucination/source_cache/{instance_id}.json
Phase 3: Rewrite Queries¶
Module: query_rewriter.py
Transforms raw GitHub issue problem_statement fields into natural developer queries using an LLM.
Example¶
Before (raw issue):
BUG: DataFrame.groupby with as_index=False gives wrong result when grouping by single column with duplicate name. Steps to reproduce: ...
After (rewritten):
I'm getting wrong results when using DataFrame.groupby with as_index=False on a column that has a duplicate name. How do I fix this?
Prompt strategy¶
The LLM is instructed to:
- Write conversational, natural language
- Extract the core technical ask
- Remove GitHub formatting, reproduction steps, tracebacks
- Keep to 1-3 sentences
Resumability¶
Writes results to JSONL incrementally. On restart, skips already-processed instance_ids.
Output: data/code_hallucination/queries.jsonl
Phase 4: Fetch Documentation¶
Module: context7_docs.py
Fetches library documentation via the Context7 API for 20% of instances (configurable via DOCS_RATIO).
Library detection¶
Maps the instance's GitHub repo to its primary library (e.g., django/django → django, scikit-learn/scikit-learn → scikit-learn). Only fetches docs for the matching library — not for random imports like sys or re.
Why 20%?¶
A minority of samples include documentation context, while most don't. This creates training variety — models learn to detect hallucinations both with and without documentation support. Documentation is also passed to the hallucination injector (Phase 6), enabling SEMANTIC hallucinations that contradict documented API behavior.
Instances not selected for docs still get an entry written with empty docs (by design, not failure).
Output: data/code_hallucination/documentation.jsonl
Phase 5: Assign Answer Formats¶
Module: format_builder.py
Each instance gets exactly one answer format, chosen by weighted random selection from available options. Uses LLM calls for code_with_explanation format.
Format types¶
Code with explanation (weight: 0.40)
The issue is that `process_data` uses `dict.items()` instead of iterating
over the sorted keys, which causes non-deterministic output.
```python
def process_data(data):
for key in sorted(data.keys()):
yield key, data[key]
This ensures consistent ordering regardless of insertion order.
Natural AI assistant response with prose explanation + code block. Generated by wrapping one of the base code formats with an LLM-generated explanation. This is the most realistic format — it matches how Claude, Cursor, and other AI coding assistants actually respond.
**Complete function** (weight: 0.25)
```python
def validate_response(self, response):
if response.status_code != 200:
raise ValidationError(f"Unexpected status: {response.status_code}")
return response.json()
Fragment (weight: 0.20)
if max_age is not None:
self.cookies[key]["max-age"] = max_age
self.cookies[key]["expires"] = http_date(time.time() + max_age)
Edit-style (weight: 0.15)
In file django/http/response.py, replace:
def set_cookie(self, key, value=""):
self.cookies[key] = value
with:
def set_cookie(self, key, value="", max_age=None):
self.cookies[key] = value
if max_age is not None:
self.cookies[key]["max-age"] = max_age
Output: data/code_hallucination/formats.jsonl
Phase 6: Inject Hallucinations¶
Module: hallucination_injector.py
Uses an LLM to inject realistic hallucinations into selected instances (determined by Phase 8). Returns structured JSON with span annotations.
Hallucination types (round-robin)¶
| Type | Description | Example |
|---|---|---|
| Structural | Non-existent APIs, wrong methods, invented parameters | response.json_decode() instead of response.json() |
| Behavioral | Wrong values, logic errors, off-by-one, swapped conditions | if status >= 200 instead of if status == 200 |
| Semantic | Code that looks right but does something subtly different | Sorting ascending instead of descending |
JSON-based span extraction¶
The LLM returns structured output:
{
"hallucinated_code": "def fix(self):\n self.data = response.json_decode()\n ...",
"changes": [
{
"original": "response.json()",
"hallucinated": "response.json_decode()",
"explanation": "json_decode() is not a valid method on Response objects"
}
]
}
Spans are found by string-matching each change["hallucinated"] in hallucinated_code. This produces clean, meaningful spans (minimum 15 chars) with zero noise.
For answers containing both code and prose (code_with_explanation format), the injector places errors in both parts — e.g., wrong API in code + misleading description in text.
Quality controls¶
- Each span must be 20-150 characters (enforced by prompt)
- Total hallucinated coverage must be < 40% of the answer (enforced by prompt)
_validate_labels()rejects samples with coverage > 60% or spans < 15 chars- Failed validation triggers up to 3 retries before skipping
- No comment data leaks (prompt explicitly forbids
# wrong,# error, etc.)
Quality metrics (from 100-sample test runs)¶
| Metric | Value |
|---|---|
| Noise-only samples | 0% |
| Min span length | 15 chars |
| Avg span length | 71 chars |
| Avg spans per sample | 2.8 |
| Coverage range | 2.8-43% |
| Mean coverage | 19.5% |
Output: data/code_hallucination/hallucinated_samples.jsonl
Phase 7: Assemble Samples¶
Module: sample_assembler.py
Combines all intermediate data into the final HallucinationSample format.
Prompt construction¶
Documentation for django:
User request: ### Sample types
**Clean samples** (~60%): Gold patch answer, empty labels, from instances NOT selected for injection.
**Hallucinated samples** (~40%): LLM-modified answer with character-level span annotations.
### Outputs
- `data/code_hallucination/code_hallucination_data.json` — List of samples
- `data/code_hallucination/code_hallucination_metadata.json` — Metadata (instance_id, repo, format_type, hallucination_type, injector_model, is_hallucinated)
---
## Phase 8: Select Hallucination Targets
**Module:** `splitter.py`
Selects which instances receive hallucination injection. Applies the hallucination ratio (default 40%) **uniformly within each split** to maintain consistent class distribution.
Note
Phase 8 runs before Phase 6 in the pipeline (target selection must happen before injection).
Output: Set of instance_ids (used in-memory by Phase 6 and Phase 7)
Phase 9: Validate¶
Module: validator.py
Runs automated quality checks and generates a report.
Checks performed¶
| Check | Description |
|---|---|
| Span validity | No negative offsets, empty spans, or out-of-bounds |
| Span coverage | Distribution of hallucinated text ratio; flags <2% or >80% |
| Distributions | Format type, hallucination type, injector model, repo, split |
| Near-duplicates | Jaccard similarity >0.95 on sampled answer pairs |
| AST parseability | For complete_function format, checks if answer parses as valid Python |
| Length statistics | Prompt and answer character length ranges |
Output: data/code_hallucination/validation_report.txt