Pipeline Phases¶

Detailed documentation for each of the 9 pipeline phases.

Phase 1: Load SWE-bench¶

Module: swebench_loader.py

Loads all SWE-bench splits from HuggingFace and tags each instance with split and Lite membership.

Split	Instances	Repos	Source
Train	19,008	35	`princeton-nlp/SWE-bench` train
Dev	225	6	`princeton-nlp/SWE-bench` dev
Test	2,294	12	`princeton-nlp/SWE-bench` test
Lite	300	12	`princeton-nlp/SWE-bench_Lite` (subset of test)

Key function: load_all_splits() -> list[dict]

Each instance includes: instance_id, repo, base_commit, patch, problem_statement, split, is_lite.

Output: data/code_hallucination/swebench_instances.json

Phase 2: Fetch Sources¶

Module: source_fetcher.py

Clones repositories and extracts source code at the base commit for each instance. Builds three answer format variants.

Strategy¶

Default: Clone repos as bare git repos to data/code_hallucination/repos/. Use git show {commit}:{path} for instant file access.
Test mode: Use GitHub raw API (raw.githubusercontent.com) — slower but no cloning needed.
Fallback: If cloning fails, automatically falls back to GitHub API.

What it extracts per instance¶

Field	Description
`changed_files`	File paths modified by the gold patch
`source_files`	Original source code at base commit
`patch_code`	Added/changed lines from the diff (fragment format)
`edit_style`	"In file X, replace Y with Z" format
`modified_functions`	AST-extracted functions that changed (complete function format)

Key functions¶

extract_changed_files(patch) — Parse unified diff for file paths (anchored regex, not lstrip("b/"))
clone_repo(repo) — git clone --bare with 30min timeout
fetch_file_at_commit(repo_dir, commit, filepath) — git show for file contents
apply_patch_and_get_file(repo_dir, commit, patch, filepath) — Apply patch in temp worktree
extract_modified_functions(original, patched) — AST-based function diff

Output: data/code_hallucination/source_cache/{instance_id}.json

Phase 3: Rewrite Queries¶

Module: query_rewriter.py

Transforms raw GitHub issue problem_statement fields into natural developer queries using an LLM.

Example¶

Before (raw issue):

BUG: DataFrame.groupby with as_index=False gives wrong result when grouping by single column with duplicate name. Steps to reproduce: ...

After (rewritten):

I'm getting wrong results when using DataFrame.groupby with as_index=False on a column that has a duplicate name. How do I fix this?

Prompt strategy¶

The LLM is instructed to:

Write conversational, natural language
Extract the core technical ask
Remove GitHub formatting, reproduction steps, tracebacks
Keep to 1-3 sentences

Resumability¶

Writes results to JSONL incrementally. On restart, skips already-processed instance_ids.

Output: data/code_hallucination/queries.jsonl

Phase 4: Fetch Documentation¶

Module: context7_docs.py

Fetches library documentation via the Context7 API for 20% of instances (configurable via DOCS_RATIO).

Library detection¶

Maps the instance's GitHub repo to its primary library (e.g., django/django → django, scikit-learn/scikit-learn → scikit-learn). Only fetches docs for the matching library — not for random imports like sys or re.

Why 20%?¶

A minority of samples include documentation context, while most don't. This creates training variety — models learn to detect hallucinations both with and without documentation support. Documentation is also passed to the hallucination injector (Phase 6), enabling SEMANTIC hallucinations that contradict documented API behavior.

Instances not selected for docs still get an entry written with empty docs (by design, not failure).

Output: data/code_hallucination/documentation.jsonl

Phase 5: Assign Answer Formats¶

Module: format_builder.py

Each instance gets exactly one answer format, chosen by weighted random selection from available options. Uses LLM calls for code_with_explanation format.

Format types¶

Code with explanation (weight: 0.40)

The issue is that `process_data` uses `dict.items()` instead of iterating
over the sorted keys, which causes non-deterministic output.

```python
def process_data(data):
    for key in sorted(data.keys()):
        yield key, data[key]

This ensures consistent ordering regardless of insertion order.

Natural AI assistant response with prose explanation + code block. Generated by wrapping one of the base code formats with an LLM-generated explanation. This is the most realistic format — it matches how Claude, Cursor, and other AI coding assistants actually respond.

**Complete function** (weight: 0.25)
```python
def validate_response(self, response):
    if response.status_code != 200:
        raise ValidationError(f"Unexpected status: {response.status_code}")
    return response.json()

Extracted via Python AST from the patched source. Only available when changes are inside a function (~60% of patches).

Fragment (weight: 0.20)

if max_age is not None:
    self.cookies[key]["max-age"] = max_age
    self.cookies[key]["expires"] = http_date(time.time() + max_age)

Added/changed lines from the diff with surrounding context.

Edit-style (weight: 0.15)

In file django/http/response.py, replace:
    def set_cookie(self, key, value=""):
        self.cookies[key] = value
with:
    def set_cookie(self, key, value="", max_age=None):
        self.cookies[key] = value
        if max_age is not None:
            self.cookies[key]["max-age"] = max_age

Available for all patches where changed regions can be extracted.

Output: data/code_hallucination/formats.jsonl

Phase 6: Inject Hallucinations¶

Module: hallucination_injector.py

Uses an LLM to inject realistic hallucinations into selected instances (determined by Phase 8). Returns structured JSON with span annotations.

Hallucination types (round-robin)¶

Type	Description	Example
Structural	Non-existent APIs, wrong methods, invented parameters	`response.json_decode()` instead of `response.json()`
Behavioral	Wrong values, logic errors, off-by-one, swapped conditions	`if status >= 200` instead of `if status == 200`
Semantic	Code that looks right but does something subtly different	Sorting ascending instead of descending

JSON-based span extraction¶

The LLM returns structured output:

{
  "hallucinated_code": "def fix(self):\n    self.data = response.json_decode()\n    ...",
  "changes": [
    {
      "original": "response.json()",
      "hallucinated": "response.json_decode()",
      "explanation": "json_decode() is not a valid method on Response objects"
    }
  ]
}

Spans are found by string-matching each change["hallucinated"] in hallucinated_code. This produces clean, meaningful spans (minimum 15 chars) with zero noise.

For answers containing both code and prose (code_with_explanation format), the injector places errors in both parts — e.g., wrong API in code + misleading description in text.

Quality controls¶

Each span must be 20-150 characters (enforced by prompt)
Total hallucinated coverage must be < 40% of the answer (enforced by prompt)
_validate_labels() rejects samples with coverage > 60% or spans < 15 chars
Failed validation triggers up to 3 retries before skipping
No comment data leaks (prompt explicitly forbids # wrong, # error, etc.)

Quality metrics (from 100-sample test runs)¶

Metric	Value
Noise-only samples	0%
Min span length	15 chars
Avg span length	71 chars
Avg spans per sample	2.8
Coverage range	2.8-43%
Mean coverage	19.5%

Output: data/code_hallucination/hallucinated_samples.jsonl

Phase 7: Assemble Samples¶

Module: sample_assembler.py

Combines all intermediate data into the final HallucinationSample format.

Prompt construction¶

File: path/to/file.py
```python
<source code at base commit>

Documentation for django:

User request:

### Sample types

**Clean samples** (~60%): Gold patch answer, empty labels, from instances NOT selected for injection.

**Hallucinated samples** (~40%): LLM-modified answer with character-level span annotations.

### Outputs

- `data/code_hallucination/code_hallucination_data.json` — List of samples
- `data/code_hallucination/code_hallucination_metadata.json` — Metadata (instance_id, repo, format_type, hallucination_type, injector_model, is_hallucinated)

---

## Phase 8: Select Hallucination Targets

**Module:** `splitter.py`

Selects which instances receive hallucination injection. Applies the hallucination ratio (default 40%) **uniformly within each split** to maintain consistent class distribution.

Train: ~7,600 hallucinated + ~11,400 clean = ~19,000 Dev: ~90 hallucinated + ~135 clean = ~225 Test: ~920 hallucinated + ~1,374 clean = ~2,294 ```

Note

Phase 8 runs before Phase 6 in the pipeline (target selection must happen before injection).

Output: Set of instance_ids (used in-memory by Phase 6 and Phase 7)

Phase 9: Validate¶

Module: validator.py

Runs automated quality checks and generates a report.

Checks performed¶

Check	Description
Span validity	No negative offsets, empty spans, or out-of-bounds
Span coverage	Distribution of hallucinated text ratio; flags <2% or >80%
Distributions	Format type, hallucination type, injector model, repo, split
Near-duplicates	Jaccard similarity >0.95 on sampled answer pairs
AST parseability	For complete_function format, checks if answer parses as valid Python
Length statistics	Prompt and answer character length ranges

Output: data/code_hallucination/validation_report.txt