Data Generation¶

The training dataset is built from real tool execution against SWE-bench repositories.

Pipeline overview¶

SWE-bench instances
    │
    ▼
Phase 1: Load SWE-bench instances
    │
    ▼
Phase 2: Clone repos (bare git clones)
    │
    ▼
Phase 3: Generate tool calls (3-7 per instance)
    │
    ▼
Phase 4: Execute tool calls (real commands against repos)
    │
    ▼
Phase 5: Auto-label (heuristic relevance scoring)
    │
    ▼
Phase 6: LLM distillation (teacher selects relevant spans)
    │
    ▼
Phase 7: Assemble into train/eval JSONL
    │
    ▼
Phase 8: Validate dataset quality

Running the full pipeline¶

squeez pipeline --phase 1 2 3 4 5 6 7 8 \
    --output-dir data \
    --github-token $GITHUB_TOKEN \
    --teacher-api-key $GROQ_API_KEY \
    --teacher-base-url https://api.groq.com/openai/v1

Running individual phases¶

# Just phase 4 (execute tool calls)
squeez pipeline --phase 4 --output-dir data

# Phase 6 (LLM distillation) with custom concurrency
squeez pipeline --phase 6 --output-dir data --concurrency 5

Key design decisions¶

Real tool execution¶

All tool calls are executed as real commands (git grep, git blame, git log, pytest, ruff, python, etc.) against bare-cloned repos checked out at the correct SWE-bench base commit. No simulated output.

Zero-hallucination extraction¶

The teacher model (gpt-oss-120b) returns JSON spans ({"spans": [{"start": N, "end": M}]}), which are matched against the original output to extract actual text lines. The student never sees generated text — only real lines from the original output.

Repo-based train/eval split¶

Train and eval splits have zero repo overlap. Eval repos (xarray, flask) are entirely held out.

Tool types¶

Tool Type	Weight	Description
read_file	25%	Source file contents via `git show`
grep	15%	Code search via `git grep`
python	10%	Python command execution
git_log	10%	Commit history
test_output	10%	Test runner output
git_diff	5%	Diff output
git_blame	5%	Line-level attribution
ls	5%	Directory listings
lint_output	5%	Linter warnings (ruff)
build_output	5%	Build/compile output
curl	5%	HTTP responses