Skip to content

Data Generation

The training dataset is built from real tool execution against SWE-bench repositories.

Pipeline overview

SWE-bench instances
Phase 1: Load SWE-bench instances
Phase 2: Clone repos (bare git clones)
Phase 3: Generate tool calls (3-7 per instance)
Phase 4: Execute tool calls (real commands against repos)
Phase 5: Auto-label (heuristic relevance scoring)
Phase 6: LLM distillation (teacher selects relevant spans)
Phase 7: Assemble into train/eval JSONL
Phase 8: Validate dataset quality

Running the full pipeline

squeez pipeline --phase 1 2 3 4 5 6 7 8 \
    --output-dir data \
    --github-token $GITHUB_TOKEN \
    --teacher-api-key $GROQ_API_KEY \
    --teacher-base-url https://api.groq.com/openai/v1

Running individual phases

# Just phase 4 (execute tool calls)
squeez pipeline --phase 4 --output-dir data

# Phase 6 (LLM distillation) with custom concurrency
squeez pipeline --phase 6 --output-dir data --concurrency 5

Key design decisions

Real tool execution

All tool calls are executed as real commands (git grep, git blame, git log, pytest, ruff, python, etc.) against bare-cloned repos checked out at the correct SWE-bench base commit. No simulated output.

Zero-hallucination extraction

The teacher model (gpt-oss-120b) returns JSON spans ({"spans": [{"start": N, "end": M}]}), which are matched against the original output to extract actual text lines. The student never sees generated text — only real lines from the original output.

Repo-based train/eval split

Train and eval splits have zero repo overlap. Eval repos (xarray, flask) are entirely held out.

Tool types

Tool Type Weight Description
read_file 25% Source file contents via git show
grep 15% Code search via git grep
python 10% Python command execution
git_log 10% Commit history
test_output 10% Test runner output
git_diff 5% Diff output
git_blame 5% Line-level attribution
ls 5% Directory listings
lint_output 5% Linter warnings (ruff)
build_output 5% Build/compile output
curl 5% HTTP responses