Data Generation¶
Squeez supports one public dataset-generation workflow:
- fresh generation from SWE-bench + synthetic generation
Canonical format¶
The source of truth is:
querytool_outputgold_spans
Qwen and encoder datasets are derived from that canonical representation.
Positive samples should produce non-empty gold_spans. If a task-derived
query fails on a positive sample, the relabeler retries with a
tool-content-first query. If that still yields no spans, the sample is
dropped rather than kept as an empty label. Empty rows are reserved for
explicit negatives.
Fresh build overview¶
SWE-bench instances
│
▼
Phase 1: Load SWE-bench instances
│
▼
Phase 2: Clone repos (bare git clones)
│
▼
Phase 3: Generate tool calls (3-7 per instance)
│
▼
Phase 4: Execute tool calls (real commands against repos)
│
▼
Phase 5: Auto-label (heuristic relevance scoring)
│
▼
Phase 6: LLM relabeling/distillation (teacher writes focused query + selects relevant spans)
│
▼
Phase 7: Assemble canonical/Qwen/encoder splits
│
▼
Phase 8: Validate dataset quality
Build from scratch¶
python scripts/build_full_dataset.py \
--output-dir data/v3 \
--teacher-model openai/gpt-oss-120b \
--teacher-base-url http://localhost:8000/v1
Key design decisions¶
Real tool execution¶
SWE tool calls are executed as real commands (git grep, git blame, git log, pytest, ruff, python, etc.) against bare-cloned repos checked out at the correct SWE-bench base commit. Synthetic samples produce realistic raw tool output, but the final labels are still grounded as spans over that raw output.
Grounded labels¶
The teacher model writes a focused extraction query and returns contiguous spans over a numbered view of the output. Those spans are mapped back onto the original raw output. Canonical labels never store synthetic line numbers.
Content-first fallback¶
The preferred query is still derived from the original task, but only when the tool output can actually answer it. If the task-derived query yields no spans for a positive sample, Squeez retries with a query driven primarily by the tool content itself. This is especially important for reused raw outputs where the original tool/file selection may have been noisy.
One source of truth¶
Canonical rows are converted into:
- Qwen SFT rows (
prompt, XMLresponse) - encoder rows (
task,tool_output,relevant_lines)
so training, evaluation, and QA all derive from the same grounded spans.
Tool types¶
| Tool Type | Description |
|-----------|--------|-------------|
| read_file | Source file contents via git show |
| grep | Code search via git grep |
| python | Python command execution |
| git_log | Commit history |
| test_output | Test runner output |
| git_diff | Diff output |
| git_blame | Line-level attribution |
| ls | Directory listings |
| lint_output | Linter warnings (ruff) |
| build_output | Build/compile output |
| curl | HTTP responses |