Data Generation¶

Squeez supports one public dataset-generation workflow:

fresh generation from SWE-bench + synthetic generation

Canonical format¶

The source of truth is:

query
tool_output
gold_spans

Qwen and encoder datasets are derived from that canonical representation.

Positive samples should produce non-empty gold_spans. If a task-derived query fails on a positive sample, the relabeler retries with a tool-content-first query. If that still yields no spans, the sample is dropped rather than kept as an empty label. Empty rows are reserved for explicit negatives.

Fresh build overview¶

SWE-bench instances
    │
    ▼
Phase 1: Load SWE-bench instances
    │
    ▼
Phase 2: Clone repos (bare git clones)
    │
    ▼
Phase 3: Generate tool calls (3-7 per instance)
    │
    ▼
Phase 4: Execute tool calls (real commands against repos)
    │
    ▼
Phase 5: Auto-label (heuristic relevance scoring)
    │
    ▼
Phase 6: LLM relabeling/distillation (teacher writes focused query + selects relevant spans)
    │
    ▼
Phase 7: Assemble canonical/Qwen/encoder splits
    │
    ▼
Phase 8: Validate dataset quality

Build from scratch¶

python scripts/build_full_dataset.py \
    --output-dir data/v3 \
    --teacher-model openai/gpt-oss-120b \
    --teacher-base-url http://localhost:8000/v1

Key design decisions¶

Real tool execution¶

SWE tool calls are executed as real commands (git grep, git blame, git log, pytest, ruff, python, etc.) against bare-cloned repos checked out at the correct SWE-bench base commit. Synthetic samples produce realistic raw tool output, but the final labels are still grounded as spans over that raw output.

Grounded labels¶

The teacher model writes a focused extraction query and returns contiguous spans over a numbered view of the output. Those spans are mapped back onto the original raw output. Canonical labels never store synthetic line numbers.

Content-first fallback¶

The preferred query is still derived from the original task, but only when the tool output can actually answer it. If the task-derived query yields no spans for a positive sample, Squeez retries with a query driven primarily by the tool content itself. This is especially important for reused raw outputs where the original tool/file selection may have been noisy.

One source of truth¶

Canonical rows are converted into:

Qwen SFT rows (prompt, XML response)
encoder rows (task, tool_output, relevant_lines)

so training, evaluation, and QA all derive from the same grounded spans.

Tool types¶

| Tool Type | Description | |-----------|--------|-------------| | read_file | Source file contents via git show | | grep | Code search via git grep | | python | Python command execution | | git_log | Commit history | | test_output | Test runner output | | git_diff | Diff output | | git_blame | Line-level attribution | | ls | Directory listings | | lint_output | Linter warnings (ruff) | | build_output | Build/compile output | | curl | HTTP responses |