Data Quality & Review Process¶

The dataset powering squeez goes through a multi-stage quality assurance pipeline. This page documents the process, the issues found, and the corrections applied.

Pipeline Overview¶

Raw SWE-bench tool output
    │
    ▼
Teacher distillation (gpt-oss-120b selects spans)
    │
    ▼
Zero-hallucination extraction (spans → actual text lines)
    │
    ▼
Automated QA (structural validation)
    │
    ▼
Manual review (test split only)
    │
    ▼
Final dataset on HuggingFace

Automated QA¶

Every sample passes structural checks:

JSON validity: response parses as {"relevant_lines": [...]}
Grounding: every line in relevant_lines exists in the prompt (the raw tool output)
Ordering: selected lines appear in the same order as in the original output
Metadata consistency: num_relevant_lines matches the actual count in the response
Deduplication: no duplicate lines in relevant_lines

Metadata Fix (v0.1.1)¶

The initial release had stale metadata inherited from the distillation phase. Specifically:

num_relevant_lines was computed before post-processing (filtering ... separators and empty lines)
num_total_lines came from upstream span counts, not the actual tool output
compression_ratio was derived from the stale counts

Impact: 3,755 / 7,148 train samples and 230 / 436 test samples had incorrect num_relevant_lines. The actual prompt and response content (what the model trains on) was always correct — only the metadata was wrong.

Fix: All three metadata fields are now recomputed from the actual assembled content. The assembler (sample_assembler.py) was also patched to prevent this in future regenerations.

Manual Review (Test Split)¶

The full 436-sample test split was manually reviewed in 5 batches. Each sample was checked against:

The task statement (SWE-bench issue)
The raw tool output embedded in the prompt
The selected relevant_lines

Corrections were only applied when the selected span was clearly wrong or clearly too incomplete to be useful.

Results¶

Metric	Count
Total samples reviewed	436 / 436
Corrected	55 (12.6%)
Unchanged	381 (87.4%)

Batch Breakdown¶

Batch	Range	Reviewed	Corrected	Unchanged
1	0–99	100	12	88
2	100–199	100	11	89
3	200–299	100	20	80
4	300–399	100	8	92
5	400–435	36	4	32

Main Error Classes¶

The dominant issue was not random annotation noise. It was a repeated failure mode from the teacher distillation:

Empty python labels despite clear tracebacks — the teacher returned no spans when the output was a full traceback, even though the traceback is the most relevant signal for debugging
Overly truncated traceback labels — only the tail of a traceback was selected, missing the root cause
Missed blocking failures — labels selected unrelated context instead of the actual error
Short error spans — test_output samples where the ImportError or E ... failure lines were omitted

Most corrections involved Python tracebacks from environment/runtime failures (import-time errors that prevented the issue-specific reproduction from running).

QA Metadata¶

Every test sample now carries per-sample QA metadata in metadata.qa:

{
  "status": "pass",
  "response_json_valid": true,
  "relevant_lines_present_in_prompt": true,
  "relevant_lines_in_prompt_order": true,
  "duplicate_relevant_line_count": 0,
  "manual_review_batch1": {
    "batch": 1,
    "status": "reviewed_no_change",
    "rationale": "..."
  }
}

Corrected samples include the review status and rationale explaining what was changed and why.

Training Split — Traceback Curation¶

After the test split review revealed a pattern of missing/truncated traceback labels, a targeted curation pass was applied to the training split. This was not a full manual review — it specifically fixed the highest-confidence traceback-related label errors.

Scope¶

Only three categories were touched:

Category	Before	After	Corrected
Empty `python` labels with clear traceback/error block	64	0	64
Short `python` spans with clear traceback	52	0	52
Short `test_output` spans with clear error block	7	0	7
Total			123

Out of 6,790 train samples, 123 (1.8%) were corrected. The remaining 6,667 samples were reviewed by the heuristic and left unchanged.

What Was Fixed¶

Corrected samples now properly include the traceback or error block that was previously omitted or truncated. These were dominated by:

Missing compiled extensions (ImportError from C extensions)
Missing optional dependencies
Import failures from Python/package version incompatibilities
pytest configuration/import failures with clear traceback output

What Was Not Fixed¶

This pass deliberately did not touch:

Subtle semantic selection mistakes in non-traceback samples
Ambiguous cases where multiple spans could plausibly be selected
Empty labels for read_file, grep, or git_log tools, where emptiness may be correct
General semantic relabeling across all tool types

QA Metadata¶

Every train sample now carries curation metadata in metadata.qa.traceback_train_curation_v1:

{
  "status": "corrected_traceback_candidate"
}

or:

{
  "status": "reviewed_no_change"
}

Artifacts¶

Curation script: scripts/curate_train_tracebacks.py
Curation notes: data/train_curated_tracebacks_v1_notes.json
Full report: data/train_curated_tracebacks_v1_report.md

Summary of All Data Quality Passes¶

Split	Pass	Samples	Corrected	Method
Both	Metadata fix	7,584	3,985 metadata fields	Automated recomputation
Test	Manual review	436	55 (12.6%)	Human review, 5 batches
Train	Traceback curation	6,790	123 (1.8%)	Heuristic + targeted fix

Recommendations for Contributors¶

If you want to further improve the dataset:

Structural fixes: run the automated QA and fix any structural issues
Candidate mining: use heuristics to find high-confidence correction candidates
Manual review: review candidates and apply corrections where the label is clearly wrong
Do not blindly auto-relabel the entire training set — the judgment calls on what constitutes "relevant" output require context about the task