Data Quality & Review Process¶
The dataset powering squeez goes through a multi-stage quality assurance pipeline. This page documents the process, the issues found, and the corrections applied.
Pipeline Overview¶
Raw SWE-bench tool output
│
▼
Teacher distillation (gpt-oss-120b selects spans)
│
▼
Zero-hallucination extraction (spans → actual text lines)
│
▼
Automated QA (structural validation)
│
▼
Manual review (test split only)
│
▼
Final dataset on HuggingFace
Automated QA¶
Every sample passes structural checks:
- JSON validity: response parses as
{"relevant_lines": [...]} - Grounding: every line in
relevant_linesexists in the prompt (the raw tool output) - Ordering: selected lines appear in the same order as in the original output
- Metadata consistency:
num_relevant_linesmatches the actual count in the response - Deduplication: no duplicate lines in
relevant_lines
Metadata Fix (v0.1.1)¶
The initial release had stale metadata inherited from the distillation phase. Specifically:
num_relevant_lineswas computed before post-processing (filtering...separators and empty lines)num_total_linescame from upstream span counts, not the actual tool outputcompression_ratiowas derived from the stale counts
Impact: 3,755 / 7,148 train samples and 230 / 436 test samples had incorrect num_relevant_lines. The actual prompt and response content (what the model trains on) was always correct — only the metadata was wrong.
Fix: All three metadata fields are now recomputed from the actual assembled content. The assembler (sample_assembler.py) was also patched to prevent this in future regenerations.
Manual Review (Test Split)¶
The full 436-sample test split was manually reviewed in 5 batches. Each sample was checked against:
- The task statement (SWE-bench issue)
- The raw tool output embedded in the prompt
- The selected
relevant_lines
Corrections were only applied when the selected span was clearly wrong or clearly too incomplete to be useful.
Results¶
| Metric | Count |
|---|---|
| Total samples reviewed | 436 / 436 |
| Corrected | 55 (12.6%) |
| Unchanged | 381 (87.4%) |
Batch Breakdown¶
| Batch | Range | Reviewed | Corrected | Unchanged |
|---|---|---|---|---|
| 1 | 0–99 | 100 | 12 | 88 |
| 2 | 100–199 | 100 | 11 | 89 |
| 3 | 200–299 | 100 | 20 | 80 |
| 4 | 300–399 | 100 | 8 | 92 |
| 5 | 400–435 | 36 | 4 | 32 |
Main Error Classes¶
The dominant issue was not random annotation noise. It was a repeated failure mode from the teacher distillation:
- Empty
pythonlabels despite clear tracebacks — the teacher returned no spans when the output was a full traceback, even though the traceback is the most relevant signal for debugging - Overly truncated traceback labels — only the tail of a traceback was selected, missing the root cause
- Missed blocking failures — labels selected unrelated context instead of the actual error
- Short error spans —
test_outputsamples where theImportErrororE ...failure lines were omitted
Most corrections involved Python tracebacks from environment/runtime failures (import-time errors that prevented the issue-specific reproduction from running).
QA Metadata¶
Every test sample now carries per-sample QA metadata in metadata.qa:
{
"status": "pass",
"response_json_valid": true,
"relevant_lines_present_in_prompt": true,
"relevant_lines_in_prompt_order": true,
"duplicate_relevant_line_count": 0,
"manual_review_batch1": {
"batch": 1,
"status": "reviewed_no_change",
"rationale": "..."
}
}
Corrected samples include the review status and rationale explaining what was changed and why.
Training Split — Traceback Curation¶
After the test split review revealed a pattern of missing/truncated traceback labels, a targeted curation pass was applied to the training split. This was not a full manual review — it specifically fixed the highest-confidence traceback-related label errors.
Scope¶
Only three categories were touched:
| Category | Before | After | Corrected |
|---|---|---|---|
Empty python labels with clear traceback/error block |
64 | 0 | 64 |
Short python spans with clear traceback |
52 | 0 | 52 |
Short test_output spans with clear error block |
7 | 0 | 7 |
| Total | 123 |
Out of 6,790 train samples, 123 (1.8%) were corrected. The remaining 6,667 samples were reviewed by the heuristic and left unchanged.
What Was Fixed¶
Corrected samples now properly include the traceback or error block that was previously omitted or truncated. These were dominated by:
- Missing compiled extensions (
ImportErrorfrom C extensions) - Missing optional dependencies
- Import failures from Python/package version incompatibilities
- pytest configuration/import failures with clear traceback output
What Was Not Fixed¶
This pass deliberately did not touch:
- Subtle semantic selection mistakes in non-traceback samples
- Ambiguous cases where multiple spans could plausibly be selected
- Empty labels for
read_file,grep, orgit_logtools, where emptiness may be correct - General semantic relabeling across all tool types
QA Metadata¶
Every train sample now carries curation metadata in metadata.qa.traceback_train_curation_v1:
or:
Artifacts¶
- Curation script:
scripts/curate_train_tracebacks.py - Curation notes:
data/train_curated_tracebacks_v1_notes.json - Full report:
data/train_curated_tracebacks_v1_report.md
Summary of All Data Quality Passes¶
| Split | Pass | Samples | Corrected | Method |
|---|---|---|---|---|
| Both | Metadata fix | 7,584 | 3,985 metadata fields | Automated recomputation |
| Test | Manual review | 436 | 55 (12.6%) | Human review, 5 batches |
| Train | Traceback curation | 6,790 | 123 (1.8%) | Heuristic + targeted fix |
Recommendations for Contributors¶
If you want to further improve the dataset:
- Structural fixes: run the automated QA and fix any structural issues
- Candidate mining: use heuristics to find high-confidence correction candidates
- Manual review: review candidates and apply corrections where the label is clearly wrong
- Do not blindly auto-relabel the entire training set — the judgment calls on what constitutes "relevant" output require context about the task