Data Quality & Review Process¶
The dataset powering Squeez now uses canonical query + tool_output + gold_spans rows as the source of truth. This page documents the quality checks that matter for that format.
Pipeline Overview¶
Raw tool output
│
▼
Teacher relabeling (focused query + grounded spans)
│
▼
Canonical rows (`query`, `tool_output`, `gold_spans`)
│
▼
Derived Qwen / encoder views
│
▼
Automated QA + manual review
Automated QA¶
Every canonical sample should pass:
- Grounding: every span lands inside the raw
tool_output - Ordering: spans are sorted and non-overlapping
- Canonical derivation: Qwen XML and encoder
relevant_linesare both derived from the same spans - No synthetic numbering leakage: final labels must match raw output, not temporary numbered views
- Metadata consistency: counts and compression are recomputed from the derived content
Earlier dataset versions mixed canonical truth with model-specific wrappers and temporary numbering views. The current system is designed specifically to avoid that class of bug by keeping gold_spans canonical and deriving all other representations from them.
Review priorities¶
When reviewing held-out samples, the main questions are:
- Is the
querya realistic next-step extraction request? - Do the
gold_spanspoint to the smallest useful evidence block? - Would an agent plausibly want this exact block next?
- If the sample is empty/irrelevant, is it an explicit negative rather than a failed positive?
Recommendations for Contributors¶
If you want to further improve the dataset:
- Keep canonical rows clean: fix
query/gold_spans, not only derived XML. - Check grounding first: labels must map back to raw output exactly.
- Review negatives carefully: empty labels should be deliberate explicit negatives, not artifacts of bad task generation.
- Prefer focused extraction subgoals over full issue statements.
- Use content-first fallback for positives: if a task-derived query is not answerable from a tool output, retry with a tool-aware query driven by the output itself; if that still fails, drop the sample.