Dataset¶

Training data: KRLabsOrg/tool-output-extraction-swebench

Statistics¶

	Count
Train samples	8,241
Dev samples	252
Test samples	557
Total	9,050
SWE-bench real data	5,936
Synthetic multi-ecosystem	2,039
Synthetic SWE-style	1,075
Tool types	30

Data sources¶

The dataset combines three sources:

SWE-bench real data — Tool calls executed on 2,294 cloned Python repos from SWE-bench (django, scikit-learn, sympy, etc.). Real git grep, pytest, pip install, mypy output. Labeled by a teacher LLM that selects relevant line spans grounded in the original output.
Synthetic multi-ecosystem — LLM-generated tool output for ecosystems beyond Python: npm, TypeScript, Rust, Go, Java, Docker, Terraform, kubectl, and more. Two-pass generation: Pass 1 generates task + output, Pass 2 picks relevant lines.
Synthetic SWE-style — LLM-generated versions of Python tool types that had poor quality in the real data (high noise rates from environment failures).

Sample format¶

Each sample has three fields:

prompt¶

System prompt + task description + tool output, formatted with Qwen ChatML tokens:

<|im_start|>system
You extract relevant lines from tool output for a coding task. Return the relevant lines inside <relevant_lines> tags.
<|im_end|>
<|im_start|>user
<task>
Fix the CSRF validation bug in django...
</task>
<tool_output>
class CsrfViewMiddleware(MiddlewareMixin):
    def _check_referer(self, request):
        ...
</tool_output>
<|im_end|>
<|im_start|>assistant

response¶

Relevant lines wrapped in XML tags:

<relevant_lines>
class CsrfViewMiddleware(MiddlewareMixin):
    def _check_referer(self, request):
        referer = request.META.get('HTTP_REFERER')
</relevant_lines>

Or when the output is not relevant to the task:

<relevant_lines>
</relevant_lines>

metadata¶

{
    "instance_id": "django__django-11099",
    "tool_type": "read_file",
    "source": "swe",
    "num_total_lines": 42,
    "num_relevant_lines": 8,
    "compression_ratio": 0.81
}

The source field is one of swe, synthetic, or synthetic_negative (hard negatives where the task is intentionally mismatched with the tool output).

Splits¶

SWE-bench data split by repository (zero instance overlap):

Test: pydata/xarray, pallets/flask
Dev: psf/requests
Train: all others (django, sympy, scikit-learn, sphinx, matplotlib, pytest, astropy, pylint, seaborn)

Synthetic data split per tool type: 10% test, 5% dev, 85% train. Hard negatives capped at ~10% per tool type in test.

The held-out test set was manually curated: 61 overly broad annotations were excluded.

Tool types¶

Ecosystem	Tool types
Python	read_file, grep, python, test_output, pip_install, type_check, coverage, lint_output, build_output
Git	git_log, git_diff, git_blame, ls
JavaScript/TypeScript	npm_install, npm_build, tsc, eslint
Rust	cargo_build
Go	go_build
Java	mvn_gradle
C/C++	make_cmake
Infrastructure	docker_build, docker_logs, terraform, kubectl
HTTP	curl
Python (synthetic)	grep_output, git_log_output, git_diff_output, python_output, mypy_pyright