Skip to content

Config

Pipeline configuration and constants.

PipelineConfig(output_dir=Path('data'), source_cache_dir=Path('data/source_cache'), repos_dir=Path('data/repos'), github_token='', openai_api_key='', distillation_model='gpt-5.4', distillation_base_url=None, swebench_dataset='princeton-nlp/SWE-bench', splits=(lambda: ['test'])(), max_instances=None, min_tools_per_instance=3, max_tools_per_instance=7, max_tool_output_lines=MAX_TOOL_OUTPUT_LINES, distillation_max_concurrent=50, distillation_temperature=0.3, generate_queries_with_teacher=True, command_timeout=30) dataclass

Configuration for the data generation pipeline.

Constants

config

Configuration for the data generation pipeline.

SYSTEM_PROMPT = 'You prune verbose tool output for a coding agent. Given a focused extraction query and one tool output, return only the smallest verbatim evidence block(s) the agent should read next. Return the kept text inside <relevant_lines> tags. Do not rewrite, summarize, or invent lines.' module-attribute

TOOL_WEIGHTS = {'read_file': 0.28, 'grep': 0.18, 'python': 0.08, 'git_log': 0.08, 'test_output': 0.08, 'git_diff': 0.05, 'git_blame': 0.04, 'ls': 0.04, 'lint_output': 0.02, 'build_output': 0.02, 'curl': 0.03, 'pip_install': 0.04, 'type_check': 0.04, 'coverage': 0.02} module-attribute

MIN_RELEVANT_RATIO = 0.02 module-attribute

MAX_RELEVANT_RATIO = 0.4 module-attribute

MIN_RELEVANT_LINES = 3 module-attribute

MIN_TOTAL_LINES = 10 module-attribute

MAX_TOOL_OUTPUT_LINES = 500 module-attribute