Dataset¶
SFT dataset for training.
ExtractionSFTDataset(data_path, tokenizer, max_length=4096)
¶
Bases: Dataset
SFT dataset for tool output extraction.
Loads JSONL files with 'prompt' and 'response' fields, tokenizes them, and masks prompt tokens in labels.
collate_fn(batch, pad_token_id=0)
¶
Collate function with left-padding for causal LM training.