Skip to content

Dataset

SFT dataset for training.

ExtractionSFTDataset(data_path, tokenizer, max_length=4096)

Bases: Dataset

SFT dataset for tool output extraction.

Loads JSONL files with 'prompt' and 'response' fields, tokenizes them, and masks prompt tokens in labels.

collate_fn(batch, pad_token_id=0)

Collate function with left-padding for causal LM training.