Skip to content

Splitting

Stratified train/development holdout used by the refinement loop to decide patch acceptance on data the rules were not tuned on. See Holdout Acceptance for the conceptual overview.

split_dataset

split_dataset(dataset, holdout_fraction=0.2, seed=42, min_dev_size=5)

Split a dataset into train and held-out dev portions.

Examples are split stratified by class signature. Corrections always stay in train: they are explicit user fixes that must drive patching, and holding them out would hide the highest-value signal from the learner. Feedback and existing rules are shared with the train split.

Parameters:

Name Type Description Default
dataset Dataset

Source dataset (not mutated).

required
holdout_fraction float

Fraction of examples to hold out (0 < f < 1).

0.2
seed int

Random seed for the shuffle within each stratum.

42
min_dev_size int

If the resulting dev set would be smaller than this, no split is performed and (dataset, None) is returned.

5

Returns:

Type Description
Dataset

Tuple of (train_dataset, dev_dataset). dev_dataset is None when the

Dataset | None

dataset is too small to split safely.