Splitting¶
Stratified train/development holdout used by the refinement loop to decide patch acceptance on data the rules were not tuned on. See Holdout Acceptance for the conceptual overview.
split_dataset¶
split_dataset(dataset, holdout_fraction=0.2, seed=42, min_dev_size=5)
¶
Split a dataset into train and held-out dev portions.
Examples are split stratified by class signature. Corrections always stay in train: they are explicit user fixes that must drive patching, and holding them out would hide the highest-value signal from the learner. Feedback and existing rules are shared with the train split.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset
|
Dataset
|
Source dataset (not mutated). |
required |
holdout_fraction
|
float
|
Fraction of examples to hold out (0 < f < 1). |
0.2
|
seed
|
int
|
Random seed for the shuffle within each stratum. |
42
|
min_dev_size
|
int
|
If the resulting dev set would be smaller than this, no split is performed and (dataset, None) is returned. |
5
|
Returns:
| Type | Description |
|---|---|
Dataset
|
Tuple of (train_dataset, dev_dataset). dev_dataset is None when the |
Dataset | None
|
dataset is too small to split safely. |