Evaluation & Feedback¶
RuleChef includes built-in evaluation with entity-level precision, recall, and F1 metrics.
Dataset Evaluation¶
This runs all rules against the dataset and computes:
- Exact match accuracy — percentage of inputs where output matches exactly
- Micro precision/recall/F1 — aggregated across all predictions
- Macro F1 — averaged across classes (for multi-class tasks)
- Per-class breakdown — precision, recall, F1 per label
Per-Rule Evaluation¶
Find which rules are helping and which are hurting:
Each rule gets individual metrics:
- TP / FP / FN — true positives, false positives, false negatives
- Sample matches — which examples each rule matched
- Dead rules — rules that never fire on any example
Rule Trust and Conflict Resolution¶
When learning completes, every rule is evaluated in isolation and stamped with a validated precision and support (the number of predictions behind the estimate). When a holdout is active these are measured on the dev split — data the rule was never tuned on.
For ranking and routing, precision is discounted by a Wilson lower bound, so a rule that was right 2/2 does not outrank one that was right 95/100. A rule that memorized a training lexicon flags itself this way: it transfers poorly to dev and ends up with a conspicuously low validated precision — you can spot it without reading the pattern.
Conflict resolution¶
When several rules produce overlapping or conflicting matches, the executor orders them deterministically by priority, then validated precision (falling back to confidence for rules with no validated estimate). The higher-trust rule wins.
Ranking and pruning¶
rank_rules() reports each rule's solo F1 and its leave-one-out marginal contribution to the ensemble, and can prune rules whose removal improves overall F1:
See the Ranking API for the full report structure.
Repairing Rules with Feedback¶
Because rules are readable, their defects are too — and they can be fixed in plain English without re-synthesizing from scratch. Attach rule-level feedback and run one incremental round:
chef.add_feedback(
"Never match number/number patterns like '1432/03' — those are case numbers.",
level="rule",
target_id=quantity_rule.id,
)
chef.learn_rules(incremental_only=True, holdout_fraction=0.2)
The targeted rules are patched while untouched rules are preserved. Human-written and LLM-generated (critic) feedback flow through the same channel.
Corrections¶
Corrections are the highest-value training signal. They show exactly where current rules fail:
result = chef.extract({"text": "some input"})
# Result was wrong — correct it
chef.add_correction(
{"text": "some input"},
model_output=result,
expected_output={"label": "correct_label"},
feedback="The rule matched too broadly"
)
chef.learn_rules() # Re-learns with corrections prioritized
Feedback¶
Feedback provides guidance at different levels:
Task-Level Feedback¶
General guidance for the entire task:
chef.add_feedback("Drug names always follow 'take' or 'prescribe'")
chef.add_feedback("Ignore mentions in parentheses")
Rule-Level Feedback¶
Guidance targeted at a specific rule:
chef.add_feedback(
"This rule is too broad — it matches common words",
level="rule",
target_id="rule_123"
)
Feedback is included in synthesis prompts during the next learn_rules() call.
Matching Modes¶
Extraction¶
task = Task(
...,
type=TaskType.EXTRACTION,
matching_mode="text", # Compare span text only (default)
)
task = Task(
...,
type=TaskType.EXTRACTION,
matching_mode="exact", # Compare text + start/end offsets
)
NER¶
Entity matching checks both text and type. Entities match if they have the same text and entity type.
Classification¶
Label matching is case-insensitive and strips whitespace.
Transformation¶
Dict matching compares values recursively. Array elements are matched order-independently.
Custom Matchers¶
Override the default matching logic:
def my_matcher(expected, actual):
# Custom comparison logic
return expected["label"].lower() == actual["label"].lower()
task = Task(
...,
output_matcher=my_matcher,
)
Stats¶
Get a summary of the current state: