Evaluation & Feedback¶
RuleChef includes built-in evaluation with entity-level precision, recall, and F1 metrics.
Dataset Evaluation¶
This runs all rules against the dataset and computes:
- Exact match accuracy — percentage of inputs where output matches exactly
- Micro precision/recall/F1 — aggregated across all predictions
- Macro F1 — averaged across classes (for multi-class tasks)
- Per-class breakdown — precision, recall, F1 per label
Per-Rule Evaluation¶
Find which rules are helping and which are hurting:
Each rule gets individual metrics:
- TP / FP / FN — true positives, false positives, false negatives
- Sample matches — which examples each rule matched
- Dead rules — rules that never fire on any example
Corrections¶
Corrections are the highest-value training signal. They show exactly where current rules fail:
result = chef.extract({"text": "some input"})
# Result was wrong — correct it
chef.add_correction(
{"text": "some input"},
model_output=result,
expected_output={"label": "correct_label"},
feedback="The rule matched too broadly"
)
chef.learn_rules() # Re-learns with corrections prioritized
Feedback¶
Feedback provides guidance at different levels:
Task-Level Feedback¶
General guidance for the entire task:
chef.add_feedback("Drug names always follow 'take' or 'prescribe'")
chef.add_feedback("Ignore mentions in parentheses")
Rule-Level Feedback¶
Guidance targeted at a specific rule:
chef.add_feedback(
"This rule is too broad — it matches common words",
level="rule",
target_id="rule_123"
)
Feedback is included in synthesis prompts during the next learn_rules() call.
Matching Modes¶
Extraction¶
task = Task(
...,
type=TaskType.EXTRACTION,
matching_mode="text", # Compare span text only (default)
)
task = Task(
...,
type=TaskType.EXTRACTION,
matching_mode="exact", # Compare text + start/end offsets
)
NER¶
Entity matching checks both text and type. Entities match if they have the same text and entity type.
Classification¶
Label matching is case-insensitive and strips whitespace.
Transformation¶
Dict matching compares values recursively. Array elements are matched order-independently.
Custom Matchers¶
Override the default matching logic:
def my_matcher(expected, actual):
# Custom comparison logic
return expected["label"].lower() == actual["label"].lower()
task = Task(
...,
output_matcher=my_matcher,
)
Stats¶
Get a summary of the current state: