Evaluation¶
Evaluation functions and result types.
EvalResult¶
EvalResult(micro_precision=0.0, micro_recall=0.0, micro_f1=0.0, macro_f1=0.0, per_class=list(), exact_match=0.0, total_tp=0, total_fp=0, total_fn=0, total_docs=0, failures=list())
dataclass
¶
Rich evaluation result across a dataset.
Attributes:
| Name | Type | Description |
|---|---|---|
micro_precision |
float
|
Entity-level micro-averaged precision. |
micro_recall |
float
|
Entity-level micro-averaged recall. |
micro_f1 |
float
|
Entity-level micro-averaged F1 score. |
macro_f1 |
float
|
Macro F1 (unweighted average of per-class F1 scores). |
per_class |
list[ClassMetrics]
|
Per-class precision/recall/F1 breakdown. |
exact_match |
float
|
Fraction of documents with perfect output (0.0-1.0). |
total_tp |
int
|
Total true positive count across all classes. |
total_fp |
int
|
Total false positive count across all classes. |
total_fn |
int
|
Total false negative count across all classes. |
total_docs |
int
|
Number of documents evaluated. |
failures |
list[dict]
|
List of failure dicts with keys 'input', 'expected', 'got', 'is_correction'. Used by the refinement loop to generate patches. |
to_dict()
¶
ClassMetrics¶
ClassMetrics(label, tp=0, fp=0, fn=0)
dataclass
¶
RuleMetrics¶
RuleMetrics(rule_id, rule_name, precision=0.0, recall=0.0, f1=0.0, matches=0, true_positives=0, false_positives=0, covered_expected=0, total_expected=0, per_class=list(), sample_matches=list())
dataclass
¶
Evaluation of a single rule in isolation.
Attributes:
| Name | Type | Description |
|---|---|---|
rule_id |
str
|
Unique identifier of the evaluated rule. |
rule_name |
str
|
Human-readable name of the rule. |
precision |
float
|
Precision of this rule alone (TP / (TP + FP)). |
recall |
float
|
Recall of this rule alone (covered / total expected entities). |
f1 |
float
|
F1 score derived from precision and recall. |
matches |
int
|
Total number of entities this rule produced. |
true_positives |
int
|
Entities that matched an expected entity. |
false_positives |
int
|
Entities that did not match any expected entity. |
covered_expected |
int
|
How many expected entities this rule correctly finds. |
total_expected |
int
|
Total expected entities across the full dataset. |
per_class |
list[ClassMetrics]
|
Per-class breakdown of TP/FP/FN for this rule. |
sample_matches |
list[dict]
|
Up to 10 sample match dicts showing rule behavior. |
Functions¶
evaluate_dataset¶
evaluate_dataset(rules, dataset, apply_rules_fn, mode='text')
¶
Evaluate rules against a full dataset, producing entity-level metrics.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rules
|
list[Rule]
|
Rules to evaluate. |
required |
dataset
|
Dataset
|
Dataset with examples and corrections. |
required |
apply_rules_fn
|
Callable(rules, input_data, task_type, text_field) -> output_dict. |
required | |
mode
|
str
|
'text' (match by text+type) or 'exact' (match by text+type+start+end). |
'text'
|
Returns:
| Type | Description |
|---|---|
EvalResult
|
EvalResult with micro/macro metrics, per-class breakdown, exact match |
EvalResult
|
rate, and a list of failure dicts for refinement. |
evaluate_rules_individually¶
evaluate_rules_individually(rules, dataset, apply_rules_fn, mode='text', max_samples=10)
¶
Evaluate each rule in isolation against the dataset.
For each rule, runs it alone and computes how many expected entities it produces correctly (TP), how many spurious entities it produces (FP), and how many expected entities it misses (recall denominator).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rules
|
list[Rule]
|
Rules to evaluate individually. |
required |
dataset
|
Dataset
|
Dataset with examples and corrections. |
required |
apply_rules_fn
|
Callable(rules, input_data, task_type, text_field) -> output_dict. |
required | |
mode
|
str
|
'text' or 'exact'. |
'text'
|
max_samples
|
int
|
Max sample matches to store per rule. |
10
|
Returns:
| Type | Description |
|---|---|
list[RuleMetrics]
|
List[RuleMetrics], one entry per rule, with per-rule precision/recall/F1, |
list[RuleMetrics]
|
match counts, per-class breakdown, and sample matches. |