Skip to content

Evaluation

Evaluation functions and result types.

EvalResult

EvalResult(micro_precision=0.0, micro_recall=0.0, micro_f1=0.0, macro_f1=0.0, per_class=list(), exact_match=0.0, total_tp=0, total_fp=0, total_fn=0, total_docs=0, failures=list()) dataclass

Rich evaluation result across a dataset.

Attributes:

Name Type Description
micro_precision float

Entity-level micro-averaged precision.

micro_recall float

Entity-level micro-averaged recall.

micro_f1 float

Entity-level micro-averaged F1 score.

macro_f1 float

Macro F1 (unweighted average of per-class F1 scores).

per_class list[ClassMetrics]

Per-class precision/recall/F1 breakdown.

exact_match float

Fraction of documents with perfect output (0.0-1.0).

total_tp int

Total true positive count across all classes.

total_fp int

Total false positive count across all classes.

total_fn int

Total false negative count across all classes.

total_docs int

Number of documents evaluated.

failures list[dict]

List of failure dicts with keys 'input', 'expected', 'got', 'is_correction'. Used by the refinement loop to generate patches.

to_dict()

ClassMetrics

ClassMetrics(label, tp=0, fp=0, fn=0) dataclass

Precision / recall / F1 for a single entity type or key.

Attributes:

Name Type Description
label str

The class/entity type name.

tp int

True positive count.

fp int

False positive count.

fn int

False negative count.

precision property

recall property

f1 property

to_dict()

RuleMetrics

RuleMetrics(rule_id, rule_name, precision=0.0, recall=0.0, f1=0.0, matches=0, true_positives=0, false_positives=0, covered_expected=0, total_expected=0, per_class=list(), sample_matches=list()) dataclass

Evaluation of a single rule in isolation.

Attributes:

Name Type Description
rule_id str

Unique identifier of the evaluated rule.

rule_name str

Human-readable name of the rule.

precision float

Precision of this rule alone (TP / (TP + FP)).

recall float

Recall of this rule alone (covered / total expected entities).

f1 float

F1 score derived from precision and recall.

matches int

Total number of entities this rule produced.

true_positives int

Entities that matched an expected entity.

false_positives int

Entities that did not match any expected entity.

covered_expected int

How many expected entities this rule correctly finds.

total_expected int

Total expected entities across the full dataset.

per_class list[ClassMetrics]

Per-class breakdown of TP/FP/FN for this rule.

sample_matches list[dict]

Up to 10 sample match dicts showing rule behavior.

Functions

evaluate_dataset

evaluate_dataset(rules, dataset, apply_rules_fn, mode='text')

Evaluate rules against a full dataset, producing entity-level metrics.

Parameters:

Name Type Description Default
rules list[Rule]

Rules to evaluate.

required
dataset Dataset

Dataset with examples and corrections.

required
apply_rules_fn

Callable(rules, input_data, task_type, text_field) -> output_dict.

required
mode str

'text' (match by text+type) or 'exact' (match by text+type+start+end).

'text'

Returns:

Type Description
EvalResult

EvalResult with micro/macro metrics, per-class breakdown, exact match

EvalResult

rate, and a list of failure dicts for refinement.

evaluate_rules_individually

evaluate_rules_individually(rules, dataset, apply_rules_fn, mode='text', max_samples=10)

Evaluate each rule in isolation against the dataset.

For each rule, runs it alone and computes how many expected entities it produces correctly (TP), how many spurious entities it produces (FP), and how many expected entities it misses (recall denominator).

Parameters:

Name Type Description Default
rules list[Rule]

Rules to evaluate individually.

required
dataset Dataset

Dataset with examples and corrections.

required
apply_rules_fn

Callable(rules, input_data, task_type, text_field) -> output_dict.

required
mode str

'text' or 'exact'.

'text'
max_samples int

Max sample matches to store per rule.

10

Returns:

Type Description
list[RuleMetrics]

List[RuleMetrics], one entry per rule, with per-rule precision/recall/F1,

list[RuleMetrics]

match counts, per-class breakdown, and sample matches.