Skip to content

Ranking

Per-rule and ensemble evaluation, validated-trust statistics, and pruning.

After learning, each rule is evaluated in isolation and in the context of the full ruleset so that overlapping rules can be ordered by measured trust and unhelpful rules can be removed. See Rule Trust and Conflict Resolution for the conceptual overview.

rank_rules

rank_rules(rules, dataset, apply_rules_fn, mode='text', compute_marginal=True, stamp_validated_stats=True)

Evaluate rules alone and as an ensemble, producing a ranking.

For each rule this computes standalone precision/recall/F1, and optionally its marginal contribution: how much the ensemble micro F1 drops when the rule is removed. The ranking answers "which rule should win a conflict" (solo precision) and "which rules earn their place" (marginal F1).

Parameters:

Name Type Description Default
rules list[Rule]

Rules to rank.

required
dataset Dataset

Dataset to evaluate against — use a held-out dev set when available so validated stats measure generalization, not memorization.

required
apply_rules_fn Callable

Callable(rules, input_data, task_type, text_field) -> output dict.

required
mode str

Matching mode ('text', 'exact', or 'partial').

'text'
compute_marginal bool

If True, run the leave-one-out ablation pass. Costs one full-dataset evaluation per rule; disable for very large rule sets or datasets.

True
stamp_validated_stats bool

If True, write each rule's solo precision and support onto rule.validated_precision / rule.validated_support. The executor uses validated_precision to order rules within the same priority, so the empirically more precise rule wins conflicts.

True

Returns:

Type Description
RankingReport

RankingReport with ensemble metrics and per-rule rankings sorted

RankingReport

most valuable first.

prune_harmful_rules

prune_harmful_rules(rules, report, min_marginal_f1=0.0)

Split rules into (kept, dropped) based on marginal contribution.

A rule is dropped when its marginal F1 is known and below min_marginal_f1 — i.e. the ensemble measurably does better without it. Rules without ablation data are always kept.

Returns:

Type Description
tuple[list[Rule], list[Rule]]

Tuple of (kept_rules, dropped_rules).

wilson_lower_bound

wilson_lower_bound(precision, support, z=1.96)

Wilson score lower bound for a measured precision.

Raw precision is unreliable at low support: 1/1 correct is "100% precision" but tells you almost nothing. The Wilson lower bound discounts the estimate by sample size — use it instead of raw validated_precision when deciding whether to trust a rule for routing.

Parameters:

Name Type Description Default
precision float

Observed precision (0.0-1.0).

required
support int

Number of predictions the estimate is based on.

required
z float

Confidence z-score (1.96 = 95% confidence).

1.96

Returns:

Type Description
float

Lower bound on the true precision; 0.0 when support is 0.

RankingReport

RankingReport(ensemble_precision=0.0, ensemble_recall=0.0, ensemble_f1=0.0, rankings=list()) dataclass

Result of rank_rules(): ensemble metrics plus per-rule rankings.

Attributes:

Name Type Description
ensemble_precision float

Micro precision of all rules together.

ensemble_recall float

Micro recall of all rules together.

ensemble_f1 float

Micro F1 of all rules together.

rankings list[RuleRanking]

Per-rule rankings, sorted most valuable first (by marginal F1 when available, then solo precision).

to_dict()

RuleRanking

RuleRanking(rule_id, rule_name, solo_precision=0.0, solo_recall=0.0, solo_f1=0.0, support=0, marginal_f1=None) dataclass

Ranking entry for one rule.

Attributes:

Name Type Description
rule_id str

Unique identifier of the rule.

rule_name str

Human-readable rule name.

solo_precision float

Precision of the rule evaluated in isolation.

solo_recall float

Recall of the rule evaluated in isolation.

solo_f1 float

F1 of the rule evaluated in isolation.

support int

Number of predictions the rule made alone (TP + FP).

marginal_f1 float | None

Ensemble micro F1 minus the ensemble micro F1 without this rule. Positive means the rule helps the ensemble; negative means the ensemble is better off without it. None when the ablation pass was skipped.

to_dict()

print_ranking_report(report)

Pretty-print a RankingReport.