Ranking¶
Per-rule and ensemble evaluation, validated-trust statistics, and pruning.
After learning, each rule is evaluated in isolation and in the context of the full ruleset so that overlapping rules can be ordered by measured trust and unhelpful rules can be removed. See Rule Trust and Conflict Resolution for the conceptual overview.
rank_rules¶
rank_rules(rules, dataset, apply_rules_fn, mode='text', compute_marginal=True, stamp_validated_stats=True)
¶
Evaluate rules alone and as an ensemble, producing a ranking.
For each rule this computes standalone precision/recall/F1, and optionally its marginal contribution: how much the ensemble micro F1 drops when the rule is removed. The ranking answers "which rule should win a conflict" (solo precision) and "which rules earn their place" (marginal F1).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
rules
|
list[Rule]
|
Rules to rank. |
required |
dataset
|
Dataset
|
Dataset to evaluate against — use a held-out dev set when available so validated stats measure generalization, not memorization. |
required |
apply_rules_fn
|
Callable
|
Callable(rules, input_data, task_type, text_field) -> output dict. |
required |
mode
|
str
|
Matching mode ('text', 'exact', or 'partial'). |
'text'
|
compute_marginal
|
bool
|
If True, run the leave-one-out ablation pass. Costs one full-dataset evaluation per rule; disable for very large rule sets or datasets. |
True
|
stamp_validated_stats
|
bool
|
If True, write each rule's solo precision and support onto rule.validated_precision / rule.validated_support. The executor uses validated_precision to order rules within the same priority, so the empirically more precise rule wins conflicts. |
True
|
Returns:
| Type | Description |
|---|---|
RankingReport
|
RankingReport with ensemble metrics and per-rule rankings sorted |
RankingReport
|
most valuable first. |
prune_harmful_rules¶
prune_harmful_rules(rules, report, min_marginal_f1=0.0)
¶
Split rules into (kept, dropped) based on marginal contribution.
A rule is dropped when its marginal F1 is known and below min_marginal_f1 — i.e. the ensemble measurably does better without it. Rules without ablation data are always kept.
Returns:
| Type | Description |
|---|---|
tuple[list[Rule], list[Rule]]
|
Tuple of (kept_rules, dropped_rules). |
wilson_lower_bound¶
wilson_lower_bound(precision, support, z=1.96)
¶
Wilson score lower bound for a measured precision.
Raw precision is unreliable at low support: 1/1 correct is "100% precision" but tells you almost nothing. The Wilson lower bound discounts the estimate by sample size — use it instead of raw validated_precision when deciding whether to trust a rule for routing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
precision
|
float
|
Observed precision (0.0-1.0). |
required |
support
|
int
|
Number of predictions the estimate is based on. |
required |
z
|
float
|
Confidence z-score (1.96 = 95% confidence). |
1.96
|
Returns:
| Type | Description |
|---|---|
float
|
Lower bound on the true precision; 0.0 when support is 0. |
RankingReport¶
RankingReport(ensemble_precision=0.0, ensemble_recall=0.0, ensemble_f1=0.0, rankings=list())
dataclass
¶
Result of rank_rules(): ensemble metrics plus per-rule rankings.
Attributes:
| Name | Type | Description |
|---|---|---|
ensemble_precision |
float
|
Micro precision of all rules together. |
ensemble_recall |
float
|
Micro recall of all rules together. |
ensemble_f1 |
float
|
Micro F1 of all rules together. |
rankings |
list[RuleRanking]
|
Per-rule rankings, sorted most valuable first (by marginal F1 when available, then solo precision). |
to_dict()
¶
RuleRanking¶
RuleRanking(rule_id, rule_name, solo_precision=0.0, solo_recall=0.0, solo_f1=0.0, support=0, marginal_f1=None)
dataclass
¶
Ranking entry for one rule.
Attributes:
| Name | Type | Description |
|---|---|---|
rule_id |
str
|
Unique identifier of the rule. |
rule_name |
str
|
Human-readable rule name. |
solo_precision |
float
|
Precision of the rule evaluated in isolation. |
solo_recall |
float
|
Recall of the rule evaluated in isolation. |
solo_f1 |
float
|
F1 of the rule evaluated in isolation. |
support |
int
|
Number of predictions the rule made alone (TP + FP). |
marginal_f1 |
float | None
|
Ensemble micro F1 minus the ensemble micro F1 without this rule. Positive means the rule helps the ensemble; negative means the ensemble is better off without it. None when the ablation pass was skipped. |
to_dict()
¶
print_ranking_report¶
print_ranking_report(report)
¶
Pretty-print a RankingReport.