Benchmarks¶
Banking77 Intent Classification¶
To measure how well RuleChef performs on a real task, we benchmarked on a subset of the Banking77 intent classification dataset — 77 banking customer service intent classes with ~13K examples.
Setup¶
- 5 classes pinned:
beneficiary_not_allowed,card_arrival,disposable_card_limits,exchange_rate,pending_cash_withdrawal - 5-shot per class (25 training examples total)
- Dev set: remaining ~660 unused training examples (for refinement)
- Test set: 200 held-out examples from the official test split (never seen during learning)
- Regex-only rules (no code, no spaCy)
- Agentic coordinator guiding 15 refinement iterations
- Model: Kimi K2 via Groq API
Results on Held-Out Test Set¶
| Metric | Value |
|---|---|
| Accuracy (exact match) | 60.5% |
| Micro Precision | 100% |
| Micro Recall | 60.5% |
| Micro F1 | 75.4% |
| Macro F1 | 71.7% |
| Coverage | 60.5% (121/200) |
| Rules learned | 108 |
| Learning time | ~144s |
| Per-query latency | 0.19ms |
Per-Class Breakdown¶
| Class | Precision | Recall | F1 |
|---|---|---|---|
| exchange_rate | 100% | 95% | 97% |
| pending_cash_withdrawal | 100% | 82% | 90% |
| card_arrival | 100% | 62% | 77% |
| disposable_card_limits | 100% | 40% | 57% |
| beneficiary_not_allowed | 100% | 22% | 37% |
Sample Rules¶
Here are a few of the 108 regex rules RuleChef learned (full set in benchmarks/results_banking77.json):
exchange_rate_keywords (?i)\bexchange\s+rates?\b
track_card_delivery (?i)\b(?:track|delivery|status|arrival|come|received).*\bcard\b
cash_withdrawal_pending (?i)\b(?:cash|withdrawal|atm).*\b(?:pending|still|waiting)\b
disposable_limit_keywords (?i)\bdisposable\s+cards?\b(?=.*\b(?:maximum|limit|how many)\b)
beneficiary_ultra_broad (?i)\bbeneficiar(?:y|ies)\b.*\b(?:not allowed|fail|denied|can't)\b
Key Takeaways¶
-
Precision is perfect — zero false positives across all classes. In production, wrong answers are worse than no answer, and rules never give wrong answers.
-
Recall scales with complexity. Simple keyword patterns (
exchange_rateat 95%) are easy; nuanced paraphrases (beneficiary_not_allowedat 22%) need more examples or refinement iterations. -
Zero runtime cost. After learning, every query is a regex match — no API calls, no tokens, no latency. At 0.19ms per query, you can process ~5K queries per second on a single CPU.
-
The agentic coordinator matters. Without it (simple heuristic coordinator, 3 iterations), accuracy drops to ~49% and Macro F1 to ~60%. The coordinator's per-class guidance lifts Macro F1 from ~60% to 71.7%.
Reproduce¶
pip install rulechef[benchmark]
python benchmarks/benchmark_banking77.py \
--classes beneficiary_not_allowed,card_arrival,disposable_card_limits,exchange_rate,pending_cash_withdrawal \
--shots 5 --max-iterations 15 --agentic \
--base-url https://api.groq.com/openai/v1 \
--model moonshotai/kimi-k2-instruct-0905