Evaluation¶
Metrics for evaluating model quality.
evaluate_model(model_path, eval_file, max_samples=None, max_new_tokens=1024)
¶
Evaluate the model on the eval set.
Args: model_path: Path to trained model eval_file: Path to eval.jsonl max_samples: Maximum samples to evaluate max_new_tokens: Max tokens to generate
Returns: Dict with aggregate metrics
compute_span_metrics(predicted, reference)
¶
Compute span-level precision, recall, F1 using set overlap on normalized lines.
compute_partial_overlap(predicted, reference)
¶
Compute partial overlap ratio using character-level intersection.
For each reference line, find the best matching predicted line (substring match) and compute the fraction of reference characters covered.
compute_empty_accuracy(predicted, reference)
¶
Check if model correctly predicts empty vs non-empty.
Returns category (true_positive, true_negative, false_positive, false_negative) and whether correct.
compute_rouge_l(predicted, reference)
¶
Compute ROUGE-L F1 score.
compute_compression_ratio(original, filtered)
¶
Compute compression ratio (1 - filtered/original).