Training¶
Squeez supports two model architectures:
- Generative (Qwen 3.5 2B + LoRA) — high-quality extraction via JSON generation
- Encoder (mmBERT 307M) — fast line-level binary classification with sliding window
Both use the same dataset and produce comparable metrics for direct comparison.
1. Download the dataset¶
This pulls the tool output extraction dataset into data/train.jsonl (8,241 samples), data/dev.jsonl (252 samples), and data/test.jsonl (557 samples).
2. Generative model (Qwen + LoRA)¶
Known-good environment¶
This repo currently has a known-good training stack pinned in requirements-train.txt.
Install it with:
Pinned versions:
unsloth==2026.3.4
unsloth_zoo==2026.3.2
trl==0.24.0
transformers==5.2.0
peft==0.18.1
torch==2.9.0
datasets==3.4.1
If training is already working on your machine, do not upgrade these packages casually.
Train¶
Configuration¶
Training hyperparameters are in configs/default.yaml:
model: "Qwen/Qwen3.5-2B"
max_length: 16384
batch_size: 8
gradient_accumulation_steps: 4
learning_rate: 2.0e-4
num_epochs: 3
lora_r: 16
lora_alpha: 32
lora_dropout: 0
Override via CLI:
squeez train \
--train-file data/train.jsonl \
--eval-file data/dev.jsonl \
--base-model Qwen/Qwen3.5-2B \
--batch-size 4 \
--lr 1e-4 \
--epochs 5 \
--lora-r 32
LoRA targets¶
LoRA adapters are applied to all attention and FFN layers:
q_proj,k_proj,v_proj,o_proj(attention)gate_proj,up_proj,down_proj(FFN)
With r=16, this trains ~0.5% of total parameters.
3. Encoder model (mmBERT)¶
The encoder approach is a 307M parameter mmBERT-base with a linear classification head. It classifies each token as relevant (1) or irrelevant (0), then aggregates per line via max-pooling.
Known-good environment¶
Prepare data¶
The encoder uses a different input format than the generative model. Convert the ChatML training data:
This produces data/encoder_{train,dev,test}.jsonl with {task, tool_output, relevant_lines, tool_type}.
Train¶
python -m squeez.encoder.train \
--train-file data/encoder_train.jsonl \
--eval-file data/encoder_dev.jsonl \
--base-model jhu-clsp/mmBERT-base \
--output-dir output/squeez_encoder
Configuration¶
Encoder hyperparameters in configs/default.yaml:
encoder_base_model: "jhu-clsp/mmBERT-base"
encoder_max_length: 8192
encoder_batch_size: 16
encoder_learning_rate: 2.0e-5
encoder_num_epochs: 5
encoder_warmup_ratio: 0.1
Override via CLI:
python -m squeez.encoder.train \
--train-file data/encoder_train.jsonl \
--eval-file data/encoder_dev.jsonl \
--batch-size 8 \
--learning-rate 1e-5 \
--num-epochs 10
Input format¶
[LINE_SEP]is a special token added to the tokenizer vocabulary- Task tokens are masked (
label = -100) during training - Line tokens receive binary labels: 0 (irrelevant) or 1 (relevant)
- Long samples are split into overlapping sliding windows so every line is supervised
Sliding window¶
Both training and inference use sliding windows for tool outputs that exceed the 8K context:
- Lines are split into windows that fit within the token budget
- Windows overlap by 2 lines
- Per-line scores are aggregated via max across windows (if any window says relevant, it's relevant)
4. Evaluate¶
Both models produce the same metrics format for direct comparison:
# Generative model
squeez eval \
--extractor-model output/squeez_qwen \
--eval-file data/test.jsonl
# Encoder model
python -m squeez.encoder.evaluate \
--model-path output/squeez_encoder \
--eval-file data/encoder_test.jsonl
Metrics computed:
- Line-level F1 — precision/recall against ground truth relevant lines
- ROUGE-L — token-level overlap with reference output
- Compression ratio — how much output was filtered
- Empty accuracy — correctly predicting empty vs non-empty
Results are saved to eval_results.json / eval_results_encoder.json.
5. Use the trained model¶
Both model types work through the same API:
# Either model type — auto-detected
export SQUEEZ_LOCAL_MODEL=./output/squeez_encoder
cat file.py | squeez "Fix the bug"
Or in Python:
from squeez.inference.extractor import ToolOutputExtractor
# Generative
extractor = ToolOutputExtractor(model_path="./output/squeez_qwen")
# Encoder (auto-detected from config.json)
extractor = ToolOutputExtractor(model_path="./output/squeez_encoder")
result = extractor.extract(task="Fix the bug", tool_output=raw)