--- library_name: peft license: cc-by-nc-4.0 language: - en tags: - peft - safetensors - lora - complexity-classification - llm-routing - query-difficulty - brick - text-classification - semantic-router - inference-optimization - cost-reduction - reasoning-budget datasets: - regolo/brick-complexity-extractor base_model: Qwen/Qwen3.5-0.8B pipeline_tag: text-classification model-index: - name: brick-complexity-extractor results: - task: type: text-classification name: Query Complexity Classification dataset: name: brick-complexity-extractor type: regolo/brick-complexity-extractor split: test metrics: - type: accuracy value: 0.89 name: Accuracy (3-class) - type: f1 value: 0.87 name: Weighted F1 ---
## Model Details
| Property | Value |
|---|---|
| **Model type** | LoRA adapter (PEFT) |
| **Base model** | [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) |
| **Trainable parameters** | ~2M (LoRA rank 16, alpha 32) |
| **Total parameters** | ~875M (base + adapter) |
| **Output classes** | 3 (`easy`, `medium`, `hard`) |
| **Language** | English |
| **License** | CC BY-NC 4.0 |
| **Developed by** | [Regolo.ai](https://regolo.ai) (Seeweb S.r.l.) |
| **Release date** | April 2026 |
## Architecture
The adapter applies LoRA to the query and value projection matrices (`q_proj`, `v_proj`) across all attention layers of Qwen3.5-0.8B, with a classification head on top of the last hidden state.
```
Qwen3.5-0.8B (frozen)
└── Attention Layers × 24
├── q_proj ← LoRA(r=16, α=32)
└── v_proj ← LoRA(r=16, α=32)
└── Last Hidden State
└── Classification Head (3 classes)
```
## Label Definitions
| Label | Reasoning Steps | Description | Example |
|---|---|---|---|
| **easy** | 1–2 | Surface knowledge, factual recall, simple lookups | "What is the capital of Italy?" |
| **medium** | 3–5 | Domain familiarity, multi-step reasoning, comparison | "Compare REST and GraphQL for a mobile app backend" |
| **hard** | 6+ | Deep expertise, multi-constraint optimization, creative synthesis | "Design a distributed cache eviction policy that minimizes tail latency under bursty traffic" |
Labels were generated by **Qwen3.5-122B** acting as an LLM judge on 76,831 diverse user prompts. See the [dataset card](https://huggingface.co/datasets/regolo/brick-complexity-extractor) for full labeling methodology.
## Performance
### Classification Metrics (Test Set — 3,841 samples)
| Metric | Value |
|---|---|
| **Accuracy** | 89.2% |
| **Weighted F1** | 87.4% |
| **Macro F1** | 85.1% |
### Per-Class Performance
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| easy | 0.92 | 0.94 | 0.93 | 1,057 |
| medium | 0.88 | 0.90 | 0.89 | 1,660 |
| hard | 0.84 | 0.79 | 0.81 | 519 |
### Latency
| Setup | Inference Time (p50) | Inference Time (p99) |
|---|---|---|
| NVIDIA A100 (bf16) | 8ms | 14ms |
| NVIDIA L4 (fp16) | 12ms | 22ms |
| CPU (Intel Xeon, fp32) | 45ms | 78ms |
## Quick Start
### Installation
```bash
pip install peft transformers torch
```
### Inference
```python
from peft import PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Load base model + adapter
base_model_id = "Qwen/Qwen3.5-0.8B"
adapter_id = "regolo/brick-complexity-extractor"
tokenizer = AutoTokenizer.from_pretrained(base_model_id)
model = AutoModelForSequenceClassification.from_pretrained(
base_model_id, num_labels=3
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()
# Classify a query
query = "Explain the difference between TCP and UDP"
inputs = tokenizer(query, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)
labels = ["easy", "medium", "hard"]
predicted = labels[outputs.logits.argmax(dim=-1).item()]
print(f"Complexity: {predicted}")
# Output: Complexity: medium
```
### Using with vLLM (recommended for production)
```python
# The adapter can be loaded as a LoRA module in vLLM
# See Brick SR1 documentation for full integration guide
# https://github.com/regolo-ai/brick-SR1
```
## GGUF Quantized Models
Pre-built GGUF files are available for inference with llama.cpp, Ollama, LM Studio, vLLM, and other GGUF-compatible runtimes. Each quantization is published as a separate model:
| Model | Quant | Size | BPW | Notes |
|---|---|---|---|---|
| [brick-complexity-extractor-BF16-GGUF](https://huggingface.co/regolo/brick-complexity-extractor-BF16-GGUF) | BF16 | 1.5 GB | 16.0 | Full precision |
| [brick-complexity-extractor-Q8_0-GGUF](https://huggingface.co/regolo/brick-complexity-extractor-Q8_0-GGUF) | Q8_0 | 775 MB | 8.0 | Recommended |
| [brick-complexity-extractor-Q4_K_M-GGUF](https://huggingface.co/regolo/brick-complexity-extractor-Q4_K_M-GGUF) | Q4_K_M | 494 MB | 5.5 | Best size/quality ratio |
See the [brick-complexity-extractor collection](https://huggingface.co/collections/regolo/brick-complexity-extractor-69dcc2dec2fe3b54a70b3415) for all available formats.
## Integration with Brick Semantic Router
Brick Complexity Extractor is designed to work as a signal within the **Brick Semantic Router** pipeline. In a typical deployment:
1. **Query arrives** at the Brick router endpoint
2. **Parallel signal extraction** runs complexity classification alongside keyword matching, domain detection, and reasoning estimation
3. **Routing decision** combines all signals to select the optimal model from the pool
4. **Query forwarded** to the chosen model (e.g., Qwen 7B for easy, Llama 70B for medium, Claude for hard)
```python
# Brick router configuration example (brick-config.yaml)
signals:
complexity:
model: regolo/brick-complexity-extractor
weight: 0.35
domain:
model: regolo/brick-domain-classifier # coming soon
weight: 0.25
keyword:
type: rule-based
weight: 0.20
reasoning:
type: heuristic
weight: 0.20
model_pools:
easy:
- qwen3.5-7b
- llama-3.3-8b
medium:
- qwen3.5-32b
- llama-3.3-70b
hard:
- claude-sonnet-4-20250514
- deepseek-r1
```
## Intended Uses
### ✅ Primary Use Cases
- **LLM routing**: Classify query complexity to route to the optimal model tier, reducing inference cost by 30–60% compared to always-frontier routing
- **Reasoning budget allocation**: Decide how many reasoning tokens to allocate before inference begins
- **Traffic shaping**: Balance GPU load across model pools based on real-time complexity distribution
- **Cost monitoring**: Track complexity distribution over time to optimize fleet sizing
### ⚠️ Out-of-Scope Uses
- **Content moderation or safety filtering** — this model classifies cognitive difficulty, not content safety
- **Non-English queries** trained on English data only; accuracy degrades significantly on other languages
- **Direct use as a chatbot or generative model** this is a classification adapter, not a generative model
## Limitations
- **Label noise**: The training labels were generated by Qwen3.5-122B, not human annotators. While LLM-as-judge achieves high inter-annotator agreement on complexity, systematic biases may exist (e.g., overweighting mathematical content as "hard")
- **Class imbalance**: The "hard" class represents only 13.5% of training data, which may lead to lower recall on genuinely hard queries
- **Domain coverage**: The training set covers general-purpose user prompts. Specialized domains (medical, legal, financial) may exhibit different complexity distributions
- **English only**: No multilingual support in this version
- **Adversarial robustness**: The model has not been tested against adversarial prompt manipulation designed to fool the complexity classifier
## Training Details
| Hyperparameter | Value |
|---|---|
| **Base model** | Qwen/Qwen3.5-0.8B |
| **LoRA rank (r)** | 16 |
| **LoRA alpha (α)** | 32 |
| **LoRA dropout** | 0.05 |
| **Target modules** | q_proj, v_proj |
| **Learning rate** | 2e-4 |
| **Batch size** | 32 |
| **Epochs** | 3 |
| **Optimizer** | AdamW |
| **Scheduler** | Cosine with warmup (5% steps) |
| **Max sequence length** | 512 tokens |
| **Training samples** | 65,307 |
| **Validation samples** | 7,683 |
| **Test samples** | 3,841 |
| **Training hardware** | 1× NVIDIA A100 80GB |
| **Training time** | ~2 hours |
| **Framework** | PyTorch + HuggingFace PEFT |
## Environmental Impact
Regolo.ai is committed to sustainable AI. This model was trained on GPU infrastructure powered by [Seeweb](https://www.seeweb.it/)'s data centers in Italy, which run on certified renewable energy.
| Metric | Value |
|---|---|
| **Hardware** | 1× NVIDIA A100 80GB |
| **Training duration** | ~2 hours |
| **Estimated CO₂** | < 0.5 kg CO₂eq |
| **Energy source** | Renewable (certified) |
| **Location** | Italy (EU) |
## Citation
```bibtex
@misc{regolo2026brick-complexity,
title = {Brick Complexity Extractor: A LoRA Adapter for Query Complexity Classification in LLM Routing},
author = {Regolo.ai Team},
year = {2026},
url = {https://huggingface.co/regolo/brick-complexity-extractor}
}
```
## About Regolo.ai
[Regolo.ai](https://regolo.ai) is the EU-sovereign LLM inference platform built on [Seeweb](https://www.seeweb.it/) infrastructure. We provide zero-data-retention, GDPR-native AI inference for enterprises that need privacy, compliance, and performance all from European data centers powered by renewable energy.
**Brick** is our open-source semantic routing system that intelligently distributes queries across model pools, optimizing for cost, latency, and quality.