Qwen3-Embedding-8B EU AI Act NL

Fine-tuned Qwen3-Embedding-8B with LoRA for Dutch/English retrieval on EU AI Act documentation. Supports Matryoshka embeddings (4096, 1024, 768, 512, 256, 128 dimensions) for flexible speed/quality tradeoffs.

Model Details

Base model: Qwen/Qwen3-Embedding-8B
Architecture: Decoder-based (Qwen3), last-token pooling, left padding
Training approach: Two-stage fine-tuning with LoRA (r=16, alpha=32)
- Stage 1: CachedMNRL + Matryoshka on synthetic query-chunk pairs
- Stage 2: CachedMNRL + Matryoshka with hard negatives mined from Stage 1 model
Dataset: 1,944 synthetic queries generated from EU AI Act chunks (Dutch/English)
Hardware: NVIDIA RTX 5090 (32GB VRAM, Blackwell)
Precision: bf16 + flash_attention_2

Performance

Evaluated on 340 held-out queries across 85 chunks. All metrics measured with cosine similarity.

NDCG@10 across Matryoshka dimensions

Dim	Zero-shot	Stage 1	Stage 2	Best
4096	0.8962	0.9682	0.9675	0.9682 (S1)
1024	0.8836	0.9625	0.9625	0.9625 (tie)
768	0.8825	0.9607	0.9629	0.9629 (S2)
512	0.8774	0.9577	0.9587	0.9587 (S2)
256	0.8704	0.9524	0.9535	0.9535 (S2)
128	0.8369	0.9238	0.9253	0.9253 (S2)

Fine-tuning on ~2,000 synthetic pairs improves retrieval quality by +7–9 NDCG@10 points across all Matryoshka dimensions.

Stage 2 analysis

Stage 2 (hard negative mining) produced marginal improvements at lower dimensions but a slight regression at the primary dim=4096 metric. With 127 in-batch negatives from GradCache already providing a strong contrastive signal, explicit hard negatives offer diminishing returns — consistent with findings on the smaller 0.6B and 4B models. The uploaded model uses the Stage 1 checkpoint (best at the primary dim=4096 metric).

Usage

Installation

pip install sentence-transformers>=2.7.0 transformers>=4.51.0

Basic usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("danielnoumon/qwen3-embedding-8b-ai-act-nl")

# Qwen3 uses instruct prompts for queries, no prefix for documents
queries = model.encode(
    ["What are the obligations for high-risk AI systems?"],
    prompt="Instruct: Given a question about EU AI regulation, retrieve the most relevant passage\nQuery:",
)
passages = model.encode([
    "High-risk AI systems must comply with requirements in Chapter III...",
    "The AI Act defines prohibited practices in Article 5...",
])

# Compute similarity
from sentence_transformers.util import cos_sim
scores = cos_sim(queries, passages)

Matryoshka embeddings (dimension truncation)

# Encode with full 4096 dimensions
embeddings_4096 = model.encode(queries)

# Truncate to 256 dimensions for faster search
embeddings_256 = embeddings_4096[:, :256]

# Or specify dimension at encoding time
model.truncate_dim = 256
embeddings_256 = model.encode(queries)

Speed vs quality tradeoff:

dim=4096: Full quality (NDCG@10 = 0.968)
dim=1024: 75% fewer dimensions, 99.4% of full quality (NDCG@10 = 0.963)
dim=256: ~94% fewer dimensions, 98.4% of full quality (NDCG@10 = 0.952)
dim=128: ~97% fewer dimensions, 95.4% of full quality (NDCG@10 = 0.924)

Important: Use instruct prompts

Qwen3 uses instruction-based prompting. Queries need the instruct prefix, documents do not:

# Queries: use instruct prompt
query_emb = model.encode(
    ["your question here"],
    prompt="Instruct: Given a question about EU AI regulation, retrieve the most relevant passage\nQuery:",
)

# Documents: no prefix needed
doc_emb = model.encode(["your document here"])

Training Details

LoRA Configuration

Parameter	Value
Rank (r)	16
Alpha	32
Dropout	0.05
Target modules	q_proj, k_proj, v_proj, o_proj
Trainable params	15.3M / 7583M (0.20%)

Stage 1: CachedMNRL + Matryoshka

Loss: MatryoshkaLoss(CachedMultipleNegativesRankingLoss)
Matryoshka dims: [4096, 1024, 768, 512, 256, 128]
Batch size: 128 (GradCache), mini-batch 1
Learning rate: 1e-4 (typical for LoRA)
Epochs: 3
Negatives: 127 in-batch negatives per query (via GradCache)
Precision: bf16 + flash_attention_2

Stage 2: Hard negatives

Starting point: Stage 1 LoRA checkpoint (merged into base)
Hard negative mining: Top-1 most similar wrong chunk per query (using Stage 1 model)
Learning rate: 1e-5 (10× lower to prevent catastrophic forgetting)
Epochs: 2
Batch size: 128 (GradCache), mini-batch 1
Negatives: 1 explicit hard negative + 127 in-batch negatives

Dataset

Dataset: danielnoumon/eu-ai-act-nl-queries
Train: 1,944 synthetic query-chunk pairs
Eval: 340 queries × 85 chunks
Split strategy: Chunk-level (no chunk appears in both train and eval)
Query generation: Azure OpenAI GPT-4o-mini with structured prompts

Hardware Notes

bf16 works on Blackwell (RTX 5090) with Qwen3
Qwen3's RMSNorm upcasts to fp32 internally, preventing gradient instability
flash_attention_2 used for O(N) attention memory
CachedMNRL (GradCache) essential: 8B model fills 16GB of 32GB VRAM with base weights alone
LoRA keeps trainable params at 0.20% — full fine-tuning impossible on 32GB
mini_batch_size=1 required due to extreme VRAM pressure

Limitations

Domain-specific: Fine-tuned on EU AI Act documentation. Performance on other domains may vary.
Language: Optimized for Dutch and English. Other languages supported by the base model may work but are not evaluated.
Chunk size: Trained on chunks up to 512 tokens. Very long documents should be chunked.
VRAM: Requires ~16GB VRAM for inference in bf16. Use quantization or CPU offloading for smaller GPUs.

License

Apache 2.0

Citation

@misc{qwen3embedding,
    title={Qwen3-Embedding: Advancing Text Embeddings with Qwen3},
    author={Qwen Team},
    year={2025},
    url={https://huggingface.co/Qwen/Qwen3-Embedding-8B}
}

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    year = "2019",
    url = "https://arxiv.org/abs/1908.10084",
}

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

@misc{hu2022lora,
    title={LoRA: Low-Rank Adaptation of Large Language Models},
    author={Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen},
    year={2022},
    eprint={2106.09685},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Downloads last month: 31

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for danielnoumon/qwen3-embedding-8b-ai-act-nl

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-Embedding-8B

Adapter

(5)

this model

Dataset used to train danielnoumon/qwen3-embedding-8b-ai-act-nl

Papers for danielnoumon/qwen3-embedding-8b-ai-act-nl