Qwen3-Embedding-8B EU AI Act NL

Fine-tuned Qwen3-Embedding-8B with LoRA for Dutch/English retrieval on EU AI Act documentation. Supports Matryoshka embeddings (4096, 1024, 768, 512, 256, 128 dimensions) for flexible speed/quality tradeoffs.

Model Details

  • Base model: Qwen/Qwen3-Embedding-8B
  • Architecture: Decoder-based (Qwen3), last-token pooling, left padding
  • Training approach: Two-stage fine-tuning with LoRA (r=16, alpha=32)
    • Stage 1: CachedMNRL + Matryoshka on synthetic query-chunk pairs
    • Stage 2: CachedMNRL + Matryoshka with hard negatives mined from Stage 1 model
  • Dataset: 1,944 synthetic queries generated from EU AI Act chunks (Dutch/English)
  • Hardware: NVIDIA RTX 5090 (32GB VRAM, Blackwell)
  • Precision: bf16 + flash_attention_2

Performance

Evaluated on 340 held-out queries across 85 chunks. All metrics measured with cosine similarity.

NDCG@10 across Matryoshka dimensions

Dim Zero-shot Stage 1 Stage 2 Best
4096 0.8962 0.9682 0.9675 0.9682 (S1)
1024 0.8836 0.9625 0.9625 0.9625 (tie)
768 0.8825 0.9607 0.9629 0.9629 (S2)
512 0.8774 0.9577 0.9587 0.9587 (S2)
256 0.8704 0.9524 0.9535 0.9535 (S2)
128 0.8369 0.9238 0.9253 0.9253 (S2)

Fine-tuning on ~2,000 synthetic pairs improves retrieval quality by +7–9 NDCG@10 points across all Matryoshka dimensions.

Stage 2 analysis

Stage 2 (hard negative mining) produced marginal improvements at lower dimensions but a slight regression at the primary dim=4096 metric. With 127 in-batch negatives from GradCache already providing a strong contrastive signal, explicit hard negatives offer diminishing returns — consistent with findings on the smaller 0.6B and 4B models. The uploaded model uses the Stage 1 checkpoint (best at the primary dim=4096 metric).

Usage

Installation

pip install sentence-transformers>=2.7.0 transformers>=4.51.0

Basic usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("danielnoumon/qwen3-embedding-8b-ai-act-nl")

# Qwen3 uses instruct prompts for queries, no prefix for documents
queries = model.encode(
    ["What are the obligations for high-risk AI systems?"],
    prompt="Instruct: Given a question about EU AI regulation, retrieve the most relevant passage\nQuery:",
)
passages = model.encode([
    "High-risk AI systems must comply with requirements in Chapter III...",
    "The AI Act defines prohibited practices in Article 5...",
])

# Compute similarity
from sentence_transformers.util import cos_sim
scores = cos_sim(queries, passages)

Matryoshka embeddings (dimension truncation)

# Encode with full 4096 dimensions
embeddings_4096 = model.encode(queries)

# Truncate to 256 dimensions for faster search
embeddings_256 = embeddings_4096[:, :256]

# Or specify dimension at encoding time
model.truncate_dim = 256
embeddings_256 = model.encode(queries)

Speed vs quality tradeoff:

  • dim=4096: Full quality (NDCG@10 = 0.968)
  • dim=1024: 75% fewer dimensions, 99.4% of full quality (NDCG@10 = 0.963)
  • dim=256: ~94% fewer dimensions, 98.4% of full quality (NDCG@10 = 0.952)
  • dim=128: ~97% fewer dimensions, 95.4% of full quality (NDCG@10 = 0.924)

Important: Use instruct prompts

Qwen3 uses instruction-based prompting. Queries need the instruct prefix, documents do not:

# Queries: use instruct prompt
query_emb = model.encode(
    ["your question here"],
    prompt="Instruct: Given a question about EU AI regulation, retrieve the most relevant passage\nQuery:",
)

# Documents: no prefix needed
doc_emb = model.encode(["your document here"])

Training Details

LoRA Configuration

Parameter Value
Rank (r) 16
Alpha 32
Dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj
Trainable params 15.3M / 7583M (0.20%)

Stage 1: CachedMNRL + Matryoshka

  • Loss: MatryoshkaLoss(CachedMultipleNegativesRankingLoss)
  • Matryoshka dims: [4096, 1024, 768, 512, 256, 128]
  • Batch size: 128 (GradCache), mini-batch 1
  • Learning rate: 1e-4 (typical for LoRA)
  • Epochs: 3
  • Negatives: 127 in-batch negatives per query (via GradCache)
  • Precision: bf16 + flash_attention_2

Stage 2: Hard negatives

  • Starting point: Stage 1 LoRA checkpoint (merged into base)
  • Hard negative mining: Top-1 most similar wrong chunk per query (using Stage 1 model)
  • Learning rate: 1e-5 (10× lower to prevent catastrophic forgetting)
  • Epochs: 2
  • Batch size: 128 (GradCache), mini-batch 1
  • Negatives: 1 explicit hard negative + 127 in-batch negatives

Dataset

  • Dataset: danielnoumon/eu-ai-act-nl-queries
  • Train: 1,944 synthetic query-chunk pairs
  • Eval: 340 queries × 85 chunks
  • Split strategy: Chunk-level (no chunk appears in both train and eval)
  • Query generation: Azure OpenAI GPT-4o-mini with structured prompts

Hardware Notes

  • bf16 works on Blackwell (RTX 5090) with Qwen3
  • Qwen3's RMSNorm upcasts to fp32 internally, preventing gradient instability
  • flash_attention_2 used for O(N) attention memory
  • CachedMNRL (GradCache) essential: 8B model fills 16GB of 32GB VRAM with base weights alone
  • LoRA keeps trainable params at 0.20% — full fine-tuning impossible on 32GB
  • mini_batch_size=1 required due to extreme VRAM pressure

Limitations

  • Domain-specific: Fine-tuned on EU AI Act documentation. Performance on other domains may vary.
  • Language: Optimized for Dutch and English. Other languages supported by the base model may work but are not evaluated.
  • Chunk size: Trained on chunks up to 512 tokens. Very long documents should be chunked.
  • VRAM: Requires ~16GB VRAM for inference in bf16. Use quantization or CPU offloading for smaller GPUs.

License

Apache 2.0

Citation

@misc{qwen3embedding,
    title={Qwen3-Embedding: Advancing Text Embeddings with Qwen3},
    author={Qwen Team},
    year={2025},
    url={https://huggingface.co/Qwen/Qwen3-Embedding-8B}
}

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    year = "2019",
    url = "https://arxiv.org/abs/1908.10084",
}

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

@misc{hu2022lora,
    title={LoRA: Low-Rank Adaptation of Large Language Models},
    author={Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen},
    year={2022},
    eprint={2106.09685},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
31
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for danielnoumon/qwen3-embedding-8b-ai-act-nl

Adapter
(5)
this model

Dataset used to train danielnoumon/qwen3-embedding-8b-ai-act-nl

Papers for danielnoumon/qwen3-embedding-8b-ai-act-nl