m2v-bge-m3-european

A Model2Vec static embedding model distilled from BAAI/bge-m3, pruned to European languages only.

What is this?

This model was created by:

  1. Distilling the BGE-M3 transformer (568M params) into a static token embedding lookup table using model2vec
  2. Pruning all non-European script tokens (CJK, Arabic, Hebrew, Thai, Devanagari, Korean, Japanese, etc.)

The result is a lightweight embedding model that only contains tokens relevant to European languages (Latin, Greek, Cyrillic, etc.).

Stats

Before pruning After pruning
Vocabulary 249,999 tokens 158,843 tokens
Model size ~244 MB ~78 MB
Embedding dim 256 256

36.5% of tokens were removed (non-European scripts).

Usage

from model2vec import StaticModel

model = StaticModel.from_pretrained("flipbitsnotburgers/m2v-bge-m3-european")
embeddings = model.encode(["deodorant", "Duschgel", "shower gel"])

Pruned scripts

The following scripts were removed:

  • CJK (Chinese, Japanese Kanji)
  • Hangul (Korean)
  • Hiragana & Katakana (Japanese)
  • Arabic
  • Hebrew
  • Thai, Lao
  • Devanagari, Bengali, Tamil, Telugu, and other Indic scripts
  • Myanmar, Ethiopic, Tibetan, Khmer

License

MIT (same as base model)

Downloads last month
8
Safetensors
Model size
40.7M params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for flipbitsnotburgers/m2v-bge-m3-european

Base model

BAAI/bge-m3
Finetuned
(495)
this model