Instructions to use flipbitsnotburgers/m2v-bge-m3-european with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Model2Vec
How to use flipbitsnotburgers/m2v-bge-m3-european with Model2Vec:
from model2vec import StaticModel model = StaticModel.from_pretrained("flipbitsnotburgers/m2v-bge-m3-european") - Notebooks
- Google Colab
- Kaggle
m2v-bge-m3-european
A Model2Vec static embedding model distilled from BAAI/bge-m3, pruned to European languages only.
What is this?
This model was created by:
- Distilling the BGE-M3 transformer (568M params) into a static token embedding lookup table using model2vec
- Pruning all non-European script tokens (CJK, Arabic, Hebrew, Thai, Devanagari, Korean, Japanese, etc.)
The result is a lightweight embedding model that only contains tokens relevant to European languages (Latin, Greek, Cyrillic, etc.).
Stats
| Before pruning | After pruning | |
|---|---|---|
| Vocabulary | 249,999 tokens | 158,843 tokens |
| Model size | ~244 MB | ~78 MB |
| Embedding dim | 256 | 256 |
36.5% of tokens were removed (non-European scripts).
Usage
from model2vec import StaticModel
model = StaticModel.from_pretrained("flipbitsnotburgers/m2v-bge-m3-european")
embeddings = model.encode(["deodorant", "Duschgel", "shower gel"])
Pruned scripts
The following scripts were removed:
- CJK (Chinese, Japanese Kanji)
- Hangul (Korean)
- Hiragana & Katakana (Japanese)
- Arabic
- Hebrew
- Thai, Lao
- Devanagari, Bengali, Tamil, Telugu, and other Indic scripts
- Myanmar, Ethiopic, Tibetan, Khmer
License
MIT (same as base model)
- Downloads last month
- 8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for flipbitsnotburgers/m2v-bge-m3-european
Base model
BAAI/bge-m3