--- language: - as - brx - en - grt - hi - kha - trp - mni - lus - njz - njo tags: - language-identification - fasttext - northeast-india - low-resource - multilingual license: cc-by-4.0 metrics: - accuracy - f1 library_name: fasttext pipeline_tag: text-classification model-index: - name: NE-LID results: - task: type: text-classification name: Language Identification metrics: - type: accuracy value: 99.09 name: Test Accuracy - type: f1 value: 99 name: Macro F1-Score --- # NE-LID: Northeast Language Identification ![License](https://img.shields.io/badge/License-CC%20BY%204.0-blue.svg) ![Accuracy](https://img.shields.io/badge/Accuracy-99.09%25-brightgreen) NE-LID is a **sentence-level language identification model** for low-resource languages of **Northeast India**, trained using a **character n-gram fastText classifier**. The model achieves **near-ceiling accuracy (99.1%)** and is designed to be **fast, robust, and reproducible**, especially for script-diverse and low-resource settings. --- ## Supported Languages (11) | Language | Family | Script | |----------|--------|--------| | Assamese | Indo-Aryan | Bengali-Assamese | | Bodo | Tibeto-Burman | Devanagari | | English | Germanic | Latin | | Garo | Tibeto-Burman | Latin | | Hindi | Indo-Aryan | Devanagari | | Khasi | Austroasiatic | Latin | | Kokborok | Tibeto-Burman | Latin | | Meitei | Tibeto-Burman | Bengali | | Mizo | Tibeto-Burman | Latin | | Naga | Tibeto-Burman | Latin | | Nyishi | Tibeto-Burman | Latin | --- ## Model Details - **Model type**: fastText supervised classifier - **Architecture**: Character n-grams (2–5) - **Task**: Sentence-level Language Identification (LID) - **Training data**: 22,000 sentences (2,000 per language) - **Train / Dev / Test split**: 70% / 15% / 15% (stratified) - **Evaluation accuracy**: **99.09%** (macro-F1: 0.99) - **Model size**: ~10 MB - **Inference speed**: <5ms per sentence --- ## Why fastText? Extensive experiments show that **character-level models outperform transformer-based language models** (e.g., NE-BERT, XLM-R) for Northeast Indian LID. **Key findings:** - Transformer models (NE-BERT, XLM-R) achieved only 9-37% accuracy on challenging samples - fastText maintained 99%+ accuracy even on script-diverse, low-resource languages - Character n-grams capture orthographic patterns better than subword tokenization for these languages This model therefore prioritizes: - ✅ Script awareness - ✅ Orthographic cues - ✅ Low-resource robustness --- ## Performance | Language | Precision | Recall | F1-Score | Support | |----------|-----------|--------|----------|---------| | Assamese | 1.00 | 1.00 | 1.00 | 300 | | Bodo | 0.99 | 0.98 | 0.99 | 300 | | English | 0.96 | 0.99 | 0.98 | 300 | | Garo | 0.99 | 1.00 | 1.00 | 300 | | Hindi | 0.96 | 0.97 | 0.97 | 300 | | Khasi | 1.00 | 0.99 | 0.99 | 300 | | Kokborok | 1.00 | 0.99 | 1.00 | 300 | | Meitei | 1.00 | 0.99 | 1.00 | 300 | | Mizo | 0.99 | 0.99 | 0.99 | 300 | | Naga | 1.00 | 1.00 | 1.00 | 300 | | Nyishi | 1.00 | 0.99 | 0.99 | 300 | | **Overall** | **0.99** | **0.99** | **0.99** | **3,300** | **Test Accuracy: 99.09%** --- --- ## Benchmark Comparison NE-LID significantly outperforms existing language identification systems on Northeast Indian languages: | Model | Overall Accuracy | Coverage (11 languages) | |-------|-----------------|-------------------------| | **NE-LID (Ours)** | **99.09%** | 11/11 ✅ | | GlotLID | 73.12% | 9/11 (missing Garo, Naga) | | OpenLID (Meta) | 42.03% | 5/11 | | IndicLID (AI4Bharat) | 39.30% | 4/11 | | LangDetect (Google) | 24.33% | 3/11 | ![Benchmark Comparison](ne_lid_benchmark.png) **Key Findings:** - NE-LID achieves 2.7× higher accuracy than the best competitor (GlotLID) - Existing multilingual models fail to support 6-7 Northeast Indian languages - Character n-gram approach outperforms transformer-based models for script-diverse, low-resource languages --- ## Installation ```bash pip install fasttext ``` --- ## Usage ### Basic Usage (Python) ```python import fasttext # Load the model model = fasttext.load_model("ne_lid.bin") # Predict language text = "Ki paidbah shnong ki la ia shim bynta ha ka jingïalang" labels, probs = model.predict(text) print(f"Language: {labels[0].replace('__label__', '')}") print(f"Confidence: {probs[0]:.4f}") ``` **Output:** ``` Language: khasi Confidence: 0.9999 ``` ### Batch Prediction ```python texts = [ "Ka sngi ka lieh", "আজি মই বজাৰলৈ গৈছিলোঁ", "Mizo tawng hi a ṭha hle" ] predictions = model.predict(texts) for text, (label, prob) in zip(texts, zip(*predictions)): lang = label.replace('__label__', '') print(f"{text[:30]:30} → {lang:10} ({prob:.3f})") ``` ### Get Top-K Predictions ```python # Get top 3 language predictions labels, probs = model.predict(text, k=3) for label, prob in zip(labels, probs): lang = label.replace('__label__', '') print(f"{lang}: {prob:.4f}") ``` --- ## Limitations - **Designed for monolingual sentences** – not optimized for code-mixed text - **Sentence-level only** – not designed for word-level or document-level LID - **Performance may degrade** on extremely short inputs (≤2 tokens) - **English/Hindi confusion** at 96-97% (expected due to loanwords and script overlap) --- ## Model Files - `ne_lid.bin` - Main fastText model (binary format) - `ne_lid.ftz` - Compressed model (optional, for smaller deployments) --- ## Training Details **Data Sources:** - Training corpus derived from NE-BERT dataset - 2,000 sentences per language, stratified by length and script - Balanced across language families (Austroasiatic, Tibeto-Burman, Indo-Aryan) **Hyperparameters:** - Learning rate: 0.1 - Epochs: 25 - Word n-grams: 1-3 - Character n-grams: 2-5 - Loss function: Softmax --- ## License This model is released under **Creative Commons Attribution 4.0 International (CC BY 4.0)**. You are free to: - ✅ Share — copy and redistribute the material - ✅ Adapt — remix, transform, and build upon the material Under the following terms: - Attribution — You must give appropriate credit to MWire Labs --- ## Citation If you use NE-LID in your research or applications, please cite: ```bibtex @misc{mwirelabs2025nelid, title={NE-LID: Northeast Language Identification}, author={MWire Labs}, year={2025}, publisher={HuggingFace}, howpublished={\url{https://huggingface.co/MWirelabs/ne-lid}} } ``` --- ## About MWire Labs **MWire Labs** is an AI research organization based in Shillong, Meghalaya, India, specializing in language technology for Northeast India's indigenous languages. **Repository:** [MWirelabs/ne-lid](https://huggingface.co/MWirelabs/ne-lid) **Contact:** [MWire Labs](https://mwirelabs.com) --- ## Acknowledgments We thank the open-source community and contributors to the NE-BERT corpus that made this work possible. --- **Last Updated:** January 2026 **Version:** 1.0.0