--- language: - id - en license: apache-2.0 base_model: Helsinki-NLP/opus-mt-id-en tags: - translation - indonesian - english - marian - fine-tuned - meeting-translation - domain-adaptation - enhanced pipeline_tag: translation datasets: - ted_talks_iwslt library_name: transformers metrics: - bleu - rouge widget: - text: "Selamat pagi semuanya, mari kita mulai rapat hari ini." example_title: "Meeting Opening" - text: "Tim marketing akan bertanggung jawab untuk strategi ini." example_title: "Task Assignment" - text: "Database migration sudah selesai dan berjalan dengan lancar." example_title: "Technical Update" --- # Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation) This model is an **enhanced fine-tuned version** of [Helsinki-NLP/opus-mt-id-en](https://huggingface.co/Helsinki-NLP/opus-mt-id-en) with **domain-specific adaptation** for meeting and business contexts. ## 🎯 Model Highlights - **Domain Adaptation**: Specialized for meeting and business translation - **Enhanced Dataset**: TEDTalks + 2000+ meeting-specific sentence pairs - **Improved Performance**: Better BLEU scores on meeting contexts - **Robust Training**: 80% dataset usage with domain mixing - **Production Ready**: Optimized for real-world meeting scenarios ## 📊 Performance Metrics | Metric | Base Model | This Model | Improvement | |--------|------------|------------|-------------| | BLEU Score | 9.146 | **11.747** | **+28.4%** | | Translation Speed | 1.2s | **0.12s** | **-90.0%** | | Meeting Context | Standard | **Enhanced** | **Domain Adapted** | ## 🚀 Model Details - **Base Model**: Helsinki-NLP/opus-mt-id-en - **Training Dataset**: TEDTalks (80%) + Meeting Domain (10%) - **Training Strategy**: Domain adaptation with enhanced learning - **Specialization**: Business meetings, technical discussions, formal conversations - **Training Date**: 2025-05-28 - **Languages**: Indonesian (id) → English (en) - **License**: Apache 2.0 ## 🛠️ Usage ```python from transformers import MarianMTModel, MarianTokenizer # Load model and tokenizer model_name = "dhintech/marian-tedtalks-id-en-enhanced" tokenizer = MarianTokenizer.from_pretrained(model_name) model = MarianMTModel.from_pretrained(model_name) # Translate Indonesian to English def translate(text): inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128) outputs = model.generate( **inputs, max_length=128, num_beams=3, early_stopping=True, do_sample=False ) return tokenizer.decode(outputs[0], skip_special_tokens=True) # Example usage indonesian_text = "Tim marketing akan bertanggung jawab untuk strategi ini." english_translation = translate(indonesian_text) print(english_translation) # Output: "The marketing team will be responsible for this strategy." ``` ## 📝 Example Translations ### Meeting Context Examples | Indonesian | English | Context | |------------|---------|---------| | Selamat pagi semuanya, mari kita mulai rapat hari ini. | Good morning everyone, let's start today's meeting. | Meeting Opening | | Tim marketing akan bertanggung jawab untuk strategi ini. | The marketing team will be responsible for this strategy. | Task Assignment | | Database migration sudah selesai dan berjalan dengan lancar. | Database migration is complete and running smoothly. | Technical Update | | Budget yang disetujui adalah 500 juta rupiah. | The approved budget is 500 million rupiah. | Financial Discussion | ## 🎯 Intended Use Cases - **Business Meeting Translation**: Real-time translation during meetings - **Technical Documentation**: Translating technical meeting notes - **Corporate Communication**: Formal business correspondence - **Project Management**: Translating project updates and reports - **Training Materials**: Educational and training content translation ## 📊 Training Configuration - **Dataset Size**: 69,138 sentence pairs - **TEDTalks Data**: 80% of cleaned dataset - **Meeting Domain Data**: 10% specialized meeting content - **Max Sequence Length**: 128 tokens - **Training Epochs**: 12 - **Learning Rate**: 1e-05 - **Batch Size**: 12 (effective) ## 🔧 Technical Specifications - **Model Architecture**: MarianMT (Transformer-based) - **Parameters**: ~74M (with selective fine-tuning) - **Max Input/Output Length**: 128 tokens - **Inference Time**: ~0.12s per sentence - **Memory Requirements**: - GPU: 3GB VRAM minimum - CPU: 4GB RAM minimum ## 🚨 Limitations - **Domain Specificity**: Optimized for formal business/meeting contexts - **Informal Language**: May not perform optimally on very casual Indonesian - **Regional Dialects**: Trained primarily on standard Indonesian - **Cultural Context**: Some cultural nuances may be lost in translation ## 📚 Citation ```bibtex @misc{enhanced-marian-id-en-2025, title={Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation)}, author={DhinTech}, year={2025}, publisher={Hugging Face}, journal={Hugging Face Model Hub}, howpublished={\url{https://huggingface.co/dhintech/marian-tedtalks-id-en-enhanced}}, note={Enhanced with TEDTalks and meeting-specific domain adaptation} } ``` ## 🙏 Acknowledgments - **Base Model**: Helsinki-NLP team for the original opus-mt-id-en model - **Dataset**: TEDTalks corpus and custom meeting domain data - **Framework**: Hugging Face Transformers team --- *This model is specifically enhanced for Indonesian business meeting translation scenarios with domain adaptation techniques.*