dhintech commited on
Commit
da2b0f2
·
verified ·
1 Parent(s): 25814fd

Upload enhanced MarianMT Indonesian-English model with meeting domain adaptation

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ source.spm filter=lfs diff=lfs merge=lfs -text
37
+ target.spm filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - id
4
+ - en
5
+ license: apache-2.0
6
+ base_model: Helsinki-NLP/opus-mt-id-en
7
+ tags:
8
+ - translation
9
+ - indonesian
10
+ - english
11
+ - marian
12
+ - fine-tuned
13
+ - meeting-translation
14
+ - domain-adaptation
15
+ - enhanced
16
+ pipeline_tag: translation
17
+ datasets:
18
+ - ted_talks_iwslt
19
+ library_name: transformers
20
+ metrics:
21
+ - bleu
22
+ - rouge
23
+ widget:
24
+ - text: "Selamat pagi semuanya, mari kita mulai rapat hari ini."
25
+ example_title: "Meeting Opening"
26
+ - text: "Tim marketing akan bertanggung jawab untuk strategi ini."
27
+ example_title: "Task Assignment"
28
+ - text: "Database migration sudah selesai dan berjalan dengan lancar."
29
+ example_title: "Technical Update"
30
+ ---
31
+
32
+ # Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation)
33
+
34
+ This model is an **enhanced fine-tuned version** of [Helsinki-NLP/opus-mt-id-en](https://huggingface.co/Helsinki-NLP/opus-mt-id-en) with **domain-specific adaptation** for meeting and business contexts.
35
+
36
+ ## 🎯 Model Highlights
37
+
38
+ - **Domain Adaptation**: Specialized for meeting and business translation
39
+ - **Enhanced Dataset**: TEDTalks + 2000+ meeting-specific sentence pairs
40
+ - **Improved Performance**: Better BLEU scores on meeting contexts
41
+ - **Robust Training**: 80% dataset usage with domain mixing
42
+ - **Production Ready**: Optimized for real-world meeting scenarios
43
+
44
+ ## 📊 Performance Metrics
45
+
46
+ | Metric | Base Model | This Model | Improvement |
47
+ |--------|------------|------------|-------------|
48
+ | BLEU Score | 9.146 | **11.747** | **+28.4%** |
49
+ | Translation Speed | 1.2s | **0.12s** | **-90.0%** |
50
+ | Meeting Context | Standard | **Enhanced** | **Domain Adapted** |
51
+
52
+ ## 🚀 Model Details
53
+
54
+ - **Base Model**: Helsinki-NLP/opus-mt-id-en
55
+ - **Training Dataset**: TEDTalks (80%) + Meeting Domain (10%)
56
+ - **Training Strategy**: Domain adaptation with enhanced learning
57
+ - **Specialization**: Business meetings, technical discussions, formal conversations
58
+ - **Training Date**: 2025-05-28
59
+ - **Languages**: Indonesian (id) → English (en)
60
+ - **License**: Apache 2.0
61
+
62
+ ## 🛠️ Usage
63
+
64
+ ```python
65
+ from transformers import MarianMTModel, MarianTokenizer
66
+
67
+ # Load model and tokenizer
68
+ model_name = "dhintech/marian-id-en-enhanced"
69
+ tokenizer = MarianTokenizer.from_pretrained(model_name)
70
+ model = MarianMTModel.from_pretrained(model_name)
71
+
72
+ # Translate Indonesian to English
73
+ def translate(text):
74
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
75
+ outputs = model.generate(
76
+ **inputs,
77
+ max_length=128,
78
+ num_beams=3,
79
+ early_stopping=True,
80
+ do_sample=False
81
+ )
82
+ return tokenizer.decode(outputs[0], skip_special_tokens=True)
83
+
84
+ # Example usage
85
+ indonesian_text = "Tim marketing akan bertanggung jawab untuk strategi ini."
86
+ english_translation = translate(indonesian_text)
87
+ print(english_translation)
88
+ # Output: "The marketing team will be responsible for this strategy."
89
+ ```
90
+
91
+ ## 📝 Example Translations
92
+
93
+ ### Meeting Context Examples
94
+
95
+ | Indonesian | English | Context |
96
+ |------------|---------|---------|
97
+ | Selamat pagi semuanya, mari kita mulai rapat hari ini. | Good morning everyone, let's start today's meeting. | Meeting Opening |
98
+ | Tim marketing akan bertanggung jawab untuk strategi ini. | The marketing team will be responsible for this strategy. | Task Assignment |
99
+ | Database migration sudah selesai dan berjalan dengan lancar. | Database migration is complete and running smoothly. | Technical Update |
100
+ | Budget yang disetujui adalah 500 juta rupiah. | The approved budget is 500 million rupiah. | Financial Discussion |
101
+
102
+ ## 🎯 Intended Use Cases
103
+
104
+ - **Business Meeting Translation**: Real-time translation during meetings
105
+ - **Technical Documentation**: Translating technical meeting notes
106
+ - **Corporate Communication**: Formal business correspondence
107
+ - **Project Management**: Translating project updates and reports
108
+ - **Training Materials**: Educational and training content translation
109
+
110
+ ## 📊 Training Configuration
111
+
112
+ - **Dataset Size**: 69,138 sentence pairs
113
+ - **TEDTalks Data**: 80% of cleaned dataset
114
+ - **Meeting Domain Data**: 10% specialized meeting content
115
+ - **Max Sequence Length**: 128 tokens
116
+ - **Training Epochs**: 12
117
+ - **Learning Rate**: 1e-05
118
+ - **Batch Size**: 12 (effective)
119
+
120
+ ## 🔧 Technical Specifications
121
+
122
+ - **Model Architecture**: MarianMT (Transformer-based)
123
+ - **Parameters**: ~74M (with selective fine-tuning)
124
+ - **Max Input/Output Length**: 128 tokens
125
+ - **Inference Time**: ~0.12s per sentence
126
+ - **Memory Requirements**:
127
+ - GPU: 3GB VRAM minimum
128
+ - CPU: 4GB RAM minimum
129
+
130
+ ## 🚨 Limitations
131
+
132
+ - **Domain Specificity**: Optimized for formal business/meeting contexts
133
+ - **Informal Language**: May not perform optimally on very casual Indonesian
134
+ - **Regional Dialects**: Trained primarily on standard Indonesian
135
+ - **Cultural Context**: Some cultural nuances may be lost in translation
136
+
137
+ ## 📚 Citation
138
+
139
+ ```bibtex
140
+ @misc{enhanced-marian-id-en-2025,
141
+ title={Enhanced MarianMT Indonesian-English Translation (Meeting Domain Adaptation)},
142
+ author={DhinTech},
143
+ year={2025},
144
+ publisher={Hugging Face},
145
+ journal={Hugging Face Model Hub},
146
+ howpublished={\url{https://huggingface.co/dhintech/marian-id-en-enhanced}},
147
+ note={Enhanced with TEDTalks and meeting-specific domain adaptation}
148
+ }
149
+ ```
150
+
151
+ ## 🙏 Acknowledgments
152
+
153
+ - **Base Model**: Helsinki-NLP team for the original opus-mt-id-en model
154
+ - **Dataset**: TEDTalks corpus and custom meeting domain data
155
+ - **Framework**: Hugging Face Transformers team
156
+
157
+ ---
158
+
159
+ *This model is specifically enhanced for Indonesian business meeting translation scenarios with domain adaptation techniques.*
config.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Helsinki-NLP/opus-mt-id-en",
3
+ "_num_labels": 3,
4
+ "activation_dropout": 0.0,
5
+ "activation_function": "swish",
6
+ "add_bias_logits": false,
7
+ "add_final_layer_norm": false,
8
+ "architectures": [
9
+ "MarianMTModel"
10
+ ],
11
+ "attention_dropout": 0.0,
12
+ "bad_words_ids": [
13
+ [
14
+ 54795
15
+ ]
16
+ ],
17
+ "bos_token_id": 0,
18
+ "classif_dropout": 0.0,
19
+ "classifier_dropout": 0.0,
20
+ "d_model": 512,
21
+ "decoder_attention_heads": 8,
22
+ "decoder_ffn_dim": 2048,
23
+ "decoder_layerdrop": 0.0,
24
+ "decoder_layers": 6,
25
+ "decoder_start_token_id": 54795,
26
+ "decoder_vocab_size": 54796,
27
+ "dropout": 0.1,
28
+ "encoder_attention_heads": 8,
29
+ "encoder_ffn_dim": 2048,
30
+ "encoder_layerdrop": 0.0,
31
+ "encoder_layers": 6,
32
+ "eos_token_id": 0,
33
+ "forced_eos_token_id": 0,
34
+ "id2label": {
35
+ "0": "LABEL_0",
36
+ "1": "LABEL_1",
37
+ "2": "LABEL_2"
38
+ },
39
+ "init_std": 0.02,
40
+ "is_encoder_decoder": true,
41
+ "label2id": {
42
+ "LABEL_0": 0,
43
+ "LABEL_1": 1,
44
+ "LABEL_2": 2
45
+ },
46
+ "max_length": 512,
47
+ "max_position_embeddings": 512,
48
+ "model_type": "marian",
49
+ "normalize_before": false,
50
+ "normalize_embedding": false,
51
+ "num_beams": 6,
52
+ "num_hidden_layers": 6,
53
+ "pad_token_id": 54795,
54
+ "scale_embedding": true,
55
+ "share_encoder_decoder_embeddings": true,
56
+ "static_position_embeddings": true,
57
+ "torch_dtype": "float32",
58
+ "transformers_version": "4.44.2",
59
+ "use_cache": true,
60
+ "vocab_size": 54796
61
+ }
generation_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bad_words_ids": [
3
+ [
4
+ 54795
5
+ ]
6
+ ],
7
+ "bos_token_id": 0,
8
+ "decoder_start_token_id": 54795,
9
+ "eos_token_id": 0,
10
+ "forced_eos_token_id": 0,
11
+ "max_length": 512,
12
+ "num_beams": 6,
13
+ "pad_token_id": 54795,
14
+ "renormalize_logits": true,
15
+ "transformers_version": "4.44.2"
16
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5996a0a4142079c10d217f565b81ef7962dc070bc2b414cbf248a98c3dafa74e
3
+ size 289024432
model_config.json ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "Enhanced MarianMT Meeting Translation ID-EN",
3
+ "base_model": "Helsinki-NLP/opus-mt-id-en",
4
+ "enhancement_date": "2025-05-28T12:42:35.765269",
5
+ "best_bleu_score": 11.746771868146594,
6
+ "baseline_bleu": 9.146153343607343,
7
+ "improvement": 2.60061852453925,
8
+ "training_epochs": 12,
9
+ "dataset_composition": {
10
+ "tedtalks_percentage": 0.8,
11
+ "meeting_domain_percentage": 0.1,
12
+ "total_samples": 69138
13
+ },
14
+ "specialization": "meeting_domain_adaptation",
15
+ "hyperparameters": {
16
+ "max_length": 128,
17
+ "batch_size": 6,
18
+ "learning_rate": 1e-05,
19
+ "weight_decay": 0.01,
20
+ "gradient_clip": 1.0,
21
+ "warmup_ratio": 0.15
22
+ },
23
+ "performance": {
24
+ "target_bleu": "> baseline",
25
+ "target_speed": "< 1.5s",
26
+ "achieved_bleu": 11.746771868146594,
27
+ "achieved_speed": 0.11984974145889282,
28
+ "bleu_achieved": true,
29
+ "speed_achieved": true
30
+ },
31
+ "enhancements": [
32
+ "domain_specific_meeting_data",
33
+ "tedtalks_large_dataset",
34
+ "enhanced_learning_rate",
35
+ "robust_evaluation",
36
+ "longer_max_length",
37
+ "meeting_vocabulary_adaptation"
38
+ ]
39
+ }
source.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2a8fefe71c7f26cb0c6aa1b9f0cc0f8d18006b20fe41c547af7f25b9c8333465
3
+ size 800687
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "eos_token": "</s>",
3
+ "pad_token": "<pad>",
4
+ "unk_token": "<unk>"
5
+ }
target.spm ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e88300911c2c573ec5526777a1e84bae698d20925b82dcef9c7248bb0e537ed0
3
+ size 795925
tokenizer_config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "</s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<unk>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "54795": {
20
+ "content": "<pad>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ }
27
+ },
28
+ "clean_up_tokenization_spaces": true,
29
+ "eos_token": "</s>",
30
+ "model_max_length": 512,
31
+ "pad_token": "<pad>",
32
+ "separate_vocabs": false,
33
+ "source_lang": "id",
34
+ "sp_model_kwargs": {},
35
+ "target_lang": "en",
36
+ "tokenizer_class": "MarianTokenizer",
37
+ "unk_token": "<unk>"
38
+ }
training_history.json ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "train_losses": [
3
+ 2.0055856378383523,
4
+ 0.7818654294015012,
5
+ 0.7045550505443797,
6
+ 0.6663902908830861,
7
+ 0.6422204164767114,
8
+ 0.6251934011488444,
9
+ 0.6124287407625068,
10
+ 0.6027957165098136,
11
+ 0.5959445067628659,
12
+ 0.5917444733568574,
13
+ 0.5890379352300207,
14
+ 0.5879716100466652
15
+ ],
16
+ "val_losses": [
17
+ 0.8035921615460389,
18
+ 0.6999677060897311,
19
+ 0.662580210104882,
20
+ 0.6443683768704159,
21
+ 0.6335329711695702,
22
+ 0.6267606106046377,
23
+ 0.62275813724151,
24
+ 0.6199767906719565,
25
+ 0.618514860518994,
26
+ 0.6178649193756702,
27
+ 0.6174377355488714,
28
+ 0.61749539792486
29
+ ],
30
+ "bleu_scores": [
31
+ 6.870457439368253,
32
+ 9.73229422952864,
33
+ 9.919058115987571,
34
+ 9.93168908467393,
35
+ 10.62673495946515,
36
+ 10.751118334233405,
37
+ 11.389943562043996,
38
+ 11.737880062097886,
39
+ 11.51161050891599,
40
+ 11.675473586281159,
41
+ 11.746771868146594,
42
+ 11.716210798469715
43
+ ],
44
+ "speeds": [
45
+ 0.10214268576865103,
46
+ 0.08268359361910353,
47
+ 0.08358702823227528,
48
+ 0.08614971941592646,
49
+ 0.08425713520424039,
50
+ 0.08357842529521269,
51
+ 0.07981526384166643,
52
+ 0.08311401044621188,
53
+ 0.07876004307877783,
54
+ 0.08189725642110787,
55
+ 0.08256440419776767,
56
+ 0.08131513057970534
57
+ ],
58
+ "best_bleu_score": 11.746771868146594,
59
+ "baseline_bleu": 9.146153343607343,
60
+ "total_epochs": 12
61
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff