--- library_name: transformers language: - en - de - fr license: other base_model: straker/tiri-tahi-3b-base-pt-bf16 tags: - translation - sft - seq2seq - fuzzy-match pipeline_tag: translation --- # Tiri Tahi 3B - Genesis SFT (EN-DE/FR) A supervised fine-tuned version of [straker/tiri-tahi-3b-base-pt-bf16](https://huggingface.co/straker/tiri-tahi-3b-base-pt-bf16) for machine translation with translation memory (fuzzy match) augmentation. ## Model Details - **Base model:** Tiri Tahi 3B (MADLAD-400 architecture, T5-based encoder-decoder) - **Task:** Machine translation with fuzzy match context - **Language pairs:** English-German (EN-DE), English-French (EN-FR) - **Parameters:** ~3B ## Training Data The model was fine-tuned on 72,230 translation pairs with 4,012 held out for validation: | Language Pair | Training Samples | |---|---| | EN-DE | 44,592 | | EN-FR | 27,638 | Each training example includes up to 2 fuzzy matches from translation memory, providing the model with reference translations at varying similarity scores to improve output quality. ### Input Format The model uses the MADLAD-400 `<2xx>` prefix format with fuzzy match context prepended to the source text: ``` <2de>source text to translate ``` When fuzzy matches are available, they are prepended as context to help guide the translation. ## Training Procedure ### Hyperparameters | Parameter | Value | |---|---| | Learning rate | 1e-4 | | LR scheduler | Cosine | | Warmup steps | 50 | | Batch size | 32 | | Epochs | 5 | | Weight decay | 0.01 | | Label smoothing | 0.05 | | Max source length | 1024 | | Max target length | 256 | | Precision | bf16 | | Gradient checkpointing | Enabled | | Optimizer | AdamW (fused) | ### Training Results | Metric | Value | |---|---| | Final train loss | 0.49 | | Training time | ~2.5 hours (across resumed runs) | | Train samples/sec | 79.18 | ## Intended Uses - Machine translation for EN-DE and EN-FR language pairs - Translation memory-augmented machine translation (leveraging fuzzy matches) - CAT (Computer-Assisted Translation) tool integration ## Limitations - Only trained on EN-DE and EN-FR; other language pairs may produce lower quality output - Performance depends on quality and relevance of provided fuzzy matches - Not evaluated on standard MT benchmarks (BLEU, COMET) in this release ## Framework Versions - Transformers 4.57.6 - PyTorch 2.11.0+cu128 - Datasets 4.8.4 - Tokenizers 0.22.2