nandakishoresaic's picture
Upload folder using huggingface_hub
754c901 verified
metadata
language:
  - en
  - hi
  - ta
  - te
  - kn
  - bn
  - ml
  - es
  - fr
  - ja
  - zh
license: cc-by-nc-4.0
tags:
  - translation
  - news
  - multilingual
  - nllb
  - journalism
  - media
pipeline_tag: translation

🌍 Multilingual News Translator

Translate news articles from ANY source into 10 languages instantly!

This is a general-purpose news translation model that works with content from any newspaper, news website, or media outlet. No specific data sources are used - this is a pre-trained multilingual model suitable for translating journalistic content.

✨ Key Features

  • 🌐 Universal: Works with ANY news source (BBC, Reuters, local newspapers, blogs, etc.)
  • 🚀 Fast: Instant translations
  • 🎯 Accurate: Optimized for formal news language
  • 📰 Journalistic: Handles news terminology well
  • 🆓 Free: Open for non-commercial use

🎯 Supported Languages

  • 🇮🇳 Hindi (हिन्दी)
  • 🇮🇳 Telugu (తెలుగు)
  • 🇮🇳 Tamil (தமிழ்)
  • 🇮🇳 Kannada (ಕನ್ನಡ)
  • 🇮🇳 Bengali (বাংলা)
  • 🇮🇳 Malayalam (മലയാളം)
  • 🇪🇸 Spanish (Español)
  • 🇫🇷 French (Français)
  • 🇯🇵 Japanese (日本語)
  • 🇨🇳 Chinese (中文)

🚀 Quick Start

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Load model
model_name = "YOUR_USERNAME/multilingual-news-translator"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Translate to Hindi
text = "Global markets showed strong growth today"
tokenizer.src_lang = "eng_Latn"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.lang_code_to_id["hin_Deva"],
    max_length=512
)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)

📖 Language Codes Reference

Language Code Script
English eng_Latn Source language
Hindi hin_Deva देवनागरी
Telugu tel_Telu తెలుగు
Tamil tam_Taml தமிழ்
Kannada kan_Knda ಕನ್ನಡ
Bengali ben_Beng বাংলা
Malayalam mal_Mlym മലയാളം
Spanish spa_Latn Latin
French fra_Latn Latin
Japanese jpn_Jpan 日本語
Chinese zho_Hans 简体中文

💡 Complete Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

class NewsTranslator:
    def __init__(self, model_name):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.languages = {
            'hindi': 'hin_Deva',
            'tamil': 'tam_Taml',
            'spanish': 'spa_Latn',
            'french': 'fra_Latn'
        }
    
    def translate(self, text, target_lang):
        self.tokenizer.src_lang = "eng_Latn"
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True)
        outputs = self.model.generate(
            **inputs,
            forced_bos_token_id=self.tokenizer.lang_code_to_id[self.languages[target_lang]],
            max_length=512,
            num_beams=5
        )
        return self.tokenizer.decode(outputs[0], skip_special_tokens=True)

# Usage
translator = NewsTranslator("YOUR_USERNAME/multilingual-news-translator")
result = translator.translate("Breaking news from around the world", "hindi")
print(result)

🎯 Use Cases

  • News Aggregators: Translate content from multiple sources
  • Media Monitoring: Track news in multiple languages
  • Research: Analyze global news coverage
  • Personal Use: Read international news in your language
  • Journalism: Cross-language reporting
  • Education: Study comparative journalism

📊 Model Information

  • Base Model: NLLB-200 (600M parameters)
  • Architecture: Transformer-based sequence-to-sequence
  • Training: Pre-trained on multilingual web data
  • Languages: 200+ languages (10 optimized for news)
  • Framework: PyTorch / TensorFlow compatible
  • Size: ~2.5GB

⚠️ Limitations

  • Optimized for formal news content and journalistic language
  • Best with complete sentences and proper grammar
  • May not handle extreme slang or very informal language well
  • Long texts should be split into paragraphs (max 512 tokens)
  • Translation quality depends on content complexity

📜 License & Legal

  • License: CC-BY-NC-4.0 (Non-commercial use)
  • Base Model: Meta's NLLB-200 (Open source)
  • Data: Pre-trained on public multilingual web data
  • Usage: Free for research, personal, and non-commercial applications

⚠️ Important: This model does NOT contain data from any specific news organization. It is a general-purpose translation model trained on public multilingual data. Users are responsible for respecting copyright when translating content from specific sources.

🙏 Credits

Built using Meta's NLLB-200 (No Language Left Behind) model


Made with ❤️ for the global news community