proxectonos
/

Nos_MT-CT2-gl-en

Galician

Model card Files Files and versions

xet

Community

imdbo commited on Sep 11, 2025

Commit

15a4b4a

verified ·

1 Parent(s): ddabcff

Update README.md

Browse files

Files changed (1) hide show

README.md +111 -3

README.md CHANGED Viewed

@@ -1,3 +1,111 @@
----
-license: mit
----

+---
+license: mit
+language:
+- gl
+---
+**English text [here](https://huggingface.co/proxectonos/Nos_MT-CT2-gl-en/edit/main/README_English.md)**
+**Descrición do Modelo**
+Modelo feito con OpenNMT-py 3.5.2 para o par galego-inglés utilizando unha arquitectura transformer. O modelo foi transformado para o formato da ctranslate2.
+**Como traducir con este Modelo**
++ Instalar o [Python](https://www.python.org/downloads/release/python-390/)
++ Instalar o [ctranslate](https://github.com/OpenNMT/CTranslate2)
++ Traducir un input_text utilizando o modelo co seguinte comando:
+```bash
+    perl tokenizer.perl < input.txt > input.tok
+```
+```bash
+    subword_nmt.apply_bpe -c ./bpe/gl.code < input.tok > input.bpe
+```
+```bash
+    python3 translate.py model_name input.bpe > output.txt
+```
+```bash
+    sed -i 's/@@ //g' output.txt
+```
+```bash
+    perl detokenizer.perl < final_output.txt > output.txt
+```
+### Example translate.py file : Running CTranslate2 from Python
+<details>
+<summary>Show code</summary>
+```python
+import ctranslate2
+import sys
+model = sys.argv[1]
+file_name = sys.argv[2]
+file = open(file_name, 'r')
+translator = ctranslate2.Translator(model, device="cuda")
+for line in file:
+    line = line.strip()
+    r = translator.translate_batch(
+        [line.split()], replace_unknowns=True, beam_size=5, batch_type='examples'
+    )
+    results = ' '.join(r[0].hypotheses[0])
+    print(results)
+```
+</details>
+**Adestramento**
+No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós](https://github.com/proxectonos/corpora). Os primeiros son córpora de traducións feitas directamente por tradutores humanos. É importante salientar que a pesar destes textos seren feitos por humanos, non están libres de erros lingüísticos. Os segundos son córpora de traducións español-portugués, que convertemos en español-galego a través da tradución automática portugués-galego con Opentrad/Apertium e transliteración para palabras fóra de vocabulario.
+**Procedemento de adestramento**
++ Tokenización dos datasets feita co tokenizador (tokenizer.pl) de [linguakit](https://github.com/citiususc/Linguakit) que foi modificado para evitar o salto de liña por token do ficheiro orixinal.
++ O vocabulario BPE para os modelos foi xerado a través do script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) da OpenNMT
+**Avaliación**
+A avaliación BLEU dos modelos é feita cunha mistura de tests desenvolvidos internamente (gold1, gold2, test-suite) con outros datasets disponíbeis en galego (Flores).
+**Licenzas do Modelo**
+MIT License
+Copyright (c) 2023 Proxecto Nós
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+**Financiamento**
+This model was developed within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215336.
+**Citar este traballo**
+Se utilizar este modelo no seu traballo, cite por favor así:
+Daniel Bardanca Outeirinho, Pablo Gamallo Otero, Iria de-Dios-Flores, and José Ramom Pichel Campos. 2024.
+Exploring the effects of vocabulary size in neural machine translation: Galician as a target language.
+In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 600–604,
+Santiago de Compostela, Galiza. Association for Computational Lingustics.