imdbo commited on
Commit
15a4b4a
·
verified ·
1 Parent(s): ddabcff

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +111 -3
README.md CHANGED
@@ -1,3 +1,111 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - gl
5
+ ---
6
+
7
+ **English text [here](https://huggingface.co/proxectonos/Nos_MT-CT2-gl-en/edit/main/README_English.md)**
8
+
9
+ **Descrición do Modelo**
10
+
11
+ Modelo feito con OpenNMT-py 3.5.2 para o par galego-inglés utilizando unha arquitectura transformer. O modelo foi transformado para o formato da ctranslate2.
12
+
13
+ **Como traducir con este Modelo**
14
+
15
+ + Instalar o [Python](https://www.python.org/downloads/release/python-390/)
16
+ + Instalar o [ctranslate](https://github.com/OpenNMT/CTranslate2)
17
+ + Traducir un input_text utilizando o modelo co seguinte comando:
18
+ ```bash
19
+ perl tokenizer.perl < input.txt > input.tok
20
+ ```
21
+ ```bash
22
+ subword_nmt.apply_bpe -c ./bpe/gl.code < input.tok > input.bpe
23
+ ```
24
+ ```bash
25
+ python3 translate.py model_name input.bpe > output.txt
26
+ ```
27
+ ```bash
28
+ sed -i 's/@@ //g' output.txt
29
+ ```
30
+ ```bash
31
+ perl detokenizer.perl < final_output.txt > output.txt
32
+ ```
33
+ ### Example translate.py file : Running CTranslate2 from Python
34
+ <details>
35
+ <summary>Show code</summary>
36
+
37
+ ```python
38
+ import ctranslate2
39
+ import sys
40
+
41
+ model = sys.argv[1]
42
+ file_name = sys.argv[2]
43
+
44
+ file = open(file_name, 'r')
45
+
46
+ translator = ctranslate2.Translator(model, device="cuda")
47
+
48
+ for line in file:
49
+ line = line.strip()
50
+ r = translator.translate_batch(
51
+ [line.split()], replace_unknowns=True, beam_size=5, batch_type='examples'
52
+ )
53
+ results = ' '.join(r[0].hypotheses[0])
54
+ print(results)
55
+ ```
56
+ </details>
57
+
58
+
59
+ **Adestramento**
60
+
61
+ No adestramento, utilizamos córpora auténticos e sintéticos do [ProxectoNós](https://github.com/proxectonos/corpora). Os primeiros son córpora de traducións feitas directamente por tradutores humanos. É importante salientar que a pesar destes textos seren feitos por humanos, non están libres de erros lingüísticos. Os segundos son córpora de traducións español-portugués, que convertemos en español-galego a través da tradución automática portugués-galego con Opentrad/Apertium e transliteración para palabras fóra de vocabulario.
62
+
63
+ **Procedemento de adestramento**
64
+
65
+ + Tokenización dos datasets feita co tokenizador (tokenizer.pl) de [linguakit](https://github.com/citiususc/Linguakit) que foi modificado para evitar o salto de liña por token do ficheiro orixinal.
66
+
67
+ + O vocabulario BPE para os modelos foi xerado a través do script [learn_bpe.py](https://github.com/OpenNMT/OpenNMT-py/blob/master/tools/learn_bpe.py) da OpenNMT
68
+
69
+
70
+
71
+ **Avaliación**
72
+
73
+ A avaliación BLEU dos modelos é feita cunha mistura de tests desenvolvidos internamente (gold1, gold2, test-suite) con outros datasets disponíbeis en galego (Flores).
74
+
75
+ **Licenzas do Modelo**
76
+
77
+ MIT License
78
+
79
+ Copyright (c) 2023 Proxecto Nós
80
+
81
+ Permission is hereby granted, free of charge, to any person obtaining a copy
82
+ of this software and associated documentation files (the "Software"), to deal
83
+ in the Software without restriction, including without limitation the rights
84
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
85
+ copies of the Software, and to permit persons to whom the Software is
86
+ furnished to do so, subject to the following conditions:
87
+
88
+ The above copyright notice and this permission notice shall be included in all
89
+ copies or substantial portions of the Software.
90
+
91
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
92
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
93
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
94
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
95
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
96
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
97
+ SOFTWARE.
98
+
99
+ **Financiamento**
100
+
101
+ This model was developed within the Nós Project, funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the [project ILENIA] (https://proyectoilenia.es/) with reference 2022/TL22/00215336.
102
+
103
+ **Citar este traballo**
104
+
105
+ Se utilizar este modelo no seu traballo, cite por favor así:
106
+
107
+ Daniel Bardanca Outeirinho, Pablo Gamallo Otero, Iria de-Dios-Flores, and José Ramom Pichel Campos. 2024.
108
+ Exploring the effects of vocabulary size in neural machine translation: Galician as a target language.
109
+ In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pages 600–604,
110
+ Santiago de Compostela, Galiza. Association for Computational Lingustics.
111
+