Rhpan tomasBernal commited on
Commit
39d774d
·
1 Parent(s): d342103

Create README.md (#1)

Browse files

- Create README.md (23e4666c22b3c9af2f9222011c6507a8c3a43a05)


Co-authored-by: Tomás Bernal Beltrán <tomasBernal@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +233 -0
README.md ADDED
@@ -0,0 +1,233 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - es
4
+ license: mit
5
+ library_name: transformers
6
+ pipeline_tag: audio-classification
7
+ tags:
8
+ - emotion-recognition
9
+ - speech-emotion-recognition
10
+ - multimodal-learning
11
+ - audio-classification
12
+ - speech-processing
13
+ - text-processing
14
+ - spanish
15
+ - affective-computing
16
+ - umuteam
17
+ datasets:
18
+ - NLP-UMUTeam/Spanish-MEACorpus-2023
19
+ metrics:
20
+ - accuracy
21
+ - f1
22
+
23
+ model-index:
24
+ - name: UMUTeam/w2v-bert-beto-concat-emotion-es
25
+ results:
26
+ - task:
27
+ type: audio-classification
28
+ name: Multimodal Speech Emotion Recognition
29
+ dataset:
30
+ name: Spanish MEACorpus 2023
31
+ type: custom
32
+ metrics:
33
+ - type: accuracy
34
+ value: 90.0682
35
+ name: Accuracy
36
+ - type: weighted-f1
37
+ value: 90.0642
38
+ name: Weighted F1
39
+ - type: macro-f1
40
+ value: 87.7455
41
+ name: Macro F1
42
+ ---
43
+
44
+ # UMUTeam/w2v-bert-beto-concat-emotion-es
45
+
46
+ ## Model description
47
+
48
+ `UMUTeam/w2v-bert-beto-concat-emotion-es` is a Spanish multimodal emotion recognition model developed as part of **speech-emotion**, an open-source multilingual and multimodal toolkit for emotion recognition from speech, text, and multimodal inputs.
49
+
50
+ This model performs **multimodal emotion classification from Spanish speech and text inputs**.
51
+
52
+ The model combines acoustic representations extracted with Wav2Vec2-BERT and linguistic representations generated with BETO using a concatenation-based multimodal fusion strategy.
53
+
54
+ It is designed to jointly exploit complementary emotional information from speech and text in order to improve emotion recognition performance compared to unimodal approaches.
55
+
56
+ The model predicts one of the following emotion labels:
57
+
58
+ - `anger`
59
+ - `disgust`
60
+ - `fear`
61
+ - `joy`
62
+ - `neutral`
63
+ - `sadness`
64
+
65
+ ## Intended use
66
+
67
+ This model is intended for research and applied scenarios involving multimodal emotion recognition in Spanish, such as:
68
+
69
+ - multimodal conversational analysis
70
+ - speech and text emotion analysis
71
+ - affective computing research
72
+ - emotion-aware conversational systems
73
+ - human-computer interaction
74
+ - multimodal AI research
75
+
76
+ The model is particularly useful in scenarios where both speech audio and transcribed text are available.
77
+
78
+ It can be used through the `speech-emotion` toolkit.
79
+
80
+ ## Out-of-scope use
81
+
82
+ This model should not be used as the sole basis for high-stakes decisions, including but not limited to:
83
+
84
+ - clinical diagnosis
85
+ - mental health assessment
86
+ - employment, legal, or educational decisions
87
+ - biometric profiling or surveillance
88
+ - automated decisions affecting individuals without human oversight
89
+
90
+ Emotion recognition is inherently uncertain and context-dependent. Predictions should be interpreted as model estimates, not as definitive assessments of a person's emotional state.
91
+
92
+ ## Training data
93
+
94
+ The model was trained on the Spanish portion of the datasets used in the `speech-emotion` project, primarily based on the **Spanish MEACorpus 2023** dataset.
95
+
96
+ Spanish MEACorpus 2023 is a multimodal speech-text emotion corpus for Spanish emotion analysis collected from natural environments. The dataset contains aligned speech and textual information for emotion recognition tasks.
97
+
98
+ The emotion labels were harmonized into the following six-class taxonomy:
99
+
100
+ - `anger`
101
+ - `disgust`
102
+ - `fear`
103
+ - `joy`
104
+ - `neutral`
105
+ - `sadness`
106
+
107
+ For the Spanish multimodal emotion recognition setup, the same aligned speech-text samples were used for both the acoustic and textual modalities:
108
+
109
+ - Training samples: 3,692
110
+ - Validation samples: 410
111
+ - Test samples: 1,027
112
+
113
+ More details about the dataset and preprocessing pipeline are available in the project repository:
114
+
115
+ https://github.com/NLP-UMUTeam/umuteam-speech-emotion
116
+
117
+ ## Evaluation
118
+
119
+ The model was evaluated on the Spanish held-out test set used in the `speech-emotion` toolkit.
120
+
121
+ ### Performance comparison on Spanish emotion recognition
122
+
123
+ | Configuration | Accuracy | Weighted Precision | Weighted F1 | Macro F1 |
124
+ |---|---:|---:|---:|---:|
125
+ | Speech-only | 88.1207 | 88.3244 | 88.1357 | 84.4829 |
126
+ | Text-only | 77.0204 | 77.0449 | 76.8367 | 69.3886 |
127
+ | Multimodal (Concat) | **90.0682** | **90.2048** | **90.0642** | **87.7455** |
128
+ | Multimodal (Mean) | 88.5102 | 88.6163 | 88.5011 | 84.1653 |
129
+ | Multimodal (Multihead) | 82.6680 | 82.3820 | 82.4600 | 75.5606 |
130
+
131
+ The results show that combining acoustic and linguistic representations improves emotion recognition performance compared to unimodal speech-only or text-only systems.
132
+
133
+ Among the evaluated fusion strategies, the concatenation-based multimodal approach achieved the best overall performance across all reported metrics.
134
+
135
+ ## How to use
136
+
137
+ ```bash
138
+ pip install speech-emotion
139
+ ```
140
+
141
+ ### Multimodal emotion recognition using audio and text
142
+
143
+ ```python
144
+ from speech_emotion import predict_emotion
145
+
146
+ emotion = predict_emotion(
147
+ audio_path="audio.wav",
148
+ text="Estoy muy feliz de verte de nuevo.",
149
+ language="es",
150
+ mode="concat",
151
+ model_config_path="model.json"
152
+ )
153
+
154
+ print("Detected emotion:", emotion)
155
+ ```
156
+
157
+ ### Multimodal emotion recognition using automatic transcription (Whisper)
158
+
159
+ If no transcription is provided, the toolkit can automatically generate it using Whisper before performing emotion recognition.
160
+
161
+ ```python
162
+ from speech_emotion import predict_emotion
163
+
164
+ emotion = predict_emotion(
165
+ audio_path="audio.wav",
166
+ language="es",
167
+ mode="concat",
168
+ model_config_path="model.json"
169
+ )
170
+
171
+ print("Detected emotion:", emotion)
172
+ ```
173
+
174
+ Repository:
175
+
176
+ https://github.com/NLP-UMUTeam/umuteam-speech-emotion
177
+
178
+ ## Limitations
179
+
180
+ - The model is designed for Spanish multimodal emotion recognition and may not generalize reliably to other languages.
181
+ - It predicts a single label from a fixed set of six emotions.
182
+ - Emotion expression is subjective and highly context-dependent.
183
+ - Performance may decrease with noisy audio, inaccurate transcriptions, overlapping speakers, or domain shifts.
184
+ - The model assumes that audio and text inputs are semantically aligned.
185
+ - Errors in automatic speech transcription may negatively affect multimodal performance.
186
+
187
+ ## Bias and ethical considerations
188
+
189
+ Emotion recognition systems may reflect biases present in their training data, including differences related to accents, speaking styles, demographics, recording conditions, or annotation subjectivity.
190
+
191
+ Users should avoid interpreting predictions as objective truths about a person's internal emotional state. The model should be used with transparency, appropriate consent, and human oversight, especially in sensitive contexts.
192
+
193
+ ## Citation
194
+
195
+ If you use this model in your research, please cite the following works:
196
+
197
+ ### speech-emotion toolkit
198
+
199
+ ```bibtex
200
+ @article{PAN2026102677,
201
+ title = {speech-emotion: A multilingual and multimodal toolkit for emotion recognition from speech},
202
+ journal = {SoftwareX},
203
+ volume = {34},
204
+ pages = {102677},
205
+ year = {2026},
206
+ issn = {2352-7110},
207
+ doi = {https://doi.org/10.1016/j.softx.2026.102677},
208
+ url = {https://www.sciencedirect.com/science/article/pii/S235271102600169X},
209
+ author = {Ronghao Pan and Tomás Bernal-Beltrán and José Antonio García-Díaz and Rafael Valencia-García},
210
+ }
211
+ ```
212
+
213
+ ### Spanish MEACorpus 2023
214
+
215
+ ```bibtex
216
+ @article{PAN2024103856,
217
+ title = {Spanish MEACorpus 2023: A multimodal speech–text corpus for emotion analysis in Spanish from natural environments},
218
+ journal = {Computer Standards & Interfaces},
219
+ volume = {90},
220
+ pages = {103856},
221
+ year = {2024},
222
+ issn = {0920-5489},
223
+ doi = {https://doi.org/10.1016/j.csi.2024.103856},
224
+ url = {https://www.sciencedirect.com/science/article/pii/S0920548924000254},
225
+ author = {Ronghao Pan and José Antonio García-Díaz and Miguel Ángel Rodríguez-García and Rafael Valencia-García},
226
+ }
227
+ ```
228
+
229
+ ## Acknowledgments
230
+
231
+ This work is part of the research project LaTe4PoliticES (PID2022-138099OB-I00), funded by MICIU/AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF/EU - FEDER/UE), “A way of making Europe”.
232
+
233
+ Mr. Tomás Bernal-Beltrán is supported by the University of Murcia through the predoctoral programme.