File size: 5,449 Bytes
bd71d46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
905e988
bd71d46
 
905e988
bd71d46
 
 
905e988
bd71d46
 
 
 
 
905e988
 
bd71d46
905e988
 
bd71d46
905e988
 
bd71d46
905e988
bd71d46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
905e988
bd71d46
 
 
 
 
905e988
bd71d46
 
 
905e988
bd71d46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
905e988
 
 
 
 
 
 
 
bd71d46
 
905e988
 
bd71d46
 
 
905e988
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
---
language:
- pt
license: apache-2.0
library_name: scikit-learn
pipeline_tag: text-classification
tags:
- mlp
- tfidf
- scikit-learn
- portuguese
- pt
- fake-news
- binary-classification
metrics:
- accuracy
- precision
- recall
- f1-score
datasets: vzani/corpus-faketrue-br
model-index:
- name: portuguese-fake-news-classifier-mlp-tfidf-faketrue-br
  results:
  - task:
      type: text-classification
    dataset:
      name: FakeTrue.Br
      type: vzani/corpus-faketrue-br
      split: test
    metrics:
    - name: accuracy
      type: accuracy
      value: 0.95258
    - name: precision_macro
      type: precision
      value: 0.952597
      args:
        average: macro
    - name: recall_macro
      type: recall
      value: 0.952576
      args:
        average: macro
    - name: f1_macro
      type: f1
      value: 0.952579
      args:
        average: macro
    - name: precision_weighted
      type: precision
      value: 0.952594
      args:
        average: weighted
    - name: recall_weighted
      type: recall
      value: 0.95258
      args:
        average: weighted
    - name: f1_weighted
      type: f1
      value: 0.95258
      args:
        average: weighted
    - name: n_test_samples
      type: num
      value: 717
---
# MLP (TF-IDF) for Fake News Detection (Portuguese)

## Model Overview

This repository contains **MLP classifiers trained on TF-IDF features** for **fake news detection in Portuguese**.
The model is trained and evaluated on corpora derived from Brazilian Portuguese dataset **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**.

- **Architecture**: Multi-Layer Perceptron (scikit-learn)
- **Features**: TF-IDF over unigrams/bigrams/trigrams
- **Task**: Binary text classification (Fake vs. True)
- **Language**: Portuguese (`pt`)
- **Framework**: scikit-learn
- **Training source code**: https://github.com/viniciuszani/portuguese-fake-new-classifiers

---

## Available Variants

- [**mlp-tfidf-combined**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-combined)
  Fine-tuned using the [combined dataset](https://huggingface.co/datasets/vzani/corpus-combined) from Fake.br and FakeTrue.Br.

- [**mlp-tfidf-fake-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-fake-br)
  Fine-tuned using the [Fake.br dataset](https://huggingface.co/datasets/vzani/corpus-fake-br) from Fake.br.

- [**mlp-tfidf-faketrue-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br)
  Fine-tuned using the [FakeTrue.Br dataset](https://huggingface.co/datasets/vzani/corpus-faketrue-br) from FakeTrue.Br.

Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.

---

## Training Details

```python
{
    "n_layers": 2,
    "first_layer_size": 128,
    "second_layer_size": 64,
    "ngram_upper": 3,
    "min_df": 5,
    "max_df": 0.991954939032491,
    "activation": "relu",
    "solver": "lbfgs",
    "alpha": 0.00014375816817663168,
    "learning_rate_init": 0.005261446157045498,
}
```

---

## Evaluation Results

Evaluation metrics are stored in the repo as:
- `confusion_matrix.png`
- `final_classification_report.parquet`
- `final_predictions.parquet`

These files provide per-class performance and prediction logs for reproducibility.

---

## How to Use

This model is stored as `final_model.joblib`.

```python
import joblib
from huggingface_hub import hf_hub_download

repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br"  # or fake-br / combined
filename = "final_model.joblib"

model_path = hf_hub_download(repo_id=repo_id, filename=filename)
clf = joblib.load(model_path)


def predict(text: str) -> tuple[bool, float]:
    prob = clf.predict_proba([text])[0][1]
    pred = prob >= 0.5

    # Convert the probability in case of Fake
    prob = prob if pred else 1 - prob
    return bool(pred), float(prob)


if __name__ == "__main__":
    text = "BOMBA! A Dilma vai taxar ainda mais os pobres!"
    print(predict(text))

```

The expected output is a Tuple where the first entry represents the classification (`True` for true news and `False` for fake news) and the second the probability assigned to the predicted class (ranging from 0 to 1.0).
```
(False, 1.0)
```

## Source code

You can find the source code that produced this model in the repository below:
- https://github.com/viniciuszani/portuguese-fake-new-classifiers

The source contains all the steps from data collection, evaluation, hyperparameter fine tuning, final model tuning and publishing to HuggingFace.
If you use it, please remember to credit the author and/or cite the work.

## License

- Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- Fine-tuned models and corpora: Released under the same license for academic and research use.

## Citation

```bibtex
@misc{zani2025portuguesefakenews,
  author       = {ZANI, Vinícius Augusto Tagliatti},
  title        = {Avaliação comparativa de técnicas de processamento de linguagem natural para a detecção de notícias falsas em Português},
  year         = {2025},
  pages        = {61},
  address      = {São Carlos},
  school       = {Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo},
  type         = {Trabalho de Conclusão de Curso (MBA em Inteligência Artificial e Big Data)},
  note         = {Orientador: Prof. Dr. Ivandre Paraboni}
}
```