File size: 5,449 Bytes
bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 bd71d46 905e988 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 | ---
language:
- pt
license: apache-2.0
library_name: scikit-learn
pipeline_tag: text-classification
tags:
- mlp
- tfidf
- scikit-learn
- portuguese
- pt
- fake-news
- binary-classification
metrics:
- accuracy
- precision
- recall
- f1-score
datasets: vzani/corpus-faketrue-br
model-index:
- name: portuguese-fake-news-classifier-mlp-tfidf-faketrue-br
results:
- task:
type: text-classification
dataset:
name: FakeTrue.Br
type: vzani/corpus-faketrue-br
split: test
metrics:
- name: accuracy
type: accuracy
value: 0.95258
- name: precision_macro
type: precision
value: 0.952597
args:
average: macro
- name: recall_macro
type: recall
value: 0.952576
args:
average: macro
- name: f1_macro
type: f1
value: 0.952579
args:
average: macro
- name: precision_weighted
type: precision
value: 0.952594
args:
average: weighted
- name: recall_weighted
type: recall
value: 0.95258
args:
average: weighted
- name: f1_weighted
type: f1
value: 0.95258
args:
average: weighted
- name: n_test_samples
type: num
value: 717
---
# MLP (TF-IDF) for Fake News Detection (Portuguese)
## Model Overview
This repository contains **MLP classifiers trained on TF-IDF features** for **fake news detection in Portuguese**.
The model is trained and evaluated on corpora derived from Brazilian Portuguese dataset **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**.
- **Architecture**: Multi-Layer Perceptron (scikit-learn)
- **Features**: TF-IDF over unigrams/bigrams/trigrams
- **Task**: Binary text classification (Fake vs. True)
- **Language**: Portuguese (`pt`)
- **Framework**: scikit-learn
- **Training source code**: https://github.com/viniciuszani/portuguese-fake-new-classifiers
---
## Available Variants
- [**mlp-tfidf-combined**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-combined)
Fine-tuned using the [combined dataset](https://huggingface.co/datasets/vzani/corpus-combined) from Fake.br and FakeTrue.Br.
- [**mlp-tfidf-fake-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-fake-br)
Fine-tuned using the [Fake.br dataset](https://huggingface.co/datasets/vzani/corpus-fake-br) from Fake.br.
- [**mlp-tfidf-faketrue-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br)
Fine-tuned using the [FakeTrue.Br dataset](https://huggingface.co/datasets/vzani/corpus-faketrue-br) from FakeTrue.Br.
Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.
---
## Training Details
```python
{
"n_layers": 2,
"first_layer_size": 128,
"second_layer_size": 64,
"ngram_upper": 3,
"min_df": 5,
"max_df": 0.991954939032491,
"activation": "relu",
"solver": "lbfgs",
"alpha": 0.00014375816817663168,
"learning_rate_init": 0.005261446157045498,
}
```
---
## Evaluation Results
Evaluation metrics are stored in the repo as:
- `confusion_matrix.png`
- `final_classification_report.parquet`
- `final_predictions.parquet`
These files provide per-class performance and prediction logs for reproducibility.
---
## How to Use
This model is stored as `final_model.joblib`.
```python
import joblib
from huggingface_hub import hf_hub_download
repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-faketrue-br" # or fake-br / combined
filename = "final_model.joblib"
model_path = hf_hub_download(repo_id=repo_id, filename=filename)
clf = joblib.load(model_path)
def predict(text: str) -> tuple[bool, float]:
prob = clf.predict_proba([text])[0][1]
pred = prob >= 0.5
# Convert the probability in case of Fake
prob = prob if pred else 1 - prob
return bool(pred), float(prob)
if __name__ == "__main__":
text = "BOMBA! A Dilma vai taxar ainda mais os pobres!"
print(predict(text))
```
The expected output is a Tuple where the first entry represents the classification (`True` for true news and `False` for fake news) and the second the probability assigned to the predicted class (ranging from 0 to 1.0).
```
(False, 1.0)
```
## Source code
You can find the source code that produced this model in the repository below:
- https://github.com/viniciuszani/portuguese-fake-new-classifiers
The source contains all the steps from data collection, evaluation, hyperparameter fine tuning, final model tuning and publishing to HuggingFace.
If you use it, please remember to credit the author and/or cite the work.
## License
- Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- Fine-tuned models and corpora: Released under the same license for academic and research use.
## Citation
```bibtex
@misc{zani2025portuguesefakenews,
author = {ZANI, Vinícius Augusto Tagliatti},
title = {Avaliação comparativa de técnicas de processamento de linguagem natural para a detecção de notícias falsas em Português},
year = {2025},
pages = {61},
address = {São Carlos},
school = {Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo},
type = {Trabalho de Conclusão de Curso (MBA em Inteligência Artificial e Big Data)},
note = {Orientador: Prof. Dr. Ivandre Paraboni}
}
```
|