vzani
/

portuguese-fake-news-classifier-mlp-tfidf-faketrue-br

+---
+language:
+- pt
+license: apache-2.0
+library_name: scikit-learn
+pipeline_tag: text-classification
+tags:
+- mlp
+- tfidf
+- scikit-learn
+- portuguese
+- pt
+- fake-news
+- binary-classification
+metrics:
+- accuracy
+- precision
+- recall
+- f1-score
+datasets: vzani/corpus-faketrue-br
+model-index:
+- name: portuguese-fake-news-classifier-mlp-tfidf-faketrue-br
+  results:
+  - task:
+      type: text-classification
+    dataset:
+      name: FakeTrue.Br
+      type: vzani/corpus-faketrue-br
+      split: test
+    metrics:
+    - name: accuracy
+      type: accuracy
+      value: 0.95258
+    - name: precision_macro
+      type: precision
+      value: 0.952597
+      args:
+        average: macro
+    - name: recall_macro
+      type: recall
+      value: 0.952576
+      args:
+        average: macro
+    - name: f1_macro
+      type: f1
+      value: 0.952579
+      args:
+        average: macro
+    - name: precision_weighted
+      type: precision
+      value: 0.952594
+      args:
+        average: weighted
+    - name: recall_weighted
+      type: recall
+      value: 0.95258
+      args:
+        average: weighted
+    - name: f1_weighted
+      type: f1
+      value: 0.95258
+      args:
+        average: weighted
+    - name: n_test_samples
+      type: num
+      value: 717
+---
+# MLP (TF-IDF) for Fake News Detection (Portuguese)
+## Model Overview
+This repository contains **MLP classifiers trained on TF-IDF features** for **fake news detection in Portuguese**.
+Models are trained and evaluated on corpora derived from Brazilian Portuguese datasets **[Fake.br](https://github.com/roneysco/Fake.br-Corpus)** and **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**.
+- **Architecture**: Multi-Layer Perceptron (scikit-learn)
+- **Features**: TF-IDF over unigrams/bigrams
+- **Task**: Binary text classification (Fake vs. True)
+- **Language**: Portuguese (`pt`)
+- **Framework**: scikit-learn
+---
+## Available Variants
+- **mlp-tfidf-combined**
+  Trained on the aligned combined corpus.
+- **mlp-tfidf-fake-br**
+  Trained on **Fake.br**.
+- **mlp-tfidf-faketrue-br**
+  Trained on **FakeTrue.Br**.
+  Includes aligned splits and the original CSV when available.
+Each variant ships with:
+- `final_model.joblib`
+- `confusion_matrix.png`
+- `final_classification_report.parquet`
+- `final_predictions.parquet`
+---
+## Training Details
+```python
+{
+    "n_layers": 2,
+    "first_layer_size": 128,
+    "second_layer_size": 64,
+    "ngram_upper": 3,
+    "min_df": 5,
+    "max_df": 0.991954939032491,
+    "activation": "relu",
+    "solver": "lbfgs",
+    "alpha": 0.00014375816817663168,
+    "learning_rate_init": 0.005261446157045498,
+}
+```
+---
+## Evaluation Results
+Evaluation metrics are stored in the repo as:
+- `confusion_matrix.png`
+- `final_classification_report.parquet`
+- `final_predictions.parquet`
+These files provide per-class performance and prediction logs for reproducibility.
+---
+## Corpus
+The corpora used for training and evaluation are provided in the `corpus/` folder.
+- **Combined (root folder)**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
+- **Fake.br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
+- **FakeTrue.Br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet` and `FakeTrueBr_corpus.csv`.
+---
+## How to Use
+This model is a **Keras** model stored as `final_bilstm_model.keras`.
+```python
+import joblib
+from huggingface_hub import hf_hub_download
+repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-combined"  # or fake-br / faketrue-br
+filename = "final_model.joblib"
+model_path = hf_hub_download(repo_id=repo_id, filename=filename)
+clf = joblib.load(model_path)  # Pipeline or bare estimator
+def predict(text: str) -> tuple[bool, float]:
+    prob = clf.predict_proba([text])[0][1]
+    pred = prob >= 0.5
+    # Convert the probability in case of Fake
+    prob = prob if pred else 1 - prob
+    return bool(pred), float(prob)
+if __name__ == "__main__":
+    text = "BOMBA! A Dilma vai taxar ainda mais os pobres!"
+    print(predict(text))
+```
+The expected output is a Tuple where the first entry represents the classification (`True` for true news and `False` for fake news) and the second the probability assigned to the predicted class (ranging from 0 to 1.0).
+```
+(False, 1.0)
+```
+## License
+[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+## Citation
+Coming soon.