vzani
/

portuguese-fake-news-classifier-bertimbau-combined

@@ -75,21 +75,20 @@ The models are trained and evaluated on corpora derived from Brazilian Portugues
 - **Task**: Binary text classification (Fake vs. True news)
 - **Language**: Portuguese (`pt`)
 - **Framework**: 🤗 Transformers
 ---
 ## Available Variants
-- **bertimbau-combined**
-  Fine-tuned on the aligned corpus (`data/corpus_train_df.parquet`, etc.).
-- **bertimbau-fake-br**
-  Fine-tuned on the **Fake.br** dataset.
-  Corpus is available in [`corpus/`](./corpus) with preprocessed and size-normalized versions.
-- **bertimbau-faketrue-br**
-  Fine-tuned on the **FakeTrue.Br** dataset.
-  Includes both raw CSV and aligned corpus partitions.
 Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.
@@ -107,7 +106,7 @@ Each variant has its own confusion matrix, classification report, and prediction
 ```
 - **Base model**: `neuralmind/bert-base-portuguese-cased`
-- **Fine-tuning**: 3–5 epochs, batch size 16, AdamW optimizer
 - **Sequence length**: 512
 - **Loss function**: Cross-entropy
 - **Evaluation metrics**: Accuracy, Precision, Recall, F1-score
@@ -117,7 +116,7 @@ Each variant has its own confusion matrix, classification report, and prediction
 ## Evaluation Results
-Evaluation metrics are stored in the repo as:
 - `confusion_matrix.png`
 - `final_classification_report.parquet`
 - `final_predictions.parquet`
@@ -126,16 +125,6 @@ These files provide per-class performance and prediction logs for reproducibilit
 ---
-## Corpus
-The corpora used for training and evaluation are provided in the `corpus/` folder.
-- **Combined (root folder)**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
-- **Fake.br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
-- **FakeTrue.Br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet` and `FakeTrueBr_corpus.csv`.
----
 ## How to Use
 ```python
@@ -170,6 +159,14 @@ The expected output is a Tuple where the first entry represents the classificati
 (False, 0.9999247789382935)
 ```
 ## License
 - Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
@@ -177,4 +174,15 @@ The expected output is a Tuple where the first entry represents the classificati
 ## Citation
-Coming soon.

 - **Task**: Binary text classification (Fake vs. True news)
 - **Language**: Portuguese (`pt`)
 - **Framework**: 🤗 Transformers
+- **Training source code**: https://github.com/viniciuszani/portuguese-fake-new-classifiers
 ---
 ## Available Variants
+- [**bertimbau-combined**](https://huggingface.co/vzani/portuguese-fake-news-classifier-bertimbau-combined)
+  Fine-tuned using the [combined dataset](https://huggingface.co/datasets/vzani/corpus-combined) from Fake.br and FakeTrue.Br.
+- [**bertimbau-fake-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-bertimbau-fake-br)
+  Fine-tuned using the [Fake.br dataset](https://huggingface.co/datasets/vzani/corpus-fake-br) from Fake.br.
+- [**bertimbau-faketrue-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-bertimbau-faketrue-br)
+  Fine-tuned using the [FakeTrue.Br dataset](https://huggingface.co/datasets/vzani/corpus-faketrue-br) from FakeTrue.Br.
 Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.
 ```
 - **Base model**: `neuralmind/bert-base-portuguese-cased`
+- **Fine-tuning**: 7 epochs, batch size 16, AdamW optimizer, 4 layers tuned
 - **Sequence length**: 512
 - **Loss function**: Cross-entropy
 - **Evaluation metrics**: Accuracy, Precision, Recall, F1-score
 ## Evaluation Results
+Evaluation metrics are stored in the repo `Files and Versions` section as:
 - `confusion_matrix.png`
 - `final_classification_report.parquet`
 - `final_predictions.parquet`
 ---
 ## How to Use
 ```python
 (False, 0.9999247789382935)
 ```
+## Source code
+You can find the source code that produced this model in the repository below:
+- https://github.com/viniciuszani/portuguese-fake-new-classifiers
+The source contains all the steps from data collection, evaluation, hyperparameter fine tuning, final model tuning and publishing to HuggingFace.
+If you use it, please remember to credit the author and/or cite the work.
 ## License
 - Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 ## Citation
+```bibtex
+@misc{zani2025portuguesefakenews,
+  author       = {ZANI, Vinícius Augusto Tagliatti},
+  title        = {Avaliação comparativa de técnicas de processamento de linguagem natural para a detecção de notícias falsas em Português},
+  year         = {2025},
+  pages        = {61},
+  address      = {São Carlos},
+  school       = {Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo},
+  type         = {Trabalho de Conclusão de Curso (MBA em Inteligência Artificial e Big Data)},
+  note         = {Orientador: Prof. Dr. Ivandre Paraboni}
+}
+```