vzani commited on
Commit
7253aeb
·
verified ·
1 Parent(s): 6dc5ff1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +29 -21
README.md CHANGED
@@ -75,21 +75,20 @@ The models are trained and evaluated on corpora derived from Brazilian Portugues
75
  - **Task**: Binary text classification (Fake vs. True news)
76
  - **Language**: Portuguese (`pt`)
77
  - **Framework**: 🤗 Transformers
 
78
 
79
  ---
80
 
81
  ## Available Variants
82
 
83
- - **bertimbau-combined**
84
- Fine-tuned on the aligned corpus (`data/corpus_train_df.parquet`, etc.).
85
 
86
- - **bertimbau-fake-br**
87
- Fine-tuned on the **Fake.br** dataset.
88
- Corpus is available in [`corpus/`](./corpus) with preprocessed and size-normalized versions.
89
 
90
- - **bertimbau-faketrue-br**
91
- Fine-tuned on the **FakeTrue.Br** dataset.
92
- Includes both raw CSV and aligned corpus partitions.
93
 
94
  Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.
95
 
@@ -107,7 +106,7 @@ Each variant has its own confusion matrix, classification report, and prediction
107
  ```
108
 
109
  - **Base model**: `neuralmind/bert-base-portuguese-cased`
110
- - **Fine-tuning**: 3–5 epochs, batch size 16, AdamW optimizer
111
  - **Sequence length**: 512
112
  - **Loss function**: Cross-entropy
113
  - **Evaluation metrics**: Accuracy, Precision, Recall, F1-score
@@ -117,7 +116,7 @@ Each variant has its own confusion matrix, classification report, and prediction
117
 
118
  ## Evaluation Results
119
 
120
- Evaluation metrics are stored in the repo as:
121
  - `confusion_matrix.png`
122
  - `final_classification_report.parquet`
123
  - `final_predictions.parquet`
@@ -126,16 +125,6 @@ These files provide per-class performance and prediction logs for reproducibilit
126
 
127
  ---
128
 
129
- ## Corpus
130
-
131
- The corpora used for training and evaluation are provided in the `corpus/` folder.
132
-
133
- - **Combined (root folder)**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
134
- - **Fake.br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
135
- - **FakeTrue.Br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet` and `FakeTrueBr_corpus.csv`.
136
-
137
- ---
138
-
139
  ## How to Use
140
 
141
  ```python
@@ -170,6 +159,14 @@ The expected output is a Tuple where the first entry represents the classificati
170
  (False, 0.9999247789382935)
171
  ```
172
 
 
 
 
 
 
 
 
 
173
  ## License
174
 
175
  - Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
@@ -177,4 +174,15 @@ The expected output is a Tuple where the first entry represents the classificati
177
 
178
  ## Citation
179
 
180
- Coming soon.
 
 
 
 
 
 
 
 
 
 
 
 
75
  - **Task**: Binary text classification (Fake vs. True news)
76
  - **Language**: Portuguese (`pt`)
77
  - **Framework**: 🤗 Transformers
78
+ - **Training source code**: https://github.com/viniciuszani/portuguese-fake-new-classifiers
79
 
80
  ---
81
 
82
  ## Available Variants
83
 
84
+ - [**bertimbau-combined**](https://huggingface.co/vzani/portuguese-fake-news-classifier-bertimbau-combined)
85
+ Fine-tuned using the [combined dataset](https://huggingface.co/datasets/vzani/corpus-combined) from Fake.br and FakeTrue.Br.
86
 
87
+ - [**bertimbau-fake-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-bertimbau-fake-br)
88
+ Fine-tuned using the [Fake.br dataset](https://huggingface.co/datasets/vzani/corpus-fake-br) from Fake.br.
 
89
 
90
+ - [**bertimbau-faketrue-br**](https://huggingface.co/vzani/portuguese-fake-news-classifier-bertimbau-faketrue-br)
91
+ Fine-tuned using the [FakeTrue.Br dataset](https://huggingface.co/datasets/vzani/corpus-faketrue-br) from FakeTrue.Br.
 
92
 
93
  Each variant has its own confusion matrix, classification report, and predictions stored as artifacts.
94
 
 
106
  ```
107
 
108
  - **Base model**: `neuralmind/bert-base-portuguese-cased`
109
+ - **Fine-tuning**: 7 epochs, batch size 16, AdamW optimizer, 4 layers tuned
110
  - **Sequence length**: 512
111
  - **Loss function**: Cross-entropy
112
  - **Evaluation metrics**: Accuracy, Precision, Recall, F1-score
 
116
 
117
  ## Evaluation Results
118
 
119
+ Evaluation metrics are stored in the repo `Files and Versions` section as:
120
  - `confusion_matrix.png`
121
  - `final_classification_report.parquet`
122
  - `final_predictions.parquet`
 
125
 
126
  ---
127
 
 
 
 
 
 
 
 
 
 
 
128
  ## How to Use
129
 
130
  ```python
 
159
  (False, 0.9999247789382935)
160
  ```
161
 
162
+ ## Source code
163
+
164
+ You can find the source code that produced this model in the repository below:
165
+ - https://github.com/viniciuszani/portuguese-fake-new-classifiers
166
+
167
+ The source contains all the steps from data collection, evaluation, hyperparameter fine tuning, final model tuning and publishing to HuggingFace.
168
+ If you use it, please remember to credit the author and/or cite the work.
169
+
170
  ## License
171
 
172
  - Base model BERTimbau: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
 
174
 
175
  ## Citation
176
 
177
+ ```bibtex
178
+ @misc{zani2025portuguesefakenews,
179
+ author = {ZANI, Vinícius Augusto Tagliatti},
180
+ title = {Avaliação comparativa de técnicas de processamento de linguagem natural para a detecção de notícias falsas em Português},
181
+ year = {2025},
182
+ pages = {61},
183
+ address = {São Carlos},
184
+ school = {Instituto de Ciências Matemáticas e de Computação, Universidade de São Paulo},
185
+ type = {Trabalho de Conclusão de Curso (MBA em Inteligência Artificial e Big Data)},
186
+ note = {Orientador: Prof. Dr. Ivandre Paraboni}
187
+ }
188
+ ```