vzani commited on
Commit
bd71d46
·
verified ·
1 Parent(s): 90363fa

Add model card (with dataset reference)

Browse files
Files changed (1) hide show
  1. README.md +184 -0
README.md ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - pt
4
+ license: apache-2.0
5
+ library_name: scikit-learn
6
+ pipeline_tag: text-classification
7
+ tags:
8
+ - mlp
9
+ - tfidf
10
+ - scikit-learn
11
+ - portuguese
12
+ - pt
13
+ - fake-news
14
+ - binary-classification
15
+ metrics:
16
+ - accuracy
17
+ - precision
18
+ - recall
19
+ - f1-score
20
+ datasets: vzani/corpus-faketrue-br
21
+ model-index:
22
+ - name: portuguese-fake-news-classifier-mlp-tfidf-faketrue-br
23
+ results:
24
+ - task:
25
+ type: text-classification
26
+ dataset:
27
+ name: FakeTrue.Br
28
+ type: vzani/corpus-faketrue-br
29
+ split: test
30
+ metrics:
31
+ - name: accuracy
32
+ type: accuracy
33
+ value: 0.95258
34
+ - name: precision_macro
35
+ type: precision
36
+ value: 0.952597
37
+ args:
38
+ average: macro
39
+ - name: recall_macro
40
+ type: recall
41
+ value: 0.952576
42
+ args:
43
+ average: macro
44
+ - name: f1_macro
45
+ type: f1
46
+ value: 0.952579
47
+ args:
48
+ average: macro
49
+ - name: precision_weighted
50
+ type: precision
51
+ value: 0.952594
52
+ args:
53
+ average: weighted
54
+ - name: recall_weighted
55
+ type: recall
56
+ value: 0.95258
57
+ args:
58
+ average: weighted
59
+ - name: f1_weighted
60
+ type: f1
61
+ value: 0.95258
62
+ args:
63
+ average: weighted
64
+ - name: n_test_samples
65
+ type: num
66
+ value: 717
67
+ ---
68
+ # MLP (TF-IDF) for Fake News Detection (Portuguese)
69
+
70
+ ## Model Overview
71
+
72
+ This repository contains **MLP classifiers trained on TF-IDF features** for **fake news detection in Portuguese**.
73
+ Models are trained and evaluated on corpora derived from Brazilian Portuguese datasets **[Fake.br](https://github.com/roneysco/Fake.br-Corpus)** and **[FakeTrue.Br](https://github.com/jpchav98/FakeTrue.Br/)**.
74
+
75
+ - **Architecture**: Multi-Layer Perceptron (scikit-learn)
76
+ - **Features**: TF-IDF over unigrams/bigrams
77
+ - **Task**: Binary text classification (Fake vs. True)
78
+ - **Language**: Portuguese (`pt`)
79
+ - **Framework**: scikit-learn
80
+
81
+ ---
82
+
83
+ ## Available Variants
84
+
85
+ - **mlp-tfidf-combined**
86
+ Trained on the aligned combined corpus.
87
+
88
+ - **mlp-tfidf-fake-br**
89
+ Trained on **Fake.br**.
90
+
91
+ - **mlp-tfidf-faketrue-br**
92
+ Trained on **FakeTrue.Br**.
93
+ Includes aligned splits and the original CSV when available.
94
+
95
+ Each variant ships with:
96
+ - `final_model.joblib`
97
+ - `confusion_matrix.png`
98
+ - `final_classification_report.parquet`
99
+ - `final_predictions.parquet`
100
+
101
+ ---
102
+
103
+ ## Training Details
104
+
105
+ ```python
106
+ {
107
+ "n_layers": 2,
108
+ "first_layer_size": 128,
109
+ "second_layer_size": 64,
110
+ "ngram_upper": 3,
111
+ "min_df": 5,
112
+ "max_df": 0.991954939032491,
113
+ "activation": "relu",
114
+ "solver": "lbfgs",
115
+ "alpha": 0.00014375816817663168,
116
+ "learning_rate_init": 0.005261446157045498,
117
+ }
118
+ ```
119
+
120
+ ---
121
+
122
+ ## Evaluation Results
123
+
124
+ Evaluation metrics are stored in the repo as:
125
+ - `confusion_matrix.png`
126
+ - `final_classification_report.parquet`
127
+ - `final_predictions.parquet`
128
+
129
+ These files provide per-class performance and prediction logs for reproducibility.
130
+
131
+ ---
132
+
133
+ ## Corpus
134
+
135
+ The corpora used for training and evaluation are provided in the `corpus/` folder.
136
+
137
+ - **Combined (root folder)**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
138
+ - **Fake.br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet`.
139
+ - **FakeTrue.Br**: `corpus_train_df.parquet`, `corpus_test_df.parquet`, `corpus_df.parquet`, `corpus_alinhado_df.parquet` and `FakeTrueBr_corpus.csv`.
140
+
141
+ ---
142
+
143
+ ## How to Use
144
+
145
+ This model is a **Keras** model stored as `final_bilstm_model.keras`.
146
+
147
+ ```python
148
+ import joblib
149
+ from huggingface_hub import hf_hub_download
150
+
151
+ repo_id = "vzani/portuguese-fake-news-classifier-mlp-tfidf-combined" # or fake-br / faketrue-br
152
+ filename = "final_model.joblib"
153
+
154
+ model_path = hf_hub_download(repo_id=repo_id, filename=filename)
155
+ clf = joblib.load(model_path) # Pipeline or bare estimator
156
+
157
+
158
+ def predict(text: str) -> tuple[bool, float]:
159
+ prob = clf.predict_proba([text])[0][1]
160
+ pred = prob >= 0.5
161
+
162
+ # Convert the probability in case of Fake
163
+ prob = prob if pred else 1 - prob
164
+ return bool(pred), float(prob)
165
+
166
+
167
+ if __name__ == "__main__":
168
+ text = "BOMBA! A Dilma vai taxar ainda mais os pobres!"
169
+ print(predict(text))
170
+
171
+ ```
172
+
173
+ The expected output is a Tuple where the first entry represents the classification (`True` for true news and `False` for fake news) and the second the probability assigned to the predicted class (ranging from 0 to 1.0).
174
+ ```
175
+ (False, 1.0)
176
+ ```
177
+
178
+ ## License
179
+
180
+ [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
181
+
182
+ ## Citation
183
+
184
+ Coming soon.