ngocthanhdoan commited on
Commit
0a8904e
·
0 Parent(s):

chore: squash history — keep latest version only

Browse files
.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: vi
3
+ tags:
4
+ - ner
5
+ - phobert
6
+ - vietnamese
7
+ - document-ai
8
+ - cccd
9
+ - synthetic-data
10
+ license: mit
11
+ base_model: vinai/phobert-base
12
+ ---
13
+
14
+ # VietNerm - Căn cước công dân NER Model
15
+
16
+ PhoBERT-based Named Entity Recognition model for Vietnamese **Căn cước công dân** documents.
17
+
18
+ ## ⚠️ DISCLAIMER: SYNTHETIC / MOCKUP DATA
19
+
20
+ > **Model này được train hoàn toàn trên dữ liệu giả lập (synthetic/mockup data), KHÔNG sử dụng dữ liệu cá nhân thật.**
21
+
22
+ - Tất cả dữ liệu training được **sinh tự động** bằng hệ thống template + generator
23
+ - **Không** sử dụng giấy tờ thật, thông tin cá nhân thật, hoặc dữ liệu thu thập từ người dùng
24
+ - Số định danh (ID, CCCD...) được sinh ngẫu nhiên, thiết kế để **không trùng** với dữ liệu thật
25
+ - Dữ liệu có inject nhiễu OCR (noise) để giả lập điều kiện thực tế
26
+ - Mục đích: **nghiên cứu AI, Document AI, OCR/NER pipeline**
27
+ - **Không** được sử dụng để giả mạo giấy tờ, tạo giấy tờ giả, lừa đảo hoặc gian lận
28
+
29
+ ## Model Description
30
+
31
+ This model is fine-tuned from [`vinai/phobert-base`](https://huggingface.co/vinai/phobert-base) for token-level NER on Vietnamese administrative/medical documents. It extracts structured fields from OCR text output.
32
+
33
+ - **Base model**: vinai/phobert-base
34
+ - **Task**: Token Classification (NER)
35
+ - **Language**: Vietnamese (vi)
36
+ - **Document type**: Căn cước công dân
37
+ - **Number of labels**: 13
38
+ - **Training data**: Synthetic/Mockup (not real personal data)
39
+
40
+ ## Labels
41
+
42
+ - `B-date_of_birth`
43
+ - `B-date_of_expiry`
44
+ - `B-full_name`
45
+ - `B-gender`
46
+ - `B-id_number`
47
+ - `B-nationality`
48
+ - `B-place_of_origin`
49
+ - `B-place_of_residence`
50
+ - `I-full_name`
51
+ - `I-nationality`
52
+ - `I-place_of_origin`
53
+ - `I-place_of_residence`
54
+
55
+ ## Usage
56
+
57
+ ### With VietNerm SDK
58
+
59
+ ```python
60
+ from vietnerm import VietNerm
61
+
62
+ ner = VietNerm(doc_type="cccd", model_path="ngocthanhdoan/phobert-cccd-ner")
63
+ result = ner.extract("your document text here")
64
+ print(result)
65
+ ```
66
+
67
+ ### With Transformers
68
+
69
+ ```python
70
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
71
+ import torch
72
+
73
+ tokenizer = AutoTokenizer.from_pretrained("ngocthanhdoan/phobert-cccd-ner")
74
+ model = AutoModelForTokenClassification.from_pretrained("ngocthanhdoan/phobert-cccd-ner")
75
+
76
+ text = "your document text here"
77
+ inputs = tokenizer(text, return_tensors="pt")
78
+
79
+ with torch.no_grad():
80
+ outputs = model(**inputs)
81
+ predictions = torch.argmax(outputs.logits, dim=-1)
82
+ ```
83
+
84
+ ## Training
85
+
86
+ - **Dataset**: Synthetically generated (mockup data) with OCR noise simulation
87
+ - **Data source**: Auto-generated from Jinja2 templates + random generators (no real personal data)
88
+ - **Framework**: HuggingFace Transformers + Trainer API
89
+ - **Optimizer**: AdamW (lr=2e-5)
90
+ - **Epochs**: 5-7 (with early stopping)
91
+
92
+ ## Ethical Use
93
+
94
+ This model is built for **research and development purposes only**:
95
+
96
+ - ✅ AI/NLP research
97
+ - ✅ Document AI development
98
+ - ✅ OCR/NER pipeline prototyping
99
+ - ✅ Educational purposes
100
+ - ❌ Forging documents
101
+ - ❌ Creating fake identity papers
102
+ - ❌ Fraud or deception
103
+
104
+ ## About VietNerm
105
+
106
+ VietNerm is a Document AI Factory for Vietnamese documents. It provides a complete pipeline
107
+ from template-based synthetic data generation to model training and deployment.
108
+
109
+ - **Repository**: [Devhub-Solutions/VietNerm](https://github.com/Devhub-Solutions/VietNerm)
110
+ - **Training dataset**: [ngocthanhdoan/vietnerm-cccd-dataset](https://huggingface.co/datasets/ngocthanhdoan/vietnerm-cccd-dataset)
111
+ - **SDK**: `pip install vietnerm`
112
+ - **License**: MIT — Copyright (c) 2026 Devhub Solutions
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "<mask>": 64000
3
+ }
bpe.codes ADDED
The diff for this file is too large to render. See raw diff
 
config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForTokenClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "O",
15
+ "1": "B-date_of_birth",
16
+ "2": "B-date_of_expiry",
17
+ "3": "B-full_name",
18
+ "4": "B-gender",
19
+ "5": "B-id_number",
20
+ "6": "B-nationality",
21
+ "7": "B-place_of_origin",
22
+ "8": "B-place_of_residence",
23
+ "9": "I-full_name",
24
+ "10": "I-nationality",
25
+ "11": "I-place_of_origin",
26
+ "12": "I-place_of_residence"
27
+ },
28
+ "initializer_range": 0.02,
29
+ "intermediate_size": 3072,
30
+ "label2id": {
31
+ "B-date_of_birth": 1,
32
+ "B-date_of_expiry": 2,
33
+ "B-full_name": 3,
34
+ "B-gender": 4,
35
+ "B-id_number": 5,
36
+ "B-nationality": 6,
37
+ "B-place_of_origin": 7,
38
+ "B-place_of_residence": 8,
39
+ "I-full_name": 9,
40
+ "I-nationality": 10,
41
+ "I-place_of_origin": 11,
42
+ "I-place_of_residence": 12,
43
+ "O": 0
44
+ },
45
+ "layer_norm_eps": 1e-05,
46
+ "max_position_embeddings": 258,
47
+ "model_type": "roberta",
48
+ "num_attention_heads": 12,
49
+ "num_hidden_layers": 12,
50
+ "pad_token_id": 1,
51
+ "position_embedding_type": "absolute",
52
+ "tokenizer_class": "PhobertTokenizer",
53
+ "torch_dtype": "float32",
54
+ "transformers_version": "4.51.3",
55
+ "type_vocab_size": 1,
56
+ "use_cache": true,
57
+ "vocab_size": 64001
58
+ }
label_map.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "id2label": {
3
+ "0": "O",
4
+ "1": "B-date_of_birth",
5
+ "2": "B-date_of_expiry",
6
+ "3": "B-full_name",
7
+ "4": "B-gender",
8
+ "5": "B-id_number",
9
+ "6": "B-nationality",
10
+ "7": "B-place_of_origin",
11
+ "8": "B-place_of_residence",
12
+ "9": "I-full_name",
13
+ "10": "I-nationality",
14
+ "11": "I-place_of_origin",
15
+ "12": "I-place_of_residence"
16
+ },
17
+ "label2id": {
18
+ "O": 0,
19
+ "B-date_of_birth": 1,
20
+ "B-date_of_expiry": 2,
21
+ "B-full_name": 3,
22
+ "B-gender": 4,
23
+ "B-id_number": 5,
24
+ "B-nationality": 6,
25
+ "B-place_of_origin": 7,
26
+ "B-place_of_residence": 8,
27
+ "I-full_name": 9,
28
+ "I-nationality": 10,
29
+ "I-place_of_origin": 11,
30
+ "I-place_of_residence": 12
31
+ }
32
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0955f8d22ca7c2ceee23423c13259eadc80bd78976685fbd8817d0bc0e09f269
3
+ size 537694636
special_tokens_map.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": "<mask>",
6
+ "pad_token": "<pad>",
7
+ "sep_token": "</s>",
8
+ "unk_token": "<unk>"
9
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "64000": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "extra_special_tokens": {},
49
+ "mask_token": "<mask>",
50
+ "model_max_length": 1000000000000000019884624838656,
51
+ "pad_token": "<pad>",
52
+ "sep_token": "</s>",
53
+ "tokenizer_class": "PhobertTokenizer",
54
+ "unk_token": "<unk>"
55
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff