Commit ·
0a8904e
0
Parent(s):
chore: squash history — keep latest version only
Browse files- .gitattributes +35 -0
- README.md +112 -0
- added_tokens.json +3 -0
- bpe.codes +0 -0
- config.json +58 -0
- label_map.json +32 -0
- model.safetensors +3 -0
- special_tokens_map.json +9 -0
- tokenizer_config.json +55 -0
- vocab.txt +0 -0
.gitattributes
ADDED
|
@@ -0,0 +1,35 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
+
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
+
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
+
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
+
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
+
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
+
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
+
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
+
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
+
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
+
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
+
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
+
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
+
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
+
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
+
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
+
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
+
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
+
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
+
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
+
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
+
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
+
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
+
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
| 27 |
+
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
+
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
+
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
+
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
+
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
+
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
+
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
+
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
+
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: vi
|
| 3 |
+
tags:
|
| 4 |
+
- ner
|
| 5 |
+
- phobert
|
| 6 |
+
- vietnamese
|
| 7 |
+
- document-ai
|
| 8 |
+
- cccd
|
| 9 |
+
- synthetic-data
|
| 10 |
+
license: mit
|
| 11 |
+
base_model: vinai/phobert-base
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# VietNerm - Căn cước công dân NER Model
|
| 15 |
+
|
| 16 |
+
PhoBERT-based Named Entity Recognition model for Vietnamese **Căn cước công dân** documents.
|
| 17 |
+
|
| 18 |
+
## ⚠️ DISCLAIMER: SYNTHETIC / MOCKUP DATA
|
| 19 |
+
|
| 20 |
+
> **Model này được train hoàn toàn trên dữ liệu giả lập (synthetic/mockup data), KHÔNG sử dụng dữ liệu cá nhân thật.**
|
| 21 |
+
|
| 22 |
+
- Tất cả dữ liệu training được **sinh tự động** bằng hệ thống template + generator
|
| 23 |
+
- **Không** sử dụng giấy tờ thật, thông tin cá nhân thật, hoặc dữ liệu thu thập từ người dùng
|
| 24 |
+
- Số định danh (ID, CCCD...) được sinh ngẫu nhiên, thiết kế để **không trùng** với dữ liệu thật
|
| 25 |
+
- Dữ liệu có inject nhiễu OCR (noise) để giả lập điều kiện thực tế
|
| 26 |
+
- Mục đích: **nghiên cứu AI, Document AI, OCR/NER pipeline**
|
| 27 |
+
- **Không** được sử dụng để giả mạo giấy tờ, tạo giấy tờ giả, lừa đảo hoặc gian lận
|
| 28 |
+
|
| 29 |
+
## Model Description
|
| 30 |
+
|
| 31 |
+
This model is fine-tuned from [`vinai/phobert-base`](https://huggingface.co/vinai/phobert-base) for token-level NER on Vietnamese administrative/medical documents. It extracts structured fields from OCR text output.
|
| 32 |
+
|
| 33 |
+
- **Base model**: vinai/phobert-base
|
| 34 |
+
- **Task**: Token Classification (NER)
|
| 35 |
+
- **Language**: Vietnamese (vi)
|
| 36 |
+
- **Document type**: Căn cước công dân
|
| 37 |
+
- **Number of labels**: 13
|
| 38 |
+
- **Training data**: Synthetic/Mockup (not real personal data)
|
| 39 |
+
|
| 40 |
+
## Labels
|
| 41 |
+
|
| 42 |
+
- `B-date_of_birth`
|
| 43 |
+
- `B-date_of_expiry`
|
| 44 |
+
- `B-full_name`
|
| 45 |
+
- `B-gender`
|
| 46 |
+
- `B-id_number`
|
| 47 |
+
- `B-nationality`
|
| 48 |
+
- `B-place_of_origin`
|
| 49 |
+
- `B-place_of_residence`
|
| 50 |
+
- `I-full_name`
|
| 51 |
+
- `I-nationality`
|
| 52 |
+
- `I-place_of_origin`
|
| 53 |
+
- `I-place_of_residence`
|
| 54 |
+
|
| 55 |
+
## Usage
|
| 56 |
+
|
| 57 |
+
### With VietNerm SDK
|
| 58 |
+
|
| 59 |
+
```python
|
| 60 |
+
from vietnerm import VietNerm
|
| 61 |
+
|
| 62 |
+
ner = VietNerm(doc_type="cccd", model_path="ngocthanhdoan/phobert-cccd-ner")
|
| 63 |
+
result = ner.extract("your document text here")
|
| 64 |
+
print(result)
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
### With Transformers
|
| 68 |
+
|
| 69 |
+
```python
|
| 70 |
+
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
| 71 |
+
import torch
|
| 72 |
+
|
| 73 |
+
tokenizer = AutoTokenizer.from_pretrained("ngocthanhdoan/phobert-cccd-ner")
|
| 74 |
+
model = AutoModelForTokenClassification.from_pretrained("ngocthanhdoan/phobert-cccd-ner")
|
| 75 |
+
|
| 76 |
+
text = "your document text here"
|
| 77 |
+
inputs = tokenizer(text, return_tensors="pt")
|
| 78 |
+
|
| 79 |
+
with torch.no_grad():
|
| 80 |
+
outputs = model(**inputs)
|
| 81 |
+
predictions = torch.argmax(outputs.logits, dim=-1)
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
## Training
|
| 85 |
+
|
| 86 |
+
- **Dataset**: Synthetically generated (mockup data) with OCR noise simulation
|
| 87 |
+
- **Data source**: Auto-generated from Jinja2 templates + random generators (no real personal data)
|
| 88 |
+
- **Framework**: HuggingFace Transformers + Trainer API
|
| 89 |
+
- **Optimizer**: AdamW (lr=2e-5)
|
| 90 |
+
- **Epochs**: 5-7 (with early stopping)
|
| 91 |
+
|
| 92 |
+
## Ethical Use
|
| 93 |
+
|
| 94 |
+
This model is built for **research and development purposes only**:
|
| 95 |
+
|
| 96 |
+
- ✅ AI/NLP research
|
| 97 |
+
- ✅ Document AI development
|
| 98 |
+
- ✅ OCR/NER pipeline prototyping
|
| 99 |
+
- ✅ Educational purposes
|
| 100 |
+
- ❌ Forging documents
|
| 101 |
+
- ❌ Creating fake identity papers
|
| 102 |
+
- ❌ Fraud or deception
|
| 103 |
+
|
| 104 |
+
## About VietNerm
|
| 105 |
+
|
| 106 |
+
VietNerm is a Document AI Factory for Vietnamese documents. It provides a complete pipeline
|
| 107 |
+
from template-based synthetic data generation to model training and deployment.
|
| 108 |
+
|
| 109 |
+
- **Repository**: [Devhub-Solutions/VietNerm](https://github.com/Devhub-Solutions/VietNerm)
|
| 110 |
+
- **Training dataset**: [ngocthanhdoan/vietnerm-cccd-dataset](https://huggingface.co/datasets/ngocthanhdoan/vietnerm-cccd-dataset)
|
| 111 |
+
- **SDK**: `pip install vietnerm`
|
| 112 |
+
- **License**: MIT — Copyright (c) 2026 Devhub Solutions
|
added_tokens.json
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"<mask>": 64000
|
| 3 |
+
}
|
bpe.codes
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
config.json
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"RobertaForTokenClassification"
|
| 4 |
+
],
|
| 5 |
+
"attention_probs_dropout_prob": 0.1,
|
| 6 |
+
"bos_token_id": 0,
|
| 7 |
+
"classifier_dropout": null,
|
| 8 |
+
"eos_token_id": 2,
|
| 9 |
+
"gradient_checkpointing": false,
|
| 10 |
+
"hidden_act": "gelu",
|
| 11 |
+
"hidden_dropout_prob": 0.1,
|
| 12 |
+
"hidden_size": 768,
|
| 13 |
+
"id2label": {
|
| 14 |
+
"0": "O",
|
| 15 |
+
"1": "B-date_of_birth",
|
| 16 |
+
"2": "B-date_of_expiry",
|
| 17 |
+
"3": "B-full_name",
|
| 18 |
+
"4": "B-gender",
|
| 19 |
+
"5": "B-id_number",
|
| 20 |
+
"6": "B-nationality",
|
| 21 |
+
"7": "B-place_of_origin",
|
| 22 |
+
"8": "B-place_of_residence",
|
| 23 |
+
"9": "I-full_name",
|
| 24 |
+
"10": "I-nationality",
|
| 25 |
+
"11": "I-place_of_origin",
|
| 26 |
+
"12": "I-place_of_residence"
|
| 27 |
+
},
|
| 28 |
+
"initializer_range": 0.02,
|
| 29 |
+
"intermediate_size": 3072,
|
| 30 |
+
"label2id": {
|
| 31 |
+
"B-date_of_birth": 1,
|
| 32 |
+
"B-date_of_expiry": 2,
|
| 33 |
+
"B-full_name": 3,
|
| 34 |
+
"B-gender": 4,
|
| 35 |
+
"B-id_number": 5,
|
| 36 |
+
"B-nationality": 6,
|
| 37 |
+
"B-place_of_origin": 7,
|
| 38 |
+
"B-place_of_residence": 8,
|
| 39 |
+
"I-full_name": 9,
|
| 40 |
+
"I-nationality": 10,
|
| 41 |
+
"I-place_of_origin": 11,
|
| 42 |
+
"I-place_of_residence": 12,
|
| 43 |
+
"O": 0
|
| 44 |
+
},
|
| 45 |
+
"layer_norm_eps": 1e-05,
|
| 46 |
+
"max_position_embeddings": 258,
|
| 47 |
+
"model_type": "roberta",
|
| 48 |
+
"num_attention_heads": 12,
|
| 49 |
+
"num_hidden_layers": 12,
|
| 50 |
+
"pad_token_id": 1,
|
| 51 |
+
"position_embedding_type": "absolute",
|
| 52 |
+
"tokenizer_class": "PhobertTokenizer",
|
| 53 |
+
"torch_dtype": "float32",
|
| 54 |
+
"transformers_version": "4.51.3",
|
| 55 |
+
"type_vocab_size": 1,
|
| 56 |
+
"use_cache": true,
|
| 57 |
+
"vocab_size": 64001
|
| 58 |
+
}
|
label_map.json
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"id2label": {
|
| 3 |
+
"0": "O",
|
| 4 |
+
"1": "B-date_of_birth",
|
| 5 |
+
"2": "B-date_of_expiry",
|
| 6 |
+
"3": "B-full_name",
|
| 7 |
+
"4": "B-gender",
|
| 8 |
+
"5": "B-id_number",
|
| 9 |
+
"6": "B-nationality",
|
| 10 |
+
"7": "B-place_of_origin",
|
| 11 |
+
"8": "B-place_of_residence",
|
| 12 |
+
"9": "I-full_name",
|
| 13 |
+
"10": "I-nationality",
|
| 14 |
+
"11": "I-place_of_origin",
|
| 15 |
+
"12": "I-place_of_residence"
|
| 16 |
+
},
|
| 17 |
+
"label2id": {
|
| 18 |
+
"O": 0,
|
| 19 |
+
"B-date_of_birth": 1,
|
| 20 |
+
"B-date_of_expiry": 2,
|
| 21 |
+
"B-full_name": 3,
|
| 22 |
+
"B-gender": 4,
|
| 23 |
+
"B-id_number": 5,
|
| 24 |
+
"B-nationality": 6,
|
| 25 |
+
"B-place_of_origin": 7,
|
| 26 |
+
"B-place_of_residence": 8,
|
| 27 |
+
"I-full_name": 9,
|
| 28 |
+
"I-nationality": 10,
|
| 29 |
+
"I-place_of_origin": 11,
|
| 30 |
+
"I-place_of_residence": 12
|
| 31 |
+
}
|
| 32 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0955f8d22ca7c2ceee23423c13259eadc80bd78976685fbd8817d0bc0e09f269
|
| 3 |
+
size 537694636
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,9 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"bos_token": "<s>",
|
| 3 |
+
"cls_token": "<s>",
|
| 4 |
+
"eos_token": "</s>",
|
| 5 |
+
"mask_token": "<mask>",
|
| 6 |
+
"pad_token": "<pad>",
|
| 7 |
+
"sep_token": "</s>",
|
| 8 |
+
"unk_token": "<unk>"
|
| 9 |
+
}
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,55 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "<s>",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"1": {
|
| 12 |
+
"content": "<pad>",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"2": {
|
| 20 |
+
"content": "</s>",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"3": {
|
| 28 |
+
"content": "<unk>",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"64000": {
|
| 36 |
+
"content": "<mask>",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"bos_token": "<s>",
|
| 45 |
+
"clean_up_tokenization_spaces": false,
|
| 46 |
+
"cls_token": "<s>",
|
| 47 |
+
"eos_token": "</s>",
|
| 48 |
+
"extra_special_tokens": {},
|
| 49 |
+
"mask_token": "<mask>",
|
| 50 |
+
"model_max_length": 1000000000000000019884624838656,
|
| 51 |
+
"pad_token": "<pad>",
|
| 52 |
+
"sep_token": "</s>",
|
| 53 |
+
"tokenizer_class": "PhobertTokenizer",
|
| 54 |
+
"unk_token": "<unk>"
|
| 55 |
+
}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|