VietNerm - Căn cước công dân NER Model

PhoBERT-based Named Entity Recognition model for Vietnamese Căn cước công dân documents.

⚠️ DISCLAIMER: SYNTHETIC / MOCKUP DATA

Model này được train hoàn toàn trên dữ liệu giả lập (synthetic/mockup data), KHÔNG sử dụng dữ liệu cá nhân thật.

Tất cả dữ liệu training được sinh tự động bằng hệ thống template + generator
Không sử dụng giấy tờ thật, thông tin cá nhân thật, hoặc dữ liệu thu thập từ người dùng
Số định danh (ID, CCCD...) được sinh ngẫu nhiên, thiết kế để không trùng với dữ liệu thật
Dữ liệu có inject nhiễu OCR (noise) để giả lập điều kiện thực tế
Mục đích: nghiên cứu AI, Document AI, OCR/NER pipeline
Không được sử dụng để giả mạo giấy tờ, tạo giấy tờ giả, lừa đảo hoặc gian lận

Model Description

This model is fine-tuned from vinai/phobert-base for token-level NER on Vietnamese administrative/medical documents. It extracts structured fields from OCR text output.

Base model: vinai/phobert-base
Task: Token Classification (NER)
Language: Vietnamese (vi)
Document type: Căn cước công dân
Number of labels: 13
Training data: Synthetic/Mockup (not real personal data)

Labels

B-date_of_birth
B-date_of_expiry
B-full_name
B-gender
B-id_number
B-nationality
B-place_of_origin
B-place_of_residence
I-full_name
I-nationality
I-place_of_origin
I-place_of_residence

Usage

With VietNerm SDK

from vietnerm import VietNerm

ner = VietNerm(doc_type="cccd", model_path="ngocthanhdoan/phobert-cccd-ner")
result = ner.extract("your document text here")
print(result)

With Transformers

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("ngocthanhdoan/phobert-cccd-ner")
model = AutoModelForTokenClassification.from_pretrained("ngocthanhdoan/phobert-cccd-ner")

text = "your document text here"
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.argmax(outputs.logits, dim=-1)

Training

Dataset: Synthetically generated (mockup data) with OCR noise simulation
Data source: Auto-generated from Jinja2 templates + random generators (no real personal data)
Framework: HuggingFace Transformers + Trainer API
Optimizer: AdamW (lr=2e-5)
Epochs: 5-7 (with early stopping)

Ethical Use

This model is built for research and development purposes only:

✅ AI/NLP research
✅ Document AI development
✅ OCR/NER pipeline prototyping
✅ Educational purposes
❌ Forging documents
❌ Creating fake identity papers
❌ Fraud or deception

About VietNerm

VietNerm is a Document AI Factory for Vietnamese documents. It provides a complete pipeline from template-based synthetic data generation to model training and deployment.

Repository: Devhub-Solutions/VietNerm
Training dataset: ngocthanhdoan/vietnerm-cccd-dataset
SDK: pip install vietnerm
License: MIT — Copyright (c) 2026 Devhub Solutions

Downloads last month: 357

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ngocthanhdoan/phobert-cccd-ner

Base model

vinai/phobert-base

Finetuned

(169)

this model