Legal-BERT-LGPD 🇧🇷
Legal-BERT-LGPD is a domain-adapted transformer model for Brazilian legal texts, with a focus on the Lei Geral de Proteção de Dados (LGPD). It is designed to support task of token classification (NER) in the legal domain.
Overview
This model is based on BERT architecture and was further pre-trained and fine-tuned on Brazilian legal corpora, emphasizing privacy and data protection regulations. The development is grounded in academic research on legal NLP applied to LGPD.
Training Background
The model was developed using domain adaptation techniques over legal corpora in Portuguese. The methodology follows established approaches in Legal NLP, leveraging:
- Continued pre-training on domain-specific corpora
- Fine-tuning for downstream legal tasks
The underlying research highlights the importance of contextual embeddings for capturing the nuances of legal language, especially in highly specialized domains such as privacy law.
Limitations
- The model is specialized in Brazilian Portuguese legal language and may not generalize well to other domains or languages.
- It should not be used as a substitute for professional legal advice.
- Performance may vary depending on task-specific fine-tuning.
Personal Data Labels and Examples
| Label | Entity Example |
|---|---|
| NAME | Francis Pantele |
| DATE | January 12, 2013 |
| ADDRESS | Campo Grande, MS |
| CPF | 049.567.041-22 |
| PHONE | (61) 9412 3333 |
| fran@bol.com | |
| MONEY | 5,534.00 |
| ZIPCODE | 59123-222 |
Model performance results
| Entity | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| NAME | 0.95 | 0.95 | 0.95 | 1743 |
| DATE | 0.98 | 0.97 | 0.98 | 2024 |
| ADDRESS | 0.80 | 0.84 | 0.82 | 1323 |
| CPF | 0.98 | 1.00 | 0.99 | 144 |
| PHONE | 0.96 | 0.97 | 0.97 | 983 |
| 0.96 | 0.98 | 0.97 | 567 | |
| MONEY | 0.95 | 0.95 | 0.95 | 348 |
| ZIPCODE | 0.98 | 0.98 | 0.98 | 447 |
| Average | 0.94 | 0.96 | 0.95 | 947 |
How to Use
import torch
from transformers import pipeline, AutoTokenizer
MODEL_NAME = "celiudos/legal-bert-lgpd"
tokenizer = AutoTokenizer.from_pretrained(
MODEL_NAME,
model_max_length=512,
)
pipe = pipeline(
"ner",
tokenizer=tokenizer,
model=MODEL_NAME,
stride=100,
aggregation_strategy="first",
device=0 if torch.cuda.is_available() else -1,
)
pipe(
"Anotação de Responsabilidade Técnica Nº 1055330634101 de 12 de janeiro de 2013 relativa à Lei Federal Nº 531. Trata-se de representação referente a possível falsificação documentação técnica registrada pelo CREA-SP, feita pelo senhor Francis Pantele da Cozzi, CPF: 412.612.341-32, telefone (31) 951358433, email fran@bol.com, atinente à sua contratação pela senhora Marinalva Bete Raz, CPF: 049.567.041-22, telefone (61) 9412 3333, mulher branca, opinião política conservadora, religião evangélica. Marinalva Bete Raz reclama por indenização por danos morais no dia 14.05.2013 no valor de R$ 82.662,00 (Oitenta e dois mil, seiscentos e sessenta e dois reais) relacionado ao endereço IP 192.168.01 e ao endereço constante no CEP 59123-222, Rua dos Pioneiros, nº 450, Jardim Esmeralda, Campo Grande, MS."
)
Output
[
{
"entity_group": "DATA",
"score": 0.9828296,
"word": "12 de janeiro de 2013",
"start": 57,
"end": 78
},
{
"entity_group": "NOME",
"score": 0.95766664,
"word": "Francis Pantele da Cozzi",
"start": 234,
"end": 258
},
{
"entity_group": "CPF",
"score": 0.9954297,
"word": "412. 612. 341 - 32",
"start": 265,
"end": 279
},
{
"entity_group": "TELEFONE",
"score": 0.5634508,
"word": "31 )",
"start": 291,
"end": 294
},
{
"entity_group": "EMAIL",
"score": 0.9973985,
"word": "fran @ bol. com",
"start": 312,
"end": 324
},
{
"entity_group": "NOME",
"score": 0.96683884,
"word": "Marinalva Bete Raz",
"start": 366,
"end": 384
},
{
"entity_group": "CPF",
"score": 0.99713326,
"word": "049. 567. 041 - 22",
"start": 391,
"end": 405
},
{
"entity_group": "TELEFONE",
"score": 0.90854883,
"word": "( 61 ) 9412 3333",
"start": 416,
"end": 430
},
{
"entity_group": "NOME",
"score": 0.9364093,
"word": "Marinalva Bete Raz",
"start": 499,
"end": 517
},
{
"entity_group": "DATA",
"score": 0.9986375,
"word": "14",
"start": 566,
"end": 568
},
{
"entity_group": "DATA",
"score": 0.9968226,
"word": "05",
"start": 569,
"end": 571
},
{
"entity_group": "DATA",
"score": 0.9992943,
"word": "2013",
"start": 572,
"end": 576
},
{
"entity_group": "DINHEIRO",
"score": 0.99847966,
"word": "R $ 82. 662, 00",
"start": 589,
"end": 601
},
{
"entity_group": "CEP",
"score": 0.9977593,
"word": "59123 - 222",
"start": 728,
"end": 737
},
{
"entity_group": "ENDERECO",
"score": 0.9711078,
"word": "Rua dos Pioneiros",
"start": 739,
"end": 756
},
{
"entity_group": "ENDERECO",
"score": 0.9741938,
"word": "Jardim Esmeralda",
"start": 766,
"end": 782
},
{
"entity_group": "ENDERECO",
"score": 0.9352198,
"word": "Campo Grande, MS",
"start": 784,
"end": 800
}
]
Custom Input Usage
import gradio as gr
def ner(text):
return {"text": text, "entities": pipe(text)}
gr.Interface(
ner,
gr.Textbox(placeholder="Enter sentence here..."),
gr.HighlightedText(),
live=True,
examples=[
"Anotação de Responsabilidade Técnica Nº 1055330634101 de 12 de janeiro de 2013 relativa à Lei Federal Nº 531. Trata-se de representação referente a possível falsificação documentação técnica registrada pelo CREA-SP, feita pelo senhor Francis Pantele da Cozzi, CPF: 412.612.341-32, telefone (31) 951358433, email fran@bol.com.",
],
).launch()
Citation
@mastersthesis{souza_filho_2025_lgpd,
author = {Souza Filho, Marcelo Anselmo de},
title = {Inteligência Artificial no MPF: Uma Solução Baseada em IA para Pseudonimização de Dados Pessoais},
school = {Universidade de Brasília},
year = {2025},
address = {Brasília},
type = {Dissertação (Mestrado profissional em Computação Aplicada)},
pages = {74}
}
- Downloads last month
- 15,959