Instructions to use mpilhlt/byt5-salamanca-abbr with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mpilhlt/byt5-salamanca-abbr with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mpilhlt/byt5-salamanca-abbr")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("mpilhlt/byt5-salamanca-abbr")
model = AutoModelForMultimodalLM.from_pretrained("mpilhlt/byt5-salamanca-abbr")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use mpilhlt/byt5-salamanca-abbr with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mpilhlt/byt5-salamanca-abbr"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mpilhlt/byt5-salamanca-abbr",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/mpilhlt/byt5-salamanca-abbr

SGLang

How to use mpilhlt/byt5-salamanca-abbr with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mpilhlt/byt5-salamanca-abbr" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mpilhlt/byt5-salamanca-abbr",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mpilhlt/byt5-salamanca-abbr" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mpilhlt/byt5-salamanca-abbr",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use mpilhlt/byt5-salamanca-abbr with Docker Model Runner:
```
docker model run hf.co/mpilhlt/byt5-salamanca-abbr
```

ByT5 Abbreviation Expansion for Early Modern Texts

This model expands abbreviations in early modern Spanish and Latin printed texts — brevigraphs, macrons, tildes, and other typographic shorthand conventions used in 16th–17th century printing. Ideally, it should even be capable of expanding some abbreviations without typographical signals. It is the second stage in a two-model pipeline, following a boundary classifier (mpilhlt/canine-salamanca-boundary-classifier) that identifies nonbreaking line boundaries.

Model description

The model is a fine-tuned ByT5-base, a byte-level sequence-to-sequence Transformer that operates directly on UTF-8 bytes without tokenization. This makes it well suited to historical text where abbreviation markers (macrons, tildes, superscripts) are individual Unicode codepoints that subword tokenizers would split unpredictably.

Input: one or more lines of transcribed text, joined with the line separator character ¬ (U+00AC) for nonbreaking boundaries or ↵ (U+21B5) for normal line boundaries. Context lines (preceding and following) improve disambiguation.

Output: the same text with abbreviations expanded. Unchanged characters are copied through verbatim; only abbreviated spans are modified.

Marker dropout training

The training data have abbreviation delimiters indicating beginning and end of abbreviation tokens (⦃⦄). During training, 50% of examples have their abbreviation delimiters randomly stripped. This teaches the model to both expand pre-marked abbreviations (when markers are present) and detect and expand abbreviations from orthographic cues alone (when markers are absent). At inference time, input is always unmarked — the model relies on the distinctive Unicode characters of abbreviation markers (macrons, tildes, brevigraphs) or typical character sequences to identify spans requiring expansion.

Training

The model was trained on approximately 1.5 million multi-line examples constructed from TEI XML transcriptions of early modern legal and theological texts in Latin and Spanish, published as part of the School of Salamanca Digital Collection.

Training data

Source: 2.4 million transcribed lines from the Salamanca corpus
Examples: ~1.54 million training examples (sliding windows of 3–5 lines)
Context: 1 line of context on each side of the central line
Abbreviation oversampling: 2× oversampling of examples containing abbreviations
Document-level splits: no document appears in more than one split
Dataset: https://huggingface.co/datasets/mpilhlt/salamanca-abbr

Training configuration

Base model: google/byt5-base (582M parameters)
Training hardware: 2× AMD Instinct MI300A (MPCDF Viper HPC)
Training duration: approximately 45 hours (6 epochs, early stopping at epoch 6, without final test eval that took another 17 hours)
Optimizer: AdamW, learning rate 1e-4 with linear warmup and cosine decay
Effective batch size: 128 (batch 32 per GPU × 2 GPUs × 2 gradient accumulation steps)
Max input length: 512 bytes
Max target length: 384 bytes
Precision: bf16
Marker dropout: 0.5

Training results

Loss and learning rate over training (early stopping triggered at epoch 6 with patience 3):

Epoch	Train Loss	Eval Loss	Eval Span CER	Eval Span Exact Match	Eval Full-line CER
1	0.0059	0.0029	0.0312	87.89%	0.0013
2	0.0035	0.0032	0.0404	85.23%	0.0016
3	0.0022	0.0043	0.0404	84.99%	0.0017
4	0.0015	0.0039	0.0348	86.92%	0.0014
5	0.0010	0.0048	0.0340	86.92%	0.0014
6	0.0006	0.0058	0.0369	86.68%	0.0015

Note: span-level eval metrics are measured on unmarked input (abbreviation delimiters stripped), so the model must both detect and expand abbreviations. The eval set was capped at 1,000 examples for efficiency during training.

Test set evaluation

Evaluated on a held-out document-level test set (20,000 examples from documents not seen during training):

Metric	Value
Full-line CER	0.00005
Span CER	0.008
Span exact match	95.54%
Spans evaluated	1,459

Interpretation:

Full-line CER of 0.00005 means approximately one wrong character per 21,000 characters
Span exact match of 95.5% means about 1 in 22 abbreviations has any error (assuming the average length of abbreviations is 7 characters), but most errors are false negatives (abbreviation left unchanged) rather than incorrect expansions
The rate of incorrect expansions (false positives) is approximately 1 in 36 abbreviations (same assumption about abbreviation length)

Common error patterns

The most frequent error types in the test set:

m/n confusion: the tilde (~) over vowels is genuinely ambiguous between m and n in Latin abbreviation conventions (e.g., ã → am vs an)
False negatives: the model leaves some abbreviations unexpanded, particularly less frequent forms
OCR/transcription artifacts: the model cannot correct upstream transcription errors (e.g., ſ/f confusion: plufquã instead of pluſquã)

For details, see the breakdown by abbreviation, which contains counts and erroneous predictions of each abbreviation. (Or the test_breakdown_errors file that has been filtered to include only those abbreviations where errors happened in test set evaluation.)

Intended use

This model is intended for post-processing transcribed early modern Latin and Spanish printed texts. It works best as part of the full svsal-poco pipeline, which chains boundary detection with abbreviation expansion.

It is not intended as a general-purpose text normalizer and has not been evaluated on other corpora, time periods, or languages.

How to use

Installation

pip install transformers torch
# or, to install the full pipeline:
pip install git+https://github.com/digicademy/svsal-poco

Direct use with transformers

from transformers import AutoTokenizer, T5ForConditionalGeneration

model_repo = "mpilhlt/byt5-salamanca-abbr"
tokenizer  = AutoTokenizer.from_pretrained("google/byt5-base")
model      = T5ForConditionalGeneration.from_pretrained(
    model_repo, tie_word_embeddings=False
)
model.eval()

# Single line with abbreviation
text = "Lex est communis ciuitatis cōsensus qui"

enc = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
out = model.generate(**enc, max_length=384)
result = tokenizer.decode(out[0], skip_special_tokens=True)
print(result)
# → "Lex est communis ciuitatis consensus qui"

Multi-line input with context

The model was trained on multi-line windows using two separator characters:

¬ (U+00AC): nonbreaking boundary — word continues across the line break
↵ (U+21B5): normal line boundary

# Two lines with a nonbreaking boundary (word split across lines)
text = "ex epicheia poſſet occultè, absq́; paro¬cho, & teſtibus celebrari."

enc = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
out = model.generate(**enc, max_length=384)
result = tokenizer.decode(out[0], skip_special_tokens=True)
print(result)
# → "ex epicheia poſſet occultè, absque paro¬cho, & teſtibus celebrari."

Use in the full pipeline

When using the full svsal-poco pipeline, the ByT5 model runs automatically as the second stage after boundary detection. See the svsal-poco repository for the complete pipeline documentation.

python -m infer \
  --input             texts.jsonl \
  --output            expanded.jsonl \
  --boundary_model_dir ./canine-salamanca-boundary-classifier \
  --byt5_model_dir    mpilhlt/byt5-salamanca-abbr \
  --batch_size        32

Interactive demo

Try the model in the SvSal PoCo HuggingFace Space, which provides a web interface for both plain text and TEI XML input.

Limitations

Trained exclusively on texts from the School of Salamanca corpus; performance on other early modern corpora is unknown and may be lower
The m/n ambiguity in tilde abbreviations is inherent to the notation system and cannot be fully resolved without semantic understanding
The model cannot correct upstream OCR or transcription errors
Cross-line abbreviations (where the abbreviated word straddles a line break) require correct boundary detection as a prerequisite
Very rare abbreviation forms seen fewer than ~5 times in training may not be recognized
The model has no explicit language identification; Latin and Spanish abbreviations are handled jointly

Citation

If you use this model, please cite the School of Salamanca project and this repository:

@misc{svsal-poco,
  author    = {Wagner, Andreas and others},
  title     = {svsal-poco: Abbreviation Expansion Pipeline for Early Modern
               Spanish and Latin Printed Texts},
  year      = {2026},
  publisher = {GitHub},
  url       = {https://github.com/digicademy/svsal-poco}
}

Downloads last month: 90

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for mpilhlt/byt5-salamanca-abbr

Base model

google/byt5-base

Finetuned

(52)

this model

Dataset used to train mpilhlt/byt5-salamanca-abbr

Evaluation results

Full-line CER on salamanca-abbr
self-reported

0.000
Span CER on salamanca-abbr
self-reported

0.008
Span exact match on salamanca-abbr
self-reported

0.955