--- library_name: transformers tags: - deberta-v3 - regression - text-evaluation - multilingual - trust-remote-code --- # OmniScore DeBERTa-v3 `QCRI/OmniScore-deberta-v3` is a multi-output regression model for automatic text quality evaluation. It predicts four scalar scores in the range `[1, 5]`: - `informativeness` - `clarity` - `plausibility` - `faithfulness` The model is built on top of `microsoft/deberta-v3-base` and published with custom model code (`AutoModel` + `trust_remote_code=True`). ## Model Details - Base model: `microsoft/deberta-v3-base` - Architecture: `ScorePredictorModel` (custom `transformers` model) - Model type: encoder-only text regression - Max sequence length: 512 - Number of outputs: 4 - Output range: `[1, 5]` (sigmoid-scaled in model head) - Backbone hidden size: 768 - Saved dtype: `float32` ## Quick Access Model page: ```python from transformers import AutoTokenizer, AutoModel repo_id = "QCRI/OmniScore-deberta-v3" tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True) model = AutoModel.from_pretrained(repo_id, trust_remote_code=True) ``` ## What Input To Provide The model takes a single text string and returns four quality scores. For best results, keep a consistent prompt/input format during inference. Recommended flat format: ```text Task: Source: Reference: Candidate: ``` Chat-style input can be flattened as: ```text System: ... User: ... Assistant: ... ``` ## Usage Examples Install dependencies: ```bash pip install -U torch transformers sentencepiece ``` ### 1) Single Text Example ```python import torch from transformers import AutoTokenizer, AutoModel repo_id = "QCRI/OmniScore-deberta-v3" tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True) model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval() text = """Task: headline_evaluation Source: Full article text goes here. Candidate: Microsoft releases detailed model documentation.""" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) scores = { name: float(outputs.predictions[0, i]) for i, name in enumerate(model.config.score_names) } print(scores) ``` ### 2) Batch Example (GPU/CPU) ```python import torch from transformers import AutoTokenizer, AutoModel repo_id = "QCRI/OmniScore-deberta-v3" device = "cuda" if torch.cuda.is_available() else "cpu" tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True) model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).to(device).eval() texts = [ "Task: summarization\nSource: ...\nCandidate: ...", "Task: translation_evaluation\nSource: ...\nReference: ...\nCandidate: ...", ] batch = tokenizer(texts, return_tensors="pt", truncation=True, padding=True, max_length=512) batch = {k: v.to(device) for k, v in batch.items()} with torch.no_grad(): pred = model(**batch).predictions results = [] for row in pred.cpu(): results.append({name: float(row[i]) for i, name in enumerate(model.config.score_names)}) print(results) ``` ### 3) Chat Messages Helper ```python from transformers import AutoTokenizer, AutoModel import torch repo_id = "QCRI/OmniScore-deberta-v3" tokenizer = AutoTokenizer.from_pretrained(repo_id, trust_remote_code=True) model = AutoModel.from_pretrained(repo_id, trust_remote_code=True).eval() messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Write a concise summary of this article."}, {"role": "assistant", "content": "Here is a short summary..."}, ] flat_text = " ".join([f"{m['role'].capitalize()}: {m['content']}" for m in messages]) inputs = tokenizer(flat_text, return_tensors="pt", truncation=True, max_length=512) with torch.no_grad(): outputs = model(**inputs) print(dict((n, float(outputs.predictions[0, i])) for i, n in enumerate(model.config.score_names))) ``` ### Programmatic Download (Optional) ```python from huggingface_hub import snapshot_download local_dir = snapshot_download("QCRI/OmniScore-deberta-v3") print(local_dir) ``` ## Data and Task Coverage This checkpoint is for multi-task text quality scoring and is evaluated on the test set covering: - Chat evaluation - Headline evaluation - Paraphrase evaluation - QA evaluation - Summarization evaluation - Translation evaluation The underlying project data is multilingual and multi-domain. ## Intended Use Use this model to score generated text quality (or response quality) as a supporting signal in: - evaluation dashboards - ranking experiments - offline model comparison - human-in-the-loop workflows Not intended as a sole decision maker for high-stakes or safety-critical settings. ## Limitations - Scores are continuous estimates and should not be treated as absolute truth. - Performance differs by task, language, and domain. - The model can inherit annotation noise and dataset biases. - Long inputs beyond 512 tokens are truncated. - Low correlation metrics on some dimensions indicate that rank ordering can be weak for certain subsets. ## Responsible Use Recommended: - Use as a decision-support signal, not as a sole decision maker. - Calibrate thresholds on your own validation set before production use. - Monitor by language/task slices for fairness and reliability. Not recommended: - High-stakes automated decisions without human oversight. - Out-of-domain deployment without re-validation. ## Reproducibility Notes Published artifacts include: - `model.safetensors` - `config.json` - `configuration_score_predictor.py` - `modeling_score_predictor.py` - tokenizer files - `metrics_final.json` - `predictions.jsonl` Load with `trust_remote_code=True` because the architecture is custom. ## Citation If you use this model, please cite the project/repository and this model URL: ```bibtex @misc{qcri_omniscore_deberta_v3, title = {OmniScore DeBERTa-v3}, author = {QCRI}, year = {2026}, howpublished = {\url{https://huggingface.co/QCRI/OmniScore-deberta-v3}} } ```