kiselyovd/grnti-text-classifier
Production-grade Russian scientific-text classifier: 28 top-level GRNTI codes.
Main model: XLM-RoBERTa-base fine-tuned on ai-forever/ru-scibench-grnti-classification.
Metrics (test split, n = 2772, 28 classes)
| Model | Top-1 | Top-5 | Macro F1 | Weighted F1 |
|---|---|---|---|---|
| FacebookAI/xlm-roberta-base | 72.4% | 96.8% | 72.3% | 72.3% |
| DeepPavlov/rubert-base-cased | 72.9% | 95.9% | 72.8% | 72.8% |
Usage
from transformers import pipeline
clf = pipeline("text-classification", model="kiselyovd/grnti-text-classifier", top_k=5)
clf("Исследование квантовой электродинамики в кристаллах.")
Intended use
This model is trained for Russian-language top-level GRNTI section classification (State Rubricator of Scientific and Technical Information). It is not evaluated outside Russian scientific text and should not be used for generic multilingual classification.
Do not rely on this model for high-stakes decisions. Outputs are probabilistic and subject to training-data biases.
Training
- Dataset:
ai-forever/ru-scibench-grnti-classification(MIT, 28 476 train + 2 772 test). - Base model:
FacebookAI/xlm-roberta-base. - Baseline:
DeepPavlov/rubert-base-cased. - Precision: bf16-mixed on CUDA.
- Optimizer: AdamW + linear warmup/decay.
- Optuna 10-trial sweep for lr/weight_decay/warmup_ratio, then 5-epoch final training with best params.
Source: https://github.com/kiselyovd/grnti-text-classifier
License
MIT.
- Downloads last month
- 27
Model tree for kiselyovd/grnti-text-classifier
Base model
FacebookAI/xlm-roberta-baseDataset used to train kiselyovd/grnti-text-classifier
Evaluation results
- Top-1 accuracy on ru-scibench-grnti-classificationself-reported0.724
- Top-5 accuracy on ru-scibench-grnti-classificationself-reported0.968
- Macro F1 on ru-scibench-grnti-classificationself-reported0.723
- Weighted F1 on ru-scibench-grnti-classificationself-reported0.723