kiselyovd/grnti-text-classifier

Production-grade Russian scientific-text classifier: 28 top-level GRNTI codes. Main model: XLM-RoBERTa-base fine-tuned on ai-forever/ru-scibench-grnti-classification.

Metrics (test split, n = 2772, 28 classes)

Model Top-1 Top-5 Macro F1 Weighted F1
FacebookAI/xlm-roberta-base 72.4% 96.8% 72.3% 72.3%
DeepPavlov/rubert-base-cased 72.9% 95.9% 72.8% 72.8%

Usage

from transformers import pipeline

clf = pipeline("text-classification", model="kiselyovd/grnti-text-classifier", top_k=5)
clf("Исследование квантовой электродинамики в кристаллах.")

Intended use

This model is trained for Russian-language top-level GRNTI section classification (State Rubricator of Scientific and Technical Information). It is not evaluated outside Russian scientific text and should not be used for generic multilingual classification.

Do not rely on this model for high-stakes decisions. Outputs are probabilistic and subject to training-data biases.

Training

  • Dataset: ai-forever/ru-scibench-grnti-classification (MIT, 28 476 train + 2 772 test).
  • Base model: FacebookAI/xlm-roberta-base.
  • Baseline: DeepPavlov/rubert-base-cased.
  • Precision: bf16-mixed on CUDA.
  • Optimizer: AdamW + linear warmup/decay.
  • Optuna 10-trial sweep for lr/weight_decay/warmup_ratio, then 5-epoch final training with best params.

Source: https://github.com/kiselyovd/grnti-text-classifier

License

MIT.

Downloads last month
27
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kiselyovd/grnti-text-classifier

Finetuned
(3886)
this model

Dataset used to train kiselyovd/grnti-text-classifier

Evaluation results

  • Top-1 accuracy on ru-scibench-grnti-classification
    self-reported
    0.724
  • Top-5 accuracy on ru-scibench-grnti-classification
    self-reported
    0.968
  • Macro F1 on ru-scibench-grnti-classification
    self-reported
    0.723
  • Weighted F1 on ru-scibench-grnti-classification
    self-reported
    0.723