ModernGENA base

ModernGENA is a DNA foundation model based on ModernBERT (a modernized BERT-style encoder architecture) adapted for genomic sequence modeling.
ModernGENA base is the 377M-parameter version introduced in the paper Back to BERT in 2026: ModernGENA as a Strong, Efficient Baseline for DNA Foundation Models.

How to load pre-trained model to fine-tune it on classification task: GENA_LM repository

Technical features

  • ModernBERT-based encoder architecture
  • RoPE positional embeddings
  • hybrid local/global attention
  • pre-norm transformer blocks
  • GeGLU feed-forward layers
  • end-to-end unpadding
  • FlashAttention-based efficient inference on compatible hardware
  • torch.compile support

Model strengths

  • strong efficiency-quality trade-off
  • higher inference throughput with FlashAttention-based implementations
  • competitive downstream performance on the Nucleotide Transformer benchmark
  • intended to support long genomic contexts

This makes it a practical baseline for genomic modeling experiments and future architectural comparisons.

Tokenization

ModernGENA uses the 32k BPE vocabulary (AIRI-Institute/gena-lm-bert-base-t2t) introduced in GENA-LM, built over the DNA alphabet symbols A/T/G/C/N, with special tokens [CLS], [SEP], [PAD], [UNK], and [MASK].

Pretraining corpus

  • 443 vertebrate genome assemblies
  • 353,574,093,776 bp total
  • Includes both forward strand and reverse complement sequences
  • Excludes sequences containing ambiguous symbols other than A/C/G/T

To reduce overrepresentation of simple repeats and enrich biologically informative regions, training intervals were sampled around transcription start sites:

  • window: [-16 kbp, +8 kbp] around each unique TSS
  • overlapping intervals merged with BEDTools
  • both strands included for each resulting region

Load pretrained model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("AIRI-Institute/gena-lm-bert-base-t2t")
model = AutoModel.from_pretrained("AIRI-Institute/moderngena-base", trust_remote_code=True, attn_implementation="flash_attention_2")

Evaluation

For evaluation results, see our paper:

Citation

Downloads last month
294
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including AIRI-Institute/moderngena-base