skilledu's picture
Duplicate from JetBrains/Mellum2-12B-A2.5B-Base
dbbd253
metadata
library_name: transformers
language:
  - en
model-index:
  - name: Mellum2 Base
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: humaneval
          name: HumanEval
        metrics:
          - name: pass@1
            type: pass@1
            value: 41.46
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: humaneval_plus
          name: HumanEval+
        metrics:
          - name: pass@1
            type: pass@1
            value: 37.2
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: mbpp
          name: MBPP
        metrics:
          - name: pass@1
            type: pass@1
            value: 62.4
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: mbpp_plus
          name: MBPP+
        metrics:
          - name: pass@1
            type: pass@1
            value: 78.31
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: multipl-e
          name: MultiPL-E HumanEval, 7 languages
        metrics:
          - name: pass@1
            type: pass@1
            value: 20.97
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: cruxeval
          name: CRUXEval-I
        metrics:
          - name: pass@1
            type: pass@1
            value: 45.38
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: cruxeval
          name: CRUXEval-O
        metrics:
          - name: pass@1
            type: pass@1
            value: 43.88
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: cais/mmlu
          name: MMLU
        metrics:
          - name: accuracy
            type: acc
            value: 70.87
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: mmlu-pro
          name: MMLU-Pro
        metrics:
          - name: exact match
            type: exact_match
            value: 59.31
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: bbh
          name: BBH
        metrics:
          - name: exact match
            type: exact_match
            value: 74.9
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: ai2_arc
          name: ARC-Challenge
        metrics:
          - name: normalized accuracy
            type: acc_norm
            value: 53.5
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: hellaswag
          name: HellaSwag
        metrics:
          - name: normalized accuracy
            type: acc_norm
            value: 73.72
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: winogrande
          name: WinoGrande
        metrics:
          - name: accuracy
            type: acc
            value: 65.51
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: truthful_qa
          name: TruthfulQA MC2
        metrics:
          - name: MC2
            type: mc2
            value: 44.51
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: gsm8k
          name: GSM8K
        metrics:
          - name: exact match
            type: exact_match
            value: 81.73
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: hendrycks_math
          name: MATH
        metrics:
          - name: exact match
            type: exact_match
            value: 9.96
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: gpqa
          name: GPQA Diamond
        metrics:
          - name: accuracy
            type: acc
            value: 31.31
            verified: false
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          type: gpqa
          name: GPQA Main
        metrics:
          - name: accuracy
            type: acc
            value: 35.04
            verified: false
license: apache-2.0
Mellum

Mellum2 Base

Use this checkpoint as the starting point for your own fine-tuning, alignment, or domain adaptation on top of the long-context base. For instruction-following or reasoning tasks out of the box, use Instruct or Thinking instead.

Mellum2 Base Highlights

Mellum2 Base is a long-context pretrained causal language model trained by JetBrains.

The model uses a Mixture-of-Experts architecture with 64 experts and activates 8 experts per token. It uses a combination of sliding-window and full attention layers, with a context length of 131,072 tokens.

This is the long-context base, produced from Mellum2-12B-A2.5B-Base-Pretrain by a layer-selective YaRN extension stage that re-maps RoPE frequencies on the global-attention layers only. It is the shared starting point for the released Instruct and Thinking variants.

Mellum2 Model Family

This repository contains one checkpoint from the Mellum2 family.

Checkpoint Description
Base Pretrain Base checkpoint before long-context extension
Base Final base model
Instruct SFT Supervised instruction-tuned checkpoint
Thinking SFT Supervised thinking checkpoint
Instruct RL-tuned instruction model
Thinking RL-tuned thinking model

Model Overview

Mellum2 Base has the following features:

  • Number of Layers: 28
  • Hidden Size: 2304
  • Intermediate Size: 7168
  • MoE Intermediate Size: 896
  • Number of Experts: 64
  • Number of Activated Experts: 8
  • Number of Attention Heads (GQA): 32 for Q and 4 for KV
  • Context Length: 131,072
  • Sliding Window: 1,024
  • Vocabulary Size: 98,304
  • Precision: bfloat16

Serving with vLLM

vllm serve JetBrains/Mellum2-12B-A2.5B-Base --max-model-len 131072

Quickstart

Text-Only Input (base model — use the completions endpoint, not chat)

from openai import OpenAI
# Configured by environment variables
client = OpenAI()

completion = client.completions.create(
    model="JetBrains/Mellum2-12B-A2.5B-Base",
    prompt="def fibonacci(n):\n    ",
    max_tokens=81920,
    temperature=0.6,
    top_p=0.95,
    extra_body={
        "top_k": 20,
    },
)
print("Completion:", completion)

Evaluation

Mellum2 Base pretraining results compared with similarly-sized open base models. All values are self-reported by JetBrains.

Benchmark Mellum2 (12B-A2.5B) OLMo-3 (7B) Qwen2.5 (7B) Qwen3 (4B) Qwen3.5 (4B)
Code Generation
HumanEval 41.5 45.1 55.5 57.3 50.0
HumanEval+ 37.2 39.6 47.0 51.2 43.9
MBPP 62.4 50.6 63.6 67.0 52.2
MBPP+ 61.4 52.9 64.0 64.5 55.0
MultiPL-E (7 langs) 21.0 10.0 19.2 26.0 12.1
CRUXEval-I 45.4 38.8 44.0 44.6 49.1
CRUXEval-O 43.9 36.6 42.9 43.5 43.2
Knowledge & Reasoning
MMLU 70.9 62.1 71.8 71.1 74.2
MMLU-Pro 59.3 34.5 48.6 51.5 52.4
BBH 74.9 63.6 69.0 71.3 80.2
ARC-Challenge 53.5 53.6 51.3 51.2 54.9
HellaSwag 73.7 74.2 78.9 73.7 75.3
WinoGrande 65.5 69.5 73.3 71.2 70.8
TruthfulQA MC2 44.5 47.0 56.4 53.5 52.1
Math & Science
GSM8K 81.7 73.5 81.9 82.0 80.1
MATH 10.0 18.7 24.6 27.7 25.3
GPQA Diamond 31.3 28.8 32.8 36.9 41.4
GPQA Main 35.0 27.9 34.2 36.8 40.2

For more details, see the Mellum2 Technical Report.

License

Released under the Apache 2.0 license.