image

Want a smaller model? Download Sarvam-30B!

Index

  1. Introduction
  2. Architecture
  3. Inference
  4. Citation

Introduction

Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding.

This repository provides FP8 quantized weights for Sarvam-105B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit.

Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting.

A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.

Sarvam-105B is open-sourced under the Apache License. For more details, see our blog.

Architecture

The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192 split into RoPE and noPE components, v_head_dim=128) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an intermediate_size (16384) and moe_intermediate_size (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing.

Inference

SGLang

Install latest SGLang from source

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Launch Server

sglang serve --model-path sarvamai/sov_105b_fp8 \
  --port 3002 --host 0.0.0.0 \
  --mem-fraction-static 0.80 \
  --trust-remote-code \
  --tp 2 \
  --moe-runner-backend flashinfer_cutedsl \
  --prefill-attention-backend fa3 \
  --decode-attention-backend flashmla \
  --quantization modelopt_fp8 \
  --kv-cache-dtype fp8_e4m3 \
  --enable-mixed-chunk \
  --enable-dp-attention \
  --tool-call-parser glm45 \
  --reasoning-parser glm45
vLLM

Note: currently a PR is open for native support for the Sarvam models in vLLM (link). Therefore, we have 2 options here.

Option 1: install from source (hard)

  • Use the custom fork here: link
  • Follow the instructions here to install from source: link

Option 2: hot-patch (easy)

  • Run hotpatch_vllm.py
  • This will do the following:
    • install vllm=0.15.0
    • add 2 model entries to registry.py
    • download the model executors for sarvam-105b and sarvam-30b

Once this is done, you can launch the vLLM server.

Important: You must set VLLM_USE_FLASHINFER_MOE_FP8=0 as an environment variable, otherwise the server will get stuck during compilation and crash.

VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-105b-fp8 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --port 3002

Citation

@misc{sarvam_sovereign_models,
  title        = {Introducing Sarvam's Sovereign Models},
  author       = {{Sarvam Foundation Models Team}},
  year         = {2026},
  howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
  note         = {Accessed: 2026-03-03}
}
Downloads last month
708
Safetensors
Model size
106B params
Tensor type
F32
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support