Want a smaller model? Download Sarvam-30B!

Index

Introduction
Architecture
Inference
- SGLang
- vLLM
Citation

Introduction

Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding.

This repository provides FP8 quantized weights for Sarvam-105B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit.

Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting.

A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.

Sarvam-105B is open-sourced under the Apache License. For more details, see our blog.

Architecture

The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192 split into RoPE and noPE components, v_head_dim=128) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an intermediate_size (16384) and moe_intermediate_size (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing.

Inference

SGLang

Install latest SGLang from source

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Launch Server

sglang serve --model-path sarvamai/sov_105b_fp8 \
  --port 3002 --host 0.0.0.0 \
  --mem-fraction-static 0.80 \
  --trust-remote-code \
  --tp 2 \
  --moe-runner-backend flashinfer_cutedsl \
  --prefill-attention-backend fa3 \
  --decode-attention-backend flashmla \
  --quantization modelopt_fp8 \
  --kv-cache-dtype fp8_e4m3 \
  --enable-mixed-chunk \
  --enable-dp-attention \
  --tool-call-parser glm45 \
  --reasoning-parser glm45

vLLM

Note: currently a PR is open for native support for the Sarvam models in vLLM (link). Therefore, we have 2 options here.

Option 1: install from source (hard)

Use the custom fork here: link
Follow the instructions here to install from source: link

Option 2: hot-patch (easy)

Run hotpatch_vllm.py
This will do the following:
- install vllm=0.15.0
- add 2 model entries to registry.py
- download the model executors for sarvam-105b and sarvam-30b

Once this is done, you can launch the vLLM server.

Important: You must set VLLM_USE_FLASHINFER_MOE_FP8=0 as an environment variable, otherwise the server will get stuck during compilation and crash.

VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-105b-fp8 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --port 3002

Citation

@misc{sarvam_sovereign_models,
  title        = {Introducing Sarvam's Sovereign Models},
  author       = {{Sarvam Foundation Models Team}},
  year         = {2026},
  howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
  note         = {Accessed: 2026-03-03}
}

Downloads last month: 708

Safetensors

Model size

106B params

Tensor type

F32

F8_E4M3