Want a smaller model? Download Sarvam-30B!
Index
Introduction
Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding.
This repository provides FP8 quantized weights for Sarvam-105B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit.
Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting.
A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.
Sarvam-105B is open-sourced under the Apache License. For more details, see our blog.
Architecture
The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192 split into RoPE and noPE components, v_head_dim=128) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an intermediate_size (16384) and moe_intermediate_size (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing.
Inference
SGLang
Install latest SGLang from source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
Launch Server
sglang serve --model-path sarvamai/sov_105b_fp8 \
--port 3002 --host 0.0.0.0 \
--mem-fraction-static 0.80 \
--trust-remote-code \
--tp 2 \
--moe-runner-backend flashinfer_cutedsl \
--prefill-attention-backend fa3 \
--decode-attention-backend flashmla \
--quantization modelopt_fp8 \
--kv-cache-dtype fp8_e4m3 \
--enable-mixed-chunk \
--enable-dp-attention \
--tool-call-parser glm45 \
--reasoning-parser glm45
vLLM
Note: currently a PR is open for native support for the Sarvam models in vLLM (link). Therefore, we have 2 options here.
Option 1: install from source (hard)
Option 2: hot-patch (easy)
- Run hotpatch_vllm.py
- This will do the following:
- install vllm=0.15.0
- add 2 model entries to
registry.py - download the model executors for
sarvam-105bandsarvam-30b
Once this is done, you can launch the vLLM server.
Important: You must set
VLLM_USE_FLASHINFER_MOE_FP8=0as an environment variable, otherwise the server will get stuck during compilation and crash.
VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-105b-fp8 \
--trust-remote-code \
--tensor-parallel-size 2 \
--quantization modelopt \
--kv-cache-dtype fp8 \
--port 3002
Citation
@misc{sarvam_sovereign_models,
title = {Introducing Sarvam's Sovereign Models},
author = {{Sarvam Foundation Models Team}},
year = {2026},
howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
note = {Accessed: 2026-03-03}
}
- Downloads last month
- 708
