Was this built on DeepSeek V3 0324?

#6
by nova434431 - opened

From the model card: "trained from the ground up"

From our family of large models, Mistral Large 3 is a state-of-the-art general-purpose Multimodal granular Mixture-of-Experts model with 41B active parameters and 675B total parameters trained from the ground up with 3000 H200s.

From the model card: "trained from the ground up"

From our family of large models, Mistral Large 3 is a state-of-the-art general-purpose Multimodal granular Mixture-of-Experts model with 41B active parameters and 675B total parameters trained from the ground up with 3000 H200s.

and because they said it it has to be true, right?

@evewashere Would you like some help learning how to inspect the model architecture or look at the vLLM commits to see how inference works?

Mistral AI_ org

If you compare the configurations of this model and DS V3 0324, you can see that this model has fewer but "fatter" experts:

Compare:
https://huggingface.co/deepseek-ai/DeepSeek-V3-0324/blob/main/config.json#L23
with:

 "moe_intermediate_size": 2048,
"n_routed_experts": 256,

to:
https://huggingface.co/mistralai/Mistral-Large-3-675B-Instruct-2512/blob/main/params.json

with:

"expert_hidden_dim": 4096,
"num_experts": 128,

This should make it clear that ML3 is not trained/built on top of DS3. In addition, we use different rope scaling, select fewer experts per token and have an integrated vision encoder.

It's obviously though heavily inspired by DS3 (since the model architecture is more or less the same as you can see from this file: https://github.com/vllm-project/vllm/blob/83319b44c26af45de4753c74f55a07df8c637a25/vllm/model_executor/models/mistral_large_3.py#L11) similar to Kimi.

patrickvonplaten changed discussion status to closed

Sign up or log in to comment