Instructions to use AesSedai/MiMo-V2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use AesSedai/MiMo-V2.5-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="AesSedai/MiMo-V2.5-GGUF", filename="IQ3_S/MiMo-V2.5-IQ3_S-00001-of-00004.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use AesSedai/MiMo-V2.5-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
Use Docker
docker model run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use AesSedai/MiMo-V2.5-GGUF with Ollama:
ollama run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M
- Unsloth Studio
How to use AesSedai/MiMo-V2.5-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AesSedai/MiMo-V2.5-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for AesSedai/MiMo-V2.5-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for AesSedai/MiMo-V2.5-GGUF to start chatting
- Pi
How to use AesSedai/MiMo-V2.5-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "AesSedai/MiMo-V2.5-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use AesSedai/MiMo-V2.5-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default AesSedai/MiMo-V2.5-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use AesSedai/MiMo-V2.5-GGUF with Docker Model Runner:
docker model run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M
- Lemonade
How to use AesSedai/MiMo-V2.5-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull AesSedai/MiMo-V2.5-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.MiMo-V2.5-GGUF-Q4_K_M
List all available models
lemonade list
New quants IQ4_XS isn't loaded by llama.cpp(b9050) with segfault issue
The prev version of IQ4_XS works fine with llama.cpp(b9050).
Hi, can you give me any further details about this? Log lines or stack trace please?
in GDB
srv load_model: loading model './models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
[New Thread 0x7fffc0d46000 (LWP 868754)]
[New Thread 0x7fffb5fff000 (LWP 868755)]
[New Thread 0x7fffb57fe000 (LWP 868756)]
[New Thread 0x7fffb4ffd000 (LWP 868757)]
[New Thread 0x7fffaffff000 (LWP 868758)]
[New Thread 0x7fffaf7fe000 (LWP 868759)]
[New Thread 0x7fffaeffd000 (LWP 868760)]
[New Thread 0x7fffae7fc000 (LWP 868761)]
[New Thread 0x7fffadffb000 (LWP 868762)]
[New Thread 0x7fffad7fa000 (LWP 868763)]
[New Thread 0x7fffacff9000 (LWP 868764)]
[New Thread 0x7fffa7fff000 (LWP 868765)]
[New Thread 0x7fffa77fe000 (LWP 868766)]
[New Thread 0x7fffa6ffd000 (LWP 868767)]
[New Thread 0x7fffa67fc000 (LWP 868768)]
[New Thread 0x7fffa5ffb000 (LWP 868769)]
[New Thread 0x7fffa57fa000 (LWP 868770)]
[New Thread 0x7fffa4ff9000 (LWP 868771)]
[New Thread 0x7fffa47f8000 (LWP 868772)]
Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x00007ffff795dd9e in ggml_mul_mat () from /home/uid/_llama_cpp/llama.cpp-b9050/build/bin/libggml-base.so.0
$ ./llama-server -m ./models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf -fa 1 -c 96000 --jinja -t 20 -ncmoe 0 --host 192.168.1.69 --batch-size 4096 --no-mmproj --temp 1.0 --top_p 0.95 --cache-type-k q8_0 --cache-type-v q8_0
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b0-unknown
system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model './models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
Segmentation fault (core dumped)
$ ./llama-server -m ./models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf -fa 1 -c 96000 --jinja -t 20 -ncmoe 0 --host 192.168.1.69 --batch-size 4096 --no-mmproj --temp 1.0 --top_p 0.95 --verbose
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b0-unknown
system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model './models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 54 key-value pairs and 472 tensors from ./models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mimo2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 3: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 4: general.name str = MiMo V2.5
llama_model_loader: - kv 5: general.version str = V2.5
llama_model_loader: - kv 6: general.basename str = MiMo
llama_model_loader: - kv 7: general.size_label str = 256x7.2B
llama_model_loader: - kv 8: general.license str = mit
llama_model_loader: - kv 9: general.tags arr[str,6] = ["multimodal", "vision-language", "au...
llama_model_loader: - kv 10: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 11: mimo2.block_count u32 = 48
llama_model_loader: - kv 12: mimo2.context_length u32 = 1048576
llama_model_loader: - kv 13: mimo2.embedding_length u32 = 4096
llama_model_loader: - kv 14: mimo2.feed_forward_length u32 = 16384
llama_model_loader: - kv 15: mimo2.attention.head_count u32 = 64
llama_model_loader: - kv 16: mimo2.attention.head_count_kv arr[i32,48] = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, ...
llama_model_loader: - kv 17: mimo2.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 18: mimo2.rope.freq_base_swa f32 = 10000.000000
llama_model_loader: - kv 19: mimo2.expert_used_count u32 = 8
llama_model_loader: - kv 20: mimo2.expert_group_count u32 = 1
llama_model_loader: - kv 21: mimo2.expert_group_used_count u32 = 1
llama_model_loader: - kv 22: mimo2.expert_gating_func u32 = 2
llama_model_loader: - kv 23: mimo2.attention.key_length u32 = 192
llama_model_loader: - kv 24: mimo2.attention.value_length u32 = 128
llama_model_loader: - kv 25: mimo2.attention.sliding_window u32 = 128
llama_model_loader: - kv 26: mimo2.attention.sliding_window_pattern arr[i32,48] = [0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, ...
llama_model_loader: - kv 27: mimo2.expert_count u32 = 256
llama_model_loader: - kv 28: mimo2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 29: mimo2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: mimo2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 31: mimo2.attention.value_scale f32 = 0.707000
llama_model_loader: - kv 32: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 33: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 34: tokenizer.ggml.tokens arr[str,152576] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 35: tokenizer.ggml.token_type arr[i32,152576] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 36: tokenizer.ggml.merges arr[str,151387] = ["Δ Δ ", "Δ Δ Δ Δ ", "i n", "Δ t",...
llama_model_loader: - kv 37: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 38: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 39: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 40: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
llama_model_loader: - kv 41: general.quantization_version u32 = 2
llama_model_loader: - kv 42: general.file_type u32 = 7
llama_model_loader: - kv 43: MoE_Quantization.ffn_up_exps str = IQ3_S
llama_model_loader: - kv 44: MoE_Quantization.ffn_gate_exps str = IQ3_S
llama_model_loader: - kv 45: MoE_Quantization.ffn_down_exps str = IQ4_XS
llama_model_loader: - kv 46: MoE_Quantization.type_default str = Q8_0
llama_model_loader: - kv 47: quantize.imatrix.file str = /mnt/srv/snowdrift/fp16/MiMo-V2.5/ima...
llama_model_loader: - kv 48: quantize.imatrix.dataset str = /mnt/srv/host/resources/KLD/calibrati...
llama_model_loader: - kv 49: quantize.imatrix.entries_count u32 = 287
llama_model_loader: - kv 50: quantize.imatrix.chunks_count u32 = 51
llama_model_loader: - kv 51: split.no u16 = 0
llama_model_loader: - kv 52: split.tensors.count i32 = 472
llama_model_loader: - kv 53: split.count u16 = 4
llama_model_loader: - type f32: 230 tensors
llama_model_loader: - type q8_0: 101 tensors
llama_model_loader: - type iq3_s: 94 tensors
llama_model_loader: - type iq4_xs: 47 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 136.78 GiB (3.80 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:02:00.0) - 23845 MiB free
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 151673 '<|mimo_audio_start|>' is not marked as EOG
load: control token: 151669 '<|audio_pad|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control-looking token: 128247 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control token: 151674 '<|mimo_audio_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151672 '<|mimo_audio_eod|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151670 '<|mimo_video_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151671 '<|mimo_video_end|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: printing all EOG tokens:
load: - 128247 ('</s>')
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 0.9312 MB
print_info: arch = mimo2
print_info: vocab_only = 0
print_info: no_alloc = 1
print_info: n_ctx_train = 1048576
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 48
print_info: n_head = 64
print_info: n_head_kv = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4]
print_info: n_rot = 64
print_info: n_swa = 128
print_info: is_swa_any = 1
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = [16, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16]
print_info: n_embd_k_gqa = [768, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768]
print_info: n_embd_v_gqa = [512, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512]
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 16384
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: n_expert_groups = 1
print_info: n_group_used = 1
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: freq_base_swa = 10000.0
print_info: freq_scale_swa = 1
print_info: n_embd_head_k_swa = 192
print_info: n_embd_head_v_swa = 128
print_info: n_rot_swa = 64
print_info: n_ctx_orig_yarn = 1048576
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 310B.A15B
print_info: model params = 308.78 B
print_info: general.name = MiMo V2.5
print_info: vocab type = BPE
print_info: n_vocab = 152576
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Δ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 128247 '</s>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: layer 0 assigned to device CUDA0, is_swa = 0
load_tensors: layer 1 assigned to device CUDA0, is_swa = 1
load_tensors: layer 2 assigned to device CUDA0, is_swa = 1
load_tensors: layer 3 assigned to device CUDA0, is_swa = 1
load_tensors: layer 4 assigned to device CUDA0, is_swa = 1
load_tensors: layer 5 assigned to device CUDA0, is_swa = 0
load_tensors: layer 6 assigned to device CUDA0, is_swa = 1
load_tensors: layer 7 assigned to device CUDA0, is_swa = 1
load_tensors: layer 8 assigned to device CUDA0, is_swa = 1
load_tensors: layer 9 assigned to device CUDA0, is_swa = 1
load_tensors: layer 10 assigned to device CUDA0, is_swa = 1
load_tensors: layer 11 assigned to device CUDA0, is_swa = 0
load_tensors: layer 12 assigned to device CUDA0, is_swa = 1
load_tensors: layer 13 assigned to device CUDA0, is_swa = 1
load_tensors: layer 14 assigned to device CUDA0, is_swa = 1
load_tensors: layer 15 assigned to device CUDA0, is_swa = 1
load_tensors: layer 16 assigned to device CUDA0, is_swa = 1
load_tensors: layer 17 assigned to device CUDA0, is_swa = 0
load_tensors: layer 18 assigned to device CUDA0, is_swa = 1
load_tensors: layer 19 assigned to device CUDA0, is_swa = 1
load_tensors: layer 20 assigned to device CUDA0, is_swa = 1
load_tensors: layer 21 assigned to device CUDA0, is_swa = 1
load_tensors: layer 22 assigned to device CUDA0, is_swa = 1
load_tensors: layer 23 assigned to device CUDA0, is_swa = 0
load_tensors: layer 24 assigned to device CUDA0, is_swa = 1
load_tensors: layer 25 assigned to device CUDA0, is_swa = 1
load_tensors: layer 26 assigned to device CUDA0, is_swa = 1
load_tensors: layer 27 assigned to device CUDA0, is_swa = 1
load_tensors: layer 28 assigned to device CUDA0, is_swa = 1
load_tensors: layer 29 assigned to device CUDA0, is_swa = 0
load_tensors: layer 30 assigned to device CUDA0, is_swa = 1
load_tensors: layer 31 assigned to device CUDA0, is_swa = 1
load_tensors: layer 32 assigned to device CUDA0, is_swa = 1
load_tensors: layer 33 assigned to device CUDA0, is_swa = 1
load_tensors: layer 34 assigned to device CUDA0, is_swa = 1
load_tensors: layer 35 assigned to device CUDA0, is_swa = 0
load_tensors: layer 36 assigned to device CUDA0, is_swa = 1
load_tensors: layer 37 assigned to device CUDA0, is_swa = 1
load_tensors: layer 38 assigned to device CUDA0, is_swa = 1
load_tensors: layer 39 assigned to device CUDA0, is_swa = 1
load_tensors: layer 40 assigned to device CUDA0, is_swa = 1
load_tensors: layer 41 assigned to device CUDA0, is_swa = 0
load_tensors: layer 42 assigned to device CUDA0, is_swa = 1
load_tensors: layer 43 assigned to device CUDA0, is_swa = 1
load_tensors: layer 44 assigned to device CUDA0, is_swa = 1
load_tensors: layer 45 assigned to device CUDA0, is_swa = 1
load_tensors: layer 46 assigned to device CUDA0, is_swa = 1
load_tensors: layer 47 assigned to device CUDA0, is_swa = 0
load_tensors: layer 48 assigned to device CUDA0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_qkv.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_qkv.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_sinks.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate_inp.weight
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.1.exp_probs_b.bias
create_tensor: loading tensor blk.2.attn_qkv.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_sinks.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate_inp.weight
create_tensor: loading tensor blk.2.ffn_gate_exps.weight
create_tensor: loading tensor blk.2.ffn_down_exps.weight
create_tensor: loading tensor blk.2.ffn_up_exps.weight
create_tensor: loading tensor blk.2.exp_probs_b.bias
create_tensor: loading tensor blk.3.attn_qkv.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_sinks.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate_inp.weight
create_tensor: loading tensor blk.3.ffn_gate_exps.weight
create_tensor: loading tensor blk.3.ffn_down_exps.weight
create_tensor: loading tensor blk.3.ffn_up_exps.weight
create_tensor: loading tensor blk.3.exp_probs_b.bias
create_tensor: loading tensor blk.4.attn_qkv.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_sinks.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate_inp.weight
create_tensor: loading tensor blk.4.ffn_gate_exps.weight
create_tensor: loading tensor blk.4.ffn_down_exps.weight
create_tensor: loading tensor blk.4.ffn_up_exps.weight
create_tensor: loading tensor blk.4.exp_probs_b.bias
create_tensor: loading tensor blk.5.attn_qkv.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate_inp.weight
create_tensor: loading tensor blk.5.ffn_gate_exps.weight
create_tensor: loading tensor blk.5.ffn_down_exps.weight
create_tensor: loading tensor blk.5.ffn_up_exps.weight
create_tensor: loading tensor blk.5.exp_probs_b.bias
create_tensor: loading tensor blk.6.attn_qkv.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_sinks.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate_inp.weight
create_tensor: loading tensor blk.6.ffn_gate_exps.weight
create_tensor: loading tensor blk.6.ffn_down_exps.weight
create_tensor: loading tensor blk.6.ffn_up_exps.weight
create_tensor: loading tensor blk.6.exp_probs_b.bias
create_tensor: loading tensor blk.7.attn_qkv.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_sinks.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate_inp.weight
create_tensor: loading tensor blk.7.ffn_gate_exps.weight
create_tensor: loading tensor blk.7.ffn_down_exps.weight
create_tensor: loading tensor blk.7.ffn_up_exps.weight
create_tensor: loading tensor blk.7.exp_probs_b.bias
create_tensor: loading tensor blk.8.attn_qkv.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_sinks.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate_inp.weight
create_tensor: loading tensor blk.8.ffn_gate_exps.weight
create_tensor: loading tensor blk.8.ffn_down_exps.weight
create_tensor: loading tensor blk.8.ffn_up_exps.weight
create_tensor: loading tensor blk.8.exp_probs_b.bias
create_tensor: loading tensor blk.9.attn_qkv.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_sinks.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate_inp.weight
create_tensor: loading tensor blk.9.ffn_gate_exps.weight
create_tensor: loading tensor blk.9.ffn_down_exps.weight
create_tensor: loading tensor blk.9.ffn_up_exps.weight
create_tensor: loading tensor blk.9.exp_probs_b.bias
create_tensor: loading tensor blk.10.attn_qkv.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_sinks.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate_inp.weight
create_tensor: loading tensor blk.10.ffn_gate_exps.weight
create_tensor: loading tensor blk.10.ffn_down_exps.weight
create_tensor: loading tensor blk.10.ffn_up_exps.weight
create_tensor: loading tensor blk.10.exp_probs_b.bias
create_tensor: loading tensor blk.11.attn_qkv.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate_inp.weight
create_tensor: loading tensor blk.11.ffn_gate_exps.weight
create_tensor: loading tensor blk.11.ffn_down_exps.weight
create_tensor: loading tensor blk.11.ffn_up_exps.weight
create_tensor: loading tensor blk.11.exp_probs_b.bias
create_tensor: loading tensor blk.12.attn_qkv.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_sinks.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate_inp.weight
create_tensor: loading tensor blk.12.ffn_gate_exps.weight
create_tensor: loading tensor blk.12.ffn_down_exps.weight
create_tensor: loading tensor blk.12.ffn_up_exps.weight
create_tensor: loading tensor blk.12.exp_probs_b.bias
create_tensor: loading tensor blk.13.attn_qkv.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_sinks.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate_inp.weight
create_tensor: loading tensor blk.13.ffn_gate_exps.weight
create_tensor: loading tensor blk.13.ffn_down_exps.weight
create_tensor: loading tensor blk.13.ffn_up_exps.weight
create_tensor: loading tensor blk.13.exp_probs_b.bias
create_tensor: loading tensor blk.14.attn_qkv.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_sinks.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate_inp.weight
create_tensor: loading tensor blk.14.ffn_gate_exps.weight
create_tensor: loading tensor blk.14.ffn_down_exps.weight
create_tensor: loading tensor blk.14.ffn_up_exps.weight
create_tensor: loading tensor blk.14.exp_probs_b.bias
create_tensor: loading tensor blk.15.attn_qkv.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_sinks.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate_inp.weight
create_tensor: loading tensor blk.15.ffn_gate_exps.weight
create_tensor: loading tensor blk.15.ffn_down_exps.weight
create_tensor: loading tensor blk.15.ffn_up_exps.weight
create_tensor: loading tensor blk.15.exp_probs_b.bias
create_tensor: loading tensor blk.16.attn_qkv.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_sinks.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate_inp.weight
create_tensor: loading tensor blk.16.ffn_gate_exps.weight
create_tensor: loading tensor blk.16.ffn_down_exps.weight
create_tensor: loading tensor blk.16.ffn_up_exps.weight
create_tensor: loading tensor blk.16.exp_probs_b.bias
create_tensor: loading tensor blk.17.attn_qkv.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate_inp.weight
create_tensor: loading tensor blk.17.ffn_gate_exps.weight
create_tensor: loading tensor blk.17.ffn_down_exps.weight
create_tensor: loading tensor blk.17.ffn_up_exps.weight
create_tensor: loading tensor blk.17.exp_probs_b.bias
create_tensor: loading tensor blk.18.attn_qkv.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_sinks.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate_inp.weight
create_tensor: loading tensor blk.18.ffn_gate_exps.weight
create_tensor: loading tensor blk.18.ffn_down_exps.weight
create_tensor: loading tensor blk.18.ffn_up_exps.weight
create_tensor: loading tensor blk.18.exp_probs_b.bias
create_tensor: loading tensor blk.19.attn_qkv.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_sinks.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate_inp.weight
create_tensor: loading tensor blk.19.ffn_gate_exps.weight
create_tensor: loading tensor blk.19.ffn_down_exps.weight
create_tensor: loading tensor blk.19.ffn_up_exps.weight
create_tensor: loading tensor blk.19.exp_probs_b.bias
create_tensor: loading tensor blk.20.attn_qkv.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_sinks.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate_inp.weight
create_tensor: loading tensor blk.20.ffn_gate_exps.weight
create_tensor: loading tensor blk.20.ffn_down_exps.weight
create_tensor: loading tensor blk.20.ffn_up_exps.weight
create_tensor: loading tensor blk.20.exp_probs_b.bias
create_tensor: loading tensor blk.21.attn_qkv.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_sinks.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate_inp.weight
create_tensor: loading tensor blk.21.ffn_gate_exps.weight
create_tensor: loading tensor blk.21.ffn_down_exps.weight
create_tensor: loading tensor blk.21.ffn_up_exps.weight
create_tensor: loading tensor blk.21.exp_probs_b.bias
create_tensor: loading tensor blk.22.attn_qkv.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_sinks.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate_inp.weight
create_tensor: loading tensor blk.22.ffn_gate_exps.weight
create_tensor: loading tensor blk.22.ffn_down_exps.weight
create_tensor: loading tensor blk.22.ffn_up_exps.weight
create_tensor: loading tensor blk.22.exp_probs_b.bias
create_tensor: loading tensor blk.23.attn_qkv.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate_inp.weight
create_tensor: loading tensor blk.23.ffn_gate_exps.weight
create_tensor: loading tensor blk.23.ffn_down_exps.weight
create_tensor: loading tensor blk.23.ffn_up_exps.weight
create_tensor: loading tensor blk.23.exp_probs_b.bias
create_tensor: loading tensor blk.24.attn_qkv.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_sinks.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate_inp.weight
create_tensor: loading tensor blk.24.ffn_gate_exps.weight
create_tensor: loading tensor blk.24.ffn_down_exps.weight
create_tensor: loading tensor blk.24.ffn_up_exps.weight
create_tensor: loading tensor blk.24.exp_probs_b.bias
create_tensor: loading tensor blk.25.attn_qkv.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_sinks.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate_inp.weight
create_tensor: loading tensor blk.25.ffn_gate_exps.weight
create_tensor: loading tensor blk.25.ffn_down_exps.weight
create_tensor: loading tensor blk.25.ffn_up_exps.weight
create_tensor: loading tensor blk.25.exp_probs_b.bias
create_tensor: loading tensor blk.26.attn_qkv.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_sinks.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate_inp.weight
create_tensor: loading tensor blk.26.ffn_gate_exps.weight
create_tensor: loading tensor blk.26.ffn_down_exps.weight
create_tensor: loading tensor blk.26.ffn_up_exps.weight
create_tensor: loading tensor blk.26.exp_probs_b.bias
create_tensor: loading tensor blk.27.attn_qkv.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_sinks.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate_inp.weight
create_tensor: loading tensor blk.27.ffn_gate_exps.weight
create_tensor: loading tensor blk.27.ffn_down_exps.weight
create_tensor: loading tensor blk.27.ffn_up_exps.weight
create_tensor: loading tensor blk.27.exp_probs_b.bias
create_tensor: loading tensor blk.28.attn_qkv.weight
create_tensor: loading tensor blk.28.attn_output.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.attn_sinks.weight
create_tensor: loading tensor blk.28.ffn_norm.weight
create_tensor: loading tensor blk.28.ffn_gate_inp.weight
create_tensor: loading tensor blk.28.ffn_gate_exps.weight
create_tensor: loading tensor blk.28.ffn_down_exps.weight
create_tensor: loading tensor blk.28.ffn_up_exps.weight
create_tensor: loading tensor blk.28.exp_probs_b.bias
create_tensor: loading tensor blk.29.attn_qkv.weight
create_tensor: loading tensor blk.29.attn_output.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.ffn_norm.weight
create_tensor: loading tensor blk.29.ffn_gate_inp.weight
create_tensor: loading tensor blk.29.ffn_gate_exps.weight
create_tensor: loading tensor blk.29.ffn_down_exps.weight
create_tensor: loading tensor blk.29.ffn_up_exps.weight
create_tensor: loading tensor blk.29.exp_probs_b.bias
create_tensor: loading tensor blk.30.attn_qkv.weight
create_tensor: loading tensor blk.30.attn_output.weight
create_tensor: loading tensor blk.30.attn_norm.weight
create_tensor: loading tensor blk.30.attn_sinks.weight
create_tensor: loading tensor blk.30.ffn_norm.weight
create_tensor: loading tensor blk.30.ffn_gate_inp.weight
create_tensor: loading tensor blk.30.ffn_gate_exps.weight
create_tensor: loading tensor blk.30.ffn_down_exps.weight
create_tensor: loading tensor blk.30.ffn_up_exps.weight
create_tensor: loading tensor blk.30.exp_probs_b.bias
create_tensor: loading tensor blk.31.attn_qkv.weight
create_tensor: loading tensor blk.31.attn_output.weight
create_tensor: loading tensor blk.31.attn_norm.weight
create_tensor: loading tensor blk.31.attn_sinks.weight
create_tensor: loading tensor blk.31.ffn_norm.weight
create_tensor: loading tensor blk.31.ffn_gate_inp.weight
create_tensor: loading tensor blk.31.ffn_gate_exps.weight
create_tensor: loading tensor blk.31.ffn_down_exps.weight
create_tensor: loading tensor blk.31.ffn_up_exps.weight
create_tensor: loading tensor blk.31.exp_probs_b.bias
create_tensor: loading tensor blk.32.attn_qkv.weight
create_tensor: loading tensor blk.32.attn_output.weight
create_tensor: loading tensor blk.32.attn_norm.weight
create_tensor: loading tensor blk.32.attn_sinks.weight
create_tensor: loading tensor blk.32.ffn_norm.weight
create_tensor: loading tensor blk.32.ffn_gate_inp.weight
create_tensor: loading tensor blk.32.ffn_gate_exps.weight
create_tensor: loading tensor blk.32.ffn_down_exps.weight
create_tensor: loading tensor blk.32.ffn_up_exps.weight
create_tensor: loading tensor blk.32.exp_probs_b.bias
create_tensor: loading tensor blk.33.attn_qkv.weight
create_tensor: loading tensor blk.33.attn_output.weight
create_tensor: loading tensor blk.33.attn_norm.weight
create_tensor: loading tensor blk.33.attn_sinks.weight
create_tensor: loading tensor blk.33.ffn_norm.weight
create_tensor: loading tensor blk.33.ffn_gate_inp.weight
create_tensor: loading tensor blk.33.ffn_gate_exps.weight
create_tensor: loading tensor blk.33.ffn_down_exps.weight
create_tensor: loading tensor blk.33.ffn_up_exps.weight
create_tensor: loading tensor blk.33.exp_probs_b.bias
create_tensor: loading tensor blk.34.attn_qkv.weight
create_tensor: loading tensor blk.34.attn_output.weight
create_tensor: loading tensor blk.34.attn_norm.weight
create_tensor: loading tensor blk.34.attn_sinks.weight
create_tensor: loading tensor blk.34.ffn_norm.weight
create_tensor: loading tensor blk.34.ffn_gate_inp.weight
create_tensor: loading tensor blk.34.ffn_gate_exps.weight
create_tensor: loading tensor blk.34.ffn_down_exps.weight
create_tensor: loading tensor blk.34.ffn_up_exps.weight
create_tensor: loading tensor blk.34.exp_probs_b.bias
create_tensor: loading tensor blk.35.attn_qkv.weight
create_tensor: loading tensor blk.35.attn_output.weight
create_tensor: loading tensor blk.35.attn_norm.weight
create_tensor: loading tensor blk.35.ffn_norm.weight
create_tensor: loading tensor blk.35.ffn_gate_inp.weight
create_tensor: loading tensor blk.35.ffn_gate_exps.weight
create_tensor: loading tensor blk.35.ffn_down_exps.weight
create_tensor: loading tensor blk.35.ffn_up_exps.weight
create_tensor: loading tensor blk.35.exp_probs_b.bias
create_tensor: loading tensor blk.36.attn_qkv.weight
create_tensor: loading tensor blk.36.attn_output.weight
create_tensor: loading tensor blk.36.attn_norm.weight
create_tensor: loading tensor blk.36.attn_sinks.weight
create_tensor: loading tensor blk.36.ffn_norm.weight
create_tensor: loading tensor blk.36.ffn_gate_inp.weight
create_tensor: loading tensor blk.36.ffn_gate_exps.weight
create_tensor: loading tensor blk.36.ffn_down_exps.weight
create_tensor: loading tensor blk.36.ffn_up_exps.weight
create_tensor: loading tensor blk.36.exp_probs_b.bias
create_tensor: loading tensor blk.37.attn_qkv.weight
create_tensor: loading tensor blk.37.attn_output.weight
create_tensor: loading tensor blk.37.attn_norm.weight
create_tensor: loading tensor blk.37.attn_sinks.weight
create_tensor: loading tensor blk.37.ffn_norm.weight
create_tensor: loading tensor blk.37.ffn_gate_inp.weight
create_tensor: loading tensor blk.37.ffn_gate_exps.weight
create_tensor: loading tensor blk.37.ffn_down_exps.weight
create_tensor: loading tensor blk.37.ffn_up_exps.weight
create_tensor: loading tensor blk.37.exp_probs_b.bias
create_tensor: loading tensor blk.38.attn_qkv.weight
create_tensor: loading tensor blk.38.attn_output.weight
create_tensor: loading tensor blk.38.attn_norm.weight
create_tensor: loading tensor blk.38.attn_sinks.weight
create_tensor: loading tensor blk.38.ffn_norm.weight
create_tensor: loading tensor blk.38.ffn_gate_inp.weight
create_tensor: loading tensor blk.38.ffn_gate_exps.weight
create_tensor: loading tensor blk.38.ffn_down_exps.weight
create_tensor: loading tensor blk.38.ffn_up_exps.weight
create_tensor: loading tensor blk.38.exp_probs_b.bias
create_tensor: loading tensor blk.39.attn_qkv.weight
create_tensor: loading tensor blk.39.attn_output.weight
create_tensor: loading tensor blk.39.attn_norm.weight
create_tensor: loading tensor blk.39.attn_sinks.weight
create_tensor: loading tensor blk.39.ffn_norm.weight
create_tensor: loading tensor blk.39.ffn_gate_inp.weight
create_tensor: loading tensor blk.39.ffn_gate_exps.weight
create_tensor: loading tensor blk.39.ffn_down_exps.weight
create_tensor: loading tensor blk.39.ffn_up_exps.weight
create_tensor: loading tensor blk.39.exp_probs_b.bias
create_tensor: loading tensor blk.40.attn_qkv.weight
create_tensor: loading tensor blk.40.attn_output.weight
create_tensor: loading tensor blk.40.attn_norm.weight
create_tensor: loading tensor blk.40.attn_sinks.weight
create_tensor: loading tensor blk.40.ffn_norm.weight
create_tensor: loading tensor blk.40.ffn_gate_inp.weight
create_tensor: loading tensor blk.40.ffn_gate_exps.weight
create_tensor: loading tensor blk.40.ffn_down_exps.weight
create_tensor: loading tensor blk.40.ffn_up_exps.weight
create_tensor: loading tensor blk.40.exp_probs_b.bias
create_tensor: loading tensor blk.41.attn_qkv.weight
create_tensor: loading tensor blk.41.attn_output.weight
create_tensor: loading tensor blk.41.attn_norm.weight
create_tensor: loading tensor blk.41.ffn_norm.weight
create_tensor: loading tensor blk.41.ffn_gate_inp.weight
create_tensor: loading tensor blk.41.ffn_gate_exps.weight
create_tensor: loading tensor blk.41.ffn_down_exps.weight
create_tensor: loading tensor blk.41.ffn_up_exps.weight
create_tensor: loading tensor blk.41.exp_probs_b.bias
create_tensor: loading tensor blk.42.attn_qkv.weight
create_tensor: loading tensor blk.42.attn_output.weight
create_tensor: loading tensor blk.42.attn_norm.weight
create_tensor: loading tensor blk.42.attn_sinks.weight
create_tensor: loading tensor blk.42.ffn_norm.weight
create_tensor: loading tensor blk.42.ffn_gate_inp.weight
create_tensor: loading tensor blk.42.ffn_gate_exps.weight
create_tensor: loading tensor blk.42.ffn_down_exps.weight
create_tensor: loading tensor blk.42.ffn_up_exps.weight
create_tensor: loading tensor blk.42.exp_probs_b.bias
create_tensor: loading tensor blk.43.attn_qkv.weight
create_tensor: loading tensor blk.43.attn_output.weight
create_tensor: loading tensor blk.43.attn_norm.weight
create_tensor: loading tensor blk.43.attn_sinks.weight
create_tensor: loading tensor blk.43.ffn_norm.weight
create_tensor: loading tensor blk.43.ffn_gate_inp.weight
create_tensor: loading tensor blk.43.ffn_gate_exps.weight
create_tensor: loading tensor blk.43.ffn_down_exps.weight
create_tensor: loading tensor blk.43.ffn_up_exps.weight
create_tensor: loading tensor blk.43.exp_probs_b.bias
create_tensor: loading tensor blk.44.attn_qkv.weight
create_tensor: loading tensor blk.44.attn_output.weight
create_tensor: loading tensor blk.44.attn_norm.weight
create_tensor: loading tensor blk.44.attn_sinks.weight
create_tensor: loading tensor blk.44.ffn_norm.weight
create_tensor: loading tensor blk.44.ffn_gate_inp.weight
create_tensor: loading tensor blk.44.ffn_gate_exps.weight
create_tensor: loading tensor blk.44.ffn_down_exps.weight
create_tensor: loading tensor blk.44.ffn_up_exps.weight
create_tensor: loading tensor blk.44.exp_probs_b.bias
create_tensor: loading tensor blk.45.attn_qkv.weight
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_sinks.weight
create_tensor: loading tensor blk.45.ffn_norm.weight
create_tensor: loading tensor blk.45.ffn_gate_inp.weight
create_tensor: loading tensor blk.45.ffn_gate_exps.weight
create_tensor: loading tensor blk.45.ffn_down_exps.weight
create_tensor: loading tensor blk.45.ffn_up_exps.weight
create_tensor: loading tensor blk.45.exp_probs_b.bias
create_tensor: loading tensor blk.46.attn_qkv.weight
create_tensor: loading tensor blk.46.attn_output.weight
create_tensor: loading tensor blk.46.attn_norm.weight
create_tensor: loading tensor blk.46.attn_sinks.weight
create_tensor: loading tensor blk.46.ffn_norm.weight
create_tensor: loading tensor blk.46.ffn_gate_inp.weight
create_tensor: loading tensor blk.46.ffn_gate_exps.weight
create_tensor: loading tensor blk.46.ffn_down_exps.weight
create_tensor: loading tensor blk.46.ffn_up_exps.weight
create_tensor: loading tensor blk.46.exp_probs_b.bias
create_tensor: loading tensor blk.47.attn_qkv.weight
create_tensor: loading tensor blk.47.attn_output.weight
create_tensor: loading tensor blk.47.attn_norm.weight
create_tensor: loading tensor blk.47.ffn_norm.weight
create_tensor: loading tensor blk.47.ffn_gate_inp.weight
create_tensor: loading tensor blk.47.ffn_gate_exps.weight
create_tensor: loading tensor blk.47.ffn_down_exps.weight
create_tensor: loading tensor blk.47.ffn_up_exps.weight
create_tensor: loading tensor blk.47.exp_probs_b.bias
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors: CUDA0 model buffer size = 0.00 MiB
load_tensors: CUDA_Host model buffer size = 0.00 MiB
llama_context: constructing llama_context
llama_context: n_seq_max = 4
llama_context: n_ctx = 96000
llama_context: n_ctx_seq = 96000
llama_context: n_batch = 4096
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = enabled
llama_context: kv_unified = true
llama_context: freq_base = 10000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (96000) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: CUDA_Host output buffer size = 2.33 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 96000 cells
llama_kv_cache: layer 0: dev = CUDA0
llama_kv_cache: layer 1: filtered
llama_kv_cache: layer 2: filtered
llama_kv_cache: layer 3: filtered
llama_kv_cache: layer 4: filtered
llama_kv_cache: layer 5: dev = CUDA0
llama_kv_cache: layer 6: filtered
llama_kv_cache: layer 7: filtered
llama_kv_cache: layer 8: filtered
llama_kv_cache: layer 9: filtered
llama_kv_cache: layer 10: filtered
llama_kv_cache: layer 11: dev = CUDA0
llama_kv_cache: layer 12: filtered
llama_kv_cache: layer 13: filtered
llama_kv_cache: layer 14: filtered
llama_kv_cache: layer 15: filtered
llama_kv_cache: layer 16: filtered
llama_kv_cache: layer 17: dev = CUDA0
llama_kv_cache: layer 18: filtered
llama_kv_cache: layer 19: filtered
llama_kv_cache: layer 20: filtered
llama_kv_cache: layer 21: filtered
llama_kv_cache: layer 22: filtered
llama_kv_cache: layer 23: dev = CUDA0
llama_kv_cache: layer 24: filtered
llama_kv_cache: layer 25: filtered
llama_kv_cache: layer 26: filtered
llama_kv_cache: layer 27: filtered
llama_kv_cache: layer 28: filtered
llama_kv_cache: layer 29: dev = CUDA0
llama_kv_cache: layer 30: filtered
llama_kv_cache: layer 31: filtered
llama_kv_cache: layer 32: filtered
llama_kv_cache: layer 33: filtered
llama_kv_cache: layer 34: filtered
llama_kv_cache: layer 35: dev = CUDA0
llama_kv_cache: layer 36: filtered
llama_kv_cache: layer 37: filtered
llama_kv_cache: layer 38: filtered
llama_kv_cache: layer 39: filtered
llama_kv_cache: layer 40: filtered
llama_kv_cache: layer 41: dev = CUDA0
llama_kv_cache: layer 42: filtered
llama_kv_cache: layer 43: filtered
llama_kv_cache: layer 44: filtered
llama_kv_cache: layer 45: filtered
llama_kv_cache: layer 46: filtered
llama_kv_cache: layer 47: dev = CUDA0
llama_kv_cache: CUDA0 KV buffer size = 0.00 MiB
llama_kv_cache: size = 2109.38 MiB ( 96000 cells, 9 layers, 4/1 seqs), K (f16): 1265.62 MiB, V (f16): 843.75 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 192
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
llama_kv_cache_iswa: creating SWA KV cache, size = 1024 cells
llama_kv_cache: layer 0: filtered
llama_kv_cache: layer 1: dev = CUDA0
llama_kv_cache: layer 2: dev = CUDA0
llama_kv_cache: layer 3: dev = CUDA0
llama_kv_cache: layer 4: dev = CUDA0
llama_kv_cache: layer 5: filtered
llama_kv_cache: layer 6: dev = CUDA0
llama_kv_cache: layer 7: dev = CUDA0
llama_kv_cache: layer 8: dev = CUDA0
llama_kv_cache: layer 9: dev = CUDA0
llama_kv_cache: layer 10: dev = CUDA0
llama_kv_cache: layer 11: filtered
llama_kv_cache: layer 12: dev = CUDA0
llama_kv_cache: layer 13: dev = CUDA0
llama_kv_cache: layer 14: dev = CUDA0
llama_kv_cache: layer 15: dev = CUDA0
llama_kv_cache: layer 16: dev = CUDA0
llama_kv_cache: layer 17: filtered
llama_kv_cache: layer 18: dev = CUDA0
llama_kv_cache: layer 19: dev = CUDA0
llama_kv_cache: layer 20: dev = CUDA0
llama_kv_cache: layer 21: dev = CUDA0
llama_kv_cache: layer 22: dev = CUDA0
llama_kv_cache: layer 23: filtered
llama_kv_cache: layer 24: dev = CUDA0
llama_kv_cache: layer 25: dev = CUDA0
llama_kv_cache: layer 26: dev = CUDA0
llama_kv_cache: layer 27: dev = CUDA0
llama_kv_cache: layer 28: dev = CUDA0
llama_kv_cache: layer 29: filtered
llama_kv_cache: layer 30: dev = CUDA0
llama_kv_cache: layer 31: dev = CUDA0
llama_kv_cache: layer 32: dev = CUDA0
llama_kv_cache: layer 33: dev = CUDA0
llama_kv_cache: layer 34: dev = CUDA0
llama_kv_cache: layer 35: filtered
llama_kv_cache: layer 36: dev = CUDA0
llama_kv_cache: layer 37: dev = CUDA0
llama_kv_cache: layer 38: dev = CUDA0
llama_kv_cache: layer 39: dev = CUDA0
llama_kv_cache: layer 40: dev = CUDA0
llama_kv_cache: layer 41: filtered
llama_kv_cache: layer 42: dev = CUDA0
llama_kv_cache: layer 43: dev = CUDA0
llama_kv_cache: layer 44: dev = CUDA0
llama_kv_cache: layer 45: dev = CUDA0
llama_kv_cache: layer 46: dev = CUDA0
llama_kv_cache: layer 47: filtered
llama_kv_cache: CUDA0 KV buffer size = 0.00 MiB
llama_kv_cache: size = 195.00 MiB ( 1024 cells, 39 layers, 4/1 seqs), K (f16): 117.00 MiB, V (f16): 78.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 192
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 3776
Segmentation fault (core dumped)
I'll try to look at this tomorrow, thanks. I just finished uploading newly converted quants now that the PR has been merged into master, that is perhaps worth trying?
I can confirm that even the previous version of MiMo-V2.5-Q4_K_M (before your update of 05/05/26) is NOT working with the latest llama.cpp, version: 9072 (6d57a49a7)
P.S. I am downloading the latest quants now and I will report later
Sadly the same behavior applies to the latest "IQ4_XS" with llama.cpp version: 9075 (58e68df0f):
/home/vik/llms/llama.cpp/build/bin/llama-server
--model ~/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf
--alias "aessedai/MiMo-V2.5-IQ4_XS"
-c 131072 --fit off --threads 24 -fa off --no-mmap --jinja
--temp 0.8 --top-p 0.95 --seed 1976
--host 0.0.0.0 --port 5005
-b 4096 -ub 1024 -ctxcp 24
-ctk q8_0 -ctv q8_0 -cb
--chat-template-kwargs '{"reasoning_effort": "normal"}'
==========================================
ggml_cuda_init: found 8 CUDA devices (Total VRAM: 192990 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24121 MiB
Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9075-58e68df0f
system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 47 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
llama_init_from_model: V cache quantization requires flash_attn
common_fit_params: encountered an error while trying to fit params to free device memory: failed to create llama_context from model
common_fit_params: fitting params to free memory took 2.02 seconds
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 55 key-value pairs and 508 tensors from /home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mimo2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 3: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 4: general.name str = MiMo V2.5
llama_model_loader: - kv 5: general.version str = V2.5
llama_model_loader: - kv 6: general.basename str = MiMo
llama_model_loader: - kv 7: general.size_label str = 256x8.2B
llama_model_loader: - kv 8: general.license str = mit
llama_model_loader: - kv 9: general.tags arr[str,6] = ["multimodal", "vision-language", "au...
llama_model_loader: - kv 10: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 11: mimo2.block_count u32 = 51
llama_model_loader: - kv 12: mimo2.context_length u32 = 1048576
llama_model_loader: - kv 13: mimo2.embedding_length u32 = 4096
llama_model_loader: - kv 14: mimo2.feed_forward_length u32 = 16384
llama_model_loader: - kv 15: mimo2.attention.head_count u32 = 64
llama_model_loader: - kv 16: mimo2.attention.head_count_kv arr[i32,51] = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, ...
llama_model_loader: - kv 17: mimo2.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 18: mimo2.rope.freq_base_swa f32 = 10000.000000
llama_model_loader: - kv 19: mimo2.expert_used_count u32 = 8
llama_model_loader: - kv 20: mimo2.expert_group_count u32 = 1
llama_model_loader: - kv 21: mimo2.expert_group_used_count u32 = 1
llama_model_loader: - kv 22: mimo2.expert_gating_func u32 = 2
llama_model_loader: - kv 23: mimo2.attention.key_length u32 = 192
llama_model_loader: - kv 24: mimo2.attention.value_length u32 = 128
llama_model_loader: - kv 25: mimo2.attention.sliding_window u32 = 128
llama_model_loader: - kv 26: mimo2.attention.sliding_window_pattern arr[i32,51] = [0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, ...
llama_model_loader: - kv 27: mimo2.expert_count u32 = 256
llama_model_loader: - kv 28: mimo2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 29: mimo2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: mimo2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 31: mimo2.attention.value_scale f32 = 0.707000
llama_model_loader: - kv 32: mimo2.nextn_predict_layers u32 = 3
llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 34: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,152576] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,152576] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,151387] = ["Δ Δ ", "Δ Δ Δ Δ ", "i n", "Δ t",...
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 41: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 7
llama_model_loader: - kv 44: MoE_Quantization.ffn_up_exps str = IQ3_S
llama_model_loader: - kv 45: MoE_Quantization.ffn_gate_exps str = IQ3_S
llama_model_loader: - kv 46: MoE_Quantization.ffn_down_exps str = IQ4_XS
llama_model_loader: - kv 47: MoE_Quantization.type_default str = Q8_0
llama_model_loader: - kv 48: quantize.imatrix.file str = /mnt/srv/snowdrift/fp16/MiMo-V2.5/ima...
llama_model_loader: - kv 49: quantize.imatrix.dataset str = /mnt/srv/host/resources/KLD/calibrati...
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 287
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 51
llama_model_loader: - kv 52: split.no u16 = 0
llama_model_loader: - kv 53: split.tensors.count i32 = 508
llama_model_loader: - kv 54: split.count u16 = 4
llama_model_loader: - type f32: 248 tensors
llama_model_loader: - type q8_0: 119 tensors
llama_model_loader: - type iq3_s: 94 tensors
llama_model_loader: - type iq4_xs: 47 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 137.75 GiB (3.82 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:02:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA2 (NVIDIA GeForce RTX 3090) (0000:81:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA3 (NVIDIA GeForce RTX 3090) (0000:82:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA4 (NVIDIA GeForce RTX 3090) (0000:83:00.0) - 23840 MiB free
llama_prepare_model_devices: using device CUDA5 (NVIDIA GeForce RTX 3090) (0000:84:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA6 (NVIDIA GeForce RTX 3090) (0000:c1:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA7 (NVIDIA GeForce RTX 3090) (0000:c2:00.0) - 23859 MiB free
load: 0 unused tokens
load: control-looking token: 128247 '' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load: - 128247 ('')
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 0.9312 MB
print_info: arch = mimo2
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 1048576
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 51
print_info: n_head = 64
print_info: n_head_kv = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8]
print_info: n_rot = 64
print_info: n_swa = 128
print_info: is_swa_any = 1
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = [16, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8]
print_info: n_embd_k_gqa = [768, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536]
print_info: n_embd_v_gqa = [512, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024]
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: f_attn_value_scale = 0.7070
print_info: n_ff = 16384
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: n_expert_groups = 1
print_info: n_group_used = 1
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: freq_base_swa = 10000.0
print_info: freq_scale_swa = 1
print_info: n_embd_head_k_swa = 192
print_info: n_embd_head_v_swa = 128
print_info: n_rot_swa = 64
print_info: n_ctx_orig_yarn = 1048576
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 310B.A15B
print_info: model params = 309.77 B
print_info: general.name = MiMo V2.5
print_info: vocab type = BPE
print_info: n_vocab = 152576
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Δ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 128247 ''
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
model has unused tensor blk.48.attn_output.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.48.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.48.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.ffn_gate.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.48.ffn_down.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.48.ffn_up.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.48.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.48.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_output.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.49.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.49.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.ffn_gate.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.49.ffn_down.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.49.ffn_up.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.49.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.49.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_output.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.50.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.50.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.ffn_gate.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.50.ffn_down.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.50.ffn_up.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.50.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.50.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.layer_output_norm.weight (size = 16384 bytes) -- ignoring
load_tensors: offloading output layer to GPU
load_tensors: offloading 50 repeating layers to GPU
load_tensors: offloaded 52/52 layers to GPU
load_tensors: CUDA0 model buffer size = 17974.98 MiB
load_tensors: CUDA1 model buffer size = 20628.29 MiB
load_tensors: CUDA2 model buffer size = 17680.63 MiB
load_tensors: CUDA3 model buffer size = 20628.29 MiB
load_tensors: CUDA4 model buffer size = 17680.63 MiB
load_tensors: CUDA5 model buffer size = 17680.63 MiB
load_tensors: CUDA6 model buffer size = 20628.29 MiB
load_tensors: CUDA7 model buffer size = 6708.14 MiB
load_tensors: CUDA_Host model buffer size = 633.25 MiB
....................................................................................................
common_init_result: added logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_init_from_model: V cache quantization requires flash_attn
common_init_result: failed to create context with model '/home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_from_params: failed to create context with model '/home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
./llama-srv_MiMo-V2.5-IQ4_XS.sh: line 12: 35811 Segmentation fault (core dumped)
Thanks, will try to load it up on my system and see if I can spot the problem
@dehnhaide can you try with the mimo-v2.5-fattn branch?
Edit: also -fa off isn't compatible with -ctk q8_0 -ctv q8_0, you need FA to quant the KV cache.
@dehnhaide I was able to reproduce your issue with -fa off -ctx q8_0 -ctv q8_0, it works with -fa on or without the -ctx q8_0 -ctv q8_0. I'd still recommend using the mimo-v2.5-fattn branch I linked in the readme + -fa on.
@dehnhaide I was able to reproduce your issue with
-fa off -ctx q8_0 -ctv q8_0, it works with-fa onor without the-ctx q8_0 -ctv q8_0. I'd still recommend using themimo-v2.5-fattnbranch I linked in the readme +-fa on.
Damn! I am so sorry for the misreport! :( Indeed, after playing with the initial version (in mainline) with "-fa on/off" it remained as "off".
Happy to report that it now starts up (with version: 9079 (f9cd456ea)) BUT it's broken: the speed is abysmal! Basically a simple prompt in OpenCode:
prompt eval time = 391869.08 ms / 19154 tokens ( 20.46 ms per token, 48.88 tokens per second)
eval time = 17018.09 ms / 126 tokens ( 135.06 ms per token, 7.40 tokens per second)
total time = 408887.17 ms / 19280 tokens
slot release: id 2 | task 2 | stop processing: n_tokens = 19279, truncated = 0
With https://github.com/AesSedai/llama.cpp/tree/mimo-v2.5-fattn the new IQ4_XS quant works beautiful (40toks in OpenCode) with "-fa on"
prompt eval time = 19402.71 ms / 19154 tokens ( 1.01 ms per token, 987.18 tokens per second)
eval time = 2592.45 ms / 102 tokens ( 25.42 ms per token, 39.35 tokens per second)
total time = 21995.15 ms / 19256 tokens
slot release: id 2 | task 2 | stop processing: n_tokens = 19255, truncated = 0
Cool, this PR is close to merging I think so the FA fixes will be on master after that: https://github.com/ggml-org/llama.cpp/pull/22812
The speed issue is known, I didn't want to pollute the initial support PR with the FA fixes. Basically MiMo uses a 192/128 asymmetric head which wasn't templated yet since no other model has needed it.
Appreciate all this. Downloading to test, but would y'all say this model has good pop culture knowledge? Decent prose? There's so much focus on agenic stuff, its hard to tell if they threw everything else out (like Qwen).
@Downtown-Case I've mostly been using Pro, but it has become my favorite RP model due to its prose / natural language / intuition. I've only used non-Pro a little bit and it's probably 65% of the way there. It's not bad but there isn't really a replacement for tripling the total and active params. Too bad Pro doesn't support multimodality :(