New quants IQ4_XS isn't loaded by llama.cpp(b9050) with segfault issue

#8
by smalinin - opened

The prev version of IQ4_XS works fine with llama.cpp(b9050).

Owner

Hi, can you give me any further details about this? Log lines or stack trace please?

in GDB

srv    load_model: loading model './models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
[New Thread 0x7fffc0d46000 (LWP 868754)]
[New Thread 0x7fffb5fff000 (LWP 868755)]
[New Thread 0x7fffb57fe000 (LWP 868756)]
[New Thread 0x7fffb4ffd000 (LWP 868757)]
[New Thread 0x7fffaffff000 (LWP 868758)]
[New Thread 0x7fffaf7fe000 (LWP 868759)]
[New Thread 0x7fffaeffd000 (LWP 868760)]
[New Thread 0x7fffae7fc000 (LWP 868761)]
[New Thread 0x7fffadffb000 (LWP 868762)]
[New Thread 0x7fffad7fa000 (LWP 868763)]
[New Thread 0x7fffacff9000 (LWP 868764)]
[New Thread 0x7fffa7fff000 (LWP 868765)]
[New Thread 0x7fffa77fe000 (LWP 868766)]
[New Thread 0x7fffa6ffd000 (LWP 868767)]
[New Thread 0x7fffa67fc000 (LWP 868768)]
[New Thread 0x7fffa5ffb000 (LWP 868769)]
[New Thread 0x7fffa57fa000 (LWP 868770)]
[New Thread 0x7fffa4ff9000 (LWP 868771)]
[New Thread 0x7fffa47f8000 (LWP 868772)]

Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x00007ffff795dd9e in ggml_mul_mat () from /home/uid/_llama_cpp/llama.cpp-b9050/build/bin/libggml-base.so.0
$ ./llama-server -m ./models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf  -fa 1 -c 96000 --jinja  -t 20 -ncmoe 0 --host 192.168.1.69 --batch-size 4096 --no-mmproj --temp 1.0 --top_p 0.95 --cache-type-k q8_0 --cache-type-v q8_0
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b0-unknown
system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Running without SSL
init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model './models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
Segmentation fault (core dumped)
$ ./llama-server -m ./models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf  -fa 1 -c 96000 --jinja  -t 20 -ncmoe 0 --host 192.168.1.69 --batch-size 4096 --no-mmproj --temp 1.0 --top_p 0.95 --verbose
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b0-unknown
system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Running without SSL
init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model './models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 54 key-value pairs and 472 tensors from ./models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mimo2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   3:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   4:                               general.name str              = MiMo V2.5
llama_model_loader: - kv   5:                            general.version str              = V2.5
llama_model_loader: - kv   6:                           general.basename str              = MiMo
llama_model_loader: - kv   7:                         general.size_label str              = 256x7.2B
llama_model_loader: - kv   8:                            general.license str              = mit
llama_model_loader: - kv   9:                               general.tags arr[str,6]       = ["multimodal", "vision-language", "au...
llama_model_loader: - kv  10:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv  11:                          mimo2.block_count u32              = 48
llama_model_loader: - kv  12:                       mimo2.context_length u32              = 1048576
llama_model_loader: - kv  13:                     mimo2.embedding_length u32              = 4096
llama_model_loader: - kv  14:                  mimo2.feed_forward_length u32              = 16384
llama_model_loader: - kv  15:                 mimo2.attention.head_count u32              = 64
llama_model_loader: - kv  16:              mimo2.attention.head_count_kv arr[i32,48]      = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, ...
llama_model_loader: - kv  17:                       mimo2.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  18:                   mimo2.rope.freq_base_swa f32              = 10000.000000
llama_model_loader: - kv  19:                    mimo2.expert_used_count u32              = 8
llama_model_loader: - kv  20:                   mimo2.expert_group_count u32              = 1
llama_model_loader: - kv  21:              mimo2.expert_group_used_count u32              = 1
llama_model_loader: - kv  22:                   mimo2.expert_gating_func u32              = 2
llama_model_loader: - kv  23:                 mimo2.attention.key_length u32              = 192
llama_model_loader: - kv  24:               mimo2.attention.value_length u32              = 128
llama_model_loader: - kv  25:             mimo2.attention.sliding_window u32              = 128
llama_model_loader: - kv  26:     mimo2.attention.sliding_window_pattern arr[i32,48]      = [0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, ...
llama_model_loader: - kv  27:                         mimo2.expert_count u32              = 256
llama_model_loader: - kv  28:           mimo2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  29:                 mimo2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:     mimo2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  31:                mimo2.attention.value_scale f32              = 0.707000
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,152576]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,152576]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,151387]  = ["Δ  Δ ", "Δ Δ  Δ Δ ", "i n", "Δ  t",...
llama_model_loader: - kv  37:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  40:                    tokenizer.chat_template str              = {%- if not add_generation_prompt is d...
llama_model_loader: - kv  41:               general.quantization_version u32              = 2
llama_model_loader: - kv  42:                          general.file_type u32              = 7
llama_model_loader: - kv  43:               MoE_Quantization.ffn_up_exps str              = IQ3_S
llama_model_loader: - kv  44:             MoE_Quantization.ffn_gate_exps str              = IQ3_S
llama_model_loader: - kv  45:             MoE_Quantization.ffn_down_exps str              = IQ4_XS
llama_model_loader: - kv  46:              MoE_Quantization.type_default str              = Q8_0
llama_model_loader: - kv  47:                      quantize.imatrix.file str              = /mnt/srv/snowdrift/fp16/MiMo-V2.5/ima...
llama_model_loader: - kv  48:                   quantize.imatrix.dataset str              = /mnt/srv/host/resources/KLD/calibrati...
llama_model_loader: - kv  49:             quantize.imatrix.entries_count u32              = 287
llama_model_loader: - kv  50:              quantize.imatrix.chunks_count u32              = 51
llama_model_loader: - kv  51:                                   split.no u16              = 0
llama_model_loader: - kv  52:                        split.tensors.count i32              = 472
llama_model_loader: - kv  53:                                split.count u16              = 4
llama_model_loader: - type  f32:  230 tensors
llama_model_loader: - type q8_0:  101 tensors
llama_model_loader: - type iq3_s:   94 tensors
llama_model_loader: - type iq4_xs:   47 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 136.78 GiB (3.80 BPW) 
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:02:00.0) - 23845 MiB free
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 151673 '<|mimo_audio_start|>' is not marked as EOG
load: control token: 151669 '<|audio_pad|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control-looking token: 128247 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control token: 151674 '<|mimo_audio_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151672 '<|mimo_audio_eod|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151670 '<|mimo_video_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151671 '<|mimo_video_end|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 128247 ('</s>')
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 0.9312 MB
print_info: arch                  = mimo2
print_info: vocab_only            = 0
print_info: no_alloc              = 1
print_info: n_ctx_train           = 1048576
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 48
print_info: n_head                = 64
print_info: n_head_kv             = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4]
print_info: n_rot                 = 64
print_info: n_swa                 = 128
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 192
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = [16, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16]
print_info: n_embd_k_gqa          = [768, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768]
print_info: n_embd_v_gqa          = [512, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512]
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 16384
print_info: n_expert              = 256
print_info: n_expert_used         = 8
print_info: n_expert_groups       = 1
print_info: n_group_used          = 1
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: freq_base_swa         = 10000.0
print_info: freq_scale_swa        = 1
print_info: n_embd_head_k_swa     = 192
print_info: n_embd_head_v_swa     = 128
print_info: n_rot_swa             = 64
print_info: n_ctx_orig_yarn       = 1048576
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 310B.A15B
print_info: model params          = 308.78 B
print_info: general.name          = MiMo V2.5
print_info: vocab type            = BPE
print_info: n_vocab               = 152576
print_info: n_merges              = 151387
print_info: BOS token             = 11 ','
print_info: EOS token             = 151645 '<|im_end|>'
print_info: EOT token             = 151645 '<|im_end|>'
print_info: PAD token             = 151643 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151659 '<|fim_prefix|>'
print_info: FIM SUF token         = 151661 '<|fim_suffix|>'
print_info: FIM MID token         = 151660 '<|fim_middle|>'
print_info: FIM PAD token         = 151662 '<|fim_pad|>'
print_info: FIM REP token         = 151663 '<|repo_name|>'
print_info: FIM SEP token         = 151664 '<|file_sep|>'
print_info: EOG token             = 128247 '</s>'
print_info: EOG token             = 151643 '<|endoftext|>'
print_info: EOG token             = 151645 '<|im_end|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 1
load_tensors: layer   2 assigned to device CUDA0, is_swa = 1
load_tensors: layer   3 assigned to device CUDA0, is_swa = 1
load_tensors: layer   4 assigned to device CUDA0, is_swa = 1
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 1
load_tensors: layer   7 assigned to device CUDA0, is_swa = 1
load_tensors: layer   8 assigned to device CUDA0, is_swa = 1
load_tensors: layer   9 assigned to device CUDA0, is_swa = 1
load_tensors: layer  10 assigned to device CUDA0, is_swa = 1
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 1
load_tensors: layer  13 assigned to device CUDA0, is_swa = 1
load_tensors: layer  14 assigned to device CUDA0, is_swa = 1
load_tensors: layer  15 assigned to device CUDA0, is_swa = 1
load_tensors: layer  16 assigned to device CUDA0, is_swa = 1
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 1
load_tensors: layer  19 assigned to device CUDA0, is_swa = 1
load_tensors: layer  20 assigned to device CUDA0, is_swa = 1
load_tensors: layer  21 assigned to device CUDA0, is_swa = 1
load_tensors: layer  22 assigned to device CUDA0, is_swa = 1
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 1
load_tensors: layer  25 assigned to device CUDA0, is_swa = 1
load_tensors: layer  26 assigned to device CUDA0, is_swa = 1
load_tensors: layer  27 assigned to device CUDA0, is_swa = 1
load_tensors: layer  28 assigned to device CUDA0, is_swa = 1
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 1
load_tensors: layer  31 assigned to device CUDA0, is_swa = 1
load_tensors: layer  32 assigned to device CUDA0, is_swa = 1
load_tensors: layer  33 assigned to device CUDA0, is_swa = 1
load_tensors: layer  34 assigned to device CUDA0, is_swa = 1
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 1
load_tensors: layer  37 assigned to device CUDA0, is_swa = 1
load_tensors: layer  38 assigned to device CUDA0, is_swa = 1
load_tensors: layer  39 assigned to device CUDA0, is_swa = 1
load_tensors: layer  40 assigned to device CUDA0, is_swa = 1
load_tensors: layer  41 assigned to device CUDA0, is_swa = 0
load_tensors: layer  42 assigned to device CUDA0, is_swa = 1
load_tensors: layer  43 assigned to device CUDA0, is_swa = 1
load_tensors: layer  44 assigned to device CUDA0, is_swa = 1
load_tensors: layer  45 assigned to device CUDA0, is_swa = 1
load_tensors: layer  46 assigned to device CUDA0, is_swa = 1
load_tensors: layer  47 assigned to device CUDA0, is_swa = 0
load_tensors: layer  48 assigned to device CUDA0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_qkv.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_qkv.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_sinks.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate_inp.weight
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.1.exp_probs_b.bias
create_tensor: loading tensor blk.2.attn_qkv.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_sinks.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate_inp.weight
create_tensor: loading tensor blk.2.ffn_gate_exps.weight
create_tensor: loading tensor blk.2.ffn_down_exps.weight
create_tensor: loading tensor blk.2.ffn_up_exps.weight
create_tensor: loading tensor blk.2.exp_probs_b.bias
create_tensor: loading tensor blk.3.attn_qkv.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_sinks.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate_inp.weight
create_tensor: loading tensor blk.3.ffn_gate_exps.weight
create_tensor: loading tensor blk.3.ffn_down_exps.weight
create_tensor: loading tensor blk.3.ffn_up_exps.weight
create_tensor: loading tensor blk.3.exp_probs_b.bias
create_tensor: loading tensor blk.4.attn_qkv.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_sinks.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate_inp.weight
create_tensor: loading tensor blk.4.ffn_gate_exps.weight
create_tensor: loading tensor blk.4.ffn_down_exps.weight
create_tensor: loading tensor blk.4.ffn_up_exps.weight
create_tensor: loading tensor blk.4.exp_probs_b.bias
create_tensor: loading tensor blk.5.attn_qkv.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate_inp.weight
create_tensor: loading tensor blk.5.ffn_gate_exps.weight
create_tensor: loading tensor blk.5.ffn_down_exps.weight
create_tensor: loading tensor blk.5.ffn_up_exps.weight
create_tensor: loading tensor blk.5.exp_probs_b.bias
create_tensor: loading tensor blk.6.attn_qkv.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_sinks.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate_inp.weight
create_tensor: loading tensor blk.6.ffn_gate_exps.weight
create_tensor: loading tensor blk.6.ffn_down_exps.weight
create_tensor: loading tensor blk.6.ffn_up_exps.weight
create_tensor: loading tensor blk.6.exp_probs_b.bias
create_tensor: loading tensor blk.7.attn_qkv.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_sinks.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate_inp.weight
create_tensor: loading tensor blk.7.ffn_gate_exps.weight
create_tensor: loading tensor blk.7.ffn_down_exps.weight
create_tensor: loading tensor blk.7.ffn_up_exps.weight
create_tensor: loading tensor blk.7.exp_probs_b.bias
create_tensor: loading tensor blk.8.attn_qkv.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_sinks.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate_inp.weight
create_tensor: loading tensor blk.8.ffn_gate_exps.weight
create_tensor: loading tensor blk.8.ffn_down_exps.weight
create_tensor: loading tensor blk.8.ffn_up_exps.weight
create_tensor: loading tensor blk.8.exp_probs_b.bias
create_tensor: loading tensor blk.9.attn_qkv.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_sinks.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate_inp.weight
create_tensor: loading tensor blk.9.ffn_gate_exps.weight
create_tensor: loading tensor blk.9.ffn_down_exps.weight
create_tensor: loading tensor blk.9.ffn_up_exps.weight
create_tensor: loading tensor blk.9.exp_probs_b.bias
create_tensor: loading tensor blk.10.attn_qkv.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_sinks.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate_inp.weight
create_tensor: loading tensor blk.10.ffn_gate_exps.weight
create_tensor: loading tensor blk.10.ffn_down_exps.weight
create_tensor: loading tensor blk.10.ffn_up_exps.weight
create_tensor: loading tensor blk.10.exp_probs_b.bias
create_tensor: loading tensor blk.11.attn_qkv.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate_inp.weight
create_tensor: loading tensor blk.11.ffn_gate_exps.weight
create_tensor: loading tensor blk.11.ffn_down_exps.weight
create_tensor: loading tensor blk.11.ffn_up_exps.weight
create_tensor: loading tensor blk.11.exp_probs_b.bias
create_tensor: loading tensor blk.12.attn_qkv.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_sinks.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate_inp.weight
create_tensor: loading tensor blk.12.ffn_gate_exps.weight
create_tensor: loading tensor blk.12.ffn_down_exps.weight
create_tensor: loading tensor blk.12.ffn_up_exps.weight
create_tensor: loading tensor blk.12.exp_probs_b.bias
create_tensor: loading tensor blk.13.attn_qkv.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_sinks.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate_inp.weight
create_tensor: loading tensor blk.13.ffn_gate_exps.weight
create_tensor: loading tensor blk.13.ffn_down_exps.weight
create_tensor: loading tensor blk.13.ffn_up_exps.weight
create_tensor: loading tensor blk.13.exp_probs_b.bias
create_tensor: loading tensor blk.14.attn_qkv.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_sinks.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate_inp.weight
create_tensor: loading tensor blk.14.ffn_gate_exps.weight
create_tensor: loading tensor blk.14.ffn_down_exps.weight
create_tensor: loading tensor blk.14.ffn_up_exps.weight
create_tensor: loading tensor blk.14.exp_probs_b.bias
create_tensor: loading tensor blk.15.attn_qkv.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_sinks.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate_inp.weight
create_tensor: loading tensor blk.15.ffn_gate_exps.weight
create_tensor: loading tensor blk.15.ffn_down_exps.weight
create_tensor: loading tensor blk.15.ffn_up_exps.weight
create_tensor: loading tensor blk.15.exp_probs_b.bias
create_tensor: loading tensor blk.16.attn_qkv.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_sinks.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate_inp.weight
create_tensor: loading tensor blk.16.ffn_gate_exps.weight
create_tensor: loading tensor blk.16.ffn_down_exps.weight
create_tensor: loading tensor blk.16.ffn_up_exps.weight
create_tensor: loading tensor blk.16.exp_probs_b.bias
create_tensor: loading tensor blk.17.attn_qkv.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate_inp.weight
create_tensor: loading tensor blk.17.ffn_gate_exps.weight
create_tensor: loading tensor blk.17.ffn_down_exps.weight
create_tensor: loading tensor blk.17.ffn_up_exps.weight
create_tensor: loading tensor blk.17.exp_probs_b.bias
create_tensor: loading tensor blk.18.attn_qkv.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_sinks.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate_inp.weight
create_tensor: loading tensor blk.18.ffn_gate_exps.weight
create_tensor: loading tensor blk.18.ffn_down_exps.weight
create_tensor: loading tensor blk.18.ffn_up_exps.weight
create_tensor: loading tensor blk.18.exp_probs_b.bias
create_tensor: loading tensor blk.19.attn_qkv.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_sinks.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate_inp.weight
create_tensor: loading tensor blk.19.ffn_gate_exps.weight
create_tensor: loading tensor blk.19.ffn_down_exps.weight
create_tensor: loading tensor blk.19.ffn_up_exps.weight
create_tensor: loading tensor blk.19.exp_probs_b.bias
create_tensor: loading tensor blk.20.attn_qkv.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_sinks.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate_inp.weight
create_tensor: loading tensor blk.20.ffn_gate_exps.weight
create_tensor: loading tensor blk.20.ffn_down_exps.weight
create_tensor: loading tensor blk.20.ffn_up_exps.weight
create_tensor: loading tensor blk.20.exp_probs_b.bias
create_tensor: loading tensor blk.21.attn_qkv.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_sinks.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate_inp.weight
create_tensor: loading tensor blk.21.ffn_gate_exps.weight
create_tensor: loading tensor blk.21.ffn_down_exps.weight
create_tensor: loading tensor blk.21.ffn_up_exps.weight
create_tensor: loading tensor blk.21.exp_probs_b.bias
create_tensor: loading tensor blk.22.attn_qkv.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_sinks.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate_inp.weight
create_tensor: loading tensor blk.22.ffn_gate_exps.weight
create_tensor: loading tensor blk.22.ffn_down_exps.weight
create_tensor: loading tensor blk.22.ffn_up_exps.weight
create_tensor: loading tensor blk.22.exp_probs_b.bias
create_tensor: loading tensor blk.23.attn_qkv.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate_inp.weight
create_tensor: loading tensor blk.23.ffn_gate_exps.weight
create_tensor: loading tensor blk.23.ffn_down_exps.weight
create_tensor: loading tensor blk.23.ffn_up_exps.weight
create_tensor: loading tensor blk.23.exp_probs_b.bias
create_tensor: loading tensor blk.24.attn_qkv.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_sinks.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate_inp.weight
create_tensor: loading tensor blk.24.ffn_gate_exps.weight
create_tensor: loading tensor blk.24.ffn_down_exps.weight
create_tensor: loading tensor blk.24.ffn_up_exps.weight
create_tensor: loading tensor blk.24.exp_probs_b.bias
create_tensor: loading tensor blk.25.attn_qkv.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_sinks.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate_inp.weight
create_tensor: loading tensor blk.25.ffn_gate_exps.weight
create_tensor: loading tensor blk.25.ffn_down_exps.weight
create_tensor: loading tensor blk.25.ffn_up_exps.weight
create_tensor: loading tensor blk.25.exp_probs_b.bias
create_tensor: loading tensor blk.26.attn_qkv.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_sinks.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate_inp.weight
create_tensor: loading tensor blk.26.ffn_gate_exps.weight
create_tensor: loading tensor blk.26.ffn_down_exps.weight
create_tensor: loading tensor blk.26.ffn_up_exps.weight
create_tensor: loading tensor blk.26.exp_probs_b.bias
create_tensor: loading tensor blk.27.attn_qkv.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_sinks.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate_inp.weight
create_tensor: loading tensor blk.27.ffn_gate_exps.weight
create_tensor: loading tensor blk.27.ffn_down_exps.weight
create_tensor: loading tensor blk.27.ffn_up_exps.weight
create_tensor: loading tensor blk.27.exp_probs_b.bias
create_tensor: loading tensor blk.28.attn_qkv.weight
create_tensor: loading tensor blk.28.attn_output.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.attn_sinks.weight
create_tensor: loading tensor blk.28.ffn_norm.weight
create_tensor: loading tensor blk.28.ffn_gate_inp.weight
create_tensor: loading tensor blk.28.ffn_gate_exps.weight
create_tensor: loading tensor blk.28.ffn_down_exps.weight
create_tensor: loading tensor blk.28.ffn_up_exps.weight
create_tensor: loading tensor blk.28.exp_probs_b.bias
create_tensor: loading tensor blk.29.attn_qkv.weight
create_tensor: loading tensor blk.29.attn_output.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.ffn_norm.weight
create_tensor: loading tensor blk.29.ffn_gate_inp.weight
create_tensor: loading tensor blk.29.ffn_gate_exps.weight
create_tensor: loading tensor blk.29.ffn_down_exps.weight
create_tensor: loading tensor blk.29.ffn_up_exps.weight
create_tensor: loading tensor blk.29.exp_probs_b.bias
create_tensor: loading tensor blk.30.attn_qkv.weight
create_tensor: loading tensor blk.30.attn_output.weight
create_tensor: loading tensor blk.30.attn_norm.weight
create_tensor: loading tensor blk.30.attn_sinks.weight
create_tensor: loading tensor blk.30.ffn_norm.weight
create_tensor: loading tensor blk.30.ffn_gate_inp.weight
create_tensor: loading tensor blk.30.ffn_gate_exps.weight
create_tensor: loading tensor blk.30.ffn_down_exps.weight
create_tensor: loading tensor blk.30.ffn_up_exps.weight
create_tensor: loading tensor blk.30.exp_probs_b.bias
create_tensor: loading tensor blk.31.attn_qkv.weight
create_tensor: loading tensor blk.31.attn_output.weight
create_tensor: loading tensor blk.31.attn_norm.weight
create_tensor: loading tensor blk.31.attn_sinks.weight
create_tensor: loading tensor blk.31.ffn_norm.weight
create_tensor: loading tensor blk.31.ffn_gate_inp.weight
create_tensor: loading tensor blk.31.ffn_gate_exps.weight
create_tensor: loading tensor blk.31.ffn_down_exps.weight
create_tensor: loading tensor blk.31.ffn_up_exps.weight
create_tensor: loading tensor blk.31.exp_probs_b.bias
create_tensor: loading tensor blk.32.attn_qkv.weight
create_tensor: loading tensor blk.32.attn_output.weight
create_tensor: loading tensor blk.32.attn_norm.weight
create_tensor: loading tensor blk.32.attn_sinks.weight
create_tensor: loading tensor blk.32.ffn_norm.weight
create_tensor: loading tensor blk.32.ffn_gate_inp.weight
create_tensor: loading tensor blk.32.ffn_gate_exps.weight
create_tensor: loading tensor blk.32.ffn_down_exps.weight
create_tensor: loading tensor blk.32.ffn_up_exps.weight
create_tensor: loading tensor blk.32.exp_probs_b.bias
create_tensor: loading tensor blk.33.attn_qkv.weight
create_tensor: loading tensor blk.33.attn_output.weight
create_tensor: loading tensor blk.33.attn_norm.weight
create_tensor: loading tensor blk.33.attn_sinks.weight
create_tensor: loading tensor blk.33.ffn_norm.weight
create_tensor: loading tensor blk.33.ffn_gate_inp.weight
create_tensor: loading tensor blk.33.ffn_gate_exps.weight
create_tensor: loading tensor blk.33.ffn_down_exps.weight
create_tensor: loading tensor blk.33.ffn_up_exps.weight
create_tensor: loading tensor blk.33.exp_probs_b.bias
create_tensor: loading tensor blk.34.attn_qkv.weight
create_tensor: loading tensor blk.34.attn_output.weight
create_tensor: loading tensor blk.34.attn_norm.weight
create_tensor: loading tensor blk.34.attn_sinks.weight
create_tensor: loading tensor blk.34.ffn_norm.weight
create_tensor: loading tensor blk.34.ffn_gate_inp.weight
create_tensor: loading tensor blk.34.ffn_gate_exps.weight
create_tensor: loading tensor blk.34.ffn_down_exps.weight
create_tensor: loading tensor blk.34.ffn_up_exps.weight
create_tensor: loading tensor blk.34.exp_probs_b.bias
create_tensor: loading tensor blk.35.attn_qkv.weight
create_tensor: loading tensor blk.35.attn_output.weight
create_tensor: loading tensor blk.35.attn_norm.weight
create_tensor: loading tensor blk.35.ffn_norm.weight
create_tensor: loading tensor blk.35.ffn_gate_inp.weight
create_tensor: loading tensor blk.35.ffn_gate_exps.weight
create_tensor: loading tensor blk.35.ffn_down_exps.weight
create_tensor: loading tensor blk.35.ffn_up_exps.weight
create_tensor: loading tensor blk.35.exp_probs_b.bias
create_tensor: loading tensor blk.36.attn_qkv.weight
create_tensor: loading tensor blk.36.attn_output.weight
create_tensor: loading tensor blk.36.attn_norm.weight
create_tensor: loading tensor blk.36.attn_sinks.weight
create_tensor: loading tensor blk.36.ffn_norm.weight
create_tensor: loading tensor blk.36.ffn_gate_inp.weight
create_tensor: loading tensor blk.36.ffn_gate_exps.weight
create_tensor: loading tensor blk.36.ffn_down_exps.weight
create_tensor: loading tensor blk.36.ffn_up_exps.weight
create_tensor: loading tensor blk.36.exp_probs_b.bias
create_tensor: loading tensor blk.37.attn_qkv.weight
create_tensor: loading tensor blk.37.attn_output.weight
create_tensor: loading tensor blk.37.attn_norm.weight
create_tensor: loading tensor blk.37.attn_sinks.weight
create_tensor: loading tensor blk.37.ffn_norm.weight
create_tensor: loading tensor blk.37.ffn_gate_inp.weight
create_tensor: loading tensor blk.37.ffn_gate_exps.weight
create_tensor: loading tensor blk.37.ffn_down_exps.weight
create_tensor: loading tensor blk.37.ffn_up_exps.weight
create_tensor: loading tensor blk.37.exp_probs_b.bias
create_tensor: loading tensor blk.38.attn_qkv.weight
create_tensor: loading tensor blk.38.attn_output.weight
create_tensor: loading tensor blk.38.attn_norm.weight
create_tensor: loading tensor blk.38.attn_sinks.weight
create_tensor: loading tensor blk.38.ffn_norm.weight
create_tensor: loading tensor blk.38.ffn_gate_inp.weight
create_tensor: loading tensor blk.38.ffn_gate_exps.weight
create_tensor: loading tensor blk.38.ffn_down_exps.weight
create_tensor: loading tensor blk.38.ffn_up_exps.weight
create_tensor: loading tensor blk.38.exp_probs_b.bias
create_tensor: loading tensor blk.39.attn_qkv.weight
create_tensor: loading tensor blk.39.attn_output.weight
create_tensor: loading tensor blk.39.attn_norm.weight
create_tensor: loading tensor blk.39.attn_sinks.weight
create_tensor: loading tensor blk.39.ffn_norm.weight
create_tensor: loading tensor blk.39.ffn_gate_inp.weight
create_tensor: loading tensor blk.39.ffn_gate_exps.weight
create_tensor: loading tensor blk.39.ffn_down_exps.weight
create_tensor: loading tensor blk.39.ffn_up_exps.weight
create_tensor: loading tensor blk.39.exp_probs_b.bias
create_tensor: loading tensor blk.40.attn_qkv.weight
create_tensor: loading tensor blk.40.attn_output.weight
create_tensor: loading tensor blk.40.attn_norm.weight
create_tensor: loading tensor blk.40.attn_sinks.weight
create_tensor: loading tensor blk.40.ffn_norm.weight
create_tensor: loading tensor blk.40.ffn_gate_inp.weight
create_tensor: loading tensor blk.40.ffn_gate_exps.weight
create_tensor: loading tensor blk.40.ffn_down_exps.weight
create_tensor: loading tensor blk.40.ffn_up_exps.weight
create_tensor: loading tensor blk.40.exp_probs_b.bias
create_tensor: loading tensor blk.41.attn_qkv.weight
create_tensor: loading tensor blk.41.attn_output.weight
create_tensor: loading tensor blk.41.attn_norm.weight
create_tensor: loading tensor blk.41.ffn_norm.weight
create_tensor: loading tensor blk.41.ffn_gate_inp.weight
create_tensor: loading tensor blk.41.ffn_gate_exps.weight
create_tensor: loading tensor blk.41.ffn_down_exps.weight
create_tensor: loading tensor blk.41.ffn_up_exps.weight
create_tensor: loading tensor blk.41.exp_probs_b.bias
create_tensor: loading tensor blk.42.attn_qkv.weight
create_tensor: loading tensor blk.42.attn_output.weight
create_tensor: loading tensor blk.42.attn_norm.weight
create_tensor: loading tensor blk.42.attn_sinks.weight
create_tensor: loading tensor blk.42.ffn_norm.weight
create_tensor: loading tensor blk.42.ffn_gate_inp.weight
create_tensor: loading tensor blk.42.ffn_gate_exps.weight
create_tensor: loading tensor blk.42.ffn_down_exps.weight
create_tensor: loading tensor blk.42.ffn_up_exps.weight
create_tensor: loading tensor blk.42.exp_probs_b.bias
create_tensor: loading tensor blk.43.attn_qkv.weight
create_tensor: loading tensor blk.43.attn_output.weight
create_tensor: loading tensor blk.43.attn_norm.weight
create_tensor: loading tensor blk.43.attn_sinks.weight
create_tensor: loading tensor blk.43.ffn_norm.weight
create_tensor: loading tensor blk.43.ffn_gate_inp.weight
create_tensor: loading tensor blk.43.ffn_gate_exps.weight
create_tensor: loading tensor blk.43.ffn_down_exps.weight
create_tensor: loading tensor blk.43.ffn_up_exps.weight
create_tensor: loading tensor blk.43.exp_probs_b.bias
create_tensor: loading tensor blk.44.attn_qkv.weight
create_tensor: loading tensor blk.44.attn_output.weight
create_tensor: loading tensor blk.44.attn_norm.weight
create_tensor: loading tensor blk.44.attn_sinks.weight
create_tensor: loading tensor blk.44.ffn_norm.weight
create_tensor: loading tensor blk.44.ffn_gate_inp.weight
create_tensor: loading tensor blk.44.ffn_gate_exps.weight
create_tensor: loading tensor blk.44.ffn_down_exps.weight
create_tensor: loading tensor blk.44.ffn_up_exps.weight
create_tensor: loading tensor blk.44.exp_probs_b.bias
create_tensor: loading tensor blk.45.attn_qkv.weight
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_sinks.weight
create_tensor: loading tensor blk.45.ffn_norm.weight
create_tensor: loading tensor blk.45.ffn_gate_inp.weight
create_tensor: loading tensor blk.45.ffn_gate_exps.weight
create_tensor: loading tensor blk.45.ffn_down_exps.weight
create_tensor: loading tensor blk.45.ffn_up_exps.weight
create_tensor: loading tensor blk.45.exp_probs_b.bias
create_tensor: loading tensor blk.46.attn_qkv.weight
create_tensor: loading tensor blk.46.attn_output.weight
create_tensor: loading tensor blk.46.attn_norm.weight
create_tensor: loading tensor blk.46.attn_sinks.weight
create_tensor: loading tensor blk.46.ffn_norm.weight
create_tensor: loading tensor blk.46.ffn_gate_inp.weight
create_tensor: loading tensor blk.46.ffn_gate_exps.weight
create_tensor: loading tensor blk.46.ffn_down_exps.weight
create_tensor: loading tensor blk.46.ffn_up_exps.weight
create_tensor: loading tensor blk.46.exp_probs_b.bias
create_tensor: loading tensor blk.47.attn_qkv.weight
create_tensor: loading tensor blk.47.attn_output.weight
create_tensor: loading tensor blk.47.attn_norm.weight
create_tensor: loading tensor blk.47.ffn_norm.weight
create_tensor: loading tensor blk.47.ffn_gate_inp.weight
create_tensor: loading tensor blk.47.ffn_gate_exps.weight
create_tensor: loading tensor blk.47.ffn_down_exps.weight
create_tensor: loading tensor blk.47.ffn_up_exps.weight
create_tensor: loading tensor blk.47.exp_probs_b.bias
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =     0.00 MiB
load_tensors:    CUDA_Host model buffer size =     0.00 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 96000
llama_context: n_ctx_seq     = 96000
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (96000) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     2.33 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 96000 cells
llama_kv_cache: layer   0: dev = CUDA0
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: filtered
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: dev = CUDA0
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: filtered
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = CUDA0
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: filtered
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: dev = CUDA0
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: filtered
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = CUDA0
llama_kv_cache: layer  24: filtered
llama_kv_cache: layer  25: filtered
llama_kv_cache: layer  26: filtered
llama_kv_cache: layer  27: filtered
llama_kv_cache: layer  28: filtered
llama_kv_cache: layer  29: dev = CUDA0
llama_kv_cache: layer  30: filtered
llama_kv_cache: layer  31: filtered
llama_kv_cache: layer  32: filtered
llama_kv_cache: layer  33: filtered
llama_kv_cache: layer  34: filtered
llama_kv_cache: layer  35: dev = CUDA0
llama_kv_cache: layer  36: filtered
llama_kv_cache: layer  37: filtered
llama_kv_cache: layer  38: filtered
llama_kv_cache: layer  39: filtered
llama_kv_cache: layer  40: filtered
llama_kv_cache: layer  41: dev = CUDA0
llama_kv_cache: layer  42: filtered
llama_kv_cache: layer  43: filtered
llama_kv_cache: layer  44: filtered
llama_kv_cache: layer  45: filtered
llama_kv_cache: layer  46: filtered
llama_kv_cache: layer  47: dev = CUDA0
llama_kv_cache:      CUDA0 KV buffer size =     0.00 MiB
llama_kv_cache: size = 2109.38 MiB ( 96000 cells,   9 layers,  4/1 seqs), K (f16): 1265.62 MiB, V (f16):  843.75 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 192
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
llama_kv_cache_iswa: creating     SWA KV cache, size = 1024 cells
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: dev = CUDA0
llama_kv_cache: layer   2: dev = CUDA0
llama_kv_cache: layer   3: dev = CUDA0
llama_kv_cache: layer   4: dev = CUDA0
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: dev = CUDA0
llama_kv_cache: layer   7: dev = CUDA0
llama_kv_cache: layer   8: dev = CUDA0
llama_kv_cache: layer   9: dev = CUDA0
llama_kv_cache: layer  10: dev = CUDA0
llama_kv_cache: layer  11: filtered
llama_kv_cache: layer  12: dev = CUDA0
llama_kv_cache: layer  13: dev = CUDA0
llama_kv_cache: layer  14: dev = CUDA0
llama_kv_cache: layer  15: dev = CUDA0
llama_kv_cache: layer  16: dev = CUDA0
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: dev = CUDA0
llama_kv_cache: layer  19: dev = CUDA0
llama_kv_cache: layer  20: dev = CUDA0
llama_kv_cache: layer  21: dev = CUDA0
llama_kv_cache: layer  22: dev = CUDA0
llama_kv_cache: layer  23: filtered
llama_kv_cache: layer  24: dev = CUDA0
llama_kv_cache: layer  25: dev = CUDA0
llama_kv_cache: layer  26: dev = CUDA0
llama_kv_cache: layer  27: dev = CUDA0
llama_kv_cache: layer  28: dev = CUDA0
llama_kv_cache: layer  29: filtered
llama_kv_cache: layer  30: dev = CUDA0
llama_kv_cache: layer  31: dev = CUDA0
llama_kv_cache: layer  32: dev = CUDA0
llama_kv_cache: layer  33: dev = CUDA0
llama_kv_cache: layer  34: dev = CUDA0
llama_kv_cache: layer  35: filtered
llama_kv_cache: layer  36: dev = CUDA0
llama_kv_cache: layer  37: dev = CUDA0
llama_kv_cache: layer  38: dev = CUDA0
llama_kv_cache: layer  39: dev = CUDA0
llama_kv_cache: layer  40: dev = CUDA0
llama_kv_cache: layer  41: filtered
llama_kv_cache: layer  42: dev = CUDA0
llama_kv_cache: layer  43: dev = CUDA0
llama_kv_cache: layer  44: dev = CUDA0
llama_kv_cache: layer  45: dev = CUDA0
llama_kv_cache: layer  46: dev = CUDA0
llama_kv_cache: layer  47: filtered
llama_kv_cache:      CUDA0 KV buffer size =     0.00 MiB
llama_kv_cache: size =  195.00 MiB (  1024 cells,  39 layers,  4/1 seqs), K (f16):  117.00 MiB, V (f16):   78.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 192
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 3776
Segmentation fault (core dumped)
Owner
β€’
edited May 8

I'll try to look at this tomorrow, thanks. I just finished uploading newly converted quants now that the PR has been merged into master, that is perhaps worth trying?

I can confirm that even the previous version of MiMo-V2.5-Q4_K_M (before your update of 05/05/26) is NOT working with the latest llama.cpp, version: 9072 (6d57a49a7)

P.S. I am downloading the latest quants now and I will report later

Sadly the same behavior applies to the latest "IQ4_XS" with llama.cpp version: 9075 (58e68df0f):

/home/vik/llms/llama.cpp/build/bin/llama-server
--model ~/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf
--alias "aessedai/MiMo-V2.5-IQ4_XS"
-c 131072 --fit off --threads 24 -fa off --no-mmap --jinja
--temp 0.8 --top-p 0.95 --seed 1976
--host 0.0.0.0 --port 5005
-b 4096 -ub 1024 -ctxcp 24
-ctk q8_0 -ctv q8_0 -cb
--chat-template-kwargs '{"reasoning_effort": "normal"}'

==========================================
ggml_cuda_init: found 8 CUDA devices (Total VRAM: 192990 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24121 MiB
Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9075-58e68df0f
system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 47 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
llama_init_from_model: V cache quantization requires flash_attn
common_fit_params: encountered an error while trying to fit params to free device memory: failed to create llama_context from model
common_fit_params: fitting params to free memory took 2.02 seconds
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 55 key-value pairs and 508 tensors from /home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mimo2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 3: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 4: general.name str = MiMo V2.5
llama_model_loader: - kv 5: general.version str = V2.5
llama_model_loader: - kv 6: general.basename str = MiMo
llama_model_loader: - kv 7: general.size_label str = 256x8.2B
llama_model_loader: - kv 8: general.license str = mit
llama_model_loader: - kv 9: general.tags arr[str,6] = ["multimodal", "vision-language", "au...
llama_model_loader: - kv 10: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 11: mimo2.block_count u32 = 51
llama_model_loader: - kv 12: mimo2.context_length u32 = 1048576
llama_model_loader: - kv 13: mimo2.embedding_length u32 = 4096
llama_model_loader: - kv 14: mimo2.feed_forward_length u32 = 16384
llama_model_loader: - kv 15: mimo2.attention.head_count u32 = 64
llama_model_loader: - kv 16: mimo2.attention.head_count_kv arr[i32,51] = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, ...
llama_model_loader: - kv 17: mimo2.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 18: mimo2.rope.freq_base_swa f32 = 10000.000000
llama_model_loader: - kv 19: mimo2.expert_used_count u32 = 8
llama_model_loader: - kv 20: mimo2.expert_group_count u32 = 1
llama_model_loader: - kv 21: mimo2.expert_group_used_count u32 = 1
llama_model_loader: - kv 22: mimo2.expert_gating_func u32 = 2
llama_model_loader: - kv 23: mimo2.attention.key_length u32 = 192
llama_model_loader: - kv 24: mimo2.attention.value_length u32 = 128
llama_model_loader: - kv 25: mimo2.attention.sliding_window u32 = 128
llama_model_loader: - kv 26: mimo2.attention.sliding_window_pattern arr[i32,51] = [0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, ...
llama_model_loader: - kv 27: mimo2.expert_count u32 = 256
llama_model_loader: - kv 28: mimo2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 29: mimo2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: mimo2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 31: mimo2.attention.value_scale f32 = 0.707000
llama_model_loader: - kv 32: mimo2.nextn_predict_layers u32 = 3
llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 34: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,152576] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,152576] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,151387] = ["Δ  Δ ", "Δ Δ  Δ Δ ", "i n", "Δ  t",...
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 41: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 7
llama_model_loader: - kv 44: MoE_Quantization.ffn_up_exps str = IQ3_S
llama_model_loader: - kv 45: MoE_Quantization.ffn_gate_exps str = IQ3_S
llama_model_loader: - kv 46: MoE_Quantization.ffn_down_exps str = IQ4_XS
llama_model_loader: - kv 47: MoE_Quantization.type_default str = Q8_0
llama_model_loader: - kv 48: quantize.imatrix.file str = /mnt/srv/snowdrift/fp16/MiMo-V2.5/ima...
llama_model_loader: - kv 49: quantize.imatrix.dataset str = /mnt/srv/host/resources/KLD/calibrati...
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 287
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 51
llama_model_loader: - kv 52: split.no u16 = 0
llama_model_loader: - kv 53: split.tensors.count i32 = 508
llama_model_loader: - kv 54: split.count u16 = 4
llama_model_loader: - type f32: 248 tensors
llama_model_loader: - type q8_0: 119 tensors
llama_model_loader: - type iq3_s: 94 tensors
llama_model_loader: - type iq4_xs: 47 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 137.75 GiB (3.82 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:02:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA2 (NVIDIA GeForce RTX 3090) (0000:81:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA3 (NVIDIA GeForce RTX 3090) (0000:82:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA4 (NVIDIA GeForce RTX 3090) (0000:83:00.0) - 23840 MiB free
llama_prepare_model_devices: using device CUDA5 (NVIDIA GeForce RTX 3090) (0000:84:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA6 (NVIDIA GeForce RTX 3090) (0000:c1:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA7 (NVIDIA GeForce RTX 3090) (0000:c2:00.0) - 23859 MiB free
load: 0 unused tokens
load: control-looking token: 128247 '' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load: - 128247 ('')
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 0.9312 MB
print_info: arch = mimo2
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 1048576
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 51
print_info: n_head = 64
print_info: n_head_kv = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8]
print_info: n_rot = 64
print_info: n_swa = 128
print_info: is_swa_any = 1
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = [16, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8]
print_info: n_embd_k_gqa = [768, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536]
print_info: n_embd_v_gqa = [512, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024]
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: f_attn_value_scale = 0.7070
print_info: n_ff = 16384
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: n_expert_groups = 1
print_info: n_group_used = 1
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: freq_base_swa = 10000.0
print_info: freq_scale_swa = 1
print_info: n_embd_head_k_swa = 192
print_info: n_embd_head_v_swa = 128
print_info: n_rot_swa = 64
print_info: n_ctx_orig_yarn = 1048576
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 310B.A15B
print_info: model params = 309.77 B
print_info: general.name = MiMo V2.5
print_info: vocab type = BPE
print_info: n_vocab = 152576
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 128247 ''
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
model has unused tensor blk.48.attn_output.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.48.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.48.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.ffn_gate.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.48.ffn_down.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.48.ffn_up.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.48.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.48.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_output.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.49.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.49.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.ffn_gate.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.49.ffn_down.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.49.ffn_up.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.49.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.49.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_output.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.50.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.50.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.ffn_gate.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.50.ffn_down.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.50.ffn_up.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.50.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.50.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.layer_output_norm.weight (size = 16384 bytes) -- ignoring
load_tensors: offloading output layer to GPU
load_tensors: offloading 50 repeating layers to GPU
load_tensors: offloaded 52/52 layers to GPU
load_tensors: CUDA0 model buffer size = 17974.98 MiB
load_tensors: CUDA1 model buffer size = 20628.29 MiB
load_tensors: CUDA2 model buffer size = 17680.63 MiB
load_tensors: CUDA3 model buffer size = 20628.29 MiB
load_tensors: CUDA4 model buffer size = 17680.63 MiB
load_tensors: CUDA5 model buffer size = 17680.63 MiB
load_tensors: CUDA6 model buffer size = 20628.29 MiB
load_tensors: CUDA7 model buffer size = 6708.14 MiB
load_tensors: CUDA_Host model buffer size = 633.25 MiB
....................................................................................................
common_init_result: added logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_init_from_model: V cache quantization requires flash_attn
common_init_result: failed to create context with model '/home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_from_params: failed to create context with model '/home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
./llama-srv_MiMo-V2.5-IQ4_XS.sh: line 12: 35811 Segmentation fault (core dumped)

Owner

Thanks, will try to load it up on my system and see if I can spot the problem

Owner
β€’
edited May 8

@dehnhaide can you try with the mimo-v2.5-fattn branch?

Edit: also -fa off isn't compatible with -ctk q8_0 -ctv q8_0, you need FA to quant the KV cache.

Owner

@dehnhaide I was able to reproduce your issue with -fa off -ctx q8_0 -ctv q8_0, it works with -fa on or without the -ctx q8_0 -ctv q8_0. I'd still recommend using the mimo-v2.5-fattn branch I linked in the readme + -fa on.

@dehnhaide I was able to reproduce your issue with -fa off -ctx q8_0 -ctv q8_0, it works with -fa on or without the -ctx q8_0 -ctv q8_0. I'd still recommend using the mimo-v2.5-fattn branch I linked in the readme + -fa on.

Damn! I am so sorry for the misreport! :( Indeed, after playing with the initial version (in mainline) with "-fa on/off" it remained as "off".
Happy to report that it now starts up (with version: 9079 (f9cd456ea)) BUT it's broken: the speed is abysmal! Basically a simple prompt in OpenCode:

prompt eval time = 391869.08 ms / 19154 tokens ( 20.46 ms per token, 48.88 tokens per second)
eval time = 17018.09 ms / 126 tokens ( 135.06 ms per token, 7.40 tokens per second)
total time = 408887.17 ms / 19280 tokens
slot release: id 2 | task 2 | stop processing: n_tokens = 19279, truncated = 0

With https://github.com/AesSedai/llama.cpp/tree/mimo-v2.5-fattn the new IQ4_XS quant works beautiful (40toks in OpenCode) with "-fa on"

prompt eval time = 19402.71 ms / 19154 tokens ( 1.01 ms per token, 987.18 tokens per second)
eval time = 2592.45 ms / 102 tokens ( 25.42 ms per token, 39.35 tokens per second)
total time = 21995.15 ms / 19256 tokens
slot release: id 2 | task 2 | stop processing: n_tokens = 19255, truncated = 0

Owner
β€’
edited May 8

Cool, this PR is close to merging I think so the FA fixes will be on master after that: https://github.com/ggml-org/llama.cpp/pull/22812

The speed issue is known, I didn't want to pollute the initial support PR with the FA fixes. Basically MiMo uses a 192/128 asymmetric head which wasn't templated yet since no other model has needed it.

Appreciate all this. Downloading to test, but would y'all say this model has good pop culture knowledge? Decent prose? There's so much focus on agenic stuff, its hard to tell if they threw everything else out (like Qwen).

Owner

@Downtown-Case I've mostly been using Pro, but it has become my favorite RP model due to its prose / natural language / intuition. I've only used non-Pro a little bit and it's probably 65% of the way there. It's not bad but there isn't really a replacement for tripling the total and active params. Too bad Pro doesn't support multimodality :(

Sign up or log in to comment