Instructions to use AesSedai/MiMo-V2.5-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AesSedai/MiMo-V2.5-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="AesSedai/MiMo-V2.5-GGUF",
	filename="IQ3_S/MiMo-V2.5-IQ3_S-00001-of-00004.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use AesSedai/MiMo-V2.5-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Use Docker

docker model run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use AesSedai/MiMo-V2.5-GGUF with Ollama:
```
ollama run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M
```

Unsloth Studio

How to use AesSedai/MiMo-V2.5-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/MiMo-V2.5-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for AesSedai/MiMo-V2.5-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for AesSedai/MiMo-V2.5-GGUF to start chatting

How to use AesSedai/MiMo-V2.5-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "AesSedai/MiMo-V2.5-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use AesSedai/MiMo-V2.5-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use AesSedai/MiMo-V2.5-GGUF with Docker Model Runner:
```
docker model run hf.co/AesSedai/MiMo-V2.5-GGUF:Q4_K_M
```

Lemonade

How to use AesSedai/MiMo-V2.5-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull AesSedai/MiMo-V2.5-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.MiMo-V2.5-GGUF-Q4_K_M

List all available models

lemonade list

New quants IQ4_XS isn't loaded by llama.cpp(b9050) with segfault issue

by smalinin - opened May 7

Discussion

smalinin

May 7

The prev version of IQ4_XS works fine with llama.cpp(b9050).

AesSedai

Owner May 7

Hi, can you give me any further details about this? Log lines or stack trace please?

smalinin

May 8

in GDB

srv    load_model: loading model './models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
[New Thread 0x7fffc0d46000 (LWP 868754)]
[New Thread 0x7fffb5fff000 (LWP 868755)]
[New Thread 0x7fffb57fe000 (LWP 868756)]
[New Thread 0x7fffb4ffd000 (LWP 868757)]
[New Thread 0x7fffaffff000 (LWP 868758)]
[New Thread 0x7fffaf7fe000 (LWP 868759)]
[New Thread 0x7fffaeffd000 (LWP 868760)]
[New Thread 0x7fffae7fc000 (LWP 868761)]
[New Thread 0x7fffadffb000 (LWP 868762)]
[New Thread 0x7fffad7fa000 (LWP 868763)]
[New Thread 0x7fffacff9000 (LWP 868764)]
[New Thread 0x7fffa7fff000 (LWP 868765)]
[New Thread 0x7fffa77fe000 (LWP 868766)]
[New Thread 0x7fffa6ffd000 (LWP 868767)]
[New Thread 0x7fffa67fc000 (LWP 868768)]
[New Thread 0x7fffa5ffb000 (LWP 868769)]
[New Thread 0x7fffa57fa000 (LWP 868770)]
[New Thread 0x7fffa4ff9000 (LWP 868771)]
[New Thread 0x7fffa47f8000 (LWP 868772)]

Thread 1 "llama-server" received signal SIGSEGV, Segmentation fault.
0x00007ffff795dd9e in ggml_mul_mat () from /home/uid/_llama_cpp/llama.cpp-b9050/build/bin/libggml-base.so.0

$ ./llama-server -m ./models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf  -fa 1 -c 96000 --jinja  -t 20 -ncmoe 0 --host 192.168.1.69 --batch-size 4096 --no-mmproj --temp 1.0 --top_p 0.95 --cache-type-k q8_0 --cache-type-v q8_0
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b0-unknown
system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Running without SSL
init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model './models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
Segmentation fault (core dumped)

$ ./llama-server -m ./models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf  -fa 1 -c 96000 --jinja  -t 20 -ncmoe 0 --host 192.168.1.69 --batch-size 4096 --no-mmproj --temp 1.0 --top_p 0.95 --verbose
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24124 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b0-unknown
system_info: n_threads = 20 (n_threads_batch = 20) / 20 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
Running without SSL
init: using 19 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model './models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 54 key-value pairs and 472 tensors from ./models/MiMo-V25/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = mimo2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   3:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   4:                               general.name str              = MiMo V2.5
llama_model_loader: - kv   5:                            general.version str              = V2.5
llama_model_loader: - kv   6:                           general.basename str              = MiMo
llama_model_loader: - kv   7:                         general.size_label str              = 256x7.2B
llama_model_loader: - kv   8:                            general.license str              = mit
llama_model_loader: - kv   9:                               general.tags arr[str,6]       = ["multimodal", "vision-language", "au...
llama_model_loader: - kv  10:                          general.languages arr[str,2]       = ["en", "zh"]
llama_model_loader: - kv  11:                          mimo2.block_count u32              = 48
llama_model_loader: - kv  12:                       mimo2.context_length u32              = 1048576
llama_model_loader: - kv  13:                     mimo2.embedding_length u32              = 4096
llama_model_loader: - kv  14:                  mimo2.feed_forward_length u32              = 16384
llama_model_loader: - kv  15:                 mimo2.attention.head_count u32              = 64
llama_model_loader: - kv  16:              mimo2.attention.head_count_kv arr[i32,48]      = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, ...
llama_model_loader: - kv  17:                       mimo2.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  18:                   mimo2.rope.freq_base_swa f32              = 10000.000000
llama_model_loader: - kv  19:                    mimo2.expert_used_count u32              = 8
llama_model_loader: - kv  20:                   mimo2.expert_group_count u32              = 1
llama_model_loader: - kv  21:              mimo2.expert_group_used_count u32              = 1
llama_model_loader: - kv  22:                   mimo2.expert_gating_func u32              = 2
llama_model_loader: - kv  23:                 mimo2.attention.key_length u32              = 192
llama_model_loader: - kv  24:               mimo2.attention.value_length u32              = 128
llama_model_loader: - kv  25:             mimo2.attention.sliding_window u32              = 128
llama_model_loader: - kv  26:     mimo2.attention.sliding_window_pattern arr[i32,48]      = [0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, ...
llama_model_loader: - kv  27:                         mimo2.expert_count u32              = 256
llama_model_loader: - kv  28:           mimo2.expert_feed_forward_length u32              = 2048
llama_model_loader: - kv  29:                 mimo2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:     mimo2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  31:                mimo2.attention.value_scale f32              = 0.707000
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,152576]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,152576]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  37:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  38:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  39:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  40:                    tokenizer.chat_template str              = {%- if not add_generation_prompt is d...
llama_model_loader: - kv  41:               general.quantization_version u32              = 2
llama_model_loader: - kv  42:                          general.file_type u32              = 7
llama_model_loader: - kv  43:               MoE_Quantization.ffn_up_exps str              = IQ3_S
llama_model_loader: - kv  44:             MoE_Quantization.ffn_gate_exps str              = IQ3_S
llama_model_loader: - kv  45:             MoE_Quantization.ffn_down_exps str              = IQ4_XS
llama_model_loader: - kv  46:              MoE_Quantization.type_default str              = Q8_0
llama_model_loader: - kv  47:                      quantize.imatrix.file str              = /mnt/srv/snowdrift/fp16/MiMo-V2.5/ima...
llama_model_loader: - kv  48:                   quantize.imatrix.dataset str              = /mnt/srv/host/resources/KLD/calibrati...
llama_model_loader: - kv  49:             quantize.imatrix.entries_count u32              = 287
llama_model_loader: - kv  50:              quantize.imatrix.chunks_count u32              = 51
llama_model_loader: - kv  51:                                   split.no u16              = 0
llama_model_loader: - kv  52:                        split.tensors.count i32              = 472
llama_model_loader: - kv  53:                                split.count u16              = 4
llama_model_loader: - type  f32:  230 tensors
llama_model_loader: - type q8_0:  101 tensors
llama_model_loader: - type iq3_s:   94 tensors
llama_model_loader: - type iq4_xs:   47 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 136.78 GiB (3.80 BPW) 
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:02:00.0) - 23845 MiB free
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 151673 '<|mimo_audio_start|>' is not marked as EOG
load: control token: 151669 '<|audio_pad|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151648 '<|box_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151649 '<|box_end|>' is not marked as EOG
load: control-looking token: 128247 '</s>' was not control-type; this is probably a bug in the model. its type will be overridden
load: control token: 151674 '<|mimo_audio_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151672 '<|mimo_audio_eod|>' is not marked as EOG
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151670 '<|mimo_video_start|>' is not marked as EOG
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151671 '<|mimo_video_end|>' is not marked as EOG
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 128247 ('</s>')
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 0.9312 MB
print_info: arch                  = mimo2
print_info: vocab_only            = 0
print_info: no_alloc              = 1
print_info: n_ctx_train           = 1048576
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 48
print_info: n_head                = 64
print_info: n_head_kv             = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4]
print_info: n_rot                 = 64
print_info: n_swa                 = 128
print_info: is_swa_any            = 1
print_info: n_embd_head_k         = 192
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = [16, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16]
print_info: n_embd_k_gqa          = [768, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768]
print_info: n_embd_v_gqa          = [512, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512]
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 16384
print_info: n_expert              = 256
print_info: n_expert_used         = 8
print_info: n_expert_groups       = 1
print_info: n_group_used          = 1
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 2
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: freq_base_swa         = 10000.0
print_info: freq_scale_swa        = 1
print_info: n_embd_head_k_swa     = 192
print_info: n_embd_head_v_swa     = 128
print_info: n_rot_swa             = 64
print_info: n_ctx_orig_yarn       = 1048576
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 310B.A15B
print_info: model params          = 308.78 B
print_info: general.name          = MiMo V2.5
print_info: vocab type            = BPE
print_info: n_vocab               = 152576
print_info: n_merges              = 151387
print_info: BOS token             = 11 ','
print_info: EOS token             = 151645 '<|im_end|>'
print_info: EOT token             = 151645 '<|im_end|>'
print_info: PAD token             = 151643 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 151659 '<|fim_prefix|>'
print_info: FIM SUF token         = 151661 '<|fim_suffix|>'
print_info: FIM MID token         = 151660 '<|fim_middle|>'
print_info: FIM PAD token         = 151662 '<|fim_pad|>'
print_info: FIM REP token         = 151663 '<|repo_name|>'
print_info: FIM SEP token         = 151664 '<|file_sep|>'
print_info: EOG token             = 128247 '</s>'
print_info: EOG token             = 151643 '<|endoftext|>'
print_info: EOG token             = 151645 '<|im_end|>'
print_info: EOG token             = 151662 '<|fim_pad|>'
print_info: EOG token             = 151663 '<|repo_name|>'
print_info: EOG token             = 151664 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: layer   0 assigned to device CUDA0, is_swa = 0
load_tensors: layer   1 assigned to device CUDA0, is_swa = 1
load_tensors: layer   2 assigned to device CUDA0, is_swa = 1
load_tensors: layer   3 assigned to device CUDA0, is_swa = 1
load_tensors: layer   4 assigned to device CUDA0, is_swa = 1
load_tensors: layer   5 assigned to device CUDA0, is_swa = 0
load_tensors: layer   6 assigned to device CUDA0, is_swa = 1
load_tensors: layer   7 assigned to device CUDA0, is_swa = 1
load_tensors: layer   8 assigned to device CUDA0, is_swa = 1
load_tensors: layer   9 assigned to device CUDA0, is_swa = 1
load_tensors: layer  10 assigned to device CUDA0, is_swa = 1
load_tensors: layer  11 assigned to device CUDA0, is_swa = 0
load_tensors: layer  12 assigned to device CUDA0, is_swa = 1
load_tensors: layer  13 assigned to device CUDA0, is_swa = 1
load_tensors: layer  14 assigned to device CUDA0, is_swa = 1
load_tensors: layer  15 assigned to device CUDA0, is_swa = 1
load_tensors: layer  16 assigned to device CUDA0, is_swa = 1
load_tensors: layer  17 assigned to device CUDA0, is_swa = 0
load_tensors: layer  18 assigned to device CUDA0, is_swa = 1
load_tensors: layer  19 assigned to device CUDA0, is_swa = 1
load_tensors: layer  20 assigned to device CUDA0, is_swa = 1
load_tensors: layer  21 assigned to device CUDA0, is_swa = 1
load_tensors: layer  22 assigned to device CUDA0, is_swa = 1
load_tensors: layer  23 assigned to device CUDA0, is_swa = 0
load_tensors: layer  24 assigned to device CUDA0, is_swa = 1
load_tensors: layer  25 assigned to device CUDA0, is_swa = 1
load_tensors: layer  26 assigned to device CUDA0, is_swa = 1
load_tensors: layer  27 assigned to device CUDA0, is_swa = 1
load_tensors: layer  28 assigned to device CUDA0, is_swa = 1
load_tensors: layer  29 assigned to device CUDA0, is_swa = 0
load_tensors: layer  30 assigned to device CUDA0, is_swa = 1
load_tensors: layer  31 assigned to device CUDA0, is_swa = 1
load_tensors: layer  32 assigned to device CUDA0, is_swa = 1
load_tensors: layer  33 assigned to device CUDA0, is_swa = 1
load_tensors: layer  34 assigned to device CUDA0, is_swa = 1
load_tensors: layer  35 assigned to device CUDA0, is_swa = 0
load_tensors: layer  36 assigned to device CUDA0, is_swa = 1
load_tensors: layer  37 assigned to device CUDA0, is_swa = 1
load_tensors: layer  38 assigned to device CUDA0, is_swa = 1
load_tensors: layer  39 assigned to device CUDA0, is_swa = 1
load_tensors: layer  40 assigned to device CUDA0, is_swa = 1
load_tensors: layer  41 assigned to device CUDA0, is_swa = 0
load_tensors: layer  42 assigned to device CUDA0, is_swa = 1
load_tensors: layer  43 assigned to device CUDA0, is_swa = 1
load_tensors: layer  44 assigned to device CUDA0, is_swa = 1
load_tensors: layer  45 assigned to device CUDA0, is_swa = 1
load_tensors: layer  46 assigned to device CUDA0, is_swa = 1
load_tensors: layer  47 assigned to device CUDA0, is_swa = 0
load_tensors: layer  48 assigned to device CUDA0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_qkv.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_qkv.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_sinks.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate_inp.weight
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.1.exp_probs_b.bias
create_tensor: loading tensor blk.2.attn_qkv.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_sinks.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate_inp.weight
create_tensor: loading tensor blk.2.ffn_gate_exps.weight
create_tensor: loading tensor blk.2.ffn_down_exps.weight
create_tensor: loading tensor blk.2.ffn_up_exps.weight
create_tensor: loading tensor blk.2.exp_probs_b.bias
create_tensor: loading tensor blk.3.attn_qkv.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_sinks.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate_inp.weight
create_tensor: loading tensor blk.3.ffn_gate_exps.weight
create_tensor: loading tensor blk.3.ffn_down_exps.weight
create_tensor: loading tensor blk.3.ffn_up_exps.weight
create_tensor: loading tensor blk.3.exp_probs_b.bias
create_tensor: loading tensor blk.4.attn_qkv.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_sinks.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate_inp.weight
create_tensor: loading tensor blk.4.ffn_gate_exps.weight
create_tensor: loading tensor blk.4.ffn_down_exps.weight
create_tensor: loading tensor blk.4.ffn_up_exps.weight
create_tensor: loading tensor blk.4.exp_probs_b.bias
create_tensor: loading tensor blk.5.attn_qkv.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate_inp.weight
create_tensor: loading tensor blk.5.ffn_gate_exps.weight
create_tensor: loading tensor blk.5.ffn_down_exps.weight
create_tensor: loading tensor blk.5.ffn_up_exps.weight
create_tensor: loading tensor blk.5.exp_probs_b.bias
create_tensor: loading tensor blk.6.attn_qkv.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_sinks.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate_inp.weight
create_tensor: loading tensor blk.6.ffn_gate_exps.weight
create_tensor: loading tensor blk.6.ffn_down_exps.weight
create_tensor: loading tensor blk.6.ffn_up_exps.weight
create_tensor: loading tensor blk.6.exp_probs_b.bias
create_tensor: loading tensor blk.7.attn_qkv.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_sinks.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate_inp.weight
create_tensor: loading tensor blk.7.ffn_gate_exps.weight
create_tensor: loading tensor blk.7.ffn_down_exps.weight
create_tensor: loading tensor blk.7.ffn_up_exps.weight
create_tensor: loading tensor blk.7.exp_probs_b.bias
create_tensor: loading tensor blk.8.attn_qkv.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_sinks.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate_inp.weight
create_tensor: loading tensor blk.8.ffn_gate_exps.weight
create_tensor: loading tensor blk.8.ffn_down_exps.weight
create_tensor: loading tensor blk.8.ffn_up_exps.weight
create_tensor: loading tensor blk.8.exp_probs_b.bias
create_tensor: loading tensor blk.9.attn_qkv.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_sinks.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate_inp.weight
create_tensor: loading tensor blk.9.ffn_gate_exps.weight
create_tensor: loading tensor blk.9.ffn_down_exps.weight
create_tensor: loading tensor blk.9.ffn_up_exps.weight
create_tensor: loading tensor blk.9.exp_probs_b.bias
create_tensor: loading tensor blk.10.attn_qkv.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_sinks.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate_inp.weight
create_tensor: loading tensor blk.10.ffn_gate_exps.weight
create_tensor: loading tensor blk.10.ffn_down_exps.weight
create_tensor: loading tensor blk.10.ffn_up_exps.weight
create_tensor: loading tensor blk.10.exp_probs_b.bias
create_tensor: loading tensor blk.11.attn_qkv.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate_inp.weight
create_tensor: loading tensor blk.11.ffn_gate_exps.weight
create_tensor: loading tensor blk.11.ffn_down_exps.weight
create_tensor: loading tensor blk.11.ffn_up_exps.weight
create_tensor: loading tensor blk.11.exp_probs_b.bias
create_tensor: loading tensor blk.12.attn_qkv.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_sinks.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate_inp.weight
create_tensor: loading tensor blk.12.ffn_gate_exps.weight
create_tensor: loading tensor blk.12.ffn_down_exps.weight
create_tensor: loading tensor blk.12.ffn_up_exps.weight
create_tensor: loading tensor blk.12.exp_probs_b.bias
create_tensor: loading tensor blk.13.attn_qkv.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_sinks.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate_inp.weight
create_tensor: loading tensor blk.13.ffn_gate_exps.weight
create_tensor: loading tensor blk.13.ffn_down_exps.weight
create_tensor: loading tensor blk.13.ffn_up_exps.weight
create_tensor: loading tensor blk.13.exp_probs_b.bias
create_tensor: loading tensor blk.14.attn_qkv.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_sinks.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate_inp.weight
create_tensor: loading tensor blk.14.ffn_gate_exps.weight
create_tensor: loading tensor blk.14.ffn_down_exps.weight
create_tensor: loading tensor blk.14.ffn_up_exps.weight
create_tensor: loading tensor blk.14.exp_probs_b.bias
create_tensor: loading tensor blk.15.attn_qkv.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_sinks.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate_inp.weight
create_tensor: loading tensor blk.15.ffn_gate_exps.weight
create_tensor: loading tensor blk.15.ffn_down_exps.weight
create_tensor: loading tensor blk.15.ffn_up_exps.weight
create_tensor: loading tensor blk.15.exp_probs_b.bias
create_tensor: loading tensor blk.16.attn_qkv.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_sinks.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate_inp.weight
create_tensor: loading tensor blk.16.ffn_gate_exps.weight
create_tensor: loading tensor blk.16.ffn_down_exps.weight
create_tensor: loading tensor blk.16.ffn_up_exps.weight
create_tensor: loading tensor blk.16.exp_probs_b.bias
create_tensor: loading tensor blk.17.attn_qkv.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate_inp.weight
create_tensor: loading tensor blk.17.ffn_gate_exps.weight
create_tensor: loading tensor blk.17.ffn_down_exps.weight
create_tensor: loading tensor blk.17.ffn_up_exps.weight
create_tensor: loading tensor blk.17.exp_probs_b.bias
create_tensor: loading tensor blk.18.attn_qkv.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_sinks.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate_inp.weight
create_tensor: loading tensor blk.18.ffn_gate_exps.weight
create_tensor: loading tensor blk.18.ffn_down_exps.weight
create_tensor: loading tensor blk.18.ffn_up_exps.weight
create_tensor: loading tensor blk.18.exp_probs_b.bias
create_tensor: loading tensor blk.19.attn_qkv.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_sinks.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate_inp.weight
create_tensor: loading tensor blk.19.ffn_gate_exps.weight
create_tensor: loading tensor blk.19.ffn_down_exps.weight
create_tensor: loading tensor blk.19.ffn_up_exps.weight
create_tensor: loading tensor blk.19.exp_probs_b.bias
create_tensor: loading tensor blk.20.attn_qkv.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_sinks.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate_inp.weight
create_tensor: loading tensor blk.20.ffn_gate_exps.weight
create_tensor: loading tensor blk.20.ffn_down_exps.weight
create_tensor: loading tensor blk.20.ffn_up_exps.weight
create_tensor: loading tensor blk.20.exp_probs_b.bias
create_tensor: loading tensor blk.21.attn_qkv.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_sinks.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate_inp.weight
create_tensor: loading tensor blk.21.ffn_gate_exps.weight
create_tensor: loading tensor blk.21.ffn_down_exps.weight
create_tensor: loading tensor blk.21.ffn_up_exps.weight
create_tensor: loading tensor blk.21.exp_probs_b.bias
create_tensor: loading tensor blk.22.attn_qkv.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_sinks.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate_inp.weight
create_tensor: loading tensor blk.22.ffn_gate_exps.weight
create_tensor: loading tensor blk.22.ffn_down_exps.weight
create_tensor: loading tensor blk.22.ffn_up_exps.weight
create_tensor: loading tensor blk.22.exp_probs_b.bias
create_tensor: loading tensor blk.23.attn_qkv.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate_inp.weight
create_tensor: loading tensor blk.23.ffn_gate_exps.weight
create_tensor: loading tensor blk.23.ffn_down_exps.weight
create_tensor: loading tensor blk.23.ffn_up_exps.weight
create_tensor: loading tensor blk.23.exp_probs_b.bias
create_tensor: loading tensor blk.24.attn_qkv.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_sinks.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate_inp.weight
create_tensor: loading tensor blk.24.ffn_gate_exps.weight
create_tensor: loading tensor blk.24.ffn_down_exps.weight
create_tensor: loading tensor blk.24.ffn_up_exps.weight
create_tensor: loading tensor blk.24.exp_probs_b.bias
create_tensor: loading tensor blk.25.attn_qkv.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_sinks.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate_inp.weight
create_tensor: loading tensor blk.25.ffn_gate_exps.weight
create_tensor: loading tensor blk.25.ffn_down_exps.weight
create_tensor: loading tensor blk.25.ffn_up_exps.weight
create_tensor: loading tensor blk.25.exp_probs_b.bias
create_tensor: loading tensor blk.26.attn_qkv.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_sinks.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate_inp.weight
create_tensor: loading tensor blk.26.ffn_gate_exps.weight
create_tensor: loading tensor blk.26.ffn_down_exps.weight
create_tensor: loading tensor blk.26.ffn_up_exps.weight
create_tensor: loading tensor blk.26.exp_probs_b.bias
create_tensor: loading tensor blk.27.attn_qkv.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_sinks.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate_inp.weight
create_tensor: loading tensor blk.27.ffn_gate_exps.weight
create_tensor: loading tensor blk.27.ffn_down_exps.weight
create_tensor: loading tensor blk.27.ffn_up_exps.weight
create_tensor: loading tensor blk.27.exp_probs_b.bias
create_tensor: loading tensor blk.28.attn_qkv.weight
create_tensor: loading tensor blk.28.attn_output.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.attn_sinks.weight
create_tensor: loading tensor blk.28.ffn_norm.weight
create_tensor: loading tensor blk.28.ffn_gate_inp.weight
create_tensor: loading tensor blk.28.ffn_gate_exps.weight
create_tensor: loading tensor blk.28.ffn_down_exps.weight
create_tensor: loading tensor blk.28.ffn_up_exps.weight
create_tensor: loading tensor blk.28.exp_probs_b.bias
create_tensor: loading tensor blk.29.attn_qkv.weight
create_tensor: loading tensor blk.29.attn_output.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.ffn_norm.weight
create_tensor: loading tensor blk.29.ffn_gate_inp.weight
create_tensor: loading tensor blk.29.ffn_gate_exps.weight
create_tensor: loading tensor blk.29.ffn_down_exps.weight
create_tensor: loading tensor blk.29.ffn_up_exps.weight
create_tensor: loading tensor blk.29.exp_probs_b.bias
create_tensor: loading tensor blk.30.attn_qkv.weight
create_tensor: loading tensor blk.30.attn_output.weight
create_tensor: loading tensor blk.30.attn_norm.weight
create_tensor: loading tensor blk.30.attn_sinks.weight
create_tensor: loading tensor blk.30.ffn_norm.weight
create_tensor: loading tensor blk.30.ffn_gate_inp.weight
create_tensor: loading tensor blk.30.ffn_gate_exps.weight
create_tensor: loading tensor blk.30.ffn_down_exps.weight
create_tensor: loading tensor blk.30.ffn_up_exps.weight
create_tensor: loading tensor blk.30.exp_probs_b.bias
create_tensor: loading tensor blk.31.attn_qkv.weight
create_tensor: loading tensor blk.31.attn_output.weight
create_tensor: loading tensor blk.31.attn_norm.weight
create_tensor: loading tensor blk.31.attn_sinks.weight
create_tensor: loading tensor blk.31.ffn_norm.weight
create_tensor: loading tensor blk.31.ffn_gate_inp.weight
create_tensor: loading tensor blk.31.ffn_gate_exps.weight
create_tensor: loading tensor blk.31.ffn_down_exps.weight
create_tensor: loading tensor blk.31.ffn_up_exps.weight
create_tensor: loading tensor blk.31.exp_probs_b.bias
create_tensor: loading tensor blk.32.attn_qkv.weight
create_tensor: loading tensor blk.32.attn_output.weight
create_tensor: loading tensor blk.32.attn_norm.weight
create_tensor: loading tensor blk.32.attn_sinks.weight
create_tensor: loading tensor blk.32.ffn_norm.weight
create_tensor: loading tensor blk.32.ffn_gate_inp.weight
create_tensor: loading tensor blk.32.ffn_gate_exps.weight
create_tensor: loading tensor blk.32.ffn_down_exps.weight
create_tensor: loading tensor blk.32.ffn_up_exps.weight
create_tensor: loading tensor blk.32.exp_probs_b.bias
create_tensor: loading tensor blk.33.attn_qkv.weight
create_tensor: loading tensor blk.33.attn_output.weight
create_tensor: loading tensor blk.33.attn_norm.weight
create_tensor: loading tensor blk.33.attn_sinks.weight
create_tensor: loading tensor blk.33.ffn_norm.weight
create_tensor: loading tensor blk.33.ffn_gate_inp.weight
create_tensor: loading tensor blk.33.ffn_gate_exps.weight
create_tensor: loading tensor blk.33.ffn_down_exps.weight
create_tensor: loading tensor blk.33.ffn_up_exps.weight
create_tensor: loading tensor blk.33.exp_probs_b.bias
create_tensor: loading tensor blk.34.attn_qkv.weight
create_tensor: loading tensor blk.34.attn_output.weight
create_tensor: loading tensor blk.34.attn_norm.weight
create_tensor: loading tensor blk.34.attn_sinks.weight
create_tensor: loading tensor blk.34.ffn_norm.weight
create_tensor: loading tensor blk.34.ffn_gate_inp.weight
create_tensor: loading tensor blk.34.ffn_gate_exps.weight
create_tensor: loading tensor blk.34.ffn_down_exps.weight
create_tensor: loading tensor blk.34.ffn_up_exps.weight
create_tensor: loading tensor blk.34.exp_probs_b.bias
create_tensor: loading tensor blk.35.attn_qkv.weight
create_tensor: loading tensor blk.35.attn_output.weight
create_tensor: loading tensor blk.35.attn_norm.weight
create_tensor: loading tensor blk.35.ffn_norm.weight
create_tensor: loading tensor blk.35.ffn_gate_inp.weight
create_tensor: loading tensor blk.35.ffn_gate_exps.weight
create_tensor: loading tensor blk.35.ffn_down_exps.weight
create_tensor: loading tensor blk.35.ffn_up_exps.weight
create_tensor: loading tensor blk.35.exp_probs_b.bias
create_tensor: loading tensor blk.36.attn_qkv.weight
create_tensor: loading tensor blk.36.attn_output.weight
create_tensor: loading tensor blk.36.attn_norm.weight
create_tensor: loading tensor blk.36.attn_sinks.weight
create_tensor: loading tensor blk.36.ffn_norm.weight
create_tensor: loading tensor blk.36.ffn_gate_inp.weight
create_tensor: loading tensor blk.36.ffn_gate_exps.weight
create_tensor: loading tensor blk.36.ffn_down_exps.weight
create_tensor: loading tensor blk.36.ffn_up_exps.weight
create_tensor: loading tensor blk.36.exp_probs_b.bias
create_tensor: loading tensor blk.37.attn_qkv.weight
create_tensor: loading tensor blk.37.attn_output.weight
create_tensor: loading tensor blk.37.attn_norm.weight
create_tensor: loading tensor blk.37.attn_sinks.weight
create_tensor: loading tensor blk.37.ffn_norm.weight
create_tensor: loading tensor blk.37.ffn_gate_inp.weight
create_tensor: loading tensor blk.37.ffn_gate_exps.weight
create_tensor: loading tensor blk.37.ffn_down_exps.weight
create_tensor: loading tensor blk.37.ffn_up_exps.weight
create_tensor: loading tensor blk.37.exp_probs_b.bias
create_tensor: loading tensor blk.38.attn_qkv.weight
create_tensor: loading tensor blk.38.attn_output.weight
create_tensor: loading tensor blk.38.attn_norm.weight
create_tensor: loading tensor blk.38.attn_sinks.weight
create_tensor: loading tensor blk.38.ffn_norm.weight
create_tensor: loading tensor blk.38.ffn_gate_inp.weight
create_tensor: loading tensor blk.38.ffn_gate_exps.weight
create_tensor: loading tensor blk.38.ffn_down_exps.weight
create_tensor: loading tensor blk.38.ffn_up_exps.weight
create_tensor: loading tensor blk.38.exp_probs_b.bias
create_tensor: loading tensor blk.39.attn_qkv.weight
create_tensor: loading tensor blk.39.attn_output.weight
create_tensor: loading tensor blk.39.attn_norm.weight
create_tensor: loading tensor blk.39.attn_sinks.weight
create_tensor: loading tensor blk.39.ffn_norm.weight
create_tensor: loading tensor blk.39.ffn_gate_inp.weight
create_tensor: loading tensor blk.39.ffn_gate_exps.weight
create_tensor: loading tensor blk.39.ffn_down_exps.weight
create_tensor: loading tensor blk.39.ffn_up_exps.weight
create_tensor: loading tensor blk.39.exp_probs_b.bias
create_tensor: loading tensor blk.40.attn_qkv.weight
create_tensor: loading tensor blk.40.attn_output.weight
create_tensor: loading tensor blk.40.attn_norm.weight
create_tensor: loading tensor blk.40.attn_sinks.weight
create_tensor: loading tensor blk.40.ffn_norm.weight
create_tensor: loading tensor blk.40.ffn_gate_inp.weight
create_tensor: loading tensor blk.40.ffn_gate_exps.weight
create_tensor: loading tensor blk.40.ffn_down_exps.weight
create_tensor: loading tensor blk.40.ffn_up_exps.weight
create_tensor: loading tensor blk.40.exp_probs_b.bias
create_tensor: loading tensor blk.41.attn_qkv.weight
create_tensor: loading tensor blk.41.attn_output.weight
create_tensor: loading tensor blk.41.attn_norm.weight
create_tensor: loading tensor blk.41.ffn_norm.weight
create_tensor: loading tensor blk.41.ffn_gate_inp.weight
create_tensor: loading tensor blk.41.ffn_gate_exps.weight
create_tensor: loading tensor blk.41.ffn_down_exps.weight
create_tensor: loading tensor blk.41.ffn_up_exps.weight
create_tensor: loading tensor blk.41.exp_probs_b.bias
create_tensor: loading tensor blk.42.attn_qkv.weight
create_tensor: loading tensor blk.42.attn_output.weight
create_tensor: loading tensor blk.42.attn_norm.weight
create_tensor: loading tensor blk.42.attn_sinks.weight
create_tensor: loading tensor blk.42.ffn_norm.weight
create_tensor: loading tensor blk.42.ffn_gate_inp.weight
create_tensor: loading tensor blk.42.ffn_gate_exps.weight
create_tensor: loading tensor blk.42.ffn_down_exps.weight
create_tensor: loading tensor blk.42.ffn_up_exps.weight
create_tensor: loading tensor blk.42.exp_probs_b.bias
create_tensor: loading tensor blk.43.attn_qkv.weight
create_tensor: loading tensor blk.43.attn_output.weight
create_tensor: loading tensor blk.43.attn_norm.weight
create_tensor: loading tensor blk.43.attn_sinks.weight
create_tensor: loading tensor blk.43.ffn_norm.weight
create_tensor: loading tensor blk.43.ffn_gate_inp.weight
create_tensor: loading tensor blk.43.ffn_gate_exps.weight
create_tensor: loading tensor blk.43.ffn_down_exps.weight
create_tensor: loading tensor blk.43.ffn_up_exps.weight
create_tensor: loading tensor blk.43.exp_probs_b.bias
create_tensor: loading tensor blk.44.attn_qkv.weight
create_tensor: loading tensor blk.44.attn_output.weight
create_tensor: loading tensor blk.44.attn_norm.weight
create_tensor: loading tensor blk.44.attn_sinks.weight
create_tensor: loading tensor blk.44.ffn_norm.weight
create_tensor: loading tensor blk.44.ffn_gate_inp.weight
create_tensor: loading tensor blk.44.ffn_gate_exps.weight
create_tensor: loading tensor blk.44.ffn_down_exps.weight
create_tensor: loading tensor blk.44.ffn_up_exps.weight
create_tensor: loading tensor blk.44.exp_probs_b.bias
create_tensor: loading tensor blk.45.attn_qkv.weight
create_tensor: loading tensor blk.45.attn_output.weight
create_tensor: loading tensor blk.45.attn_norm.weight
create_tensor: loading tensor blk.45.attn_sinks.weight
create_tensor: loading tensor blk.45.ffn_norm.weight
create_tensor: loading tensor blk.45.ffn_gate_inp.weight
create_tensor: loading tensor blk.45.ffn_gate_exps.weight
create_tensor: loading tensor blk.45.ffn_down_exps.weight
create_tensor: loading tensor blk.45.ffn_up_exps.weight
create_tensor: loading tensor blk.45.exp_probs_b.bias
create_tensor: loading tensor blk.46.attn_qkv.weight
create_tensor: loading tensor blk.46.attn_output.weight
create_tensor: loading tensor blk.46.attn_norm.weight
create_tensor: loading tensor blk.46.attn_sinks.weight
create_tensor: loading tensor blk.46.ffn_norm.weight
create_tensor: loading tensor blk.46.ffn_gate_inp.weight
create_tensor: loading tensor blk.46.ffn_gate_exps.weight
create_tensor: loading tensor blk.46.ffn_down_exps.weight
create_tensor: loading tensor blk.46.ffn_up_exps.weight
create_tensor: loading tensor blk.46.exp_probs_b.bias
create_tensor: loading tensor blk.47.attn_qkv.weight
create_tensor: loading tensor blk.47.attn_output.weight
create_tensor: loading tensor blk.47.attn_norm.weight
create_tensor: loading tensor blk.47.ffn_norm.weight
create_tensor: loading tensor blk.47.ffn_gate_inp.weight
create_tensor: loading tensor blk.47.ffn_gate_exps.weight
create_tensor: loading tensor blk.47.ffn_down_exps.weight
create_tensor: loading tensor blk.47.ffn_up_exps.weight
create_tensor: loading tensor blk.47.exp_probs_b.bias
load_tensors: offloading output layer to GPU
load_tensors: offloading 47 repeating layers to GPU
load_tensors: offloaded 49/49 layers to GPU
load_tensors:        CUDA0 model buffer size =     0.00 MiB
load_tensors:    CUDA_Host model buffer size =     0.00 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 4
llama_context: n_ctx         = 96000
llama_context: n_ctx_seq     = 96000
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = true
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (96000) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context:  CUDA_Host  output buffer size =     2.33 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 96000 cells
llama_kv_cache: layer   0: dev = CUDA0
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: filtered
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: dev = CUDA0
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: filtered
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = CUDA0
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: filtered
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: dev = CUDA0
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: filtered
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = CUDA0
llama_kv_cache: layer  24: filtered
llama_kv_cache: layer  25: filtered
llama_kv_cache: layer  26: filtered
llama_kv_cache: layer  27: filtered
llama_kv_cache: layer  28: filtered
llama_kv_cache: layer  29: dev = CUDA0
llama_kv_cache: layer  30: filtered
llama_kv_cache: layer  31: filtered
llama_kv_cache: layer  32: filtered
llama_kv_cache: layer  33: filtered
llama_kv_cache: layer  34: filtered
llama_kv_cache: layer  35: dev = CUDA0
llama_kv_cache: layer  36: filtered
llama_kv_cache: layer  37: filtered
llama_kv_cache: layer  38: filtered
llama_kv_cache: layer  39: filtered
llama_kv_cache: layer  40: filtered
llama_kv_cache: layer  41: dev = CUDA0
llama_kv_cache: layer  42: filtered
llama_kv_cache: layer  43: filtered
llama_kv_cache: layer  44: filtered
llama_kv_cache: layer  45: filtered
llama_kv_cache: layer  46: filtered
llama_kv_cache: layer  47: dev = CUDA0
llama_kv_cache:      CUDA0 KV buffer size =     0.00 MiB
llama_kv_cache: size = 2109.38 MiB ( 96000 cells,   9 layers,  4/1 seqs), K (f16): 1265.62 MiB, V (f16):  843.75 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 192
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
llama_kv_cache_iswa: creating     SWA KV cache, size = 1024 cells
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: dev = CUDA0
llama_kv_cache: layer   2: dev = CUDA0
llama_kv_cache: layer   3: dev = CUDA0
llama_kv_cache: layer   4: dev = CUDA0
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: dev = CUDA0
llama_kv_cache: layer   7: dev = CUDA0
llama_kv_cache: layer   8: dev = CUDA0
llama_kv_cache: layer   9: dev = CUDA0
llama_kv_cache: layer  10: dev = CUDA0
llama_kv_cache: layer  11: filtered
llama_kv_cache: layer  12: dev = CUDA0
llama_kv_cache: layer  13: dev = CUDA0
llama_kv_cache: layer  14: dev = CUDA0
llama_kv_cache: layer  15: dev = CUDA0
llama_kv_cache: layer  16: dev = CUDA0
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: dev = CUDA0
llama_kv_cache: layer  19: dev = CUDA0
llama_kv_cache: layer  20: dev = CUDA0
llama_kv_cache: layer  21: dev = CUDA0
llama_kv_cache: layer  22: dev = CUDA0
llama_kv_cache: layer  23: filtered
llama_kv_cache: layer  24: dev = CUDA0
llama_kv_cache: layer  25: dev = CUDA0
llama_kv_cache: layer  26: dev = CUDA0
llama_kv_cache: layer  27: dev = CUDA0
llama_kv_cache: layer  28: dev = CUDA0
llama_kv_cache: layer  29: filtered
llama_kv_cache: layer  30: dev = CUDA0
llama_kv_cache: layer  31: dev = CUDA0
llama_kv_cache: layer  32: dev = CUDA0
llama_kv_cache: layer  33: dev = CUDA0
llama_kv_cache: layer  34: dev = CUDA0
llama_kv_cache: layer  35: filtered
llama_kv_cache: layer  36: dev = CUDA0
llama_kv_cache: layer  37: dev = CUDA0
llama_kv_cache: layer  38: dev = CUDA0
llama_kv_cache: layer  39: dev = CUDA0
llama_kv_cache: layer  40: dev = CUDA0
llama_kv_cache: layer  41: filtered
llama_kv_cache: layer  42: dev = CUDA0
llama_kv_cache: layer  43: dev = CUDA0
llama_kv_cache: layer  44: dev = CUDA0
llama_kv_cache: layer  45: dev = CUDA0
llama_kv_cache: layer  46: dev = CUDA0
llama_kv_cache: layer  47: filtered
llama_kv_cache:      CUDA0 KV buffer size =     0.00 MiB
llama_kv_cache: size =  195.00 MiB (  1024 cells,  39 layers,  4/1 seqs), K (f16):  117.00 MiB, V (f16):   78.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 192
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 128
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 3776
Segmentation fault (core dumped)

AesSedai

Owner May 8

•

edited May 8

I'll try to look at this tomorrow, thanks. I just finished uploading newly converted quants now that the PR has been merged into master, that is perhaps worth trying?

dehnhaide

May 8

•

edited May 8

I can confirm that even the previous version of MiMo-V2.5-Q4_K_M (before your update of 05/05/26) is NOT working with the latest llama.cpp, version: 9072 (6d57a49a7)

P.S. I am downloading the latest quants now and I will report later

dehnhaide

May 8

Sadly the same behavior applies to the latest "IQ4_XS" with llama.cpp version: 9075 (58e68df0f):

/home/vik/llms/llama.cpp/build/bin/llama-server
--model ~/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf
--alias "aessedai/MiMo-V2.5-IQ4_XS"
-c 131072 --fit off --threads 24 -fa off --no-mmap --jinja
--temp 0.8 --top-p 0.95 --seed 1976
--host 0.0.0.0 --port 5005
-b 4096 -ub 1024 -ctxcp 24
-ctk q8_0 -ctv q8_0 -cb
--chat-template-kwargs '{"reasoning_effort": "normal"}'

==========================================
ggml_cuda_init: found 8 CUDA devices (Total VRAM: 192990 MiB):
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 3: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 4: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24121 MiB
Device 5: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 6: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
Device 7: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24124 MiB
main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true
build_info: b9075-58e68df0f
system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 47 threads for HTTP server
start: binding port with default address family
main: loading model
srv load_model: loading model '/home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
llama_init_from_model: V cache quantization requires flash_attn
common_fit_params: encountered an error while trying to fit params to free device memory: failed to create llama_context from model
common_fit_params: fitting params to free memory took 2.02 seconds
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 55 key-value pairs and 508 tensors from /home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mimo2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.sampling.top_p f32 = 0.950000
llama_model_loader: - kv 3: general.sampling.temp f32 = 1.000000
llama_model_loader: - kv 4: general.name str = MiMo V2.5
llama_model_loader: - kv 5: general.version str = V2.5
llama_model_loader: - kv 6: general.basename str = MiMo
llama_model_loader: - kv 7: general.size_label str = 256x8.2B
llama_model_loader: - kv 8: general.license str = mit
llama_model_loader: - kv 9: general.tags arr[str,6] = ["multimodal", "vision-language", "au...
llama_model_loader: - kv 10: general.languages arr[str,2] = ["en", "zh"]
llama_model_loader: - kv 11: mimo2.block_count u32 = 51
llama_model_loader: - kv 12: mimo2.context_length u32 = 1048576
llama_model_loader: - kv 13: mimo2.embedding_length u32 = 4096
llama_model_loader: - kv 14: mimo2.feed_forward_length u32 = 16384
llama_model_loader: - kv 15: mimo2.attention.head_count u32 = 64
llama_model_loader: - kv 16: mimo2.attention.head_count_kv arr[i32,51] = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, ...
llama_model_loader: - kv 17: mimo2.rope.freq_base f32 = 10000000.000000
llama_model_loader: - kv 18: mimo2.rope.freq_base_swa f32 = 10000.000000
llama_model_loader: - kv 19: mimo2.expert_used_count u32 = 8
llama_model_loader: - kv 20: mimo2.expert_group_count u32 = 1
llama_model_loader: - kv 21: mimo2.expert_group_used_count u32 = 1
llama_model_loader: - kv 22: mimo2.expert_gating_func u32 = 2
llama_model_loader: - kv 23: mimo2.attention.key_length u32 = 192
llama_model_loader: - kv 24: mimo2.attention.value_length u32 = 128
llama_model_loader: - kv 25: mimo2.attention.sliding_window u32 = 128
llama_model_loader: - kv 26: mimo2.attention.sliding_window_pattern arr[i32,51] = [0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, ...
llama_model_loader: - kv 27: mimo2.expert_count u32 = 256
llama_model_loader: - kv 28: mimo2.expert_feed_forward_length u32 = 2048
llama_model_loader: - kv 29: mimo2.rope.dimension_count u32 = 64
llama_model_loader: - kv 30: mimo2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 31: mimo2.attention.value_scale f32 = 0.707000
llama_model_loader: - kv 32: mimo2.nextn_predict_layers u32 = 3
llama_model_loader: - kv 33: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 34: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 35: tokenizer.ggml.tokens arr[str,152576] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 36: tokenizer.ggml.token_type arr[i32,152576] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 37: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 38: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 39: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 40: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 41: tokenizer.chat_template str = {%- if not add_generation_prompt is d...
llama_model_loader: - kv 42: general.quantization_version u32 = 2
llama_model_loader: - kv 43: general.file_type u32 = 7
llama_model_loader: - kv 44: MoE_Quantization.ffn_up_exps str = IQ3_S
llama_model_loader: - kv 45: MoE_Quantization.ffn_gate_exps str = IQ3_S
llama_model_loader: - kv 46: MoE_Quantization.ffn_down_exps str = IQ4_XS
llama_model_loader: - kv 47: MoE_Quantization.type_default str = Q8_0
llama_model_loader: - kv 48: quantize.imatrix.file str = /mnt/srv/snowdrift/fp16/MiMo-V2.5/ima...
llama_model_loader: - kv 49: quantize.imatrix.dataset str = /mnt/srv/host/resources/KLD/calibrati...
llama_model_loader: - kv 50: quantize.imatrix.entries_count u32 = 287
llama_model_loader: - kv 51: quantize.imatrix.chunks_count u32 = 51
llama_model_loader: - kv 52: split.no u16 = 0
llama_model_loader: - kv 53: split.tensors.count i32 = 508
llama_model_loader: - kv 54: split.count u16 = 4
llama_model_loader: - type f32: 248 tensors
llama_model_loader: - type q8_0: 119 tensors
llama_model_loader: - type iq3_s: 94 tensors
llama_model_loader: - type iq4_xs: 47 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 137.75 GiB (3.82 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:01:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:02:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA2 (NVIDIA GeForce RTX 3090) (0000:81:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA3 (NVIDIA GeForce RTX 3090) (0000:82:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA4 (NVIDIA GeForce RTX 3090) (0000:83:00.0) - 23840 MiB free
llama_prepare_model_devices: using device CUDA5 (NVIDIA GeForce RTX 3090) (0000:84:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA6 (NVIDIA GeForce RTX 3090) (0000:c1:00.0) - 23859 MiB free
llama_prepare_model_devices: using device CUDA7 (NVIDIA GeForce RTX 3090) (0000:c2:00.0) - 23859 MiB free
load: 0 unused tokens
load: control-looking token: 128247 '' was not control-type; this is probably a bug in the model. its type will be overridden
load: printing all EOG tokens:
load: - 128247 ('')
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 0.9312 MB
print_info: arch = mimo2
print_info: vocab_only = 0
print_info: no_alloc = 0
print_info: n_ctx_train = 1048576
print_info: n_embd = 4096
print_info: n_embd_inp = 4096
print_info: n_layer = 51
print_info: n_head = 64
print_info: n_head_kv = [4, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8, 8, 8, 4, 8, 8, 8]
print_info: n_rot = 64
print_info: n_swa = 128
print_info: is_swa_any = 1
print_info: n_embd_head_k = 192
print_info: n_embd_head_v = 128
print_info: n_gqa = [16, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8, 8, 8, 16, 8, 8, 8]
print_info: n_embd_k_gqa = [768, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536, 1536, 1536, 768, 1536, 1536, 1536]
print_info: n_embd_v_gqa = [512, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024, 1024, 1024, 512, 1024, 1024, 1024]
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: f_attn_value_scale = 0.7070
print_info: n_ff = 16384
print_info: n_expert = 256
print_info: n_expert_used = 8
print_info: n_expert_groups = 1
print_info: n_group_used = 1
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 2
print_info: rope scaling = linear
print_info: freq_base_train = 10000000.0
print_info: freq_scale_train = 1
print_info: freq_base_swa = 10000.0
print_info: freq_scale_swa = 1
print_info: n_embd_head_k_swa = 192
print_info: n_embd_head_v_swa = 128
print_info: n_rot_swa = 64
print_info: n_ctx_orig_yarn = 1048576
print_info: rope_yarn_log_mul = 0.0000
print_info: rope_finetuned = unknown
print_info: model type = 310B.A15B
print_info: model params = 309.77 B
print_info: general.name = MiMo V2.5
print_info: vocab type = BPE
print_info: n_vocab = 152576
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 128247 ''
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
model has unused tensor blk.48.attn_output.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.48.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.48.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.ffn_gate.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.48.ffn_down.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.48.ffn_up.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.48.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.48.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_output.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.49.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.49.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.ffn_gate.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.49.ffn_down.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.49.ffn_up.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.49.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.49.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_output.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.50.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.50.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.ffn_gate.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.50.ffn_down.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.50.ffn_up.weight (size = 71303168 bytes) -- ignoring
model has unused tensor blk.50.nextn.eh_proj.weight (size = 35651584 bytes) -- ignoring
model has unused tensor blk.50.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.layer_output_norm.weight (size = 16384 bytes) -- ignoring
load_tensors: offloading output layer to GPU
load_tensors: offloading 50 repeating layers to GPU
load_tensors: offloaded 52/52 layers to GPU
load_tensors: CUDA0 model buffer size = 17974.98 MiB
load_tensors: CUDA1 model buffer size = 20628.29 MiB
load_tensors: CUDA2 model buffer size = 17680.63 MiB
load_tensors: CUDA3 model buffer size = 20628.29 MiB
load_tensors: CUDA4 model buffer size = 17680.63 MiB
load_tensors: CUDA5 model buffer size = 17680.63 MiB
load_tensors: CUDA6 model buffer size = 20628.29 MiB
load_tensors: CUDA7 model buffer size = 6708.14 MiB
load_tensors: CUDA_Host model buffer size = 633.25 MiB
....................................................................................................
common_init_result: added logit bias = -inf
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_init_from_model: V cache quantization requires flash_attn
common_init_result: failed to create context with model '/home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
common_init_from_params: failed to create context with model '/home/vik/models/gguf/aessedai/MiMo-V2.5-IQ4_XS/MiMo-V2.5-IQ4_XS-00001-of-00004.gguf'
./llama-srv_MiMo-V2.5-IQ4_XS.sh: line 12: 35811 Segmentation fault (core dumped)

AesSedai

Owner May 8

Thanks, will try to load it up on my system and see if I can spot the problem

AesSedai

Owner May 8

•

edited May 8

@dehnhaide can you try with the mimo-v2.5-fattn branch?

Edit: also -fa off isn't compatible with -ctk q8_0 -ctv q8_0, you need FA to quant the KV cache.

AesSedai

Owner May 8

@dehnhaide I was able to reproduce your issue with -fa off -ctx q8_0 -ctv q8_0, it works with -fa on or without the -ctx q8_0 -ctv q8_0. I'd still recommend using the mimo-v2.5-fattn branch I linked in the readme + -fa on.

dehnhaide

May 8

@dehnhaide I was able to reproduce your issue with -fa off -ctx q8_0 -ctv q8_0, it works with -fa on or without the -ctx q8_0 -ctv q8_0. I'd still recommend using the mimo-v2.5-fattn branch I linked in the readme + -fa on.

Damn! I am so sorry for the misreport! :( Indeed, after playing with the initial version (in mainline) with "-fa on/off" it remained as "off".
Happy to report that it now starts up (with version: 9079 (f9cd456ea)) BUT it's broken: the speed is abysmal! Basically a simple prompt in OpenCode:

prompt eval time = 391869.08 ms / 19154 tokens ( 20.46 ms per token, 48.88 tokens per second)
eval time = 17018.09 ms / 126 tokens ( 135.06 ms per token, 7.40 tokens per second)
total time = 408887.17 ms / 19280 tokens
slot release: id 2 | task 2 | stop processing: n_tokens = 19279, truncated = 0

With https://github.com/AesSedai/llama.cpp/tree/mimo-v2.5-fattn the new IQ4_XS quant works beautiful (40toks in OpenCode) with "-fa on"

prompt eval time = 19402.71 ms / 19154 tokens ( 1.01 ms per token, 987.18 tokens per second)
eval time = 2592.45 ms / 102 tokens ( 25.42 ms per token, 39.35 tokens per second)
total time = 21995.15 ms / 19256 tokens
slot release: id 2 | task 2 | stop processing: n_tokens = 19255, truncated = 0

AesSedai

Owner May 8

•

edited May 8

Cool, this PR is close to merging I think so the FA fixes will be on master after that: https://github.com/ggml-org/llama.cpp/pull/22812

The speed issue is known, I didn't want to pollute the initial support PR with the FA fixes. Basically MiMo uses a 192/128 asymmetric head which wasn't templated yet since no other model has needed it.

Downtown-Case

May 8

Appreciate all this. Downloading to test, but would y'all say this model has good pop culture knowledge? Decent prose? There's so much focus on agenic stuff, its hard to tell if they threw everything else out (like Qwen).

AesSedai

Owner May 8

@Downtown-Case I've mostly been using Pro, but it has become my favorite RP model due to its prose / natural language / intuition. I've only used non-Pro a little bit and it's probably 65% of the way there. It's not bad but there isn't really a replacement for tripling the total and active params. Too bad Pro doesn't support multimodality :(

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment