Create GGUFs if possible?

#3
by InfernalDread - opened

Hello,

Thank you guys for the work that you do! I was wondering if it would be possible to release various sized GGUF quants for people to run under llamacpp, as it would be a great way to test these models?

Thank you.

I have tried to use a quant made myself for llamacpp, but I had recieved this error in the beginning of model loading, would you guys know of a solution?

llama_model_load: error loading model: missing tensor 'blk.60.attn_norm.weight'

while converting, I did notice that the layers went from 0-59, but llamacpp is oddly expecting an extra layer 60

I had the same issue with the nex-agi/Nex-N2-mini and vibecoded it. So I can't tell you what my agent exactly did but you can try it too to get a working gguf.

I had the same issue with the nex-agi/Nex-N2-mini and vibecoded it. So I can't tell you what my agent exactly did but you can try it too to get a working gguf.

did you have to regenerate the GGUF or just make a patch in the llamacpp project to get your current gguf to work? If its the patch, do you think you can upload your version of llamacpp onto github for me to try as well?

Thank you!

config.json advertises mtp_num_hidden_layers: 1, but this uploaded model does not ship the corresponding MTP (Multi-Token Prediction) tensors. Try to call convert_hf_to_gguf.py with the --no-mtp parameter

Guys, we need IQ3_XXS and IQ4_XS <3

There are more issues: The chat_template.jinja does not define the tags in the form expected by llama.cpp. This file needs to be patched.

Diff:
103c103
< {{- '<|im_start|>' + message.role + '\n' + content }}
---
> {{- '<|im_start|>' + message.role + '\n<think></think>' + content }}
150c150
< {{- '<think>\n\n</think>\n\n' }}
---
> {{- '<think></think>' }}
152c152
< {{- '<think>' }}
---
> {{- '<think>\n' }}

ok, progress update, I have done what J8son91 mentioned about using --no-mtp to get rid of the extra layer error and it worked, the model loaded. I have also made the diff changes to the chat template suggested by the same user as well. However, when speaking to the model, gibberish is all that comes out. I am not sure what is missing.

Try this chat template https://pastebin.com/ay9hkyNc that worked for me with the Mini version

ok, progress update, I have done what J8son91 mentioned about using --no-mtp to get rid of the extra layer error and it worked, the model loaded. I have also made the diff changes to the chat template suggested by the same user as well. However, when speaking to the model, gibberish is all that comes out. I am not sure what is missing.

Convert to GGUF:

python3 convert_hf_to_gguf.py /storage/models/nex-agi/Nex-N2-Pro/ \
    --outtype f16 \
    --outfile /storage/models/nex-agi/Nex-N2-Pro_f16.gguf \
    --no-mtp

I have only tried the Q8_0 quantization because quality is more important than speed for me (batch processing):

./llama-quantize /storage/models/nex-agi/Nex-N2-Pro_f16.gguf /storage/models/nex-agi/Nex-N2-Pro_Q8_0.gguf Q8_0

Server:

./llama-server \
     --model /storage/models/nex-agi/Nex-N2-Pro_Q8_0.gguf \
     --port 39507 \
     --host 127.0.0.1 \
     --no-webui \
     --offline \
     -c 262144 \
     -ctk q8_0 -ctv q8_0 \
     --reasoning on \
     --reasoning-format deepseek \
     --chat-template-file /storage/models/nex-agi/Nex-N2-Pro/chat_template.llamacpp-thinking.jinja \
     --parallel 1 \
     --threads 64 \
     --threads-batch 64 \
     --log-verbosity 4 \
     --flash-attn on \
     -ub 4096 -b 4096 \
     --no-mmap \
     -ngl 6

Github copilot chat:
~/.config/Code/User/chatLanguageModels.json

[
    {
        "name": "Custom Endpoint",
        "vendor": "customendpoint",
        "apiType": "chat-completions",
        "models": [
            {
                "id": "llama-server",
                "name": "llama-server",
                "url": "http://localhost:39507/v1",
                "toolCalling": true,
                "vision": true,
                "maxInputTokens": 128000,
                "maxOutputTokens": 16000
            }
        ]
    }
]

Here is the gguf version: https://hugston.com/models/hugston-nex-agi-nex-n2-proq4-k-m

Try it and let me know. if is any good i will upload it in HF.

Here is the gguf version: https://hugston.com/models/hugston-nex-agi-nex-n2-proq4-k-m

Try it and let me know. if is any good i will upload it in HF.

Is it possible for you to make a 2-bit quant? With this quant, on my system, I can barely fit any usable context, but within 2-bit range, I can fit full context to use in agents. That's what I used for the original Qwen 397B as well.

Thank you for your help by the way!

Is it possible for you to make a 2-bit quant?

The 2bits quant would need an imatrix to be done, which degrades the model quality (according to our research). So we skip that. As soon as time allows can try to do a q3xxs or similar that do not need an Imatrix. would be great to have a bit of feedback on the current one first.

I have found another bug and these bugs need to be fixed before we can create useful quantizations of this model.

In chat_template.jinja, every message ends with <|im_end|> and tokenizer_config.json declares:

"eos_token": "<|im_end|>"

Now according to tokenizer.json, <|im_end|> = 248046 but when you look into config.json you find

"eos_token_id": 248044,

which is a contradiction (BUG!). This means even if the model signals that it is done talking (with <|im_end|> token) the inference server will ignore this continue to compute more tokens creating weird nonsense.

Here is the gguf version: https://hugston.com/models/hugston-nex-agi-nex-n2-proq4-k-m

Try it and let me know. if is any good i will upload it in HF.

download really sucks there hope you will upload it in hf

@gopi87

download really sucks there hope you will upload it in hf

Thank you for the feedback. Can you be more specific what issues did you encounter is it low speed or you got any broken link or timeouts, limits?
It is supposed to be over 100mb/s standing by the other user reports, so in 25 mins should be able to download all the 225 gb file.
OFC we can try to upload it (today10-36-23 ) here but as it is in one file only we get timeouts and cutoff many times, so we thought users can test it first and if is good enough it will be uploaded in HF.

Please let us know what the real issue was and thanks again for your feedback.

@gopi87

download really sucks there hope you will upload it in hf

Thank you for the feedback. Can you be more specific what issues did you encounter is it low speed or you got any broken link or timeouts, limits?
It is supposed to be over 100mb/s standing by the other user reports, so in 25 mins should be able to download all the 225 gb file.
OFC we can try to upload it (today10-36-23 ) here but as it is in one file only we get timeouts and cutoff many times, so we thought users can test it first and if is good enough it will be uploaded in HF.

Please let us know what the real issue was and thanks again for your feedback.

thanks for that . the reason why hf is beter is that i can donwload it around 70mb/sec and resume function also available

It is very painful, and time demanding to upload large files in HF (like in every where else). As you see, ~3 hour later we are in ~35% with the hope that it will not break or timeout.
We will do everything to support Huggingface because it deserves it and is a great place of great value. Still we would like to skip this pain if possible., we simply do not have the time and resources to afford it.
So unless we learn some new solution, after the upload of this model we will be avoiding uploads of models over 50gb.
12-51-43

It is very painful, and time demanding to upload large files in HF (like in every where else). As you see, ~3 hour later we are in ~35% with the hope that it will not break or timeout.
We will do everything to support Huggingface because it deserves it and is a great place of great value. Still we would like to skip this pain if possible., we simply do not have the time and resources to afford it.
So unless we learn some new solution, after the upload of this model we will be avoiding uploads of models over 50gb.
12-51-43

yep i think 50gb files will be perfect hope the long one dont get break it.

@mradermacher Can you create a GGUF quantification for this model?

Update 6 hours later:
17-35-14
10 hours later:

20-20-31
Reminds me of the 9-27-56 kb/s modems, whoa, 10 hours still 75% what is this emule, shall we torrent again?
We pay 1000/1000 mb/s as speed, what is happening...

edit: Done and available for download after 13 hours of uploading time here: https://huggingface.co/Trilogix1/Hugston-Nex-N2-Pro-gguf.

Sign up or log in to comment