The generation falls into constant repetition without any good result

#2
by ddd2r2 - opened

Ollama locally tested with Nvidia RTX 5060Ti 16GB on two GGUF models:

  • GLM-4.7-Flash-REAP-23B-A3B-UD-IQ3_XXS.gguf
  • GLM-4.7-Flash-REAP-23B-A3B-UD-Q2_K_XL.gguf

Poor results with the request "Write simple rust program with GTK4". Code generation continues either for a long time or indefinitely, constantly repeating itself. And in the end it does not produce any good result.

Made extra test with model GLM-4.7-Flash-REAP-23B-A3B-UD-IQ2_XXS.gguf - same poor result. Poor and meaningless code generation with infinite repetition:

Unsaved Image 73

Unsloth AI org

Ollama locally tested with Nvidia RTX 5060Ti 16GB on two GGUF models:

  • GLM-4.7-Flash-REAP-23B-A3B-UD-IQ3_XXS.gguf
  • GLM-4.7-Flash-REAP-23B-A3B-UD-Q2_K_XL.gguf

Poor results with the request "Write simple rust program with GTK4". Code generation continues either for a long time or indefinitely, constantly repeating itself. And in the end it does not produce any good result.

Do not use it Ollama, it's stated in our docs multiple times that it doesn't work!! https://unsloth.ai/docs/models/glm-4.7-flash#run-glm-4.7-flash

Ollama locally tested with Nvidia RTX 5060Ti 16GB on two GGUF models:

  • GLM-4.7-Flash-REAP-23B-A3B-UD-IQ3_XXS.gguf
  • GLM-4.7-Flash-REAP-23B-A3B-UD-Q2_K_XL.gguf

Poor results with the request "Write simple rust program with GTK4". Code generation continues either for a long time or indefinitely, constantly repeating itself. And in the end it does not produce any good result.

To get a better results in Ollama (v15.0) use Unsloth's recommended settings from the Jan 21st release, temp 1.0, top-p 0.95, repeat-penalty 1.0, min-p 0.01. Then change the start of the file that looks like the below in ollama/models/blobs (as of Jan 25th, file starting sha256-eba9)

From
"model_format":"gguf","model_family":"deepseek2","model_families":["deepseek2"],"model_type":"23B","file_type":"unknown"

To
"model_format":"gguf","model_family":"glm4moelite","model_families":["glm4moelite"],"model_type":"23B","file_type":"unknown","renderer":"glm-4.7","parser":"glm-4.7"

This is probably a bit of a hack rather than a fix but the results are decent.

Hate to break it to you, but any quantization 3-bit or under will produce little-to-no intelligence. You're lucky that you get natural language! Hopefully the results are better outside of Ollama.

I tried 4-bit and it benchmarked surprisingly well. I recommend using it with PowerInfer if you have low VRAM.

The fact that this model could be REAPed and quantized anyway is amazing. We're on the path to smarter tiny models, and I can't wait to see it all happen.

Ollama locally tested with Nvidia RTX 5060Ti 16GB on two GGUF models:

  • GLM-4.7-Flash-REAP-23B-A3B-UD-IQ3_XXS.gguf
  • GLM-4.7-Flash-REAP-23B-A3B-UD-Q2_K_XL.gguf

Poor results with the request "Write simple rust program with GTK4". Code generation continues either for a long time or indefinitely, constantly repeating itself. And in the end it does not produce any good result.

To get a better results in Ollama (v15.0) use Unsloth's recommended settings from the Jan 21st release, temp 1.0, top-p 0.95, repeat-penalty 1.0, min-p 0.01. Then change the start of the file that looks like the below in ollama/models/blobs (as of Jan 25th, file starting sha256-eba9)

From
"model_format":"gguf","model_family":"deepseek2","model_families":["deepseek2"],"model_type":"23B","file_type":"unknown"

To
"model_format":"gguf","model_family":"glm4moelite","model_families":["glm4moelite"],"model_type":"23B","file_type":"unknown","renderer":"glm-4.7","parser":"glm-4.7"

This is probably a bit of a hack rather than a fix but the results are decent.

I managed to turn the hack into a json that I was able to use and get working in LMStudio. Q4 and Q3 quants seem to be working very well, and don't appear to be that much less intelligent than the non-REAP variants at respective quants. I need to test it more, but so far the results have been rather satisfactory for me as a chatbot. However, as an agent source, I've only tested this model in the Q4 REAP quant and the non-REAP variants, and it still appears that the amount of context needed to run the model at useful speeds is still a bottleneck. I'm in the process of testing Q3 REAP, and I will determine if I can get the context high enough with sufficient speeds compared to how much loss of intelligence occurs. Maybe it could still work, but I have my doubts, and all on 16GB of VRAM and 32 DDR4 mind you.

Unsloth AI org

Hate to break it to you, but any quantization 3-bit or under will produce little-to-no intelligence. You're lucky that you get natural language! Hopefully the results are better outside of Ollama.

I tried 4-bit and it benchmarked surprisingly well. I recommend using it with PowerInfer if you have low VRAM.

The fact that this model could be REAPed and quantized anyway is amazing. We're on the path to smarter tiny models, and I can't wait to see it all happen.

We wrote that it doesn't work in Ollama in our glm-4.7-flash docs.

Unsloth AI org

If it helps I'm reuploading them and redoing them as we speak - hope they're better!

Can confirm good results with the latest versions and revised model file! Thank you.

My experience with RTX 3090 and llama.cpp b7898 (2026/02/01) temp=0.7, top-p=1, min-p=0.01, c=100000 with all .gguf not older than 3 days:

  • Q4_0: good results, fast
  • Q4_K_M: worse results, getting stuck in thinking loop occasionally -> unusable
  • Q4_UD_K_XL: worse results than Q4_K_M, and also getting stuck in thinking loop more often -> unusable

it's counter intuitive to what quantization benchmarks say (Q4_0 should be worst) but here it's the reverse.

The model otherwise it's excellent (when it works) for its size.

I tried Q2_K_XL and IQ3_XXS. With different recommended and not inference settings. The model generates landing page, but makes mistakes like non-existing input name, '[a href]' as css selector, and it gets into loops from time to time when trying to generate SVG icons:

<svg class="icon" viewBox="0 0 24 24"><path d="M19 3h-1V2h-1V1h-4V2h-4V1h-4V2h-1V3h-1V4h-1V5h-1V6h-1V7h-1V8h-1V9h-1V10h-1V11h-1V12h-1V13h-1V14h-1V15h-1V16h-1V17h-1V18h-1V19h-1V20h-1V21h-1V22h-1V23h-1V24h-1V25h-1V26h-1V27h-1V28h-1V29h-1V30h-1V31h-1V32h-1V33h-1V34h-1V35h-1V36h-1V37h-1V38h-1V39h-1V40h-1V41h-1V42h-1V43h-1V44h-1V45h-1V46h-1V47h-1V48h-1V49h-1V50h-1V51h-1V52h-1V53h-1V54h-1V55h-1V56h-1V57h-1V58h-1V59h-1V60h-1V61h-1V62h-1V63h-1V64h-1V65h-1V66h-1V67h-1V68h-1V69h-1V70h-1V71h-1V72h-1V73h-1V74h-1V75h-1V76h-1V77h-1V78h-1V79h-1V80h-1V81h-1V82h-1V83h-1V84h-1V85h-1V86h-1V87h-1V88h-1V89h-1V90h

Downloading Q4_UD_K_XL for now. I thought UD versions should be much more stable, stable regular Q4/Q_K quants.

Can confirm Q4_K_M is busted or thinks for a loooong time first and does succeed at output, but it's able to permanently get stuck thinking. I set the tuning options mentioned in this thread.

Q4_K_S works ok. Just set temperature to 1 (not 0 or 0.6), min_p: 0.01, top_p: 0.9, top_k: 40, disable repetition penalty.
Seems like low-bit quants with low temperature tend to produce similar tokens in loops.

I refined the previous "hack" to be a proper Ollama format and the results are better. I have since switched to the standard GLM-4.7-Flash as I find this model at a smaller quant renders better results than the REAP model of the same size, but the method used is the same.

  1. download the model from the "Files and versions" using something like wget. I saved it to a folder called "customModelfiles" but it can be saved anywhere ollama can access.
    Example: wget https://huggingface.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF/resolve/main/GLM-4.7-Flash-REAP-23B-A3B-UD-Q4_K_XL.gguf?download=true

Then rename the file to something more useable as it will have download=true at the end.

  1. create a modelfile using this template:

Model Location

FROM ~/.ollama/customModelfiles/GLM-4.7-Flash-REAP-23B-A3B-UD-Q4_K_XL.gguf

Model Template

TEMPLATE {{ .Prompt }}
RENDERER glm-4.7
PARSER glm-4.7

Model Parameters

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.0
PARAMETER min_p 0.01

  1. In the Ollama CLI run:

ollama create GLM-4.7-REAP:23b -f modelfile

Hope this helps.

You can delete the ?download=true from your wget invocation and save a step

You can delete the ?download=true from your wget invocation and save a step

Sign up or log in to comment