Thought Loop

#6
by jukingjack1 - opened

https://github.com/XiaomiMiMo/MiMo-V2-Flash/issues/17

I think we are experiencing the same issues seen in the previous model where it just keeps thinking over and over.

Has there been any progress made towards this?

If you tell the model to not over think, then it should get a result much quicker

If you tell the model to not over think, then it should get a result much quicker

image

Even with thinking disabled it enters a loop and exhausts the context. I told it not to overthink and it still did this.

If you tell the model to not over think, then it should get a result much quicker

image

Even with thinking disabled it enters a loop and exhausts the context. I told it not to overthink and it still did this.

Increase the repetition penalty to 1.2 will mitigate this. here is my vLLM command:

  --data-parallel-size 2 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --gpu-memory-utilization 0.96 \
  --max-model-len auto \
  --reasoning-parser mimo \
  --tool-call-parser mimo \
  --served-model-name MiMo-V2.5 \
  --generation-config "model_hub/MiMo-V2.5" \
  --override-generation-config '{"repetition_penalty":1.2, "top_p":0.95, "temperature":0.6}' \
  --disable-hybrid-kv-cache-manager \
  --max-num-seqs 8 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

This works in my 8*A800 80GB gpus. But this model is still hard to use, the overthink problem is blocking the model functionality. Seems the XiaoMi did not try to deploy the model with vLLM/SGLANG at all.

Besides, I have used the xiaomi's mimo API, the MiMo-V2.5/V2.5 Pro behaves normally, and the think process can be parsed by claude and opencode. So, my felling is that Xiaomi's official internal version differs from the open-source version—at least in terms of the generation_config or system_prompt.

https://aistudio.xiaomimimo.com/#/share/d9add9d5c1f37461347ab73c52f1c0da

I have just tested the model on the web-ui and this is the chat, the thought looping seems to be an issue with the actual model?

Is this a known issue internally?

https://aistudio.xiaomimimo.com/#/share/d9add9d5c1f37461347ab73c52f1c0da

I have just tested the model on the web-ui and this is the chat, the thought looping seems to be an issue with the actual model?

Is this a known issue internally?

Yes, I hightly suspect this is an internal issue, even with the official API, the model sometimes outputs repeated chain of thought. I have sent an Email to the XiaoMi and I have not got an response, maybe they are fixing this issue. Both the MiMo-V2.5 and its pro version have the same problem. But I mitigate this by setting repetition penalty to 1.2 in local deployment.
1

2

Just chiming in: I am also experiencing a lot of overthinking. Even the model itself says: "I think I am overthinking this".

It will actually generate the solution correctly during the thought process, but then say, "Wait, let me consider X" and then start over again.

It can generate responses without overthinking, so it is a bit of a crapshoot if it will overthink or not.

I'll try increasing the repetition penalty to 1.2 and see what happens.

after spending many hours with a lot of people, we think we have solved the looping, feel free to try and and let me know if it still loops

  --gpus '"device=0,1,2,3"' \
  --ipc=host --network host --shm-size=32g \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  -e SAFETENSORS_FAST_GPU=1 \
  -e CUTE_DSL_ARCH="sm_120a" \
  -e B12X_ENABLE_DYNAMIC_DOWN_SCALE=1 \
  -e SGLANG_PREVENT_THOUGHT_LOOPS=1 \
  -e B12X_MOE_FORCE_A16=0 \
  -e PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
  -e NCCL_P2P_LEVEL=SYS \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_ALLOC_P2P_NET_LL_BUFFERS=1 \
  -e NCCL_MIN_NCHANNELS=8 \
  -e NCCL_CUMEM_HOST_ENABLE=0 \
  -e NCCL_NET_GDR_LEVEL=SYS \
  -v /models/MiMo-M2.5-NVFP4:/models/MiMo-M2.5-NVFP4:ro \
  -v /models/.cache/huggingface:/root/.cache/huggingface \
  -v /models/.vllm_cache/triton:/root/.triton \
  -v /models/.vllm_cache/sglang-generated:/root/.cache/sglang-generated \
  lukealonso/sglang-cuda13-b12x:noloop \
  python -m sglang.launch_server \
    --model-path /models/MiMo-M2.5-NVFP4 \
    --served-model-name mimo-v2.5 \
    --tp-size 4 \
    --page-size 64 \
    --host 0.0.0.0 \
    --port 8001 \
    --kv-cache-dtype fp8_e4m3 \
    --quantization modelopt_fp4 \
    --mem-fraction-static 0.85 \
    --swa-full-tokens-ratio 0.3 \
    --chunked-prefill-size 8192 \
    --enable-multi-layer-eagle \
    --reasoning-parser mimo \
    --tool-call-parser mimo \
    --max-running-requests 8 \
    --speculative-algorithm EAGLE \
    --speculative-num-steps 3 \
    --speculative-eagle-topk 1 \
    --speculative-num-draft-tokens 4 \
    --moe-runner-backend b12x \
    --attention-backend b12x \
    --mm-attention-backend b12x \
    --fp4-gemm-backend b12x \
    --fp8-gemm-backend flashinfer_cutlass \
    --cuda-graph-max-bs 8 \
    --sleep-on-idle \
    --enable-metrics


genration_config.json

{
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151643,
    151645,
    151672
  ],
  "temperature": 1.0,
  "top_p": 0.95,
  "max_new_tokens": 2048,
  "transformers_version": "4.37.0",
  "repetition_penalty": 1.05
}

Thanks for all the testing! I can't seem to find anything where SGLANG_PREVENT_THOUGHT_LOOPS=1 exists though, how did you determine these parameters?

Sign up or log in to comment