Is it possible to release a version with low bit quantization?
#11
by lan0004 - opened
It works really well with OpenClaw, especially for those who want a local low-bit quantization version.
Are you asking for 2-bit or even 1.58-bit version?
For example, its quantized size, capable of running on a machine with an AI MAX 395, currently exceeds 128GB in size due to the model size of INT4 plus context space and system overhead.
For example, its quantized size, capable of running on a machine with an AI MAX 395, currently exceeds 128GB in size due to the model size of INT4 plus context space and system overhead.
https://github.com/stepfun-ai/Step-3.5-Flash/blob/main/llama.cpp/docs/step3.5-flash.md Sorry for the late reply. On the AI MAX 395, I can run the int4 model with up to around a 64K context.