How to use from
vLLM
Install from pip and serve model
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ZeroWw/Phi-4-mini-reasoning-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ZeroWw/Phi-4-mini-reasoning-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'
Use Docker
docker model run hf.co/ZeroWw/Phi-4-mini-reasoning-GGUF:
Quick Links

My own (ZeroWw) quantizations. output and embed tensors quantized to f16. all other tensors quantized to q5_k or q6_k.

Result: both f16.q6 and f16.q5 are smaller than q8_0 standard quantization and they perform as well as the pure f16.

Updated on: Thu May 01, 18:04:42

Downloads last month
51
GGUF
Model size
4B params
Architecture
phi3
Hardware compatibility
Log In to add your hardware

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including ZeroWw/Phi-4-mini-reasoning-GGUF