Qwen3-4B-Thinking-2507 Text-to-SQL Agent FT GGUF

This repository contains GGUF exports of the fine-tuned Qwen3-4B-Thinking-2507 Text-to-SQL agent model. It is intended for local inference with llama.cpp, LM Studio, or another OpenAI-compatible local server.

Code and reproducibility repository:

https://github.com/Boakpe/distilled-slms-for-text-to-sql-pt-br

Related collection:

https://huggingface.co/collections/Boakpe/distilled-slms-for-text-to-sql-pt-br

Recommended File

Use the Q8_0 GGUF for most local runs. It is the practical default because it is much smaller than BF16 while preserving strong behavior for this task.

Available variants on the model page:

Quantization Approx. size Suggested use
Q8_0 4.28 GB Recommended local default
BF16 8.05 GB Higher precision, more memory

Run with llama.cpp

Install or build llama.cpp:

https://github.com/ggml-org/llama.cpp

Download the Q8_0 model:

uvx --from huggingface-hub hf download \
  Boakpe/Qwen3-4B-Thinking-2507-Text-to-SQL-Agent-FT-GGUF \
  qwen-3-4b-thinking-2004-checkpoint-1500-merged-Q8_0.gguf \
  --local-dir models

Start an OpenAI-compatible server:

SERVER_BIN="${SERVER_BIN:-llama.cpp/build/bin/llama-server}"
MODEL_PATH="${MODEL_PATH:-models/qwen-3-4b-thinking-2004-checkpoint-1500-merged-Q8_0.gguf}"
CTX_SIZE="${CTX_SIZE:-37000}"
N_GPU_LAYERS="${N_GPU_LAYERS:-999}"
N_PARALLEL="${N_PARALLEL:-1}"
THREADS="$(nproc --all 2>/dev/null || sysctl -n hw.logicalcpu 2>/dev/null || echo 8)"

"$SERVER_BIN" \
  --model "$MODEL_PATH" \
  --ctx-size "$CTX_SIZE" \
  --n-gpu-layers "$N_GPU_LAYERS" \
  --threads "$THREADS" \
  --threads-batch "$THREADS" \
  --parallel "$N_PARALLEL" \
  --flash-attn on \
  --mlock \
  --no-mmap \
  --cont-batching \
  --batch-size 512 \
  --ubatch-size 512 \
  --host 127.0.0.1 \
  --port 8080 \
  --alias text2sql \
  --temp 0.6 \
  --top-p 0.95 \
  --top-k 20

Then run the agent from the GitHub repository with a model entry pointing to:

provider: openai
model_name: text2sql
api_key: lmstudio
base_url: http://localhost:8080/v1
tool_choice: auto

Easier Setup with LM Studio

LM Studio can also serve GGUF models through a local OpenAI-compatible API:

https://lmstudio.ai/

Load the Q8_0 GGUF, start the local server, and set the model name and port in agent/config/models.yaml.

Results

These are the same model weights as the safetensors model, exported to GGUF.

Primary environmental-registry benchmark:

Model Overall Strict SQL Relaxed SQL Non-SQL Clarification Unanswerable
Qwen3-4B-Thinking-2507 base 56.1 28.9 36.7 75.6 71.1 80.0
Qwen3-4B-Thinking FT 78.9 34.4 70.0 87.8 86.7 88.9

Pass@5 for the fine-tuned model reached 91.7% overall, 87.8% relaxed SQL, and 95.6% non-SQL.

On rede_saude_publica, the fine-tuned model reached 75.0% overall, 72.0% SQL, and 78.0% non-SQL.

Notes

  • Use a recent llama.cpp build. Tool-calling and chat-template handling matter for this agent.
  • If you run CPU-only, set N_GPU_LAYERS=0.
  • Increase --parallel only if you need concurrent requests and have enough memory for the KV cache.
  • The model is intended for the released agent protocol, not standalone production database access.

License

Apache 2.0.

Downloads last month
-
GGUF
Model size
4B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Boakpe/Qwen3-4B-Thinking-2507-Text-to-SQL-Agent-FT-GGUF

Collection including Boakpe/Qwen3-4B-Thinking-2507-Text-to-SQL-Agent-FT-GGUF