Instructions to use distil-labs/distil-qwen3-1.7b-customer-support-deferral with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use distil-labs/distil-qwen3-1.7b-customer-support-deferral with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="distil-labs/distil-qwen3-1.7b-customer-support-deferral") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("distil-labs/distil-qwen3-1.7b-customer-support-deferral") model = AutoModelForMultimodalLM.from_pretrained("distil-labs/distil-qwen3-1.7b-customer-support-deferral") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use distil-labs/distil-qwen3-1.7b-customer-support-deferral with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "distil-labs/distil-qwen3-1.7b-customer-support-deferral" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "distil-labs/distil-qwen3-1.7b-customer-support-deferral", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/distil-labs/distil-qwen3-1.7b-customer-support-deferral
- SGLang
How to use distil-labs/distil-qwen3-1.7b-customer-support-deferral with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "distil-labs/distil-qwen3-1.7b-customer-support-deferral" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "distil-labs/distil-qwen3-1.7b-customer-support-deferral", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "distil-labs/distil-qwen3-1.7b-customer-support-deferral" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "distil-labs/distil-qwen3-1.7b-customer-support-deferral", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use distil-labs/distil-qwen3-1.7b-customer-support-deferral with Docker Model Runner:
docker model run hf.co/distil-labs/distil-qwen3-1.7b-customer-support-deferral
Distil-Qwen3-1.7B-Customer-Support-Deferral
A fine-tuned Qwen3-1.7B model for multi-turn airline customer support that runs as
the small tier of a two-model cascade. It handles most support turns itself and
defers genuinely-hard turns to a larger model by emitting a defer_to_larger_model
tool call. Trained with knowledge distillation from a large teacher model (zai.glm-5).
Every assistant action is a single tool call, including talking to the customer via
respond_to_user, so the model can be driven by a thin, deterministic orchestrator.
Results
Held-out airline test set. The tuned 1.7B model beats its roughly 40x larger GLM-5 teacher on llm-as-a-judge (0.722 vs 0.697) and staged tool calling (0.707 vs 0.667), and lifts every metric well above the base model.
| Model | llm-as-a-judge | llm-judge (ref-free) | staged_tool_call | ROUGE | tool_call_equiv |
|---|---|---|---|---|---|
| GLM-5 teacher (~40x larger) | 0.697 | - | 0.667 | - | - |
| This model (tuned) | 0.722 | 0.794 | 0.707 | 0.616 | 0.290 |
| Qwen3-1.7B (base) | 0.422 | 0.502 | 0.487 | 0.482 | 0.154 |
What the model does
Given the airline policy (as the system prompt), the available tools, and the conversation so far, the model produces the next single tool call:
- Talk to the customer:
respond_to_user(message=...)(terminal, ends the turn). - Act / look up:
get_reservation_details,book_reservation,send_certificate, and so on. - Reason silently:
think(thought=...). - Escalate to a larger model:
defer_to_larger_model(reason=...)on turns whose correct action depends on non-obvious policy eligibility, combining several rules, a multi-step calculation, or a genuinely ambiguous judgement call. - Hand off to a human:
transfer_to_human_agents(summary=...)for out-of-scope requests or explicit human requests (distinct from deferral, which stays automated).
Deferral vs. human transfer
defer_to_larger_model is a capability escalation: a larger, more capable model takes
over the same conversation with the same tools and policy, and the customer keeps being
served automatically. transfer_to_human_agents is for requests outside the tools' scope
or when the user asks for a person. Judging when to defer, by the absolute structure of
the problem rather than the model's own confidence, is the core skill this model is
distilled for.
Quick Start
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "distil-labs/distil-qwen3-1.7b-customer-support-deferral"
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# The full airline policy (system prompt) and the 16 tool schemas ship with the demo
# app as `job_description.json`. Wrap the policy in the distil tool-calling preamble:
TASK_DESCRIPTION = "# Airline Agent Policy\n... (see job_description.json) ..."
SYSTEM = (
"You are a tool-calling model working on:\n"
f"<task_description>{TASK_DESCRIPTION}</task_description>\n\n"
"Respond to the conversation history by generating an appropriate tool call that "
"satisfies the user request. Generate only the tool call according to the provided "
"tool schema, do not generate anything else. Always respond with a tool call."
)
TOOLS = [ ... ] # 16 tools from job_description.json
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "Can I get a refund for reservation 8JX2WO?"},
]
text = tokenizer.apply_chat_template(
messages, tools=TOOLS, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
# <tool_call>
# {"name": "defer_to_larger_model", "arguments": {"reason": "refund eligibility depends on fare class + travel insurance"}}
# </tool_call>
Using the Demo App
This model powers the Flexible Customer Support Bot demo, a terminal cascade where a local SLM handles most airline-support turns and defers hard turns to a larger, OpenAI-compatible model.
Using llama.cpp
For local serving, use the GGUF build at distil-labs/distil-qwen3-1.7b-customer-support-deferral-gguf:
llama-server --model distil-qwen3-1.7b-customer-support-deferral.gguf --port 8000 --jinja
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3-1.7B |
| Parameters | 1.7 billion |
| Architecture | Qwen3ForCausalLM |
| Context Length | 40,960 tokens |
| Precision | bfloat16 (merged) |
| Teacher Model | GLM-5 (zai.glm-5) |
| Task | Multi-turn tool calling (closed book) with model deferral |
Training
The model is distilled with the Distil Labs platform:
- Traces: airline customer-support conversations (tau-bench airline tool set), processed and cleaned through the distil trace-processing pipeline.
- Deferral signal: a
defer_to_larger_modeltool and policy guidance, so the teacher marks genuinely-hard turns for escalation while the student learns the rest. - Synthetic expansion + fine-tuning: distilled onto Qwen3-1.7B with GLM-5 as teacher.
Supported Functions (16 tools)
| Function | Description |
|---|---|
book_reservation |
Book a new flight reservation |
cancel_reservation |
Cancel an existing reservation |
get_reservation_details |
Look up a reservation |
get_user_details |
Look up a user / profile |
list_all_airports |
List supported airports |
search_direct_flight |
Search direct flights |
search_onestop_flight |
Search one-stop flights |
update_reservation_flights |
Change flights on a reservation |
update_reservation_baggages |
Update baggage on a reservation |
update_reservation_passengers |
Update passengers on a reservation |
send_certificate |
Issue a travel certificate / compensation |
calculate |
Perform an arithmetic calculation |
think |
Private step-by-step reasoning (no side effects) |
respond_to_user |
Send a natural-language message to the customer (ends the turn) |
transfer_to_human_agents |
Hand off to a human agent (out-of-scope / explicit request) |
defer_to_larger_model |
Escalate this turn to a larger model (capability escalation) |
Use Cases
- Cost-efficient customer-support assistants: a small local model handles the bulk of traffic, a larger model is invoked only on the hard minority of turns.
- Any multi-turn tool-calling task with a bounded tool catalog and a difficulty signal worth routing on.
Limitations
- English airline customer-support only, not a general-purpose tool caller.
- Deferral calibration depends on the policy and tool catalog it was trained with.
License
Released under the Apache 2.0 license. See STUDENT_LICENSE (base model) and
TEACHER_LICENSE (teacher model) for upstream terms.
Links
Citation
@misc{distil-qwen3-1.7b-customer-support-deferral,
author = {Distil Labs},
title = {Distil-Qwen3-1.7B-Customer-Support-Deferral: A Fine-tuned SLM for Airline Support with Model Deferral},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/distil-labs/distil-qwen3-1.7b-customer-support-deferral}
}
- Downloads last month
- 21