Instructions to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared

SGLang

How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with Docker Model Runner:
```
docker model run hf.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
```

lvj commited on Dec 16, 2025

Commit

106ffe2

verified ·

1 Parent(s): dab5c66

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +18 -11

README.md CHANGED Viewed

@@ -17,6 +17,14 @@ base_model:
 pipeline_tag: text-generation
 ---
 # Quantization Recipe
 Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
@@ -30,9 +38,7 @@ uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126
 ## QAT Finetuning with PARQ
-We apply QAT with a torchao optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq).
-The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets).
 ```bash
 source ~/.uv-hf/bin/activate
@@ -40,6 +46,8 @@ source ~/.uv-hf/bin/activate
 SEED=$RANDOM
 SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
 ngpu=8
 device_batch_size=4
 grad_accum_steps=2
@@ -60,9 +68,8 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 ))  \
   --dataset_name $dataset_name \
   --dataloader_num_workers 4 \
   --max_length 4096 \
-  --save_total_limit 1 \
   --report_to tensorboard \
-  --logging_steps 2 \
   --learning_rate $lr \
   --lr_scheduler_type linear \
   --warmup_ratio 0.0 \
@@ -74,17 +81,15 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 ))  \
   --embed_pat '(lm_head|embed_tokens)'
 ```
 ## Generation from Quantized Model
 ```py
 import os
 from huggingface_hub import whoami, get_token
-from transformers import (
-  AutoModelForCausalLM,
-  AutoTokenizer,
-  set_seed,
-)
 set_seed(0)
 model_path = f"{SAVE_DIR}"
@@ -113,6 +118,8 @@ print(output_text)
 # Model Quality
 Evaluation command for below table:
 ```bash
 lm_eval \
@@ -123,7 +130,7 @@ lm_eval \
   --batch_size auto \
   --trust_remote_code
 ```
-Note that exact numbers may vary slightly based on your machine's chosen batch size.
 | | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
 | --- | :---: | :---: | :---: |

 pipeline_tag: text-generation
 ---
+Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
+We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a max_context_length of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
+# Running in a Mobile App
+The pte file can be run with ExecuTorch on a mobile phone. See the [instructions](https://docs.pytorch.org/executorch/0.7/llm/llama-demo-ios.html) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/sec and uses 1453 Mb of memory.
 # Quantization Recipe
 Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
 ## QAT Finetuning with PARQ
+We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.
 ```bash
 source ~/.uv-hf/bin/activate
 SEED=$RANDOM
 SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
+dataset_name=<TODO>
+max_steps=<TODO>
 ngpu=8
 device_batch_size=4
 grad_accum_steps=2
   --dataset_name $dataset_name \
   --dataloader_num_workers 4 \
   --max_length 4096 \
+  --max_steps $max_steps \
   --report_to tensorboard \
   --learning_rate $lr \
   --lr_scheduler_type linear \
   --warmup_ratio 0.0 \
   --embed_pat '(lm_head|embed_tokens)'
 ```
+To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.
 ## Generation from Quantized Model
 ```py
 import os
 from huggingface_hub import whoami, get_token
+from transformers import AutoModelForCausalLM, AutoTokenizer
 set_seed(0)
 model_path = f"{SAVE_DIR}"
 # Model Quality
+We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
 Evaluation command for below table:
 ```bash
 lm_eval \
   --batch_size auto \
   --trust_remote_code
 ```
+Note: exact numbers may vary slightly based on your machine's chosen batch size.
 | | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
 | --- | :---: | :---: | :---: |