Instructions to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True)
model = AutoModelForMultimodalLM.from_pretrained("pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared

SGLang

How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with Docker Model Runner:
```
docker model run hf.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
```

Phi-4-mini-instruct-parq-2w-4e-shared

File size: 7,974 Bytes

9eca35f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
091cb6a
106ffe2
1b89613
106ffe2
 
 
22c5567
106ffe2
9eca35f
 
 
 
 
 
 
 
 
 
 
 
 
981b6a9
091cb6a
981b6a9
 
9eca35f
 
 
 
 
 
 
106ffe2
 
9eca35f
 
 
 
 
091cb6a
9eca35f
 
 
 
 
 
 
 
 
 
 
 
 
 
106ffe2
9eca35f
 
 
 
 
 
 
 
 
 
 
 
106ffe2
 
9eca35f
 
 
 
 
 
106ffe2
9eca35f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dab5c66
9eca35f
 
 
106ffe2
 
9eca35f
 
 
 
 
 
 
 
 
 
106ffe2
9eca35f
1b89613
9eca35f
 
 
 
 
 
 
 
 
 
e55354d
 
 
 
 
 
 
 
9eca35f
 
 
 
e55354d
 
 
 
 
 
 
 
 
 
9eca35f
e55354d
b0f36b3
e55354d
 
9eca35f
e55354d

---
library_name: transformers
tags:
- torchao
- phi
- phi4
- nlp
- code
- math
- chat
- conversational
license: mit
language:
- multilingual
base_model:
- microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation
---

Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq) ([paper](https://openreview.net/pdf?id=8PCxOlwbIn)). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).

We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) for direct use in ExecuTorch. (The provided pte file is exported with a `max_context_length` of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)

# Running in a Mobile App

The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/second and uses 1453 Mb of memory.

# Quantization Recipe

Install `uv` by following https://docs.astral.sh/uv/getting-started/installation

```bash
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
```

## QAT Finetuning with PARQ

We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The below script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Do the following before running it:

1. `curl -O https://huggingface.co/datasets/pytorch/parq-sft/resolve/main/qat_sft.py`
2. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.

```bash
source ~/.uv-hf/bin/activate

SEED=$RANDOM
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}

dataset_name=<TODO>
max_steps=<TODO>
ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=3e-5
TOKENIZERS_PARALLELISM=$(( ngpu == 1 ))  \
  PYTORCH_ALLOC_CONF=expandable_segments:True \
  torchrun \
  --nproc-per-node $ngpu \
  --rdzv-id $SEED \
  --rdzv-backend c10d \
  --rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
  -m qat_sft \
  --model_name_or_path microsoft/Phi-4-mini-instruct \
  --bf16 true \
  --num_train_epochs 1 \
  --per_device_train_batch_size $device_batch_size \
  --gradient_accumulation_steps $grad_accum_steps \
  --dataset_name $dataset_name \
  --dataloader_num_workers 4 \
  --max_length 4096 \
  --max_steps $max_steps \
  --report_to tensorboard \
  --learning_rate $lr \
  --lr_scheduler_type linear \
  --warmup_ratio 0.0 \
  --seed $SEED \
  --output_dir $SAVE_DIR \
  --weight_bits 2 \
  --linear_pat 'proj\.weight$' \
  --embed_bits 4 \
  --embed_pat '(lm_head|embed_tokens)'
```

To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.

## Generation from Quantized Model

```py
import os

from huggingface_hub import whoami, get_token
from transformers import AutoModelForCausalLM, AutoTokenizer

set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [{"role": "user", "content": prompt}]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)
```

# Model Quality

We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.

Evaluation command for below table:
```bash
lm_eval \
  --model hf \
  --model_args pretrained=$SAVE_DIR,dtype=auto \
  --tasks arc_easy,arc_challenge,boolq,hellaswag,mathqa,openbookqa,piqa,social_iqa,winogrande \
  --output_path ${SAVE_DIR}/eval_results.json \
  --batch_size auto \
  --trust_remote_code
```
Note: exact numbers may vary slightly based on your machine's chosen batch size.

| | [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
| --- | :---: | :---: | :---: |
| arc_easy | 80.30 | 74.28 | 68.98 |
| arc_challenge | 58.45 | 52.65 | 43.17 |
| boolq | 83.46 | 69.11 | 71.50 |
| hellaswag | 72.76 | 68.97 | 62.10 |
| mathqa | 41.27 | 38.12 | 32.76 |
| openbookqa | 41.80 | 39.80 | 38.40 |
| piqa | 78.29 | 76.22 | 73.83 |
| social_iqa | 49.64 | 45.55 | 46.93 |
| winogrande | 71.51 | 68.67 | 64.48 |

# Exporting to ExecuTorch

⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.

We can run the 2-bit quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch), the PyTorch solution for mobile deployment.

To set up ExecuTorch with TorchAO lowbit kernels, run the following commands:
```bash
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install third-party/ao
popd
```

(The above command works on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing third-party/ao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).

Now we export the model to ExecuTorch, using the TorchAO lowbit kernel backend.
(Do not run these commands from a directory containing the ExecuTorch repo you cloned during setup, or python will use the local paths in the repo instead of the installed paths.)

```bash
# 1. Download QAT'd weights from HF
HF_DIR=pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
WEIGHT_DIR=$(hf download ${HF_DIR})

# 2. Rename the weight keys to ones that ExecuTorch expects
python -m executorch.examples.models.phi_4_mini.convert_weights $WEIGHT_DIR pytorch_model_converted.bin

# 3. Download model config from the ExecuTorch repo
curl -L -o phi_4_mini_config.json https://raw.githubusercontent.com/pytorch/executorch/main/examples/models/phi_4_mini/config/config.json

# 4. Export the model to ExecuTorch pte file
python -m executorch.examples.models.llama.export_llama \
  --model "phi_4_mini" \
  --checkpoint pytorch_model_converted.bin \
  --params phi_4_mini_config.json \
  --output_name phi4_model_2bit.pte \
  -kv \
  --use_sdpa_with_kv_cache \
  --use-torchao-kernels \
  --max_context_length 1024 \
  --max_seq_length 256 \
  --dtype fp32 \
  --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'

# # 5. (optional) Upload pte file to HuggingFace
# hf upload ${HF_DIR} phi4_model_2bit.pte
```

Once you have the *.pte file, you can run it inside of our [iOS demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) in a [few easy steps](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple#build-and-run).