Text Generation
Transformers
PyTorch
ExecuTorch
multilingual
phi3
torchao
phi
phi4
nlp
code
math
chat
conversational
custom_code
text-generation-inference
Instructions to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
- SGLang
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with Docker Model Runner:
docker model run hf.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
File size: 7,974 Bytes
9eca35f 091cb6a 106ffe2 1b89613 106ffe2 22c5567 106ffe2 9eca35f 981b6a9 091cb6a 981b6a9 9eca35f 106ffe2 9eca35f 091cb6a 9eca35f 106ffe2 9eca35f 106ffe2 9eca35f 106ffe2 9eca35f dab5c66 9eca35f 106ffe2 9eca35f 106ffe2 9eca35f 1b89613 9eca35f e55354d 9eca35f e55354d 9eca35f e55354d b0f36b3 e55354d 9eca35f e55354d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 | ---
library_name: transformers
tags:
- torchao
- phi
- phi4
- nlp
- code
- math
- chat
- conversational
license: mit
language:
- multilingual
base_model:
- microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation
---
Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq) ([paper](https://openreview.net/pdf?id=8PCxOlwbIn)). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) for direct use in ExecuTorch. (The provided pte file is exported with a `max_context_length` of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
# Running in a Mobile App
The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/second and uses 1453 Mb of memory.
# Quantization Recipe
Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
```bash
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
```
## QAT Finetuning with PARQ
We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The below script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Do the following before running it:
1. `curl -O https://huggingface.co/datasets/pytorch/parq-sft/resolve/main/qat_sft.py`
2. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.
```bash
source ~/.uv-hf/bin/activate
SEED=$RANDOM
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
dataset_name=<TODO>
max_steps=<TODO>
ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=3e-5
TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
PYTORCH_ALLOC_CONF=expandable_segments:True \
torchrun \
--nproc-per-node $ngpu \
--rdzv-id $SEED \
--rdzv-backend c10d \
--rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
-m qat_sft \
--model_name_or_path microsoft/Phi-4-mini-instruct \
--bf16 true \
--num_train_epochs 1 \
--per_device_train_batch_size $device_batch_size \
--gradient_accumulation_steps $grad_accum_steps \
--dataset_name $dataset_name \
--dataloader_num_workers 4 \
--max_length 4096 \
--max_steps $max_steps \
--report_to tensorboard \
--learning_rate $lr \
--lr_scheduler_type linear \
--warmup_ratio 0.0 \
--seed $SEED \
--output_dir $SAVE_DIR \
--weight_bits 2 \
--linear_pat 'proj\.weight$' \
--embed_bits 4 \
--embed_pat '(lm_head|embed_tokens)'
```
To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.
## Generation from Quantized Model
```py
import os
from huggingface_hub import whoami, get_token
from transformers import AutoModelForCausalLM, AutoTokenizer
set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [{"role": "user", "content": prompt}]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)
start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)
```
# Model Quality
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
Evaluation command for below table:
```bash
lm_eval \
--model hf \
--model_args pretrained=$SAVE_DIR,dtype=auto \
--tasks arc_easy,arc_challenge,boolq,hellaswag,mathqa,openbookqa,piqa,social_iqa,winogrande \
--output_path ${SAVE_DIR}/eval_results.json \
--batch_size auto \
--trust_remote_code
```
Note: exact numbers may vary slightly based on your machine's chosen batch size.
| | [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
| --- | :---: | :---: | :---: |
| arc_easy | 80.30 | 74.28 | 68.98 |
| arc_challenge | 58.45 | 52.65 | 43.17 |
| boolq | 83.46 | 69.11 | 71.50 |
| hellaswag | 72.76 | 68.97 | 62.10 |
| mathqa | 41.27 | 38.12 | 32.76 |
| openbookqa | 41.80 | 39.80 | 38.40 |
| piqa | 78.29 | 76.22 | 73.83 |
| social_iqa | 49.64 | 45.55 | 46.93 |
| winogrande | 71.51 | 68.67 | 64.48 |
# Exporting to ExecuTorch
⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.
We can run the 2-bit quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch), the PyTorch solution for mobile deployment.
To set up ExecuTorch with TorchAO lowbit kernels, run the following commands:
```bash
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install third-party/ao
popd
```
(The above command works on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing third-party/ao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).
Now we export the model to ExecuTorch, using the TorchAO lowbit kernel backend.
(Do not run these commands from a directory containing the ExecuTorch repo you cloned during setup, or python will use the local paths in the repo instead of the installed paths.)
```bash
# 1. Download QAT'd weights from HF
HF_DIR=pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
WEIGHT_DIR=$(hf download ${HF_DIR})
# 2. Rename the weight keys to ones that ExecuTorch expects
python -m executorch.examples.models.phi_4_mini.convert_weights $WEIGHT_DIR pytorch_model_converted.bin
# 3. Download model config from the ExecuTorch repo
curl -L -o phi_4_mini_config.json https://raw.githubusercontent.com/pytorch/executorch/main/examples/models/phi_4_mini/config/config.json
# 4. Export the model to ExecuTorch pte file
python -m executorch.examples.models.llama.export_llama \
--model "phi_4_mini" \
--checkpoint pytorch_model_converted.bin \
--params phi_4_mini_config.json \
--output_name phi4_model_2bit.pte \
-kv \
--use_sdpa_with_kv_cache \
--use-torchao-kernels \
--max_context_length 1024 \
--max_seq_length 256 \
--dtype fp32 \
--metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'
# # 5. (optional) Upload pte file to HuggingFace
# hf upload ${HF_DIR} phi4_model_2bit.pte
```
Once you have the *.pte file, you can run it inside of our [iOS demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) in a [few easy steps](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple#build-and-run).
|