File size: 7,974 Bytes
9eca35f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
091cb6a
106ffe2
1b89613
106ffe2
 
 
22c5567
106ffe2
9eca35f
 
 
 
 
 
 
 
 
 
 
 
 
981b6a9
091cb6a
981b6a9
 
9eca35f
 
 
 
 
 
 
106ffe2
 
9eca35f
 
 
 
 
091cb6a
9eca35f
 
 
 
 
 
 
 
 
 
 
 
 
 
106ffe2
9eca35f
 
 
 
 
 
 
 
 
 
 
 
106ffe2
 
9eca35f
 
 
 
 
 
106ffe2
9eca35f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dab5c66
9eca35f
 
 
106ffe2
 
9eca35f
 
 
 
 
 
 
 
 
 
106ffe2
9eca35f
1b89613
9eca35f
 
 
 
 
 
 
 
 
 
e55354d
 
 
 
 
 
 
 
9eca35f
 
 
 
e55354d
 
 
 
 
 
 
 
 
 
9eca35f
e55354d
b0f36b3
e55354d
 
9eca35f
e55354d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
---
library_name: transformers
tags:
- torchao
- phi
- phi4
- nlp
- code
- math
- chat
- conversational
license: mit
language:
- multilingual
base_model:
- microsoft/Phi-4-mini-instruct
pipeline_tag: text-generation
---

Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq) ([paper](https://openreview.net/pdf?id=8PCxOlwbIn)). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).

We provide the [quantized pte](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) for direct use in ExecuTorch. (The provided pte file is exported with a `max_context_length` of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)

# Running in a Mobile App

The [pte file](https://huggingface.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared/blob/main/phi4_model_2bit.pte) can be run with ExecuTorch on a mobile phone. See the [instructions](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/second and uses 1453 Mb of memory.

# Quantization Recipe

Install `uv` by following https://docs.astral.sh/uv/getting-started/installation

```bash
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
```

## QAT Finetuning with PARQ

We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The below script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Do the following before running it:

1. `curl -O https://huggingface.co/datasets/pytorch/parq-sft/resolve/main/qat_sft.py`
2. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.

```bash
source ~/.uv-hf/bin/activate

SEED=$RANDOM
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}

dataset_name=<TODO>
max_steps=<TODO>
ngpu=8
device_batch_size=4
grad_accum_steps=2
lr=3e-5
TOKENIZERS_PARALLELISM=$(( ngpu == 1 ))  \
  PYTORCH_ALLOC_CONF=expandable_segments:True \
  torchrun \
  --nproc-per-node $ngpu \
  --rdzv-id $SEED \
  --rdzv-backend c10d \
  --rdzv-endpoint localhost:$(shuf -i 29000-29500 -n 1) \
  -m qat_sft \
  --model_name_or_path microsoft/Phi-4-mini-instruct \
  --bf16 true \
  --num_train_epochs 1 \
  --per_device_train_batch_size $device_batch_size \
  --gradient_accumulation_steps $grad_accum_steps \
  --dataset_name $dataset_name \
  --dataloader_num_workers 4 \
  --max_length 4096 \
  --max_steps $max_steps \
  --report_to tensorboard \
  --learning_rate $lr \
  --lr_scheduler_type linear \
  --warmup_ratio 0.0 \
  --seed $SEED \
  --output_dir $SAVE_DIR \
  --weight_bits 2 \
  --linear_pat 'proj\.weight$' \
  --embed_bits 4 \
  --embed_pat '(lm_head|embed_tokens)'
```

To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.

## Generation from Quantized Model

```py
import os

from huggingface_hub import whoami, get_token
from transformers import AutoModelForCausalLM, AutoTokenizer

set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Manual testing
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [{"role": "user", "content": prompt}]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)
```

# Model Quality

We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.

Evaluation command for below table:
```bash
lm_eval \
  --model hf \
  --model_args pretrained=$SAVE_DIR,dtype=auto \
  --tasks arc_easy,arc_challenge,boolq,hellaswag,mathqa,openbookqa,piqa,social_iqa,winogrande \
  --output_path ${SAVE_DIR}/eval_results.json \
  --batch_size auto \
  --trust_remote_code
```
Note: exact numbers may vary slightly based on your machine's chosen batch size.

| | [Phi-4-mini-instruct](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
| --- | :---: | :---: | :---: |
| arc_easy | 80.30 | 74.28 | 68.98 |
| arc_challenge | 58.45 | 52.65 | 43.17 |
| boolq | 83.46 | 69.11 | 71.50 |
| hellaswag | 72.76 | 68.97 | 62.10 |
| mathqa | 41.27 | 38.12 | 32.76 |
| openbookqa | 41.80 | 39.80 | 38.40 |
| piqa | 78.29 | 76.22 | 73.83 |
| social_iqa | 49.64 | 45.55 | 46.93 |
| winogrande | 71.51 | 68.67 | 64.48 |

# Exporting to ExecuTorch

⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.

We can run the 2-bit quantized model on a mobile phone using [ExecuTorch](https://github.com/pytorch/executorch), the PyTorch solution for mobile deployment.

To set up ExecuTorch with TorchAO lowbit kernels, run the following commands:
```bash
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
USE_CPP=1 TORCHAO_BUILD_KLEIDIAI=1 pip install third-party/ao
popd
```

(The above command works on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing third-party/ao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).

Now we export the model to ExecuTorch, using the TorchAO lowbit kernel backend.
(Do not run these commands from a directory containing the ExecuTorch repo you cloned during setup, or python will use the local paths in the repo instead of the installed paths.)

```bash
# 1. Download QAT'd weights from HF
HF_DIR=pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
WEIGHT_DIR=$(hf download ${HF_DIR})

# 2. Rename the weight keys to ones that ExecuTorch expects
python -m executorch.examples.models.phi_4_mini.convert_weights $WEIGHT_DIR pytorch_model_converted.bin

# 3. Download model config from the ExecuTorch repo
curl -L -o phi_4_mini_config.json https://raw.githubusercontent.com/pytorch/executorch/main/examples/models/phi_4_mini/config/config.json

# 4. Export the model to ExecuTorch pte file
python -m executorch.examples.models.llama.export_llama \
  --model "phi_4_mini" \
  --checkpoint pytorch_model_converted.bin \
  --params phi_4_mini_config.json \
  --output_name phi4_model_2bit.pte \
  -kv \
  --use_sdpa_with_kv_cache \
  --use-torchao-kernels \
  --max_context_length 1024 \
  --max_seq_length 256 \
  --dtype fp32 \
  --metadata '{"get_bos_id":199999, "get_eos_ids":[200020,199999]}'

# # 5. (optional) Upload pte file to HuggingFace
# hf upload ${HF_DIR} phi4_model_2bit.pte
```

Once you have the *.pte file, you can run it inside of our [iOS demo app](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple) in a [few easy steps](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple#build-and-run).