Text Generation
Transformers
PyTorch
ExecuTorch
multilingual
phi3
torchao
phi
phi4
nlp
code
math
chat
conversational
custom_code
text-generation-inference
Instructions to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True) model = AutoModelForMultimodalLM.from_pretrained("pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
- SGLang
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pytorch/Phi-4-mini-instruct-parq-2w-4e-shared", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use pytorch/Phi-4-mini-instruct-parq-2w-4e-shared with Docker Model Runner:
docker model run hf.co/pytorch/Phi-4-mini-instruct-parq-2w-4e-shared
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -17,6 +17,14 @@ base_model:
|
|
| 17 |
pipeline_tag: text-generation
|
| 18 |
---
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
# Quantization Recipe
|
| 21 |
|
| 22 |
Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
|
|
@@ -30,9 +38,7 @@ uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126
|
|
| 30 |
|
| 31 |
## QAT Finetuning with PARQ
|
| 32 |
|
| 33 |
-
We apply QAT with
|
| 34 |
-
|
| 35 |
-
The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets).
|
| 36 |
|
| 37 |
```bash
|
| 38 |
source ~/.uv-hf/bin/activate
|
|
@@ -40,6 +46,8 @@ source ~/.uv-hf/bin/activate
|
|
| 40 |
SEED=$RANDOM
|
| 41 |
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
|
| 42 |
|
|
|
|
|
|
|
| 43 |
ngpu=8
|
| 44 |
device_batch_size=4
|
| 45 |
grad_accum_steps=2
|
|
@@ -60,9 +68,8 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
|
|
| 60 |
--dataset_name $dataset_name \
|
| 61 |
--dataloader_num_workers 4 \
|
| 62 |
--max_length 4096 \
|
| 63 |
-
--
|
| 64 |
--report_to tensorboard \
|
| 65 |
-
--logging_steps 2 \
|
| 66 |
--learning_rate $lr \
|
| 67 |
--lr_scheduler_type linear \
|
| 68 |
--warmup_ratio 0.0 \
|
|
@@ -74,17 +81,15 @@ TOKENIZERS_PARALLELISM=$(( ngpu == 1 )) \
|
|
| 74 |
--embed_pat '(lm_head|embed_tokens)'
|
| 75 |
```
|
| 76 |
|
|
|
|
|
|
|
| 77 |
## Generation from Quantized Model
|
| 78 |
|
| 79 |
```py
|
| 80 |
import os
|
| 81 |
|
| 82 |
from huggingface_hub import whoami, get_token
|
| 83 |
-
from transformers import
|
| 84 |
-
AutoModelForCausalLM,
|
| 85 |
-
AutoTokenizer,
|
| 86 |
-
set_seed,
|
| 87 |
-
)
|
| 88 |
|
| 89 |
set_seed(0)
|
| 90 |
model_path = f"{SAVE_DIR}"
|
|
@@ -113,6 +118,8 @@ print(output_text)
|
|
| 113 |
|
| 114 |
# Model Quality
|
| 115 |
|
|
|
|
|
|
|
| 116 |
Evaluation command for below table:
|
| 117 |
```bash
|
| 118 |
lm_eval \
|
|
@@ -123,7 +130,7 @@ lm_eval \
|
|
| 123 |
--batch_size auto \
|
| 124 |
--trust_remote_code
|
| 125 |
```
|
| 126 |
-
Note
|
| 127 |
|
| 128 |
| | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
|
| 129 |
| --- | :---: | :---: | :---: |
|
|
|
|
| 17 |
pipeline_tag: text-generation
|
| 18 |
---
|
| 19 |
|
| 20 |
+
Phi4-mini is quantized by the PyTorch team using an algorithm in torchao called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The model has 2-bit weight linears, 4-bit embeddings, and 8-bit dynamic activations. It is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).
|
| 21 |
+
|
| 22 |
+
We provide the quantized pte for direct use in ExecuTorch. (The provided pte file is exported with a max_context_length of 1024. If you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)
|
| 23 |
+
|
| 24 |
+
# Running in a Mobile App
|
| 25 |
+
|
| 26 |
+
The pte file can be run with ExecuTorch on a mobile phone. See the [instructions](https://docs.pytorch.org/executorch/0.7/llm/llama-demo-ios.html) for doing this in iOS. On iPhone 15 Pro, the model runs at 27 tokens/sec and uses 1453 Mb of memory.
|
| 27 |
+
|
| 28 |
# Quantization Recipe
|
| 29 |
|
| 30 |
Install `uv` by following https://docs.astral.sh/uv/getting-started/installation
|
|
|
|
| 38 |
|
| 39 |
## QAT Finetuning with PARQ
|
| 40 |
|
| 41 |
+
We apply QAT with an optimizer-only package called [PARQ](https://github.com/pytorch/ao/tree/main/torchao/prototype/parq). The following script finetunes Phi-4-mini-instruct with 2-bit weight quantization and 4-bit embedding quantization using QAT, both at per-row granularity. Set `dataset_name` to your desired dataset from the [HuggingFace datasets hub](https://huggingface.co/datasets) in addition to `max_steps`.
|
|
|
|
|
|
|
| 42 |
|
| 43 |
```bash
|
| 44 |
source ~/.uv-hf/bin/activate
|
|
|
|
| 46 |
SEED=$RANDOM
|
| 47 |
SAVE_DIR=checkpoints/phi-4-mini-2wei-4emb-${SEED}
|
| 48 |
|
| 49 |
+
dataset_name=<TODO>
|
| 50 |
+
max_steps=<TODO>
|
| 51 |
ngpu=8
|
| 52 |
device_batch_size=4
|
| 53 |
grad_accum_steps=2
|
|
|
|
| 68 |
--dataset_name $dataset_name \
|
| 69 |
--dataloader_num_workers 4 \
|
| 70 |
--max_length 4096 \
|
| 71 |
+
--max_steps $max_steps \
|
| 72 |
--report_to tensorboard \
|
|
|
|
| 73 |
--learning_rate $lr \
|
| 74 |
--lr_scheduler_type linear \
|
| 75 |
--warmup_ratio 0.0 \
|
|
|
|
| 81 |
--embed_pat '(lm_head|embed_tokens)'
|
| 82 |
```
|
| 83 |
|
| 84 |
+
To export the finetuned model, rerun the above script on a single GPU with `--resume_from_checkpoint ${SAVE_DIR}/checkpoint-{SAVE_STEP}`. The exported model will be saved to `${SAVE_DIR}/quant_converted`.
|
| 85 |
+
|
| 86 |
## Generation from Quantized Model
|
| 87 |
|
| 88 |
```py
|
| 89 |
import os
|
| 90 |
|
| 91 |
from huggingface_hub import whoami, get_token
|
| 92 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
set_seed(0)
|
| 95 |
model_path = f"{SAVE_DIR}"
|
|
|
|
| 118 |
|
| 119 |
# Model Quality
|
| 120 |
|
| 121 |
+
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
|
| 122 |
+
|
| 123 |
Evaluation command for below table:
|
| 124 |
```bash
|
| 125 |
lm_eval \
|
|
|
|
| 130 |
--batch_size auto \
|
| 131 |
--trust_remote_code
|
| 132 |
```
|
| 133 |
+
Note: exact numbers may vary slightly based on your machine's chosen batch size.
|
| 134 |
|
| 135 |
| | [bf16](https://huggingface.co/microsoft/Phi-4-mini-instruct) | [4-bit PTQ](https://huggingface.co/pytorch/Phi-4-mini-instruct-INT8-INT4) | 2-bit QAT |
|
| 136 |
| --- | :---: | :---: | :---: |
|