# openPangu-R-72B-2512在Omni-Infer部署指导文档

## 硬件环境和部署方式
PD混部，只需要1台Atlas 800T A3机器中的4个die。

## 代码和镜像
- Omni-Infer代码版本：release_v0.7.0
- 配套镜像：参考 https://gitee.com/omniai/omniinfer/releases 中v0.7.0镜像，以A3硬件和arm架构为例，使用“docker pull swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm”。

## 部署
### 1. 启动镜像
```bash
IMAGE=swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm
NAME=omniinfer-v0.7.0  # Custom docker name
NPU_NUM=16  # A3节点die数
DEVICE_ARGS=$(for i in $(seq 0 $((NPU_NUM-1))); do echo -n "--device /dev/davinci${i} "; done)

# Run the container using the defined variables
# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
# To prevent device interference from other docker containers, add the argument "--privileged"
docker run -itd \
  --name=${NAME} \
  --network host \
  --privileged \
  --ipc=host \
  $DEVICE_ARGS \
  --device=/dev/davinci_manager \
  --device=/dev/devmm_svm \
  --device=/dev/hisi_hdc \
  -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
  -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
  -v /usr/local/sbin/npu-smi:/usr/local/sbin/npu-smi \
  -v /etc/ascend_install.info:/etc/ascend_install.info \
  -v /mnt/:/mnt/ \
  -v /data:/data \
  -v /home/work:/home/work \
  --entrypoint /bin/bash \
  swr.cn-east-4.myhuaweicloud.com/omni/omniinfer-a3-arm:release_v0.7.0-vllm
```
需要保证模型权重和本项目代码可在容器中访问。进入容器:
```bash
docker exec -it $NAME /bin/bash
```

### 2. 将examples/start_serving_openpangu_r_72b_2512.sh脚本放入omniinfer/tools/scripts路径并执行

```bash
git clone -b release_v0.7.0 https://gitee.com/omniai/omniinfer.git
cd omniinfer/tools/scripts
# 需修改serving脚本中model-path模型路径、master-ip机器IP地址和PYTHONPATH。
bash start_serving_openpangu_r_72b_2512.sh
```

### 3. 发请求测试

服务启动后，可发送测试请求。

```bash
curl http://0.0.0.0:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "openpangu_r_72b_2512",
        "messages": [
            {
                "role": "user",
                "content": "Who are you?"
            }
        ],
        "temperature": 1.0,
        "top_p": 0.8,
        "top_k": -1,
        "vllm_xargs": {"top_n_sigma": 0.05},
        "chat_template_kwargs": {"think": true, "reasoning_effort": "low"}
    }'
   ``` 
   ```bash
 # 工具使用
curl http://0.0.0.0:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
   -d '{
        "model": "openpangu_r_72b_2512",
        "messages": [
            {"role": "system", "content": "你是华为公司开发的盘古模型。\n现在是2025年7月30日"},
            {"role": "user", "content": "深圳明天的天气如何？"}
        ],
        "tools": [
                    {
                        "type": "function",
                        "function": {
                            "name": "get_current_weather",
                            "description": "获取指定城市的当前天气信息，包括温度、湿度、风速等数据。",
                            "parameters": {
                                "type": "object",
                                "properties": {
                                    "location": {
                                        "type": "string",
                                        "description": "城市名称，例如：北京、深圳。支持中文或拼音输入。"
                                    },
                                    "date": {
                                        "type": "string",
                                        "description": "查询日期，格式为 YYYY-MM-DD（遵循 ISO 8601 标准）。例如：2023-10-01。"
                                    }
                                },
                                "required": ["location", "date"],
                                "additionalProperties": "false"
                            }
                        }
                    }
                ],
        "temperature": 1.0,
        "top_p": 0.8,
        "top_k": -1,
        "vllm_xargs": {"top_n_sigma": 0.05},
        "chat_template_kwargs": {"think": true, "reasoning_effort": "high"}
    }'
```
模型默认是慢思考模式，在慢思考模式下，模型支持思维链分档，可通过请求体字段"chat_template_kwargs": {"think": true, "reasoning_effort": "high"}中"reasoning_effort": "high"和"low"平衡模型精度和效率。  
模型的慢思考模式，可通过请求体字段"chat_template_kwargs": {"think": true/false} 开启和关闭。