| ## openPangu-VL-7B 在[vllm-ascend](https://github.com/vllm-project/vllm-ascend)部署指导文档 |
|
|
| ### 部署环境要求 |
|
|
| Atlas 800T A2(64GB) 1、2、4、8卡均可部署openPangu-VL-7B |
|
|
| ### 镜像构建和启动 |
|
|
| 选用vllm-ascend社区镜像v0.9.1 |
|
|
| 拉取方式如下: |
|
|
| ```bash |
| docker pull quay.io/ascend/vllm-ascend:v0.9.1 |
| ``` |
|
|
| 以下操作需在每个节点都执行。 |
| 启动镜像。 |
|
|
| ```bash |
| # Update the vllm-ascend image |
| export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1 # Use correct image id |
| export NAME=vllm-ascend # Custom docker name |
| |
| # Run the container using the defined variables |
| # Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance |
| # To prevent device interference from other docker containers, add the argument "--privileged" |
| docker run --rm \ |
| --name $NAME \ |
| --network host \ |
| --ipc=host \ |
| --device /dev/davinci0 \ |
| --device /dev/davinci1 \ |
| --device /dev/davinci2 \ |
| --device /dev/davinci3 \ |
| --device /dev/davinci4 \ |
| --device /dev/davinci5 \ |
| --device /dev/davinci6 \ |
| --device /dev/davinci7 \ |
| --device /dev/davinci_manager \ |
| --device /dev/devmm_svm \ |
| --device /dev/hisi_hdc \ |
| -v /usr/local/dcmi:/usr/local/dcmi \ |
| -v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \ |
| -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \ |
| -v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \ |
| -v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \ |
| -v /etc/ascend_install.info:/etc/ascend_install.info \ |
| -v /mnt/sfs_turbo/.cache:/root/.cache \ |
| -it $IMAGE bash |
| ``` |
|
|
| 需要保证模型权重和本项目代码可在容器中访问。如果未进入容器,需以root用户进入容器。 |
| ```bash |
| docker exec -itu root $NAME /bin/bash |
| ``` |
|
|
| ### PD混部推理 |
|
|
| 示例启动脚本:`LOAD_CKPT_DIR=xxx bash examples/start_serving_openpangu_vl_7b.sh`。该启动脚本为8卡推理(变量TENSOR_PARALLEL_SIZE_LOCAL=8)。拉起服务后,可向首节点(主节点)发送请求。 |
| |
| ### 发请求测试 |
| |
| 服务启动后,可发送测试请求。推荐使用示例中的system prompt。 |
| |
| 推理示例:图片+文字 |
| |
| ```python |
| import json |
| import base64 |
| import os |
| import requests |
| import json |
| |
| def encode_image_to_base64(img_path, img_name): |
| #load image to base64 |
| try: |
| with open(os.path.join(img_path, img_name), 'rb') as img_file: |
| img_data = img_file.read() |
| base64_str = base64.b64encode(img_data).decode('utf-8') |
| return base64_str |
| except Exception as e: |
| print(f"image load failed: {e}") |
| return None |
| |
| base64_image = encode_image_to_base64("/image_path", "image_name.jpg") |
|
|
|
|
| payload_image_example = json.dumps({ |
| "messages": [ |
| { |
| "role": "system", |
| "content": [ |
| {"type": "text", "text": "你是华为公司开发的多模态大模型,名字是openPangu-VL-7B。你能够处理文本和视觉模态的输入,并给出文本输出。"}, |
| ] |
| }, |
| { |
| "role": "user", |
| "content": [ |
| {"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{base64_image}"}}, |
| {"type": "text", "text": "Please describe this picture."}, |
| ] |
| } |
| ], |
| "model": "pangu_vl", |
| "max_tokens": 500, |
| "temperature": 1.0, |
| "stream": False, |
| }) |
| |
|
|
| url = "http://127.0.0.1:8000/v1/chat/completions" |
| headers = { |
| 'Content-Type': 'application/json' |
| } |
|
|
| response_image_example = requests.request("POST", url, headers=headers, data=payload_image_example) |
| print(f"the response of image example is {response_image_example.text}") |
|
|
| ``` |
| |
| |
| 推理示例:视频+文字 |
| |
| ```python |
| import json |
| import base64 |
| import os |
| import requests |
| import json |
|
|
| def encode_video_to_base64(video_path, video_name): |
| #load video to base64 |
| try: |
| with open(os.path.join(video_path, video_name), 'rb') as video_file: |
| video_data = video_file.read() |
| base64_str = base64.b64encode(video_data).decode('utf-8') |
| return base64_str |
| except Exception as e: |
| print(f"video load failed: {e}") |
| return None |
| |
| base64_video = encode_video_to_base64("/video_path", "video_name.mp4") |
|
|
| payload_video_example = json.dumps({ |
| "messages": [ |
| { |
| "role": "system", |
| "content": [ |
| {"type": "text", "text": "你是华为公司开发的多模态大模型,名字是openPangu-VL-7B。你能够处理文本和视觉模态的输入,并给出文本输出。"}, |
| ] |
| }, |
| { |
| "role": "user", |
| "content": [ |
| {"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{base64_video}"}}, |
| {"type": "text", "text": "Please describe this video."}, |
| ] |
| } |
| ], |
| "model": "pangu_vl", |
| "max_tokens": 500, |
| "temperature": 1.0, |
| "stream": False, |
| }) |
| |
|
|
| url = "http://127.0.0.1:8000/v1/chat/completions" |
| headers = { |
| 'Content-Type': 'application/json' |
| } |
|
|
| response_video_example = requests.request("POST", url, headers=headers, data=payload_video_example) |
| print(f"the response of video example is {response_video_example.text}") |
|
|
| ``` |
| |
| ### 128k 视频长序列推理 |
| 在/preprocessor_config.json中添加字段,输入视频会被抽取为768帧 |
| ``` |
| "num_frames": 768, |
| "sample_fps": -1.0 |
| ``` |
| |
| |
| 启动脚本(/inference/vllm_ascend/examples/start_serving_openpangu_vl_7b.sh)内设置参数: |
| ``` |
| export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True |
| MAX_MODEL_LEN=128000 |
| MAX_NUM_BATCHED_TOKENS=100000 |
| GPU_MEMORY_UTILIZATION=0.7 |
| |
| --no-enable-chunked-prefill \ |
| --no-enable-prefix-caching \ |
| ``` |
| |
| |
| ### Int8推理 |
| |
| #### ModelSlim量化 |
| |
| openPangu-VL-7B模型支持使用开源量化框架ModelSlim,参考[[ModelSlim_openPangu-VL-7B]](https://gitcode.com/Ascend/msit/blob/msModelslim_Pangu_VL/msmodelslim/example/multimodal_vlm/openPangu-VL/ReadMe.md),当前模型支持W8A8权重激活量化。 |
|
|
| ##### openPangu-VL-7B W8A8 动态量化 |
|
|
| ```bash |
| export QUANT_PATH=your_quant_save_dir |
| export MODEL_PATH=your_model_ckpt_dir |
| export CALI_DATASET=your_cali_dataset_dir |
| python quant_pangu_vl.py \ |
| --model_path $MODEL_PATH --calib_images $CALI_DATASET \ |
| --save_directory $QUANT_PATH --w_bit 8 --a_bit 8 --device_type npu \ |
| --trust_remote_code True --anti_method m2 --act_method 3 --is_dynamic True |
| ``` |
|
|
| 相较于BF16模型,int8量化模型的config.json增加以下字段: |
| ``` |
| "quantize": "w8a8_dynamic", |
| ``` |
|
|
| ModelSlim量化脚本生成量化模型后会自动追加上述字段到config.json中。 |
|
|
| #### Int8推理 |
|
|
| 相较于BF16模型推理,int8量化模型推理可使用同样的启动脚本,仅需: |
| * 减少节点数、卡数; |
| * 修改模型Checkpoint路径。 |