openPangu-VL-7B / doc /vllm_ascend_for_openpangu_vl_7b.md

Upload folder using huggingface_hub

1688f96 verified 3 months ago

6.74 kB

	## openPangu-VL-7B 在[vllm-ascend](https://github.com/vllm-project/vllm-ascend)部署指导文档

	### 部署环境要求

	Atlas 800T A2(64GB) 1、2、4、8卡均可部署openPangu-VL-7B

	### 镜像构建和启动

	选用vllm-ascend社区镜像v0.9.1

	拉取方式如下：

	```bash
	docker pull quay.io/ascend/vllm-ascend:v0.9.1
	```

	以下操作需在每个节点都执行。
	启动镜像。

	```bash
	# Update the vllm-ascend image
	export IMAGE=quay.io/ascend/vllm-ascend:v0.9.1 # Use correct image id
	export NAME=vllm-ascend # Custom docker name

	# Run the container using the defined variables
	# Note if you are running bridge network with docker, Please expose available ports for multiple nodes communication in advance
	# To prevent device interference from other docker containers, add the argument "--privileged"
	docker run --rm \
	--name $NAME \
	--network host \
	--ipc=host \
	--device /dev/davinci0 \
	--device /dev/davinci1 \
	--device /dev/davinci2 \
	--device /dev/davinci3 \
	--device /dev/davinci4 \
	--device /dev/davinci5 \
	--device /dev/davinci6 \
	--device /dev/davinci7 \
	--device /dev/davinci_manager \
	--device /dev/devmm_svm \
	--device /dev/hisi_hdc \
	-v /usr/local/dcmi:/usr/local/dcmi \
	-v /usr/local/Ascend/driver/tools/hccn_tool:/usr/local/Ascend/driver/tools/hccn_tool \
	-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
	-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
	-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
	-v /etc/ascend_install.info:/etc/ascend_install.info \
	-v /mnt/sfs_turbo/.cache:/root/.cache \
	-it $IMAGE bash
	```

	需要保证模型权重和本项目代码可在容器中访问。如果未进入容器，需以root用户进入容器。
	```bash
	docker exec -itu root $NAME /bin/bash
	```

	### PD混部推理

	示例启动脚本：`LOAD_CKPT_DIR=xxx bash examples/start_serving_openpangu_vl_7b.sh`。该启动脚本为8卡推理（变量TENSOR_PARALLEL_SIZE_LOCAL=8）。拉起服务后，可向首节点（主节点）发送请求。

	### 发请求测试

	服务启动后，可发送测试请求。推荐使用示例中的system prompt。

	推理示例：图片+文字

	```python
	import json
	import base64
	import os
	import requests
	import json

	def encode_image_to_base64(img_path, img_name):
	#load image to base64
	try:
	with open(os.path.join(img_path, img_name), 'rb') as img_file:
	img_data = img_file.read()
	base64_str = base64.b64encode(img_data).decode('utf-8')
	return base64_str
	except Exception as e:
	print(f"image load failed: {e}")
	return None

	base64_image = encode_image_to_base64("/image_path", "image_name.jpg")


	payload_image_example = json.dumps({
	"messages": [
	{
	"role": "system",
	"content": [
	{"type": "text", "text": "你是华为公司开发的多模态大模型，名字是openPangu-VL-7B。你能够处理文本和视觉模态的输入，并给出文本输出。"},
	]
	},
	{
	"role": "user",
	"content": [
	{"type": "image_url", "image_url": {"url": f"data:image/jpg;base64,{base64_image}"}},
	{"type": "text", "text": "Please describe this picture."},
	]
	}
	],
	"model": "pangu_vl",
	"max_tokens": 500,
	"temperature": 1.0,
	"stream": False,
	})


	url = "http://127.0.0.1:8000/v1/chat/completions"
	headers = {
	'Content-Type': 'application/json'
	}

	response_image_example = requests.request("POST", url, headers=headers, data=payload_image_example)
	print(f"the response of image example is {response_image_example.text}")

	```


	推理示例：视频+文字

	```python
	import json
	import base64
	import os
	import requests
	import json

	def encode_video_to_base64(video_path, video_name):
	#load video to base64
	try:
	with open(os.path.join(video_path, video_name), 'rb') as video_file:
	video_data = video_file.read()
	base64_str = base64.b64encode(video_data).decode('utf-8')
	return base64_str
	except Exception as e:
	print(f"video load failed: {e}")
	return None

	base64_video = encode_video_to_base64("/video_path", "video_name.mp4")

	payload_video_example = json.dumps({
	"messages": [
	{
	"role": "system",
	"content": [
	{"type": "text", "text": "你是华为公司开发的多模态大模型，名字是openPangu-VL-7B。你能够处理文本和视觉模态的输入，并给出文本输出。"},
	]
	},
	{
	"role": "user",
	"content": [
	{"type": "video_url", "video_url": {"url": f"data:video/mp4;base64,{base64_video}"}},
	{"type": "text", "text": "Please describe this video."},
	]
	}
	],
	"model": "pangu_vl",
	"max_tokens": 500,
	"temperature": 1.0,
	"stream": False,
	})


	url = "http://127.0.0.1:8000/v1/chat/completions"
	headers = {
	'Content-Type': 'application/json'
	}

	response_video_example = requests.request("POST", url, headers=headers, data=payload_video_example)
	print(f"the response of video example is {response_video_example.text}")

	```

	### 128k 视频长序列推理
	在/preprocessor_config.json中添加字段,输入视频会被抽取为768帧
	```
	"num_frames": 768,
	"sample_fps": -1.0
	```


	启动脚本(/inference/vllm_ascend/examples/start_serving_openpangu_vl_7b.sh)内设置参数：
	```
	export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
	MAX_MODEL_LEN=128000
	MAX_NUM_BATCHED_TOKENS=100000
	GPU_MEMORY_UTILIZATION=0.7

	--no-enable-chunked-prefill \
	--no-enable-prefix-caching \
	```


	### Int8推理

	#### ModelSlim量化

	openPangu-VL-7B模型支持使用开源量化框架ModelSlim,参考[[ModelSlim_openPangu-VL-7B]](https://gitcode.com/Ascend/msit/blob/msModelslim_Pangu_VL/msmodelslim/example/multimodal_vlm/openPangu-VL/ReadMe.md)，当前模型支持W8A8权重激活量化。

	##### openPangu-VL-7B W8A8 动态量化

	```bash
	export QUANT_PATH=your_quant_save_dir
	export MODEL_PATH=your_model_ckpt_dir
	export CALI_DATASET=your_cali_dataset_dir
	python quant_pangu_vl.py \
	--model_path $MODEL_PATH --calib_images $CALI_DATASET \
	--save_directory $QUANT_PATH --w_bit 8 --a_bit 8 --device_type npu \
	--trust_remote_code True --anti_method m2 --act_method 3 --is_dynamic True
	```

	相较于BF16模型，int8量化模型的config.json增加以下字段：
	```
	"quantize": "w8a8_dynamic",
	```

	ModelSlim量化脚本生成量化模型后会自动追加上述字段到config.json中。

	#### Int8推理

	相较于BF16模型推理，int8量化模型推理可使用同样的启动脚本，仅需：
	* 减少节点数、卡数；
	* 修改模型Checkpoint路径。