--- license: apache-2.0 datasets: - imperfect-follow/qwen3-compress-256 language: - en base_model: - Qwen/Qwen3-14B tags: - OpenVino --- Data-aware int4 compression of Qwen3-14B using a custom dataset targeted at protecting those activations which are important for the functioning of the Qwen3 models. Pull with ovms. ``` ovms --pull --source_model imperfect-follow/qwen3-14b-int4-asym-awq-ov \ --model_repository_path [your_local_path] \ --target_device [CPU/GPU] \ --task text_generation \ --tool_parser hermes3 \ --reasoning_parser qwen3 ``` Compression configuration. ``` optimum-cli export openvino \ --model Qwen/Qwen3-14B \ --task text-generation-with-past \ --weight-format int4 \ --trust-remote-code \ --dataset imperfect \ --num-samples 64 \ --ratio 0.85 \ --sensitivity-metric hessian_input_activation \ --group-size 128 \ --awq \ --scale-estimation \ --lora-correction \ qwen3-14B-int4-asym-awq-ov ``` Use the recommended Qwen3 sampling parameters. ``` # --- thinking --- # Temperature=0.6 TopP=0.95 TopK=20 MinP=0 #--- non-thinking --- # Temperature=0.7 TopP=0.8 TopK=20 MinP=0. ``` The Following modifications were made to optimum-intel to enable the use of the targeted custom dataset. Tell utils.py that "imperfect" is a valid option. optimum/intel/openvino/utils.py (153): ``` PREDEFINED_CAUSAL_LANGUAGE_DATASETS = {"wikitext2", "c4", "c4-new", "auto", "gsm8k", "imperfect"} ``` Give quantization.py the ability to use the dataset. optimum/intel/openvino/quantization.py (688): ``` elif config.dataset == "imperfect": seq_len = seq_len or 4096 dataset = self.load_dataset( "imperfect-follow/qwen3-compress-256", dataset_split="train", num_samples=config.num_samples ) calibration_dataset = [] for i, _text in enumerate(dataset["text"], 1): # Tokenize once without truncation to check full length full_tokenized = tokenizer(_text, return_tensors="pt") full_length = full_tokenized.input_ids.shape[1] # Use index 1 for length if full_length > seq_len: logger.warning( f"Sample {i} truncated: Length {full_length} exceeds " f"maximum {seq_len} by {full_length - seq_len}" ) # Slice the existing tensors instead of re-tokenizing # This slices input_ids, attention_mask, etc., to max seq_len item = {k: v[:, :seq_len] for k, v in full_tokenized.items()} calibration_dataset.append(item) else: calibration_dataset.append(full_tokenized) ```