metadata
license: apache-2.0
datasets:
- imperfect-follow/qwen3-compress-256
language:
- en
base_model:
- Qwen/Qwen3-14B
tags:
- OpenVino
Data-aware int4 compression of Qwen3-14B using a custom dataset targeted at protecting those activations which are important for the functioning of the Qwen3 models.
Pull with ovms.
ovms --pull --source_model imperfect-follow/qwen3-14b-int4-asym-awq-ov \
--model_repository_path [your_local_path] \
--target_device [CPU/GPU] \
--task text_generation \
--tool_parser hermes3 \
--reasoning_parser qwen3
Compression configuration.
optimum-cli export openvino \
--model Qwen/Qwen3-14B \
--task text-generation-with-past \
--weight-format int4 \
--trust-remote-code \
--dataset imperfect \
--num-samples 64 \
--ratio 0.85 \
--sensitivity-metric hessian_input_activation \
--group-size 128 \
--awq \
--scale-estimation \
--lora-correction \
qwen3-14B-int4-asym-awq-ov
Use the recommended Qwen3 sampling parameters.
# --- thinking --- #
Temperature=0.6
TopP=0.95
TopK=20
MinP=0
#--- non-thinking --- #
Temperature=0.7
TopP=0.8
TopK=20
MinP=0.
The Following modifications were made to optimum-intel to enable the use of the targeted custom dataset.
Tell utils.py that "imperfect" is a valid option.
optimum/intel/openvino/utils.py (153):
PREDEFINED_CAUSAL_LANGUAGE_DATASETS = {"wikitext2", "c4", "c4-new", "auto", "gsm8k", "imperfect"}
Give quantization.py the ability to use the dataset.
optimum/intel/openvino/quantization.py (688):
elif config.dataset == "imperfect":
seq_len = seq_len or 4096
dataset = self.load_dataset(
"imperfect-follow/qwen3-compress-256",
dataset_split="train",
num_samples=config.num_samples
)
calibration_dataset = []
for i, _text in enumerate(dataset["text"], 1):
# Tokenize once without truncation to check full length
full_tokenized = tokenizer(_text, return_tensors="pt")
full_length = full_tokenized.input_ids.shape[1] # Use index 1 for length
if full_length > seq_len:
logger.warning(
f"Sample {i} truncated: Length {full_length} exceeds "
f"maximum {seq_len} by {full_length - seq_len}"
)
# Slice the existing tensors instead of re-tokenizing
# This slices input_ids, attention_mask, etc., to max seq_len
item = {k: v[:, :seq_len] for k, v in full_tokenized.items()}
calibration_dataset.append(item)
else:
calibration_dataset.append(full_tokenized)