imperfect-follow's picture
Create README.md
6625651 verified
metadata
license: apache-2.0
datasets:
  - imperfect-follow/qwen3-compress-256
language:
  - en
base_model:
  - Qwen/Qwen3-14B
tags:
  - OpenVino

Data-aware int4 compression of Qwen3-14B using a custom dataset targeted at protecting those activations which are important for the functioning of the Qwen3 models.

Pull with ovms.

ovms --pull --source_model imperfect-follow/qwen3-14b-int4-asym-awq-ov \
  --model_repository_path [your_local_path] \
  --target_device [CPU/GPU] \
  --task text_generation \
  --tool_parser hermes3 \
  --reasoning_parser qwen3

Compression configuration.

optimum-cli export openvino \
  --model Qwen/Qwen3-14B \
  --task text-generation-with-past \
  --weight-format int4 \
  --trust-remote-code \
  --dataset imperfect \
  --num-samples 64 \
  --ratio 0.85 \
  --sensitivity-metric hessian_input_activation \
  --group-size 128 \
  --awq \
  --scale-estimation \
  --lora-correction \
  qwen3-14B-int4-asym-awq-ov

Use the recommended Qwen3 sampling parameters.

# --- thinking --- #
Temperature=0.6
TopP=0.95
TopK=20
MinP=0

#--- non-thinking --- #
Temperature=0.7
TopP=0.8
TopK=20
MinP=0.

The Following modifications were made to optimum-intel to enable the use of the targeted custom dataset.

Tell utils.py that "imperfect" is a valid option.

optimum/intel/openvino/utils.py (153):

PREDEFINED_CAUSAL_LANGUAGE_DATASETS = {"wikitext2", "c4", "c4-new", "auto", "gsm8k", "imperfect"}

Give quantization.py the ability to use the dataset.

optimum/intel/openvino/quantization.py (688):

            elif config.dataset == "imperfect":
                seq_len = seq_len or 4096
                dataset = self.load_dataset(
                    "imperfect-follow/qwen3-compress-256",
                    dataset_split="train",
                    num_samples=config.num_samples
                )
                calibration_dataset = []

                for i, _text in enumerate(dataset["text"], 1):
                    # Tokenize once without truncation to check full length
                    full_tokenized = tokenizer(_text, return_tensors="pt")
                    full_length = full_tokenized.input_ids.shape[1] # Use index 1 for length
                    
                    if full_length > seq_len:
                        logger.warning(
                            f"Sample {i} truncated: Length {full_length} exceeds "
                            f"maximum {seq_len} by {full_length - seq_len}"
                        )
                        # Slice the existing tensors instead of re-tokenizing
                        # This slices input_ids, attention_mask, etc., to max seq_len
                        item = {k: v[:, :seq_len] for k, v in full_tokenized.items()}
                        calibration_dataset.append(item)
                    else:
                        calibration_dataset.append(full_tokenized)