Instructions to use AbarthJoe/Qwopus3.6-27B-v2-oQ4-fp16-mtp with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use AbarthJoe/Qwopus3.6-27B-v2-oQ4-fp16-mtp with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Qwopus3.6-27B-v2-oQ4-fp16-mtp AbarthJoe/Qwopus3.6-27B-v2-oQ4-fp16-mtp
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
Qwopus3.6-27B-v2-oQ4-fp16-mtp
This repository contains an oMLX/oQ 4-bit quantized MLX version of Jackrong/Qwopus3.6-27B-v2.
This variant uses float16 / fp16 for non-quantized weights and preserves MTP weights.
Model lineage
- Original model:
Jackrong/Qwopus3.6-27B-v2 - Quantized model:
AbarthJoe/Qwopus3.6-27B-v2-oQ4-fp16-mtp - Quantization format: MLX / oMLX oQ
- Relationship: Quantized derivative of the original model
Quantization details
- Quantization tool: oMLX / oQ
- Quantization level: oQ4
- Preserve MTP weights: Yes
- Non-quant weight dtype: float16 / fp16
- Output format: MLX
- Target platform: Apple Silicon
About this fp16 variant
This version keeps non-quantized weights in float16 instead of the default bfloat16 path.
The goal is to test whether fp16 improves prefill / prompt processing speed on Apple Silicon while keeping oQ4 generation speed high.
Compared with the default oQ4 MTP version:
- May provide faster prefill in some environments
- May be less numerically stable than bfloat16
- Should be tested carefully with long-context prompts
- May behave differently depending on Apple Silicon generation and MLX/oMLX version
If this fp16 variant shows unstable output, repeated text, degraded reasoning, or unusual long-context behavior, use the default oQ4-mtp version instead.
Expected use case
This is an experimental speed-focused fp16 variant.
It is mainly useful for users who want to benchmark:
- oQ4 generation speed
- fp16 non-quant weight behavior
- MTP-preserved inference
- Apple Silicon local model performance
Benchmark
Tested on MacBook Pro M3 Max 40-core GPU.
| Model | Context | Prompt processing | Token generation |
|---|---|---|---|
| oQ4 fp16 + MTP | 1k | 218.2 tok/s | 19.7 tok/s |
| oQ4 fp16 + MTP | 4k | 210.9 tok/s | 20.9 tok/s |
Benchmark results may vary depending on hardware, software version, prompt type, context length, and runtime settings.
Related models
Other oMLX/oQ quantized versions are available in this collection:
Qwopus oMLX oQ Quantized Models for Apple Silicon
Credits
Original model by Jackrong.
This quantized MLX/oQ fp16 variant was created and uploaded by AbarthJoe.
License
The original model is licensed under Apache-2.0.
This quantized version follows the same Apache-2.0 license where applicable.
Disclaimer
This is a community quantized model for research, experimentation, and local inference testing.
It has not been fully safety-evaluated or benchmarked across all tasks.
Please test carefully before using it for production, sensitive, or high-stakes use cases.
- Downloads last month
- 537
4-bit
Model tree for AbarthJoe/Qwopus3.6-27B-v2-oQ4-fp16-mtp
Base model
Jackrong/Qwopus3.6-27B-v2