Instructions to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16", dtype="auto")

llama-cpp-python

How to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16",
	filename="qwen3_omni_f16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16
# Run inference directly in the terminal:
llama-cli -hf vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16
# Run inference directly in the terminal:
llama-cli -hf vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16
# Run inference directly in the terminal:
./llama-cli -hf vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16

Use Docker

docker model run hf.co/vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16

LM Studio
Jan

vLLM

How to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16

SGLang

How to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Ollama
How to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with Ollama:
```
ollama run hf.co/vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16
```

Unsloth Studio

How to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 to start chatting

Atomic Chat new
Docker Model Runner
How to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with Docker Model Runner:
```
docker model run hf.co/vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16
```

Lemonade

How to use vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16:F16

Run and chat with the model

lemonade run user.Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16-F16

List all available models

lemonade list

Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 / example_usage.py

vito95311

Initial GGUF release: Qwen3-Omni quantized models with Ollama support

d4ef36e 9 months ago

raw

history blame contribute delete

10.5 kB

	#!/usr/bin/env python3
	"""
	Qwen3-Omni GGUF格式使用範例

	這個腳本展示如何使用GGUF格式的Qwen3-Omni模型進行各種任務，
	包括Ollama API、llama-cpp-python直接調用等方法。
	"""

	import json
	import time
	import requests
	import subprocess
	from pathlib import Path
	from typing import Dict, List, Optional

	try:
	from llama_cpp import Llama
	LLAMA_CPP_AVAILABLE = True
	except ImportError:
	LLAMA_CPP_AVAILABLE = False
	print("⚠️ llama-cpp-python not installed. Install with: pip install llama-cpp-python")


	class QwenGGUFRunner:
	"""Qwen GGUF格式運行器"""

	def __init__(self, model_path: str = "qwen3_omni_quantized.gguf"):
	self.model_path = model_path
	self.llm = None

	def load_with_llama_cpp(self, **kwargs):
	"""使用llama-cpp-python載入模型"""
	if not LLAMA_CPP_AVAILABLE:
	raise ImportError("llama-cpp-python not available")

	default_params = {
	'n_gpu_layers': 35, # GPU加速層數
	'n_ctx': 4096, # 上下文長度
	'n_batch': 512, # 批次大小
	'verbose': False, # 靜音模式
	'n_threads': 8, # CPU線程數
	}
	default_params.update(kwargs)

	print(f"🚀 Loading GGUF model: {self.model_path}")
	start_time = time.time()

	self.llm = Llama(model_path=self.model_path, **default_params)

	load_time = time.time() - start_time
	print(f"✅ Model loaded in {load_time:.2f}s")
	return self.llm

	def generate_with_llama_cpp(self, prompt: str, **kwargs) -> str:
	"""使用llama-cpp-python生成文本"""
	if not self.llm:
	raise ValueError("Model not loaded. Call load_with_llama_cpp() first.")

	default_params = {
	'max_tokens': 256,
	'temperature': 0.7,
	'top_p': 0.8,
	'top_k': 50,
	'repeat_penalty': 1.1,
	'stop': ["</s>", "<\|endoftext\|>"]
	}
	default_params.update(kwargs)

	print(f"💭 Generating response...")
	start_time = time.time()

	response = self.llm(prompt, **default_params)

	gen_time = time.time() - start_time
	tokens = len(response['choices'][0]['text'].split())
	speed = tokens / gen_time if gen_time > 0 else 0

	print(f"⚡ Generated {tokens} tokens in {gen_time:.2f}s ({speed:.1f} tok/s)")

	return response['choices'][0]['text']


	class OllamaAPI:
	"""Ollama API 接口"""

	def __init__(self, base_url: str = "http://localhost:11434"):
	self.base_url = base_url
	self.model_name = "qwen3-omni-quantized"

	def check_connection(self) -> bool:
	"""檢查Ollama連接"""
	try:
	response = requests.get(f"{self.base_url}/api/tags", timeout=5)
	return response.status_code == 200
	except:
	return False

	def is_model_available(self) -> bool:
	"""檢查模型是否可用"""
	try:
	response = requests.get(f"{self.base_url}/api/tags")
	models = response.json().get("models", [])
	return any(model["name"] == self.model_name for model in models)
	except:
	return False

	def generate(self, prompt: str, **kwargs) -> str:
	"""使用Ollama API生成文本"""
	if not self.check_connection():
	raise ConnectionError("Cannot connect to Ollama API")

	if not self.is_model_available():
	raise ValueError(f"Model {self.model_name} not found in Ollama")

	payload = {
	"model": self.model_name,
	"prompt": prompt,
	"stream": False,
	"options": {
	"temperature": kwargs.get("temperature", 0.7),
	"top_p": kwargs.get("top_p", 0.8),
	"top_k": kwargs.get("top_k", 50),
	"repeat_penalty": kwargs.get("repeat_penalty", 1.1),
	"num_predict": kwargs.get("max_tokens", 256),
	}
	}

	print(f"💭 Sending request to Ollama...")
	start_time = time.time()

	response = requests.post(
	f"{self.base_url}/api/generate",
	json=payload,
	timeout=60
	)

	if response.status_code != 200:
	raise RuntimeError(f"Ollama API error: {response.text}")

	result = response.json()
	gen_time = time.time() - start_time

	# 估算tokens和速度
	output_text = result["response"]
	tokens = len(output_text.split())
	speed = tokens / gen_time if gen_time > 0 else 0

	print(f"⚡ Generated {tokens} tokens in {gen_time:.2f}s ({speed:.1f} tok/s)")

	return output_text


	def run_examples():
	"""運行示例代碼"""

	examples = [
	{
	"name": "🌟 創意寫作",
	"prompt": "請寫一個關於AI和人類合作探索宇宙的短故事，要有科幻感和哲理思考。",
	"params": {"temperature": 0.8, "max_tokens": 400}
	},
	{
	"name": "💻 代碼生成",
	"prompt": "請用Python寫一個快速排序算法，包含詳細註解和時間複雜度分析。",
	"params": {"temperature": 0.3, "max_tokens": 500}
	},
	{
	"name": "🧮 數學推理",
	"prompt": "一個圓的半徑是5cm，請計算其面積和周長，並解釋計算過程。",
	"params": {"temperature": 0.2, "max_tokens": 300}
	},
	{
	"name": "🌐 多語言翻譯",
	"prompt": "Please translate this English text to Chinese: 'Artificial Intelligence is revolutionizing the way we interact with technology, making it more intuitive and human-friendly.'",
	"params": {"temperature": 0.3, "max_tokens": 200}
	},
	{
	"name": "🤔 邏輯推理",
	"prompt": "如果所有的A都是B，所有的B都是C，而某個X是A，那麼X是什麼？請解釋邏輯推理過程。",
	"params": {"temperature": 0.1, "max_tokens": 250}
	}
	]

	# 檢查Ollama可用性
	ollama = OllamaAPI()
	ollama_available = ollama.check_connection() and ollama.is_model_available()

	# 檢查GGUF文件可用性
	gguf_available = LLAMA_CPP_AVAILABLE and Path("qwen3_omni_quantized.gguf").exists()

	print("=" * 80)
	print("🔥 Qwen3-Omni GGUF格式使用範例")
	print("=" * 80)
	print(f"💾 Ollama API 可用: {'✅' if ollama_available else '❌'}")
	print(f"📁 GGUF文件可用: {'✅' if gguf_available else '❌'}")
	print()

	# 如果都不可用，提供設置指南
	if not ollama_available and not gguf_available:
	print("⚠️ 請先設置Ollama或下載GGUF文件:")
	print()
	print("🚀 Ollama 設置:")
	print(" 1. ollama create qwen3-omni-quantized -f Qwen3OmniQuantized.modelfile")
	print(" 2. ollama serve")
	print()
	print("📁 GGUF文件下載:")
	print(" huggingface-cli download vito95311/Qwen3-Omni-30B-A3B-Thinking-GGUF-INT8FP16 qwen3_omni_quantized.gguf")
	return

	# 優先使用Ollama，因為更簡單
	if ollama_available:
	print("🎯 使用Ollama API進行推理")
	runner_type = "ollama"
	api = ollama
	else:
	print("🎯 使用llama-cpp-python進行推理")
	runner_type = "llama_cpp"
	runner = QwenGGUFRunner()
	runner.load_with_llama_cpp()

	print("=" * 80)

	# 運行示例
	for i, example in enumerate(examples, 1):
	print(f"\n📝 示例 {i}: {example['name']}")
	print(f"💬 提示: {example['prompt'][:100]}...")
	print("-" * 40)

	try:
	if runner_type == "ollama":
	response = api.generate(example['prompt'], **example['params'])
	else:
	response = runner.generate_with_llama_cpp(example['prompt'], **example['params'])

	print(f"🤖 回應: {response.strip()}")

	except Exception as e:
	print(f"❌ 錯誤: {str(e)}")

	print("-" * 40)

	# 暫停一下避免過載
	time.sleep(1)


	def benchmark_performance():
	"""性能基準測試"""

	print("\n🏆 性能基準測試")
	print("=" * 50)

	test_prompts = [
	"解釋什麼是機器學習",
	"寫一個Python函數來計算斐波那契數列",
	"描述量子計算的基本原理",
	"What are the benefits of renewable energy?",
	"如何優化深度學習模型的性能？"
	]

	ollama = OllamaAPI()

	if ollama.check_connection() and ollama.is_model_available():
	print("📊 測試Ollama API性能...")

	total_time = 0
	total_tokens = 0

	for i, prompt in enumerate(test_prompts, 1):
	print(f" Test {i}/5: ", end="", flush=True)

	start_time = time.time()
	response = ollama.generate(prompt, max_tokens=100, temperature=0.7)
	end_time = time.time()

	test_time = end_time - start_time
	tokens = len(response.split())
	speed = tokens / test_time if test_time > 0 else 0

	total_time += test_time
	total_tokens += tokens

	print(f"{speed:.1f} tok/s")

	avg_speed = total_tokens / total_time if total_time > 0 else 0
	print(f"\n📈 平均性能: {avg_speed:.1f} tokens/秒")
	print(f"⏱️ 總時間: {total_time:.2f}秒")
	print(f"📝 總tokens: {total_tokens}")

	else:
	print("⚠️ Ollama不可用，跳過性能測試")


	def main():
	"""主函數"""
	print("🔥 Qwen3-Omni GGUF 使用範例")
	print("這個腳本展示如何使用GGUF格式的模型進行各種AI任務")

	# 運行使用範例
	run_examples()

	# 性能測試
	user_input = input("\n🤔 是否運行性能基準測試？ (y/n): ")
	if user_input.lower() in ['y', 'yes']:
	benchmark_performance()

	print("\n✨ 示例運行完成！")
	print("💡 更多使用方法請參考 README.md")


	if __name__ == "__main__":
	main()