Instructions to use unsloth/diffusiongemma-26B-A4B-it-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use unsloth/diffusiongemma-26B-A4B-it-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="unsloth/diffusiongemma-26B-A4B-it-GGUF", filename="diffusiongemma-26B-A4B-it-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use unsloth/diffusiongemma-26B-A4B-it-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
Use Docker
docker model run hf.co/unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use unsloth/diffusiongemma-26B-A4B-it-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "unsloth/diffusiongemma-26B-A4B-it-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "unsloth/diffusiongemma-26B-A4B-it-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
- Ollama
How to use unsloth/diffusiongemma-26B-A4B-it-GGUF with Ollama:
ollama run hf.co/unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
- Unsloth Studio
How to use unsloth/diffusiongemma-26B-A4B-it-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/diffusiongemma-26B-A4B-it-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for unsloth/diffusiongemma-26B-A4B-it-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for unsloth/diffusiongemma-26B-A4B-it-GGUF to start chatting
- Pi
How to use unsloth/diffusiongemma-26B-A4B-it-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use unsloth/diffusiongemma-26B-A4B-it-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use unsloth/diffusiongemma-26B-A4B-it-GGUF with Docker Model Runner:
docker model run hf.co/unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
- Lemonade
How to use unsloth/diffusiongemma-26B-A4B-it-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.diffusiongemma-26B-A4B-it-GGUF-Q4_K_M
List all available models
lemonade list
Read our How to Run DiffusionGemma Guide!
See Unsloth Dynamic 2.0 GGUFs for our quantization benchmarks.
DiffusionGemma
Hugging Face |
GitHub |
Launch Blog |
Documentation
License: Apache 2.0 | Authors: Google DeepMind
Run with llama.cpp (GGUF)
These GGUFs run with the DiffusionGemma build of llama.cpp (the DiffusionGemma PR
ggml-org/llama.cpp#24423). DiffusionGemma is a block-diffusion
architecture, so it needs that branch plus the dedicated llama-diffusion-cli runner - the standard
llama-cli / llama-server cannot generate from it yet.
1. Build the branch
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
# check out the DiffusionGemma PR (#24423)
gh pr checkout 24423
# or, without the gh CLI:
# git fetch origin pull/24423/head:diffusiongemma && git checkout diffusiongemma
# build with CUDA (drop -DGGML_CUDA=ON for a CPU-only build)
cmake -B build -DGGML_CUDA=ON
cmake --build build -j --config Release --target llama-diffusion-cli
2. Download a GGUF
pip install -U "huggingface_hub[cli]"
hf download unsloth/diffusiongemma-26B-A4B-it-GGUF \
--local-dir unsloth/diffusiongemma-26B-A4B-it-GGUF \
--include "*Q8_0*" # Use "*Q4_K_M*" for a smaller 16 GB download
| Quant | Size | Notes |
|---|---|---|
| BF16 | 47 GB | full precision reference |
| Q8_0 | 25 GB | near-lossless, recommended |
| Q6_K | 21 GB | |
| Q5_K_M | 18 GB | |
| Q4_K_M | 16 GB | smallest, fits a single 24 GB GPU |
3. Run it (chat)
-n / --n-predict is the only length knob you need: it auto-derives the diffusion block count and
grows the batch/context to fit. On a single GPU the prompt prefix KV cache turns on automatically.
./build/bin/llama-diffusion-cli \
-m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
-ngl 99 -cnv -n 2048
4. Watch the canvas denoise live (--diffusion-visual)
Add --diffusion-visual to see each 256-token canvas being denoised in place (flicker-free, keeps
your scrollback):
./build/bin/llama-diffusion-cli \
-m unsloth/diffusiongemma-26B-A4B-it-GGUF/diffusiongemma-26B-A4B-it-Q8_0.gguf \
-ngl 99 -cnv -n 2048 --diffusion-visual
Flags
-n, --n-predict N- target tokens; derives--diffusion-blocksand grows-ub/-b/-c.-ngl 99- offload all layers to the GPU (-ngl 0for CPU-only).-cnv- multi-turn conversation mode.--diffusion-visual- live canvas denoising view.- The Entropy-Bound sampler is on by default (
--diffusion-eb auto). Tune it with--diffusion-eb-max-steps(default 48),--diffusion-eb-t-max/--diffusion-eb-t-min(0.8 -> 0.4),--diffusion-eb-entropy-bound(0.1), and--diffusion-eb-confidence(0.005). --diffusion-kv-cache {auto,on,off}- prompt prefix KV cache (auto = on for single GPU).
DiffusionGemma is a generative model built by Google DeepMind. Based on the 26B A4B Mixture-of-Experts (MoE) Gemma 4 architecture, DiffusionGemma generates tokens using discrete diffusion. This open-weights model is multimodal, handling text, image, and video inputs to generate text output.
Built on a MoE foundation, DiffusionGemma is designed to improve generation speed (tokens per second) while remaining deployable across various hardware environments. DiffusionGemma builds upon the architectural and capability advancements of Gemma 4, introducing several core features:
- Discrete Text Diffusion – Shifts from token-by-token autoregression to block-autoregressive multi-canvas sampling. It generates text by iteratively denoising blocks of tokens (a 'canvas') in parallel, significantly increasing decoding speed.
- Multimodal Input Processing – Processes interleaved text, image (with variable aspect ratio and resolution support), and video inputs to generate text outputs.
- Encoder-Decoder Architecture – Utilizes an autoregressive encoder to process and cache the prompt context, paired with a decoder that applies bidirectional attention over the generation canvas.
- Mixture-of-Experts (MoE) Efficiency – Leverages a sparse MoE design (8 active experts out of 128 total) to provide strong reasoning capabilities while maintaining a low memory footprint suitable for local execution.
- Thinking Mode (Reasoning) – Designed as a highly capable reasoner, with configurable thinking modes.
- Optimized for Small Batch Size Inference – Specifically engineered for low-latency, high-speed generation on a single capable accelerator.
- Native System Prompt Support – As with Gemma 4, it supports updating the
systemrole, enabling more structured and controllable conversations.
Model Overview
DiffusionGemma is engineered to reduce the sequential bottlenecks of standard causal language models. It employs an encoder-decoder architecture specifically optimized for inference speed.
The encoder operates in a prefill capacity, processing the initial prompt and generating the KV cache. The decoder then utilizes bidirectional attention to process an input block (a 'canvas') of tokens, accessing the cached context via cross-attention.
During inference, DiffusionGemma leverages multi-canvas sampling. Rather than generating one token at a time, the model iteratively denoises a full block of tokens using a diffusion sampler. Once a canvas is fully denoised, it is processed by the encoder and appended to the KV cache, after which the model generates the next canvas. This block-autoregressive approach facilitates text generation at higher speeds.
DiffusionGemma
| Total Parameters | 25.2B |
|---|---|
| Active Parameters | 3.8B |
| Layers | 30 |
| Sliding Window | 1024 tokens |
| Context Length | Up to 256K tokens |
| Canvas Length | 256 |
| Vocabulary Size | 262K |
| Expert Count | 8 active / 128 total and 1 shared |
| Supported Modalities | Text, Image |
| Vision Encoder Parameters | ~550M |
Benchmark Results
These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation. Evaluation results marked in the table are for instruction-tuned models, with the recommended Entropy Bound (EB) sampler (see Best Practices below).
| Benchmark | DiffusionGemma 26B A4B | Gemma 4 26B A4B |
|---|---|---|
| MMLU Pro | 77.6% | 82.6% |
| AIME 2026 no tools | 69.1% | 88.3% |
| LiveCodeBench v6 | 69.1% | 77.1% |
| Codeforces ELO | 1429 | 1718 |
| GPQA Diamond | 73.2% | 82.3% |
| Tau2 (average over 3) | 56.2% | 68.2% |
| HLE no tools | 11.0% | 8.7% |
| HLE with search | 11.9% | 17.2% |
| BigBench Extra Hard | 47.6% | 64.8% |
| MMMLU | 81.5% | 86.3% |
| Vision | ||
| MMMU Pro | 54.3% | 73.8% |
| OmniDocBench 1.5 (average edit distance, lower is better) | 0.319 | 0.149 |
| MATH-Vision | 70.5% | 82.4% |
| MedXPertQA MM | 49.0% | 58.1% |
| Long Context | ||
| MRCR v2 8 needle 128k (average) | 32.0% | 44.1% |
Core Capabilities
DiffusionGemma handles a broad range of tasks across text and vision. Key capabilities include:
- High-Speed Generation parallel denoising of 256 tokens via diffusion sampling achieves low latency by generating 15-20 tokens per forward pass, unlocking per user generation speeds exceeding 1100 tokens per second in low batch size settings (H100, FP8).
- Adaptive Inference Time Computation Simpler prompts and structured tasks like code require fewer denoising steps, enabling dynamic tokens-per-second speeds based on task complexity.
- Thinking – Built-in reasoning mode that lets the model think step-by-step before answering.
- Long Context – Context windows of up to 256K tokens.
- Image Understanding – Object detection, Document/PDF parsing, screen and UI understanding, chart comprehension, OCR (including multilingual), handwriting recognition, and pointing. Images can be processed at variable aspect ratios and resolutions.
- Video Understanding – Analyzes and describes video content by processing sequences of frames.
- Interleaved Multimodal Input – Mix images, video, and text within a single prompt for context-heavy reasoning.
- Function Calling – Native support for structured tool use, enabling agentic workflows.
- Coding & Reasoning – Capable of code generation, completion, and step-by-step logical reasoning.
- Multilingual – Out-of-the-box support for 35+ languages, pre-trained on 140+ languages.
Getting Started
You can use all Gemma 4 models with the latest version of Transformers. To get started, install the necessary dependencies in your environment:
pip install -U transformers torch accelerate
Once you have everything installed, you can proceed to load the model with the code below:
from transformers import DiffusionGemmaForBlockDiffusion, AutoProcessor
MODEL_ID = "google/diffusiongemma-26B-A4B-it"
# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = DiffusionGemmaForBlockDiffusion.from_pretrained(
MODEL_ID,
dtype="auto",
device_map="auto",
)
Once the model is loaded, you can start generating output:
# Prompt
message = [
{"role": "user", "content": "Why is the sky blue?"}
]
# Process input
input_ids = processor.apply_chat_template(
message,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
output = model.generate(**input_ids, max_new_tokens=512)
# Parse output
text = processor.decode(output[0], skip_special_tokens=False)
Best Practices
For the best performance, use these configurations and best practices:
1. Diffusion Sampling Settings
Use the following standardized sampling configuration across all use cases:
- Method: Diffusion sampling with Entropy-Bounded Denoising and Adaptive Stopping.
- Sampling Configuration:
- Maximum number of Denoising Steps = 48
- Temperature schedule (for logit shaping): Linear decay from 0.8 → 0.4
- Token Selection: At each step, the sampler selects the lowest-entropy tokens such that their mutual information bound stays below entropy bound = 0.1
- Token Renoising: The sampler fully renoises the non-selected tokens
- Adaptive Stopping: Sampling terminates early if and only if both of the following conditions are met simultaneously:
- Confident predictions: The average model entropy over the canvas is below the entropy threshold = 0.005
- Stable predictions: The highest-probability token predictions remain identical across two consecutive denoising steps
2. Thinking Mode Configuration
Similar to Gemma 4 models, we use standard system, assistant, and user roles. To properly manage the thinking process, use the following control tokens:
- Trigger Thinking: Thinking is enabled by including the
<|think|>token at the start of the system prompt. To disable thinking, remove the token (note that an empty thinking channel might still be emitted). - Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
<|channel>thought\n[Internal reasoning]<channel|>. - Disabled Thinking Behavior: If thinking is disabled, the model will still generate the tags but with an empty thought block:
<|channel>thought\n<channel|>[Final answer].
Note that many libraries like transformers handle the complexities of the chat template for you.
3. Multi-Turn Conversations
- No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.
4. Modality order
- For optimal performance with multimodal inputs, place image content before the text in your prompt.
5. Variable Image Resolution
Aside from variable aspect ratios, DiffusionGemma supports variable image resolution through a configurable visual token budget, which controls how many tokens are used to represent an image. A higher token budget preserves more visual detail at the cost of additional compute, while a lower budget enables faster inference for tasks that don't require fine-grained understanding.
- The supported token budgets are: 70, 140, 280, 560, and 1120.
- Use lower budgets for classification, captioning, or video understanding, where faster inference and processing many frames outweigh fine-grained detail.
- Use higher budgets for tasks like OCR, document parsing, or reading small text.
6. Video Length
All models support image inputs and can process videos as frames. Video supports a maximum of 60 seconds assuming the images are processed at one frame per second.
Model Data
Data used for model training and how the data was processed.
Training Dataset
Our pre-training dataset is a large-scale, diverse collection of data encompassing a wide range of domains and modalities, which includes web documents, code, images, audio, with a cutoff date of January 2025. Here are the key components:
- Web Documents: A diverse collection of web text ensures the model is exposed to a broad range of linguistic styles, topics, and vocabulary. The training dataset includes content in over 140 languages.
- Code: Exposing the model to code helps it to learn the syntax and patterns of programming languages, which improves its ability to generate code and understand code-related questions.
- Mathematics: Training on mathematical text helps the model learn logical reasoning, symbolic representation, and address mathematical queries.
- Images: A wide range of images enables the model to perform image analysis and visual data extraction tasks.
The combination of these diverse data sources is crucial for training a powerful multimodal model that can handle a wide variety of different tasks and data formats.
Data Preprocessing
Here are the key data cleaning and filtering methods applied to the training data:
- CSAM Filtering: Rigorous CSAM (Child Sexual Abuse Material) filtering was applied at multiple stages in the data preparation process to ensure the exclusion of harmful and illegal content.
- Sensitive Data Filtering: As part of making Gemma pre-trained models safe and reliable, automated techniques were used to filter out certain personal information and other sensitive data from training sets.
- Additional methods: Filtering based on content quality and safety in line with our policies.
Ethics and Safety
As open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, DiffusionGemma undergoes the same rigorous safety evaluations as our proprietary Gemini models.
Evaluation Approach
DiffusionGemma was developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align with Google’s AI principles, as well as safety policies, which aim to prevent our generative AI models from generating harmful content, including:
- Content related to child sexual abuse material and exploitation
- Dangerous content (e.g., promoting suicide, or instructing in activities that could cause real-world harm)
- Sexually explicit content
- Hate speech (e.g., dehumanizing members of protected groups)
- Harassment (e.g., encouraging violence against people)
Evaluation Results
For all areas of safety testing, we saw major improvements in all categories of content safety relative to previous generations of Gemma models. Overall, DiffusionGemma, like Gemma 4 models, significantly outperforms Gemma 3 and 3n models in improving safety, while keeping unjustified refusals low. All testing was intentionally conducted without safety filters to evaluate the model’s raw capabilities and baseline behaviors. For both text-to-text and image-to-text, and across all model sizes, the model produced minimal policy violations, and showed significant improvements over previous Gemma models.
Usage and Limitations
These models have certain limitations that users should be aware of.
Intended Usage
Multimodal models (capable of processing vision, language, and/or audio) have a wide range of applications across various industries and domains. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.
- Content Creation and Communication
- Text Generation: Generates creative text formats such as poems, scripts, code, marketing copy, and email drafts.
- Chatbots and Conversational AI: Powers conversational interfaces for customer service, virtual assistants, or interactive applications.
- Text Summarization: Generates concise summaries of a text corpus, research papers, or reports.
- Image Data Extraction: Extracts, interprets and summarizes visual data for text communications.
- Research and Education
- Natural Language Processing (NLP) and VLM Research: Serves as a foundation for researchers to experiment with VLM and NLP techniques, develop algorithms, and contribute to the advancement of the field.
- Language Learning Tools: Supports interactive language learning experiences, aiding in grammar correction or providing writing practice.
- Knowledge Exploration: Assists researchers in exploring large bodies of text by generating summaries or answering questions about specific topics.
Limitations
- Training Data
- The quality and diversity of the training data significantly influence the model's capabilities. Biases or gaps in the training data can lead to limitations in the model's responses.
- The scope of the training dataset determines the subject areas the model can handle effectively.
- Context and Task Complexity
- The model performs well on tasks that can be framed with clear prompts and instructions. Open-ended or highly complex tasks might be challenging.
- The model's performance can be influenced by the amount of context provided (longer context generally leads to better outputs, up to a certain point).
- Language Ambiguity and Nuance
- Natural language is inherently complex. The model might struggle to grasp subtle nuances, sarcasm, or figurative language.
- Factual Accuracy
- The model generates responses based on information it learned from their training datasets, but they are not knowledge bases. It may generate incorrect or outdated factual statements.
- Common Sense
- The model relies on statistical patterns in language. It might lack the ability to apply common sense reasoning in certain situations.
Ethical Considerations and Risks
In creating an open, vision-language model, we have carefully considered the following:
- Bias and Fairness
- VLMs trained on large-scale, real-world text and image data can reflect socio-cultural biases embedded in the training material. DiffusionGemma underwent careful scrutiny, input data pre-processing, and post-training evaluations as reported in this card to help mitigate the risk of these biases.
- Misinformation and Misuse
- VLMs can be misused to generate text that is false, misleading, or harmful.
- Guidelines are provided for responsible use with the model, see the Responsible Generative AI Toolkit.
- Transparency and Accountability
- This model card summarizes details on the model’s architecture, capabilities, limitations, and evaluation processes.
- A responsibly developed open model offers the opportunity to share innovation by making VLM technology accessible to developers and researchers across the AI ecosystem.
Risks identified and mitigations:
- Generation of harmful content: Mechanisms and guidelines for content safety are essential. Developers are encouraged to exercise caution and implement appropriate content safety safeguards based on their specific product policies and application use cases.
- Misuse for malicious purposes: Technical limitations and developer and end-user education can help mitigate against malicious applications of VLMs. Educational resources and reporting mechanisms for users to flag misuse are provided.
- Privacy violations: Models were trained on data filtered for removal of certain personal information and other sensitive data. Developers are encouraged to adhere to privacy regulations with privacy-preserving techniques.
- Perpetuation of biases: It's encouraged to perform continuous monitoring (using evaluation metrics, human review) and the exploration of de-biasing techniques during model training, fine-tuning, and other use cases.
Benefits
At the time of release, this is a low-latency, high-performance open vision-language model that provides a compelling option for developers and those interested in researching diffusion language models. The model is designed from the ground up for responsible AI development compared to similarly sized models.
- Downloads last month
- -
4-bit
5-bit
6-bit
8-bit
16-bit
