Instructions to use cusmato/gpt-oss-20b-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cusmato/gpt-oss-20b-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="cusmato/gpt-oss-20b-GGUF") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("cusmato/gpt-oss-20b-GGUF") model = AutoModelForMultimodalLM.from_pretrained("cusmato/gpt-oss-20b-GGUF") - llama-cpp-python
How to use cusmato/gpt-oss-20b-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="cusmato/gpt-oss-20b-GGUF", filename="gpt-oss-20b-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use cusmato/gpt-oss-20b-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf cusmato/gpt-oss-20b-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf cusmato/gpt-oss-20b-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf cusmato/gpt-oss-20b-GGUF:F16 # Run inference directly in the terminal: llama-cli -hf cusmato/gpt-oss-20b-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf cusmato/gpt-oss-20b-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf cusmato/gpt-oss-20b-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf cusmato/gpt-oss-20b-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf cusmato/gpt-oss-20b-GGUF:F16
Use Docker
docker model run hf.co/cusmato/gpt-oss-20b-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use cusmato/gpt-oss-20b-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "cusmato/gpt-oss-20b-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cusmato/gpt-oss-20b-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/cusmato/gpt-oss-20b-GGUF:F16
- SGLang
How to use cusmato/gpt-oss-20b-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "cusmato/gpt-oss-20b-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cusmato/gpt-oss-20b-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "cusmato/gpt-oss-20b-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "cusmato/gpt-oss-20b-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use cusmato/gpt-oss-20b-GGUF with Ollama:
ollama run hf.co/cusmato/gpt-oss-20b-GGUF:F16
- Unsloth Studio
How to use cusmato/gpt-oss-20b-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cusmato/gpt-oss-20b-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for cusmato/gpt-oss-20b-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for cusmato/gpt-oss-20b-GGUF to start chatting
- Pi
How to use cusmato/gpt-oss-20b-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf cusmato/gpt-oss-20b-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "cusmato/gpt-oss-20b-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use cusmato/gpt-oss-20b-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf cusmato/gpt-oss-20b-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default cusmato/gpt-oss-20b-GGUF:F16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use cusmato/gpt-oss-20b-GGUF with Docker Model Runner:
docker model run hf.co/cusmato/gpt-oss-20b-GGUF:F16
- Lemonade
How to use cusmato/gpt-oss-20b-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull cusmato/gpt-oss-20b-GGUF:F16
Run and chat with the model
lemonade run user.gpt-oss-20b-GGUF-F16
List all available models
lemonade list
| base_model: | |
| - openai/gpt-oss-20b | |
| license: apache-2.0 | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| tags: | |
| - openai | |
| - unsloth | |
| # Read our How to [Run gpt-oss Guide here!](https://docs.unsloth.ai/models/gpt-oss-how-to-run-and-fine-tune) | |
| <div> | |
| <p style="margin-bottom: 0; margin-top: 0;"> | |
| <strong>See <a href="https://huggingface.co/collections/unsloth/gpt-oss-6892433695ce0dee42f31681">our collection</a> for all versions of gpt-oss including GGUF, 4-bit & 16-bit formats.</strong> | |
| </p> | |
| <p style="margin-bottom: 0;"> | |
| <em>Learn to run gpt-oss correctly - <a href="https://docs.unsloth.ai/basics/gpt-oss">Read our Guide</a>.</em> | |
| </p> | |
| <p style="margin-top: 0;margin-bottom: 0;"> | |
| <em>See <a href="https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-gguf">Unsloth Dynamic 2.0 GGUFs</a> for our quantization benchmarks.</em> | |
| </p> | |
| <div style="display: flex; gap: 5px; align-items: center; "> | |
| <a href="https://github.com/unslothai/unsloth/"> | |
| <img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="133"> | |
| </a> | |
| <a href="https://discord.gg/unsloth"> | |
| <img src="https://github.com/unslothai/unsloth/raw/main/images/Discord%20button.png" width="173"> | |
| </a> | |
| <a href="https://docs.unsloth.ai/basics/gpt-oss"> | |
| <img src="https://raw.githubusercontent.com/unslothai/unsloth/refs/heads/main/images/documentation%20green%20button.png" width="143"> | |
| </a> | |
| </div> | |
| <h1 style="margin-top: 0rem;">✨ Read our gpt-oss Guide <a href="https://docs.unsloth.ai/basics/gpt-oss">here</a>!</h1> | |
| </div> | |
| - Fine-tune gpt-oss-20b for free using our [Google Colab notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/gpt-oss-(20B)-Fine-tuning.ipynb) | |
| - Read our Blog about gpt-oss support: [unsloth.ai/blog/gpt-oss](https://unsloth.ai/blog/gpt-oss) | |
| - View the rest of our notebooks in our [docs here](https://docs.unsloth.ai/get-started/unsloth-notebooks). | |
| - Thank you to the [llama.cpp](https://github.com/ggml-org/llama.cpp) team for their work on supporting this model. We wouldn't be able to release quants without them! | |
| The F32 quant is MXFP4 upcasted to BF16 for every single layer and is unquantized. | |
| # gpt-oss-20b Details | |
| <p align="center"> | |
| <img alt="gpt-oss-20b" src="https://raw.githubusercontent.com/openai/gpt-oss/main/docs/gpt-oss-20b.svg"> | |
| </p> | |
| <p align="center"> | |
| <a href="https://gpt-oss.com"><strong>Try gpt-oss</strong></a> · | |
| <a href="https://cookbook.openai.com/topic/gpt-oss"><strong>Guides</strong></a> · | |
| <a href="https://openai.com/index/gpt-oss-model-card"><strong>System card</strong></a> · | |
| <a href="https://openai.com/index/introducing-gpt-oss/"><strong>OpenAI blog</strong></a> | |
| </p> | |
| <br> | |
| Welcome to the gpt-oss series, [OpenAI’s open-weight models](https://openai.com/open-models) designed for powerful reasoning, agentic tasks, and versatile developer use cases. | |
| We’re releasing two flavors of the open models: | |
| - `gpt-oss-120b` — for production, general purpose, high reasoning use cases that fits into a single H100 GPU (117B parameters with 5.1B active parameters) | |
| - `gpt-oss-20b` — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters) | |
| Both models were trained on our [harmony response format](https://github.com/openai/harmony) and should only be used with the harmony format as it will not work correctly otherwise. | |
| > [!NOTE] | |
| > This model card is dedicated to the smaller `gpt-oss-20b` model. Check out [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) for the larger model. | |
| # Highlights | |
| * **Permissive Apache 2.0 license:** Build freely without copyleft restrictions or patent risk—ideal for experimentation, customization, and commercial deployment. | |
| * **Configurable reasoning effort:** Easily adjust the reasoning effort (low, medium, high) based on your specific use case and latency needs. | |
| * **Full chain-of-thought:** Gain complete access to the model’s reasoning process, facilitating easier debugging and increased trust in outputs. It’s not intended to be shown to end users. | |
| * **Fine-tunable:** Fully customize models to your specific use case through parameter fine-tuning. | |
| * **Agentic capabilities:** Use the models’ native capabilities for function calling, [web browsing](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#browser), [Python code execution](https://github.com/openai/gpt-oss/tree/main?tab=readme-ov-file#python), and Structured Outputs. | |
| * **Native MXFP4 quantization:** The models are trained with native MXFP4 precision for the MoE layer, making `gpt-oss-120b` run on a single H100 GPU and the `gpt-oss-20b` model run within 16GB of memory. | |
| --- | |
| # Inference examples | |
| ## Transformers | |
| You can use `gpt-oss-120b` and `gpt-oss-20b` with Transformers. If you use the Transformers chat template, it will automatically apply the [harmony response format](https://github.com/openai/harmony). If you use `model.generate` directly, you need to apply the harmony format manually using the chat template or use our [openai-harmony](https://github.com/openai/harmony) package. | |
| To get started, install the necessary dependencies to setup your environment: | |
| ``` | |
| pip install -U transformers kernels torch | |
| ``` | |
| Once, setup you can proceed to run the model by running the snippet below: | |
| ```py | |
| from transformers import pipeline | |
| import torch | |
| model_id = "openai/gpt-oss-20b" | |
| pipe = pipeline( | |
| "text-generation", | |
| model=model_id, | |
| torch_dtype="auto", | |
| device_map="auto", | |
| ) | |
| messages = [ | |
| {"role": "user", "content": "Explain quantum mechanics clearly and concisely."}, | |
| ] | |
| outputs = pipe( | |
| messages, | |
| max_new_tokens=256, | |
| ) | |
| print(outputs[0]["generated_text"][-1]) | |
| ``` | |
| Alternatively, you can run the model via [`Transformers Serve`](https://huggingface.co/docs/transformers/main/serving) to spin up a OpenAI-compatible webserver: | |
| ``` | |
| transformers serve | |
| transformers chat localhost:8000 --model-name-or-path openai/gpt-oss-20b | |
| ``` | |
| [Learn more about how to use gpt-oss with Transformers.](https://cookbook.openai.com/articles/gpt-oss/run-transformers) | |
| ## vLLM | |
| vLLM recommends using [uv](https://docs.astral.sh/uv/) for Python dependency management. You can use vLLM to spin up an OpenAI-compatible webserver. The following command will automatically download the model and start the server. | |
| ```bash | |
| uv pip install --pre vllm==0.10.1+gptoss \ | |
| --extra-index-url https://wheels.vllm.ai/gpt-oss/ \ | |
| --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \ | |
| --index-strategy unsafe-best-match | |
| vllm serve openai/gpt-oss-20b | |
| ``` | |
| [Learn more about how to use gpt-oss with vLLM.](https://cookbook.openai.com/articles/gpt-oss/run-vllm) | |
| ## PyTorch / Triton | |
| To learn about how to use this model with PyTorch and Triton, check out our [reference implementations in the gpt-oss repository](https://github.com/openai/gpt-oss?tab=readme-ov-file#reference-pytorch-implementation). | |
| ## Ollama | |
| If you are trying to run gpt-oss on consumer hardware, you can use Ollama by running the following commands after [installing Ollama](https://ollama.com/download). | |
| ```bash | |
| # gpt-oss-20b | |
| ollama pull gpt-oss:20b | |
| ollama run gpt-oss:20b | |
| ``` | |
| [Learn more about how to use gpt-oss with Ollama.](https://cookbook.openai.com/articles/gpt-oss/run-locally-ollama) | |
| #### LM Studio | |
| If you are using [LM Studio](https://lmstudio.ai/) you can use the following commands to download. | |
| ```bash | |
| # gpt-oss-20b | |
| lms get openai/gpt-oss-20b | |
| ``` | |
| Check out our [awesome list](https://github.com/openai/gpt-oss/blob/main/awesome-gpt-oss.md) for a broader collection of gpt-oss resources and inference partners. | |
| --- | |
| # Download the model | |
| You can download the model weights from the [Hugging Face Hub](https://huggingface.co/collections/openai/gpt-oss-68911959590a1634ba11c7a4) directly from Hugging Face CLI: | |
| ```shell | |
| # gpt-oss-20b | |
| huggingface-cli download openai/gpt-oss-20b --include "original/*" --local-dir gpt-oss-20b/ | |
| pip install gpt-oss | |
| python -m gpt_oss.chat model/ | |
| ``` | |
| # Reasoning levels | |
| You can adjust the reasoning level that suits your task across three levels: | |
| * **Low:** Fast responses for general dialogue. | |
| * **Medium:** Balanced speed and detail. | |
| * **High:** Deep and detailed analysis. | |
| The reasoning level can be set in the system prompts, e.g., "Reasoning: high". | |
| # Tool use | |
| The gpt-oss models are excellent for: | |
| * Web browsing (using built-in browsing tools) | |
| * Function calling with defined schemas | |
| * Agentic operations like browser tasks | |
| # Fine-tuning | |
| Both gpt-oss models can be fine-tuned for a variety of specialized use cases. | |
| This smaller model `gpt-oss-20b` can be fine-tuned on consumer hardware, whereas the larger [`gpt-oss-120b`](https://huggingface.co/openai/gpt-oss-120b) can be fine-tuned on a single H100 node. | |