Instructions to use cesp99/qwen3-sussurro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use cesp99/qwen3-sussurro with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="cesp99/qwen3-sussurro",
	filename="qwen3-sussurro-f16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use cesp99/qwen3-sussurro with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf cesp99/qwen3-sussurro:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf cesp99/qwen3-sussurro:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf cesp99/qwen3-sussurro:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf cesp99/qwen3-sussurro:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf cesp99/qwen3-sussurro:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf cesp99/qwen3-sussurro:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf cesp99/qwen3-sussurro:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf cesp99/qwen3-sussurro:Q4_K_M

Use Docker

docker model run hf.co/cesp99/qwen3-sussurro:Q4_K_M

LM Studio
Jan
Ollama
How to use cesp99/qwen3-sussurro with Ollama:
```
ollama run hf.co/cesp99/qwen3-sussurro:Q4_K_M
```

Unsloth Studio

How to use cesp99/qwen3-sussurro with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for cesp99/qwen3-sussurro to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for cesp99/qwen3-sussurro to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for cesp99/qwen3-sussurro to start chatting

How to use cesp99/qwen3-sussurro with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf cesp99/qwen3-sussurro:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "cesp99/qwen3-sussurro:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use cesp99/qwen3-sussurro with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf cesp99/qwen3-sussurro:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default cesp99/qwen3-sussurro:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use cesp99/qwen3-sussurro with Docker Model Runner:
```
docker model run hf.co/cesp99/qwen3-sussurro:Q4_K_M
```

Lemonade

How to use cesp99/qwen3-sussurro with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull cesp99/qwen3-sussurro:Q4_K_M

Run and chat with the model

lemonade run user.qwen3-sussurro-Q4_K_M

List all available models

lemonade list

Qwen3-1.7B Sussurro - v1.0

A fine-tuned version of Qwen/Qwen3-1.7B for speech-to-text transcription correction.

Model Description

This model converts raw speech transcriptions into clean, written-quality text by:

Removing filler words: um, uh, like, you know, I mean, actually, literally, right, you see
Fixing stuttering: the the → the, we we → we, I I → I
Eliminating false starts: "I was- actually, I mean..." → clean phrasing
Converting conversational to written: Transform spoken language patterns to formal written text
Organizing rambling speech: Convert stream-of-consciousness to structured sentences
Preserving meaning: Maintain all important content and intent

Training Details

Base Model: Qwen/Qwen3-1.7B
Training Method: QLoRA (4-bit quantization + LoRA adapters)
Training Data: 3,997 speech transcription pairs
Hardware: AMD Radeon RX 7800 XT (16GB VRAM) with ROCm
Training Duration: ~4 hours

Training Configuration

Quantization: 4-bit NF4 with double quantization
LoRA: rank=64, alpha=128, targeting all attention and MLP layers
Batch Size: 2 per device, 32 gradient accumulation (effective batch=64)
Learning Rate: 2e-4 with cosine schedule
Epochs: 3
Optimizer: paged_adamw_8bit

Evaluation Results

BLEU-4: 0.461
ROUGE-1: 0.785
ROUGE-2: 0.652
ROUGE-L: 0.748
Test Samples: 401

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "cesp99/qwen3-sussurro"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# System prompt
system_prompt = """You are a speech-to-text correction specialist. Your task is to convert raw speech transcriptions into clean, written text by:
- Removing all filler words (um, uh, like, you know, I mean, actually, literally, right, you see)
- Fixing stuttering and repeated words (the the → the, we we → we)
- Eliminating false starts and self-corrections
- Converting conversational speech patterns to formal written language
- Organizing rambling thoughts into clear, structured sentences
- Preserving all important meaning and content"""

# Example correction
raw_speech = "so, uh, I was thinking like maybe we could, you know, meet up on Saturday?"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": raw_speech},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    do_sample=True,
)

corrected_text = tokenizer.decode(
    outputs[0][inputs['input_ids'].shape[1]:],
    skip_special_tokens=True
)

print(corrected_text)
# Output: "I was thinking maybe we could meet up on Saturday?"

Example Corrections

Example 1: Filler Words

Input: "so, uh, I was thinking like maybe we could, you know, meet up on Saturday?" Output: "I was thinking maybe we could meet up on Saturday?"

Example 2: Stuttering

Input: "the the budget report is, uh, almost ready and we we just need to finalize" Output: "The budget report is almost ready and we just need to finalize it."

Example 3: False Starts

Input: "I mean, actually, uh, we should probably, like, you know, consider all the options" Output: "We should probably consider all the options before making a decision."

Use Cases

Meeting Transcripts: Clean up recorded meeting transcriptions
Podcast/Interview Processing: Convert conversational speech to publishable text
Voice Notes: Transform voice memos into written format
Content Creation: Prepare speech-to-text data for articles or documentation
Data Cleaning: Pre-process speech datasets for downstream NLP tasks

Limitations

Trained primarily on English speech patterns
May occasionally over-correct or change intended meaning
Best suited for conversational speech patterns (not formal presentations)
Requires careful review for critical applications

Technical Requirements

GPU: Recommended 8GB+ VRAM for inference
Framework: PyTorch with Transformers library
Precision: BF16 recommended (FP16 also supported)

License

GNU General Public License v3.0 (GPL-3.0)

This fine-tuned model is licensed under GPL-3.0. Note that the base model (Qwen3-1.7B) is Apache 2.0 licensed.

Citation

If you use this model, please cite:

@misc{qwen3-sussurro,
  title={Qwen3-1.7B Sussurro},
  author={Carlo Esposito},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/cesp99/qwen3-sussurro}
}

Acknowledgments

Base model: Qwen/Qwen3-1.7B
Training framework: Hugging Face Transformers + PEFT
Quantization: BitsAndBytes

Training Repository

Full training pipeline and code: github.com/cesp99/qwen3-sussurro

Downloads last month: 20

GGUF

Model size

2B params

Architecture

qwen3

Hardware compatibility

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cesp99/qwen3-sussurro

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Quantized

(288)

this model