Qwen3-1.7B Sussurro - v1.0

A fine-tuned version of Qwen/Qwen3-1.7B for speech-to-text transcription correction.

Model Description

This model converts raw speech transcriptions into clean, written-quality text by:

  • Removing filler words: um, uh, like, you know, I mean, actually, literally, right, you see
  • Fixing stuttering: the the โ†’ the, we we โ†’ we, I I โ†’ I
  • Eliminating false starts: "I was- actually, I mean..." โ†’ clean phrasing
  • Converting conversational to written: Transform spoken language patterns to formal written text
  • Organizing rambling speech: Convert stream-of-consciousness to structured sentences
  • Preserving meaning: Maintain all important content and intent

Training Details

  • Base Model: Qwen/Qwen3-1.7B
  • Training Method: QLoRA (4-bit quantization + LoRA adapters)
  • Training Data: 3,997 speech transcription pairs
  • Hardware: AMD Radeon RX 7800 XT (16GB VRAM) with ROCm
  • Training Duration: ~4 hours

Training Configuration

  • Quantization: 4-bit NF4 with double quantization
  • LoRA: rank=64, alpha=128, targeting all attention and MLP layers
  • Batch Size: 2 per device, 32 gradient accumulation (effective batch=64)
  • Learning Rate: 2e-4 with cosine schedule
  • Epochs: 3
  • Optimizer: paged_adamw_8bit

Evaluation Results

  • BLEU-4: 0.461
  • ROUGE-1: 0.785
  • ROUGE-2: 0.652
  • ROUGE-L: 0.748
  • Test Samples: 401

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "cesp99/qwen3-sussurro"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# System prompt
system_prompt = """You are a speech-to-text correction specialist. Your task is to convert raw speech transcriptions into clean, written text by:
- Removing all filler words (um, uh, like, you know, I mean, actually, literally, right, you see)
- Fixing stuttering and repeated words (the the โ†’ the, we we โ†’ we)
- Eliminating false starts and self-corrections
- Converting conversational speech patterns to formal written language
- Organizing rambling thoughts into clear, structured sentences
- Preserving all important meaning and content"""

# Example correction
raw_speech = "so, uh, I was thinking like maybe we could, you know, meet up on Saturday?"

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": raw_speech},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False,
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    do_sample=True,
)

corrected_text = tokenizer.decode(
    outputs[0][inputs['input_ids'].shape[1]:],
    skip_special_tokens=True
)

print(corrected_text)
# Output: "I was thinking maybe we could meet up on Saturday?"

Example Corrections

Example 1: Filler Words

Input: "so, uh, I was thinking like maybe we could, you know, meet up on Saturday?" Output: "I was thinking maybe we could meet up on Saturday?"

Example 2: Stuttering

Input: "the the budget report is, uh, almost ready and we we just need to finalize" Output: "The budget report is almost ready and we just need to finalize it."

Example 3: False Starts

Input: "I mean, actually, uh, we should probably, like, you know, consider all the options" Output: "We should probably consider all the options before making a decision."

Use Cases

  • Meeting Transcripts: Clean up recorded meeting transcriptions
  • Podcast/Interview Processing: Convert conversational speech to publishable text
  • Voice Notes: Transform voice memos into written format
  • Content Creation: Prepare speech-to-text data for articles or documentation
  • Data Cleaning: Pre-process speech datasets for downstream NLP tasks

Limitations

  • Trained primarily on English speech patterns
  • May occasionally over-correct or change intended meaning
  • Best suited for conversational speech patterns (not formal presentations)
  • Requires careful review for critical applications

Technical Requirements

  • GPU: Recommended 8GB+ VRAM for inference
  • Framework: PyTorch with Transformers library
  • Precision: BF16 recommended (FP16 also supported)

License

GNU General Public License v3.0 (GPL-3.0)

This fine-tuned model is licensed under GPL-3.0. Note that the base model (Qwen3-1.7B) is Apache 2.0 licensed.

Citation

If you use this model, please cite:

@misc{qwen3-sussurro,
  title={Qwen3-1.7B Sussurro},
  author={Carlo Esposito},
  year={2026},
  publisher={Hugging Face},
  url={https://huggingface.co/cesp99/qwen3-sussurro}
}

Acknowledgments

  • Base model: Qwen/Qwen3-1.7B
  • Training framework: Hugging Face Transformers + PEFT
  • Quantization: BitsAndBytes

Training Repository

Full training pipeline and code: github.com/cesp99/qwen3-sussurro

Downloads last month
20
GGUF
Model size
2B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cesp99/qwen3-sussurro

Finetuned
Qwen/Qwen3-1.7B
Quantized
(288)
this model