Liujgoj Cantonese LLM (V1.0-Alpha)
🌟 Introduction
Liujgoj Cantonese LLM is a fine-tuned language model built for the Liujgoj (溜歌粵語) romanization system.
Liujgoj is more than a phonetic transcription method. It is an experimental writing system designed to represent spoken Cantonese through:
- word-based orthography
- Latin script writing
- efficient tone encoding
- native Cantonese expression
This model is trained to convert natural Cantonese Chinese text into standardized Liujgoj Romanization.
中文簡介
Liujgoj Cantonese LLM 係專為 溜歌粵語(Liujgoj)羅馬字系統 微調嘅語言模型。
Liujgoj 唔單止係拼音方案,而係一套以廣東話口語為核心、強調「單詞化」同「拉丁字母化」嘅書寫系統。
本模型主要用途係:
- 將廣東話漢字句子轉換為 Liujgoj 羅馬字
- 學習地道口語詞彙組合
- 處理溜歌獨特拼寫規則
- 推動廣東話 AI 書寫技術發展
🚀 Key Features
✅ Word-Based Orthography
Learns merged word forms instead of character-by-character output.
Examples:
- 食咗 →
sikhzor - 靚仔 →
lengzair - 做咩 →
zouh mej
✅ Tone-as-Letter System
Supports Liujgoj tone letters:
jrxqh
This allows compact tone representation without numbers.
✅ Native Cantonese Fluency
Training data is derived from authentic Hong Kong Cantonese dialogue, preserving:
- colloquial speech
- sentence particles
- slang usage
- real spoken rhythm
📊 Training Data
Dataset Size
30,082 high-quality instruction pairs
Sources
Curated from 60+ Hong Kong movie subtitle files (SRT) and converted through manual / semi-automatic annotation.
Format Example
{
"instruction": "Convert Cantonese Chinese into Liujgoj Romanization",
"input": "你食咗飯未啊?",
"output": "Neiq sikhzor faanh meih aa?"
}
🧠 Base Model
unsloth/Qwen2.5-7B-bnb-4bit
Fine-tuning Method
- LoRA
- Supervised Fine-Tuning (SFT)
- Unsloth optimized training pipeline
🛠️ Usage
Python Example
from unsloth import FastLanguageModel
import torch
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "yvthyvq/liujgoj-cantonese-lora",
max_seq_length = 2048,
load_in_4bit = True,
)
FastLanguageModel.for_inference(model)
prompt = """Below is an instruction that describes a task.
### Instruction:
Convert Cantonese Chinese into Liujgoj Romanization.
### Input:
{}
### Response:
"""
inputs = tokenizer(
[prompt.format("你食咗飯未啊?")],
return_tensors="pt"
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=64)
print(tokenizer.batch_decode(outputs))
💬 Example
Input
你食咗飯未啊?
Output
Neiq sikhzor faanh meih aa?
⚠️ Limitations
This is an Alpha release.
Current limitations may include:
- unseen slang words
- long-context instability
- occasional spelling inconsistency
- hallucination on unrelated tasks
Recommended primarily for:
- Cantonese romanization
- Liujgoj experiments
- linguistic research
- niche Cantonese NLP tasks
🗺️ Roadmap
V1.0 Alpha
- Initial public release
- Cantonese Hanzi → Liujgoj conversion
V2.0 Planned
- improved accuracy
- better segmentation
- stronger instruction following
- broader vocabulary coverage
Future Goals
- Liujgoj chat assistant
- speech alignment
- grammar tools
- full Cantonese-native LLM ecosystem
🏷️ Tags
cantonese yue romanization liujgoj lora qwen unsloth linguistics
🙌 Author
Created by Yvthyvq
Building language technology for Cantonese and Liujgoj.