Liujgoj Cantonese LLM (V1.0-Alpha)

🌟 Introduction

Liujgoj Cantonese LLM is a fine-tuned language model built for the Liujgoj (溜歌粵語) romanization system.

Liujgoj is more than a phonetic transcription method. It is an experimental writing system designed to represent spoken Cantonese through:

  • word-based orthography
  • Latin script writing
  • efficient tone encoding
  • native Cantonese expression

This model is trained to convert natural Cantonese Chinese text into standardized Liujgoj Romanization.


中文簡介

Liujgoj Cantonese LLM 係專為 溜歌粵語(Liujgoj)羅馬字系統 微調嘅語言模型。

Liujgoj 唔單止係拼音方案,而係一套以廣東話口語為核心、強調「單詞化」同「拉丁字母化」嘅書寫系統。

本模型主要用途係:

  • 將廣東話漢字句子轉換為 Liujgoj 羅馬字
  • 學習地道口語詞彙組合
  • 處理溜歌獨特拼寫規則
  • 推動廣東話 AI 書寫技術發展

🚀 Key Features

✅ Word-Based Orthography

Learns merged word forms instead of character-by-character output.

Examples:

  • 食咗 → sikhzor
  • 靚仔 → lengzair
  • 做咩 → zouh mej

✅ Tone-as-Letter System

Supports Liujgoj tone letters:

  • j
  • r
  • x
  • q
  • h

This allows compact tone representation without numbers.


✅ Native Cantonese Fluency

Training data is derived from authentic Hong Kong Cantonese dialogue, preserving:

  • colloquial speech
  • sentence particles
  • slang usage
  • real spoken rhythm

📊 Training Data

Dataset Size

30,082 high-quality instruction pairs

Sources

Curated from 60+ Hong Kong movie subtitle files (SRT) and converted through manual / semi-automatic annotation.

Format Example

{
  "instruction": "Convert Cantonese Chinese into Liujgoj Romanization",
  "input": "你食咗飯未啊?",
  "output": "Neiq sikhzor faanh meih aa?"
}

🧠 Base Model

  • unsloth/Qwen2.5-7B-bnb-4bit

Fine-tuning Method

  • LoRA
  • Supervised Fine-Tuning (SFT)
  • Unsloth optimized training pipeline

🛠️ Usage

Python Example

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "yvthyvq/liujgoj-cantonese-lora",
    max_seq_length = 2048,
    load_in_4bit = True,
)

FastLanguageModel.for_inference(model)

prompt = """Below is an instruction that describes a task.

### Instruction:
Convert Cantonese Chinese into Liujgoj Romanization.

### Input:
{}

### Response:
"""

inputs = tokenizer(
    [prompt.format("你食咗飯未啊?")],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64)

print(tokenizer.batch_decode(outputs))

💬 Example

Input

你食咗飯未啊?

Output

Neiq sikhzor faanh meih aa?

⚠️ Limitations

This is an Alpha release.

Current limitations may include:

  • unseen slang words
  • long-context instability
  • occasional spelling inconsistency
  • hallucination on unrelated tasks

Recommended primarily for:

  • Cantonese romanization
  • Liujgoj experiments
  • linguistic research
  • niche Cantonese NLP tasks

🗺️ Roadmap

V1.0 Alpha

  • Initial public release
  • Cantonese Hanzi → Liujgoj conversion

V2.0 Planned

  • improved accuracy
  • better segmentation
  • stronger instruction following
  • broader vocabulary coverage

Future Goals

  • Liujgoj chat assistant
  • speech alignment
  • grammar tools
  • full Cantonese-native LLM ecosystem

🏷️ Tags

cantonese yue romanization liujgoj lora qwen unsloth linguistics


🙌 Author

Created by Yvthyvq

Building language technology for Cantonese and Liujgoj.


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support