Liujgoj Cantonese LLM (V1.0-Alpha)

🌟 Introduction

Liujgoj Cantonese LLM is a fine-tuned language model built for the Liujgoj (溜歌粵語) romanization system.

Liujgoj is more than a phonetic transcription method. It is an experimental writing system designed to represent spoken Cantonese through:

word-based orthography
Latin script writing
efficient tone encoding
native Cantonese expression

This model is trained to convert natural Cantonese Chinese text into standardized Liujgoj Romanization.

中文簡介

Liujgoj Cantonese LLM 係專為 溜歌粵語（Liujgoj）羅馬字系統 微調嘅語言模型。

Liujgoj 唔單止係拼音方案，而係一套以廣東話口語為核心、強調「單詞化」同「拉丁字母化」嘅書寫系統。

本模型主要用途係：

將廣東話漢字句子轉換為 Liujgoj 羅馬字
學習地道口語詞彙組合
處理溜歌獨特拼寫規則
推動廣東話 AI 書寫技術發展

🚀 Key Features

✅ Word-Based Orthography

Learns merged word forms instead of character-by-character output.

Examples:

食咗 → sikhzor
靚仔 → lengzair
做咩 → zouh mej

✅ Tone-as-Letter System

Supports Liujgoj tone letters:

This allows compact tone representation without numbers.

✅ Native Cantonese Fluency

Training data is derived from authentic Hong Kong Cantonese dialogue, preserving:

colloquial speech
sentence particles
slang usage
real spoken rhythm

📊 Training Data

Dataset Size

30,082 high-quality instruction pairs

Sources

Curated from 60+ Hong Kong movie subtitle files (SRT) and converted through manual / semi-automatic annotation.

Format Example

{
  "instruction": "Convert Cantonese Chinese into Liujgoj Romanization",
  "input": "你食咗飯未啊？",
  "output": "Neiq sikhzor faanh meih aa?"
}

🧠 Base Model

unsloth/Qwen2.5-7B-bnb-4bit

Fine-tuning Method

LoRA
Supervised Fine-Tuning (SFT)
Unsloth optimized training pipeline

🛠️ Usage

Python Example

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "yvthyvq/liujgoj-cantonese-lora",
    max_seq_length = 2048,
    load_in_4bit = True,
)

FastLanguageModel.for_inference(model)

prompt = """Below is an instruction that describes a task.

### Instruction:
Convert Cantonese Chinese into Liujgoj Romanization.

### Input:
{}

### Response:
"""

inputs = tokenizer(
    [prompt.format("你食咗飯未啊？")],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64)

print(tokenizer.batch_decode(outputs))

💬 Example

Input

你食咗飯未啊？

Output

Neiq sikhzor faanh meih aa?

⚠️ Limitations

This is an Alpha release.

Current limitations may include:

unseen slang words
long-context instability
occasional spelling inconsistency
hallucination on unrelated tasks

Recommended primarily for:

Cantonese romanization
Liujgoj experiments
linguistic research
niche Cantonese NLP tasks

🗺️ Roadmap

V1.0 Alpha

Initial public release
Cantonese Hanzi → Liujgoj conversion

V2.0 Planned

improved accuracy
better segmentation
stronger instruction following
broader vocabulary coverage

Future Goals

Liujgoj chat assistant
speech alignment
grammar tools
full Cantonese-native LLM ecosystem

🏷️ Tags

cantonese yue romanization liujgoj lora qwen unsloth linguistics

🙌 Author

Created by Yvthyvq

Building language technology for Cantonese and Liujgoj.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support