license: gemma
language:
- en
- zh
base_model: twinkle-ai/gemma-3-4B-T1-it
library_name: transformers
tags:
- Taiwan
- SLM
- GGUF
- agent
datasets:
- lianghsun/tw-reasoning-instruct
- lianghsun/tw-contract-review-chat
- minyichen/tw-instruct-R1-200k
- minyichen/tw_mm_R1
- minyichen/LongPaper_multitask_zh_tw_R1
- nvidia/Nemotron-Instruction-Following-Chat-v1
metrics:
- accuracy
model-index:
- name: gemma-3-4B-T1-it
results:
- task:
type: question-answering
name: Single Choice Question
dataset:
name: tmmlu+
type: ikala/tmmluplus
config: all
split: test
revision: c0e8ae955997300d5dbf0e382bf0ba5115f85e8c
metrics:
- type: accuracy
value: 47.44
name: single choice
- task:
type: question-answering
name: Single Choice Question
dataset:
name: mmlu
type: cais/mmlu
config: all
split: test
revision: c30699e
metrics:
- type: accuracy
value: 59.13
name: single choice
- task:
type: question-answering
name: Single Choice Question
dataset:
name: tw-legal-benchmark-v1
type: lianghsun/tw-legal-benchmark-v1
config: all
split: test
revision: 66c3a5f
metrics:
- type: accuracy
value: 44.18
name: single choice
pipeline_tag: text-generation
Gemma 3 4B T1-it GGUF Collection
GGUF quantized models converted from twinkle-ai/gemma-3-4B-T1-it for use with llama.cpp.
About
Gemma 3 4B T1-it is a small language model fine-tuned on Taiwan-focused datasets, supporting both English and Traditional Chinese. This repository provides multiple quantization formats optimized for different use cases.
Available Models
| Model | Size | Use Case |
|---|---|---|
twinkle-ai-gemma-3-4B-T1-it-BF16.gguf |
Largest | Best quality, highest precision |
twinkle-ai-gemma-3-4B-T1-it-F16.gguf |
Large | High quality, good precision |
twinkle-ai-gemma-3-4B-T1-it-Q8_0.gguf |
Medium | Balanced quality and speed |
twinkle-ai-gemma-3-4b-t1-it-q4_k_m.gguf |
Smallest | Fastest inference, lower memory |
Quick Start
Option 1: Using Hugging Face Hub (Recommended)
Install llama.cpp via Homebrew:
brew install llama.cpp
Run inference directly from Hugging Face:
llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-p "Your prompt here"
Start as a server:
llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-c 2048
Option 2: Build from Source
Step 1: Clone llama.cpp repository
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Step 2: Build llama.cpp
Basic build (CPU only):
LLAMA_CURL=1 make
Hardware-specific build options:
NVIDIA GPU (Linux):
LLAMA_CUDA=1 LLAMA_CURL=1 makeApple Silicon (Mac):
LLAMA_METAL=1 LLAMA_CURL=1 makeAMD GPU (ROCm):
LLAMA_HIPBLAS=1 LLAMA_CURL=1 make
Step 3: Run inference
./llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-p "Your prompt here"
Step 4: Start server (optional)
./llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-c 2048
Advanced Usage
Choosing the Right Model
Select a model based on your needs:
- Best Quality: Use
BF16orF16versions (requires more memory) - Balanced: Use
Q8_0version (recommended for most users) - Resource Constrained: Use
q4_k_mversion (suitable for devices with limited memory)
Common Parameters
-p "prompt": Your input text for the model to respond to-c 2048: Context length (maximum number of tokens that can be processed)--hf-repo: Hugging Face repository name--hf-file: Model file name to use
Adjusting Generation Parameters
llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
--hf-file gemma-3-4b-t1-it-q8_0.gguf \
-p "Your prompt here" \
--temp 0.7 \
--top-p 0.9 \
--repeat-penalty 1.1
Parameter explanations:
--temp: Temperature (0.0-2.0), higher values produce more random output--top-p: Nucleus sampling parameter (0.0-1.0)--repeat-penalty: Repetition penalty to avoid repetitive content
Model Information
- Base Model: twinkle-ai/gemma-3-4B-T1-it
- Languages: English, Traditional Chinese
- License: Gemma
- Format: GGUF (converted via GGUF-my-repo)
Training Data
- Taiwan reasoning and instruction datasets
- Contract review and legal documents
- Multimodal and long-form content
- Instruction-following examples
Benchmarks
- TMMLU+: 47.44% accuracy
- MMLU: 59.13% accuracy
- TW Legal Benchmark: 44.18% accuracy
Troubleshooting
Common Issues
Q: Getting out of memory errors?
A: Try using a smaller quantized version like q4_k_m, or reduce the context length parameter -c.
Q: How can I speed up inference?
A:
- Use GPU acceleration (add hardware-specific flags during compilation)
- Choose a smaller quantized model (like
q4_k_m) - Reduce context length
Q: What prompt format does the model support?
A: This is an instruction-tuned model. Use a clear instruction format, for example:
Please analyze the main clauses of the following contract: [contract content]
Links
Contributing
If you have any questions or suggestions, please feel free to open a discussion in the Hugging Face repository.
Note: On first run, llama.cpp will automatically download the model file from Hugging Face. Please ensure you have a stable internet connection.
