Update README.md

854f3d5 verified 3 months ago

7.21 kB

license: gemma
language:
  - en
  - zh
base_model: twinkle-ai/gemma-3-4B-T1-it
library_name: transformers
tags:
  - Taiwan
  - SLM
  - GGUF
  - agent
datasets:
  - lianghsun/tw-reasoning-instruct
  - lianghsun/tw-contract-review-chat
  - minyichen/tw-instruct-R1-200k
  - minyichen/tw_mm_R1
  - minyichen/LongPaper_multitask_zh_tw_R1
  - nvidia/Nemotron-Instruction-Following-Chat-v1
metrics:
  - accuracy
model-index:
  - name: gemma-3-4B-T1-it
    results:
      - task:
          type: question-answering
          name: Single Choice Question
        dataset:
          name: tmmlu+
          type: ikala/tmmluplus
          config: all
          split: test
          revision: c0e8ae955997300d5dbf0e382bf0ba5115f85e8c
        metrics:
          - type: accuracy
            value: 47.44
            name: single choice
      - task:
          type: question-answering
          name: Single Choice Question
        dataset:
          name: mmlu
          type: cais/mmlu
          config: all
          split: test
          revision: c30699e
        metrics:
          - type: accuracy
            value: 59.13
            name: single choice
      - task:
          type: question-answering
          name: Single Choice Question
        dataset:
          name: tw-legal-benchmark-v1
          type: lianghsun/tw-legal-benchmark-v1
          config: all
          split: test
          revision: 66c3a5f
        metrics:
          - type: accuracy
            value: 44.18
            name: single choice
pipeline_tag: text-generation

Gemma 3 4B T1-it GGUF Collection

GGUF quantized models converted from twinkle-ai/gemma-3-4B-T1-it for use with llama.cpp.

About

Gemma 3 4B T1-it is a small language model fine-tuned on Taiwan-focused datasets, supporting both English and Traditional Chinese. This repository provides multiple quantization formats optimized for different use cases.

Available Models

Model	Size	Use Case
`twinkle-ai-gemma-3-4B-T1-it-BF16.gguf`	Largest	Best quality, highest precision
`twinkle-ai-gemma-3-4B-T1-it-F16.gguf`	Large	High quality, good precision
`twinkle-ai-gemma-3-4B-T1-it-Q8_0.gguf`	Medium	Balanced quality and speed
`twinkle-ai-gemma-3-4b-t1-it-q4_k_m.gguf`	Smallest	Fastest inference, lower memory

Quick Start

Option 1: Using Hugging Face Hub (Recommended)

Install llama.cpp via Homebrew:

brew install llama.cpp

Run inference directly from Hugging Face:

llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here"

Start as a server:

llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -c 2048

Option 2: Build from Source

Step 1: Clone llama.cpp repository

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Step 2: Build llama.cpp

Basic build (CPU only):

LLAMA_CURL=1 make

Hardware-specific build options:

NVIDIA GPU (Linux):
```
LLAMA_CUDA=1 LLAMA_CURL=1 make
```
Apple Silicon (Mac):
```
LLAMA_METAL=1 LLAMA_CURL=1 make
```
AMD GPU (ROCm):
```
LLAMA_HIPBLAS=1 LLAMA_CURL=1 make
```

Step 3: Run inference

./llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here"

Step 4: Start server (optional)

./llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -c 2048

Advanced Usage

Choosing the Right Model

Select a model based on your needs:

Best Quality: Use BF16 or F16 versions (requires more memory)
Balanced: Use Q8_0 version (recommended for most users)
Resource Constrained: Use q4_k_m version (suitable for devices with limited memory)

Common Parameters

-p "prompt": Your input text for the model to respond to
-c 2048: Context length (maximum number of tokens that can be processed)
--hf-repo: Hugging Face repository name
--hf-file: Model file name to use

Adjusting Generation Parameters

llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \
  --hf-file gemma-3-4b-t1-it-q8_0.gguf \
  -p "Your prompt here" \
  --temp 0.7 \
  --top-p 0.9 \
  --repeat-penalty 1.1

Parameter explanations:

--temp: Temperature (0.0-2.0), higher values produce more random output
--top-p: Nucleus sampling parameter (0.0-1.0)
--repeat-penalty: Repetition penalty to avoid repetitive content

Model Information

Base Model: twinkle-ai/gemma-3-4B-T1-it
Languages: English, Traditional Chinese
License: Gemma
Format: GGUF (converted via GGUF-my-repo)

Training Data

Taiwan reasoning and instruction datasets
Contract review and legal documents
Multimodal and long-form content
Instruction-following examples

Benchmarks

TMMLU+: 47.44% accuracy
MMLU: 59.13% accuracy
TW Legal Benchmark: 44.18% accuracy

Troubleshooting

Common Issues

Q: Getting out of memory errors?

A: Try using a smaller quantized version like q4_k_m, or reduce the context length parameter -c.

Q: How can I speed up inference?

Use GPU acceleration (add hardware-specific flags during compilation)
Choose a smaller quantized model (like q4_k_m)
Reduce context length

Q: What prompt format does the model support?

A: This is an instruction-tuned model. Use a clear instruction format, for example:

Please analyze the main clauses of the following contract: [contract content]

Contributing

If you have any questions or suggestions, please feel free to open a discussion in the Hugging Face repository.

Note: On first run, llama.cpp will automatically download the model file from Hugging Face. Please ensure you have a stable internet connection.

twinkle-ai
/

gemma-3-4B-T1-it-GGUF