--- license: gemma language: - en - zh base_model: twinkle-ai/gemma-3-4B-T1-it library_name: transformers tags: - Taiwan - SLM - GGUF - agent datasets: - lianghsun/tw-reasoning-instruct - lianghsun/tw-contract-review-chat - minyichen/tw-instruct-R1-200k - minyichen/tw_mm_R1 - minyichen/LongPaper_multitask_zh_tw_R1 - nvidia/Nemotron-Instruction-Following-Chat-v1 metrics: - accuracy model-index: - name: gemma-3-4B-T1-it results: - task: type: question-answering name: Single Choice Question dataset: name: tmmlu+ type: ikala/tmmluplus config: all split: test revision: c0e8ae955997300d5dbf0e382bf0ba5115f85e8c metrics: - type: accuracy value: 47.44 name: single choice - task: type: question-answering name: Single Choice Question dataset: name: mmlu type: cais/mmlu config: all split: test revision: c30699e metrics: - type: accuracy value: 59.13 name: single choice - task: type: question-answering name: Single Choice Question dataset: name: tw-legal-benchmark-v1 type: lianghsun/tw-legal-benchmark-v1 config: all split: test revision: 66c3a5f metrics: - type: accuracy value: 44.18 name: single choice pipeline_tag: text-generation --- # Gemma 3 4B T1-it GGUF Collection
Discord Hugging Face License
GGUF quantized models converted from [twinkle-ai/gemma-3-4B-T1-it](https://huggingface.co/twinkle-ai/gemma-3-4B-T1-it) for use with llama.cpp. ![Gemma3-4B-T1-it](https://cdn-uploads.huggingface.co/production/uploads/62b085e6a14cbd643867d561/wqX8cTImAyfmdLxrrGW4L.png) ## About Gemma 3 4B T1-it is a small language model fine-tuned on Taiwan-focused datasets, supporting both English and Traditional Chinese. This repository provides multiple quantization formats optimized for different use cases. ## Available Models | Model | Size | Use Case | | ----- | ---- | -------- | | `twinkle-ai-gemma-3-4B-T1-it-BF16.gguf` | Largest | Best quality, highest precision | | `twinkle-ai-gemma-3-4B-T1-it-F16.gguf` | Large | High quality, good precision | | `twinkle-ai-gemma-3-4B-T1-it-Q8_0.gguf` | Medium | Balanced quality and speed | | `twinkle-ai-gemma-3-4b-t1-it-q4_k_m.gguf` | Smallest | Fastest inference, lower memory | ## Quick Start ### Option 1: Using Hugging Face Hub (Recommended) Install llama.cpp via Homebrew: ```bash brew install llama.cpp ``` Run inference directly from Hugging Face: ```bash llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \ --hf-file gemma-3-4b-t1-it-q8_0.gguf \ -p "Your prompt here" ``` Start as a server: ```bash llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \ --hf-file gemma-3-4b-t1-it-q8_0.gguf \ -c 2048 ``` ### Option 2: Build from Source #### Step 1: Clone llama.cpp repository ```bash git clone https://github.com/ggerganov/llama.cpp cd llama.cpp ``` #### Step 2: Build llama.cpp Basic build (CPU only): ```bash LLAMA_CURL=1 make ``` **Hardware-specific build options:** - **NVIDIA GPU (Linux)**: ```bash LLAMA_CUDA=1 LLAMA_CURL=1 make ``` - **Apple Silicon (Mac)**: ```bash LLAMA_METAL=1 LLAMA_CURL=1 make ``` - **AMD GPU (ROCm)**: ```bash LLAMA_HIPBLAS=1 LLAMA_CURL=1 make ``` #### Step 3: Run inference ```bash ./llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \ --hf-file gemma-3-4b-t1-it-q8_0.gguf \ -p "Your prompt here" ``` #### Step 4: Start server (optional) ```bash ./llama-server --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \ --hf-file gemma-3-4b-t1-it-q8_0.gguf \ -c 2048 ``` ## Advanced Usage ### Choosing the Right Model Select a model based on your needs: - **Best Quality**: Use `BF16` or `F16` versions (requires more memory) - **Balanced**: Use `Q8_0` version (recommended for most users) - **Resource Constrained**: Use `q4_k_m` version (suitable for devices with limited memory) ### Common Parameters - `-p "prompt"`: Your input text for the model to respond to - `-c 2048`: Context length (maximum number of tokens that can be processed) - `--hf-repo`: Hugging Face repository name - `--hf-file`: Model file name to use ### Adjusting Generation Parameters ```bash llama-cli --hf-repo thliang01/gemma-3-4B-T1-it-Q8_0-GGUF \ --hf-file gemma-3-4b-t1-it-q8_0.gguf \ -p "Your prompt here" \ --temp 0.7 \ --top-p 0.9 \ --repeat-penalty 1.1 ``` Parameter explanations: - `--temp`: Temperature (0.0-2.0), higher values produce more random output - `--top-p`: Nucleus sampling parameter (0.0-1.0) - `--repeat-penalty`: Repetition penalty to avoid repetitive content ## Model Information - **Base Model**: twinkle-ai/gemma-3-4B-T1-it - **Languages**: English, Traditional Chinese - **License**: Gemma - **Format**: GGUF (converted via [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo)) ### Training Data - Taiwan reasoning and instruction datasets - Contract review and legal documents - Multimodal and long-form content - Instruction-following examples ### Benchmarks - **TMMLU+**: 47.44% accuracy - **MMLU**: 59.13% accuracy - **TW Legal Benchmark**: 44.18% accuracy ## Troubleshooting ### Common Issues **Q: Getting out of memory errors?** A: Try using a smaller quantized version like `q4_k_m`, or reduce the context length parameter `-c`. **Q: How can I speed up inference?** A: 1. Use GPU acceleration (add hardware-specific flags during compilation) 2. Choose a smaller quantized model (like `q4_k_m`) 3. Reduce context length **Q: What prompt format does the model support?** A: This is an instruction-tuned model. Use a clear instruction format, for example: ```text Please analyze the main clauses of the following contract: [contract content] ``` ## Links - [Original Model](https://huggingface.co/twinkle-ai/gemma-3-4B-T1-it) - [llama.cpp Documentation](https://github.com/ggerganov/llama.cpp) - [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) ## Contributing If you have any questions or suggestions, please feel free to open a discussion in the Hugging Face repository. --- **Note**: On first run, llama.cpp will automatically download the model file from Hugging Face. Please ensure you have a stable internet connection.