How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

worthdoing

Author: Simon-Pierre Boucher

GGUF Parameters Apple Silicon License worthdoing

Q4_K_M Q5_K_M Q8_0

TinyLlama-1.1B-Chat-v1.0 - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of TinyLlama-1.1B-Chat-v1.0, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Description

Ultra-tiny Llama variant. Minimal resource usage for basic tasks.

Available Quantizations

File Quant BPW Size Use Case
tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf Q4_K_M 4.58 ~0.6 GB Recommended - Best quality/size ratio
tinyllama-1.1b-chat-v1.0-Q5_K_M-worthdoing.gguf Q5_K_M 5.33 ~0.7 GB Higher quality, still fast
tinyllama-1.1b-chat-v1.0-Q8_0-worthdoing.gguf Q8_0 7.96 ~1.0 GB Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create tinyllama-1.1b-chat-v1.0 -f Modelfile
ollama run tinyllama-1.1b-chat-v1.0

With llama.cpp

llama-cli -m tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

  1. Download the GGUF file
  2. Open LM Studio -> My Models -> Import
  3. Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 โ€” Download & Validation

  • Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
  • Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
  • Tokenizer, configuration, and all metadata are preserved

Step 2 โ€” Conversion to GGUF F16 Baseline

  • The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
  • This lossless baseline preserves the full original model quality
  • Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 โ€” K-Quant Quantization

  • The F16 baseline is quantized using llama-quantize with k-quant methods
  • K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
  • Each quantization level offers a different quality/size tradeoff:
Method Bits per Weight Strategy
Q4_K_M ~4.58 bpw Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M ~5.33 bpw Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0 ~7.96 bpw Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 โ€” Metadata Injection

  • Custom metadata is embedded directly in each GGUF file:
    • general.quantized_by: worthdoing
    • general.quantization_version: corelm-1.0
  • This ensures full traceability and provenance of every quantized file

Tools & Environment

  • llama.cpp: Used for both conversion and quantization โ€” the industry-standard open-source LLM inference engine
  • Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
  • Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime

Recommended Hardware

Quant Min RAM Recommended
Q4_K_M 4 GB Mac with 8 GB+ RAM
Q5_K_M 4 GB Mac with 8 GB+ RAM
Q8_0 4 GB Mac with 8 GB+ RAM

Tags

general, ultra-lightweight, edge


Quantized with corelm-model pipeline by worthdoing on 2026-04-17

Downloads last month
254
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF

Quantized
(145)
this model