Instructions to use worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF",
	filename="tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M

Use Docker

docker model run hf.co/worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M

Ollama
How to use worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF with Ollama:
```
ollama run hf.co/worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M
```

Unsloth Studio new

How to use worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF to start chatting

Docker Model Runner
How to use worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF with Docker Model Runner:
```
docker model run hf.co/worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M
```

Lemonade

How to use worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.TinyLlama-1.1B-Chat-v1.0-GGUF-Q4_K_M

List all available models

lemonade list

Author: Simon-Pierre Boucher

TinyLlama-1.1B-Chat-v1.0 - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of TinyLlama-1.1B-Chat-v1.0, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Original model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Parameters: 1.1B
Quantized by: worthdoing
Pipeline: corelm-model v1.0

Description

Ultra-tiny Llama variant. Minimal resource usage for basic tasks.

Available Quantizations

File	Quant	BPW	Size	Use Case
`tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf`	Q4_K_M	4.58	~0.6 GB	Recommended - Best quality/size ratio
`tinyllama-1.1b-chat-v1.0-Q5_K_M-worthdoing.gguf`	Q5_K_M	5.33	~0.7 GB	Higher quality, still fast
`tinyllama-1.1b-chat-v1.0-Q8_0-worthdoing.gguf`	Q8_0	7.96	~1.0 GB	Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create tinyllama-1.1b-chat-v1.0 -f Modelfile
ollama run tinyllama-1.1b-chat-v1.0

With llama.cpp

llama-cli -m tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

Download the GGUF file
Open LM Studio -> My Models -> Import
Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 — Download & Validation

Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
Tokenizer, configuration, and all metadata are preserved

Step 2 — Conversion to GGUF F16 Baseline

The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
This lossless baseline preserves the full original model quality
Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 — K-Quant Quantization

The F16 baseline is quantized using llama-quantize with k-quant methods
K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
Each quantization level offers a different quality/size tradeoff:

Method	Bits per Weight	Strategy
Q4_K_M	~4.58 bpw	Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M	~5.33 bpw	Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0	~7.96 bpw	Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 — Metadata Injection

Custom metadata is embedded directly in each GGUF file:
- general.quantized_by: worthdoing
- general.quantization_version: corelm-1.0
This ensures full traceability and provenance of every quantized file

Tools & Environment

llama.cpp: Used for both conversion and quantization — the industry-standard open-source LLM inference engine
Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime

Recommended Hardware

Quant	Min RAM	Recommended
Q4_K_M	4 GB	Mac with 8 GB+ RAM
Q5_K_M	4 GB	Mac with 8 GB+ RAM
Q8_0	4 GB	Mac with 8 GB+ RAM

Model tree for worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF

Base model

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Quantized

(145)

this model

worthdoing
/

TinyLlama-1.1B-Chat-v1.0-GGUF