How to use from
llama.cpp
Install from brew
brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pavel-tolstyko/ggml-model-Q4_K_M:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf pavel-tolstyko/ggml-model-Q4_K_M:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf pavel-tolstyko/ggml-model-Q4_K_M:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf pavel-tolstyko/ggml-model-Q4_K_M:Q4_K_M
Use pre-built binary
# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pavel-tolstyko/ggml-model-Q4_K_M:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf pavel-tolstyko/ggml-model-Q4_K_M:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pavel-tolstyko/ggml-model-Q4_K_M:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pavel-tolstyko/ggml-model-Q4_K_M:Q4_K_M
Use Docker
docker model run hf.co/pavel-tolstyko/ggml-model-Q4_K_M:Q4_K_M
Quick Links

Model Card for TinyLlama-1.1B-Chat-v1.0 (Quantized)

This is a quantized version of TinyLlama-1.1B-Chat-v1.0.

Performance Evaluation

The quantized model was tested on the hellaswag dataset with the following results:

Metric Base Model Quantized Model Change
hellaswag accuracy 0.456 0.462 unchanged
hellaswag normalized accuracy 0.64 0.64 unchanged
eval time (GPU) - seconds 219.67 209.34 4.70% decrease

The quantized version of TinyLlama-1.1B-Chat-v1.0 maintains similar accuracy while achieving a 4.7% reduction in evaluation time. This evaluation was conducted using GPU resources on a subset of 100 hellaswag samples for expediency. For production purposes, it is recommended to perform a full evaluation.

Quantization Approach
The model was quantized to 4-bits using the Q4_K_M method with llama.cpp, specifically designed for optimized GPU performance. The following steps were used:

  1. Convert the original model to GGUF format:

    python ./llama.cpp/convert_hf_to_gguf.py ./llama.cpp/models/TinyLlama-1.1B-Chat-v1.0/
    
  2. Quantize the GGUF model to 4-bit Q4_K_M:

./llama.cpp/build/bin/llama-quantize ./llama.cpp/models/TinyLlama-1.1B-Chat-v1.0/ggml-model-Q4_K_M.gguf q4_k_m

Downloads last month
6
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for pavel-tolstyko/ggml-model-Q4_K_M

Quantized
(149)
this model