Instructions to use plawanrath/gemma-2-9b-it-magnitude-s50-pia with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use plawanrath/gemma-2-9b-it-magnitude-s50-pia with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="plawanrath/gemma-2-9b-it-magnitude-s50-pia")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("plawanrath/gemma-2-9b-it-magnitude-s50-pia")
model = AutoModelForMultimodalLM.from_pretrained("plawanrath/gemma-2-9b-it-magnitude-s50-pia")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use plawanrath/gemma-2-9b-it-magnitude-s50-pia with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "plawanrath/gemma-2-9b-it-magnitude-s50-pia"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plawanrath/gemma-2-9b-it-magnitude-s50-pia",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/plawanrath/gemma-2-9b-it-magnitude-s50-pia

SGLang

How to use plawanrath/gemma-2-9b-it-magnitude-s50-pia with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "plawanrath/gemma-2-9b-it-magnitude-s50-pia" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plawanrath/gemma-2-9b-it-magnitude-s50-pia",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "plawanrath/gemma-2-9b-it-magnitude-s50-pia" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "plawanrath/gemma-2-9b-it-magnitude-s50-pia",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use plawanrath/gemma-2-9b-it-magnitude-s50-pia with Docker Model Runner:
```
docker model run hf.co/plawanrath/gemma-2-9b-it-magnitude-s50-pia
```

gemma-2-9b-it-magnitude-s50-pia / README.md

plawanrath

docs: add arXiv 2605.08137 for citation

b91bcf1 verified about 1 month ago

preview code

raw

history blame contribute delete

3.79 kB

metadata

license: gemma
base_model: google/gemma-2-9b-it
library_name: transformers
language:
  - en
tags:
  - pruning
  - magnitude
  - bias-evaluation
  - llm-compression
  - arxiv:2605.08137
  - research-only

gemma-2-9b-it — magnitude pruning at 50% target sparsity

⚠️ Research artifact only — not for production use. This model was created to study fairness degradation under weight pruning. The companion paper (IEEE AIIoT 2026) demonstrates that magnitude pruning at this sparsity level induces measurable bias amplification on the BBQ benchmark. Do not deploy this model in any user-facing or decision-making system.

Paper

Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI Plawan Kumar Rath, Rahul Maliakkal. IEEE AIIoT 2026.

arXiv: https://arxiv.org/abs/2605.08137
Code: https://github.com/plawanrath/pruning-impact-analysis
Base model: google/gemma-2-9b-it
License: gemma (inherited from base model — see terms)

Pruning configuration

Method: magnitude
Target sparsity: 50%
Actual sparsity achieved: 50.11%
Zeroed parameters: 4,170,814,390 of 8,323,596,288 prunable (50.11%)
Prune wall time: 4.6s
Pruning scope: linear layers in transformer blocks (attention projections + MLP). Embeddings, LM head, and layer norms are untouched.
Calibration set (Wanda only): 128 samples from C4, sequence length 2048.

Method description. Classical magnitude-based unstructured pruning: weights with the smallest absolute values are zeroed.

Reported metrics (from the paper)

Metric	Value	Reference
Perplexity (Tulu-3 SFT mix, 256×512)	54.05	dense baseline 8.94 (+504.6%)
SRS by category (s50)	Age: 0.179, Gender Identity: 0.031, Race/Ethnicity: 0.011, Religion: 0.070, SES: 0.019	random-chance baseline ≈ 0.333
Mean per-item inference latency (Apple Silicon, MLX)	0.455s	identical to the dense baseline — unstructured pruning provides no latency benefit on dense GEMM kernels (paper §V.B)

Important caveats for IoT / edge deployment

No storage savings. Unstructured pruning zeroes individual weights but keeps them in the dense float tensor. SafeTensors and GGUF do not exploit unstructured sparsity, so the on-disk size of this checkpoint is identical to the dense base model.
No latency savings. Dense GEMM kernels do not skip zero entries. Inference latency on Apple Silicon (MLX) and the majority of consumer GPUs / mobile NPUs is identical to the dense baseline.
Bias amplification may be invisible to perplexity-based eval. The paper's headline finding (the Smart Pruning Paradox): Wanda at 50% sparsity on Mistral-7B raises perplexity 3.5% but raises Stereotype Reliance Score 83.7% — a 24× disparity. Standard deployment validation based on perplexity alone provides false assurance.

Citation

@inproceedings{rath2026pruning,
  title         = {Weight Pruning Amplifies Bias: A Multi-Method Study of Compressed LLMs for Edge AI},
  author        = {Rath, Plawan Kumar and Maliakkal, Rahul},
  booktitle     = {Proc. IEEE AIIoT 2026},
  year          = {2026},
  eprint        = {2605.08137},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2605.08137}
}

Reproducibility

All pruning scripts, evaluation pipelines, and aggregated results: https://github.com/plawanrath/pruning-impact-analysis
BBQ benchmark (ambiguous condition only): Elfsong/BBQ
Generated from pruning_meta.json shipped in this repo (actual_sparsity, prune time, etc.).