⚖️ Vidhik AI: Sovereign Legal SLM (1B)

📌 Model Summary

Vidhik AI is a highly optimized, domain-specific Small Language Model (SLM) engineered for the Indian Judiciary and MSME sector. Fine-tuned on a 1B parameter base, it specializes in drafting formal legal notices (e.g., MSMED Act delayed payments), analyzing case law, and navigating complex Indian officialese ("Babu-speak").

Built with a focus on Edge Compute, this model is designed to run locally on highly constrained hardware (like a 4GB GTX 1050) while retaining the ability to process massive context windows using Google TurboQuant.

Developer: Gaurav / Bhishaj Technologies
Base Model: Llama-3.2-1B-Instruct
Language(s): English, Hindi (Indic Legal Terminology)
License: Llama 3.2 Community License

🛠️ Training & MLOps Architecture

To bypass local hardware constraints (4GB VRAM), the model was trained using a hybrid cloud-edge pipeline:

1. Data Engineering

Corpus: Curated and filtered Indian Legal QA datasets (Techmaestro369/indian-legal-texts-finetuning) and multilingual judiciary data (coild-aikosh/Judiciary_v2).
Formatting: Converted raw unstructured legal texts into strict Alpaca/ShareGPT instruction formats for deterministic instruction following.

2. Fine-Tuning Setup

Compute: Kaggle Dual T4 GPUs (32GB VRAM combined).
Optimization: Utilized Unsloth for a 70% VRAM reduction during fine-tuning, accelerating the training process by 2x.
Methodology: Parameter-Efficient Fine-Tuning (PEFT) using QLoRA.

3. Guardrails & Alignment

Trained with strict negative stop-sequences and deterministic decoding parameters (Temperature = 0.0) to cure the base model of MCQ-loop hallucinations.
Aligned to a "Senior Advocate, Supreme Court of India" persona for formal, zero-fluff document generation.

⚡ Edge Deployment & Google TurboQuant (2026)

This model is specifically compiled to run on legacy/constrained hardware.

By utilizing Google TurboQuant, the model compresses the KV-cache to 3-bits during runtime. This allows for 128k context windows (essential for processing long Indian government gazettes and supreme court rulings) without triggering OOM (Out of Memory) crashes on a 4GB GPU, maintaining a throughput of ~24.5 tokens/sec.

💻 Usage: Running Locally (TurboQuant Enabled)

To achieve 6x KV-cache compression on your local machine:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

repo_id = "Bhishaj/Vidhik-Llama-1B-GGU" 
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="cuda")

# Initialize TurboQuant 4-bit Cache for 4GB VRAM support
tq_cache = TurboQuantCache(bits=4, compute_device="cuda")

prompt = "TASK: Draft a formal legal notice for my client 'M/s Vidhik Electronics' under MSMED Act Sections 15 & 16."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate with Compressed Context
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        past_key_values=tq_cache, # Injecting the TurboQuant cache
        max_new_tokens=512,
        temperature=0.0
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Downloads last month: 65

GGUF

Model size

1B params

Architecture

llama

Hardware compatibility

4-bit

Model tree for Bhishaj/Vidhik-Llama-1B-GGU

Base model

meta-llama/Llama-3.2-1B-Instruct

Quantized

(368)

this model

Bhishaj
/

Vidhik-Llama-1B-GGU