⚖️ Vidhik AI: Sovereign Legal SLM (1B)
📌 Model Summary
Vidhik AI is a highly optimized, domain-specific Small Language Model (SLM) engineered for the Indian Judiciary and MSME sector. Fine-tuned on a 1B parameter base, it specializes in drafting formal legal notices (e.g., MSMED Act delayed payments), analyzing case law, and navigating complex Indian officialese ("Babu-speak").
Built with a focus on Edge Compute, this model is designed to run locally on highly constrained hardware (like a 4GB GTX 1050) while retaining the ability to process massive context windows using Google TurboQuant.
- Developer: Gaurav / Bhishaj Technologies
- Base Model: Llama-3.2-1B-Instruct
- Language(s): English, Hindi (Indic Legal Terminology)
- License: Llama 3.2 Community License
🛠️ Training & MLOps Architecture
To bypass local hardware constraints (4GB VRAM), the model was trained using a hybrid cloud-edge pipeline:
1. Data Engineering
- Corpus: Curated and filtered Indian Legal QA datasets (
Techmaestro369/indian-legal-texts-finetuning) and multilingual judiciary data (coild-aikosh/Judiciary_v2). - Formatting: Converted raw unstructured legal texts into strict Alpaca/ShareGPT instruction formats for deterministic instruction following.
2. Fine-Tuning Setup
- Compute: Kaggle Dual T4 GPUs (32GB VRAM combined).
- Optimization: Utilized Unsloth for a 70% VRAM reduction during fine-tuning, accelerating the training process by 2x.
- Methodology: Parameter-Efficient Fine-Tuning (PEFT) using QLoRA.
3. Guardrails & Alignment
- Trained with strict negative stop-sequences and deterministic decoding parameters (
Temperature = 0.0) to cure the base model of MCQ-loop hallucinations. - Aligned to a "Senior Advocate, Supreme Court of India" persona for formal, zero-fluff document generation.
⚡ Edge Deployment & Google TurboQuant (2026)
This model is specifically compiled to run on legacy/constrained hardware.
By utilizing Google TurboQuant, the model compresses the KV-cache to 3-bits during runtime. This allows for 128k context windows (essential for processing long Indian government gazettes and supreme court rulings) without triggering OOM (Out of Memory) crashes on a 4GB GPU, maintaining a throughput of ~24.5 tokens/sec.
💻 Usage: Running Locally (TurboQuant Enabled)
To achieve 6x KV-cache compression on your local machine:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache
repo_id = "Bhishaj/Vidhik-Llama-1B-GGU"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="cuda")
# Initialize TurboQuant 4-bit Cache for 4GB VRAM support
tq_cache = TurboQuantCache(bits=4, compute_device="cuda")
prompt = "TASK: Draft a formal legal notice for my client 'M/s Vidhik Electronics' under MSMED Act Sections 15 & 16."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Generate with Compressed Context
with torch.no_grad():
outputs = model.generate(
**inputs,
past_key_values=tq_cache, # Injecting the TurboQuant cache
max_new_tokens=512,
temperature=0.0
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 65
4-bit
Model tree for Bhishaj/Vidhik-Llama-1B-GGU
Base model
meta-llama/Llama-3.2-1B-Instruct