Text Generation
GGUF
English
Hindi
legal
unsloth
turboquant
edge-ai
slm
conversational

⚖️ Vidhik AI: Sovereign Legal SLM (1B)

Model Size Quantization Hardware Framework

📌 Model Summary

Vidhik AI is a highly optimized, domain-specific Small Language Model (SLM) engineered for the Indian Judiciary and MSME sector. Fine-tuned on a 1B parameter base, it specializes in drafting formal legal notices (e.g., MSMED Act delayed payments), analyzing case law, and navigating complex Indian officialese ("Babu-speak").

Built with a focus on Edge Compute, this model is designed to run locally on highly constrained hardware (like a 4GB GTX 1050) while retaining the ability to process massive context windows using Google TurboQuant.

  • Developer: Gaurav / Bhishaj Technologies
  • Base Model: Llama-3.2-1B-Instruct
  • Language(s): English, Hindi (Indic Legal Terminology)
  • License: Llama 3.2 Community License

🛠️ Training & MLOps Architecture

To bypass local hardware constraints (4GB VRAM), the model was trained using a hybrid cloud-edge pipeline:

1. Data Engineering

  • Corpus: Curated and filtered Indian Legal QA datasets (Techmaestro369/indian-legal-texts-finetuning) and multilingual judiciary data (coild-aikosh/Judiciary_v2).
  • Formatting: Converted raw unstructured legal texts into strict Alpaca/ShareGPT instruction formats for deterministic instruction following.

2. Fine-Tuning Setup

  • Compute: Kaggle Dual T4 GPUs (32GB VRAM combined).
  • Optimization: Utilized Unsloth for a 70% VRAM reduction during fine-tuning, accelerating the training process by 2x.
  • Methodology: Parameter-Efficient Fine-Tuning (PEFT) using QLoRA.

3. Guardrails & Alignment

  • Trained with strict negative stop-sequences and deterministic decoding parameters (Temperature = 0.0) to cure the base model of MCQ-loop hallucinations.
  • Aligned to a "Senior Advocate, Supreme Court of India" persona for formal, zero-fluff document generation.

⚡ Edge Deployment & Google TurboQuant (2026)

This model is specifically compiled to run on legacy/constrained hardware.

By utilizing Google TurboQuant, the model compresses the KV-cache to 3-bits during runtime. This allows for 128k context windows (essential for processing long Indian government gazettes and supreme court rulings) without triggering OOM (Out of Memory) crashes on a 4GB GPU, maintaining a throughput of ~24.5 tokens/sec.

💻 Usage: Running Locally (TurboQuant Enabled)

To achieve 6x KV-cache compression on your local machine:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from turboquant import TurboQuantCache

repo_id = "Bhishaj/Vidhik-Llama-1B-GGU" 
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id, device_map="cuda")

# Initialize TurboQuant 4-bit Cache for 4GB VRAM support
tq_cache = TurboQuantCache(bits=4, compute_device="cuda")

prompt = "TASK: Draft a formal legal notice for my client 'M/s Vidhik Electronics' under MSMED Act Sections 15 & 16."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

# Generate with Compressed Context
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        past_key_values=tq_cache, # Injecting the TurboQuant cache
        max_new_tokens=512,
        temperature=0.0
    )

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Downloads last month
65
GGUF
Model size
1B params
Architecture
llama
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bhishaj/Vidhik-Llama-1B-GGU

Quantized
(368)
this model

Datasets used to train Bhishaj/Vidhik-Llama-1B-GGU

Space using Bhishaj/Vidhik-Llama-1B-GGU 1