Azure Cloud Solution Architect - Qwen 3.5 0.8B (GRPO Merged)

A fully merged Qwen 3.5 0.8B model trained with GRPO (Group Relative Policy Optimization) to be an Azure Cloud Solution Architect with structured reasoning capabilities. This is the LoRA adapters merged into the base weights — ready for deployment with no adapter loading needed.

What This Model Does

  • Answers multi-choice Azure architecture questions with structured reasoning
  • Produces output in <REASONING>...</REASONING> and <SOLUTION>...</SOLUTION> format
  • References specific Azure services in its reasoning
  • Trained with 4 reward signals: format compliance, answer correctness, Azure relevance, reasoning quality

Example Output

Question: Which Azure service handles global load balancing?
A. Azure Load Balancer  B. Azure Front Door  C. Traffic Manager  D. Application Gateway

<REASONING>
Azure Front Door provides global HTTP/HTTPS load balancing with built-in CDN, 
WAF, and SSL offloading. It operates at Layer 7 and routes traffic to the 
closest healthy backend across regions. Azure Load Balancer is regional (Layer 4), 
Traffic Manager is DNS-based (slower failover), and Application Gateway is 
regional Layer 7. For global load balancing with low latency, Front Door is ideal.
</REASONING>
<SOLUTION>B</SOLUTION>

Training Details

Parameter Value
Base Model unsloth/Qwen3.5-0.8B
Method SFT → GRPO with GSPO variant (loss_type=dr_grpo), then merged
LoRA Rank 16
SFT Dataset thegovind/azure-architecture-vqa (1,678 train examples)
GRPO Dataset thegovind/azure-architecture-grpo-benchmark (200 train / 51 eval)
SFT Training 42.6 min, 210 steps, loss 0.6517
GRPO Training ~4 hr 40 min, 200 steps, peak reward 5.5/7.0
Hardware 1x NVIDIA RTX 4090 (24GB)
Total Cost ~$1.88 on vast.ai

Reward Functions (Rubric)

Signal Max Score What It Measures
R1 — Format Compliance +2.0 Proper <REASONING> and <SOLUTION> XML tags
R2 — Answer Correctness +3.0 Exact match on A/B/C/D answer letter
R3 — Azure Relevance +1.0 Mentions relevant Azure services in reasoning
R4 — Reasoning Quality +1.0 Substantive reasoning (50–500 words)
Total 7.0

How to Use

!pip install -q --upgrade transformers accelerate torch

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "thegovind/azure-architect-qwen35-0.8b-grpo-merged",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-architect-qwen35-0.8b-grpo-merged")

SYSTEM_PROMPT = (
    "You are an expert Azure Cloud Solution Architect. "
    "Provide reasoning in <REASONING></REASONING> tags, "
    "then your answer in <SOLUTION></SOLUTION> tags."
)

question = """Which Azure service is best for real-time fraud detection at scale?
A. Azure Batch
B. Azure Stream Analytics with Event Hubs
C. Azure SQL Database
D. Azure Blob Storage"""

prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{question}\n\nProvide reasoning in <REASONING></REASONING> tags and answer in <SOLUTION></SOLUTION> tags.<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

When to Use This vs. the LoRA Version

Version Use When
This (merged) Deployment, inference servers, GGUF conversion, Foundry Local, vLLM
LoRA Further fine-tuning, experimentation, saving storage (43 MB vs 1.6 GB)

Two-Stage Training Pipeline

Stage 1: SFT — "Learn Azure knowledge from 1,678 Q&A pairs"
    → Supervised Fine-Tuning on Azure Architecture Center content

Stage 2: GRPO — "Learn to reason through problems via RL with 4 reward signals"
    → Reinforcement Learning with structured output format
    
Merge — LoRA adapters merged into base weights for easy deployment
    → This model

Related Models & Resources

Downloads last month
9
Safetensors
Model size
0.8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thegovind/azure-architect-qwen35-0.8b-grpo-merged

Finetuned
(67)
this model

Datasets used to train thegovind/azure-architect-qwen35-0.8b-grpo-merged

Space using thegovind/azure-architect-qwen35-0.8b-grpo-merged 1