Azure Cloud Solution Architect - Qwen 3.5 0.8B (GRPO Merged)
A fully merged Qwen 3.5 0.8B model trained with GRPO (Group Relative Policy Optimization) to be an Azure Cloud Solution Architect with structured reasoning capabilities. This is the LoRA adapters merged into the base weights — ready for deployment with no adapter loading needed.
What This Model Does
- Answers multi-choice Azure architecture questions with structured reasoning
- Produces output in
<REASONING>...</REASONING> and <SOLUTION>...</SOLUTION> format
- References specific Azure services in its reasoning
- Trained with 4 reward signals: format compliance, answer correctness, Azure relevance, reasoning quality
Example Output
Question: Which Azure service handles global load balancing?
A. Azure Load Balancer B. Azure Front Door C. Traffic Manager D. Application Gateway
<REASONING>
Azure Front Door provides global HTTP/HTTPS load balancing with built-in CDN,
WAF, and SSL offloading. It operates at Layer 7 and routes traffic to the
closest healthy backend across regions. Azure Load Balancer is regional (Layer 4),
Traffic Manager is DNS-based (slower failover), and Application Gateway is
regional Layer 7. For global load balancing with low latency, Front Door is ideal.
</REASONING>
<SOLUTION>B</SOLUTION>
Training Details
| Parameter |
Value |
| Base Model |
unsloth/Qwen3.5-0.8B |
| Method |
SFT → GRPO with GSPO variant (loss_type=dr_grpo), then merged |
| LoRA Rank |
16 |
| SFT Dataset |
thegovind/azure-architecture-vqa (1,678 train examples) |
| GRPO Dataset |
thegovind/azure-architecture-grpo-benchmark (200 train / 51 eval) |
| SFT Training |
42.6 min, 210 steps, loss 0.6517 |
| GRPO Training |
~4 hr 40 min, 200 steps, peak reward 5.5/7.0 |
| Hardware |
1x NVIDIA RTX 4090 (24GB) |
| Total Cost |
~$1.88 on vast.ai |
Reward Functions (Rubric)
| Signal |
Max Score |
What It Measures |
| R1 — Format Compliance |
+2.0 |
Proper <REASONING> and <SOLUTION> XML tags |
| R2 — Answer Correctness |
+3.0 |
Exact match on A/B/C/D answer letter |
| R3 — Azure Relevance |
+1.0 |
Mentions relevant Azure services in reasoning |
| R4 — Reasoning Quality |
+1.0 |
Substantive reasoning (50–500 words) |
| Total |
7.0 |
|
How to Use
!pip install -q --upgrade transformers accelerate torch
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"thegovind/azure-architect-qwen35-0.8b-grpo-merged",
torch_dtype=torch.float16,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-architect-qwen35-0.8b-grpo-merged")
SYSTEM_PROMPT = (
"You are an expert Azure Cloud Solution Architect. "
"Provide reasoning in <REASONING></REASONING> tags, "
"then your answer in <SOLUTION></SOLUTION> tags."
)
question = """Which Azure service is best for real-time fraud detection at scale?
A. Azure Batch
B. Azure Stream Analytics with Event Hubs
C. Azure SQL Database
D. Azure Blob Storage"""
prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{question}\n\nProvide reasoning in <REASONING></REASONING> tags and answer in <SOLUTION></SOLUTION> tags.<|im_end|>\n<|im_start|>assistant\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
When to Use This vs. the LoRA Version
| Version |
Use When |
| This (merged) |
Deployment, inference servers, GGUF conversion, Foundry Local, vLLM |
| LoRA |
Further fine-tuning, experimentation, saving storage (43 MB vs 1.6 GB) |
Two-Stage Training Pipeline
Stage 1: SFT — "Learn Azure knowledge from 1,678 Q&A pairs"
→ Supervised Fine-Tuning on Azure Architecture Center content
Stage 2: GRPO — "Learn to reason through problems via RL with 4 reward signals"
→ Reinforcement Learning with structured output format
Merge — LoRA adapters merged into base weights for easy deployment
→ This model
Related Models & Resources