Azure Cloud Solution Architect - Qwen 3.5 0.8B (GRPO Merged)

A fully merged Qwen 3.5 0.8B model trained with GRPO (Group Relative Policy Optimization) to be an Azure Cloud Solution Architect with structured reasoning capabilities. This is the LoRA adapters merged into the base weights — ready for deployment with no adapter loading needed.

What This Model Does

Answers multi-choice Azure architecture questions with structured reasoning
Produces output in <REASONING>...</REASONING> and <SOLUTION>...</SOLUTION> format
References specific Azure services in its reasoning
Trained with 4 reward signals: format compliance, answer correctness, Azure relevance, reasoning quality

Example Output

Question: Which Azure service handles global load balancing?
A. Azure Load Balancer  B. Azure Front Door  C. Traffic Manager  D. Application Gateway

<REASONING>
Azure Front Door provides global HTTP/HTTPS load balancing with built-in CDN, 
WAF, and SSL offloading. It operates at Layer 7 and routes traffic to the 
closest healthy backend across regions. Azure Load Balancer is regional (Layer 4), 
Traffic Manager is DNS-based (slower failover), and Application Gateway is 
regional Layer 7. For global load balancing with low latency, Front Door is ideal.
</REASONING>
<SOLUTION>B</SOLUTION>

Training Details

Parameter	Value
Base Model	`unsloth/Qwen3.5-0.8B`
Method	SFT → GRPO with GSPO variant (`loss_type=dr_grpo`), then merged
LoRA Rank	16
SFT Dataset	thegovind/azure-architecture-vqa (1,678 train examples)
GRPO Dataset	thegovind/azure-architecture-grpo-benchmark (200 train / 51 eval)
SFT Training	42.6 min, 210 steps, loss 0.6517
GRPO Training	~4 hr 40 min, 200 steps, peak reward 5.5/7.0
Hardware	1x NVIDIA RTX 4090 (24GB)
Total Cost	~$1.88 on vast.ai

Reward Functions (Rubric)

Signal	Max Score	What It Measures
R1 — Format Compliance	+2.0	Proper `<REASONING>` and `<SOLUTION>` XML tags
R2 — Answer Correctness	+3.0	Exact match on A/B/C/D answer letter
R3 — Azure Relevance	+1.0	Mentions relevant Azure services in reasoning
R4 — Reasoning Quality	+1.0	Substantive reasoning (50–500 words)
Total	7.0

How to Use

!pip install -q --upgrade transformers accelerate torch

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "thegovind/azure-architect-qwen35-0.8b-grpo-merged",
    torch_dtype=torch.float16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("thegovind/azure-architect-qwen35-0.8b-grpo-merged")

SYSTEM_PROMPT = (
    "You are an expert Azure Cloud Solution Architect. "
    "Provide reasoning in <REASONING></REASONING> tags, "
    "then your answer in <SOLUTION></SOLUTION> tags."
)

question = """Which Azure service is best for real-time fraud detection at scale?
A. Azure Batch
B. Azure Stream Analytics with Event Hubs
C. Azure SQL Database
D. Azure Blob Storage"""

prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n<|im_start|>user\n{question}\n\nProvide reasoning in <REASONING></REASONING> tags and answer in <SOLUTION></SOLUTION> tags.<|im_end|>\n<|im_start|>assistant\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

When to Use This vs. the LoRA Version

Version	Use When
This (merged)	Deployment, inference servers, GGUF conversion, Foundry Local, vLLM
LoRA	Further fine-tuning, experimentation, saving storage (43 MB vs 1.6 GB)

Two-Stage Training Pipeline

Stage 1: SFT — "Learn Azure knowledge from 1,678 Q&A pairs"
    → Supervised Fine-Tuning on Azure Architecture Center content

Stage 2: GRPO — "Learn to reason through problems via RL with 4 reward signals"
    → Reinforcement Learning with structured output format
    
Merge — LoRA adapters merged into base weights for easy deployment
    → This model

Related Models & Resources

Resource	Link
SFT LoRA	thegovind/azure-architect-qwen35-0.8b
SFT Merged	thegovind/azure-architect-qwen35-0.8b-merged
GRPO LoRA	thegovind/azure-architect-qwen35-0.8b-grpo
GRPO Merged (this)	thegovind/azure-architect-qwen35-0.8b-grpo-merged
Training Dataset	thegovind/azure-architecture-vqa
Benchmark	thegovind/azure-architecture-grpo-benchmark

Downloads last month: 9

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for thegovind/azure-architect-qwen35-0.8b-grpo-merged

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

unsloth/Qwen3.5-0.8B

Finetuned

(67)

this model

thegovind
/

azure-architect-qwen35-0.8b-grpo-merged

Azure Cloud Solution Architect - Qwen 3.5 0.8B (GRPO Merged)

What This Model Does

Example Output

Training Details

Reward Functions (Rubric)

How to Use

When to Use This vs. the LoRA Version

Two-Stage Training Pipeline

Related Models & Resources

Model tree for thegovind/azure-architect-qwen35-0.8b-grpo-merged

Datasets used to train thegovind/azure-architect-qwen35-0.8b-grpo-merged

Space using thegovind/azure-architect-qwen35-0.8b-grpo-merged 1