When Better Gradients Hurt: The Gradient-Architecture Interaction in Recursive Reasoning
Two HRM model variants trained on Sudoku-Extreme, completing the 2x2 factorial design from the disentanglement study. Counterintuitive result: full BPTT degrades hierarchical architectures.
Models
| Variant | Gradient Method | Token Accuracy | Exact Accuracy |
|---|---|---|---|
hrm_1step |
1-step gradient (O(1) memory) | 68.8% | 9.2% |
hrm_fullbp |
Full BPTT (O(T) memory) | 69.3% | 8.7% |
Full BPTT degrades exact accuracy on the hierarchical architecture, in sharp contrast to the 8.6x improvement seen on flat architectures.
Architecture
- Base: sapientinc/HRM (Hierarchical Reasoning Model)
- Type: Dual-network hierarchy (H-module + L-module)
- H_layers / L_layers: 4 / 4
- Hidden size: 384
- Attention heads: 8
- H_cycles / L_cycles: 2 / 2
- ACT halt steps: 16
Training
- Dataset: Sudoku-Extreme -- 500 puzzles + 500 augmentations
- Epochs: 10,000
- Batch size: 512
- Optimizer: AdamW (lr=1e-4, beta1=0.9, beta2=0.95)
- Precision: bfloat16
- Hardware: 2x NVIDIA Tesla T4 (16GB)
Usage
This model requires the HRM codebase from sapientinc/HRM.
import torch
from utils.functions import load_model_class
# Load checkpoint
ckpt = torch.load("hrm_1step/step_9765", map_location="cuda")
# Load into HRM model (requires HRM codebase)
# model = load_model_class(config)
# model.load_state_dict(ckpt["model"])
Paper
When Better Gradients Hurt: The Gradient-Architecture Interaction in Recursive Reasoning
Key finding: While full BPTT improves flat architectures by 8.6x (2.2% to 18.9%), it degrades HRM's hierarchical architecture (9.2% to 8.7%). Architecture and gradient method have a strong interaction effect. Hierarchy compensates for poor gradients, but fails to benefit from full BPTT.
Citation
@article{jj2026interaction,
title={When Better Gradients Hurt: The Gradient--Architecture Interaction in Recursive Reasoning},
author={Jani, Jatin},
journal={arXiv preprint},
year={2026}
}