When Better Gradients Hurt: The Gradient-Architecture Interaction in Recursive Reasoning

Two HRM model variants trained on Sudoku-Extreme, completing the 2x2 factorial design from the disentanglement study. Counterintuitive result: full BPTT degrades hierarchical architectures.

Models

Variant	Gradient Method	Token Accuracy	Exact Accuracy
`hrm_1step`	1-step gradient (O(1) memory)	68.8%	9.2%
`hrm_fullbp`	Full BPTT (O(T) memory)	69.3%	8.7%

Full BPTT degrades exact accuracy on the hierarchical architecture, in sharp contrast to the 8.6x improvement seen on flat architectures.

Architecture

Base: sapientinc/HRM (Hierarchical Reasoning Model)
Type: Dual-network hierarchy (H-module + L-module)
H_layers / L_layers: 4 / 4
Hidden size: 384
Attention heads: 8
H_cycles / L_cycles: 2 / 2
ACT halt steps: 16

Training

Dataset: Sudoku-Extreme -- 500 puzzles + 500 augmentations
Epochs: 10,000
Batch size: 512
Optimizer: AdamW (lr=1e-4, beta1=0.9, beta2=0.95)
Precision: bfloat16
Hardware: 2x NVIDIA Tesla T4 (16GB)

Usage

This model requires the HRM codebase from sapientinc/HRM.

import torch
from utils.functions import load_model_class

# Load checkpoint
ckpt = torch.load("hrm_1step/step_9765", map_location="cuda")

# Load into HRM model (requires HRM codebase)
# model = load_model_class(config)
# model.load_state_dict(ckpt["model"])

Paper

When Better Gradients Hurt: The Gradient-Architecture Interaction in Recursive Reasoning

Key finding: While full BPTT improves flat architectures by 8.6x (2.2% to 18.9%), it degrades HRM's hierarchical architecture (9.2% to 8.7%). Architecture and gradient method have a strong interaction effect. Hierarchy compensates for poor gradients, but fails to benefit from full BPTT.

Citation

@article{jj2026interaction,
  title={When Better Gradients Hurt: The Gradient--Architecture Interaction in Recursive Reasoning},
  author={Jani, Jatin},
  journal={arXiv preprint},
  year={2026}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support