When Better Gradients Hurt: The Gradient-Architecture Interaction in Recursive Reasoning

Two HRM model variants trained on Sudoku-Extreme, completing the 2x2 factorial design from the disentanglement study. Counterintuitive result: full BPTT degrades hierarchical architectures.

Models

Variant Gradient Method Token Accuracy Exact Accuracy
hrm_1step 1-step gradient (O(1) memory) 68.8% 9.2%
hrm_fullbp Full BPTT (O(T) memory) 69.3% 8.7%

Full BPTT degrades exact accuracy on the hierarchical architecture, in sharp contrast to the 8.6x improvement seen on flat architectures.

Architecture

  • Base: sapientinc/HRM (Hierarchical Reasoning Model)
  • Type: Dual-network hierarchy (H-module + L-module)
  • H_layers / L_layers: 4 / 4
  • Hidden size: 384
  • Attention heads: 8
  • H_cycles / L_cycles: 2 / 2
  • ACT halt steps: 16

Training

  • Dataset: Sudoku-Extreme -- 500 puzzles + 500 augmentations
  • Epochs: 10,000
  • Batch size: 512
  • Optimizer: AdamW (lr=1e-4, beta1=0.9, beta2=0.95)
  • Precision: bfloat16
  • Hardware: 2x NVIDIA Tesla T4 (16GB)

Usage

This model requires the HRM codebase from sapientinc/HRM.

import torch
from utils.functions import load_model_class

# Load checkpoint
ckpt = torch.load("hrm_1step/step_9765", map_location="cuda")

# Load into HRM model (requires HRM codebase)
# model = load_model_class(config)
# model.load_state_dict(ckpt["model"])

Paper

When Better Gradients Hurt: The Gradient-Architecture Interaction in Recursive Reasoning

Key finding: While full BPTT improves flat architectures by 8.6x (2.2% to 18.9%), it degrades HRM's hierarchical architecture (9.2% to 8.7%). Architecture and gradient method have a strong interaction effect. Hierarchy compensates for poor gradients, but fails to benefit from full BPTT.

Citation

@article{jj2026interaction,
  title={When Better Gradients Hurt: The Gradient--Architecture Interaction in Recursive Reasoning},
  author={Jani, Jatin},
  journal={arXiv preprint},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support