# RTH-LM: A Fractal Temporal Convolutional Language Model **Author:** Christian Quintino De Luca **Affiliation:** RTH Italia (Research & Technology Hub), Milan, Italy **Email:** [info@rthitalia.com](mailto:info@rthitalia.com) **Date:** February 2026 **License:** CC BY-NC 4.0 (Research) / Commercial License (RTH Italia) **Repository:** [https://github.com/rthgit/ZetaGrid](https://github.com/rthgit/ZetaGrid) --- ## Abstract We introduce **RTH-LM**, a **Fractal Gated Causal Temporal Convolutional Network (TCN)** for language modeling, designed as an alternative to attention-centric architectures. RTH-LM targets linear-time inference in sequence length and improved data/compute efficiency under constrained training regimes. The model family is organized around a modular separation between a compact shared frozen core (the **Genome**) and trainable low-rank adapters (the **Soul**), enabling rapid domain specialization with minimal update artifacts. This paper presents: (i) the Fractal TCN backbone and scaling strategy, (ii) the Genome/Soul modular deployment design, and (iii) initial training signals showing stable convergence under deliberately restricted data. We also provide a conservative hardware feasibility analysis indicating that a **120B-parameter scaled variant**—deployed with **2-bit weight-only quantization**—can fit on **80GB-class GPUs**, depending on runtime state and inference-engine overhead. Finally, we outline a practical pathway to a **1T-parameter vision** using a compact **8–9× H100 80GB cluster** for sharded inference and incremental expansion workflows. --- ## 1. Introduction Transformer architectures dominate modern language modeling, but their operational profile often becomes the limiting factor in real deployments: quadratic attention costs, high memory pressure at long context, and a prevailing reliance on extremely large pretraining corpora. These constraints raise barriers for independent research groups and small companies, and they increase energy requirements at scale. **RTH-LM** explores a different axis of scaling: **architectural structure over brute-force data scaling**. The core hypothesis is that long-range dependency modeling can be achieved using deep causal temporal convolutions combined with gating and a fractal block expansion strategy, reducing reliance on explicit attention mechanisms. ### 1.1 Contributions 1. **Fractal Gated Causal TCN Backbone**: a deep, dilated, causal convolutional stack with gating and residual routing for autoregressive language modeling, designed for linear-time inference in sequence length. 2. **Genome/Soul Modularity**: a deployable separation between a shared frozen Genome core and trainable Soul adapters (LoRA-style), enabling fast specialization with minimal retraining and small update artifacts. 3. **Constrained-Regime Training Signals**: training is intentionally performed on a small curated dataset to emphasize architectural learning dynamics and feasibility under limited compute/data. 4. **Conservative Memory & Hardware Feasibility**: a planning-grade VRAM model for a 120B scaled variant under 2-bit weight-only quantization, with explicit assumptions and bounded estimates. ### 1.2 Scope and Non-Claims This paper focuses on feasibility, training stability, and modular deployment design. It does not claim parity with frontier-scale instruction-tuned Transformer systems trained on trillions of tokens. Instead, it addresses: **How far can capacity and usability be pushed under tight data/compute constraints using a non-attention backbone?** --- ## 2. Background and Motivation ### 2.1 Why Replace Attention? Attention provides flexible token-to-token routing but incurs costs that become dominant at long context and high-throughput serving. Many real-world deployments are constrained not by FLOPs alone but by memory bandwidth, allocator fragmentation, and context-state growth. **RTH-LM** aims to reduce these constraints using temporal mixing via convolution and gating, relying on depth/dilation schedules rather than all-pairs interactions. ### 2.2 Temporal Convolutions for Sequence Modeling Causal dilated convolutions can cover long receptive fields using dilation schedules. With sufficient depth, the model can integrate information across large contexts with predictable compute, which is attractive for streaming and on-prem inference. --- ## 3. RTH-LM Architecture ### 3.1 Overview RTH-LM consists of: (i) tokenizer/embeddings, (ii) a Fractal Gated Causal TCN backbone, (iii) an output head, and (iv) optional modular adapters (the **Soul**) for domain specialization. The backbone is organized as a stack of Fractal Blocks, each composed of multiple gated causal convolutional layers with residual pathways and normalization. ### 3.2 Gated Causal Convolutional Layer Let $x \in \mathbb{R}^{T \times d}$ be the sequence representation. Each layer performs: 1. $h = \text{Conv1D}_{\text{causal, dilated}}(x)$ 2. $[h_g, h_v] = \text{split}(W h)$ 3. $y = \sigma(h_g) \odot \phi(h_v)$ 4. $x' = x + \text{Dropout}(W_o y)$ 5. $\hat{x} = \text{Norm}(x')$ ### 3.3 Fractal Block Expansion The fractal property refers to a scalable block composition strategy: * A base model is built from repeated block templates (a repeated micro-architecture). * Larger models are formed by mirroring/replicating block groups and re-initializing only minimal routing/scaling parameters. * Scaling is followed by brief stabilization training and/or adapter re-optimization. ### 3.4 The Frozen Genome RTH-LM defines a compact shared core parameter artifact called the **Genome**, reused across instances. In the reference configuration motivating this paper, the Genome is engineered to be storage-efficient (single-digit GB class depending on serialization and quantization). ### 3.5 The Liquid Soul (Adapters) Domain adaptation is performed via trainable low-rank adapters, called the **Soul**. Souls are small relative to the full model footprint and can be swapped to change domain behavior (coding, technical writing, creative) without full retraining. --- ## 4. Training Methodology ### 4.1 Objective RTH-LM is trained using autoregressive next-token prediction: $\mathcal{L} = - \sum_{t=1}^{T} \log p_\theta(x_t \mid x_{