Stuff I'm going to read - a gmongaras Collection

gmongaras 's Collections

2Mamba2Furious: Linear in Complexity...

Stuff I'm going to read

Stable Diffusion 3 Checkpoints

Cosine Attention (Cottention)

Stuff I'm going to read

updated 2 days ago

LTX-2: Efficient Joint Audio-Visual Foundation Model

Paper • 2601.03233 • Published Jan 6 • 176
MHLA: Restoring Expressivity of Linear Attention via Token-Level Multi-Head

Paper • 2601.07832 • Published Jan 12 • 52
Motion Attribution for Video Generation

Paper • 2601.08828 • Published Jan 13 • 72
Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Paper • 2601.19895 • Published Jan 27 • 27
Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

Paper • 2601.17367 • Published Jan 24 • 34
Advancing Open-source World Models

Paper • 2601.20540 • Published Jan 28 • 135
Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

Paper • 2601.21709 • Published Jan 29 • 3
ERNIE 5.0 Technical Report

Paper • 2602.04705 • Published Feb 4 • 267
FASA: Frequency-aware Sparse Attention

Paper • 2602.03152 • Published Feb 3 • 154
LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Paper • 2602.08676 • Published Feb 9 • 70
MOVA: Towards Scalable and Synchronized Video-Audio Generation

Paper • 2602.08794 • Published Feb 9 • 159
OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Paper • 2602.05400 • Published Feb 5 • 352
When to Memorize and When to Stop: Gated Recurrent Memory for Long-Context Reasoning

Paper • 2602.10560 • Published Feb 11 • 31
Towards Autonomous Mathematics Research

Paper • 2602.10177 • Published Feb 10 • 36
MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models

Paper • 2602.10934 • Published Feb 11 • 49
Experiential Reinforcement Learning

Paper • 2602.13949 • Published Feb 15 • 72
BitDance: Scaling Autoregressive Generative Models with Binary Tokens

Paper • 2602.14041 • Published Feb 15 • 53
STAPO: Stabilizing Reinforcement Learning for LLMs by Silencing Rare Spurious Tokens

Paper • 2602.15620 • Published Feb 17 • 3
SLA2: Sparse-Linear Attention with Learnable Routing and QAT

Paper • 2602.12675 • Published Feb 13 • 58
VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Paper • 2602.10693 • Published Feb 11 • 220
Avey-B

Paper • 2602.15814 • Published Feb 17 • 3
Decoding as Optimisation on the Probability Simplex: From Top-K to Top-P (Nucleus) to Best-of-K Samplers

Paper • 2602.18292 • Published Feb 20 • 13
Test-Time Training with KV Binding Is Secretly Linear Attention

Paper • 2602.21204 • Published Feb 24 • 31
Memory Caching: RNNs with Growing Memory

Paper • 2602.24281 • Published Feb 27 • 10
Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling

Paper • 2603.04791 • Published Mar 5 • 20
LoGeR: Long-Context Geometric Reconstruction with Hybrid Memory

Paper • 2603.03269 • Published Mar 3 • 63
ReMix: Reinforcement routing for mixtures of LoRAs in LLM finetuning

Paper • 2603.10160 • Published Mar 10 • 26
V_{0.5}: Generalist Value Model as a Prior for Sparse RL Rollouts

Paper • 2603.10848 • Published Mar 11 • 14
Video-Based Reward Modeling for Computer-Use Agents

Paper • 2603.10178 • Published Mar 10 • 43
Attention Residuals

Paper • 2603.15031 • Published Mar 16 • 180
Efficient Exploration at Scale

Paper • 2603.17378 • Published Mar 18 • 14
MinerU-Diffusion: Rethinking Document OCR as Inverse Rendering via Diffusion Decoding

Paper • 2603.22458 • Published 27 days ago • 135
Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Paper • 2603.24472 • Published 25 days ago • 54
Voxtral TTS

Paper • 2603.25551 • Published 24 days ago • 59
Out of Sight but Not Out of Mind: Hybrid Memory for Dynamic Video World Models

Paper • 2603.25716 • Published 24 days ago • 155
LongCat-Next: Lexicalizing Modalities as Discrete Tokens

Paper • 2603.27538 • Published 21 days ago • 143
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

Paper • 2604.04921 • Published 14 days ago • 108
Think in Strokes, Not Pixels: Process-Driven Image Generation via Interleaved Reasoning

Paper • 2604.04746 • Published 12 days ago • 70
ELT: Elastic Looped Transformers for Visual Generation

Paper • 2604.09168 • Published 10 days ago • 19
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

Paper • 2604.08995 • Published 10 days ago • 46
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

Paper • 2604.10098 • Published 9 days ago • 74
Continuous Adversarial Flow Models

Paper • 2604.11521 • Published 7 days ago • 8
Efficient RL Training for LLMs with Experience Replay

Paper • 2604.08706 • Published 11 days ago • 17
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Paper • 2604.15308 • Published 4 days ago • 25