RAG (vs) Fine-Tuning + continual learning
Why RAG Falls Short — And What Happens When You Bake Knowledge Into the Model? I spent six months building RAG pipelines before I threw them all out. This is what I learned. Everyone Says Use RAG. Every tutorial, every Discord channel, every AI influencer says the same thing: store your documents in a vector database, retrieve relevant chunks at query time, feed them to the model. Simple. It's simple. It also breaks in ways nobody warns you about. Where It Fell Apart My first problem was chunking. Domain-specific documents are full of conditional logic and cross-references. Split them into chunks and you lose the context that makes them useful. I tried every chunk size. Every approach broke something else. My second problem was retrieval. When I added a second document type alongside the first, the retriever started pulling the wrong context. The embedding model couldn't distinguish between similar terms used in different domains. My third problem was the one that made me quit: the system never got smarter. Every query started from zero. The model didn't learn from the thousands of questions it had answered. It didn't build understanding. It just kept looking things up, over and over. I spent more time debugging retrieval failures than solving the actual problem. The Real Problem With RAG RAG doesn't teach the model anything. It gives the model a cheat sheet to reference during the exam. Sometimes the cheat sheet has the right page open. Sometimes it doesn't. And the student never actually studies. Silent failures. When retrieval misses the right document, the model doesn't say "I don't know." It confidently generates an answer from whatever it did retrieve. In regulated industries, that's not a minor inconvenience. Infrastructure that grows forever. Every new document type means re-tuning your chunking, re-evaluating your embeddings, expanding your vector DB. The system gets more fragile as it gets bigger. Latency that compounds. Embedding, vector search, reranking, context injection, generation. That pipeline adds hundreds of milliseconds per query. At scale, that's real money. No actual understanding. The model processes text. It doesn't understand your domain. It can't reason about your specific patterns because it has never learned them. What I Built Instead I asked a simple question: what if the model just knew the answer? Not "looked it up." Knew it. The way a specialist knows their field — because they've internalized the knowledge, not because they're flipping through a reference manual. That's what fine-tuning does. You train the model on your data, and the knowledge becomes part of the model's weights. No retrieval pipeline. No vector database. No chunking. But fine-tuning has always had a fatal flaw: catastrophic forgetting. Train on domain A, the model learns it. Train on domain B next, and it forgets A. Every new domain erases the last one. This is why RAG took over. Not because RAG was better — but because fine-tuning was destructive. I spent a year solving the forgetting problem. Zero Forgetting. For Real. I developed an cumulative algorithm that constrains the fine-tuning process so new knowledge gets added without overwriting existing knowledge. I trained a single Mistral 7B model across five sequential domains. After all five, I tested it on every prior domain. The results:
- Retention BERTScores of 0.82–0.86 on all prior domains
- Backbone drift of -0.16% (3-seed average) — the model actually got slightly better on earlier domains
- Stable gradient norms throughout — no explosions, no collapse
- No replay buffers. No frozen parameters. No growing memory banks. One model. Five domains. Everything retained. Standard LoRA on the same data showed +43% forgetting on prior domains. My approach showed -0.16% drift. That's not an incremental improvement. That's a different category. The Comparison Nobody Is Making
- RAG: Looks up the answer every time ModelBrew: The model just knows it
- RAG: Embed → search → rerank → generate ModelBrew: Generate. That's it.
- RAG: Vector DB + embeddings + retriever + chunking ModelBrew: Just the model
- RAG: Add a second domain, retrieval gets confused ModelBrew: Add a second domain, model knows both
- RAG: 10,000 queries later, learned nothing ModelBrew: 5 domains later, remembers everything
- RAG: Silent retrieval failures = confident wrong answers ModelBrew: Trained on verified data, -0.16% drift
- RAG: Costs grow with every query ModelBrew: One-time training cos When RAG Still Wins RAG is the right tool when:
- Your data changes by the hour. Stock prices, live inventory, breaking news. You can't fine-tune fast enough for real-time data.
- You need to cite sources. RAG can point to the exact document. A fine-tuned model generates from patterns — it can't tell you which paragraph it learned from.
- You have millions of documents you just need searchable. Not every document needs to be memorized. Some just need to be findable. But for domain expertise — protocols, institutional knowledge, product details, regulatory knowledge — the model should know it. Not look it up every time. The Architecture That Actually Makes Sense. Use both. But use them for what they're good at. Fine-tuning with continual learning = long-term memory. The stuff that doesn't change often but matters deeply. Train it into the weights. RAG = short-term memory. Today's updates. This week's tickets. The latest regulatory filing. Retrieve it at query time. That's how your brain works. You don't Google how to do your job every morning. You know it. But you do check your inbox for today's updates. Why This Matters Now RAG became the default because fine-tuning was broken. Catastrophic forgetting made it impossible to build a model that accumulated knowledge over time. The only option was to keep the model static and bolt on a retrieval layer. That constraint doesn't exist anymore. The platform I built — https://modelbrew.ai — lets you train across unlimited domains sequentially with zero forgetting. It's live, it's deployed, and it's based on a year of independent research with a patent pending. The question isn't "RAG or fine-tuning?" anymore. The question is: what should the model know permanently, and what should it look up? That's the question worth asking.