Production deployment at scale
#363
by O96a - opened
Still the workhorse for many production RAG pipelines after all this time. We've been running Llama-3.1-8B on Oracle Free Tier with 4-bit quantization for async agent tasks — the balance between context length and latency is hard to beat. One thing we've noticed: instruction-following degrades noticeably when you push past 60-70% context utilization. Has anyone benchmarked the sweet spot between packing context and maintaining reasoning quality? Also curious if the instruct variant has better tool-calling alignment than the base model for agentic workflows.