Bug Report: Conversation history silently resets to a previous checkpoint

#40
by rdjarbeng - opened

Description

The ML Intern space occasionally reverts to a previous point in conversation history, showing an older message as the latest one with no visible error on screen. All messages sent after that checkpoint appear lost. When the agent is asked what it last remembers, it confirms the old message shown in the UI, meaning its context has rolled back too, not just the display.

I've experienced this more than once. Strangely, returning to the chat after a few days sometimes restores the missing messages, which makes the root cause harder to pin down.

Steps to reproduce

  1. Have a long, ongoing conversation in the ML Intern space
  2. Continue working (in my case, I was on v38 of a project)
  3. The page refreshes (or is revisited)
  4. The conversation appears reset to an earlier checkpoint and recent messages are gone, with the agent's context matching the old state

What I see in the browser console (DevTools)

Two errors appear that are likely related:

Failed to load resource: the server responded with a status of 404 ()
/api/session/a65b23f6-aa70-44de-95b7-4dc4716dc1f0/messages

Failed to persist backend messages: QuotaExceededError: Failed to execute 'setItem' on 'Storage': 
Setting the value of 'hf-agent-backend-messages' exceeded the quota.
    at Bb (index-CTDlGvua.js:248:61489)
    at Rf (index-CTDlGvua.js:248:61748)
    at index-CTDlGvua.js:304:3898

Analysis

The QuotaExceededError could suggest localStorage is being used to persist conversation history, and the storage limit is being hit as conversations grow long, but the devs will be able to better address this. I thought this was a foreseeable edge case for long sessions.

The 404 on the messages endpoint is a separate issue, likely a stale or expired session ID, but both failures appear to be silent, giving the user no feedback.

Expected behavior

  • Users should be notified when conversation history is approaching storage limits, with a clear prompt to start a new conversation
  • In either case, messages already sent should not be silently lost. At minimum, the user should know they were
rdjarbeng changed discussion title from Failed to persist backend messages -Sudden memory loss on ml intern to Bug Report: Conversation history silently resets to a previous checkpoint

Update, it happened again but this time I was shown this error in the screenshot below

image

LLM Provider Unreachable
bedrock/us.anthropic.claude-opus-4-6-v1 — litellm.Timeout: Connection timed out. Timeout passed=Timeout(timeout=10.0), time taken=10.002 seconds

Unable to debug this further, I've spent more time than I wanted, and I don't seem to be getting a response either. Hope it can be fixed. Whether it's a network issue or another

smolagents org

Hello @rdjarbeng does the issue persist?

Yes, @lewtun the issue persists. However, it may not be local storage issue after all. I think I have diagnosed it down to a network problem. On one network, let's call it wifi network 1) I get a previous checkpoint, and on another, my mobile network let's call it network 2, I get the full conversation.

Steps I took to diagnose:

Network 1

  • Connect to network 1
  • Go to existing discussion in ml-intern which is currently on version-38 (v38)
  • Refresh the page
  • Conversation is now set to previous checkpoint where experiment was on version-17 (v17)
  • Observations: Model is still promptable can continue without error, nothing on the UI shows that messages from later are loading. Errors in dev console as shown above

Network 2

  • Connect to network 2
  • Go to same existing discussion in ml-intern which is currently on version-38 (v38)
  • Refresh the page
  • Conversation is now set to latest checkpoint on v38
  • Observations: Model is still promptable can continue without error

I usually use network 1. My current fear is that if I continue prompting with the earlier checkpoint I might lose the progress made with the current checkpoint, as if the project has diverged into 2 different branches.

I don't know why this happens because I'm usually fine on network 1, in fact that's what I've been using for most of my ml-intern sessions.

Possible solutions:

I suppose it would be good to display an error or some kind of indicator that the full context didn't load in case it's actually a network issue.
If it's specific to ml-intern I guess the devs would know better how to handle this.

Sign up or log in to comment