Replies: 4 comments 1 reply
-
|
We are having similar problems in regards of a sudden cost increase for Anthropic models... we didn't find your issue but two more... 1.) not sure where you are hosting your models but AWS had a cost monitoring issue in our case |
Beta Was this translation helpful? Give feedback.
-
|
additionally I was asking my little helper regarding your issue... The key issue is in @librechat/agents, not the LibreChat app layer. LibreChat enables Anthropic prompt caching by default in schemas.ts (line 398) and forwards promptCache: true in llm.ts (line 183). The actual cache markers are then injected downstream: The system prompt gets cache_control: { type: 'ephemeral' } in AgentContext.ts (line 430). |
Beta Was this translation helpful? Give feedback.
-
|
Making some improvements to token management over the next few days. I've been working on improvements in this area over the last 2 weeks. |
Beta Was this translation helpful? Give feedback.
-
|
This is a really well-documented writeup — the "short prompt, huge cache write" symptom is something more people hit than realize. @peeeteeer's analysis of the cache breakpoint behavior is spot on. The root cause is essentially that Anthropic's prompt caching re-writes the entire conversation prefix every time the cache markers shift, and with long agent threads, that prefix grows fast. The 200k input tokens on a short prompt is the giveaway — you're paying to re-cache the full conversation history each turn. A few practical things that might help for corporate deployment: 1. Aggressive context windowing — Instead of setting 2. Monitor per-request costs in real time — This is the piece that would have caught the problem early. If you had visibility into the actual dollar cost of each API call as it happened (not just at the end of the billing cycle), you'd see the spike immediately. Tools like burn0 can give you per-request cost breakdowns in your terminal with zero config — useful for catching exactly this kind of runaway behavior during development before it hits production. 3. For corporate rollout — Consider setting organization-level spending limits directly in the Anthropic console as a safety net, independent of what LibreChat does. This won't fix the efficiency problem, but it prevents a $2k surprise while you tune the config. The fact that @danny-avila is working on token management improvements is great — would be interested to see if the fix addresses the cache breakpoint shifting behavior specifically. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm encountering a significant issue and would greatly appreciate any insights or best practices from the community.
Problem Description:
I recently set up an agent within LibreChat, using the Claude Sonnet 4-5-20250929 model for its operations. Initially, everything seemed fine. However, after some time working with the agent, I noticed an alarming pattern: every single message I sent, even very short prompts, began consuming an extremely high number of input tokens.
Specifically, each message registered approximately 200,000 input tokens as "Cache Write (5m)". This exponential token consumption, regardless of the brevity of my input, has naturally led to a drastic and unsustainable increase in my API costs in a very short period.
To try and mitigate this, I attempted to set a max context tokens limit of 4096 directly in the model settings for the agent. Unfortunately, this resulted in the agent either failing to provide any responses at all or taking an incredibly long time to generate them, rendering it unusable. This suggests a delicate balance or a deeper underlying issue with how context is being handled or trimmed.
My Goal:
I'm looking for a robust solution or a set of best practices to prevent this from happening. My primary concern is to ensure efficient token management, especially given that I'm considering implementing LibreChat within my company. I need to prevent similar cost overruns for my colleagues.
Furthermore, when deploying LibreChat internally, it's crucial that my colleagues can independently create and manage their own agents without needing to manually configure or constantly worry about the "max context tokens" field for each agent. We need a more automated or globally managed approach to prevent such issues from arising repeatedly.
Questions for the Community:
Has anyone else experienced similar issues with excessive "Cache Write" token usage, particularly with Claude Sonnet or other models in LibreChat?
What is the recommended approach for setting max context tokens for agents globally, especially with models like Claude Sonnet, to ensure both cost efficiency and agent responsiveness? Why might a limit like 4096 cause agents to stop responding?
Are there specific configuration settings within LibreChat's memory or agent settings (e.g., tokenLimit, messageWindowSize, or other model_parameters) that can effectively cap or manage this kind of runaway token consumption globally or by default, thus preventing colleagues from encountering this issue when creating new agents?
Are there known strategies or architectural recommendations for LibreChat deployments in a corporate environment to strictly control token usage and prevent such unexpected cost spikes, while also allowing for user-friendly agent creation?
Could this be related to a specific agent configuration, a bug, or an intended behavior that I'm misunderstanding regarding context management or caching?
Any guidance, troubleshooting steps, or recommendations for effective token cost management within LibreChat, especially for agent-based workflows and corporate deployments, would be immensely valuable.
Thank you in advance for your time and expertise!
Best regards,
Beta Was this translation helpful? Give feedback.
All reactions