Unexpected High Token Usage with LibreChat Agent #12209

dev-fbm · 2026-03-13T14:43:27Z

dev-fbm
Mar 13, 2026

Hi,

I'm encountering a significant issue and would greatly appreciate any insights or best practices from the community.

Problem Description:

I recently set up an agent within LibreChat, using the Claude Sonnet 4-5-20250929 model for its operations. Initially, everything seemed fine. However, after some time working with the agent, I noticed an alarming pattern: every single message I sent, even very short prompts, began consuming an extremely high number of input tokens.

Specifically, each message registered approximately 200,000 input tokens as "Cache Write (5m)". This exponential token consumption, regardless of the brevity of my input, has naturally led to a drastic and unsustainable increase in my API costs in a very short period.

To try and mitigate this, I attempted to set a max context tokens limit of 4096 directly in the model settings for the agent. Unfortunately, this resulted in the agent either failing to provide any responses at all or taking an incredibly long time to generate them, rendering it unusable. This suggests a delicate balance or a deeper underlying issue with how context is being handled or trimmed.

My Goal:

I'm looking for a robust solution or a set of best practices to prevent this from happening. My primary concern is to ensure efficient token management, especially given that I'm considering implementing LibreChat within my company. I need to prevent similar cost overruns for my colleagues.

Furthermore, when deploying LibreChat internally, it's crucial that my colleagues can independently create and manage their own agents without needing to manually configure or constantly worry about the "max context tokens" field for each agent. We need a more automated or globally managed approach to prevent such issues from arising repeatedly.

Questions for the Community:

Has anyone else experienced similar issues with excessive "Cache Write" token usage, particularly with Claude Sonnet or other models in LibreChat?
What is the recommended approach for setting max context tokens for agents globally, especially with models like Claude Sonnet, to ensure both cost efficiency and agent responsiveness? Why might a limit like 4096 cause agents to stop responding?
Are there specific configuration settings within LibreChat's memory or agent settings (e.g., tokenLimit, messageWindowSize, or other model_parameters) that can effectively cap or manage this kind of runaway token consumption globally or by default, thus preventing colleagues from encountering this issue when creating new agents?
Are there known strategies or architectural recommendations for LibreChat deployments in a corporate environment to strictly control token usage and prevent such unexpected cost spikes, while also allowing for user-friendly agent creation?
Could this be related to a specific agent configuration, a bug, or an intended behavior that I'm misunderstanding regarding context management or caching?
Any guidance, troubleshooting steps, or recommendations for effective token cost management within LibreChat, especially for agent-based workflows and corporate deployments, would be immensely valuable.

Thank you in advance for your time and expertise!

Best regards,

peeeteeer · 2026-03-13T18:03:42Z

peeeteeer
Mar 13, 2026
Collaborator

We are having similar problems in regards of a sudden cost increase for Anthropic models... we didn't find your issue but two more...

1.) not sure where you are hosting your models but AWS had a cost monitoring issue in our case
2.) we did discover that Librechat does per default enable the 1M Token version of the model which is more costly (can be disabled in code (packages\data-provider\src\bedrock.ts line 108 - just don't send the header)

0 replies

peeeteeer · 2026-03-13T18:05:29Z

peeeteeer
Mar 13, 2026
Collaborator

additionally I was asking my little helper regarding your issue...

The key issue is in @librechat/agents, not the LibreChat app layer.

LibreChat enables Anthropic prompt caching by default in schemas.ts (line 398) and forwards promptCache: true in llm.ts (line 183). The actual cache markers are then injected downstream:

The system prompt gets cache_control: { type: 'ephemeral' } in AgentContext.ts (line 430).
Every turn, Anthropic messages are rewritten so the last 2 user messages get fresh cache_control markers in cache.ts (line 118).
That rewrite is applied on each invocation in Graph.ts (line 863).
My inference: this can absolutely produce repeated very large Anthropic cache_creation_input_tokens on long agent threads. Anthropic caching works on the prompt prefix up to the cache breakpoint, so if LibreChat keeps moving cache breakpoints onto the latest user turns, it can keep re-writing an increasingly large conversation prefix instead of mostly reading from cache. That matches the discussion’s “short prompt, huge cache write” symptom.

0 replies

danny-avila · 2026-03-14T00:15:37Z

danny-avila
Mar 14, 2026
Maintainer

Making some improvements to token management over the next few days. I've been working on improvements in this area over the last 2 weeks.

0 replies

UrRhb · 2026-03-25T10:13:51Z

UrRhb
Mar 25, 2026

This is a really well-documented writeup — the "short prompt, huge cache write" symptom is something more people hit than realize.

@peeeteeer's analysis of the cache breakpoint behavior is spot on. The root cause is essentially that Anthropic's prompt caching re-writes the entire conversation prefix every time the cache markers shift, and with long agent threads, that prefix grows fast. The 200k input tokens on a short prompt is the giveaway — you're paying to re-cache the full conversation history each turn.

A few practical things that might help for corporate deployment:

1. Aggressive context windowing — Instead of setting max_context_tokens to something tiny like 4096 (which starves the agent), try something like 32k-64k. The goal isn't to limit the model's reasoning capacity, it's to cap how much conversation history gets re-sent. This way the agent still works, but older turns get trimmed before the prefix balloons.

2. Monitor per-request costs in real time — This is the piece that would have caught the problem early. If you had visibility into the actual dollar cost of each API call as it happened (not just at the end of the billing cycle), you'd see the spike immediately. Tools like burn0 can give you per-request cost breakdowns in your terminal with zero config — useful for catching exactly this kind of runaway behavior during development before it hits production.

3. For corporate rollout — Consider setting organization-level spending limits directly in the Anthropic console as a safety net, independent of what LibreChat does. This won't fix the efficiency problem, but it prevents a $2k surprise while you tune the config.

The fact that @danny-avila is working on token management improvements is great — would be interested to see if the fix addresses the cache breakpoint shifting behavior specifically.

1 reply

danny-avila Mar 28, 2026
Maintainer

The fact that @danny-avila is working on token management improvements is great — would be interested to see if the fix addresses the cache breakpoint shifting behavior specifically.

These are already in place if you're using the latest main branch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected High Token Usage with LibreChat Agent #12209

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Unexpected High Token Usage with LibreChat Agent #12209

Uh oh!

dev-fbm Mar 13, 2026

Replies: 4 comments · 1 reply

Uh oh!

peeeteeer Mar 13, 2026 Collaborator

Uh oh!

peeeteeer Mar 13, 2026 Collaborator

Uh oh!

Uh oh!

danny-avila Mar 14, 2026 Maintainer

Uh oh!

UrRhb Mar 25, 2026

Uh oh!

danny-avila Mar 28, 2026 Maintainer

dev-fbm
Mar 13, 2026

Replies: 4 comments 1 reply

peeeteeer
Mar 13, 2026
Collaborator

peeeteeer
Mar 13, 2026
Collaborator

danny-avila
Mar 14, 2026
Maintainer

UrRhb
Mar 25, 2026

danny-avila Mar 28, 2026
Maintainer