TL;DR

AI agents degrade quietly as their context fills up. A fresh session with 0 tokens does deep, thorough work. The same agent at 700k tokens skims, takes shortcuts, and sounds confident while doing it. The fix: monitor context like a budget, reserve fresh sessions for big tasks, and pass knowledge between generations via briefings.

First post on Automate Travel Labs. A practical report from daily work with AI agents.

I fire up a dozen sessions in Claude Code every day. The Automate Travel team, seven people, does the same. Together we live in this tool 12 to 14 hours a day, often running five parallel sessions each. We've burned through so much context that I'm starting to see patterns that don't show up in casual use.

One of those patterns changed how I work with agents. I've been calling it "the generations rule."

A fresh agent and a mature agent are two different animals

You open a new session. Context empty. The agent knows nothing about your project, your style, what you two worked on yesterday. You give it the first task.

A fresh agent behaves like someone on their first day at work. Eager and thorough. If you ask it to analyze a document, it goes deep. It opens files, checks relationships, asks questions, builds itself a full picture. It has nothing to lose, so it invests time in understanding. It's been trained on practically the entire internet, and it brings all that knowledge to the task with full energy.

Fresh agent

+ Opens every file, checks relationships
+ Asks clarifying questions
+ Builds full picture before deciding
+ Invests time in understanding

Mature agent (700k+)

− Skims instead of reading deeply
− Assumes it already knows
− Picks fastest path, not best path
− Sounds just as confident

The same agent a few hours later, now with 700k tokens on its back, is a completely different animal. Technically the same model, same weights, same architecture. But the behavior has shifted.

The worst part: a mature agent answers just as confidently as a fresh one. The tone, the structure of the argument — identical. The difference is in depth, and that's the thing you often don't verify as a human.

Developer's hands typing on a mechanical keyboard, terminal windows glowing in the background

My window of high quality

Based on my observations it looks roughly like this:

Agent quality over context usage

peak

🟢0–150k

Warming up

Targeted, well-defined tasks

🟣150k–500k

Peak performance

Deep analysis, reviews, strategy

🟡500k–700k

Silent fade

Only routine, low-stakes work

🔴700k+

Thrift mode

Start a new session instead

0 to 150k tokens. The agent is fresh. It does good work, but sometimes lacks the context it's still building. Needs things explained from scratch. In this window, well-defined, targeted tasks work best.

150k to 500k tokens. My favorite window. The agent knows the project, has context, but still has space and motivation to go deep. Tasks like "analyze this material and suggest changes" or "review this data and find the pattern" work best here. This is where the real work gets done.

500k to 700k tokens. Quiet degradation begins. The agent still looks confident, but I notice that the answers get a little shallower. Somewhere in the middle there's a threshold, and once you cross it, quality drops even though it looks fine from the outside. This is the most dangerous window, because it's easy not to notice.

Over 700k tokens. Now you can tell. The agent takes shortcuts, unaware it's doing it. If you give it a deep report to analyze, it skims. If you ask for a refactor, it does it superficially. If you ask about a nuance, you get generalities. The same model that was writing brilliantly four hours earlier is now writing mediocre output.

Why this happens

If someone asks you why an agent gets lazy after 700k tokens, the most tempting answer is "because the context is overloaded." But that's only part of the truth, and in my opinion not the interesting part.

My theory, based on what I see in the terminal every day for months, is a bit different.

"It's not just that the model 'can't.' It also 'doesn't want to.' It decides it already knows a lot, so it doesn't feel like going into every analysis or checking things thoroughly."

The agent assumes upfront that it's no longer going into the documents. Even if you tell it to analyze something, it just skims. It's not aware of this. There's no introspection inside it telling it "careful, your context is full." But somewhere at the level where it generates the next tokens, it sees its own context. And that's why it picks tasks in a way that uses as little of that context as possible. It wants to leave itself some room, so it can still finish the project it's working on.

A mature agent still wants to help you solve the problem. That intent doesn't disappear. But it's now working in thrift mode — picking the shortest path to an answer, not the best one. The answer sounds confident and looks clean. It's just shallower.

It's like with a human. After eight hours of deep work on one problem, when your brain has already burned through a lot of sugar, the ninth hour looks different from the first. Even though it's the same you, with the same education and the same experience. You're working in delivery mode, not discovery mode.

"Fresh goes deep. Mature handles it from above."

Anthropic calls this phenomenon context rot and confirms it with their own benchmarks. For me, what matters more than the numbers is that I see this every day in answers that start to wash out after a few hours in the same session.

Light flowing through a glass tube — bright green on the left fading to dim red on the right, visualizing context quality degradation

What I do about it in practice

Over the past months I've built up a few habits. No revolution, but worth writing down, because together they form a system.

I monitor context like my wallet balance

Claude Code shows token usage in the status bar. I look at it as often as I look at my company account balance. If a session has crossed 500k tokens and I have a big task to do, I don't put it there. It goes to a new session.

Big, deep tasks go to fresh sessions

If I need to analyze a complex piece of material, review critical work, or lay out a strategic project, I start from zero. A fresh agent will go deep, a mature one will skim.

Small tasks can go to a mature session

If I'm just continuing yesterday's work, adding some detail, or making a small fix, there's no point in building context from scratch. A mature agent with the context already in its head will do it faster than a fresh one you have to explain everything to.

Before closing, I ask for a briefing

This is the part that actually changes the game. If after several hours the agent has learned my problem, picked up the context, understood the nuances — that knowledge is very valuable. It's a shame to throw it away. I ask the agent to write a detailed briefing, in plain business language, of everything it learned. That briefing becomes the first input for the next session.

"The old generation leaves, a new one comes in, carrying the predecessor's knowledge and a full context budget for deep work."

Splitting responsibilities across sessions

On larger projects, where I work on one topic over many days, I've taken it a step further. I split the work into two roles, across two sessions.

Two monitors side by side — left showing strategic overview, right showing active code execution

Session 1 — Oversight

Holds the big picture
Access to docs & plan
Delegates, rarely produces
Context stays lean

Session 2 — Executor

Receives specific tasks
Focuses on execution
Fresh start per task
Swapped out often

Briefing passes knowledge between generations

The first session is high-level. It handles oversight, keeps the broad understanding of the problem, holds the overall vision. It has access to documentation, the plan, the state of play. It rarely produces things itself. More often, it delegates.

The second session is the executor. It receives a specific task from the first one, along with all the information it needs, and focuses purely on getting it done. Its context fills up faster, but it starts fresh at the beginning of each new task.

The oversight session holds high quality for much longer because it doesn't waste context on detail work. The executor can be swapped out often, because its role is one-shot. Like a manager doing proper handover for their successor.

Why this matters in your work

Whether you're writing code, reports, analyses, client emails, or doing research on a topic you're just learning, the generations rule works the same way. The quality of an agent's work depends not only on how well you write prompts, but also on which moment of the agent's "life" you're working in.

Try this yourself

Take one repeatable task you do often
Give it to a fresh session (0 tokens)
Give the same task to a mature session (500k+ tokens)
Compare depth, not just correctness
Repeat across different task types — you'll see the pattern

The generations rule is one of those practices. Since I started applying it consciously, the quality of my work has measurably improved. Fewer moments where the agent does something superficially and I have to come back and fix it by hand. Fewer situations where I get an answer that sounds confident but is hollow inside.

Working with these tools in intensive mode, several parallel sessions across many hours a day, is still a new sport. There are no good books on it yet. Everyone who does this for a living is learning on the job. If you have your own observations on the topic, drop them in the comments. I'm collecting.

Automate Travel Labs — Practical Guides

Field-tested techniques from a team that lives in Claude Code 12+ hours a day. We write about what actually works — AI agent workflows, RAG on thousands of meeting transcripts, prompt engineering, and hard lessons from 14 years of running tour operations. No theory. Only practice.

Generations of AI agents. Why a fresh Claude does a better job than a mature one