Generations of AI agents. Why a fresh Claude does a better job than a mature one

    ·9 min read
    AI agentsClaude Codeproductivitycontext window
    KB

    Krzysztof Balon

    CEO & Founder, Automate Travel

    Tour operator for 14 years, now building AI-powered operations software. Runs a dozen Claude Code sessions daily.

    Developer workspace at night — ultrawide monitor with terminal sessions

    TL;DR

    AI agents degrade quietly as their context fills up. A fresh session with 0 tokens does deep, thorough work. The same agent at 700k tokens skims, takes shortcuts, and sounds confident while doing it. The fix: monitor context like a budget, reserve fresh sessions for big tasks, and pass knowledge between generations via briefings.

    First post on Automate Travel Labs. A practical report from daily work with AI agents.

    I fire up a dozen sessions in Claude Code every day. The Automate Travel team, seven people, does the same. Together we live in this tool 12 to 14 hours a day, often running five parallel sessions each. We've burned through so much context that I'm starting to see patterns that don't show up in casual use.

    One of those patterns changed how I work with agents. I've been calling it "the generations rule."

    1

    A fresh agent and a mature agent are two different animals

    You open a new session. Context empty. The agent knows nothing about your project, your style, what you two worked on yesterday. You give it the first task.

    A fresh agent behaves like someone on their first day at work. Eager and thorough. If you ask it to analyze a document, it goes deep. It opens files, checks relationships, asks questions, builds itself a full picture. It has nothing to lose, so it invests time in understanding. It's been trained on practically the entire internet, and it brings all that knowledge to the task with full energy.

    Fresh agent

    • + Opens every file, checks relationships
    • + Asks clarifying questions
    • + Builds full picture before deciding
    • + Invests time in understanding

    Mature agent (700k+)

    • Skims instead of reading deeply
    • Assumes it already knows
    • Picks fastest path, not best path
    • Sounds just as confident

    The same agent a few hours later, now with 700k tokens on its back, is a completely different animal. Technically the same model, same weights, same architecture. But the behavior has shifted.

    The worst part: a mature agent answers just as confidently as a fresh one. The tone, the structure of the argument — identical. The difference is in depth, and that's the thing you often don't verify as a human.

    Developer's hands typing on a mechanical keyboard, terminal windows glowing in the background
    2

    My window of high quality

    Based on my observations it looks roughly like this:

    Agent quality over context usage

    peak
    🟢0–150k

    Warming up

    Targeted, well-defined tasks

    🟣150k–500k

    Peak performance

    Deep analysis, reviews, strategy

    🟡500k–700k

    Silent fade

    Only routine, low-stakes work

    🔴700k+

    Thrift mode

    Start a new session instead

    0 to 150k tokens. The agent is fresh. It does good work, but sometimes lacks the context it's still building. Needs things explained from scratch. In this window, well-defined, targeted tasks work best.

    150k to 500k tokens. My favorite window. The agent knows the project, has context, but still has space and motivation to go deep. Tasks like "analyze this material and suggest changes" or "review this data and find the pattern" work best here. This is where the real work gets done.

    500k to 700k tokens. Quiet degradation begins. The agent still looks confident, but I notice that the answers get a little shallower. Somewhere in the middle there's a threshold, and once you cross it, quality drops even though it looks fine from the outside. This is the most dangerous window, because it's easy not to notice.

    Over 700k tokens. Now you can tell. The agent takes shortcuts, unaware it's doing it. If you give it a deep report to analyze, it skims. If you ask for a refactor, it does it superficially. If you ask about a nuance, you get generalities. The same model that was writing brilliantly four hours earlier is now writing mediocre output.

    3

    Why this happens

    If someone asks you why an agent gets lazy after 700k tokens, the most tempting answer is "because the context is overloaded." But that's only part of the truth, and in my opinion not the interesting part.

    My theory, based on what I see in the terminal every day for months, is a bit different.

    "It's not just that the model 'can't.' It also 'doesn't want to.' It decides it already knows a lot, so it doesn't feel like going into every analysis or checking things thoroughly."

    The agent assumes upfront that it's no longer going into the documents. Even if you tell it to analyze something, it just skims. It's not aware of this. There's no introspection inside it telling it "careful, your context is full." But somewhere at the level where it generates the next tokens, it sees its own context. And that's why it picks tasks in a way that uses as little of that context as possible. It wants to leave itself some room, so it can still finish the project it's working on.

    A mature agent still wants to help you solve the problem. That intent doesn't disappear. But it's now working in thrift mode — picking the shortest path to an answer, not the best one. The answer sounds confident and looks clean. It's just shallower.

    It's like with a human. After eight hours of deep work on one problem, when your brain has already burned through a lot of sugar, the ninth hour looks different from the first. Even though it's the same you, with the same education and the same experience. You're working in delivery mode, not discovery mode.

    "Fresh goes deep. Mature handles it from above."

    Anthropic calls this phenomenon context rot and confirms it with their own benchmarks. For me, what matters more than the numbers is that I see this every day in answers that start to wash out after a few hours in the same session.

    Light flowing through a glass tube — bright green on the left fading to dim red on the right, visualizing context quality degradation
    4

    What I do about it in practice

    Over the past months I've built up a few habits. No revolution, but worth writing down, because together they form a system.

    I monitor context like my wallet balance

    Claude Code shows token usage in the status bar. I look at it as often as I look at my company account balance. If a session has crossed 500k tokens and I have a big task to do, I don't put it there. It goes to a new session.

    Big, deep tasks go to fresh sessions

    If I need to analyze a complex piece of material, review critical work, or lay out a strategic project, I start from zero. A fresh agent will go deep, a mature one will skim.

    Small tasks can go to a mature session

    If I'm just continuing yesterday's work, adding some detail, or making a small fix, there's no point in building context from scratch. A mature agent with the context already in its head will do it faster than a fresh one you have to explain everything to.

    Before closing, I ask for a briefing

    This is the part that actually changes the game. If after several hours the agent has learned my problem, picked up the context, understood the nuances — that knowledge is very valuable. It's a shame to throw it away. I ask the agent to write a detailed briefing, in plain business language, of everything it learned. That briefing becomes the first input for the next session.

    "The old generation leaves, a new one comes in, carrying the predecessor's knowledge and a full context budget for deep work."

    5

    Splitting responsibilities across sessions

    On larger projects, where I work on one topic over many days, I've taken it a step further. I split the work into two roles, across two sessions.

    Two monitors side by side — left showing strategic overview, right showing active code execution
    Session 1 — Oversight
    • Holds the big picture
    • Access to docs & plan
    • Delegates, rarely produces
    • Context stays lean
    Session 2 — Executor
    • Receives specific tasks
    • Focuses on execution
    • Fresh start per task
    • Swapped out often
    Briefing passes knowledge between generations

    The first session is high-level. It handles oversight, keeps the broad understanding of the problem, holds the overall vision. It has access to documentation, the plan, the state of play. It rarely produces things itself. More often, it delegates.

    The second session is the executor. It receives a specific task from the first one, along with all the information it needs, and focuses purely on getting it done. Its context fills up faster, but it starts fresh at the beginning of each new task.

    The oversight session holds high quality for much longer because it doesn't waste context on detail work. The executor can be swapped out often, because its role is one-shot. Like a manager doing proper handover for their successor.

    6

    Why this matters in your work

    Whether you're writing code, reports, analyses, client emails, or doing research on a topic you're just learning, the generations rule works the same way. The quality of an agent's work depends not only on how well you write prompts, but also on which moment of the agent's "life" you're working in.

    Try this yourself

    1. Take one repeatable task you do often
    2. Give it to a fresh session (0 tokens)
    3. Give the same task to a mature session (500k+ tokens)
    4. Compare depth, not just correctness
    5. Repeat across different task types — you'll see the pattern

    The generations rule is one of those practices. Since I started applying it consciously, the quality of my work has measurably improved. Fewer moments where the agent does something superficially and I have to come back and fix it by hand. Fewer situations where I get an answer that sounds confident but is hollow inside.

    Working with these tools in intensive mode, several parallel sessions across many hours a day, is still a new sport. There are no good books on it yet. Everyone who does this for a living is learning on the job. If you have your own observations on the topic, drop them in the comments. I'm collecting.

    Automate Travel Labs — Practical Guides

    Field-tested techniques from a team that lives in Claude Code 12+ hours a day. We write about what actually works — AI agent workflows, RAG on thousands of meeting transcripts, prompt engineering, and hard lessons from 14 years of running tour operations. No theory. Only practice.