For most of 2024 and 2025, building an AI agent felt like Groundhog Day. You'd start a session, feed it context, watch it reason through a task—and then the session would end. Every lesson learned, every workaround discovered, every preference inferred: gone. The next session started from zero.
Models were getting smarter, but agents themselves were essentially stateless. They could act, but they couldn't learn. Existing memory systems were narrow—sticky notes, not accumulated expertise.
That changed on May 6, 2026.
At its Code with Claude developer conference, Anthropic introduced dreaming for Claude Managed Agents—a scheduled, asynchronous process that reviews past sessions, extracts patterns, and curates memory so agents genuinely improve over time. Alongside it, the company moved outcomes (self-evaluation against rubrics) and multi-agent orchestration into public beta. Together, these three features form the most significant upgrade to agent infrastructure since the category existed.
Here's what they do, why they matter, and what they mean for anyone building agent skills today.
Table of Contents
The Stateless Problem
To understand why dreaming matters, you have to understand what agents were missing.
A typical agent session: user provides a goal, agent reasons step-by-step, calls tools, delivers output. If you're lucky, the platform retains a memory file or a few preferences across sessions. But that's persistence, not learning. The agent doesn't get better at its job. It doesn't notice it made the same mistake three times last week. It doesn't consolidate the workaround for a finicky API into a reusable pattern.
For short tasks, this is fine. For long-running workflows—customer support, legal review, log analysis—it's a hard ceiling. Performance plateaus on day one.
Anthropic's own research confirmed a deeper problem: agents asked to evaluate their own work "tend to respond by confidently praising the work—even when, to a human observer, the quality is obviously mediocre." Self-evaluation in the same context window doesn't work. The model is too invested in its own reasoning to see the flaws.
The fix isn't more prompting. It's a different architecture.
What Is Dreaming?
Dreaming is a scheduled background process—think of it as an agent's overnight review session. Here's how it works in practice:
- Input: A batch of prior session transcripts and the current memory store.
- Processing: Claude reviews the sessions, identifies recurring mistakes, notes workflows multiple agents converged on independently, detects team preferences shared across users, and spots outdated or contradictory memory entries.
- Output: A curated, reorganized memory layer—structured playbooks, plain-text notes, and refined preferences.
- Review: Teams can approve, reject, or modify the updates before the agent adopts them.
Crucially, dreaming never modifies the original session transcripts. The raw history stays intact for audit purposes. Only the memory store gets updated. And the updates are written as human-readable notes, not opaque weight adjustments, so the entire process remains inspectable.
Anthropic describes what dreaming surfaces as "patterns that a single agent can't see on its own." A single session might show a bug. Ten sessions show that the bug only happens with .docx files over 50MB uploaded after 5 PM. That kind of cross-session pattern detection is what transforms an agent from a stateless assistant into a system that genuinely accumulates expertise.
Early results are striking. Legal AI company Harvey reported task completion rates increasing roughly 6x after implementing dreaming. The system allowed their agents to retain filetype preferences and tool-specific workarounds between sessions, improving consistency across long-running legal drafting workflows.
Outcomes: Agents That Grade Their Own Work
Dreaming handles between-session improvement. Outcomes handles within-session improvement.
The concept is simple but architecturally subtle. A developer defines a rubric—a structured definition of what success looks like. When the working agent completes its task, a separate grader agent evaluates the output against that rubric in its own independent context window.
This separation matters. The grader operates with fresh context—no accumulated biases from the working agent's reasoning trail. When it finds gaps, it pinpoints exactly what needs to change, and the working agent takes another pass. The loop continues until the rubric is satisfied.
Anthropic's internal testing showed that combining outcomes with dreaming improved task success rates by up to 10 percentage points compared with standard prompting. The improvement was especially pronounced on complex file-generation tasks involving .docx and .pptx outputs.
Medical document review company Wisedocs cut review time by 50% using outcomes. Netflix is using multi-agent orchestration (which pairs naturally with outcomes) to process logs from hundreds of builds simultaneously, surfacing only actionable anomalies for human review.
Multi-Agent Orchestration at Scale
The third piece of Anthropic's announcement was the public beta of multi-agent orchestration for Claude Managed Agents, supporting up to 20 parallel specialist agents.
This isn't just running multiple agents at once. It's a structured delegation pattern where a lead agent breaks complex tasks into subtasks, assigns each to a specialist with its own context window and tools, and coordinates the results.
Anthropic demonstrated this on stage with a fictional aerospace startup called "Lumara." The task: autonomously land drones on the moon. The system used three specialist agents—a commander for mission success, a detector for landing site identification, and a navigator for flight and landing—coordinated by a lead agent with a success rubric.
The real power emerged when dreaming was layered on top. After initial simulations produced imperfect results, a dreaming session reviewed all past mission runs overnight and wrote a detailed descent playbook—heuristics extracted from patterns across multiple simulations. The next morning, with the playbook loaded into memory, the agents performed meaningfully better on previously underperforming sites.
All without a human writing new logic.
Why This Is a Bigger Deal Than It Sounds
At first glance, dreaming might look like a fancy memory cleanup tool. It's not. It represents a philosophical shift in how we architect agent systems.
Until now, the dominant mental model for agents was the stateless request-response loop. You send a prompt, you get output. Even "memory" was just stuffing context into the next prompt. The agent had no genuine lifecycle, no accumulation of expertise, no identity that persisted and matured.
Dreaming, outcomes, and orchestration together introduce a stateful, continuous-improvement model. Agents now have:
- A lifecycle (session → review → memory update → next session)
- Self-evaluation (grader agents with independent context)
- Specialization (multi-agent teams with coordinated goals)
- Auditability (human-readable memory updates, not opaque weight changes)
This is the difference between a chatbot and a coworker. A chatbot answers questions. A coworker learns how your team works, remembers what went wrong last time, and gets better at their job every week.
The market is responding accordingly. Anthropic CEO Dario Amodei disclosed at the conference that the company saw 80x annualized growth in revenue and usage in Q1 2026—eight times the 10x growth they had planned for. The average developer using Claude Code now spends 20 hours per week in the tool. These aren't casual users. These are people whose workflows have become genuinely dependent on agents that improve over time.
Practical Patterns for Builders
You don't need Claude Managed Agents to apply these concepts. The underlying patterns are portable. Here are three you can implement in any agent system today.
Pattern 1: Session Logging and Pattern Extraction
The simplest version of dreaming is just logging sessions and periodically reviewing them. Store structured session data (goals, actions, outcomes, errors) and run a periodic batch job that feeds summaries back into your agent's system prompt or memory store.
# Minimal session logging for pattern extraction
class SessionLogger:
def __init__(self, store_path="sessions.jsonl"):
self.store_path = store_path
def log(self, goal, actions, outcome, errors=None):
entry = {
"timestamp": time.time(),
"goal": goal,
"actions": actions,
"outcome": outcome,
"errors": errors or []
}
with open(self.store_path, "a") as f:
f.write(json.dumps(entry) + "\n")
def extract_patterns(self, limit=100):
"""Feed recent sessions to an LLM for pattern extraction."""
sessions = self.recent_sessions(limit)
prompt = f"""Review these agent sessions and extract:
1. Recurring errors or failure modes
2. Successful workarounds or patterns
3. User preferences that appear repeatedly
Sessions:
{json.dumps(sessions, indent=2)}
"""
return llm.generate(prompt)
Run extract_patterns() nightly, append the results to your agent's system prompt, and you have a crude but functional version of dreaming.
Pattern 2: Separate Grader for Self-Evaluation
The outcomes pattern—having a separate agent grade the work—is equally portable. Instead of asking the working agent "did you do this right?", hand the output to a fresh agent with a rubric.
# Separate grader pattern
def evaluate_output(output, rubric, task_context):
grader_prompt = f"""You are an objective grader. Evaluate the following output against the rubric.
Do not consider the reasoning process—only the final output.
Rubric:
{rubric}
Task Context:
{task_context}
Output to Evaluate:
{output}
Respond with:
- PASS or FAIL for each rubric criterion
- Specific feedback on what needs to change
- A rewritten version if FAIL"""
return llm.generate(grader_prompt)
If the grader returns FAIL with specific feedback, feed that back to the working agent for another pass. This separation is what makes the loop work.
Pattern 3: Specialist Delegation
For complex tasks, break them into subtasks and delegate to specialist agents rather than asking one agent to do everything. This reduces context pressure and lets you optimize prompts and tools per subtask.
# Minimal specialist delegation
def run_multi_agent_workflow(task, specialists):
# Step 1: Lead agent breaks task into subtasks
plan = lead_agent.plan(task, available_specialists=list(specialists.keys()))
results = {}
for subtask in plan.subtasks:
specialist = specialists[subtask.agent_type]
results[subtask.id] = specialist.run(
goal=subtask.description,
context=task.context,
dependencies=[results[dep] for dep in subtask.depends_on]
)
# Step 2: Lead agent synthesizes final output
return lead_agent.synthesize(results, original_task=task)
Even a simple two-agent split—one to execute, one to verify—can dramatically improve reliability on complex outputs.
The Road Ahead
Anthropic's announcements are part of a broader industry shift. OpenAI's decoupled agent harness (open-sourced in May 2026) separates the "brain" from the "hands" of an agent, allowing multiple models and tool backends to plug into a single orchestration layer. Pinecone's Nexus knowledge engine is designed specifically to give agents structured, queryable long-term memory at scale. NVIDIA and ServiceNow's Project Arc is building autonomous enterprise agents that can operate across ServiceNow workflows without human intervention.
What all of these have in common is a move away from the "one big prompt" model of agent design and toward modular, persistent, self-improving systems. The agent is no longer just the model. It's the model plus memory plus evaluation plus orchestration plus feedback loops.
The implication for builders is clear: if you're designing agent skills today, design for statefulness from the start. Don't assume each session is independent. Build logging. Build review loops. Build rubrics. The platforms are catching up to this architecture, and the builders who already think in terms of continuous improvement will have a significant head start.
Bottom Line
Self-improving agents are no longer a research concept. They're shipping.
Anthropic's dreaming system, outcomes grading, and multi-agent orchestration give us the first production-grade toolkit for agents that learn from their own history. The results from early adopters—6x completion rate improvements, 50% time reductions, parallel processing at Netflix scale—suggest this isn't incremental progress. It's a category shift.
For anyone building agent skills, the takeaway is practical: stop treating agents like stateless functions. Start treating them like systems with a lifecycle—session, review, memory update, repeat. The patterns are portable, the infrastructure is arriving, and the gap between prototype and production-grade agent is closing faster than anyone expected.
The agents of the next 12 months won't just be smarter models. They'll be agents that remember what you taught them last month, noticed what went wrong last week, and showed up today better than they were yesterday.
That's not a chatbot. That's a teammate.