Everybody’s talking about agents

Everybody’s talking about agents

You hear it everywhere right now: agents managing teams of agents, agents replacing whole departments, an agent for every workflow, a future where you manage software the way you used to manage people. The word has stretched to mean almost anything, or even anyone, and most of what it brings to mind doesn’t really exist yet. What’s actually running is smaller and a lot more ordinary: coding assistants like Claude Code and Cursor working next to developers, chat tools writing and summarizing inside companies, and customer-service bots that most teams buy off the shelf instead of building. Those are the agents actually doing work today.

What matters more than the label is the gap between the agent everyone’s describing and the ones actually in production, because that’s where most of the confusion lives. This issue is about what’s really running, why the rest hasn’t caught up, and the key things holding us back.

What’s actually in production

Spend any time around AI right now and the same forecasts keep surfacing about agents taking over customer service, engineering, and operations— with a dedicated one coming for every workflow. Some of that will be true eventually. Almost none of it is true today, and the gap between the pitch and what’s actually deployed is the most useful thing to get straight, whether you’re running a company, building one of these systems, or just trying to read where the work is going.

LangChain’s State of Agent Engineering, its 2026 report built on a late-2025 survey of more than 1,300 practitioners, found that 57 percent of respondents have agents in production. That sounds like the revolution showed up, until you see who answered: close to two-thirds work at tech companies, so the survey captures the front of the pack more than the typical company. Even there, about 30 percent are still just building, and the report itself admits the whole agents-everywhere idea is still early. What people actually run day to day is narrower still: asked to name it, they listed coding assistants first, Claude Code, Cursor, and GitHub Copilot, with research and customer service close behind, and the two use cases that make up more than half of all deployments are customer service, at about 27 percent, and research and data analysis, at about 24 percent. The real map is small: code, customer service, and research, mostly internal, mostly helping a person rather than replacing one. Customer service is worth a caveat, because it was already heavily automated long before this wave of AI, the phone trees and scripted bots people notoriously try to escape by asking for a human; what’s new isn’t that the work is being automated for the first time, it’s that there’s finally a real chance to make that automated experience less of the thing everyone dreads.

The underlying capability is climbing fast. METR, a nonprofit that runs independent evaluations of how well frontier AI systems can carry out complex tasks on their own and what risks that creates, put out an updated estimate in January 2026 that shows the length of task a frontier model can finish without help has been doubling about every seven months for six years, and lately faster than that, closer to every four months. But the raw numbers are still modest, because the strongest models now clear tasks that take a skilled person well over ten hours only about half the time, though METR cautions that estimates past sixteen hours aren't yet reliable, and five-day tasks remain out of reach. So the agent that runs a whole project on its own, the one doing an actual employee’s job, is on the curve; it’s just not what’s shipping this quarter.

What’s stopping the rest is quality. In that same LangChain survey, the biggest single barrier to getting agents into production was the quality of their output, named by about a third of people, while cost worries kept dropping. Quality here means the basics: getting it right, staying consistent, staying on task, keeping the right tone. And among the biggest organizations, the ones closest to running this at scale, there are still major issues. They pointed to hallucinations, output that won’t stay consistent, and the ongoing headache of context engineering and managing context at scale.

This is the context problem seen from the production side. Output gets shaky as the context window fills up. The industry’s answer was to build bigger windows, and by 2026 that race is basically settled, with more than a dozen frontier models now shipping windows of a million tokens or more, but the bigger-window fix didn’t hold. What the benchmarks track now is the gap between a model’s advertised window and its effective context, the span it can actually use reliably, and the two aren’t close. The pattern NVIDIA’s RULER benchmark established, that effective context runs well short of the advertised window, has held across every model generation since, and 2026’s harder tests make it starker still. On Google DeepMind’s MRCR v2, which asks a model to find and combine several facts spread across a million tokens, accuracy falls off sharply for most of the field, and even the strongest models land well below where they sit on short inputs. More tokens, in other words, can mean worse answers rather than better ones.

Dex Horthy, who wrote the 12-Factor Agents guide in early 2025, has a working name for where this starts to bite: the dumb zone, which he puts, as a rule of thumb, at around 40 percent of the context window, the point where a model starts agreeing with your mistakes instead of catching them and forgetting things you told it a few thousand tokens back. He offers it as a rule of thumb, and it lines up with what the benchmarks keep showing: degradation tends to set in well before the window is full, so the number of tokens you can fit is rarely the number you should use. The fix is curation, deciding what the model sees and, just as much, what it never sees, handing it the three documents that matter instead of the thirty that might, and clearing out the dead ends it would otherwise read as live signal. Anthropic’s September 2025 guidance on context engineering sums up the whole job as finding the smallest set of high-signal tokens that still get the work done, and by 2026 that discipline was called out on most engineering teams and agent stacks.

The through-line is simple. Everybody’s talking about agents; the ones actually in production are narrower and more human-assisted than the talk suggests; and what’s holding the rest back comes from more than one place. Output quality is the largest barrier, but it isn’t the only one. In LangChain’s 2026 survey, latency comes next, named by about a fifth of teams as agents move into customer-facing work. The harder problem underneath both is keeping output reliable as these systems scale. Context is the most controllable lever on quality, which is why context engineering is becoming a central gate to production rather than a technical footnote, the unglamorous work that decides whether an agent is a demo or something you can trust. The companies, and the people, who get real value over the next year will be the ones who put their effort there, instead of waiting for the model to get big enough to make the problem disappear.

What You Should Read

Picks for this issue, with the highlights worth your time and a link to each.

1. Anthropic shuts down its Mythos models under a US government order 

Wired, June 12, 2026 · policy, national security, governance 

Days after putting its first Mythos-class model in front of the public, Anthropic disabled both Claude Fable 5 and Mythos 5 to comply with an export-control directive from the US government citing national security. The order told the company to cut off any foreign national inside or outside the US, including its own foreign-national staff, and because that reaches everyone, the only way to comply was to turn the models off for all customers.

  • The order arrived Friday at 5:21pm ET and was framed as an export control on foreign-national access; every other Claude model stayed up.
  • The government's stated concern is a jailbreak of Fable 5, which Anthropic calls narrow, amounting to asking the model to read a codebase and fix the flaws it finds, vulnerabilities other public models can turn up too.
  • It is the latest break with the Trump administration, which earlier labeled Anthropic a supply-chain risk over limits the company set on military use, and Anthropic is complying while publicly disputing the call. 

Why read it: It is the first time a US order has pulled a frontier model off the market, and the clearest sign yet that model access is becoming a political lever rather than a purely commercial decision. Read it

2. Initial impressions of Claude Fable 5

Simon Willison, June 9, 2026 · model release, capability

Willison spent about five and a half hours with Anthropic’s first publicly released Mythos-class model and came away calling it a beast: slow, expensive, and able to work through almost anything he gave it. It is the clearest hands-on read on where the frontier sits right now.

  • Fable 5 is priced at twice Opus: $10 per million input tokens and $50 per million output tokens.
  • He describes the model as feeling big, not only in speed and cost but in how much it knows.
  • Fable’s stricter guardrails trip often enough that the Claude API added new ways to flag a refusal, plus an option to fall back to another model automatically when something gets rejected.

Why read it: It is the fastest way to understand what the new top-tier model actually changes in practice (if/once access is restored). Read it

3. Open and closed models are on different exponentials

Nathan Lambert, Interconnects, June 1, 2026 · economics, strategy

Lambert argues the open versus closed debate is really an economic one, resting on whether buyers keep paying a large premium for the best closed models. He makes the case that coding agents are the first market proving they will.

  • He frames early 2026 as a seminal moment because coding agents are the first big market paying a real premium for better intelligence.
  • He expects the API businesses at the top labs to decay as they protect their best models to guard token supply and avoid distillation.
  • His read for operators is that the frontier stays an oligopoly of integrated closed labs, a mix, while the open ecosystem captures more total value spread thin across many companies at commodity prices.

Why read it: It gives leaders a clean economic lens for the model decisions they are already making. Read it

4. Co-Existence and the End of Co-Intelligence

Ethan Mollick, One Useful Thing, June 4, 2026 · future of work, leadership

Mollick retires his own co-intelligence framing and replaces it with co-existence, arguing the working relationship with AI is now a negotiation you keep re-running as models improve. He moves the leadership question from how to collaborate with AI to when to refuse it and when to hand it the keys.

  • He says the new questions are when to refuse AI’s help even when it is offered, and when to hand over the keys entirely.
  • He predicts AIs will read your work and decide whether to recommend it to their users, acting as reader, critic, and gatekeeper.
  • The piece also announces his next book, Co-Existence, out October 20.

Why read it: It is a useful new frame for how leaders should think about working alongside models that are sometimes better than they are. Read it

5. State of the software engineering job market in 2026, part 2

Gergely Orosz and Jessica Salmon, The Pragmatic Engineer, June 9, 2026 · careers, labor markets, compensation

A data-grounded read on where technical hiring actually sits in 2026, built from never-before-shared numbers across TrueUp, SignalFire, Interviewing.io, and Live Data Technologies. The picture is a reshaped market: AI-engineering roles and pay are pulling away from general software work, management layers are thinning, and the top AI labs have passed Big Tech as the most-wanted employers.

  • AI engineers are now in higher demand and command higher offers than software engineers, equity included, and at the 80th percentile in the US a $300K-plus base salary is the norm for senior engineers.
  • The great flattening continues, with fewer engineering managers for every engineer across the industry.
  • Top AI labs have overtaken Big Tech as the most sought-after employers, with Anthropic the single most in-demand name among candidates preparing to interview.
  • Frontend, native iOS, and Android titles are shrinking fastest, while AI and forward-deployed-engineer roles surge.

Why read it: A credible, numbers-first picture of how AI is reshaping technical careers and pay, useful whether you are hiring, planning a team, or advising someone early in theirs. Read it

6. Apple rebuilds Siri on Google’s Gemini

TechCrunch , June 9, 2026 · platform strategy, build vs. buy – some numbers from CNBC

At Tim Cook’s final WWDC, Apple unveiled a rebuilt Siri running on a custom version of Google’s Gemini, deciding the fastest path to a capable assistant was to license one from a direct rival rather than build it. The useful read is not that Apple fell behind, it is that capability was never the real constraint, speed was, and Apple kept the layers it actually differentiates on: privacy, device integration, and the experience of using the thing.

  • Apple licensed a custom version of Google’s Gemini to power the rebuilt Siri, a multi-year deal first reported in January at roughly $1 billion a year, after it had evaluated OpenAI and Anthropic.
  • It kept what it differentiates on, with queries running on-device or in its Private Cloud Compute and the contract barring Google from training future models on Apple user data.
  • Even with more than 160,000 employees and over $34 billion in R&D last fiscal year, Apple judged that licensing the best model and owning the experience was the pragmatic call, not an admission of failure.

Why read it: It is the clearest recent case of a build-versus-buy call going the way speed and differentiation point, useful for anyone weighing which layers of their own stack are worth owning. Read it

Who You Should Follow

Ethan Mollick, a Wharton professor who writes One Useful Thing, and the clearest voice going on what AI actually changes about how we work; he shows his evidence rather than just asserting it. LinkedIn

Grant Lee, co-founder and CEO of Gamma, who took an AI product from zero to millions of users and offers a builder’s-eye view of where consumer AI is heading. LinkedIn

Kirill Eremenko, founder of SuperDataScience, worth following for the technical side; he has spent years teaching data science and AI to people who want to build with it, not just talk about it. LinkedIn

Alex Lieberman, co-founder of Morning Brew and now building Storyarb and Tenex, with honest, unvarnished founder writing on building audiences and companies, now pointed squarely at the AI era. LinkedIn

Andrew Templeton, who runs AI and operations at CSC Generation and posts in-the-weeds, practitioner-level notes on getting real work out of LLMs, down to the prompts and frameworks he uses day to day. LinkedIn

What’s Moving in Philly

Five rooms worth being in around Philly over the next two weeks, from a low-key happy hour to a newsroom coffee chat.

(powered by Kynra)

Research mentioned in this issue

  • LangChain, “State of Agent Engineering,” 2026 report (survey run November to December 2025).
  • METR, “Time Horizon 1.1,” January 2026.
  • NVIDIA, RULER long-context benchmark, and Google DeepMind, MRCR v2 multi-fact retrieval benchmark (2026 results), which together show frontier models using well under their advertised context windows.
  • Dex Horthy, “12-Factor Agents,” HumanLayer, 2025 (github.com/humanlayer/12-factor-agents). The “dumb zone” comes from his 2025 AI Engineer Code Summit talk, “No Vibes Allowed: Solving Hard Problems in Complex Codebases.”
  • Anthropic, “Effective Context Engineering for AI Agents,” September 2025.

Copyright 2026 - Christie Mealo