The “Country of Geniuses” Needs a Babysitter
Agentic AI, a $380 Billion IPO, and the $2 Trillion Problem Nobody Wants to Name

I help developers succeed in Artificial Intelligence and Web3; Former AWS Amplify Developer Advocate. I am very excited about the future of the Web and JavaScript. Always happy Computer Science Engineer and humble Google Developer Expert. I love sharing my knowledge by speaking, training and writing about cool technologies. I love running communities and meetups such as Web3 London, GraphQL London, GraphQL San Francisco, mentoring students and giving back to the community.
It is a dazzling vision of rainbows, unicorns, and seamless automation — perfectly timed, purely by coincidence, to fend off the FUD storm during an IPO year pushing a $380 billion valuation. Who wouldn’t want to invest in a datacenter full of Einsteins?
⚠️ Current Conditions — AI Market Weather Advisory
Visibility: Near zero between valuation and fundamentals.
Pressure system: $30B high-pressure capital event centred over San Francisco, February 12, 2026. Six times oversubscribed.
Cold front: Enterprise ROI data moving in from the east. MIT Media Lab (August 2025): 95% of organisations reporting zero return on generative AI investment.
Forecast: IPO season. Partly cloudy. $800B–$1.96T annual revenue shortfall unresolved. Public markets expected to absorb what private credit and drying VC funds no longer can.
This is not a technology forecast. This is a prospectus warning.
Before the Pitch Deck: The $380 Billion Rainforest
On February 12, 2026, Anthropic closed a $30 billion Series G round at a $380 billion post-money valuation — the second-largest private tech financing deal in history. The round was six times oversubscribed: they asked for $10 billion and took $30 billion. Led by Singapore’s sovereign wealth fund GIC and Coatue, it drew in Sequoia, Founders Fund, Microsoft, Nvidia, and a who’s who of institutional capital.
At $380 billion against $14 billion in annualised revenue and no profitability, that is a 27× revenue multiple on a money-losing company. The IPO, targeted for late 2026, will invite the scrutiny that private markets have so far mercifully withheld.
Metric | Figure | Context |
|---|---|---|
Anthropic valuation | \(380B | 27× revenue multiple, pre-profit |
Capital raised (Series G) | \)30B | 2nd largest VC deal in history |
Annualised revenue | \(14B | Revenue grew 10× each year since launch |
Claude Code ARR | \)2.5B | Fastest-growing product line |
Global AI spending 2026 | \(2.53T | Source: Gartner. Over half: infrastructure |
Revenue needed by 2030 | \)2T / yr | To justify capex (Bain). Best-case: \(1.2T |
Current enterprise AI ROI | 5% | 95% of orgs reporting zero returns (MIT, 2025) |
OpenAI projected losses 2028 | \)74B | Operating losses in a single year |
According to Morningstar, AI revenues must grow from roughly $20 billion to $2 trillion annually by 2030 just to justify the infrastructure already committed. Bain puts the required run rate at \(2 trillion; best-case forecasts land at \)1.2 trillion. That is an $800 billion gap on the optimistic scenario. As one analyst put it: “Everyone knows. Nobody cares. The story’s too good.”
Gartner confirms AI is officially in the Trough of Disillusionment in 2026. The 95% enterprise ROI failure rate has not dimmed the enthusiasm of capital markets. It has merely relocated it — from VC funds (drying up by mid-2025 as LPs demanded returns) to sovereign wealth, private credit, and, imminently, public markets.
The Davos speech and the $380B raise did not happen in separate rooms. The “country of geniuses” is not just a philosophical claim — it is a prospectus argument. The keynote is the roadshow. The pitch deck slide reads: autonomous agents replacing software engineers end-to-end, six to twelve months from now.
2023: AGI hype. UBI talks. Job panic. Existential dread. Capitol chaos.
2026: IPO parade. Old fairy tales. Labs bleed. ROI ghosted. FUD incoming.
What the Documentation Actually Says
If AI is on the verge of replacing software engineers, the official documentation for the leading agentic coding tools should read like a handoff guide. Instead, read the technical docs for Claude Code and Cursor — the two most prominent players — and you find a checklist that looks like this:
Plan everything in detail upfront.
Be ultra-specific in every prompt.
Add your own verifiers and tests.
Review every single diff before accepting it.
Reset the chat when context starts to drift.
That is not autonomous behaviour. That is a tool demanding an ever-expanding load of supervision labour from the human who was supposedly about to be replaced. A February 2026 Google DeepMind paper on intelligent AI delegation describes current delegation protocols as relying on “static, opaque heuristics that would likely fail in open-ended agentic economies.”
Claude Code’s own documentation carries a direct warning: jump straight into coding without extensive upfront specification, and the model is liable to “solve the wrong problem.” A system that routinely misidentifies the problem it is supposed to solve — unless micromanaged at every stage — is not a replacement intelligence. It is confident autocomplete.
“AI-generated code can look right while being subtly wrong. You become the only feedback loop.”
— Official Cursor documentation
The word only is doing extraordinary work in that sentence. These tools are sophisticated generation engines — fast, fluent, and impressively wide in their apparent knowledge. What they structurally lack is any internal verification mechanism. They produce plausible output and rely entirely on you to determine whether plausible means correct.
The Paradox of Leverage
Here is the part the “year of agents” narrative consistently elides: these systems meet you exactly where you are. They do not elevate capability uniformly. They multiply the judgment you bring to them — which makes them very good news for people who already have strong judgment, and a liability for people who do not.
A senior engineer with deep architectural instincts, security awareness, and debugging experience gets genuine leverage. The model handles repetitive scaffolding while they direct the meaningful decisions. But hand the same tools to a beginner without that foundation, and you get what Andrej Karpathy accurately diagnosed as “slop” — code that looks finished, passes surface inspection, and quietly accumulates structural debt.
Senior Engineer | Beginner / Hobbyist | |
|---|---|---|
Brings | Architecture, security instincts, systems thinking, debugging depth | Enthusiasm and prompts |
Gets | Leverage | Confident technical debt |
True intelligence lifts all boats. What we currently have is a multiplier — and multipliers are indifferent to the sign of what they amplify.
"We love AI agents. All of our developers are on the Claude Max plan and barely writing code themselves at this point. But the fact is, at this stage, AI is a tool to augment humans. It's not an employee. These things mess up all the time. Our developers are constantly putting them back on track."
The Real-World Audit: What Happens When Theory Meets Invoice
The most damning evidence does not come from AI sceptics. It comes from the industry itself, attempting to prove the thesis and failing publicly.
The Remote Labor Index (October 2025)
Researchers from the Center for AI Safety and Scale AI took 240 real Upwork freelance projects — already completed by humans for pay, average value $630, total potential earnings of $143,991 — and gave six leading AI agents the exact same job briefs, files, and requirements. The task categories covered graphic design, video editing, game development, data analysis, architecture, product design, and software engineering: the precise domains where AI is supposedly strongest.
Results:
Metric | Figure |
|---|---|
Best-performing AI (Manus) completion rate | 2.5–3.75% to professional standard |
All other agents (Claude, GPT, Gemini, Grok variants) | ~2.1% or lower |
Overall failure rate | ~96–97% |
Total AI “earnings” across all agents | ~$1,810 of $143,991 possible |
The country of geniuses, tested on the work it was supposedly born to do, completed roughly 3% of it to professional standard. The remaining 97% needed a human. Review RLI results at Scale AI.
Two things about this result are worth holding simultaneously. First, these were digital-only, remote, standardised tasks — the favourable case for AI, not the hard one. Second, the 96% failure rate is not an argument that the technology is useless. It is an argument about where it works and where it doesn’t — a distinction the replacement narrative systematically refuses to make.
One anticipated counter-argument: specialised agents perform better than general ones. This is true in high-coverage domains. It is the entire point. We will return to it.
The Human+Agent Productivity Index (November 2025)
Upwork’s own research arm tested AI agents on over 300 real client projects from its platform — low-complexity tasks under $500, representing less than 6% of total work volume by value, pre-selected for having “a reasonable chance of success.” These are not the hard cases. These are the easy ones.
Condition | Completion rate |
|---|---|
Standalone AI agents | 17–64% depending on category |
Human + AI collaboration | Up to 93% (data science category) |
Improvement from adding humans | Up to 70 percentage points |
Upwork’s own CTO Andrew Rabinovich drew the conclusion the data demanded: “AI agents aren’t that agentic, meaning they aren’t that good.”
Even on the simplest, most pre-selected slice of digital work, standalone agents failed between 36% and 83% of the time. When humans stayed in the loop, completion rates jumped to 93%. That is not a replacement story. That is a collaboration story. The hype is selling the wrong product.
The HAPI data also validates the leverage paradox precisely: the gains are real, but they accrue to human-guided workflows, not autonomous deployment. The tools work. Autonomy doesn’t.
What Happens When You Scale It
The Davos vision implies not one agent but an army of them — a whole country. So it is worth asking what the empirical record says about scaling agentic systems in practice, rather than in demo conditions.
Google’s December 2025 paper Towards a Science of Scaling Agent Systems tested exactly this across 180 controlled configurations, spanning three LLM families and four agentic benchmarks. When multi-agent architectures were deployed on complex, sequential, multi-step tasks — the kind of work that constitutes real software engineering — performance did not scale linearly. It degraded.
Finding | Result |
|---|---|
Sequential reasoning tasks (e.g. planning) | −39% to −70% across all multi-agent variants |
Parallelisable tasks (e.g. financial analysis) | +80.8% with centralised coordination |
Independent agents: error amplification | 17.2× through unchecked propagation |
Centralised coordination: error amplification | 4.4× via supervised aggregation |
Capability saturation effect (>45% accuracy) | Adding agents yields negative returns (β = −0.408, p < 0.001) |
The mechanism is straightforward. Because individual agents cannot independently verify truth, one agent’s confident hallucination becomes the factual foundation for the next agent’s output. Errors do not cancel out. They compound. The paper identifies a capability saturation effect: once a single-agent baseline exceeds roughly 45% accuracy, adding more agents yields diminishing or negative returns. The coordination overhead costs more than the marginal gain.
Notably, software engineering is a fundamentally sequential reasoning task — placing it squarely in the degradation zone.
The paper also makes explicit what drives the long-session drift that Cursor’s documentation acknowledges and recommends restarting to fix: multi-agent systems that appear to improve on static benchmarks “exhibit fundamentally different scaling behaviour when evaluated on tasks requiring sustained environmental interaction, where coordination overhead and error propagation dynamics dominate.” As tool count grows beyond roughly sixteen, the coordination tax scales disproportionately. The fresh-conversation workaround is not a UX quirk. It is an architectural admission.
Why the Next Model Won’t Fix It: The Coverage Constraint
Every technical failure documented above has a common root. Understanding it is what separates informed scepticism from vague unease — and it is what pre-empts the standard counter: “Sure, but these are early days. Models will keep improving.”
They will. And it won’t be enough. Here is why.
The kitchen and the pantry
AI is a kitchen. Data is the ingredients.
Abundant, clean ingredients → strong dishes.
Sparse, noisy ingredients → inconsistent meals.
Private supply chains → inaccessible recipes.
The chef cannot cook what isn’t in the pantry.
AI performance does not scale with model size alone. It scales with the density, quality, and structure of data in a domain — what the Coverage Constraint framework calls distribution mass. Where a domain has high repetition, structured representation, consistent terminology, and deep public documentation, the model acquires stable learned patterns. Where it does not, representation is shallow, outputs are unstable, and the model interpolates from adjacent domains it knows better. This is not a bug. It is the mechanism.
What each layer of the stack can and cannot do
The job-replacement narrative implicitly assumes four things about how AI systems work. Each assumption is wrong in a specific, architectural way:
Layer | Can it expand coverage? | Can it amplify illusion? |
|---|---|---|
Pre-training | Yes — this is where distribution mass is defined | Yes — token gravity bias |
RL Fine-Tuning (RLHF) | No — reshapes output probability, doesn’t inject knowledge | Yes — confidence smoothing |
Chain-of-Thought / Reasoning | No — reorganises existing learned transitions, creates no new facts | Yes — structured hallucination |
Agents / Tool Use | Conditionally — only if a high-quality external corpus exists | Yes — iterative compounding |
The core insight: only pre-training and real retrieval add substrate. Everything else reshapes presentation.
This is the Confidence Amplification Illusion — high fluency over shallow geometry. RLHF makes the model sound more certain precisely as the domain gets thinner. The failure mode and the success signal are indistinguishable from the outside. This is why the “confidently wrong” outputs Salesforce documented in production were not edge cases. They were the expected behaviour of a well-aligned model operating in a thin zone.
The Transformer limitations that don’t disappear between model releases
The “just wait for GPT-6 / Claude 5 / Gemini Ultra” counter relies on the audience not knowing what specifically has not changed across generations:
No internal verification mechanism. The model cannot distinguish confident-correct from confident-wrong. It produces the most statistically plausible completion. Plausible and correct are different properties.
Context coherence degrades under sustained sequential load. This is architectural. It is why Cursor recommends fresh conversations and why the Kim et al. paper documents coordination overhead scaling disproportionately with task length. Wider context windows delay the problem; they do not eliminate it.
RLHF confidence smoothing is not fixable at the fine-tuning layer. If the pre-training manifold is thin for a domain, RLHF increases the model’s expressed confidence without increasing its structural accuracy. The DeepMind delegation paper names this directly: current systems “lack robust mechanisms for the dynamic assessment of competence, reliability, and intent.” Moving beyond reputation scores toward genuine competence assessment would require “real-time resource availability, current load, projected task duration, and the specific sub-delegation chains in operation” — none of which are architectural features of current systems.
Pre-training defines the convex hull. No amount of downstream fine-tuning expands that hull meaningfully. It reshapes weightings within it. You cannot prompt-engineer your way to a larger pantry.
Coverage is patchy, and expensive to expand where it matters
The most important asymmetry is this: the domains where agentic AI performs well in demos — public repositories, Stack Overflow, documented APIs, standardised frameworks — are exactly the high-coverage, high-distribution-mass domains. The domains where enterprise clients need it most — proprietary architecture decisions, internal business logic, regulated workflows, institutional tacit knowledge — are structurally thin zones.
This is also the answer to the specialisation defence. Every specialised agent that actually works — GitHub Copilot for boilerplate, Midjourney for stock imagery — operates in a high-coverage zone. The moment you need it to understand your internal architecture, compliance requirements, or customer context, you are back in thin territory, building the training data yourself. Enterprise value lives in proprietary business logic, not public repositories. Specialisation works where the pantry already exists. It does not conjure one.
Expanding coverage into enterprise domains requires the very human expertise the narrative claims AI is replacing. Every time an engineer documents their architecture decisions, security reasoning, and debugging heuristics in enough granular detail to make an agent useful, they are building the training substrate the agent needs. The knowledge transfer runs from the human to the machine, not the other way around. The direction of dependency is the opposite of what the pitch deck implies.
Coverage is real. Coverage is patchy. And coverage is expensive to expand in the domains that matter most. The technology works where the pantry is stocked. The hype assumes the pantry is universal. It isn’t.
Multipliers do not rescue zero.
Exhibit A: The Salesforce Autopsy
If you need a recent case study in what happens when the hype narrative meets production reality at enterprise scale, Salesforce provides the most instructive corpse.
In October 2025, CEO Marc Benioff launched Agentforce 360 with a promise of “autonomous agents that resolve customer issues end-to-end without needing to be micromanaged.” Salesforce claimed to be automating the equivalent work of 4,000 employees via AI agents.
By January 2026 — three months later — Salesforce had quietly pivoted to “deterministic controls.” SVP of Product Marketing Sanjna Parulekar stated that confidence in LLMs had declined over the past year: “All of us were more confident about large language models a year ago.”
The failure modes were textbook thin-zone behaviour:
Failure Mode | What Happened in Production |
|---|---|
Instruction omission | LLMs began dropping directives when given more than 8 instructions — critical for precision business tasks |
AI ‘drift’ | Agents lost focus on primary objectives when users asked unrelated questions |
Confidently wrong outputs | Even low-frequency inaccuracies were unacceptable when agents responded directly to customers |
Doom-prompting cycle | Engineers rewrote prompts continuously. The underlying issues could not be resolved at the prompt level |
Vivint, a home security company using Agentforce for customer support, reported that despite clear instructions, the system intermittently failed to send satisfaction surveys after interactions. The fix required “deterministic triggers” — which is to say: code. The autonomous agent needed a human-written workflow to do the one thing it was supposed to do reliably.
Industry analyst Sanchit Vir Gogia framed the pivot precisely: “Agentforce was pitched as self-directed… What Salesforce is now saying is that autonomy without guardrails is unscalable.” That is not a criticism of the technology. It is a precise technical description of inserting boundary conditions around a thin-zone deployment — the Coverage Constraint conclusion, arrived at through production pain rather than architecture documentation.
The market response was unambiguous. Salesforce stock fell 43% over the twelve months following its December 2024 peak. The head of Agentforce departed. The autonomous AI workforce became “workflows.” The AI workforce became workflow cogs.
AI hype burns cash faster than AI hallucinates. Agentforce AI workforce → workflow cogs. Reputation hit. Investors replied: −43%.
What Actual Autonomy Would Require
The February 2026 DeepMind delegation paper makes clear just how far current systems are from genuine autonomy. Intelligent delegation, it argues, requires not just task execution but “transfer of authority, responsibility, accountability, clear specifications regarding roles and boundaries, clarity of intent, and mechanisms for establishing trust between the two (or more) parties.” None of those things are baked into current agentic toolchains as defaults. They are left entirely to the human.
A genuinely autonomous system would detect its own errors before surfacing them. It would refuse structurally unsafe output rather than completing it confidently and waiting for a human to notice. It would escalate uncertainty rather than resolving it through statistical smoothing. It would enforce constraints by default rather than by permission.
What genuine autonomy would specifically require — and what current architectures do not provide — is what the DeepMind paper calls dynamic competence assessment: “real-time resource availability, current load, projected task duration, and the specific sub-delegation chains in operation.” The model would need to know not just what it knows, but how well it knows it, in real time, for the specific task at hand. That is a fundamentally different capability from next-token prediction, however sophisticated.
The paper also flags a risk that rarely makes it into keynotes: de-skilling. As AI handles increasingly routine work, human operators lose the situational awareness required to catch failures when it matters. The engineer who spends six months prompt-engineering agents rather than debugging systems loses the debugging instincts that make them the only reliable feedback loop. The knowledge transfer is not neutral.
“2025 was supposed to be the year of agents. Instead, it’s the year of cleaning up their messes.”
— Gary Marcus
We are still running into the same Transformer-architecture bottlenecks visible in 2023. The inference is faster. The context windows are wider. The marketing is considerably louder. But until these models reliably know when they are wrong — until they possess some structural mechanism for self-verification rather than confident generation — the “country of geniuses” is autocomplete with a Super Bowl advertising budget.
A Satirical Illustration: The 24/7 AI Workforce
A startup licenses an autonomous AI workforce.
Week 1. The senior engineer spends his entire morning onboarding the AI that was purchased to replace him.
Week 2. The AI builds a beautifully documented, perfectly formatted authentication system. It is, unfortunately, the wrong one entirely.
“Can’t it just remember our architecture?”
“No.”
The founder upgrades to Pro. They immediately run out of tokens.
Week 3. A swarm of twelve agentic geniuses is now confidently generating React components inside a Go backend. The AI does not detect its own errors. It autocompletes authentication leaks silently and with absolute statistical confidence.
“If I’d hired a junior, they’d at least get better over time.”
The engineer now supervises twelve agents that cannot function without him. He is, officially, the only irreplaceable part.
The country of geniuses did not replace the engineer. It gave him twelve new direct reports who will never learn, never improve, and always need him to be the feedback loop that the documentation quietly warned was unavoidable from the start.
That is not a revolution in software engineering. That is a productivity tool with extraordinary branding. The distinction matters — and the people being asked to make career and business decisions based on the hype deserve to know which one they are actually buying.
Closing: Rainbows, Unicorns, and the 10-K
The technology is real. The productivity gains for high-coverage domains are real. Claude Code does accelerate senior engineers. The parallelisable task gains are real. When humans stay in the loop, completion rates jump from 64% to 93%. None of this is in dispute.
What is in dispute is the extrapolation — from genuine gains in high-coverage, high-distribution-mass domains with humans guiding the work, to civilisational replacement of software engineering within six to twelve months. That extrapolation requires the pantry to be universal. It isn’t. It requires RLHF to eliminate brittleness. It doesn’t. It requires agents to fabricate missing substrate. They can’t. And it requires autonomous deployment to match what human-guided collaboration already achieves. The RLI and HAPI data show it doesn’t come close.
The most structurally ironic data point in the current landscape: Claude Code — Anthropic’s fastest-growing revenue line and the primary justification for a $380 billion valuation — is also the product whose own official documentation most clearly demonstrates why the agents-replace-engineers thesis is premature. The product being used to justify the IPO contains, in its own docs, the refutation of the marketing claim justifying the valuation.
When the 10-K filings land and the quarterly earnings calls begin, the questions will not be about rainbows or datacenters full of Einsteins. They will be about margin, retention, churn, and the gap between $14 billion in revenue and $380 billion in implied value.
The rainbows are real. The unicorns are real. The seamless automation is the part still waiting for its roadmap to profitability.
The country of geniuses still needs a babysitter. The babysitter is now being asked to buy the house.
Sources & References
Tomašev, N., Franklin, M., & Osindero, S. (2026). Intelligent AI Delegation. Google DeepMind. arXiv:2602.11865
Kim, Y., Gu, K., Park, C., et al. (2025). Towards a Science of Scaling Agent Systems. Google Research / DeepMind / MIT. arXiv:2512.08296v2
Center for AI Safety / Scale AI (2025). Remote Labor Index. October 2025.
Upwork (2025). Human+Agent Productivity Index (HAPI). November 2025.
Rabinovich, A. (2025). CTO, Upwork. Quoted in HAPI press materials.
MIT Media Lab (2025). Enterprise Generative AI ROI Report. August 2025.
Bain & Company (2025). AI infrastructure revenue requirements analysis.
Gartner (2026). Worldwide AI Spending Forecast. January 2026.
Morningstar / Sparkline Capital (2025). Surviving the AI Capex Boom. October 2025.
CNBC (2026). Anthropic closes $30 billion funding round at $380 billion valuation. February 12, 2026.
Fortune (2026). Anthropic’s $380 billion valuation vaults it next to OpenAI, SpaceX. February 13, 2026.
CIO.com / Salesforce (2026). Agentforce pivot to deterministic automation. January 2026.
Parulekar, S. (2026). SVP Product Marketing, Salesforce. Quoted in CIO.com coverage.
Gogia, S. V. (2026). Analyst commentary on Agentforce pivot.
MacroTrends / FinanceCharts (2026). Salesforce CRM stock price history. Peak: $365.07, December 4, 2024. 12-month trailing decline: −43.14% as of February 17, 2026.
Marcus, G. (2026). Quoted commentary on agentic AI failures.
Karpathy, A. (2025). Commentary on AI-generated “slop.”




