Skip to main content

Command Palette

Search for a command to run...

Gemini CLI: You’re Using AI as a Runtime. Stop.

The “soft execution” trap is seducing developers into building systems that look powerful and collapse quietly.

Updated
13 min read
Gemini CLI: You’re Using AI as a Runtime. Stop.
G

I help developers succeed in Artificial Intelligence and Web3; Former AWS Amplify Developer Advocate. I am very excited about the future of the Web and JavaScript. Always happy Computer Science Engineer and humble Google Developer Expert. I love sharing my knowledge by speaking, training and writing about cool technologies. I love running communities and meetups such as Web3 London, GraphQL London, GraphQL San Francisco, mentoring students and giving back to the community.

There’s a moment every developer has — staring at a 400-line prompt and thinking: this is basically a program. The AI understands the steps. It reads state. It calls tools. It even branches conditionally. Why write actual code when the model just… does it?

This is the “soft execution” trap, and it’s one of the most subtle ways to build something that feels robust until it catastrophically isn’t.

The Illusion Is Real (That’s the Problem)

“Prompt-as-runtime” — sometimes called soft execution — is the practice of encoding your entire workflow logic into natural language, then trusting the AI to interpret and execute it faithfully across multiple steps, state changes, and tool calls.

The appeal is obvious. You describe the workflow, iterate on the description, and the AI handles the rest. No orchestration code. No state machine. Just vibes and prompts.

“Just describe the workflow, and the AI handles the rest.” This is the sentence that will haunt your production logs.

The problem isn’t that AI models are dumb. It’s that they’re the wrong kind of smart for this job.

Non-Determinism Is a Feature — Until You Need Precision

Traditional software runs on deterministic runtimes. if x > 5: behaves identically every time. os.path.exists("file.txt") returns True or False based on reality, not probability.

Large Language Models are inherently probabilistic. The same input can produce meaningfully different outputs. That’s a superpower for creative tasks. It’s a liability when you’re managing state.

When you ask an LLM to be your runtime, you’re asking it to do two incompatible things at once: reason flexibly and execute precisely. It will do both imperfectly.

Two Failure Modes That Will Eat Your Workflow

1. Context Drift — The Fading Blueprint

Imagine handing someone a 12-page instruction manual, then interrupting them every five minutes with new data. After two hours, ask them to reference page 3. Watch what happens.

Context drift is what happens when your workflow prompt — the entire behavioral spec for your AI runtime — gets diluted by accumulating tool outputs, intermediate thoughts, and conversational noise. The original instructions become a smaller and smaller fraction of the model’s attention.

The AI doesn’t forget. It deprioritizes. Steps get reordered. Constraints get softened. Critical conditions get glossed over.

Think of it as an un-refactored 2,000-line function where global state is mutated from twelve different places. You can’t reason about it. Neither can the model.

2. Entropy Saturation — The Signal Drowns

Every interaction adds noise. Tool outputs. Partial results. Retry logs. Conversational filler. In a long-running workflow, the signal-to-noise ratio collapses.

The model starts hallucinating states that don’t exist, because reconstructing reality from a saturated context window is genuinely hard — for the same reason debugging is hard when every variable change is printed to stdout.

The model isn’t failing. It’s doing its best with an increasingly incoherent input. You built the incoherence in.

The Conductor Case Study: Ambitious, Instructive, Fragile

The Gemini CLI Conductor extension is a useful illustration — not because it’s badly designed, but because it’s well-intentioned in exactly the wrong direction.

Conductor uses TOML files as natural language pseudo-code for the AI to execute. It manages state via setup_state.json, tracks last_successful_step, branches between Greenfield and Brownfield project setups, and orchestrates file system checks and user prompts. The architecture is thoughtful. The intent is real.

But the AI is doing the job of a state machine.

Every file existence check — “verify that product.md exists” — is less reliable than os.path.exists("product.md"). Not slightly less reliable. Categorically less reliable. The AI might hallucinate a missing file. It might misparse the directory structure. It might simply fail to format the underlying tool call correctly on step 7 of 14.

And every piece of that logic — every state variable, every conditional branch, every intermediate result — must live in the context window, consuming tokens, adding entropy, and increasing the blast radius of any drift.

The irony: the more structured your prompt-as-runtime, the more it costs when it fails. You’ve built a fragile thing that looks robust.

This Isn’t Theoretical: Salesforce Just Lived It

If the Conductor extension feels too niche, consider what happened at the other end of the resource spectrum.

In October 2025, Salesforce launched AgentForce 360 at Dreamforce. Marc Benioff on stage, autonomous AI agents resolving customer issues end-to-end, the whole cinematic pitch. The “fastest-growing product in Salesforce history.” A $200B company, 25 years of CRM domain expertise, controlled problem space, direct customer data access, world-class implementation teams, and enterprise customers motivated enough to pay a premium.

By January 2026, CIO.com ran the post-mortem.

The failure mode was precise and familiar:

“Agent behavior varied from session to session, with identical customer scenarios triggering different execution paths based on how the model interpreted intent in the moment.”

That’s context drift. That’s entropy saturation. That’s soft execution at enterprise scale — except now it’s going directly to customers, and the failure mode, as one implementation partner put it, is “confidently wrong,” which creates reputational and legal exposure, not just a bad developer experience.

Salesforce engineers had a name for what they were doing to compensate: a “doom-prompting cycle” — continuously rewriting prompts trying to fix behavior that couldn’t be fixed at the prompt level. Sound familiar? It’s the natural end state of treating a prompt as a runtime. You keep refining the spec for a machine that doesn’t reliably execute specs.

Their own engineering blog put it plainly: “LLM reasoning alone cannot carry enterprise load… The future isn’t endless prompt refinement. It’s structured, auditable workflows.”

The solution they shipped was called Agent Script — mandatory deterministic workflows with strict business policies enforced and auditable, step-by-step control. Their current product page now leads with:

“Eliminate the inherent randomness of Large Language Models, guaranteeing that your critical business workflows follow the exact same steps every single time.”

That’s not an agent. That’s a workflow engine with really good text generation bolted on. Which is exactly the hybrid architecture this post is arguing for — they just arrived there the expensive way.

The epilogue: stock down 43% over twelve months. The head of AgentForce departed. Nearly 1,000 layoffs — including, notably, members of the AgentForce team itself.

The lesson isn’t that Salesforce failed. The lesson is that with every conceivable advantage — resources, domain knowledge, controlled scope, motivated customers — they still couldn’t make “autonomous agent as runtime” reliable in production. Not because they didn’t try hard enough. Because stochastic text generation and deterministic workflow execution are architecturally incompatible when precision is required. That’s not a problem you prompt your way out of.

Why Prompts Can’t Save You: The Architecture Beneath the Failure

Here’s what the doom-prompting cycle never confronts: the failure isn’t in the prompt. It’s in the physics.

Understanding why requires a brief descent into how transformers actually work — not the marketing version, but the architectural reality. Because once you see it, you can’t unsee why soft execution is doomed from the first token.

The Stochastic Funnel

Every transformer forward pass begins with your prompt becoming an initial embedding — call it h₀. This is the anchor. Every subsequent layer reads from a residual stream, adds its small contribution, and writes back:

h₁ = h₀ + attention_layer_1 + MLP_layer_1
h₂ = h₁ + attention_layer_2 + MLP_layer_2
...
h_N = h_{N-1} + attention_layer_N + MLP_layer_N

The residual stream accumulates momentum. Early layers have wide latitude to adjust trajectory. Later layers are increasingly bound by the accumulated signal of everything before them. By layer 20, you’re not steering — you’re riding.

This is the stochastic funnel: the possibility space narrows with every layer, every token, every tool call response added to context. The model doesn’t “drift” because it got confused. It drifts because redirecting accumulated residual stream momentum is architecturally expensive. Your corrective prompt at turn 12 is fighting the inertial weight of turns 1 through 11.

This is why experienced users discover that starting a fresh conversation with a refined prompt beats trying to course-correct an existing one. You’re not resetting the conversation. You’re resetting h₀ — the only moment in the forward pass where the highway is truly open.

Salesforce’s doom-prompting cycle was engineers discovering the stochastic funnel empirically, without a name for it, in production, at enterprise scale. They kept rewriting the prompt trying to fight momentum that could only be reset, never redirected.

Attention’s Zero-Sum Budget

Compound this with how attention itself works. Each attention head distributes exactly one unit of attention mass across all tokens in context. It’s not a spotlight that can brighten — it’s a fixed pool of water that redistributes when you pour more in.

As your workflow context grows — tool outputs, state logs, retry results, intermediate reasoning — those tokens consume attention budget. Task-critical tokens from your original prompt get diluted. Once their attention weight drops below roughly 0.01 (the salience threshold), they’re effectively ignored. The architecture stops seeing them.

This is entropy saturation, mechanistically explained. It’s not that the model gets overwhelmed in some vague cognitive sense. It’s that softmax is zero-sum and your workflow is continuously pouring new tokens into a fixed-capacity pool. The original spec drowns. Literally.

The Five Structural Absences No Prompt Can Address

But even beyond the funnel and the attention budget, there are things transformers simply don’t have — not limitations to be improved, but structural absences that no prompting technique can conjure into existence:

1. No persistent state between forward passes. The residual stream exists only during a single forward pass. When it’s done, it’s gone. There is no persistent knowledge structure. Every tool call, every new turn, every “continue from step 7” starts with frozen weights and whatever you put in the context window. The model isn’t resuming — it’s reconstructing from text. These are not the same thing.

2. No grounding mechanisms. Attention routes information between tokens. Nothing anchors those tokens to external reality. When the model “checks” whether product.md exists, it’s not querying the filesystem — it’s generating a statistically plausible description of what checking a filesystem looks like, then calling a tool, then interpreting the output as text, then generating the next statistically plausible token. Each of these is a lossy, probabilistic step. Chain enough of them and you have a game of telephone with your workflow state.

3. No causal reasoning engine. Transformers learn correlations from co-occurrence patterns. They don’t distinguish “X appears with Y” from “X causes Y.” This is why models can hallucinate states confidently — from the architecture’s perspective, a hallucinated state that fits the textual pattern is indistinguishable from a real one. There’s no causal graph being maintained. There’s no model of the world being updated. There’s next-token prediction, running on a highway built during pretraining, navigating by pattern recognition.

4. No hypothesis testing. A transformer generates a single output through a one-way forward pass. It cannot propose multiple execution paths, test them, discard failures, and consolidate the successful one. When it makes a wrong turn in your workflow, it doesn’t know it made a wrong turn. It generates the next token as if the wrong turn were the correct one, because confidence is baked into the sampling process, not derived from ground truth.

5. Local coherence, global contradiction. The model is extraordinarily good at making the next token fit the previous tokens. It is not good at maintaining consistent state across hundreds of tool calls, because that requires comparing the current context to a global invariant — and there is no global invariant. There’s only the accumulated residual stream, and it’s already committed to a direction.

The Medieval Alchemy Problem

The field has developed an entire vocabulary for fighting these limitations from the outside: chain-of-thought, tree-of-thought, reflexion, self-consistency sampling, constitutional prompting, agent scaffolding. Each technique is genuinely clever. Each one is also, architecturally speaking, a workaround for a missing component.

We are writing increasingly elaborate spells — longer, more precise, more carefully structured — to summon consistent behavior from a system that is stateless between passes, has no causal model, cannot ground its outputs in reality, and narrows its own possibility space with every token it generates.

The prompt is not the program. The prompt is a request to a probabilistic highway system that will navigate as best it can toward something that resembles your destination — using roads built during pretraining, carrying the momentum of every prior turn, running out of attention budget as it goes.

You can’t prompt your way to new roads. You can’t prompt your way to persistent memory. You can’t prompt your way out of zero-sum attention. The “doom-prompting cycle” isn’t a sign of insufficient creativity. It’s what hitting a structural wall looks like from the outside.

The fix isn’t a better prompt. The fix is knowing which job to give the AI and which job to keep in code.

The Fix: Stop Fighting the Architecture

The lesson isn’t “don’t use AI in workflows.” The lesson is don’t use AI as the workflow engine.

A hybrid architecture gives you the best of both worlds:

Deterministic layer (Python, Go, Node.js — your choice):

  • State management: load/save reliably, no hallucinations

  • File system operations: exists() is not a prompt

  • Conditional logic: if/else doesn’t drift

  • Tool orchestration: call the AI with precise inputs, handle outputs explicitly

AI layer (invoked, not running):

  • Interpret complex natural language requests

  • Generate content: product.md, code snippets, documentation

  • High-level reasoning: help plan the workflow, not execute it

The AI becomes a powerful specialist, not a fragile generalist runtime.

The key insight: Move state and control logic out of the AI’s context and into code. The AI is extraordinary at many things. Being a state machine is not one of them.

Conclusion: Vibe Coding Has an Architecture Problem

“Vibe coding” — letting AI handle more and more of the development process — is genuinely exciting. Agentic workflows are genuinely powerful. But there’s a version of this that quietly breaks in production and a version that scales.

The difference is whether you’ve given the AI a job or a job description that includes its own job description.

The transformer is not lazy. It’s not undertrained. It’s not waiting for GPT-6 to become reliable. It is a zero-sum attention system running on a momentum-carrying residual stream, stateless between passes, without grounding, without causal modeling, without persistent memory, funneling possibilities down to a single output per forward pass. It is extraordinarily good at what it does. What it does is not state management.

Use AI for what it’s brilliant at. Use code for what code is brilliant at. The separation isn’t a concession to AI’s limitations — it’s engineering literacy about what kind of machine you’re actually working with.

Your runtime should not have hallucinations. Your state machine should not drift. And no prompt, however elaborate, will change the architecture underneath.

Build accordingly.


Training for Engineers Who’ve Hit the Wall

If your team is building serious agentic systems and you’re starting to suspect the problem isn’t the model — it’s the architecture — I offer specialised training that bridges transformer mathematics with production engineering practice.

Attention mechanics. Residual stream dynamics. Deterministic scaffolding. Where the model ends and your code begins.

Designed for high-performance environments where wasted tokens and unpredictable agents are a real cost.

Gerard Sans — Founder, Axiom · Google Developer Expert in AI