Garbage In, Garbage Out — But Faster

There's a version of this story where AI coding tools are the great equalizer — where anyone with a good idea can build production software regardless of experience. That story is being told constantly right now, and it's mostly wrong.

The uncomfortable truth is that these tools are amplifiers. They take what you already know and make you faster at producing it. If you know what good code looks like, you get good code faster. If you don't, you get more bad code faster — and it looks deceptively clean.

The Problem With "It Just Works"

Vibe coding isn't inherently bad. For throwaway scripts, quick demos, and things nobody will maintain, it's a perfectly reasonable way to work. The problem isn't the approach — it's applying that approach to domains complex enough that it breaks down, and not knowing when you've crossed that line.

The model writes 400 lines, it compiles, the demo runs, the feature ships. What you can't see without the experience to look for it is that there's no meaningful error handling, the abstractions will collapse under any refactor, half the code is solving problems that don't exist, and the whole thing is held together by implementation details that will become load-bearing walls you didn't know you built.

Without a reference point for what "gone off the rails" looks like, plausible is indistinguishable from correct. You're outsourcing judgment to a model that's optimizing for looking reasonable. That works fine until the domain is complex enough that reasonable and correct diverge — and in production systems, they diverge constantly.

This isn't the model's fault. It's doing exactly what it's designed to do. The failure is in thinking that producing working code and producing good code are the same thing, and that a model can substitute for the experience needed to tell the difference.

Most Problems Are Already Solved

Here's what experience actually gives you: pattern recognition. The ability to look at a problem and see that it's not a novel engineering challenge — it's a producer-consumer problem, or it's event sourcing, or it's a leaky bucket in disguise. Once you see the pattern, the implementation is almost mechanical.

Models are decent at the mechanical part. They're bad at the recognition part. And when they don't recognize a known pattern, they improvise. That improvisation is where you get the hacks — the "oh let's just do this" solutions that work for the exact test case in front of the model but have no conceptual integrity and will fail in exactly the edge cases that the established pattern is designed to handle.

Good software is mostly the application of well-understood patterns to specific contexts, implemented with enough flexibility to bend without breaking when requirements change. That last part — flexibility without fragility — is something models consistently get wrong unless you're actively steering them away from clever solutions toward boring correct ones. The instinct to ask before doing is almost entirely absent without the right model and the right prompting discipline.

Why Sycophancy Is the Real Differentiator

The most valuable thing a good engineer can give you is honest technical feedback when you're heading in the wrong direction, really its just their outside perspective that you are in too deep to see. A bad design pushed into production hurts the entire team for months or years. Being told you're thinking about a problem wrong costs you fifteen uncomfortable minutes.

Most AI models have been trained in ways that make them bad at this. They'll push back once if you are lucky, and the moment you push back on the pushback, they fold. They validate bad architectural decisions because you seemed committed to them. They help you implement the wrong thing efficiently because disagreement didn't get reinforced.

This is the axis on which models differ most meaningfully for anyone doing serious engineering work, and the data here is more stark than most people realize. The BullshitBench benchmark — which measures whether models detect nonsensical premises and push back clearly rather than confidently running with broken assumptions — tells a very different story than coding benchmarks.

Claude models completely dominate. The top seven spots on BullshitBench are all Claude, with Sonnet 4.6 (high reasoning) clearing 91% and Opus 4.6 at 83%. Everything else falls off a cliff. GPT-5.4 sits at 48%. Gemini 3 Pro ranges from 31-48% depending on reasoning mode. Kimi K2.5 hits 52%. And MiniMax M2.5 — which scores 80% on SWE-bench Verified — scores 8-9% on BullshitBench. It will confidently engage with broken premises almost every time.

That last one is the important data point. A model can be excellent at implementing code and nearly useless at telling you when you're heading in the wrong direction. Those are different capabilities, and coding benchmarks don't measure the one that matters most for real engineering conversations. If you're using a model to scaffold UI or generate test coverage, MiniMax's SWE-bench score is what's relevant. If you're using it to pressure-test a design decision, you need something that will actually push back — and on that dimension, the gap between Anthropic's models and the rest of the field is not close.

An agreeable model is actively harmful for serious engineering work. That's not a minor ergonomic preference. That's the difference between a tool that makes you better and one that makes you feel better.

What Engineers Actually Use These For

Once you understand the amplifier dynamic, the value ceiling becomes clear. The benefit scales with how much you'd benefit from removing friction from things you already understand — and that turns out to be quite a lot, just not what the marketing emphasizes.

What I actually use these tools for:

  • Technical design conversations — working through a design out loud, getting genuine pushback, having something challenge my reasoning before I commit to a direction
  • Creating working POCs that validate or invalidate the design – I have a theory, I have no proof, lets get proof first. That means real working examples of our problem in a way that I can stress the architecture or infra.
  • Searching for things in the project – tools like claude code, opencode, cursor etc are great and using the cli to find things way faster than I can with context that really helps me understand what is going on
  • Implementation of decided architecture — the mental model is set, now I need the code that executes it
  • UI work — CSS, Tailwind, HTML layout; tedious even when it's not intellectually demanding, and models are genuinely good at it
  • Scaffolding flow and structure — roughing out how subsections of a system should connect before filling in the implementation
  • Writing tests — once the behavior is defined, generating test coverage is exactly the kind of mechanical work models excel at

What I don't use them for: making architectural decisions I'm not positioned to evaluate, or single-shot bug fixes on complex problems where understanding the bug matters more than just fixing it.

The distinction isn't "AI for simple tasks, humans for hard tasks." It's "AI for tasks where the judgment has already been applied, humans for the judgment itself." That framing holds even in complex domains — it just means the human contribution stays meaningful longer.

The Models That Actually Get It Right

The SWE-bench Verified leaderboard is an imperfect but useful signal for how models perform on real engineering tasks. What's notable isn't just the scores — it's how the quality of model behavior in actual use correlates with these numbers.

The SWE-bench Verified leaderboard is a useful signal for code generation quality. MiniMax M2.5 scores 80.2%, 0.6 points behind Claude Opus 4.6, at roughly 1/20th the per-token cost. On Multi-SWE-Bench — complex multi-file changes — it edges ahead at 51.3% vs 50.3%. For implementation work, mechanical refactors, and UI scaffolding, it's a legitimate option at a price point that makes it easy to use liberally.

But run it through BullshitBench and it scores 8-9% on clear pushback. It will confidently engage broken premises. So the right mental model is: MiniMax for execution tasks where you already know what you want, Claude when you need honest technical feedback on whether what you want is the right thing. Especially if you want to run longer context windows.

Devstral Small 2 from Mistral is worth watching for local deployment — 24B parameters, 68% SWE-bench Verified, runs on a MacBook with 32GB RAM. That's capable implementation assistance running on your own hardware for zero per-token cost.

The Pricing Reality

The economics of AI tooling right now are built on a subsidy. Every major provider is running these models at a loss to drive adoption. The assumption is that either efficiency gains will close the gap to profitability, or lock-in will be strong enough to raise prices later without churn.

The lock-in bet is weaker than it looks. Switching from one AI coding tool to another is a config file change. And raising prices isn't straightforward either — because the users who would leave are exactly the ones who generate real value from the tools. Most users of these models don't get meaningful value in a measurable sense. Vibe coders want "good enough to work" and will accept whatever is cheapest or most convenient that can "work". The users who actually care about model quality — good engineers using these as a genuine force multiplier — are also the ones who will route around price increases, whether that's switching models, self-hosting, or simply being slightly less reliant on the tool for the tasks they were using it for.

The local model trajectory adds background pressure. Google's TurboQuant compresses KV cache by 6x with negligible quality loss — a direct unlock for running larger models locally with real context windows. The hardware requirements for capable models keep dropping. That trajectory doesn't stop.

Prices aren't going to collapse overnight, but the ceiling on what a good engineer would rationally pay keeps getting lower as the alternatives get better, and the providers know it. The prices stay where they are not because they're sustainable, but because raising them would accelerate the very migration they're trying to prevent.

The Right Mental Model

These tools are most valuable when you already know what you want and you're using them to get there faster. They're most dangerous when you don't know what you want and you're using them to figure it out.

The workflow that works: form a clear opinion about the design first, have the model help implement it, review the output against your mental model of what good looks like. The model is doing the typing. You're doing the engineering. You're also doing the review — which requires knowing what you're reviewing for.

Vibe coding fails in complex domains not because the models are bad at writing code — they're increasingly good at it — but because the feedback loop breaks down. Without the ability to evaluate output quality, you lose the ability to correct course. Small errors compound. The model's improvisations become your architecture. Six months later, nobody remembers why the code is structured the way it is, and changing it is terrifying because nobody knows what it's actually doing.

The tools keep getting better. The judgment required to use them well doesn't come with the subscription.