Vibe Coding .. With Engineering Discipline: Building Real Apps with AI Agents

#54 | Five practices to make your AI-assisted coding more production-ready

Dec 19, 2025

∙ Paid

AI coding agents have gotten remarkably capable. METR’s research [1] shows that the length of tasks AI agents can complete autonomously has doubled every 7 months over the past 6 years. The recently announced OpenAI’s GPT-5-Codex [2] can run autonomously for over 7 hours, handling complex refactoring and code reviews without human intervention.

How To Build Agents from Scratch
Interested in how coding agents like Claude Code work, or how to build them from scratch for your own tasks? I wrote a book - Designing Multi-Agent Systems - that covers building agents from scratch (Chapter 4), evaluation frameworks (Chapter 10), optimization strategies for the 10 common failure modes (Chapter 11), and implementing software engineering agents like Claude Code (Chapter 15).

These advances have enabled new types of experimentation, specifically - “Vibe coding” - the practice of describing what you want and letting the AI generate code based on intuition and feel - works great for prototypes and quick experiments.

And yet, many teams - especially those with early-career engineers - still hit roadblocks as soon as they try to build complex real-world applications that matter. A game app that’s deployed and distributed with auto-scaling, auth, multi-user support, and streaming. A pipeline for coherent music generation. A document processing system for enterprise grade legal review. Today, it seems that end-to-end results for tasks like this rarely come from getting your coding agent to accomplish the entire thing.

/upgrade ... or ...

Victor Dibia, PhD

October 27, 2025

Read full story

Why does this happen? Well, agents, even while capable, make common mistakes. They gloss over security vulnerabilities. They may select suboptimal architectural designs. They may fail to consider system and deployment constraints e.g., your deployment system only works with Node.js, compliance requires a specific encryption standard, your CI pipeline has particular requirements. As the tasks get more complex and longer, the agents may fail to keep up with earlier instructions as the context fills up.

This often leads to confusion and disappointment for teams who see the success stories but frequently run into these issues.

TLDR; The good news is that with the right managements and engineering discipline , you can get the right results (with productivity benefits) from AI agents.

This post highlights 5 useful things to do to get closer to this goal. This is how I use coding agents like Claude Code today, despite its quirks.

P.S. I wrote about how to build software with AI models [3] a while ago. This post is an extension that acknowledges the autonomy that agents like Claude Code, OpenAI Codex bring - they have sophisticated agentic harnesses allowing them to tackle medium to large scale tasks.

How to Effectively Use Generative AI for Software Engineering Tasks!

Victor Dibia, PhD

November 25, 2024

Read full story

1. Share Context and Constraints

If you hired a new engineer - irrespective of their expertise - you would still need to onboard them. You’d tell them about your build system, tacit knowledge, tribal knowledge, tech stack, deployment stack, and more. For agents, you still have to do the same thing. Agents will not magically gain this knowledge. It is an opportunity to offer the agent context - what exists - your codebase structure, tech stack, design patterns, how things work today, as well as important constraints - what MUST or CANNOT happen - hard rules, compliance requirements, deployment limits, non-negotiables.

Tools like CLAUDE.md [4], and Spec-Driven Development [5] are designed to support this.

CLAUDE.md is a special markdown file that Claude Code automatically loads into every conversation—essentially an onboarding document for your agent. It’s where you codify tacit knowledge: build commands, environment setup, repo conventions, and hard-won lessons about your codebase. Keep it under 300 lines and use hierarchical files in subdirectories so the agent sees only the most relevant context.

GitHub’s Spec Kit takes a more structured approach with four gated phases: Specify → Plan → Tasks → Implement. Instead of coding first and writing docs later, you start with a specification that captures intent, then translate it into technical decisions, break it into implementable pieces, and only then let the agent code. This solves the “scattered requirements” problem—security policies, compliance rules, and design constraints get baked into the spec where the AI can actually use them, rather than living in someone’s head or buried in a wiki. The philosophy: treat coding agents like literal-minded pair programmers who excel at pattern recognition but need unambiguous instructions. GitHub

Building a Practice of Iterative Context Enrichment

As a team, consider having an architect (or senior engineer) first initialize your CLAUDE.md (it scans the codebase), then have them add useful context and update it iteratively as they work with the agent. For example, you ask the agent to build the frontend and it uses npm run build, but instead your repo is designed to use bun - you correct it and ask it to add the bun instruction to CLAUDE.md.

Context and constraints live across the lifetime of a project (or even across projects). Teams should build muscle to manage these as practices - a file system for designs, how they are organized, where specs live.

Boris Cherny, creator/ lead for claude code talks about how they author and version Claude.md for the claude code project here.

”Our team shares a single CLAUDE.md for the Claude Code repo. We check it into git, and the whole team contributes multiple times a week. Anytime we see Claude do something incorrectly we add it to the http://CLAUDE.md, so Claude knows not to do it next time. Other teams maintain their own CLAUDE.md‘s. It is each team’s job to keep theirs up to date.”

CLAUDE.md files [4] can be hierarchical - you can have a project-level file and nested ones in subdirectories. Claude prioritizes the most specific, most nested file when relevant.

For more structured approaches, GitHub’s Spec Kit [5] provides a toolkit with three phases: /specify to generate specifications, /plan to produce technical plans, and /tasks to derive actionable task lists. It’s agent-agnostic and works with Claude Code, GitHub Copilot, Gemini, and others.

2. Co-design with the Agent

I find that I often explore a problem in detail first within a design session with an agent. I ask a bunch of questions. I ask the agent to start a running doc. I iterate and improve that design doc.

For example: “I am interested in this feature - adding query filtering in the UI. Review the current implementation and propose a design that honors the design system, has minimal code changes, and reuses components. Change nothing in the code base yet.“

Sometimes it’s useful to signal to the agent directly “change nothing” to indicate brainstorm mode (some agents have specific toggles for this).

Some plan comes back. I offer corrections. I often have a running directory of such design interactions that I refine very carefully and make into specs. These design docs are referenced as specs for features, and the agent can benchmark its work against those.
And then when things are clear and well defined, I tell the agent - “hit it!”

Hit It Captain Christopher Pike GIF - Hit It Captain Christopher Pike Anson Mount - Discover & Share GIFs

3. Test-Driven Development

As agents take on more tasks, the chance that changes occur that you have no idea of increases. From my experience, this can be a really bad thing. It’s generally poor form to get to a meeting where there’s a feature that works, but nobody can explain how it works.

In my previous post [3], I advocate for knowledge parity - read everything. But I get it, the temptation is high. It also kinda defeats the productivity promise if you are sitting watching the agent go on a 7hr bender. Also, the bosses want to see more work done in a fraction of the time.

At minimum, if you don’t know everything, you should know/define exactly how things should work - as defined in tests.

Try to author and fully understand your tests. Add instructions to CLAUDE.md to require permissions for test changes (so the agent doesn’t delete the test to declare the task done - which it can do).

4. Checkpointing

I wrote about checkpointing previously in my last article [3], mostly focused on writing continuation summaries of what went well in long sessions. This still applies to agents - research like “Lost in the Middle” [6] shows that LLMs struggle with information in the middle of long contexts; performance forms a U-shape with best results at the beginning and end.

Checkpointing takes on more importance in agent land. An agent working for 1 hour can change hundreds of files, which can be problematic to debug if 1 out of 100 changes has issues.

I have learned to carefully isolate any working sessions with an AI agent via frequent git commits. Once a feature is done, tested, and in good condition, it’s best to check it in and version it.

Agents are good at version-based rollbacks using git. They can struggle at rollbacks within a session, manually reverting individual changes across 100s of files. It’s also great for your sanity / security - roll back on error. Good old engineering practice.

5. Code Review is for Humans

We should not need to rehash the obvious here, but code review is key.

You can have AI agent reviews - GitHub Copilot does leave decent suggestions. But they should be more FYI. We should still have engineering practices of having at least one or two reviewers to gate merging.

Importantly - research on code review shows it serves dual purposes [7][8] - defect detection - catching bugs, security issues, logic errors, but also - knowledge dissemination - ensuring multiple engineers understand the code.

Ofcourse, defect detection is still very important.

Feb 24: Clouldflare announces that a single dev rebuilt Next.js with significant improvements in 1 week. Massive internet excitement on the rise of 1-man dev teams.
Feb 27: Its revealed that there 2 critical, 2 high, 2 medium, 1 low security vulnerabilities in the framework.

Continue reading this post for free, courtesy of Victor Dibia, PhD.

Or purchase a paid subscription.

Designing with AI