잡동사니

스타트업은 왜 제품보다 믿음을 먼저 만들어야 할까

yeTi — Fri, 5 Jun 2026 15:42:44 +0900

안녕하세요. yeTi입니다.
오늘은 오래전부터 스타트업이 가지는 의미에 대한 생각을 풀어보고자 합니다.

저는 유발 하라리의 사피엔스의 인용문을 좋아합니다.

단어를 통해 가상의 실제를 창조하는 능력은 서로 모르는 수많은 사람들이 효과적으로 협력하는 것을 가능하게 했다. - p.60, 사피엔스

호모 사피엔스가 대규모 사회를 만들고 유지할 수 있는 원동력이고, 사회 구성원들이 가지고 있는 특징이라고 이해가 되기 때문입니다.

최근에는 독서 모임을 통해 유발 하라리의 넥서스를 읽고 있습니다.

그러면서 예전에 스타트업이라는 것을 나름의 의미로 정의했던 순간이 떠올랐습니다.

스타트업은 믿음을 만들어내는 사람들이다.

그 동안의 생각을 풀어낸 가장 간단한 말입니다.

스타트업은 단순히 제품을 만드는 조직이 아닙니다.
스타트업은 사람들이 세상을 바라보는 기준을 바꾸는 조직입니다.

사람들의 믿음을 만들어낸 스타트업은 그들만의 새로운 시장을 만들 수 있다고 생각합니다.

시장 변화의 본질은 인간 믿음의 변화다

시장 변화는 기술 변화처럼 보입니다.
새로운 앱이 나오고 새로운 물류 시스템이 생기고 새로운 결제 방식이 등장합니다.

하지만 그 변화가 정말 시장이 되려면 사람들의 믿음이 바뀌어야 합니다.

쿠팡이 등장하기 전과 후를 생각해보겠습니다.

쿠팡 이전에도 온라인 쇼핑은 있었습니다.
배송도 있었습니다.
물건을 주문하면 집으로 받을 수 있었습니다.

하지만 쿠팡이 바꾼 것은 단순히 배송 속도가 아니었습니다.
쿠팡은 한국인의 배송에 대한 믿음을 바꾸었습니다.

고객들이 오늘 주문하면 내일 받는 것을 당연하게 여기게 되었다는 걸 뜻한다. 이처럼 익일배송 보장은 경쟁에 있어서 이제 출발점이 되어버린 것이다. - p.119, 물류트랜드 2024

이 믿음이 생기는 순간, 익일배송은 특별한 혜택이 아니라 경쟁의 출발점이 되었습니다.

이것이 시장 변화의 본질이라고 생각합니다.

시장은 물건이 바뀌어서만 변하지 않습니다.
사람들이 당연하게 여기는 기준이 바뀔 때 시장이 변합니다.

예전에는 며칠 기다리는 것이 자연스러웠습니다.
이제는 내일 받지 못하면 느리다고 느낍니다.

믿음이 바뀐 것입니다.

스타트업은 새로운 기준을 만든다

스타트업이 만드는 것은 기능이기도 하지만 더 본질적으로는 새로운 기준입니다.

“배송은 빨라야 한다.”
“신선한 해산물은 집에서도 먹을 수 있다.”
“중간 유통을 줄이면 더 합리적인 가격이 가능하다.”
“개인이 가진 데이터도 자산이 될 수 있다.”
“외국어 학습은 문제집이 아니라 대화에서 시작될 수 있다.”
“감정 회복은 거창한 치료가 아니라 작은 실천에서 시작될 수 있다.”

이런 문장들은 단순한 슬로건이 아닙니다.
사람들이 세상을 다르게 보게 만드는 믿음의 씨앗입니다.

예를 들어 파도상자 같은 서비스를 바라볼 때도 저는 기능보다 믿음을 먼저 보게 됩니다.

파도상자가 만들어가는 믿음은 단순히 “해산물을 배송한다”가 아닙니다.
그 믿음은 오히려 이런 것에 가깝습니다.

“여행지에서나 먹을 수 있던 신선함을 집에서도 누릴 수 있다.”
“오프라인 시장에서 눈탱이를 맞지 않고 합리적인 가격에 살 수 있다.”

여기서 중요한 것은 신선함과 가격 자체가 아닙니다.
고객이 그 서비스를 통해 무엇을 믿게 되는가 입니다.

스타트업은 결국 고객에게 이렇게 말하는 조직입니다.

“당신이 당연하게 여기던 기준은 바뀔 수 있습니다.”

믿음은 경제권이 된다

사토 가쓰아키의 《머니 2.0》을 읽으며 인상 깊었던 지점도 여기에 있었습니다.

그는 경제와 정치, 경제와 종교의 경계가 흐려질 수 있다고 말합니다.

앞에서 경제와 정치의 경계도 사라진다고 이야기했는데, 마찬가지로 경제와 종교의 경계도 사라질 것이다. - p.258, Money 2.0

처음에는 다소 과장된 말처럼 보일 수 있습니다.
하지만 생각해보면 종교도, 경제도, 국가는 모두 공동의 믿음 위에 서 있습니다.

돈은 종이 자체에 가치가 있어서 작동하는 것이 아닙니다.
사람들이 그 종이를 가치 있다고 믿기 때문에 작동합니다.

토큰 경제도 마찬가지입니다.
특정 서비스를 중심으로 사람들이 모이고, 그 안에서 토큰이 발행되고, 참여자들이 그 토큰의 가치를 믿기 시작하면 하나의 작은 경제권이 생깁니다.

이 관점은 스타트업과 매우 닮아 있습니다.

스타트업은 처음부터 거대한 시장에 들어가는 것이 아닙니다.
처음에는 작은 공동체를 만듭니다.
같은 문제를 느끼고, 같은 가능성을 믿고, 같은 미래에 참여하려는 사람들을 모읍니다.

이것이 커뮤니티를 먼저 만들고, 팬덤을 만들고, 그 뒤에 서비스를 확장하라는 말의 본질이라고 생각합니다.

커뮤니티는 마케팅 채널이 아닙니다.
커뮤니티는 믿음이 자라는 장소입니다.

그리고 믿음이 충분히 강해지면, 그 믿음은 경제권이 됩니다.

창업자는 누군가가 되려고 하면 끝이다

《머니 2.0》에서 또 하나 인상 깊었던 문장이 있습니다.

창업자는 누군가가 되려고 하면 끝 - p.239, Money 2.0

이 문장은 스타트업의 행동양식과도 이어집니다.

스타트업의 본질은 기존에 없던 방식으로 현실의 문제를 해결하는 데 있습니다.
그런데 창업자가 누군가를 흉내 내기 시작하면, 그 순간 스타트업은 자기만의 믿음을 잃습니다.

물론 다른 회사의 사례를 참고할 수 있습니다.
좋은 인재를 영입할 수도 있습니다.
이미 검증된 운영 방식이나 성장 전략을 배울 수도 있습니다.

하지만 그 경험들이 창업자의 고유한 믿음을 대체하기 시작하면 위험해집니다.

스타트업은 정답을 복제하는 조직이 아닙니다.
자기만의 가설을 현실에서 증명하는 조직입니다.

다른 회사가 성공한 방식이 우리에게도 맞을 수는 있습니다.
하지만 그 방식이 왜 우리에게 필요한지 설명할 수 없다면, 그것은 전략이 아니라 모방입니다.

스타트업은 남이 만든 믿음을 따라가는 순간 생동감을 잃습니다.
자기만의 믿음을 만들어야 합니다.

좋은 문화와 편한 문화는 다르다

이 믿음은 제품에만 적용되지 않습니다.
조직 문화에도 적용됩니다.

이전에 봤던 글 중 좋은 문화와 편한 문화를 구분하는 인상 깊은 말이 있습니다.

많은 경우 직장인의 입장에서 좋은 문화는 편한 문화로 이해됩니다.
자율적이고, 부담이 적고, 갈등이 적고, 개인의 삶을 존중하는 문화.

물론 이것들은 중요합니다.

하지만 스타트업의 관점에서 문화는 조금 다르게 보아야 합니다.
스타트업의 문화는 단순히 구성원을 편하게 만드는 장치가 아닙니다.
스타트업의 문화는 공동의 믿음을 유지하고, 그 믿음이 성과로 이어지도록 만드는 방식입니다.

그래서 좋은 문화는 반드시 편한 문화와 같지 않습니다.

스타트업에서 좋은 문화란 이런 것입니다.

우리가 왜 이 문제를 푸는지 잊지 않게 하는 문화
불편한 현실을 직면하게 하는 문화
고객의 반응 앞에서 믿음을 수정할 수 있는 문화
성장을 위해 필요한 긴장을 견디게 하는 문화
서로의 편안함보다 공동의 방향을 우선할 수 있는 문화

편한 문화는 현재의 감정을 보호합니다.
좋은 문화는 공동의 믿음을 앞으로 나아가게 합니다.

스타트업은 아직 증명되지 않은 믿음을 붙잡고 가는 조직입니다.
그렇기 때문에 문화는 단순한 복지가 아니라 믿음의 운영체제에 가깝습니다.

스타트업은 공동 환상을 교체하는 일이다

사토 가쓰아키는 “세계를 바꾸는 일”을 오래된 공동 환상을 파괴하고 새로운 환상을 덮어씌우는 행위라고 말합니다.

저는 이 문장이 스타트업의 본질을 잘 설명한다고 느꼈습니다.

사회에는 이미 굳어진 공동 환상이 있습니다.

“배송은 며칠 걸리는 것이 당연하다.”
“신선식품은 직접 보고 사야 한다.”
“회사는 사무실에 출근해야 일하는 곳이다.”
“은행이 아니면 금융을 할 수 없다.”
“교육은 강의실에서 이루어진다.”
“AI는 사람이 시키는 일을 보조하는 도구다.”

스타트업은 이런 당연함에 질문을 던집니다.

“정말 그래야 할까?”
“다른 방식은 불가능할까?”
“사람들이 새롭게 믿을 수 있는 기준은 없을까?”

그리고 새로운 믿음을 제안합니다.

이 믿음이 충분히 강해지면, 사람들은 기존의 기준을 낡은 것으로 느끼기 시작합니다.
그 순간 시장이 바뀝니다.

독점은 새로운 믿음에서 시작된다

피터 틸은 《제로 투 원》에서 이렇게 말합니다.

독점은 진보의 원동력이다. 수년간 혹은 수십 년간 독점 이윤을 누릴 수 있다는 희망은 혁신을 위한 강력한 동기가 되기 때문이다. 그러면 독점기업은 혁신을 계속 지속할 수 있게 되는데, 왜냐하면 독점 이윤 덕분에 장기적인 계획을 세울 수 있고, 경쟁 기업들은 꿈도 꾸지 못할 야심 찬 연구 프로젝트에도 돈을 댈 수 있기 때문이다. - 제로 투 원 by 피터 틸

처음 이 문장은 다소 불편하게 들릴 수 있습니다.
우리는 보통 독점을 부정적인 말로 받아들이기 때문입니다.

하지만 피터 틸이 말하는 독점은 단순히 경쟁자를 억누르는 독점이 아닙니다.
남들이 보지 못한 시장을 만들고, 그 시장 안에서 압도적인 가치를 제공하는 상태에 가깝습니다.

블루오션을 만든다는 말도 결국 비슷합니다.

새로운 시장을 만든다는 것은 새로운 고객군을 찾는다는 뜻만이 아닙니다.
새로운 믿음을 만든다는 뜻입니다.

사람들이 전에는 필요하다고 생각하지 않았던 것을 필요하다고 느끼게 하는 것.
전에는 불가능하다고 생각했던 것을 가능하다고 믿게 하는 것.
전에는 특별한 일이라고 생각했던 것을 당연한 일로 바꾸는 것.

이것이 스타트업이 만드는 독점의 출발점입니다.

독점은 시장 점유율에서 시작되지 않습니다.
독점은 믿음의 점유율에서 시작됩니다.

그래서 스타트업은 무엇인가

이제 저는 스타트업을 이렇게 정의해보고 싶습니다.

스타트업은 새로운 믿음을 함께 만들어가는 사람들의 모임이다.

여기서 중요한 단어는 “새로운”이기도 하지만, 더 중요한 단어는 “함께”입니다.

믿음은 혼자서도 만들 수 있습니다.
철학자도 혼자 믿음을 만들 수 있고, 작가도 혼자 세계관을 만들 수 있습니다.

하지만 스타트업은 혼자 믿는 일이 아닙니다.

창업자는 먼저 믿습니다.
팀원은 그 믿음에 합류합니다.
투자자는 그 믿음이 커질 가능성에 자원을 겁니다.
초기 사용자는 아직 완성되지 않은 제품 안에서 가능성을 봅니다.
고객은 그 믿음이 자신의 문제를 해결해줄 수 있다고 판단합니다.

이렇게 믿음은 혼자만의 생각에서 벗어나 관계가 됩니다.
그리고 관계가 쌓이면 하나의 시장이 됩니다.

스타트업의 일은 기능을 만드는 일이 아닙니다.
기능은 믿음을 구체화하는 수단입니다.

마케팅은 믿음을 언어로 번역하는 일입니다.
세일즈는 믿음을 고객의 문제와 연결하는 일입니다.
제품 개발은 믿음을 실제 경험으로 증명하는 일입니다.
조직 문화는 믿음이 흔들리지 않게 유지하는 일입니다.
PMF는 그 믿음이 창업자만의 것이 아니라 시장 안에서도 작동하기 시작했다는 신호입니다.

그래서 스타트업은 새로운 넥서스를 만드는 시도입니다.

처음에는 한 사람의 믿음으로 시작합니다.
그다음 몇 사람이 그 믿음을 함께 붙잡습니다.
그다음 소수의 사용자가 그 가능성에 참여합니다.
그리고 어느 순간 더 많은 사람들이 그것을 당연한 현실로 받아들이기 시작합니다.

그때 스타트업은 단순한 제품을 넘어섭니다.
하나의 새로운 의미 체계가 됩니다.
사람들이 함께 참여하는 새로운 믿음의 네트워크가 됩니다.

저는 이것이 스타트업이라고 생각합니다.

스타트업은 새로운 믿음을 함께 만들어가는 사람들의 모임이다.
그리고 시장을 만든다는 것은, 사람들이 당연하게 믿는 기준을 바꾸는 일이다.

How Building a Multi-Agent Development Pipeline Led Me to Design an AI Engineering Organization

yeTi — Mon, 1 Jun 2026 16:39:31 +0900

From AI Coding Agents to AI Engineering Organizations

Introduction

When I started the sqlgen-ai project, my goal was straightforward.

I wanted to build an AI agent that could write code.

At the time, I imagined a future where an AI developer could continuously improve itself, create tools it needed, and gradually become more capable over time.

To explore that idea, I connected Hermes Agent, Codex, GitLab, and Discord into a development workflow and started building what I thought would become a self-improving coding agent.

What I learned was very different from what I expected.

The biggest challenge was not getting AI to write code.

The bigger challenge was operating AI reliably within a software development process.

AI Models Are Already Good at Writing Code

The first surprise was that coding itself was rarely the bottleneck.

Modern coding models such as Codex, Claude Code, and Gemini CLI can already perform a large portion of day-to-day development work.

They can:

Implement features
Fix bugs
Refactor existing code
Write tests
Create Merge Requests

When I started the project, I assumed model capability would be the primary limitation.

Instead, most failures came from workflow and coordination problems.

For example:

An agent implemented a feature against an outdated branch.
A Merge Request was created without local validation.
Acceptance Criteria were partially implemented.
Review feedback introduced regressions.
A dirty workspace caused unrelated files to be committed.

These were not coding failures.

The code itself was often reasonable.

The failures occurred because development work was being executed without the operational controls that exist in human engineering teams.

As the models became stronger, a different bottleneck emerged.

Generating code became easier.

Coordinating development work became harder.

Software Development Is Mostly State Transitions

One realization significantly changed how I approached the system.

Software development organizations are more structured than they initially appear.

Most engineering work follows a sequence of state transitions.

Issue
→ Planning
→ Approval
→ Development
→ Validation
→ Review
→ Merge
→ E2E
→ Release

Once viewed from this perspective, software development starts looking less like individual coding tasks and more like a state machine.

That realization changed how I used GitLab.

Initially, GitLab was simply a repository.

Over time, it became something much more important.

GitLab became the coordination layer of the entire agent organization.

For example:

issue:approved
dev:ready
dev:running
dev:done

mr:ready-for-review
mr:approved

e2e:ready
e2e:done

These labels are not just metadata.

They represent organizational state.

Agents consume and update those states as part of a larger workflow.

In practice:

Issues became work queues.
Labels became state transitions.
Merge Requests became review checkpoints.

The agents never communicated directly with each other.

They coordinated through GitLab.

Eventually, GitLab evolved from a repository into the operating system of the agent organization.

Specialized Roles Were More Reliable Than One Smart Agent

My original architecture assumed a single powerful agent would eventually handle everything.

Planning.

Implementation.

Validation.

Review.

Deployment.

In practice, reliability improved when responsibilities became narrower.

The current sqlgen-ai pipeline looks much closer to an engineering team than a single autonomous agent.

PM Agent
↓
Developer Agent
↓
Validator Agent
↓
Review Triage Agent
↓
Human

Each role has a specific responsibility.

The PM Agent creates implementation plans.

The Developer Agent focuses on execution.

The Validator Agent independently verifies results.

The Review Triage Agent processes review feedback.

The Human provides goals and final approval.

The separation is important because implementation and validation have fundamentally different objectives.

The Developer Agent tries to make progress.

The Validator Agent tries to find problems.

Combining both responsibilities into a single agent often creates blind spots.

Separating them makes failures easier to detect and easier to recover from.

Validation Was More Important Than Generation

The most important lesson from operating the system was simple.

Never trust implementation output without verification.

Large language models can generate convincing solutions that are still incorrect.

Because of that, "Implementation Complete" is not evidence.

Validation results are evidence.

In sqlgen-ai, implementation does not immediately lead to a Merge Request.

Every change passes through an independent validation layer.

A simplified validation pipeline looks like this:

Developer Agent
↓
ruff
↓
mypy
↓
pytest
↓
Acceptance Criteria Validator

Only validated work can move forward.

However, validation itself was not the most interesting discovery.

The more important discovery was what happens after validation fails.

Validation Is Not the Goal. Convergence Is.

Many agent workflows stop after validation.

A test fails.

A review check fails.

The workflow ends.

That approach is useful for reporting failures, but it does not help the system reach a successful outcome.

In practice, we found that validation needed to become part of a convergence loop.

Implementation
→ Validation
→ NO-GO
→ Fix
→ Validation
→ GO

When validation returns NO-GO, the implementation session is reused.

The agent receives the validation results and attempts a targeted correction.

The objective is not simply to detect problems.

The objective is to reduce the distance between the current state and the desired state.

This distinction became one of the most important architectural decisions in the system.

The value of validation is not that it finds errors.

The value of validation is that it guides convergence.

The Current sqlgen-ai Pipeline

Today, sqlgen-ai operates through a multi-stage development pipeline.

PM
→ Pick
→ Implementation
→ MR
→ Test
→ CI
→ E2E

Each stage is connected through GitLab state transitions and scheduled agents.

Agents do not communicate directly.

GitLab acts as the shared memory and coordination layer.

The role of humans has also become much smaller than I originally expected.

Human
→ Goal
→ Approval

The human defines direction.

The agent organization executes the workflow.

What Comes Next

The system is still far from a fully autonomous engineering organization.

Several challenges remain:

Parallel task execution
Automatic Merge Request approval
E2E auto-recovery
Release Agents
Deployment governance
Cross-agent conflict resolution

But the direction has become much clearer.

When I started the project, I thought I was building a coding agent.

Today, I think differently.

The real challenge is not creating an agent that can write code.

The real challenge is building an organization that can reliably move software through an engineering lifecycle.

Conclusion

The biggest shift in my thinking came from operating the system in practice.

Initially, I believed model capability would be the defining factor.

In reality, operational structure mattered far more.

Reliability did not come from better prompts.

It came from:

Role separation
State transitions
Independent validation
Convergence loops
Organizational coordination

As models continue to improve, I expect this distinction to become even more important.

The future may not belong to isolated coding assistants.

It may belong to AI engineering organizations.

And that was the unexpected lesson from building sqlgen-ai.

How I Turned GitLab into a Coordination Layer for Autonomous AI Development Agents

yeTi — Thu, 14 May 2026 10:06:47 +0900

Lessons from building a multi-agent AI development workflow for a production project

TL;DR

Building a reliable AI coding agent is one engineering problem.

Building a reliable AI development workflow with multiple agents is another.

A single agent mostly struggles with execution quality.

Multiple agents introduce coordination problems:

task ownership
shared state visibility
race conditions
workspace contamination
lock recovery
operational governance

While building an autonomous development workflow for the sqlgen project, I learned that code generation was only one part of the problem.

The dominant challenge was coordination.

GitLab labels became the shared state machine that allowed independent agents to coordinate work safely. GitLab’s scoped labels are explicitly designed to support mutually exclusive workflow states, which makes them a practical coordination primitive for workflow orchestration. ([GitLab 문서][1])

The Goal

The original goal was straightforward.

I wanted engineering work inside the sqlgen project to move through an AI-assisted delivery workflow with minimal manual execution.

The target flow looked like this:

Issue discovered
→ planned
→ implemented
→ reviewed
→ tested
→ merged

The initial assumption was simple:

If the coding model is good enough, autonomous delivery becomes practical.

That assumption turned out to be incomplete.

Code generation solved only part of the problem.

Once multiple agents became involved, coordination became the dominant engineering challenge.

This Was Not a Single-Agent Problem

I was not building a coding assistant.

I was building a workflow where multiple agents had distinct responsibilities.

A simplified structure:

Human PM
   ↓
PM Bot
   ↓
Review Bot
   ↓
Dev Bot
   ↓
QA Bot
   ↓
Human Approval

Each agent had a narrower role.

That part was intentional.

Specialized agents are easier to reason about than one general-purpose autonomous actor.

But specialization creates a new requirement:

shared operational context.

A human team can rely on conversation, memory, and implicit understanding.

Independent agents cannot.

Task ownership, workflow progress, and execution state must be externally visible.

That made coordination state an explicit architectural concern.

Why GitLab?

A natural question:

Why use GitLab instead of building a dedicated orchestration service?

The answer was practical.

GitLab already provided several useful properties.

1. Existing Workflow Surface

The engineering workflow already lived in GitLab:

issues
merge requests
labels

That meant no additional operational UI needed.

Agents could integrate into the workflow engineers were already using.

2. Shared Visibility

Humans and agents could observe the same workflow state.

This matters operationally.

A coordination system that only agents understand becomes difficult to debug.

GitLab gave immediate human inspectability.

An engineer could look at an issue and immediately understand where work was stuck.

3. Simple Polling Model

The initial MVP used a cron-based automation model.

Example:

find issues with workflow::dev-ready

This approach was intentionally simple.

No event bus.
No dedicated orchestration queue.
No new infrastructure.

For an MVP, operational simplicity mattered more than architectural purity.

4. Explicit State Representation

Scoped labels gave a lightweight way to encode workflow lifecycle state.

Example:

workflow::pm-ready
workflow::dev-running
workflow::review-ready

Because labels within the same scope are mutually exclusive, workflow transitions become naturally enforceable. ([GitLab 문서][1])

That significantly reduced coordination ambiguity.

The architectural tradeoff was intentional:

Instead of introducing a separate orchestration system, I reused the existing engineering control plane.

GitLab as a Shared State Machine

The workflow state model looked like this:

workflow::pm-ready
workflow::pm-running
workflow::dev-ready
workflow::dev-running
workflow::review-ready
workflow::qa-ready
workflow::done
workflow::failed

Example lifecycle:

Issue created
→ workflow::pm-ready

PM Bot claims task
→ workflow::pm-running

Planning complete
→ workflow::dev-ready

Dev Bot claims task
→ workflow::dev-running

Implementation complete
→ workflow::review-ready

This solved a critical coordination problem.

Agents no longer depended on hidden internal context.

Workflow state became:

explicit
queryable
observable

GitLab was no longer just storing code.

It was acting as the coordination layer for distributed autonomous workers.

First Working MVP

The initial MVP worked under normal execution conditions.

The execution flow looked like this:

1-minute cron poller
↓
Find issues labeled workflow::dev-ready
↓
Acquire workspace lock
↓
Mark issue workflow::dev-running
↓
Execute Codex implementation flow
↓
Create merge request
↓
Transition issue to workflow::review-ready

This was enough to validate the architectural direction.

But happy paths do not validate operational systems.

Failure behavior does.

What Actually Broke

The dominant failures were operational coordination failures rather than model capability failures.

1. Double Pickup

Without explicit claiming, multiple agents can observe the same available task.

Example:

Agent A sees workflow::dev-ready
Agent B sees workflow::dev-ready
Both begin execution

Classic race condition.

Humans resolve this socially.

Distributed workers do not.

The fix:

explicit task claiming
state transition before execution
locking

2. Dirty Workspace Contamination

A failed execution could leave behind:

modified files
temporary branches
partial generated output
broken local state

The next execution inherited polluted state.

This produced misleading failures.

The issue was not reasoning quality.

It was environment integrity.

The fix:

workspace isolation
cleanup contracts
pre-execution guards

3. Cron Environment Drift

Manual execution succeeded.

Automated execution failed.

This is a classic operational issue.

Cron environments differ from interactive shells.

Common failures:

PATH mismatch
missing environment variables
CLI auth assumptions
host normalization issues

In practice, this surfaced as:

Codex working manually but failing in automation
glab targeting the wrong host
executables missing during scheduled execution

These are not glamorous problems.

But production automation usually fails on operational details, not architecture diagrams.

4. Stale Locks

Locks prevent concurrent execution.

But failed runs can leave stale locks behind.

Result:

lock exists
→ no new work claimed
→ workflow silently stalls

Without recovery logic, the system appears healthy while doing nothing.

The fix:

lock TTL
stale lock detection
cleanup recovery

Human-Governed Autonomy

A design correction emerged during implementation.

Full autonomy is not the immediate objective.

A more practical operational model is:

human-governed autonomy

Humans remain responsible for:

defining goals
approving critical changes
resolving ambiguity
production governance

Agents handle:

execution
repetitive workflow progression
structured implementation tasks

This boundary preserves automation benefits while reducing operational risk.

Key Engineering Lesson

Single-agent reliability asks:

How do I make one agent execute correctly?

Multi-agent workflow reliability asks:

How do independent agents coordinate safely?

These are different engineering problems.

The second problem looks much closer to distributed systems engineering than prompt engineering.

Because the failure modes are familiar:

shared state consistency
ownership conflicts
stale resources
operational recovery
workflow observability

Reliable agents are useful.

Reliable coordination is essential.

How Role Separation Reduced Execution Drift in Multi-Agent Systems

yeTi — Thu, 7 May 2026 22:38:50 +0900

Lessons from building reliable AI agent workflows with Hermes and local LLMs

TL;DR

Multi-agent systems become unstable when responsibilities overlap
Stronger models do not automatically improve workflow convergence
Shared context without ownership boundaries creates execution drift
Separating Planner, Implementer, and Validator responsibilities significantly improved workflow stability
The Implementer should apply contracts precisely, not redesign the system during execution

1. The Problem — Execution Drift in Multi-Agent Workflows

When I first started building AI agent systems, I assumed the main problem was model capability.

If the model became smarter, the workflow would become more reliable.

That assumption turned out to be incomplete.

While experimenting with local LLM-based coding agents using:

Hermes Agent
Claude Code CLI
Ollama
local qwen models
Discord-based orchestration

I repeatedly encountered the same failure pattern.

At first, using a single agent felt efficient.

The same agent would:

plan tasks
write code
debug failures
retry execution
validate outputs
redesign architecture during retries

Everything happened inside one large shared context.

Initially, this looked flexible.

But as workflows became larger, execution stability degraded rapidly.

I started seeing problems such as:

endless retry loops
inconsistent file structures
duplicated abstractions
rewritten interfaces during execution
architectural drift between retries
increasing divergence from the original task

The workflow often looked productive.

But convergence became worse over time.

Eventually, I realized I was not only dealing with model errors.

I was dealing with execution drift.

2. Stronger Models Still Drifted During Execution

One surprising realization was that stronger models did not fundamentally solve the problem.

Larger models often generated better local outputs.

However, workflow instability still remained.

In some cases, stronger models amplified instability because they became more willing to reinterpret previous decisions during execution.

For example, an Implementer agent might:

rename directories during retries
introduce new abstractions mid-execution
redefine interfaces that were already agreed upon
restructure unrelated components while fixing a local issue

At first, this behavior appeared intelligent.

The model looked proactive and adaptive.

However, execution reliability became worse.

The model attempted to optimize locally during retries.

Instead of treating the existing structure as a fixed contract, it continuously searched for “better” architectures.

As retries accumulated, small local optimizations gradually destabilized the workflow itself.

Eventually, the workflow became harder to reason about after every retry.

This led me to an important realization:

Reliability problems in multi-agent systems are often coordination problems.

The workflow was unstable not because the model was incapable, but because responsibilities were unclear.

3. Shared Context Without Ownership Creates Instability

One of the biggest problems in multi-agent systems is uncontrolled shared context.

At first, shared memory feels efficient because every agent can access the same information.

However, in practice, this often removes ownership boundaries.

Once ownership becomes unclear, responsibilities begin overlapping.

For example:

the Planner modifies implementation details
the Implementer redesigns architecture decisions
the Validator proposes alternative execution strategies
retry loops introduce conflicting interpretations

Eventually, the workflow loses convergence.

The issue is not that the agents are unintelligent.

The issue is that every agent is allowed to make every type of decision.

This creates architectural instability.

While debugging these workflows, I realized the problem felt surprisingly familiar.

It resembled a classic problem from object-oriented design.

4. The Object-Oriented Design Parallel

In object-oriented programming, responsibility separation is considered one of the most important design principles.

The same idea appears repeatedly in concepts such as:

Single Responsibility Principle (SRP)
high cohesion
low coupling
ownership boundaries

The core idea is simple:

Systems become difficult to reason about when responsibilities overlap.

That idea started feeling very similar to what I was seeing in agent systems.

In traditional software systems:

a service should not own every responsibility
a class should not make every decision
a module should not redefine another module’s contract

The same pattern appeared inside multi-agent workflows.

When every agent could:

plan
implement
redesign
validate
reinterpret contracts during retries

workflow stability degraded rapidly.

At some point, I stopped thinking about agents as “smart tools.”

I started thinking about them as independently evolving components inside a distributed system.

That perspective changed how I designed workflows afterward.

5. The Implementer Should Not Redesign the System

One specific failure pattern repeatedly caused instability in my workflows.

The Implementer agent would begin modifying architectural decisions during execution.

For example:

changing directory structures during retries
introducing new abstractions unrelated to the original task
rewriting task boundaries while fixing local errors
redefining interfaces that other agents already depended on

At first, this behavior looked intelligent.

The model appeared proactive.

However, execution reliability became significantly worse.

Every retry introduced additional design changes.

As those changes accumulated, the workflow continuously drifted away from the original contract.

Eventually, retries stopped behaving like recovery mechanisms.

They became architecture mutation loops.

This problem became especially severe when the Implementer shared the same broad context as the Planner.

The Implementer gradually started behaving like another Planner.

That overlap destabilized the workflow.

Eventually, I realized the problem resembled a classic object-oriented design issue.

An object becomes difficult to reason about when it owns too many responsibilities.

The same pattern appeared in agent systems.

The Implementer should not make new decisions during execution.

Its role is to apply the already defined contract as precisely as possible.

Once I separated:

planning responsibilities
execution responsibilities
validation responsibilities

workflow convergence improved significantly.

6. The Harness Layer — Controlling Convergence

Role separation alone does not guarantee convergence.

The workflow still requires a control layer that verifies whether execution remains aligned with the original contract.

That became the responsibility of the Harness layer.

The Harness layer acts as a convergence controller.

It determines:

whether retries should continue
whether execution drift exceeded acceptable boundaries
whether rollback is necessary
whether the workflow should terminate

For example, if retries continuously modified unrelated files or redefined existing interfaces, the Harness layer treated the execution as divergence rather than recovery.

That distinction became important.

Without convergence control, retries often amplified instability instead of resolving failures.

The Harness layer then managed:

retries
convergence loops
execution stabilization
workflow validation

This architecture became significantly more stable than relying on a single highly capable agent operating inside a large shared context.

7. My Current Multi-Agent Structure

My current workflows are increasingly organized around ownership boundaries.

A simplified structure looks like this:

PM Agent
    ↓
Planner Agent
    ↓
Implementer Agent
    ↓
Validator Agent
    ↓
Harness Layer

Each role owns a different category of decisions.

That ownership is important.

Planner

Responsible for:

execution strategy
task decomposition
contract definition

But not responsible for execution changes during runtime.

Implementer

Responsible for:

applying predefined contracts
writing code
executing tasks precisely

But not responsible for redesigning architecture.

Validator

Responsible for:

invariant verification
semantic validation
execution correctness checks

But not responsible for redefining execution strategy.

As ownership boundaries became clearer, workflow behavior became significantly easier to reason about.

Execution drift decreased.

Retries became more predictable.

And convergence stability improved substantially.

8. Reliability Comes From Ownership Boundaries

One of the biggest misconceptions about AI agents is that reliability comes only from intelligence.

In practice, reliability often comes from constrained responsibilities.

The same principle already exists in software engineering.

Distributed systems become more stable when responsibilities are isolated.

Database systems become safer when transactional boundaries are explicit.

Microservices reduce instability by limiting ownership scope.

Multi-agent systems appear to follow similar patterns.

Without boundaries:

every agent becomes a planner
every retry becomes a redesign
every execution becomes negotiation

As workflows become more complex, instability grows quickly.

Role separation reduces that instability because ownership becomes predictable.

The more complex the workflow became, the more important role boundaries became for maintaining convergence.

9. The Future — Reliability Engineering for Agent Systems

I increasingly believe we are entering a new phase of AI system design.

Earlier generations of AI systems focused heavily on:

prompts
model quality
tool integration
inference capability

Those layers are still important.

However, as workflows become more autonomous, coordination and ownership also become architectural concerns.

Instead of only asking:

“Which model should execute this task?”

We may increasingly need to ask:

“Which role should own this decision?”

That shift feels important.

Because many of the hardest problems in agent systems are no longer only about generation quality.

They are increasingly about:

coordination
ownership
execution boundaries
convergence stability

As agent workflows become more autonomous, reliability engineering may increasingly become an exercise in defining ownership boundaries between agents.

Conclusion

One unexpected realization from building multi-agent systems was how familiar the failures looked.

Execution drift, responsibility overlap, and uncontrolled redesign during retries resembled classic software engineering problems.

In many ways, multi-agent workflows began behaving like distributed object systems.

The same lessons appeared again:

unclear ownership creates instability
overlapping responsibilities reduce predictability
uncontrolled autonomy weakens convergence

The Implementer should not redesign the system during execution.

Its responsibility is to apply the already defined contract as precisely as possible.

That separation turned out to be one of the biggest improvements in workflow stability.

Ironically, some of the most important ideas for future AI systems may not be entirely new.

Software engineering has already spent decades learning how to build stable systems through responsibility separation and ownership boundaries.

Now, those principles appear to be emerging again inside agent systems.

Why Multi-Agent Systems Fail to Respond — Debugging a Real Hermes Agent Setup

yeTi — Thu, 30 Apr 2026 16:54:05 +0900

Lessons from building and debugging a real-world multi-agent system with Hermes Agent

TL;DR

The agent didn’t fail to generate an answer
It failed to decide whether it should act
Multi-agent systems require coordination signals, not just intelligence
The fix was not better prompts, but explicit behavior contracts

1. Problem — The Agent Didn’t Respond at All

While building a multi-agent system using Hermes Agent and Discord,

I encountered a surprisingly simple but critical failure:

The agent didn’t respond.

Not partially.
Not incorrectly.

It simply did nothing.

Observed behavior

Discord message sent with mention
PM agent responded
Developer agent stayed silent
No errors
No logs indicating failure

From the outside, the system looked completely normal.

2. Initial Hypothesis — It Must Be a Configuration Issue

My first assumption was straightforward:

“This must be a Discord or Hermes configuration problem.”

So I checked everything.

What I verified

Discord bot token regeneration
Gateway intents (Message Content Intent enabled)
Channel permissions
allowed_channels configuration
require_mention settings
Restarted Hermes gateway

These are all known failure points.

Result

Everything was correct.

And yet, the agent still didn’t respond.

3. Reality — The System Was Working Correctly

This was the turning point.

The system was not broken.

It was behaving exactly as designed.

What actually happened

The agent received the message
The agent processed the message
The agent generated internal reasoning

But:

It never decided to act.

4. Root Cause — No Decision Model for Action

This is where the real problem emerged.

The agent had:

input processing
reasoning capability
tool access

But it lacked one critical component:

A decision rule for “Should I respond?”

Important distinction

There are two separate problems in agent systems:

Can the agent generate an answer?
Should the agent act at all?

Most discussions focus only on.

This failure was entirely about.

What the agent was missing

The system had no explicit definition of:

when to respond
when to ignore
how to interpret mentions
how to handle multi-agent context

5. Insight — Humans Use Signals, Not Just Understanding

This became clearer when I compared it to human behavior.

Humans do not respond to every message.

They respond based on signals.

Human decision model

If I am mentioned → respond
If someone else is mentioned → ignore
If unclear → decide based on role

The agent had none of this

It understood the message.

But it didn’t understand:

whether it was responsible for acting.

6. Fix — Explicit Behavior Contract

The solution was not improving prompts.

It was introducing a behavior contract.

Example (soul.md)

# Agent Behavior Contract

IF message mentions me → respond

IF message mentions another agent → ignore

IF message is general:
  → decide based on role (PM / Developer / Reviewer)

IF task is assigned:
  → execute within role boundary

What changed

The agent gained decision boundaries
Responsibility became explicit
Multi-agent interaction became predictable

Key takeaway

This is not prompt engineering.
This is behavior design.

7. Why This Matters — Multi-Agent Systems Need Coordination

Hermes Agent supports multi-agent configurations with role-based execution.

But simply adding multiple agents is not enough.

Multi-agent systems introduce a new layer of failure

Not:

model quality
prompt quality

But:

coordination failure

Core requirement

Multi-agent systems need:

routing
responsibility
coordination signals

Without this:

The system becomes idle, not intelligent.

8. Connection to Previous Posts

This experience connects directly to previous findings:

Prompt engineering improves outputs but not reliability
Convergence systems stabilize execution
And now:

Coordination determines whether the system acts at all

Evolution of understanding

Prompt → insufficient
Pipeline → still unstable
Convergence → improves reliability
Coordination → enables action

9. What I Learned

The system didn’t fail because it was wrong.

It failed because it was silent.

Final realization

The system didn’t fail to generate an answer.
It failed to decide whether it should respond.

How I Designed a Reliable LLM Coding Agent for Production

yeTi — Wed, 29 Apr 2026 16:49:50 +0900

From Unpredictable AI to Reliable Systems
Lessons from building real-world AI agent systems

TL;DR

Prompt engineering alone was never enough for stable execution
Single-prompt systems created over-exploration, scope drift, and unreliable outputs
I redesigned the system into Planner → Implementer → Validator
Reliability came from contracts, validation, and retry loops—not better prompts
Production reliability is a system design problem, not a model quality problem

1. The Problem Was Never Just the Prompt

When I first started building a local LLM coding agent, I believed the solution was simple:

Write a better prompt.

I was using:

Mac Studio
Ollama
Claude Code CLI
local Qwen models

The goal was straightforward:

Build an automated coding workflow that could operate without constant human intervention.

At first, I injected large prompts directly into the system and expected stable execution.

What happened instead was instability.

Sometimes the model explored too much and redesigned architecture instead of fixing the requested issue.
Sometimes it modified unrelated files and crossed boundaries I never intended to touch.
Sometimes it simply failed to return usable output.

The issue was not intelligence.

It was execution control.

That was the moment I stopped trying to optimize prompts and started redesigning the execution architecture.

2. The New Architecture — Planner → Implementer → Validator

The first version of the system looked like this:

Large Prompt
↓
LLM
↓
Hope for the best

This worked for demos, but it failed in production because the same request could produce different outcomes.

I redesigned the system around one principle:

Reliability must be enforced by the system.

The new structure became:

Request
↓
Planner
↓
Implementer
↓
Validator
↓
Retry if needed
↓
Converged Result

Each stage had:

clear responsibility boundaries
structured contracts
validation checkpoints
deterministic retry conditions

This was the point where the system became trustworthy.

The model was no longer asked to “figure everything out.”
It was asked to operate inside a controlled execution environment.

3. Layer 1 — Planner

The Planner exists to prevent intent drift.

Its job is not writing code.

Its job is defining the contract before execution begins.

Instead of vague instructions like:

Fix the login issue

the planner produces something like this:

Intent Contract

- Request Type
- Required Change
- Protected Boundaries
- Acceptance Criteria
- Codebase Anchors

For example:

Do:
- Add token refresh logic

Do Not:
- Change authentication API contracts

Must Pass:
- Existing login flow remains unchanged

This removes ambiguity before implementation starts.

Design decisions, protected boundaries, and success conditions are fixed in advance.

The Implementer is not asked to interpret intent.
It is asked to follow a contract.

That separation is what prevents scope drift.

4. Layer 2 — Implementer

The Implementer performs the actual code changes.

Its responsibility is intentionally narrow:

Execute the contract exactly.

The Implementer should not make new decisions during execution.

Its role is to apply the already defined contract as precisely as possible.
Design decisions, scope boundaries, and acceptance criteria should already be fixed by the Planner.

This significantly reduced the “creative failures” I saw in earlier versions.

Unexpected refactoring decreased, unnecessary architectural changes disappeared, and unrelated file modifications became far less common.

As a result, the implementation layer became intentionally predictable.

That predictability was a positive signal, because reliable production systems should reduce surprise rather than create it.

The goal of execution is not creativity.

It is consistency.

5. Layer 3 — Validator

This is the most important part of the system.

The Validator determines whether the result is actually acceptable.

It checks:

Was the requested behavior implemented?
Were protected contracts preserved?
Did the output format remain valid?
Were unintended side effects introduced?

Without validation, failure stays hidden.

The system may look successful while silently breaking important assumptions.

With validation, failure becomes explicit.

And once failure becomes visible, retries become meaningful.

This is where convergence actually happens—not inside the prompt, but inside the feedback loop.

A validator should reject not only syntactically broken output, but also structurally valid output that is semantically wrong.

That distinction is where many production systems fail.

6. Retry Logic — Convergence Over Perfection

Retries should not be random.

They should be guided by validation results.

Instead of saying:

Try again

the system should say:

Retry because Acceptance Criteria #2 failed
Retry because a protected contract was violated

This changes retries from guesswork into controlled convergence.

The goal is not to get the perfect answer on the first attempt.

The goal is to move the system consistently toward the target state.

That is what reliability means in practice.

Production systems do not need brilliance.

They need stable convergence.

7. Before vs After

Before:

Prompt
→ Result
→ Hope

After:

Intent
→ Contract
→ Execute
→ Validate
→ Retry
→ Reliable Output

The system became slower.

But it also became trustworthy.

And in production systems, trust matters more than speed—especially when autonomous execution is involved.

Fast failure is still failure.

Reliable execution is what matters.

8. What Actually Changed

The biggest change was not technical. It was philosophical.

At first, I focused on improving the model itself.
I kept asking how to make the model smarter, more accurate, and more capable.

Over time, I realized that intelligence was not the main problem.

The real problem was failure visibility.

The better question became:

How do I design a system where failure cannot be ignored?

That shift changed everything.

Reliable AI systems are not built by better prompts alone.

They are built through clear execution boundaries, explicit contracts, strong validation layers, and retry mechanisms that guide the system toward convergence.

In other words, reliability does not come from model intelligence itself. It comes from the system surrounding the model.

That is why:

Reliability is not a model capability.
It is a system property.

Why Prompt Engineering Fails — Harness Engineering for Reliable LLM Systems

yeTi — Fri, 24 Apr 2026 17:38:59 +0900

From unpredictable AI outputs to production-ready LLM systems

TL;DR

Prompt engineering improves output quality, but it does not guarantee reliability
Most LLM systems fail in production because they cannot handle failure
Validation layers, retry loops, and strict output contracts are what make AI automation reliable
Reliable AI agent systems are built with control systems, not just better prompts
This is where Prompt Engineering ends and Harness Engineering begins

Why Prompt Engineering Alone Fails in Production

Many teams building AI agents make the same assumption:

“If we write better prompts, the system will become reliable.”

At first, I believed that too.

I thought better prompts meant:

clearer instructions
stronger role definitions
stricter output formatting
more examples
more failure prevention rules

So I kept improving prompts.

Longer prompts.
Safer prompts.
More detailed prompts.

But something unexpected happened.

The prompts got better.

The system got worse.

Especially when working with local LLM environments using:

Mac Studio
Ollama
Claude Code CLI
qwen local models

I repeatedly saw:

no output at all
endless file exploration
incomplete implementations
unstable behavior across identical runs

The problem was not correctness.

The real problem was:

I could not reliably get results at all.

That was the turning point.

I realized:

Prompt quality was not the bottleneck.

The real issue was:

One prompt was carrying the entire system.

And that is why prompt engineering alone fails in production LLM systems.

Why LLM Systems Fail in Production

The biggest misunderstanding in AI system design is this:

People think LLM systems fail because models are not smart enough.

That is usually wrong.

Most LLM systems fail because they are designed as linear success paths.

If every step must succeed perfectly,
the entire system becomes fragile.

For example:

Analyze → Design → Implement → Validate → Deploy

If one step fails,
the entire workflow breaks.

This becomes worse because LLMs are non-deterministic.

The same input does not always produce the same output.

Unlike traditional software:

Same input → Same output ❌
Same input → Different outputs ✔

This means reliability is not a model feature.

It is a system design problem.

That is the foundation of modern LLM system design.

A Real Failure Case from Claude Code + Ollama

One failure made this obvious.

The Implementer step was supposed to modify a single API file.

The task was simple:

replace admin-token generation with user-context token handling.

Nothing more.

But instead of touching the target file, the model started scanning the entire repository.

It opened unrelated modules, rewrote helper functions, and tried to understand the whole system.

Eventually, it returned:

“Task completed successfully”

But the required logic had not changed at all.

The actual task was still unfinished.

The model had optimized for

plausible completion

instead of

actual completion

The validator failed immediately.

That was the moment I understood:

Without explicit boundaries, LLMs optimize for confidence, not correctness.

And confidence is useless in automation.

This is one of the most common failure patterns in local LLM automation.

From Prompt Engineering to Harness Engineering

This changed the architecture completely.

Instead of one giant prompt, I split the workflow into smaller steps:

Intent → Planner → Spec → Implement → Validate → Git

Each step had:

one responsibility
explicit input/output contracts
deterministic validation
retry capability

This improved reliability dramatically.

The goal changed from:

Generate the correct answer once

to:

Detect failure and drive convergence

This is the difference between:

Prompt Engineering

and

Harness Engineering

Prompt engineering improves what the model says.

Harness engineering controls what the system accepts.

That distinction is everything.

What Is Harness Engineering?

Harness Engineering is the control layer that makes LLM systems reliable.

It includes:

validation layers
retry loops
output contracts
failure detection
step isolation
convergence architecture

Instead of asking:

“Can the model do this?”

the real question becomes:

“What happens when the model fails?”

Because failure is not an exception.

Failure is the default state.

Reliable AI systems are not built by avoiding failure.

They are built by surviving it.

That is Harness Engineering.

Contract-Driven LLM Execution

The biggest improvement came from removing ambiguity.

I stopped asking the model to:

“Do the task well”

and started requiring:

“Satisfy the contract”

For example:

### REQUIRED OUTPUT

- Must create:
  .handoff/task-1/spec.md

- Must include:
  SPEC_DONE

- Must NOT:
  modify request.md

- Final line must be:
  <<<DONE>>>

Outputs also had strict file-based contracts:

[FILE]
path: .handoff/task-1/spec.md
---
implementation details here
---

No free-form output.

No interpretation.

Either the contract was satisfied or it failed.

This made failure measurable.

And measurable failure is what makes retries possible.

This is one of the most important patterns in production-grade AI agent architecture.

Why Validation Layers Matter in LLM Systems

Validation is what transforms LLM output from “probably correct” into “safe enough to continue.”

Validation was deterministic.

For example:

if ! grep -q "VALIDATOR_DONE" output.txt; then
  echo "Validation failed"
  exit 1
fi

Other validation checks included:

required file existence
schema validation
forbidden output detection
completion marker verification
PASS / FAIL summaries

The question changed from:

“Does this look correct?”

to:

“Did this satisfy the required conditions?”

That is how validation layers improve LLM reliability.

Because intuition does not scale.

Validation does.

Retry Loops Create Reliable AI Automation

Retries were not random.

They were guided by failure signals.

When validation failed, the next prompt included correction feedback:

Previous output failed because:

- missing [FILE] block
- invalid completion marker
- required file was not created

Fix only these issues.
Do not rewrite unrelated sections.

This made retries behave like:

gradient-free optimization

The model did not need gradients.

It only needed clear failure signals.

That was enough to create convergence.

This is why retry loops are the core of reliable LLM systems.

Not just better prompts.

Why Agents Alone Are Not Enough

Many people ask:

Why not just use Claude Code directly?

The answer is simple.

Claude Code is a brilliant interactive engineer.

But production automation does not need brilliance.

It needs boring reliability.

Interactive agents work well when humans are present.

Because humans can:

notice failure
redirect execution
stop bad decisions

But in a fully automated workflow:

there is no human in the loop.

The system must decide:

did this step succeed?
should we retry?
what exactly failed?
is it safe to continue?

That requires a harness.

Not just an agent.

The agent generates.

The harness controls.

That distinction separates demos from production systems.

Final Takeaway

The real question is not:

How do we build smarter AI?

The real question is:

How do we build systems that do not collapse when failure happens?

Reliable LLM systems are not built by asking better questions.

They are built by designing better control systems.

This is the core of production AI engineering.

And this is where Prompt Engineering ends.

FAQ

What is Harness Engineering?

Harness Engineering is the system design layer that controls LLM behavior.

It includes validation, retries, structured output contracts, and failure detection.

It is what makes LLM systems reliable in production.

Why does prompt engineering fail in production?

Because prompt engineering improves output quality, but it does not guarantee reliability.

Production systems fail when there is no validation or recovery mechanism for bad outputs.

How do validation layers improve LLM reliability?

Validation layers make failure measurable.

They check completion markers, files, schemas, and output contracts so the system can safely decide whether to continue or retry.

Why do local LLM systems fail more often?

Local LLMs usually have smaller context windows, weaker reasoning consistency, and higher instability in multi-step tasks.

That makes validation and retry systems even more important.

What is the difference between prompt engineering and harness engineering?

Prompt engineering improves model outputs.

Harness engineering controls whether those outputs are accepted, rejected, or retried.

Prompt improves quality.

Harness creates reliability.

Why LLM Systems Fail in Production (And Why Prompt Engineering Is Not Enough)

yeTi — Thu, 23 Apr 2026 16:01:11 +0900

From Unpredictable AI to Reliable Systems

Lessons from Building Real-World AI Agent Systems

TL;DR

Prompt engineering improves output quality, but it does not guarantee reliability
LLM systems fail because outputs are probabilistic, not deterministic
Multi-step AI agent pipelines amplify failure probabilities
Production-grade LLM systems require validation, retry loops, and convergence mechanisms
Reliability is not a model capability — it is a system property

Introduction: The Real Problem with LLM Systems

Most LLM systems fail in production not because the model is weak,
but because the system cannot handle failure.

Many teams assume the problem is intelligence.

“We need smarter models.”

“We need better reasoning.”

“If we switch to a stronger frontier model, everything will work.”

But in real-world systems, the biggest problem is not intelligence.

It is unpredictability.

The same prompt can produce different outputs.

Some days, the workflow works perfectly.

Other days, the exact same input produces failure.

It works in demos,
but breaks in production.

This is the real challenge of building reliable AI systems.

And at this point, many teams make the same mistake:

“Maybe we just need better prompt engineering.”

But prompt engineering alone is not enough.

Why Prompt Engineering Alone Fails

I thought the same thing at first.

I believed that if I designed a single prompt carefully enough,
I could get stable and reliable outputs.

So I made prompts longer.

More detailed.

More structured.

I added:

role definitions
output formatting rules
examples
failure prevention instructions

The prompt kept growing.

But the results became less stable.

This was especially obvious in local LLM environments.

Using:

Mac Studio
Ollama
Claude Code CLI
qwen local models

I repeatedly saw:

no output at all
excessive file exploration (over-exploration)
incomplete task execution
unstable behavior across identical runs

The problem was not correctness.

The real problem was:

I could not reliably get results at all.

That was the turning point.

The issue was not prompt quality.

The issue was:

One prompt was carrying the entire system.

Why Small, Well-Defined Tasks Improve LLM Reliability

I noticed something important.

Local LLMs performed much better
when tasks were small and clearly defined.

For example:

understanding requirements
defining implementation scope
finding files to modify
validating code changes

Each of these tasks worked surprisingly well.

But asking the model to do all of them at once
made the entire system unstable.

This led to a simple insight:

Break the problem into smaller pieces.

LLMs work better as

a team of specialized workers

than as

a single genius expected to solve everything.

That insight changed the architecture completely.

From One Prompt to Multi-Step AI Agent Systems

Instead of one massive prompt,
I built a structured pipeline.

For example:

Intent Extractor
Planner
Spec Planner
Implementer
Validator
Git & Merge Request Automation

Each step had:

a narrow responsibility
explicit input/output contracts
independent validation
retry capability

The goal was no longer:

Generate the correct answer in one shot

The real goal became:

Detect failure and drive convergence

This was the shift from

Prompt Engineering to Harness Engineering

Real Example: Contract-Driven LLM Execution

The most important improvement was simple:

I stopped asking the model to
“do the task well.”

Instead,

I asked it to satisfy strict contracts.

For example:

PLANNER_DONE
SPEC_DONE
IMPLEMENTATION_DONE
VALIDATOR_DONE
GIT_DONE

Each step had required completion markers.

And outputs had to follow strict file-based handoff rules:

[FILE]
path: .handoff/task-1/spec.md
---
implementation details here
---

This removed ambiguity.

The model was no longer evaluated by quality.

It was evaluated by contract satisfaction.

That changed everything.

How Validation Layers Make LLM Systems Reliable

Validation was not based on intuition.

It was deterministic.

For example:

if ! grep -q "VALIDATOR_DONE" output.txt; then
  echo "Validation failed"
  exit 1
fi

Other validation checks included:

required file existence
schema validation
forbidden output detection
completion marker verification
PASS / FAIL summaries

The question changed from:

“Does this look correct?”

to:

“Did this satisfy the required conditions?”

This made retries possible.

Because failure became measurable.

And measurable failure can be improved.

Retry Loops: The Core of Convergence Systems

Retries were not random.

They were guided by failure signals.

When validation failed,
the next prompt included explicit correction feedback:

Previous output failed because:

- missing [FILE] block
- invalid completion marker
- required file was not created

Fix only these issues.
Do not rewrite unrelated sections.

This made retries behave like

gradient-free optimization

The model did not need gradients.

It only needed clear failure signals.

That was enough to create convergence.

This is why production LLM systems require retry architecture.

Not just better prompts.

Why Harness Engineering Matters More Than the Agent

Many people ask:

Why not just use Claude Code directly?

The answer is simple:

Interactive agents are excellent for humans.

But automation requires determinism.

Claude Code works well
when a human can intervene.

But in a fully automated workflow:

there is no human in the loop.

The system must decide:

did this step succeed?
should we retry?
what exactly failed?
is it safe to continue?

That requires a harness.

Not just a better agent.

The agent generates.

The harness controls.

That distinction is critical.

Why Reliable LLM Systems Depend on Architecture, Not Model Size

Many people believe:

better models create better systems.

In reality, it is often the opposite.

Better systems make even average models reliable.

Even small local models
can become powerful
inside the right architecture.

On the other hand,

even the strongest frontier models
become unstable

when everything depends on one prompt
without validation.

Reliability is not a model capability.

It is:

a system property

This is one of the most important lessons in AI system design.

Final Takeaway

The real question is not:

How do we build smarter AI?

The real question is:

How do we build systems that do not collapse when failure happens?

The core of production AI systems
is not answer generation.

It is:

failure management

That is where real engineering begins.

And from that moment,

we stop being Prompt Engineers.

We become System Designers.

FAQ

Why do LLM systems fail in production?

Because LLM outputs are probabilistic, not deterministic.

Even with the same prompt, results can vary.
Without validation and recovery mechanisms, a single failure can break the entire system.

Is prompt engineering enough for production AI systems?

No.

Prompt engineering improves output quality,
but it does not guarantee reliability.

Production systems require validation layers, retry loops, and convergence mechanisms.

What is Harness Engineering?

Harness Engineering is the system design layer
that controls LLM behavior.

It includes:

validation
retries
structured output contracts
failure detection
convergence architecture

It is what makes LLM systems reliable.

Why Prompt Engineering Alone Fails in LLM Systems (And How to Fix It with Convergence)

yeTi — Mon, 13 Apr 2026 16:53:14 +0900

Lessons learned from building a real-world LLM coding agent with local models

TL;DR

LLMs are non-deterministic → same input, different outputs
Pipeline architectures amplify failure probabilities
Prompt engineering improves outputs but cannot guarantee reliability
The real solution is not better prompts, but convergence systems

1. Problem — You Can’t Even Get Stable Outputs

I wanted to build a local LLM-powered coding assistant.

So I set up:

Mac Studio
Ollama
Claude Code CLI
qwen3.5

Then I tried the simplest possible task:

Build a simple API

But the results were unstable:

Sometimes no output at all
Sometimes excessive file exploration (over-exploration)
Sometimes the task never completed

The problem wasn’t correctness.

The problem was that I couldn’t reliably get results at all.

2. Observation — Small Tasks Work

After multiple attempts, I noticed a pattern:

Local LLMs perform much better on small, well-defined tasks.

For example:

Implementing a single function
Fixing a specific bug
Tasks with clear input/output

This led to an important insight:

“Break the problem down into smaller pieces.”

3. Approach — Role Decomposition

Instead of one large prompt, I split the task into stages:

[Analyze] → [Design] → [Implement]

Each step:

Has a narrow scope
Produces structured output
Can be validated

This significantly improved success rates (in manual runs).

4. Scaling Up — Pipeline Automation

Naturally, the next step was:

“Let’s automate this workflow.”

So I built a pipeline:

User Input
   ↓
[Analyze] → [Design] → [Implement]
   ↓
 Final Output

5. Problem — The Pipeline Breaks Easily

After automation, new issues appeared:

Sometimes it works
Sometimes it completely fails

The key issue:

A single failure breaks the entire pipeline.

6. Why Pipelines Fail

6.1 LLMs Are Non-Deterministic

Unlike traditional systems:

Same input → same output (X)
Same input → probabilistic output (O)

6.2 Probability Compounding

If each step succeeds with probability ( p ):

P_{total} = p_1 \times p_2 \times p_3

As the number of steps increases, total success probability drops rapidly.

6.3 Manual vs Automated Execution

Aspect	Manual	Automated
Human intervention	Yes	No
Error recovery	Possible	None
Progress condition	Partial success	Full success

Pipelines require every step to succeed every time.

7. The Real Problem

Initially, I thought:

“We need better prompts.”

But the real issue was:

“How do we handle failures?”

This is not a prompt problem.

It is a system design problem.

8. Solution — Convergence System

Instead of a linear pipeline, I redesigned the system as a convergence loop.

         LLM Call
             ↓
        Validation
        /        \
     OK           FAIL
     ↓            ↓
  Accept        Retry

9. Implementation — Retry + Validation

9.1 Retry Loop

def run_with_retry(task_fn, validate_fn, max_retry=3):
    for attempt in range(max_retry):
        result = task_fn()

        if validate_fn(result):
            return result

    return result

9.2 Validation Example

def validate_code(result):
    if "```" not in result:
        return False
    if "TODO" in result:
        return False
    return True

9.3 Step Isolation

analysis = analyze(input)
design = design(analysis)
code = implement(design)

Each step is independently validated and recoverable.

10. Results

After introducing convergence mechanisms:

Reduced over-exploration
Fewer pipeline failures
More consistent outputs

The most important change:

The system started working by design, not by luck.

11. Final Takeaway

Prompt engineering matters.

But it is not enough for automation.

LLM systems are not about generating correct answers.

They are about controlling incorrect ones.

AI 도구 무엇을 써야 할까? 2026년 독립 개발자의 AI 개발 비용 중심 선택 기준

yeTi — Thu, 12 Feb 2026 11:13:29 +0900

안녕하세요. yeti 입니다.

오늘은 제가 실제로 운영 중인 AI 개발 환경과 AI 코딩 도구 운영 전략을 공유하려고 합니다.

특히 이 글은 단순한 AI 도구 추천이 아니라,
독립 개발자가 AI 개발 비용을 직접 감당하는 환경에서 어떤 기준으로 도구를 선택하는지에 대한 기록입니다.

저는 모든 AI 도구 비용을 개인이 직접 지불합니다.

그래서 이 글은 생산성 관점이 아니라 비용을 통제하면서 AI 코딩 도구를 사용하는 전략에 대한 이야기입니다.

문제: AI 개발 비용은 생각보다 빠르게 증가한다

AI 코딩을 적극적으로 활용하면 생산성은 확실히 올라갑니다.
하지만 동시에 AI 토큰 비용과 구독 비용도 빠르게 증가합니다.

2025년 11~12월, Cursor IDE 사용량이 늘어나면서 월 비용이 15 ~ 20만원까지 올라간 경험이 있습니다.

바이브 코딩을 지향하며 개발 생산성은 상승했지만 그 비용은 제 활동비에서 직접 차감되었습니다.

이때 깨달은 점은 단순했습니다.

AI는 생산성 도구이지만, 동시에 비용이 발생하는 인프라다.

전략 변경: 통합이 아닌 분산

AI 도구를 하나로 통합하는 대신 AI 개발 워크플로우를 역할 기반으로 분산하기로 했습니다.

현재 운영 중인 AI 개발 환경은 다음과 같습니다.

[추상 설계]
ChatGPT
   ↓ plan.md 생성

[코드 기반 설계 구체화]
Codex
   ↓ plan.md 보완

[백엔드 개발 / AI 리팩토링]
Antigravity

[프론트엔드 개발 / 인프라 운영]
Cursor IDE

[코드 리뷰 / 검증]
Gemini CLI

핵심은 AI 도구 분산 전략과 역할 분리입니다.

도구별 사용 전략 (AI 코딩 도구 비교 관점)

ChatGPT — AI 설계 도구

서비스 기획 정의
아키텍처 설계
Agent 명세 작성
plan.md 문서 생성

ChatGPT는 AI 코드 생성 도구라기보다 AI 설계 도구로 사용합니다.
코드를 대량 생성하기보다는, 개발 명세를 정리하는 데 집중합니다.

Codex — AI 코드 구체화 도구

코드 기반 설계 구체화
작은 기능 구현
버그 수정
GitLab 이슈 초안 작성
MR 초안 생성

Codex는 대규모 코드 생성보다 AI 개발 흐름 자동화에 가까운 역할을 합니다.

최근에는 다음 작업을 모두 Codex에게 맡겨보았습니다:

리팩토링
문제 분석
GitLab 이슈 작성
개발 계획서 작성
MR 생성
MR 리뷰 대응 작성

AI가 단순 코드 생성이 아니라 개발 프로세스 자동화 도구로 활용될 수 있다는 가능성을 확인했습니다.

Antigravity — AI 리팩토링 전용

백엔드 구조 개선
대규모 리팩토링
테스트 기반 수정

Cursor IDE가 실패했던 리팩토링을 Antigravity가 성공시킨 경험 이후, AI 리팩토링은 분리 처리하고 있습니다.

무료 플랜임에도 안정적인 결과를 보여주었습니다.

Cursor IDE — AI 프론트엔드 개발 도구

프론트엔드 구현
배포 설정
인프라 작업

현재는 주력 AI 코딩 도구라기보다 프론트엔드 중심 구현 도구로 사용합니다.

Gemini CLI — AI 코드 리뷰 도구

코드 리뷰
정적 파일 기반 검증

실행자가 아니라 저비용 AI 코드 리뷰 도구로 활용합니다.

Claude Code 비교: 왜 사용하지 않는가?

Claude Code는 강력한 AI 코딩 도구로 알려져 있습니다.

하지만 현재 도입하지 않은 이유는 명확합니다.

무료 플랜이 없어 가볍게 실험해볼 수 없었고,
제 AI 개발 비용 구조에서는 즉시 전환을 결정하기 어려웠습니다.

AI 도구 선택 기준은 단순 성능이 아니라 내가 감당할 수 있는 비용 구조와 실험 가능성입니다.

도구의 우열 문제가 아니라 AI 도구 비교 관점에서 현재 제 운영 전략과의 적합성 문제입니다.

2026년 AI 개발 비용 구조

도구	비용
ChatGPT	$20 / 월
Cursor IDE	$20 / 월
Codex	ChatGPT 플랜 포함
Antigravity	무료
Gemini CLI	무료

독립 개발자의 AI 도구 운영 전략은 다음과 같습니다.

무료 플랜 적극 활용
고소모 AI 토큰 작업 분리
재시도 비용 최소화
설계 문서 표준화

기업의 지원 여부

기업에서 AI 도구의 비용을 제공하는 환경에 따라 다를 수 있습니다.

회사에서 비용을 부담한다면 다음 전략이 가능합니다.

상위 플랜 사용
통합 플랫폼 중심 운영
생산성 극대화 전략

하지만 개인 AI 도구의 비용을 지불한다면 상황이 달라집니다.

실패 = 직접 비용

그래서 저는 다음을 선택했습니다.

통합이 아니라 분산
편의성보다 비용 안정성
고성능 모델 상시 사용 대신 역할 분리

결론: AI 도구 추천보다 중요한 것

AI 도구 추천은 상황에 따라 달라집니다.

하지만 독립 개발자의 기준은 명확합니다.

AI 도구의 성능이 아니라, 내가 감당할 수 있는 비용 구조가 기준이다.

AI를 “무제한 생산성 도구”로 쓰지 않습니다.

비용이 있는 인프라로 취급합니다.

저는 여전히 실험 중입니다.
다만, 비용을 통제하지 않는 실험은 하지 않습니다.

이상, 2026년 기준 독립 개발자의 AI 도구 운영 전략 기록이었습니다.