<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>잡동사니</title>
    <link>https://yeti.tistory.com/</link>
    <description>개발관련 로그를 남기는 공간입니다.</description>
    <language>ko</language>
    <pubDate>Mon, 8 Jun 2026 08:06:47 +0900</pubDate>
    <generator>TISTORY</generator>
    <ttl>100</ttl>
    <managingEditor>yeTi</managingEditor>
    <image>
      <title>잡동사니</title>
      <url>https://tistory1.daumcdn.net/tistory/1942493/attach/3770a3fc4050446481ab2cdbe30af004</url>
      <link>https://yeti.tistory.com</link>
    </image>
    <item>
      <title>스타트업은 왜 제품보다 믿음을 먼저 만들어야 할까</title>
      <link>https://yeti.tistory.com/420</link>
      <description>&lt;p&gt;안녕하세요. yeTi입니다.&lt;br&gt;오늘은 오래전부터 스타트업이 가지는 의미에 대한 생각을 풀어보고자 합니다.&lt;/p&gt;
&lt;p&gt;저는 유발 하라리의 &lt;a href=&quot;https://product.kyobobook.co.kr/detail/S000000597165&quot;&gt;사피엔스&lt;/a&gt;의 인용문을 좋아합니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;단어를 통해 가상의 실제를 창조하는 능력은 서로 모르는 수많은 사람들이 효과적으로 협력하는 것을 가능하게 했다. - p.60, 사피엔스&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;호모 사피엔스가 대규모 사회를 만들고 유지할 수 있는 원동력이고, 사회 구성원들이 가지고 있는 특징이라고 이해가 되기 때문입니다.&lt;/p&gt;
&lt;p&gt;최근에는 독서 모임을 통해 유발 하라리의 &lt;a href=&quot;https://product.kyobobook.co.kr/detail/S000214465905&quot;&gt;넥서스&lt;/a&gt;를 읽고 있습니다.&lt;/p&gt;
&lt;p&gt;그러면서 예전에 스타트업이라는 것을 나름의 의미로 정의했던 순간이 떠올랐습니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;&lt;strong&gt;스타트업은 믿음을 만들어내는 사람들이다.&lt;/strong&gt;&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;그 동안의 생각을 풀어낸 가장 간단한 말입니다.&lt;/p&gt;
&lt;p&gt;스타트업은 단순히 제품을 만드는 조직이 아닙니다.&lt;br&gt;스타트업은 사람들이 세상을 바라보는 기준을 바꾸는 조직입니다.&lt;/p&gt;
&lt;p&gt;사람들의 믿음을 만들어낸 스타트업은 그들만의 새로운 시장을 만들 수 있다고 생각합니다.&lt;/p&gt;
&lt;h2&gt;시장 변화의 본질은 인간 믿음의 변화다&lt;/h2&gt;
&lt;p&gt;시장 변화는 기술 변화처럼 보입니다.&lt;br&gt;새로운 앱이 나오고 새로운 물류 시스템이 생기고 새로운 결제 방식이 등장합니다.&lt;/p&gt;
&lt;p&gt;하지만 그 변화가 정말 시장이 되려면 사람들의 믿음이 바뀌어야 합니다.&lt;/p&gt;
&lt;p&gt;쿠팡이 등장하기 전과 후를 생각해보겠습니다.&lt;/p&gt;
&lt;p&gt;쿠팡 이전에도 온라인 쇼핑은 있었습니다.&lt;br&gt;배송도 있었습니다.&lt;br&gt;물건을 주문하면 집으로 받을 수 있었습니다.&lt;/p&gt;
&lt;p&gt;하지만 쿠팡이 바꾼 것은 단순히 배송 속도가 아니었습니다.&lt;br&gt;쿠팡은 한국인의 배송에 대한 믿음을 바꾸었습니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;고객들이 오늘 주문하면 내일 받는 것을 당연하게 여기게 되었다는 걸 뜻한다. 이처럼 익일배송 보장은 경쟁에 있어서 이제 출발점이 되어버린 것이다. - p.119, 물류트랜드 2024&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;이 믿음이 생기는 순간, 익일배송은 특별한 혜택이 아니라 경쟁의 출발점이 되었습니다.&lt;/p&gt;
&lt;p&gt;이것이 시장 변화의 본질이라고 생각합니다.&lt;/p&gt;
&lt;p&gt;시장은 물건이 바뀌어서만 변하지 않습니다.&lt;br&gt;사람들이 당연하게 여기는 기준이 바뀔 때 시장이 변합니다.&lt;/p&gt;
&lt;p&gt;예전에는 며칠 기다리는 것이 자연스러웠습니다.&lt;br&gt;이제는 내일 받지 못하면 느리다고 느낍니다.&lt;/p&gt;
&lt;p&gt;믿음이 바뀐 것입니다.&lt;/p&gt;
&lt;h2&gt;스타트업은 새로운 기준을 만든다&lt;/h2&gt;
&lt;p&gt;스타트업이 만드는 것은 기능이기도 하지만 더 본질적으로는 새로운 기준입니다.&lt;/p&gt;
&lt;p&gt;“배송은 빨라야 한다.”&lt;br&gt;“신선한 해산물은 집에서도 먹을 수 있다.”&lt;br&gt;“중간 유통을 줄이면 더 합리적인 가격이 가능하다.”&lt;br&gt;“개인이 가진 데이터도 자산이 될 수 있다.”&lt;br&gt;“외국어 학습은 문제집이 아니라 대화에서 시작될 수 있다.”&lt;br&gt;“감정 회복은 거창한 치료가 아니라 작은 실천에서 시작될 수 있다.”&lt;/p&gt;
&lt;p&gt;이런 문장들은 단순한 슬로건이 아닙니다.&lt;br&gt;사람들이 세상을 다르게 보게 만드는 믿음의 씨앗입니다.&lt;/p&gt;
&lt;p&gt;예를 들어 파도상자 같은 서비스를 바라볼 때도 저는 기능보다 믿음을 먼저 보게 됩니다.&lt;/p&gt;
&lt;p&gt;파도상자가 만들어가는 믿음은 단순히 “해산물을 배송한다”가 아닙니다.&lt;br&gt;그 믿음은 오히려 이런 것에 가깝습니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“여행지에서나 먹을 수 있던 신선함을 집에서도 누릴 수 있다.”&lt;br&gt;“오프라인 시장에서 눈탱이를 맞지 않고 합리적인 가격에 살 수 있다.”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;여기서 중요한 것은 신선함과 가격 자체가 아닙니다.&lt;br&gt;고객이 그 서비스를 통해 무엇을 믿게 되는가 입니다.&lt;/p&gt;
&lt;p&gt;스타트업은 결국 고객에게 이렇게 말하는 조직입니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“당신이 당연하게 여기던 기준은 바뀔 수 있습니다.”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;믿음은 경제권이 된다&lt;/h2&gt;
&lt;p&gt;사토 가쓰아키의 《머니 2.0》을 읽으며 인상 깊었던 지점도 여기에 있었습니다.&lt;/p&gt;
&lt;p&gt;그는 경제와 정치, 경제와 종교의 경계가 흐려질 수 있다고 말합니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;앞에서 경제와 정치의 경계도 사라진다고 이야기했는데, 마찬가지로 경제와 종교의 경계도 사라질 것이다. - p.258, Money 2.0&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;처음에는 다소 과장된 말처럼 보일 수 있습니다.&lt;br&gt;하지만 생각해보면 종교도, 경제도, 국가는 모두 공동의 믿음 위에 서 있습니다.&lt;/p&gt;
&lt;p&gt;돈은 종이 자체에 가치가 있어서 작동하는 것이 아닙니다.&lt;br&gt;사람들이 그 종이를 가치 있다고 믿기 때문에 작동합니다.&lt;/p&gt;
&lt;p&gt;토큰 경제도 마찬가지입니다.&lt;br&gt;특정 서비스를 중심으로 사람들이 모이고, 그 안에서 토큰이 발행되고, 참여자들이 그 토큰의 가치를 믿기 시작하면 하나의 작은 경제권이 생깁니다.&lt;/p&gt;
&lt;p&gt;이 관점은 스타트업과 매우 닮아 있습니다.&lt;/p&gt;
&lt;p&gt;스타트업은 처음부터 거대한 시장에 들어가는 것이 아닙니다.&lt;br&gt;처음에는 작은 공동체를 만듭니다.&lt;br&gt;같은 문제를 느끼고, 같은 가능성을 믿고, 같은 미래에 참여하려는 사람들을 모읍니다.&lt;/p&gt;
&lt;p&gt;이것이 커뮤니티를 먼저 만들고, 팬덤을 만들고, 그 뒤에 서비스를 확장하라는 말의 본질이라고 생각합니다.&lt;/p&gt;
&lt;p&gt;커뮤니티는 마케팅 채널이 아닙니다.&lt;br&gt;커뮤니티는 믿음이 자라는 장소입니다.&lt;/p&gt;
&lt;p&gt;그리고 믿음이 충분히 강해지면, 그 믿음은 경제권이 됩니다.&lt;/p&gt;
&lt;h2&gt;창업자는 누군가가 되려고 하면 끝이다&lt;/h2&gt;
&lt;p&gt;《머니 2.0》에서 또 하나 인상 깊었던 문장이 있습니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;창업자는 누군가가 되려고 하면 끝 - p.239, Money 2.0&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;이 문장은 스타트업의 행동양식과도 이어집니다.&lt;/p&gt;
&lt;p&gt;스타트업의 본질은 기존에 없던 방식으로 현실의 문제를 해결하는 데 있습니다.&lt;br&gt;그런데 창업자가 누군가를 흉내 내기 시작하면, 그 순간 스타트업은 자기만의 믿음을 잃습니다.&lt;/p&gt;
&lt;p&gt;물론 다른 회사의 사례를 참고할 수 있습니다.&lt;br&gt;좋은 인재를 영입할 수도 있습니다.&lt;br&gt;이미 검증된 운영 방식이나 성장 전략을 배울 수도 있습니다.&lt;/p&gt;
&lt;p&gt;하지만 그 경험들이 창업자의 고유한 믿음을 대체하기 시작하면 위험해집니다.&lt;/p&gt;
&lt;p&gt;스타트업은 정답을 복제하는 조직이 아닙니다.&lt;br&gt;자기만의 가설을 현실에서 증명하는 조직입니다.&lt;/p&gt;
&lt;p&gt;다른 회사가 성공한 방식이 우리에게도 맞을 수는 있습니다.&lt;br&gt;하지만 그 방식이 왜 우리에게 필요한지 설명할 수 없다면, 그것은 전략이 아니라 모방입니다.&lt;/p&gt;
&lt;p&gt;스타트업은 남이 만든 믿음을 따라가는 순간 생동감을 잃습니다.&lt;br&gt;자기만의 믿음을 만들어야 합니다.&lt;/p&gt;
&lt;h2&gt;좋은 문화와 편한 문화는 다르다&lt;/h2&gt;
&lt;p&gt;이 믿음은 제품에만 적용되지 않습니다.&lt;br&gt;조직 문화에도 적용됩니다.&lt;/p&gt;
&lt;p&gt;이전에 봤던 글 중 좋은 문화와 편한 문화를 구분하는 인상 깊은 말이 있습니다.&lt;/p&gt;
&lt;p&gt;많은 경우 직장인의 입장에서 좋은 문화는 편한 문화로 이해됩니다.&lt;br&gt;자율적이고, 부담이 적고, 갈등이 적고, 개인의 삶을 존중하는 문화.&lt;/p&gt;
&lt;p&gt;물론 이것들은 중요합니다.&lt;/p&gt;
&lt;p&gt;하지만 스타트업의 관점에서 문화는 조금 다르게 보아야 합니다.&lt;br&gt;스타트업의 문화는 단순히 구성원을 편하게 만드는 장치가 아닙니다.&lt;br&gt;스타트업의 문화는 공동의 믿음을 유지하고, 그 믿음이 성과로 이어지도록 만드는 방식입니다.&lt;/p&gt;
&lt;p&gt;그래서 좋은 문화는 반드시 편한 문화와 같지 않습니다.&lt;/p&gt;
&lt;p&gt;스타트업에서 좋은 문화란 이런 것입니다.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;우리가 왜 이 문제를 푸는지 잊지 않게 하는 문화&lt;/li&gt;
&lt;li&gt;불편한 현실을 직면하게 하는 문화&lt;/li&gt;
&lt;li&gt;고객의 반응 앞에서 믿음을 수정할 수 있는 문화&lt;/li&gt;
&lt;li&gt;성장을 위해 필요한 긴장을 견디게 하는 문화&lt;/li&gt;
&lt;li&gt;서로의 편안함보다 공동의 방향을 우선할 수 있는 문화&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;편한 문화는 현재의 감정을 보호합니다.&lt;br&gt;좋은 문화는 공동의 믿음을 앞으로 나아가게 합니다.&lt;/p&gt;
&lt;p&gt;스타트업은 아직 증명되지 않은 믿음을 붙잡고 가는 조직입니다.&lt;br&gt;그렇기 때문에 문화는 단순한 복지가 아니라 믿음의 운영체제에 가깝습니다.&lt;/p&gt;
&lt;h2&gt;스타트업은 공동 환상을 교체하는 일이다&lt;/h2&gt;
&lt;p&gt;사토 가쓰아키는 “세계를 바꾸는 일”을 오래된 공동 환상을 파괴하고 새로운 환상을 덮어씌우는 행위라고 말합니다.&lt;/p&gt;
&lt;p&gt;저는 이 문장이 스타트업의 본질을 잘 설명한다고 느꼈습니다.&lt;/p&gt;
&lt;p&gt;사회에는 이미 굳어진 공동 환상이 있습니다.&lt;/p&gt;
&lt;p&gt;“배송은 며칠 걸리는 것이 당연하다.”&lt;br&gt;“신선식품은 직접 보고 사야 한다.”&lt;br&gt;“회사는 사무실에 출근해야 일하는 곳이다.”&lt;br&gt;“은행이 아니면 금융을 할 수 없다.”&lt;br&gt;“교육은 강의실에서 이루어진다.”&lt;br&gt;“AI는 사람이 시키는 일을 보조하는 도구다.”&lt;/p&gt;
&lt;p&gt;스타트업은 이런 당연함에 질문을 던집니다.&lt;/p&gt;
&lt;p&gt;“정말 그래야 할까?”&lt;br&gt;“다른 방식은 불가능할까?”&lt;br&gt;“사람들이 새롭게 믿을 수 있는 기준은 없을까?”&lt;/p&gt;
&lt;p&gt;그리고 새로운 믿음을 제안합니다.&lt;/p&gt;
&lt;p&gt;이 믿음이 충분히 강해지면, 사람들은 기존의 기준을 낡은 것으로 느끼기 시작합니다.&lt;br&gt;그 순간 시장이 바뀝니다.&lt;/p&gt;
&lt;h2&gt;독점은 새로운 믿음에서 시작된다&lt;/h2&gt;
&lt;p&gt;피터 틸은 《제로 투 원》에서 이렇게 말합니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;독점은 진보의 원동력이다. 수년간 혹은 수십 년간 독점 이윤을 누릴 수 있다는 희망은 혁신을 위한 강력한 동기가 되기 때문이다. 그러면 독점기업은 혁신을 계속 지속할 수 있게 되는데, 왜냐하면 독점 이윤 덕분에 장기적인 계획을 세울 수 있고, 경쟁 기업들은 꿈도 꾸지 못할 야심 찬 연구 프로젝트에도 돈을 댈 수 있기 때문이다. - 제로 투 원 by 피터 틸&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;처음 이 문장은 다소 불편하게 들릴 수 있습니다.&lt;br&gt;우리는 보통 독점을 부정적인 말로 받아들이기 때문입니다.&lt;/p&gt;
&lt;p&gt;하지만 피터 틸이 말하는 독점은 단순히 경쟁자를 억누르는 독점이 아닙니다.&lt;br&gt;남들이 보지 못한 시장을 만들고, 그 시장 안에서 압도적인 가치를 제공하는 상태에 가깝습니다.&lt;/p&gt;
&lt;p&gt;블루오션을 만든다는 말도 결국 비슷합니다.&lt;/p&gt;
&lt;p&gt;새로운 시장을 만든다는 것은 새로운 고객군을 찾는다는 뜻만이 아닙니다.&lt;br&gt;새로운 믿음을 만든다는 뜻입니다.&lt;/p&gt;
&lt;p&gt;사람들이 전에는 필요하다고 생각하지 않았던 것을 필요하다고 느끼게 하는 것.&lt;br&gt;전에는 불가능하다고 생각했던 것을 가능하다고 믿게 하는 것.&lt;br&gt;전에는 특별한 일이라고 생각했던 것을 당연한 일로 바꾸는 것.&lt;/p&gt;
&lt;p&gt;이것이 스타트업이 만드는 독점의 출발점입니다.&lt;/p&gt;
&lt;p&gt;독점은 시장 점유율에서 시작되지 않습니다.&lt;br&gt;독점은 믿음의 점유율에서 시작됩니다.&lt;/p&gt;
&lt;h2&gt;그래서 스타트업은 무엇인가&lt;/h2&gt;
&lt;p&gt;이제 저는 스타트업을 이렇게 정의해보고 싶습니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;&lt;strong&gt;스타트업은 새로운 믿음을 함께 만들어가는 사람들의 모임이다.&lt;/strong&gt;&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;여기서 중요한 단어는 “새로운”이기도 하지만, 더 중요한 단어는 “함께”입니다.&lt;/p&gt;
&lt;p&gt;믿음은 혼자서도 만들 수 있습니다.&lt;br&gt;철학자도 혼자 믿음을 만들 수 있고, 작가도 혼자 세계관을 만들 수 있습니다.&lt;/p&gt;
&lt;p&gt;하지만 스타트업은 혼자 믿는 일이 아닙니다.&lt;/p&gt;
&lt;p&gt;창업자는 먼저 믿습니다.&lt;br&gt;팀원은 그 믿음에 합류합니다.&lt;br&gt;투자자는 그 믿음이 커질 가능성에 자원을 겁니다.&lt;br&gt;초기 사용자는 아직 완성되지 않은 제품 안에서 가능성을 봅니다.&lt;br&gt;고객은 그 믿음이 자신의 문제를 해결해줄 수 있다고 판단합니다.&lt;/p&gt;
&lt;p&gt;이렇게 믿음은 혼자만의 생각에서 벗어나 관계가 됩니다.&lt;br&gt;그리고 관계가 쌓이면 하나의 시장이 됩니다.&lt;/p&gt;
&lt;p&gt;스타트업의 일은 기능을 만드는 일이 아닙니다.&lt;br&gt;기능은 믿음을 구체화하는 수단입니다.&lt;/p&gt;
&lt;p&gt;마케팅은 믿음을 언어로 번역하는 일입니다.&lt;br&gt;세일즈는 믿음을 고객의 문제와 연결하는 일입니다.&lt;br&gt;제품 개발은 믿음을 실제 경험으로 증명하는 일입니다.&lt;br&gt;조직 문화는 믿음이 흔들리지 않게 유지하는 일입니다.&lt;br&gt;PMF는 그 믿음이 창업자만의 것이 아니라 시장 안에서도 작동하기 시작했다는 신호입니다.&lt;/p&gt;
&lt;p&gt;그래서 스타트업은 새로운 넥서스를 만드는 시도입니다.&lt;/p&gt;
&lt;p&gt;처음에는 한 사람의 믿음으로 시작합니다.&lt;br&gt;그다음 몇 사람이 그 믿음을 함께 붙잡습니다.&lt;br&gt;그다음 소수의 사용자가 그 가능성에 참여합니다.&lt;br&gt;그리고 어느 순간 더 많은 사람들이 그것을 당연한 현실로 받아들이기 시작합니다.&lt;/p&gt;
&lt;p&gt;그때 스타트업은 단순한 제품을 넘어섭니다.&lt;br&gt;하나의 새로운 의미 체계가 됩니다.&lt;br&gt;사람들이 함께 참여하는 새로운 믿음의 네트워크가 됩니다.&lt;/p&gt;
&lt;p&gt;저는 이것이 스타트업이라고 생각합니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;&lt;strong&gt;스타트업은 새로운 믿음을 함께 만들어가는 사람들의 모임이다.&lt;/strong&gt;&lt;br&gt;그리고 시장을 만든다는 것은, 사람들이 당연하게 믿는 기준을 바꾸는 일이다.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;</description>
      <category>About me</category>
      <category>pmf</category>
      <category>고객 믿음</category>
      <category>넥서스</category>
      <category>머니 2.0</category>
      <category>사피엔스</category>
      <category>스타트업</category>
      <category>시장 만들기</category>
      <category>제로 투 원</category>
      <category>조직문화</category>
      <category>창업</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/420</guid>
      <comments>https://yeti.tistory.com/420#entry420comment</comments>
      <pubDate>Fri, 5 Jun 2026 15:42:44 +0900</pubDate>
    </item>
    <item>
      <title>How Building a Multi-Agent Development Pipeline Led Me to Design an AI Engineering Organization</title>
      <link>https://yeti.tistory.com/419</link>
      <description>&lt;h3&gt;From AI Coding Agents to AI Engineering Organizations&lt;/h3&gt;
&lt;h2&gt;Introduction&lt;/h2&gt;
&lt;p&gt;When I started the sqlgen-ai project, my goal was straightforward.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;I wanted to build an AI agent that could write code.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;At the time, I imagined a future where an AI developer could continuously improve itself, create tools it needed, and gradually become more capable over time.&lt;/p&gt;
&lt;p&gt;To explore that idea, I connected Hermes Agent, Codex, GitLab, and Discord into a development workflow and started building what I thought would become a self-improving coding agent.&lt;/p&gt;
&lt;p&gt;What I learned was very different from what I expected.&lt;/p&gt;
&lt;p&gt;The biggest challenge was not getting AI to write code.&lt;/p&gt;
&lt;p&gt;The bigger challenge was operating AI reliably within a software development process.&lt;/p&gt;
&lt;h2&gt;AI Models Are Already Good at Writing Code&lt;/h2&gt;
&lt;p&gt;The first surprise was that coding itself was rarely the bottleneck.&lt;/p&gt;
&lt;p&gt;Modern coding models such as Codex, Claude Code, and Gemini CLI can already perform a large portion of day-to-day development work.&lt;/p&gt;
&lt;p&gt;They can:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Implement features&lt;/li&gt;
&lt;li&gt;Fix bugs&lt;/li&gt;
&lt;li&gt;Refactor existing code&lt;/li&gt;
&lt;li&gt;Write tests&lt;/li&gt;
&lt;li&gt;Create Merge Requests&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When I started the project, I assumed model capability would be the primary limitation.&lt;/p&gt;
&lt;p&gt;Instead, most failures came from workflow and coordination problems.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;An agent implemented a feature against an outdated branch.&lt;/li&gt;
&lt;li&gt;A Merge Request was created without local validation.&lt;/li&gt;
&lt;li&gt;Acceptance Criteria were partially implemented.&lt;/li&gt;
&lt;li&gt;Review feedback introduced regressions.&lt;/li&gt;
&lt;li&gt;A dirty workspace caused unrelated files to be committed.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These were not coding failures.&lt;/p&gt;
&lt;p&gt;The code itself was often reasonable.&lt;/p&gt;
&lt;p&gt;The failures occurred because development work was being executed without the operational controls that exist in human engineering teams.&lt;/p&gt;
&lt;p&gt;As the models became stronger, a different bottleneck emerged.&lt;/p&gt;
&lt;p&gt;Generating code became easier.&lt;/p&gt;
&lt;p&gt;Coordinating development work became harder.&lt;/p&gt;
&lt;h2&gt;Software Development Is Mostly State Transitions&lt;/h2&gt;
&lt;p&gt;One realization significantly changed how I approached the system.&lt;/p&gt;
&lt;p&gt;Software development organizations are more structured than they initially appear.&lt;/p&gt;
&lt;p&gt;Most engineering work follows a sequence of state transitions.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Issue
→ Planning
→ Approval
→ Development
→ Validation
→ Review
→ Merge
→ E2E
→ Release&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Once viewed from this perspective, software development starts looking less like individual coding tasks and more like a state machine.&lt;/p&gt;
&lt;p&gt;That realization changed how I used GitLab.&lt;/p&gt;
&lt;p&gt;Initially, GitLab was simply a repository.&lt;/p&gt;
&lt;p&gt;Over time, it became something much more important.&lt;/p&gt;
&lt;p&gt;GitLab became the coordination layer of the entire agent organization.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;issue:approved
dev:ready
dev:running
dev:done

mr:ready-for-review
mr:approved

e2e:ready
e2e:done&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;These labels are not just metadata.&lt;/p&gt;
&lt;p&gt;They represent organizational state.&lt;/p&gt;
&lt;p&gt;Agents consume and update those states as part of a larger workflow.&lt;/p&gt;
&lt;p&gt;In practice:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Issues became work queues.&lt;/li&gt;
&lt;li&gt;Labels became state transitions.&lt;/li&gt;
&lt;li&gt;Merge Requests became review checkpoints.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The agents never communicated directly with each other.&lt;/p&gt;
&lt;p&gt;They coordinated through GitLab.&lt;/p&gt;
&lt;p&gt;Eventually, GitLab evolved from a repository into the operating system of the agent organization.&lt;/p&gt;
&lt;h2&gt;Specialized Roles Were More Reliable Than One Smart Agent&lt;/h2&gt;
&lt;p&gt;My original architecture assumed a single powerful agent would eventually handle everything.&lt;/p&gt;
&lt;p&gt;Planning.&lt;/p&gt;
&lt;p&gt;Implementation.&lt;/p&gt;
&lt;p&gt;Validation.&lt;/p&gt;
&lt;p&gt;Review.&lt;/p&gt;
&lt;p&gt;Deployment.&lt;/p&gt;
&lt;p&gt;In practice, reliability improved when responsibilities became narrower.&lt;/p&gt;
&lt;p&gt;The current sqlgen-ai pipeline looks much closer to an engineering team than a single autonomous agent.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;PM Agent
↓
Developer Agent
↓
Validator Agent
↓
Review Triage Agent
↓
Human&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each role has a specific responsibility.&lt;/p&gt;
&lt;p&gt;The PM Agent creates implementation plans.&lt;/p&gt;
&lt;p&gt;The Developer Agent focuses on execution.&lt;/p&gt;
&lt;p&gt;The Validator Agent independently verifies results.&lt;/p&gt;
&lt;p&gt;The Review Triage Agent processes review feedback.&lt;/p&gt;
&lt;p&gt;The Human provides goals and final approval.&lt;/p&gt;
&lt;p&gt;The separation is important because implementation and validation have fundamentally different objectives.&lt;/p&gt;
&lt;p&gt;The Developer Agent tries to make progress.&lt;/p&gt;
&lt;p&gt;The Validator Agent tries to find problems.&lt;/p&gt;
&lt;p&gt;Combining both responsibilities into a single agent often creates blind spots.&lt;/p&gt;
&lt;p&gt;Separating them makes failures easier to detect and easier to recover from.&lt;/p&gt;
&lt;h2&gt;Validation Was More Important Than Generation&lt;/h2&gt;
&lt;p&gt;The most important lesson from operating the system was simple.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Never trust implementation output without verification.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Large language models can generate convincing solutions that are still incorrect.&lt;/p&gt;
&lt;p&gt;Because of that, &amp;quot;Implementation Complete&amp;quot; is not evidence.&lt;/p&gt;
&lt;p&gt;Validation results are evidence.&lt;/p&gt;
&lt;p&gt;In sqlgen-ai, implementation does not immediately lead to a Merge Request.&lt;/p&gt;
&lt;p&gt;Every change passes through an independent validation layer.&lt;/p&gt;
&lt;p&gt;A simplified validation pipeline looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Developer Agent
↓
ruff
↓
mypy
↓
pytest
↓
Acceptance Criteria Validator&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Only validated work can move forward.&lt;/p&gt;
&lt;p&gt;However, validation itself was not the most interesting discovery.&lt;/p&gt;
&lt;p&gt;The more important discovery was what happens after validation fails.&lt;/p&gt;
&lt;h2&gt;Validation Is Not the Goal. Convergence Is.&lt;/h2&gt;
&lt;p&gt;Many agent workflows stop after validation.&lt;/p&gt;
&lt;p&gt;A test fails.&lt;/p&gt;
&lt;p&gt;A review check fails.&lt;/p&gt;
&lt;p&gt;The workflow ends.&lt;/p&gt;
&lt;p&gt;That approach is useful for reporting failures, but it does not help the system reach a successful outcome.&lt;/p&gt;
&lt;p&gt;In practice, we found that validation needed to become part of a convergence loop.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Implementation
→ Validation
→ NO-GO
→ Fix
→ Validation
→ GO&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When validation returns NO-GO, the implementation session is reused.&lt;/p&gt;
&lt;p&gt;The agent receives the validation results and attempts a targeted correction.&lt;/p&gt;
&lt;p&gt;The objective is not simply to detect problems.&lt;/p&gt;
&lt;p&gt;The objective is to reduce the distance between the current state and the desired state.&lt;/p&gt;
&lt;p&gt;This distinction became one of the most important architectural decisions in the system.&lt;/p&gt;
&lt;p&gt;The value of validation is not that it finds errors.&lt;/p&gt;
&lt;p&gt;The value of validation is that it guides convergence.&lt;/p&gt;
&lt;h2&gt;The Current sqlgen-ai Pipeline&lt;/h2&gt;
&lt;p&gt;Today, sqlgen-ai operates through a multi-stage development pipeline.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;PM
→ Pick
→ Implementation
→ MR
→ Test
→ CI
→ E2E&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each stage is connected through GitLab state transitions and scheduled agents.&lt;/p&gt;
&lt;p&gt;Agents do not communicate directly.&lt;/p&gt;
&lt;p&gt;GitLab acts as the shared memory and coordination layer.&lt;/p&gt;
&lt;p&gt;The role of humans has also become much smaller than I originally expected.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Human
→ Goal
→ Approval&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The human defines direction.&lt;/p&gt;
&lt;p&gt;The agent organization executes the workflow.&lt;/p&gt;
&lt;h2&gt;What Comes Next&lt;/h2&gt;
&lt;p&gt;The system is still far from a fully autonomous engineering organization.&lt;/p&gt;
&lt;p&gt;Several challenges remain:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Parallel task execution&lt;/li&gt;
&lt;li&gt;Automatic Merge Request approval&lt;/li&gt;
&lt;li&gt;E2E auto-recovery&lt;/li&gt;
&lt;li&gt;Release Agents&lt;/li&gt;
&lt;li&gt;Deployment governance&lt;/li&gt;
&lt;li&gt;Cross-agent conflict resolution&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But the direction has become much clearer.&lt;/p&gt;
&lt;p&gt;When I started the project, I thought I was building a coding agent.&lt;/p&gt;
&lt;p&gt;Today, I think differently.&lt;/p&gt;
&lt;p&gt;The real challenge is not creating an agent that can write code.&lt;/p&gt;
&lt;p&gt;The real challenge is building an organization that can reliably move software through an engineering lifecycle.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;The biggest shift in my thinking came from operating the system in practice.&lt;/p&gt;
&lt;p&gt;Initially, I believed model capability would be the defining factor.&lt;/p&gt;
&lt;p&gt;In reality, operational structure mattered far more.&lt;/p&gt;
&lt;p&gt;Reliability did not come from better prompts.&lt;/p&gt;
&lt;p&gt;It came from:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Role separation&lt;/li&gt;
&lt;li&gt;State transitions&lt;/li&gt;
&lt;li&gt;Independent validation&lt;/li&gt;
&lt;li&gt;Convergence loops&lt;/li&gt;
&lt;li&gt;Organizational coordination&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As models continue to improve, I expect this distinction to become even more important.&lt;/p&gt;
&lt;p&gt;The future may not belong to isolated coding assistants.&lt;/p&gt;
&lt;p&gt;It may belong to AI engineering organizations.&lt;/p&gt;
&lt;p&gt;And that was the unexpected lesson from building sqlgen-ai.&lt;/p&gt;</description>
      <category>IT/AI</category>
      <category>Agent Reliability</category>
      <category>ai agent</category>
      <category>AI Engineering</category>
      <category>ai workflow</category>
      <category>claude code</category>
      <category>codex</category>
      <category>GitLab Automation</category>
      <category>llm agent</category>
      <category>multi-agent systems</category>
      <category>Software Engineering</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/419</guid>
      <comments>https://yeti.tistory.com/419#entry419comment</comments>
      <pubDate>Mon, 1 Jun 2026 16:39:31 +0900</pubDate>
    </item>
    <item>
      <title>How I Turned GitLab into a Coordination Layer for Autonomous AI Development Agents</title>
      <link>https://yeti.tistory.com/418</link>
      <description>&lt;p&gt;&lt;em&gt;Lessons from building a multi-agent AI development workflow for a production project&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;TL;DR&lt;/h2&gt;
&lt;p&gt;Building a reliable AI coding agent is one engineering problem.&lt;/p&gt;
&lt;p&gt;Building a reliable AI development workflow with multiple agents is another.&lt;/p&gt;
&lt;p&gt;A single agent mostly struggles with execution quality.&lt;/p&gt;
&lt;p&gt;Multiple agents introduce coordination problems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;task ownership&lt;/li&gt;
&lt;li&gt;shared state visibility&lt;/li&gt;
&lt;li&gt;race conditions&lt;/li&gt;
&lt;li&gt;workspace contamination&lt;/li&gt;
&lt;li&gt;lock recovery&lt;/li&gt;
&lt;li&gt;operational governance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;While building an autonomous development workflow for the sqlgen project, I learned that code generation was only one part of the problem.&lt;/p&gt;
&lt;p&gt;The dominant challenge was coordination.&lt;/p&gt;
&lt;p&gt;GitLab labels became the shared state machine that allowed independent agents to coordinate work safely. GitLab’s scoped labels are explicitly designed to support mutually exclusive workflow states, which makes them a practical coordination primitive for workflow orchestration. ([GitLab 문서][1])&lt;/p&gt;
&lt;h2&gt;The Goal&lt;/h2&gt;
&lt;p&gt;The original goal was straightforward.&lt;/p&gt;
&lt;p&gt;I wanted engineering work inside the sqlgen project to move through an AI-assisted delivery workflow with minimal manual execution.&lt;/p&gt;
&lt;p&gt;The target flow looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Issue discovered
→ planned
→ implemented
→ reviewed
→ tested
→ merged&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The initial assumption was simple:&lt;/p&gt;
&lt;p&gt;If the coding model is good enough, autonomous delivery becomes practical.&lt;/p&gt;
&lt;p&gt;That assumption turned out to be incomplete.&lt;/p&gt;
&lt;p&gt;Code generation solved only part of the problem.&lt;/p&gt;
&lt;p&gt;Once multiple agents became involved, coordination became the dominant engineering challenge.&lt;/p&gt;
&lt;h2&gt;This Was Not a Single-Agent Problem&lt;/h2&gt;
&lt;p&gt;I was not building a coding assistant.&lt;/p&gt;
&lt;p&gt;I was building a workflow where multiple agents had distinct responsibilities.&lt;/p&gt;
&lt;p&gt;A simplified structure:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Human PM
   ↓
PM Bot
   ↓
Review Bot
   ↓
Dev Bot
   ↓
QA Bot
   ↓
Human Approval&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each agent had a narrower role.&lt;/p&gt;
&lt;p&gt;That part was intentional.&lt;/p&gt;
&lt;p&gt;Specialized agents are easier to reason about than one general-purpose autonomous actor.&lt;/p&gt;
&lt;p&gt;But specialization creates a new requirement:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;shared operational context.&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;A human team can rely on conversation, memory, and implicit understanding.&lt;/p&gt;
&lt;p&gt;Independent agents cannot.&lt;/p&gt;
&lt;p&gt;Task ownership, workflow progress, and execution state must be externally visible.&lt;/p&gt;
&lt;p&gt;That made coordination state an explicit architectural concern.&lt;/p&gt;
&lt;h2&gt;Why GitLab?&lt;/h2&gt;
&lt;p&gt;A natural question:&lt;/p&gt;
&lt;p&gt;Why use GitLab instead of building a dedicated orchestration service?&lt;/p&gt;
&lt;p&gt;The answer was practical.&lt;/p&gt;
&lt;p&gt;GitLab already provided several useful properties.&lt;/p&gt;
&lt;h3&gt;1. Existing Workflow Surface&lt;/h3&gt;
&lt;p&gt;The engineering workflow already lived in GitLab:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;issues&lt;/li&gt;
&lt;li&gt;merge requests&lt;/li&gt;
&lt;li&gt;labels&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That meant no additional operational UI needed.&lt;/p&gt;
&lt;p&gt;Agents could integrate into the workflow engineers were already using.&lt;/p&gt;
&lt;h3&gt;2. Shared Visibility&lt;/h3&gt;
&lt;p&gt;Humans and agents could observe the same workflow state.&lt;/p&gt;
&lt;p&gt;This matters operationally.&lt;/p&gt;
&lt;p&gt;A coordination system that only agents understand becomes difficult to debug.&lt;/p&gt;
&lt;p&gt;GitLab gave immediate human inspectability.&lt;/p&gt;
&lt;p&gt;An engineer could look at an issue and immediately understand where work was stuck.&lt;/p&gt;
&lt;h3&gt;3. Simple Polling Model&lt;/h3&gt;
&lt;p&gt;The initial MVP used a cron-based automation model.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;find issues with workflow::dev-ready&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This approach was intentionally simple.&lt;/p&gt;
&lt;p&gt;No event bus.&lt;br&gt;No dedicated orchestration queue.&lt;br&gt;No new infrastructure.&lt;/p&gt;
&lt;p&gt;For an MVP, operational simplicity mattered more than architectural purity.&lt;/p&gt;
&lt;h3&gt;4. Explicit State Representation&lt;/h3&gt;
&lt;p&gt;Scoped labels gave a lightweight way to encode workflow lifecycle state.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;workflow::pm-ready
workflow::dev-running
workflow::review-ready&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Because labels within the same scope are mutually exclusive, workflow transitions become naturally enforceable. ([GitLab 문서][1])&lt;/p&gt;
&lt;p&gt;That significantly reduced coordination ambiguity.&lt;/p&gt;
&lt;p&gt;The architectural tradeoff was intentional:&lt;/p&gt;
&lt;p&gt;Instead of introducing a separate orchestration system, I reused the existing engineering control plane.&lt;/p&gt;
&lt;h2&gt;GitLab as a Shared State Machine&lt;/h2&gt;
&lt;p&gt;The workflow state model looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;workflow::pm-ready
workflow::pm-running
workflow::dev-ready
workflow::dev-running
workflow::review-ready
workflow::qa-ready
workflow::done
workflow::failed&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Example lifecycle:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Issue created
→ workflow::pm-ready

PM Bot claims task
→ workflow::pm-running

Planning complete
→ workflow::dev-ready

Dev Bot claims task
→ workflow::dev-running

Implementation complete
→ workflow::review-ready&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This solved a critical coordination problem.&lt;/p&gt;
&lt;p&gt;Agents no longer depended on hidden internal context.&lt;/p&gt;
&lt;p&gt;Workflow state became:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;explicit&lt;/li&gt;
&lt;li&gt;queryable&lt;/li&gt;
&lt;li&gt;observable&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;GitLab was no longer just storing code.&lt;/p&gt;
&lt;p&gt;It was acting as the coordination layer for distributed autonomous workers.&lt;/p&gt;
&lt;h2&gt;First Working MVP&lt;/h2&gt;
&lt;p&gt;The initial MVP worked under normal execution conditions.&lt;/p&gt;
&lt;p&gt;The execution flow looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;1-minute cron poller
↓
Find issues labeled workflow::dev-ready
↓
Acquire workspace lock
↓
Mark issue workflow::dev-running
↓
Execute Codex implementation flow
↓
Create merge request
↓
Transition issue to workflow::review-ready&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This was enough to validate the architectural direction.&lt;/p&gt;
&lt;p&gt;But happy paths do not validate operational systems.&lt;/p&gt;
&lt;p&gt;Failure behavior does.&lt;/p&gt;
&lt;h2&gt;What Actually Broke&lt;/h2&gt;
&lt;p&gt;The dominant failures were operational coordination failures rather than model capability failures.&lt;/p&gt;
&lt;h3&gt;1. Double Pickup&lt;/h3&gt;
&lt;p&gt;Without explicit claiming, multiple agents can observe the same available task.&lt;/p&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Agent A sees workflow::dev-ready
Agent B sees workflow::dev-ready
Both begin execution&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Classic race condition.&lt;/p&gt;
&lt;p&gt;Humans resolve this socially.&lt;/p&gt;
&lt;p&gt;Distributed workers do not.&lt;/p&gt;
&lt;p&gt;The fix:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;explicit task claiming&lt;/li&gt;
&lt;li&gt;state transition before execution&lt;/li&gt;
&lt;li&gt;locking&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;2. Dirty Workspace Contamination&lt;/h3&gt;
&lt;p&gt;A failed execution could leave behind:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;modified files&lt;/li&gt;
&lt;li&gt;temporary branches&lt;/li&gt;
&lt;li&gt;partial generated output&lt;/li&gt;
&lt;li&gt;broken local state&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The next execution inherited polluted state.&lt;/p&gt;
&lt;p&gt;This produced misleading failures.&lt;/p&gt;
&lt;p&gt;The issue was not reasoning quality.&lt;/p&gt;
&lt;p&gt;It was environment integrity.&lt;/p&gt;
&lt;p&gt;The fix:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;workspace isolation&lt;/li&gt;
&lt;li&gt;cleanup contracts&lt;/li&gt;
&lt;li&gt;pre-execution guards&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;3. Cron Environment Drift&lt;/h3&gt;
&lt;p&gt;Manual execution succeeded.&lt;/p&gt;
&lt;p&gt;Automated execution failed.&lt;/p&gt;
&lt;p&gt;This is a classic operational issue.&lt;/p&gt;
&lt;p&gt;Cron environments differ from interactive shells.&lt;/p&gt;
&lt;p&gt;Common failures:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;PATH mismatch&lt;/li&gt;
&lt;li&gt;missing environment variables&lt;/li&gt;
&lt;li&gt;CLI auth assumptions&lt;/li&gt;
&lt;li&gt;host normalization issues&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In practice, this surfaced as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Codex working manually but failing in automation&lt;/li&gt;
&lt;li&gt;&lt;code&gt;glab&lt;/code&gt; targeting the wrong host&lt;/li&gt;
&lt;li&gt;executables missing during scheduled execution&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These are not glamorous problems.&lt;/p&gt;
&lt;p&gt;But production automation usually fails on operational details, not architecture diagrams.&lt;/p&gt;
&lt;h3&gt;4. Stale Locks&lt;/h3&gt;
&lt;p&gt;Locks prevent concurrent execution.&lt;/p&gt;
&lt;p&gt;But failed runs can leave stale locks behind.&lt;/p&gt;
&lt;p&gt;Result:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;lock exists
→ no new work claimed
→ workflow silently stalls&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Without recovery logic, the system appears healthy while doing nothing.&lt;/p&gt;
&lt;p&gt;The fix:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;lock TTL&lt;/li&gt;
&lt;li&gt;stale lock detection&lt;/li&gt;
&lt;li&gt;cleanup recovery&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Human-Governed Autonomy&lt;/h2&gt;
&lt;p&gt;A design correction emerged during implementation.&lt;/p&gt;
&lt;p&gt;Full autonomy is not the immediate objective.&lt;/p&gt;
&lt;p&gt;A more practical operational model is:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;human-governed autonomy&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Humans remain responsible for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;defining goals&lt;/li&gt;
&lt;li&gt;approving critical changes&lt;/li&gt;
&lt;li&gt;resolving ambiguity&lt;/li&gt;
&lt;li&gt;production governance&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agents handle:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;execution&lt;/li&gt;
&lt;li&gt;repetitive workflow progression&lt;/li&gt;
&lt;li&gt;structured implementation tasks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This boundary preserves automation benefits while reducing operational risk.&lt;/p&gt;
&lt;h2&gt;Key Engineering Lesson&lt;/h2&gt;
&lt;p&gt;Single-agent reliability asks:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;How do I make one agent execute correctly?&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Multi-agent workflow reliability asks:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;How do independent agents coordinate safely?&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;These are different engineering problems.&lt;/p&gt;
&lt;p&gt;The second problem looks much closer to distributed systems engineering than prompt engineering.&lt;/p&gt;
&lt;p&gt;Because the failure modes are familiar:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;shared state consistency&lt;/li&gt;
&lt;li&gt;ownership conflicts&lt;/li&gt;
&lt;li&gt;stale resources&lt;/li&gt;
&lt;li&gt;operational recovery&lt;/li&gt;
&lt;li&gt;workflow observability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Reliable agents are useful.&lt;/p&gt;
&lt;p&gt;Reliable coordination is essential.&lt;/p&gt;</description>
      <category>IT/AI</category>
      <category>Agent Orchestration</category>
      <category>ai agents</category>
      <category>ai workflow</category>
      <category>Autonomous Development</category>
      <category>codex</category>
      <category>Distributed Systems</category>
      <category>GitLab Workflow Automation</category>
      <category>LLM Engineering</category>
      <category>multi-agent systems</category>
      <category>workflow automation</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/418</guid>
      <comments>https://yeti.tistory.com/418#entry418comment</comments>
      <pubDate>Thu, 14 May 2026 10:06:47 +0900</pubDate>
    </item>
    <item>
      <title>How Role Separation Reduced Execution Drift in Multi-Agent Systems</title>
      <link>https://yeti.tistory.com/417</link>
      <description>&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Lessons from building reliable AI agent workflows with Hermes and local LLMs&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Multi-agent systems become unstable when responsibilities overlap&lt;/li&gt;
&lt;li&gt;Stronger models do not automatically improve workflow convergence&lt;/li&gt;
&lt;li&gt;Shared context without ownership boundaries creates execution drift&lt;/li&gt;
&lt;li&gt;Separating Planner, Implementer, and Validator responsibilities significantly improved workflow stability&lt;/li&gt;
&lt;li&gt;The Implementer should apply contracts precisely, not redesign the system during execution&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;1. The Problem — Execution Drift in Multi-Agent Workflows&lt;/h2&gt;
&lt;p&gt;When I first started building AI agent systems, I assumed the main problem was model capability.&lt;/p&gt;
&lt;p&gt;If the model became smarter, the workflow would become more reliable.&lt;/p&gt;
&lt;p&gt;That assumption turned out to be incomplete.&lt;/p&gt;
&lt;p&gt;While experimenting with local LLM-based coding agents using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Hermes Agent&lt;/li&gt;
&lt;li&gt;Claude Code CLI&lt;/li&gt;
&lt;li&gt;Ollama&lt;/li&gt;
&lt;li&gt;local qwen models&lt;/li&gt;
&lt;li&gt;Discord-based orchestration&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I repeatedly encountered the same failure pattern.&lt;/p&gt;
&lt;p&gt;At first, using a single agent felt efficient.&lt;/p&gt;
&lt;p&gt;The same agent would:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;plan tasks&lt;/li&gt;
&lt;li&gt;write code&lt;/li&gt;
&lt;li&gt;debug failures&lt;/li&gt;
&lt;li&gt;retry execution&lt;/li&gt;
&lt;li&gt;validate outputs&lt;/li&gt;
&lt;li&gt;redesign architecture during retries&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Everything happened inside one large shared context.&lt;/p&gt;
&lt;p&gt;Initially, this looked flexible.&lt;/p&gt;
&lt;p&gt;But as workflows became larger, execution stability degraded rapidly.&lt;/p&gt;
&lt;p&gt;I started seeing problems such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;endless retry loops&lt;/li&gt;
&lt;li&gt;inconsistent file structures&lt;/li&gt;
&lt;li&gt;duplicated abstractions&lt;/li&gt;
&lt;li&gt;rewritten interfaces during execution&lt;/li&gt;
&lt;li&gt;architectural drift between retries&lt;/li&gt;
&lt;li&gt;increasing divergence from the original task&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The workflow often looked productive.&lt;/p&gt;
&lt;p&gt;But convergence became worse over time.&lt;/p&gt;
&lt;p&gt;Eventually, I realized I was not only dealing with model errors.&lt;/p&gt;
&lt;p&gt;I was dealing with execution drift.&lt;/p&gt;
&lt;h2&gt;2. Stronger Models Still Drifted During Execution&lt;/h2&gt;
&lt;p&gt;One surprising realization was that stronger models did not fundamentally solve the problem.&lt;/p&gt;
&lt;p&gt;Larger models often generated better local outputs.&lt;/p&gt;
&lt;p&gt;However, workflow instability still remained.&lt;/p&gt;
&lt;p&gt;In some cases, stronger models amplified instability because they became more willing to reinterpret previous decisions during execution.&lt;/p&gt;
&lt;p&gt;For example, an Implementer agent might:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;rename directories during retries&lt;/li&gt;
&lt;li&gt;introduce new abstractions mid-execution&lt;/li&gt;
&lt;li&gt;redefine interfaces that were already agreed upon&lt;/li&gt;
&lt;li&gt;restructure unrelated components while fixing a local issue&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At first, this behavior appeared intelligent.&lt;/p&gt;
&lt;p&gt;The model looked proactive and adaptive.&lt;/p&gt;
&lt;p&gt;However, execution reliability became worse.&lt;/p&gt;
&lt;p&gt;The model attempted to optimize locally during retries.&lt;/p&gt;
&lt;p&gt;Instead of treating the existing structure as a fixed contract, it continuously searched for “better” architectures.&lt;/p&gt;
&lt;p&gt;As retries accumulated, small local optimizations gradually destabilized the workflow itself.&lt;/p&gt;
&lt;p&gt;Eventually, the workflow became harder to reason about after every retry.&lt;/p&gt;
&lt;p&gt;This led me to an important realization:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Reliability problems in multi-agent systems are often coordination problems.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The workflow was unstable not because the model was incapable, but because responsibilities were unclear.&lt;/p&gt;
&lt;h2&gt;3. Shared Context Without Ownership Creates Instability&lt;/h2&gt;
&lt;p&gt;One of the biggest problems in multi-agent systems is uncontrolled shared context.&lt;/p&gt;
&lt;p&gt;At first, shared memory feels efficient because every agent can access the same information.&lt;/p&gt;
&lt;p&gt;However, in practice, this often removes ownership boundaries.&lt;/p&gt;
&lt;p&gt;Once ownership becomes unclear, responsibilities begin overlapping.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the Planner modifies implementation details&lt;/li&gt;
&lt;li&gt;the Implementer redesigns architecture decisions&lt;/li&gt;
&lt;li&gt;the Validator proposes alternative execution strategies&lt;/li&gt;
&lt;li&gt;retry loops introduce conflicting interpretations&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Eventually, the workflow loses convergence.&lt;/p&gt;
&lt;p&gt;The issue is not that the agents are unintelligent.&lt;/p&gt;
&lt;p&gt;The issue is that every agent is allowed to make every type of decision.&lt;/p&gt;
&lt;p&gt;This creates architectural instability.&lt;/p&gt;
&lt;p&gt;While debugging these workflows, I realized the problem felt surprisingly familiar.&lt;/p&gt;
&lt;p&gt;It resembled a classic problem from object-oriented design.&lt;/p&gt;
&lt;h2&gt;4. The Object-Oriented Design Parallel&lt;/h2&gt;
&lt;p&gt;In object-oriented programming, responsibility separation is considered one of the most important design principles.&lt;/p&gt;
&lt;p&gt;The same idea appears repeatedly in concepts such as:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Single Responsibility Principle (SRP)&lt;/li&gt;
&lt;li&gt;high cohesion&lt;/li&gt;
&lt;li&gt;low coupling&lt;/li&gt;
&lt;li&gt;ownership boundaries&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The core idea is simple:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Systems become difficult to reason about when responsibilities overlap.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;That idea started feeling very similar to what I was seeing in agent systems.&lt;/p&gt;
&lt;p&gt;In traditional software systems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a service should not own every responsibility&lt;/li&gt;
&lt;li&gt;a class should not make every decision&lt;/li&gt;
&lt;li&gt;a module should not redefine another module’s contract&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The same pattern appeared inside multi-agent workflows.&lt;/p&gt;
&lt;p&gt;When every agent could:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;plan&lt;/li&gt;
&lt;li&gt;implement&lt;/li&gt;
&lt;li&gt;redesign&lt;/li&gt;
&lt;li&gt;validate&lt;/li&gt;
&lt;li&gt;reinterpret contracts during retries&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;workflow stability degraded rapidly.&lt;/p&gt;
&lt;p&gt;At some point, I stopped thinking about agents as “smart tools.”&lt;/p&gt;
&lt;p&gt;I started thinking about them as independently evolving components inside a distributed system.&lt;/p&gt;
&lt;p&gt;That perspective changed how I designed workflows afterward.&lt;/p&gt;
&lt;h2&gt;5. The Implementer Should Not Redesign the System&lt;/h2&gt;
&lt;p&gt;One specific failure pattern repeatedly caused instability in my workflows.&lt;/p&gt;
&lt;p&gt;The Implementer agent would begin modifying architectural decisions during execution.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;changing directory structures during retries&lt;/li&gt;
&lt;li&gt;introducing new abstractions unrelated to the original task&lt;/li&gt;
&lt;li&gt;rewriting task boundaries while fixing local errors&lt;/li&gt;
&lt;li&gt;redefining interfaces that other agents already depended on&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;At first, this behavior looked intelligent.&lt;/p&gt;
&lt;p&gt;The model appeared proactive.&lt;/p&gt;
&lt;p&gt;However, execution reliability became significantly worse.&lt;/p&gt;
&lt;p&gt;Every retry introduced additional design changes.&lt;/p&gt;
&lt;p&gt;As those changes accumulated, the workflow continuously drifted away from the original contract.&lt;/p&gt;
&lt;p&gt;Eventually, retries stopped behaving like recovery mechanisms.&lt;/p&gt;
&lt;p&gt;They became architecture mutation loops.&lt;/p&gt;
&lt;p&gt;This problem became especially severe when the Implementer shared the same broad context as the Planner.&lt;/p&gt;
&lt;p&gt;The Implementer gradually started behaving like another Planner.&lt;/p&gt;
&lt;p&gt;That overlap destabilized the workflow.&lt;/p&gt;
&lt;p&gt;Eventually, I realized the problem resembled a classic object-oriented design issue.&lt;/p&gt;
&lt;p&gt;An object becomes difficult to reason about when it owns too many responsibilities.&lt;/p&gt;
&lt;p&gt;The same pattern appeared in agent systems.&lt;/p&gt;
&lt;p&gt;The Implementer should not make new decisions during execution.&lt;/p&gt;
&lt;p&gt;Its role is to apply the already defined contract as precisely as possible.&lt;/p&gt;
&lt;p&gt;Once I separated:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;planning responsibilities&lt;/li&gt;
&lt;li&gt;execution responsibilities&lt;/li&gt;
&lt;li&gt;validation responsibilities&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;workflow convergence improved significantly.&lt;/p&gt;
&lt;h2&gt;6. The Harness Layer — Controlling Convergence&lt;/h2&gt;
&lt;p&gt;Role separation alone does not guarantee convergence.&lt;/p&gt;
&lt;p&gt;The workflow still requires a control layer that verifies whether execution remains aligned with the original contract.&lt;/p&gt;
&lt;p&gt;That became the responsibility of the Harness layer.&lt;/p&gt;
&lt;p&gt;The Harness layer acts as a convergence controller.&lt;/p&gt;
&lt;p&gt;It determines:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;whether retries should continue&lt;/li&gt;
&lt;li&gt;whether execution drift exceeded acceptable boundaries&lt;/li&gt;
&lt;li&gt;whether rollback is necessary&lt;/li&gt;
&lt;li&gt;whether the workflow should terminate&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For example, if retries continuously modified unrelated files or redefined existing interfaces, the Harness layer treated the execution as divergence rather than recovery.&lt;/p&gt;
&lt;p&gt;That distinction became important.&lt;/p&gt;
&lt;p&gt;Without convergence control, retries often amplified instability instead of resolving failures.&lt;/p&gt;
&lt;p&gt;The Harness layer then managed:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;convergence loops&lt;/li&gt;
&lt;li&gt;execution stabilization&lt;/li&gt;
&lt;li&gt;workflow validation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This architecture became significantly more stable than relying on a single highly capable agent operating inside a large shared context.&lt;/p&gt;
&lt;h2&gt;7. My Current Multi-Agent Structure&lt;/h2&gt;
&lt;p&gt;My current workflows are increasingly organized around ownership boundaries.&lt;/p&gt;
&lt;p&gt;A simplified structure looks like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;PM Agent
    ↓
Planner Agent
    ↓
Implementer Agent
    ↓
Validator Agent
    ↓
Harness Layer&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each role owns a different category of decisions.&lt;/p&gt;
&lt;p&gt;That ownership is important.&lt;/p&gt;
&lt;h3&gt;Planner&lt;/h3&gt;
&lt;p&gt;Responsible for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;execution strategy&lt;/li&gt;
&lt;li&gt;task decomposition&lt;/li&gt;
&lt;li&gt;contract definition&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But not responsible for execution changes during runtime.&lt;/p&gt;
&lt;h3&gt;Implementer&lt;/h3&gt;
&lt;p&gt;Responsible for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;applying predefined contracts&lt;/li&gt;
&lt;li&gt;writing code&lt;/li&gt;
&lt;li&gt;executing tasks precisely&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But not responsible for redesigning architecture.&lt;/p&gt;
&lt;h3&gt;Validator&lt;/h3&gt;
&lt;p&gt;Responsible for:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;invariant verification&lt;/li&gt;
&lt;li&gt;semantic validation&lt;/li&gt;
&lt;li&gt;execution correctness checks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But not responsible for redefining execution strategy.&lt;/p&gt;
&lt;p&gt;As ownership boundaries became clearer, workflow behavior became significantly easier to reason about.&lt;/p&gt;
&lt;p&gt;Execution drift decreased.&lt;/p&gt;
&lt;p&gt;Retries became more predictable.&lt;/p&gt;
&lt;p&gt;And convergence stability improved substantially.&lt;/p&gt;
&lt;h2&gt;8. Reliability Comes From Ownership Boundaries&lt;/h2&gt;
&lt;p&gt;One of the biggest misconceptions about AI agents is that reliability comes only from intelligence.&lt;/p&gt;
&lt;p&gt;In practice, reliability often comes from constrained responsibilities.&lt;/p&gt;
&lt;p&gt;The same principle already exists in software engineering.&lt;/p&gt;
&lt;p&gt;Distributed systems become more stable when responsibilities are isolated.&lt;/p&gt;
&lt;p&gt;Database systems become safer when transactional boundaries are explicit.&lt;/p&gt;
&lt;p&gt;Microservices reduce instability by limiting ownership scope.&lt;/p&gt;
&lt;p&gt;Multi-agent systems appear to follow similar patterns.&lt;/p&gt;
&lt;p&gt;Without boundaries:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;every agent becomes a planner&lt;/li&gt;
&lt;li&gt;every retry becomes a redesign&lt;/li&gt;
&lt;li&gt;every execution becomes negotiation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As workflows become more complex, instability grows quickly.&lt;/p&gt;
&lt;p&gt;Role separation reduces that instability because ownership becomes predictable.&lt;/p&gt;
&lt;p&gt;The more complex the workflow became, the more important role boundaries became for maintaining convergence.&lt;/p&gt;
&lt;h2&gt;9. The Future — Reliability Engineering for Agent Systems&lt;/h2&gt;
&lt;p&gt;I increasingly believe we are entering a new phase of AI system design.&lt;/p&gt;
&lt;p&gt;Earlier generations of AI systems focused heavily on:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;prompts&lt;/li&gt;
&lt;li&gt;model quality&lt;/li&gt;
&lt;li&gt;tool integration&lt;/li&gt;
&lt;li&gt;inference capability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Those layers are still important.&lt;/p&gt;
&lt;p&gt;However, as workflows become more autonomous, coordination and ownership also become architectural concerns.&lt;/p&gt;
&lt;p&gt;Instead of only asking:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“Which model should execute this task?”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;We may increasingly need to ask:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“Which role should own this decision?”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;That shift feels important.&lt;/p&gt;
&lt;p&gt;Because many of the hardest problems in agent systems are no longer only about generation quality.&lt;/p&gt;
&lt;p&gt;They are increasingly about:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;coordination&lt;/li&gt;
&lt;li&gt;ownership&lt;/li&gt;
&lt;li&gt;execution boundaries&lt;/li&gt;
&lt;li&gt;convergence stability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As agent workflows become more autonomous, reliability engineering may increasingly become an exercise in defining ownership boundaries between agents.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;One unexpected realization from building multi-agent systems was how familiar the failures looked.&lt;/p&gt;
&lt;p&gt;Execution drift, responsibility overlap, and uncontrolled redesign during retries resembled classic software engineering problems.&lt;/p&gt;
&lt;p&gt;In many ways, multi-agent workflows began behaving like distributed object systems.&lt;/p&gt;
&lt;p&gt;The same lessons appeared again:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;unclear ownership creates instability&lt;/li&gt;
&lt;li&gt;overlapping responsibilities reduce predictability&lt;/li&gt;
&lt;li&gt;uncontrolled autonomy weakens convergence&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The Implementer should not redesign the system during execution.&lt;/p&gt;
&lt;p&gt;Its responsibility is to apply the already defined contract as precisely as possible.&lt;/p&gt;
&lt;p&gt;That separation turned out to be one of the biggest improvements in workflow stability.&lt;/p&gt;
&lt;p&gt;Ironically, some of the most important ideas for future AI systems may not be entirely new.&lt;/p&gt;
&lt;p&gt;Software engineering has already spent decades learning how to build stable systems through responsibility separation and ownership boundaries.&lt;/p&gt;
&lt;p&gt;Now, those principles appear to be emerging again inside agent systems.&lt;/p&gt;</description>
      <category>IT/AI</category>
      <category>AgentArchitecture</category>
      <category>aiagent</category>
      <category>AIEngineering</category>
      <category>Convergence</category>
      <category>DistributedSystems</category>
      <category>ExecutionDrift</category>
      <category>Hermes</category>
      <category>MultiAgentSystems</category>
      <category>ReliabilityEngineering</category>
      <category>systemdesign</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/417</guid>
      <comments>https://yeti.tistory.com/417#entry417comment</comments>
      <pubDate>Thu, 7 May 2026 22:38:50 +0900</pubDate>
    </item>
    <item>
      <title>Why Multi-Agent Systems Fail to Respond &amp;mdash; Debugging a Real Hermes Agent Setup</title>
      <link>https://yeti.tistory.com/416</link>
      <description>&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Lessons from building and debugging a real-world multi-agent system with Hermes Agent&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;  TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;The agent didn’t fail to generate an answer&lt;/li&gt;
&lt;li&gt;It failed to decide whether it should act&lt;/li&gt;
&lt;li&gt;Multi-agent systems require &lt;strong&gt;coordination signals&lt;/strong&gt;, not just intelligence&lt;/li&gt;
&lt;li&gt;The fix was not better prompts, but &lt;strong&gt;explicit behavior contracts&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;1. Problem — The Agent Didn’t Respond at All&lt;/h2&gt;
&lt;p&gt;While building a multi-agent system using Hermes Agent and Discord,&lt;/p&gt;
&lt;p&gt;I encountered a surprisingly simple but critical failure:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;The agent didn’t respond.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;Not partially.&lt;br&gt;Not incorrectly.&lt;/p&gt;
&lt;p&gt;It simply did nothing.&lt;/p&gt;
&lt;h3&gt;Observed behavior&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Discord message sent with mention&lt;/li&gt;
&lt;li&gt;PM agent responded&lt;/li&gt;
&lt;li&gt;Developer agent stayed silent&lt;/li&gt;
&lt;li&gt;No errors&lt;/li&gt;
&lt;li&gt;No logs indicating failure&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;From the outside, the system looked completely normal.&lt;/p&gt;
&lt;h2&gt;2. Initial Hypothesis — It Must Be a Configuration Issue&lt;/h2&gt;
&lt;p&gt;My first assumption was straightforward:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“This must be a Discord or Hermes configuration problem.”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;So I checked everything.&lt;/p&gt;
&lt;h3&gt;What I verified&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;Discord bot token regeneration&lt;/li&gt;
&lt;li&gt;Gateway intents (Message Content Intent enabled)&lt;/li&gt;
&lt;li&gt;Channel permissions&lt;/li&gt;
&lt;li&gt;&lt;code&gt;allowed_channels&lt;/code&gt; configuration&lt;/li&gt;
&lt;li&gt;&lt;code&gt;require_mention&lt;/code&gt; settings&lt;/li&gt;
&lt;li&gt;Restarted Hermes gateway&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These are all known failure points.&lt;/p&gt;
&lt;h3&gt;Result&lt;/h3&gt;
&lt;p&gt;Everything was correct.&lt;/p&gt;
&lt;p&gt;And yet, the agent still didn’t respond.&lt;/p&gt;
&lt;h2&gt;3. Reality — The System Was Working Correctly&lt;/h2&gt;
&lt;p&gt;This was the turning point.&lt;/p&gt;
&lt;p&gt;The system was not broken.&lt;/p&gt;
&lt;p&gt;It was behaving exactly as designed.&lt;/p&gt;
&lt;h3&gt;What actually happened&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;The agent &lt;strong&gt;received the message&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;The agent &lt;strong&gt;processed the message&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;The agent &lt;strong&gt;generated internal reasoning&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;It never decided to act.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;4. Root Cause — No Decision Model for Action&lt;/h2&gt;
&lt;p&gt;This is where the real problem emerged.&lt;/p&gt;
&lt;p&gt;The agent had:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;input processing&lt;/li&gt;
&lt;li&gt;reasoning capability&lt;/li&gt;
&lt;li&gt;tool access&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But it lacked one critical component:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;A decision rule for “Should I respond?”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h3&gt;Important distinction&lt;/h3&gt;
&lt;p&gt;There are two separate problems in agent systems:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Can the agent generate an answer?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Should the agent act at all?&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Most discussions focus only on.&lt;/p&gt;
&lt;p&gt;This failure was entirely about.&lt;/p&gt;
&lt;h3&gt;What the agent was missing&lt;/h3&gt;
&lt;p&gt;The system had no explicit definition of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;when to respond&lt;/li&gt;
&lt;li&gt;when to ignore&lt;/li&gt;
&lt;li&gt;how to interpret mentions&lt;/li&gt;
&lt;li&gt;how to handle multi-agent context&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;5. Insight — Humans Use Signals, Not Just Understanding&lt;/h2&gt;
&lt;p&gt;This became clearer when I compared it to human behavior.&lt;/p&gt;
&lt;p&gt;Humans do not respond to every message.&lt;/p&gt;
&lt;p&gt;They respond based on signals.&lt;/p&gt;
&lt;h3&gt;Human decision model&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;If I am mentioned → respond&lt;/li&gt;
&lt;li&gt;If someone else is mentioned → ignore&lt;/li&gt;
&lt;li&gt;If unclear → decide based on role&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;The agent had none of this&lt;/h3&gt;
&lt;p&gt;It understood the message.&lt;/p&gt;
&lt;p&gt;But it didn’t understand:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;whether it was responsible for acting.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;6. Fix — Explicit Behavior Contract&lt;/h2&gt;
&lt;p&gt;The solution was not improving prompts.&lt;/p&gt;
&lt;p&gt;It was introducing a &lt;strong&gt;behavior contract&lt;/strong&gt;.&lt;/p&gt;
&lt;h3&gt;Example (soul.md)&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;# Agent Behavior Contract

IF message mentions me → respond

IF message mentions another agent → ignore

IF message is general:
  → decide based on role (PM / Developer / Reviewer)

IF task is assigned:
  → execute within role boundary&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;What changed&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;The agent gained &lt;strong&gt;decision boundaries&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Responsibility became explicit&lt;/li&gt;
&lt;li&gt;Multi-agent interaction became predictable&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Key takeaway&lt;/h3&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;This is not prompt engineering.&lt;br&gt;This is behavior design.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;7. Why This Matters — Multi-Agent Systems Need Coordination&lt;/h2&gt;
&lt;p&gt;Hermes Agent supports multi-agent configurations with role-based execution.&lt;/p&gt;
&lt;p&gt;But simply adding multiple agents is not enough.&lt;/p&gt;
&lt;h3&gt;Multi-agent systems introduce a new layer of failure&lt;/h3&gt;
&lt;p&gt;Not:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;model quality&lt;/li&gt;
&lt;li&gt;prompt quality&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;coordination failure&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h3&gt;Core requirement&lt;/h3&gt;
&lt;p&gt;Multi-agent systems need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;routing&lt;/li&gt;
&lt;li&gt;responsibility&lt;/li&gt;
&lt;li&gt;coordination signals&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without this:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;The system becomes idle, not intelligent.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;8. Connection to Previous Posts&lt;/h2&gt;
&lt;p&gt;This experience connects directly to previous findings:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Prompt engineering improves outputs but not reliability&lt;/li&gt;
&lt;li&gt;Convergence systems stabilize execution&lt;/li&gt;
&lt;li&gt;And now:&lt;/li&gt;
&lt;/ul&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Coordination determines whether the system acts at all&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h3&gt;Evolution of understanding&lt;/h3&gt;
&lt;ol&gt;
&lt;li&gt;Prompt → insufficient&lt;/li&gt;
&lt;li&gt;Pipeline → still unstable&lt;/li&gt;
&lt;li&gt;Convergence → improves reliability&lt;/li&gt;
&lt;li&gt;Coordination → enables action&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;9. What I Learned&lt;/h2&gt;
&lt;p&gt;The system didn’t fail because it was wrong.&lt;/p&gt;
&lt;p&gt;It failed because it was silent.&lt;/p&gt;
&lt;h3&gt;Final realization&lt;/h3&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;The system didn’t fail to generate an answer.&lt;br&gt;It failed to decide whether it should respond.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;</description>
      <category>IT/AI</category>
      <category>Agent Architecture</category>
      <category>ai agents</category>
      <category>Discord Bots</category>
      <category>harness engineering</category>
      <category>Hermes Agent</category>
      <category>LLM Systems</category>
      <category>multi-agent systems</category>
      <category>System Design</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/416</guid>
      <comments>https://yeti.tistory.com/416#entry416comment</comments>
      <pubDate>Thu, 30 Apr 2026 16:54:05 +0900</pubDate>
    </item>
    <item>
      <title>How I Designed a Reliable LLM Coding Agent for Production</title>
      <link>https://yeti.tistory.com/415</link>
      <description>&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;From Unpredictable AI to Reliable Systems&lt;br&gt;Lessons from building real-world AI agent systems&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;  TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Prompt engineering alone was never enough for stable execution&lt;/li&gt;
&lt;li&gt;Single-prompt systems created over-exploration, scope drift, and unreliable outputs&lt;/li&gt;
&lt;li&gt;I redesigned the system into Planner → Implementer → Validator&lt;/li&gt;
&lt;li&gt;Reliability came from contracts, validation, and retry loops—not better prompts&lt;/li&gt;
&lt;li&gt;Production reliability is a system design problem, not a model quality problem&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;1. The Problem Was Never Just the Prompt&lt;/h2&gt;
&lt;p&gt;When I first started building a local LLM coding agent, I believed the solution was simple:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Write a better prompt.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;I was using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Mac Studio&lt;/li&gt;
&lt;li&gt;Ollama&lt;/li&gt;
&lt;li&gt;Claude Code CLI&lt;/li&gt;
&lt;li&gt;local Qwen models&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The goal was straightforward:&lt;/p&gt;
&lt;p&gt;Build an automated coding workflow that could operate without constant human intervention.&lt;/p&gt;
&lt;p&gt;At first, I injected large prompts directly into the system and expected stable execution.&lt;/p&gt;
&lt;p&gt;What happened instead was instability.&lt;/p&gt;
&lt;p&gt;Sometimes the model explored too much and redesigned architecture instead of fixing the requested issue.&lt;br&gt;Sometimes it modified unrelated files and crossed boundaries I never intended to touch.&lt;br&gt;Sometimes it simply failed to return usable output.&lt;/p&gt;
&lt;p&gt;The issue was not intelligence.&lt;/p&gt;
&lt;p&gt;It was execution control.&lt;/p&gt;
&lt;p&gt;That was the moment I stopped trying to optimize prompts and started redesigning the execution architecture.&lt;/p&gt;
&lt;h2&gt;2. The New Architecture — Planner → Implementer → Validator&lt;/h2&gt;
&lt;p&gt;The first version of the system looked like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Large Prompt
↓
LLM
↓
Hope for the best&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This worked for demos, but it failed in production because the same request could produce different outcomes.&lt;/p&gt;
&lt;p&gt;I redesigned the system around one principle:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Reliability must be enforced by the system.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The new structure became:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Request
↓
Planner
↓
Implementer
↓
Validator
↓
Retry if needed
↓
Converged Result&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each stage had:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;clear responsibility boundaries&lt;/li&gt;
&lt;li&gt;structured contracts&lt;/li&gt;
&lt;li&gt;validation checkpoints&lt;/li&gt;
&lt;li&gt;deterministic retry conditions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This was the point where the system became trustworthy.&lt;/p&gt;
&lt;p&gt;The model was no longer asked to “figure everything out.”&lt;br&gt;It was asked to operate inside a controlled execution environment.&lt;/p&gt;
&lt;h2&gt;3. Layer 1 — Planner&lt;/h2&gt;
&lt;p&gt;The Planner exists to prevent intent drift.&lt;/p&gt;
&lt;p&gt;Its job is not writing code.&lt;/p&gt;
&lt;p&gt;Its job is defining the contract before execution begins.&lt;/p&gt;
&lt;p&gt;Instead of vague instructions like:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Fix the login issue&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;the planner produces something like this:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Intent Contract

- Request Type
- Required Change
- Protected Boundaries
- Acceptance Criteria
- Codebase Anchors&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Do:
- Add token refresh logic

Do Not:
- Change authentication API contracts

Must Pass:
- Existing login flow remains unchanged&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This removes ambiguity before implementation starts.&lt;/p&gt;
&lt;p&gt;Design decisions, protected boundaries, and success conditions are fixed in advance.&lt;/p&gt;
&lt;p&gt;The Implementer is not asked to interpret intent.&lt;br&gt;It is asked to follow a contract.&lt;/p&gt;
&lt;p&gt;That separation is what prevents scope drift.&lt;/p&gt;
&lt;h2&gt;4. Layer 2 — Implementer&lt;/h2&gt;
&lt;p&gt;The Implementer performs the actual code changes.&lt;/p&gt;
&lt;p&gt;Its responsibility is intentionally narrow:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Execute the contract exactly.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The Implementer should not make new decisions during execution.&lt;/p&gt;
&lt;p&gt;Its role is to apply the already defined contract as precisely as possible.&lt;br&gt;Design decisions, scope boundaries, and acceptance criteria should already be fixed by the Planner.&lt;/p&gt;
&lt;p&gt;This significantly reduced the “creative failures” I saw in earlier versions.&lt;/p&gt;
&lt;p&gt;Unexpected refactoring decreased, unnecessary architectural changes disappeared, and unrelated file modifications became far less common.&lt;/p&gt;
&lt;p&gt;As a result, the implementation layer became intentionally predictable.&lt;/p&gt;
&lt;p&gt;That predictability was a positive signal, because reliable production systems should reduce surprise rather than create it.&lt;/p&gt;
&lt;p&gt;The goal of execution is not creativity.&lt;/p&gt;
&lt;p&gt;It is consistency.&lt;/p&gt;
&lt;h2&gt;5. Layer 3 — Validator&lt;/h2&gt;
&lt;p&gt;This is the most important part of the system.&lt;/p&gt;
&lt;p&gt;The Validator determines whether the result is actually acceptable.&lt;/p&gt;
&lt;p&gt;It checks:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Was the requested behavior implemented?&lt;/li&gt;
&lt;li&gt;Were protected contracts preserved?&lt;/li&gt;
&lt;li&gt;Did the output format remain valid?&lt;/li&gt;
&lt;li&gt;Were unintended side effects introduced?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Without validation, failure stays hidden.&lt;/p&gt;
&lt;p&gt;The system may look successful while silently breaking important assumptions.&lt;/p&gt;
&lt;p&gt;With validation, failure becomes explicit.&lt;/p&gt;
&lt;p&gt;And once failure becomes visible, retries become meaningful.&lt;/p&gt;
&lt;p&gt;This is where convergence actually happens—not inside the prompt, but inside the feedback loop.&lt;/p&gt;
&lt;p&gt;A validator should reject not only syntactically broken output, but also structurally valid output that is semantically wrong.&lt;/p&gt;
&lt;p&gt;That distinction is where many production systems fail.&lt;/p&gt;
&lt;h2&gt;6. Retry Logic — Convergence Over Perfection&lt;/h2&gt;
&lt;p&gt;Retries should not be random.&lt;/p&gt;
&lt;p&gt;They should be guided by validation results.&lt;/p&gt;
&lt;p&gt;Instead of saying:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Try again&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;the system should say:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Retry because Acceptance Criteria #2 failed&lt;br&gt;Retry because a protected contract was violated&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This changes retries from guesswork into controlled convergence.&lt;/p&gt;
&lt;p&gt;The goal is not to get the perfect answer on the first attempt.&lt;/p&gt;
&lt;p&gt;The goal is to move the system consistently toward the target state.&lt;/p&gt;
&lt;p&gt;That is what reliability means in practice.&lt;/p&gt;
&lt;p&gt;Production systems do not need brilliance.&lt;/p&gt;
&lt;p&gt;They need stable convergence.&lt;/p&gt;
&lt;h2&gt;7. Before vs After&lt;/h2&gt;
&lt;p&gt;Before:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Prompt
→ Result
→ Hope&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;After:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Intent
→ Contract
→ Execute
→ Validate
→ Retry
→ Reliable Output&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The system became slower.&lt;/p&gt;
&lt;p&gt;But it also became trustworthy.&lt;/p&gt;
&lt;p&gt;And in production systems, trust matters more than speed—especially when autonomous execution is involved.&lt;/p&gt;
&lt;p&gt;Fast failure is still failure.&lt;/p&gt;
&lt;p&gt;Reliable execution is what matters.&lt;/p&gt;
&lt;h2&gt;8. What Actually Changed&lt;/h2&gt;
&lt;p&gt;The biggest change was not technical. It was philosophical.&lt;/p&gt;
&lt;p&gt;At first, I focused on improving the model itself.&lt;br&gt;I kept asking how to make the model smarter, more accurate, and more capable.&lt;/p&gt;
&lt;p&gt;Over time, I realized that intelligence was not the main problem.&lt;/p&gt;
&lt;p&gt;The real problem was failure visibility.&lt;/p&gt;
&lt;p&gt;The better question became:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;How do I design a system where failure cannot be ignored?&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;That shift changed everything.&lt;/p&gt;
&lt;p&gt;Reliable AI systems are not built by better prompts alone.&lt;/p&gt;
&lt;p&gt;They are built through clear execution boundaries, explicit contracts, strong validation layers, and retry mechanisms that guide the system toward convergence.&lt;/p&gt;
&lt;p&gt;In other words, reliability does not come from model intelligence itself. It comes from the system surrounding the model.&lt;/p&gt;
&lt;p&gt;That is why:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Reliability is not a model capability.&lt;br&gt;It is a system property.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;</description>
      <category>IT/AI</category>
      <category>Ai</category>
      <category>aiagent</category>
      <category>HarnessEngineering</category>
      <category>llm</category>
      <category>LLMPipeline</category>
      <category>LocalLLM</category>
      <category>ProductionAI</category>
      <category>promptengineering</category>
      <category>ReliabilityEngineering</category>
      <category>systemdesign</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/415</guid>
      <comments>https://yeti.tistory.com/415#entry415comment</comments>
      <pubDate>Wed, 29 Apr 2026 16:49:50 +0900</pubDate>
    </item>
    <item>
      <title>Why Prompt Engineering Fails &amp;mdash; Harness Engineering for Reliable LLM Systems</title>
      <link>https://yeti.tistory.com/413</link>
      <description>&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;From unpredictable AI outputs to production-ready LLM systems&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;  TL;DR&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Prompt engineering improves output quality, but it does not guarantee reliability&lt;/li&gt;
&lt;li&gt;Most LLM systems fail in production because they cannot handle failure&lt;/li&gt;
&lt;li&gt;Validation layers, retry loops, and strict output contracts are what make AI automation reliable&lt;/li&gt;
&lt;li&gt;Reliable AI agent systems are built with control systems, not just better prompts&lt;/li&gt;
&lt;li&gt;This is where Prompt Engineering ends and Harness Engineering begins&lt;/li&gt;
&lt;/ul&gt;
&lt;hr data-ke-style=&quot;style1&quot; /&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Why Prompt Engineering Alone Fails in Production&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Many teams building AI agents make the same assumption:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;ldquo;If we write better prompts, the system will become reliable.&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;At first, I believed that too.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;I thought better prompts meant:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;clearer instructions&lt;/li&gt;
&lt;li&gt;stronger role definitions&lt;/li&gt;
&lt;li&gt;stricter output formatting&lt;/li&gt;
&lt;li&gt;more examples&lt;/li&gt;
&lt;li&gt;more failure prevention rules&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;So I kept improving prompts.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Longer prompts.&lt;br /&gt;Safer prompts.&lt;br /&gt;More detailed prompts.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;But something unexpected happened.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The prompts got better.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The system got worse.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Especially when working with local LLM environments using:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Mac Studio&lt;/li&gt;
&lt;li&gt;Ollama&lt;/li&gt;
&lt;li&gt;Claude Code CLI&lt;/li&gt;
&lt;li&gt;qwen local models&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;I repeatedly saw:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;no output at all&lt;/li&gt;
&lt;li&gt;endless file exploration&lt;/li&gt;
&lt;li&gt;incomplete implementations&lt;/li&gt;
&lt;li&gt;unstable behavior across identical runs&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The problem was not correctness.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The real problem was:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;I could not reliably get results at all.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That was the turning point.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;I realized:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Prompt quality was not the bottleneck.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The real issue was:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;One prompt was carrying the entire system.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;And that is why prompt engineering alone fails in production LLM systems.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Why LLM Systems Fail in Production&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The biggest misunderstanding in AI system design is this:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;People think LLM systems fail because models are not smart enough.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That is usually wrong.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Most LLM systems fail because they are designed as linear success paths.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;If every step must succeed perfectly,&lt;br /&gt;the entire system becomes fragile.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;For example:&lt;/p&gt;
&lt;pre class=&quot;pgsql&quot;&gt;&lt;code&gt;Analyze &amp;rarr; Design &amp;rarr; Implement &amp;rarr; Validate &amp;rarr; Deploy&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;If one step fails,&lt;br /&gt;the entire workflow breaks.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This becomes worse because LLMs are non-deterministic.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The same input does not always produce the same output.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Unlike traditional software:&lt;/p&gt;
&lt;pre class=&quot;lua&quot;&gt;&lt;code&gt;Same input &amp;rarr; Same output ❌
Same input &amp;rarr; Different outputs ✔&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This means reliability is not a model feature.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;It is a system design problem.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That is the foundation of modern LLM system design.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;A Real Failure Case from Claude Code + Ollama&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;One failure made this obvious.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The Implementer step was supposed to modify a single API file.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The task was simple:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;replace admin-token generation with user-context token handling.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Nothing more.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;But instead of touching the target file, the model started scanning the entire repository.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;It opened unrelated modules, rewrote helper functions, and tried to understand the whole system.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Eventually, it returned:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;ldquo;Task completed successfully&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;But the required logic had not changed at all.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The actual task was still unfinished.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The model had optimized for&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;plausible completion&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;instead of&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;actual completion&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The validator failed immediately.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That was the moment I understood:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Without explicit boundaries, LLMs optimize for confidence, not correctness.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;And confidence is useless in automation.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This is one of the most common failure patterns in local LLM automation.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;From Prompt Engineering to Harness Engineering&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This changed the architecture completely.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Instead of one giant prompt, I split the workflow into smaller steps:&lt;/p&gt;
&lt;pre class=&quot;crmsh&quot;&gt;&lt;code&gt;Intent &amp;rarr; Planner &amp;rarr; Spec &amp;rarr; Implement &amp;rarr; Validate &amp;rarr; Git&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Each step had:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;one responsibility&lt;/li&gt;
&lt;li&gt;explicit input/output contracts&lt;/li&gt;
&lt;li&gt;deterministic validation&lt;/li&gt;
&lt;li&gt;retry capability&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This improved reliability dramatically.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The goal changed from:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Generate the correct answer once&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;to:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Detect failure and drive convergence&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This is the difference between:&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;Prompt Engineering&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;and&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;Harness Engineering&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Prompt engineering improves what the model says.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Harness engineering controls what the system accepts.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That distinction is everything.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;What Is Harness Engineering?&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Harness Engineering is the control layer that makes LLM systems reliable.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;It includes:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;validation layers&lt;/li&gt;
&lt;li&gt;retry loops&lt;/li&gt;
&lt;li&gt;output contracts&lt;/li&gt;
&lt;li&gt;failure detection&lt;/li&gt;
&lt;li&gt;step isolation&lt;/li&gt;
&lt;li&gt;convergence architecture&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Instead of asking:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;ldquo;Can the model do this?&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;the real question becomes:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;ldquo;What happens when the model fails?&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Because failure is not an exception.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Failure is the default state.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Reliable AI systems are not built by avoiding failure.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;They are built by surviving it.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That is Harness Engineering.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Contract-Driven LLM Execution&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The biggest improvement came from removing ambiguity.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;I stopped asking the model to:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;ldquo;Do the task well&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;and started requiring:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;ldquo;Satisfy the contract&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;For example:&lt;/p&gt;
&lt;pre class=&quot;markdown&quot;&gt;&lt;code&gt;### REQUIRED OUTPUT

- Must create:
  .handoff/task-1/spec.md

- Must include:
  SPEC_DONE

- Must NOT:
  modify request.md

- Final line must be:
  &amp;lt;&amp;lt;&amp;lt;DONE&amp;gt;&amp;gt;&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Outputs also had strict file-based contracts:&lt;/p&gt;
&lt;pre class=&quot;yaml&quot;&gt;&lt;code&gt;[FILE]
path: .handoff/task-1/spec.md
---
implementation details here
---&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;No free-form output.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;No interpretation.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Either the contract was satisfied or it failed.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This made failure measurable.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;And measurable failure is what makes retries possible.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This is one of the most important patterns in production-grade AI agent architecture.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Why Validation Layers Matter in LLM Systems&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Validation is what transforms LLM output from &amp;ldquo;probably correct&amp;rdquo; into &amp;ldquo;safe enough to continue.&amp;rdquo;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Validation was deterministic.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;For example:&lt;/p&gt;
&lt;pre class=&quot;awk&quot;&gt;&lt;code&gt;if ! grep -q &quot;VALIDATOR_DONE&quot; output.txt; then
  echo &quot;Validation failed&quot;
  exit 1
fi&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Other validation checks included:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;required file existence&lt;/li&gt;
&lt;li&gt;schema validation&lt;/li&gt;
&lt;li&gt;forbidden output detection&lt;/li&gt;
&lt;li&gt;completion marker verification&lt;/li&gt;
&lt;li&gt;PASS / FAIL summaries&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The question changed from:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;ldquo;Does this look correct?&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;to:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;ldquo;Did this satisfy the required conditions?&amp;rdquo;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That is how validation layers improve LLM reliability.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Because intuition does not scale.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Validation does.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Retry Loops Create Reliable AI Automation&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Retries were not random.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;They were guided by failure signals.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;When validation failed, the next prompt included correction feedback:&lt;/p&gt;
&lt;pre class=&quot;sas&quot;&gt;&lt;code&gt;Previous output failed because:

- missing [FILE] block
- invalid completion marker
- required file was not created

Fix only these issues.
Do not rewrite unrelated sections.&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This made retries behave like:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;gradient-free optimization&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The model did not need gradients.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;It only needed clear failure signals.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That was enough to create convergence.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This is why retry loops are the core of reliable LLM systems.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Not just better prompts.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Why Agents Alone Are Not Enough&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Many people ask:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Why not just use Claude Code directly?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The answer is simple.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Claude Code is a brilliant interactive engineer.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;But production automation does not need brilliance.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;It needs boring reliability.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Interactive agents work well when humans are present.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Because humans can:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;notice failure&lt;/li&gt;
&lt;li&gt;redirect execution&lt;/li&gt;
&lt;li&gt;stop bad decisions&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;But in a fully automated workflow:&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;there is no human in the loop.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The system must decide:&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;did this step succeed?&lt;/li&gt;
&lt;li&gt;should we retry?&lt;/li&gt;
&lt;li&gt;what exactly failed?&lt;/li&gt;
&lt;li&gt;is it safe to continue?&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That requires a harness.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Not just an agent.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The agent generates.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The harness controls.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That distinction separates demos from production systems.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Final Takeaway&lt;/h2&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The real question is not:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;How do we build smarter AI?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;The real question is:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;How do we build systems that do not collapse when failure happens?&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Reliable LLM systems are not built by asking better questions.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;They are built by designing better control systems.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;This is the core of production AI engineering.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;And this is where Prompt Engineering ends.&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;FAQ&lt;/h2&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;What is Harness Engineering?&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Harness Engineering is the system design layer that controls LLM behavior.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;It includes validation, retries, structured output contracts, and failure detection.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;It is what makes LLM systems reliable in production.&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;Why does prompt engineering fail in production?&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Because prompt engineering improves output quality, but it does not guarantee reliability.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Production systems fail when there is no validation or recovery mechanism for bad outputs.&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;How do validation layers improve LLM reliability?&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Validation layers make failure measurable.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;They check completion markers, files, schemas, and output contracts so the system can safely decide whether to continue or retry.&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;Why do local LLM systems fail more often?&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Local LLMs usually have smaller context windows, weaker reasoning consistency, and higher instability in multi-step tasks.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;That makes validation and retry systems even more important.&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size23&quot;&gt;What is the difference between prompt engineering and harness engineering?&lt;/h3&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Prompt engineering improves model outputs.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Harness engineering controls whether those outputs are accepted, rejected, or retried.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Prompt improves quality.&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;Harness creates reliability.&lt;/p&gt;
&lt;h1&gt;Related Articles&lt;/h1&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://yeti.tistory.com/411&quot;&gt;Why Prompt Engineering Alone Fails in LLM Systems (And How to Fix It with Convergence)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://yeti.tistory.com/412&quot;&gt;Why LLM Systems Fail in Production (And Why Prompt Engineering Is Not Enough)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;</description>
      <category>IT/AI</category>
      <category>AI agent architecture</category>
      <category>Claude Code automation</category>
      <category>harness engineering</category>
      <category>LLM system design</category>
      <category>local LLM reliability</category>
      <category>Ollama AI agent system</category>
      <category>prompt engineering limitations</category>
      <category>retry loops for AI systems</category>
      <category>validation layers for LLM</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/413</guid>
      <comments>https://yeti.tistory.com/413#entry413comment</comments>
      <pubDate>Fri, 24 Apr 2026 17:38:59 +0900</pubDate>
    </item>
    <item>
      <title>Why LLM Systems Fail in Production (And Why Prompt Engineering Is Not Enough)</title>
      <link>https://yeti.tistory.com/412</link>
      <description>&lt;h3&gt;From Unpredictable AI to Reliable Systems&lt;/h3&gt;
&lt;h3&gt;Lessons from Building Real-World AI Agent Systems&lt;/h3&gt;
&lt;h2&gt;TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Prompt engineering improves output quality, but it does not guarantee reliability&lt;/li&gt;
&lt;li&gt;LLM systems fail because outputs are probabilistic, not deterministic&lt;/li&gt;
&lt;li&gt;Multi-step AI agent pipelines amplify failure probabilities&lt;/li&gt;
&lt;li&gt;Production-grade LLM systems require validation, retry loops, and convergence mechanisms&lt;/li&gt;
&lt;li&gt;Reliability is not a model capability — it is a system property&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Introduction: The Real Problem with LLM Systems&lt;/h2&gt;
&lt;p&gt;Most LLM systems fail in production not because the model is weak,&lt;br&gt;but because the system cannot handle failure.&lt;/p&gt;
&lt;p&gt;Many teams assume the problem is intelligence.&lt;/p&gt;
&lt;p&gt;“We need smarter models.”&lt;/p&gt;
&lt;p&gt;“We need better reasoning.”&lt;/p&gt;
&lt;p&gt;“If we switch to a stronger frontier model, everything will work.”&lt;/p&gt;
&lt;p&gt;But in real-world systems, the biggest problem is not intelligence.&lt;/p&gt;
&lt;p&gt;It is unpredictability.&lt;/p&gt;
&lt;p&gt;The same prompt can produce different outputs.&lt;/p&gt;
&lt;p&gt;Some days, the workflow works perfectly.&lt;/p&gt;
&lt;p&gt;Other days, the exact same input produces failure.&lt;/p&gt;
&lt;p&gt;It works in demos,&lt;br&gt;but breaks in production.&lt;/p&gt;
&lt;p&gt;This is the real challenge of building reliable AI systems.&lt;/p&gt;
&lt;p&gt;And at this point, many teams make the same mistake:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“Maybe we just need better prompt engineering.”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;But prompt engineering alone is not enough.&lt;/p&gt;
&lt;h2&gt;Why Prompt Engineering Alone Fails&lt;/h2&gt;
&lt;p&gt;I thought the same thing at first.&lt;/p&gt;
&lt;p&gt;I believed that if I designed a single prompt carefully enough,&lt;br&gt;I could get stable and reliable outputs.&lt;/p&gt;
&lt;p&gt;So I made prompts longer.&lt;/p&gt;
&lt;p&gt;More detailed.&lt;/p&gt;
&lt;p&gt;More structured.&lt;/p&gt;
&lt;p&gt;I added:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;role definitions&lt;/li&gt;
&lt;li&gt;output formatting rules&lt;/li&gt;
&lt;li&gt;examples&lt;/li&gt;
&lt;li&gt;failure prevention instructions&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The prompt kept growing.&lt;/p&gt;
&lt;p&gt;But the results became less stable.&lt;/p&gt;
&lt;p&gt;This was especially obvious in local LLM environments.&lt;/p&gt;
&lt;p&gt;Using:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Mac Studio&lt;/li&gt;
&lt;li&gt;Ollama&lt;/li&gt;
&lt;li&gt;Claude Code CLI&lt;/li&gt;
&lt;li&gt;qwen local models&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;I repeatedly saw:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;no output at all&lt;/li&gt;
&lt;li&gt;excessive file exploration (over-exploration)&lt;/li&gt;
&lt;li&gt;incomplete task execution&lt;/li&gt;
&lt;li&gt;unstable behavior across identical runs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The problem was not correctness.&lt;/p&gt;
&lt;p&gt;The real problem was:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;I could not reliably get results at all.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;That was the turning point.&lt;/p&gt;
&lt;p&gt;The issue was not prompt quality.&lt;/p&gt;
&lt;p&gt;The issue was:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;&lt;strong&gt;One prompt was carrying the entire system.&lt;/strong&gt;&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;Why Small, Well-Defined Tasks Improve LLM Reliability&lt;/h2&gt;
&lt;p&gt;I noticed something important.&lt;/p&gt;
&lt;p&gt;Local LLMs performed much better&lt;br&gt;when tasks were small and clearly defined.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;understanding requirements&lt;/li&gt;
&lt;li&gt;defining implementation scope&lt;/li&gt;
&lt;li&gt;finding files to modify&lt;/li&gt;
&lt;li&gt;validating code changes&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each of these tasks worked surprisingly well.&lt;/p&gt;
&lt;p&gt;But asking the model to do all of them at once&lt;br&gt;made the entire system unstable.&lt;/p&gt;
&lt;p&gt;This led to a simple insight:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Break the problem into smaller pieces.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;LLMs work better as&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;a team of specialized workers&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;than as&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;a single genius expected to solve everything.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;That insight changed the architecture completely.&lt;/p&gt;
&lt;h2&gt;From One Prompt to Multi-Step AI Agent Systems&lt;/h2&gt;
&lt;p&gt;Instead of one massive prompt,&lt;br&gt;I built a structured pipeline.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Intent Extractor&lt;/li&gt;
&lt;li&gt;Planner&lt;/li&gt;
&lt;li&gt;Spec Planner&lt;/li&gt;
&lt;li&gt;Implementer&lt;/li&gt;
&lt;li&gt;Validator&lt;/li&gt;
&lt;li&gt;Git &amp;amp; Merge Request Automation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Each step had:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;a narrow responsibility&lt;/li&gt;
&lt;li&gt;explicit input/output contracts&lt;/li&gt;
&lt;li&gt;independent validation&lt;/li&gt;
&lt;li&gt;retry capability&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The goal was no longer:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Generate the correct answer in one shot&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The real goal became:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Detect failure and drive convergence&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This was the shift from&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Prompt Engineering&lt;/strong&gt; to &lt;strong&gt;Harness Engineering&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Real Example: Contract-Driven LLM Execution&lt;/h2&gt;
&lt;p&gt;The most important improvement was simple:&lt;/p&gt;
&lt;p&gt;I stopped asking the model to&lt;br&gt;“do the task well.”&lt;/p&gt;
&lt;p&gt;Instead,&lt;/p&gt;
&lt;p&gt;I asked it to satisfy strict contracts.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;PLANNER_DONE
SPEC_DONE
IMPLEMENTATION_DONE
VALIDATOR_DONE
GIT_DONE&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each step had required completion markers.&lt;/p&gt;
&lt;p&gt;And outputs had to follow strict file-based handoff rules:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;[FILE]
path: .handoff/task-1/spec.md
---
implementation details here
---&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This removed ambiguity.&lt;/p&gt;
&lt;p&gt;The model was no longer evaluated by quality.&lt;/p&gt;
&lt;p&gt;It was evaluated by contract satisfaction.&lt;/p&gt;
&lt;p&gt;That changed everything.&lt;/p&gt;
&lt;h2&gt;How Validation Layers Make LLM Systems Reliable&lt;/h2&gt;
&lt;p&gt;Validation was not based on intuition.&lt;/p&gt;
&lt;p&gt;It was deterministic.&lt;/p&gt;
&lt;p&gt;For example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;if ! grep -q &amp;quot;VALIDATOR_DONE&amp;quot; output.txt; then
  echo &amp;quot;Validation failed&amp;quot;
  exit 1
fi&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Other validation checks included:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;required file existence&lt;/li&gt;
&lt;li&gt;schema validation&lt;/li&gt;
&lt;li&gt;forbidden output detection&lt;/li&gt;
&lt;li&gt;completion marker verification&lt;/li&gt;
&lt;li&gt;PASS / FAIL summaries&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The question changed from:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“Does this look correct?”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;to:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“Did this satisfy the required conditions?”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This made retries possible.&lt;/p&gt;
&lt;p&gt;Because failure became measurable.&lt;/p&gt;
&lt;p&gt;And measurable failure can be improved.&lt;/p&gt;
&lt;h2&gt;Retry Loops: The Core of Convergence Systems&lt;/h2&gt;
&lt;p&gt;Retries were not random.&lt;/p&gt;
&lt;p&gt;They were guided by failure signals.&lt;/p&gt;
&lt;p&gt;When validation failed,&lt;br&gt;the next prompt included explicit correction feedback:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Previous output failed because:

- missing [FILE] block
- invalid completion marker
- required file was not created

Fix only these issues.
Do not rewrite unrelated sections.&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This made retries behave like&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;gradient-free optimization&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The model did not need gradients.&lt;/p&gt;
&lt;p&gt;It only needed clear failure signals.&lt;/p&gt;
&lt;p&gt;That was enough to create convergence.&lt;/p&gt;
&lt;p&gt;This is why production LLM systems require retry architecture.&lt;/p&gt;
&lt;p&gt;Not just better prompts.&lt;/p&gt;
&lt;h2&gt;Why Harness Engineering Matters More Than the Agent&lt;/h2&gt;
&lt;p&gt;Many people ask:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Why not just use Claude Code directly?&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The answer is simple:&lt;/p&gt;
&lt;p&gt;Interactive agents are excellent for humans.&lt;/p&gt;
&lt;p&gt;But automation requires determinism.&lt;/p&gt;
&lt;p&gt;Claude Code works well&lt;br&gt;when a human can intervene.&lt;/p&gt;
&lt;p&gt;But in a fully automated workflow:&lt;/p&gt;
&lt;p&gt;there is no human in the loop.&lt;/p&gt;
&lt;p&gt;The system must decide:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;did this step succeed?&lt;/li&gt;
&lt;li&gt;should we retry?&lt;/li&gt;
&lt;li&gt;what exactly failed?&lt;/li&gt;
&lt;li&gt;is it safe to continue?&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;That requires a harness.&lt;/p&gt;
&lt;p&gt;Not just a better agent.&lt;/p&gt;
&lt;p&gt;The agent generates.&lt;/p&gt;
&lt;p&gt;The harness controls.&lt;/p&gt;
&lt;p&gt;That distinction is critical.&lt;/p&gt;
&lt;h2&gt;Why Reliable LLM Systems Depend on Architecture, Not Model Size&lt;/h2&gt;
&lt;p&gt;Many people believe:&lt;/p&gt;
&lt;p&gt;better models create better systems.&lt;/p&gt;
&lt;p&gt;In reality, it is often the opposite.&lt;/p&gt;
&lt;p&gt;Better systems make even average models reliable.&lt;/p&gt;
&lt;p&gt;Even small local models&lt;br&gt;can become powerful&lt;br&gt;inside the right architecture.&lt;/p&gt;
&lt;p&gt;On the other hand,&lt;/p&gt;
&lt;p&gt;even the strongest frontier models&lt;br&gt;become unstable&lt;/p&gt;
&lt;p&gt;when everything depends on one prompt&lt;br&gt;without validation.&lt;/p&gt;
&lt;p&gt;Reliability is not a model capability.&lt;/p&gt;
&lt;p&gt;It is:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;&lt;strong&gt;a system property&lt;/strong&gt;&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This is one of the most important lessons in AI system design.&lt;/p&gt;
&lt;h2&gt;Final Takeaway&lt;/h2&gt;
&lt;p&gt;The real question is not:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;How do we build smarter AI?&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The real question is:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;How do we build systems that do not collapse when failure happens?&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;The core of production AI systems&lt;br&gt;is not answer generation.&lt;/p&gt;
&lt;p&gt;It is:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;&lt;strong&gt;failure management&lt;/strong&gt;&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;That is where real engineering begins.&lt;/p&gt;
&lt;p&gt;And from that moment,&lt;/p&gt;
&lt;p&gt;we stop being Prompt Engineers.&lt;/p&gt;
&lt;p&gt;We become System Designers.&lt;/p&gt;
&lt;h2&gt;FAQ&lt;/h2&gt;
&lt;h3&gt;Why do LLM systems fail in production?&lt;/h3&gt;
&lt;p&gt;Because LLM outputs are probabilistic, not deterministic.&lt;/p&gt;
&lt;p&gt;Even with the same prompt, results can vary.&lt;br&gt;Without validation and recovery mechanisms, a single failure can break the entire system.&lt;/p&gt;
&lt;h3&gt;Is prompt engineering enough for production AI systems?&lt;/h3&gt;
&lt;p&gt;No.&lt;/p&gt;
&lt;p&gt;Prompt engineering improves output quality,&lt;br&gt;but it does not guarantee reliability.&lt;/p&gt;
&lt;p&gt;Production systems require validation layers, retry loops, and convergence mechanisms.&lt;/p&gt;
&lt;h3&gt;What is Harness Engineering?&lt;/h3&gt;
&lt;p&gt;Harness Engineering is the system design layer&lt;br&gt;that controls LLM behavior.&lt;/p&gt;
&lt;p&gt;It includes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;validation&lt;/li&gt;
&lt;li&gt;retries&lt;/li&gt;
&lt;li&gt;structured output contracts&lt;/li&gt;
&lt;li&gt;failure detection&lt;/li&gt;
&lt;li&gt;convergence architecture&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It is what makes LLM systems reliable.&lt;/p&gt;</description>
      <category>IT/AI</category>
      <category>AI agent architecture</category>
      <category>AI automation reliability</category>
      <category>convergence systems</category>
      <category>harness engineering</category>
      <category>LLM system design</category>
      <category>LLM validation layers</category>
      <category>local LLM reliability</category>
      <category>prompt engineering limitations</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/412</guid>
      <comments>https://yeti.tistory.com/412#entry412comment</comments>
      <pubDate>Thu, 23 Apr 2026 16:01:11 +0900</pubDate>
    </item>
    <item>
      <title>Why Prompt Engineering Alone Fails in LLM Systems (And How to Fix It with Convergence)</title>
      <link>https://yeti.tistory.com/411</link>
      <description>&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Lessons learned from building a real-world LLM coding agent with local models&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;  TL;DR&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;LLMs are non-deterministic → same input, different outputs&lt;/li&gt;
&lt;li&gt;Pipeline architectures amplify failure probabilities&lt;/li&gt;
&lt;li&gt;Prompt engineering improves outputs but cannot guarantee reliability&lt;/li&gt;
&lt;li&gt;The real solution is not better prompts, but &lt;strong&gt;convergence systems&lt;/strong&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;1. Problem — You Can’t Even Get Stable Outputs&lt;/h2&gt;
&lt;p&gt;I wanted to build a local LLM-powered coding assistant.&lt;/p&gt;
&lt;p&gt;So I set up:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Mac Studio&lt;/li&gt;
&lt;li&gt;Ollama&lt;/li&gt;
&lt;li&gt;Claude Code CLI&lt;/li&gt;
&lt;li&gt;qwen3.5&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Then I tried the simplest possible task:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-text&quot;&gt;Build a simple API&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;But the results were unstable:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sometimes no output at all&lt;/li&gt;
&lt;li&gt;Sometimes excessive file exploration (over-exploration)&lt;/li&gt;
&lt;li&gt;Sometimes the task never completed&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The problem wasn’t correctness.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;The problem was that I couldn’t reliably get results at all.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;2. Observation — Small Tasks Work&lt;/h2&gt;
&lt;p&gt;After multiple attempts, I noticed a pattern:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;Local LLMs perform much better on small, well-defined tasks.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;For example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Implementing a single function&lt;/li&gt;
&lt;li&gt;Fixing a specific bug&lt;/li&gt;
&lt;li&gt;Tasks with clear input/output&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This led to an important insight:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“Break the problem down into smaller pieces.”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;3. Approach — Role Decomposition&lt;/h2&gt;
&lt;p&gt;Instead of one large prompt, I split the task into stages:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[Analyze] → [Design] → [Implement]&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;Each step:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Has a narrow scope&lt;/li&gt;
&lt;li&gt;Produces structured output&lt;/li&gt;
&lt;li&gt;Can be validated&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This significantly improved success rates (in manual runs).&lt;/p&gt;
&lt;h2&gt;4. Scaling Up — Pipeline Automation&lt;/h2&gt;
&lt;p&gt;Naturally, the next step was:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“Let’s automate this workflow.”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;So I built a pipeline:&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;User Input
   ↓
[Analyze] → [Design] → [Implement]
   ↓
 Final Output&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;5. Problem — The Pipeline Breaks Easily&lt;/h2&gt;
&lt;p&gt;After automation, new issues appeared:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Sometimes it works&lt;/li&gt;
&lt;li&gt;Sometimes it completely fails&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The key issue:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;A single failure breaks the entire pipeline.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;6. Why Pipelines Fail&lt;/h2&gt;
&lt;h3&gt;6.1 LLMs Are Non-Deterministic&lt;/h3&gt;
&lt;p&gt;Unlike traditional systems:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Same input → same output (X)&lt;/li&gt;
&lt;li&gt;Same input → probabilistic output (O)&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;6.2 Probability Compounding&lt;/h3&gt;
&lt;p&gt;If each step succeeds with probability ( p ):&lt;/p&gt;
&lt;p&gt;P_{total} = p_1 \times p_2 \times p_3&lt;/p&gt;
&lt;p&gt;As the number of steps increases, total success probability drops rapidly.&lt;/p&gt;
&lt;h3&gt;6.3 Manual vs Automated Execution&lt;/h3&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Manual&lt;/th&gt;
&lt;th&gt;Automated&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;&lt;tr&gt;
&lt;td&gt;Human intervention&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error recovery&lt;/td&gt;
&lt;td&gt;Possible&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Progress condition&lt;/td&gt;
&lt;td&gt;Partial success&lt;/td&gt;
&lt;td&gt;Full success&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;Pipelines require &lt;strong&gt;every step to succeed every time&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;7. The Real Problem&lt;/h2&gt;
&lt;p&gt;Initially, I thought:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“We need better prompts.”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;But the real issue was:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;“How do we handle failures?”&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;This is not a prompt problem.&lt;/p&gt;
&lt;p&gt;It is a &lt;strong&gt;system design problem&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;8. Solution — Convergence System&lt;/h2&gt;
&lt;p&gt;Instead of a linear pipeline, I redesigned the system as a &lt;strong&gt;convergence loop&lt;/strong&gt;.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;         LLM Call
             ↓
        Validation
        /        \
     OK           FAIL
     ↓            ↓
  Accept        Retry&lt;/code&gt;&lt;/pre&gt;&lt;h2&gt;9. Implementation — Retry + Validation&lt;/h2&gt;
&lt;h3&gt;9.1 Retry Loop&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def run_with_retry(task_fn, validate_fn, max_retry=3):
    for attempt in range(max_retry):
        result = task_fn()

        if validate_fn(result):
            return result

    return result&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;9.2 Validation Example&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;def validate_code(result):
    if &amp;quot;```&amp;quot; not in result:
        return False
    if &amp;quot;TODO&amp;quot; in result:
        return False
    return True&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;9.3 Step Isolation&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-python&quot;&gt;analysis = analyze(input)
design = design(analysis)
code = implement(design)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each step is independently validated and recoverable.&lt;/p&gt;
&lt;h2&gt;10. Results&lt;/h2&gt;
&lt;p&gt;After introducing convergence mechanisms:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Reduced over-exploration&lt;/li&gt;
&lt;li&gt;Fewer pipeline failures&lt;/li&gt;
&lt;li&gt;More consistent outputs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The most important change:&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;The system started working by design, not by luck.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;11. Final Takeaway&lt;/h2&gt;
&lt;p&gt;Prompt engineering matters.&lt;/p&gt;
&lt;p&gt;But it is not enough for automation.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;LLM systems are not about generating correct answers.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;They are about &lt;strong&gt;controlling incorrect ones.&lt;/strong&gt;&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;</description>
      <category>IT/AI</category>
      <category>ai agent</category>
      <category>ai automation</category>
      <category>AI System Design</category>
      <category>claude code</category>
      <category>llm</category>
      <category>ollama</category>
      <category>Prompt Engineering</category>
      <category>qwen</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/411</guid>
      <comments>https://yeti.tistory.com/411#entry411comment</comments>
      <pubDate>Mon, 13 Apr 2026 16:53:14 +0900</pubDate>
    </item>
    <item>
      <title>AI 도구 무엇을 써야 할까? 2026년 독립 개발자의 AI 개발 비용 중심 선택 기준</title>
      <link>https://yeti.tistory.com/410</link>
      <description>&lt;p&gt;안녕하세요. yeti 입니다.&lt;/p&gt;
&lt;p&gt;오늘은 제가 실제로 운영 중인 &lt;strong&gt;AI 개발 환경과 AI 코딩 도구 운영 전략&lt;/strong&gt;을 공유하려고 합니다.&lt;/p&gt;
&lt;p&gt;특히 이 글은 단순한 &lt;em&gt;AI 도구 추천&lt;/em&gt;이 아니라,&lt;br&gt;&lt;strong&gt;독립 개발자가 AI 개발 비용을 직접 감당하는 환경에서 어떤 기준으로 도구를 선택하는지&lt;/strong&gt;에 대한 기록입니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;저는 모든 AI 도구 비용을 개인이 직접 지불합니다.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;그래서 이 글은 생산성 관점이 아니라 &lt;strong&gt;비용을 통제하면서 AI 코딩 도구를 사용하는 전략&lt;/strong&gt;에 대한 이야기입니다.&lt;/p&gt;
&lt;h2&gt;문제: AI 개발 비용은 생각보다 빠르게 증가한다&lt;/h2&gt;
&lt;p&gt;AI 코딩을 적극적으로 활용하면 생산성은 확실히 올라갑니다.&lt;br&gt;하지만 동시에 &lt;strong&gt;AI 토큰 비용과 구독 비용&lt;/strong&gt;도 빠르게 증가합니다.&lt;/p&gt;
&lt;p&gt;2025년 11~12월, Cursor IDE 사용량이 늘어나면서 월 비용이 15 ~ 20만원까지 올라간 경험이 있습니다.&lt;/p&gt;
&lt;p&gt;바이브 코딩을 지향하며 개발 생산성은 상승했지만 그 비용은 제 활동비에서 직접 차감되었습니다.&lt;/p&gt;
&lt;p&gt;이때 깨달은 점은 단순했습니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;AI는 생산성 도구이지만, 동시에 비용이 발생하는 인프라다.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;h2&gt;전략 변경: 통합이 아닌 분산&lt;/h2&gt;
&lt;p&gt;AI 도구를 하나로 통합하는 대신 &lt;strong&gt;AI 개발 워크플로우를 역할 기반으로 분산&lt;/strong&gt;하기로 했습니다.&lt;/p&gt;
&lt;p&gt;현재 운영 중인 AI 개발 환경은 다음과 같습니다.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;[추상 설계]
ChatGPT
   ↓ plan.md 생성

[코드 기반 설계 구체화]
Codex
   ↓ plan.md 보완

[백엔드 개발 / AI 리팩토링]
Antigravity

[프론트엔드 개발 / 인프라 운영]
Cursor IDE

[코드 리뷰 / 검증]
Gemini CLI&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;핵심은 &lt;strong&gt;AI 도구 분산 전략과 역할 분리&lt;/strong&gt;입니다.&lt;/p&gt;
&lt;h2&gt;도구별 사용 전략 (AI 코딩 도구 비교 관점)&lt;/h2&gt;
&lt;h3&gt;ChatGPT — AI 설계 도구&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;서비스 기획 정의&lt;/li&gt;
&lt;li&gt;아키텍처 설계&lt;/li&gt;
&lt;li&gt;Agent 명세 작성&lt;/li&gt;
&lt;li&gt;&lt;code&gt;plan.md&lt;/code&gt; 문서 생성&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;ChatGPT는 &lt;strong&gt;AI 코드 생성 도구라기보다 AI 설계 도구&lt;/strong&gt;로 사용합니다.&lt;br&gt;코드를 대량 생성하기보다는, 개발 명세를 정리하는 데 집중합니다.&lt;/p&gt;
&lt;h3&gt;Codex — AI 코드 구체화 도구&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;코드 기반 설계 구체화&lt;/li&gt;
&lt;li&gt;작은 기능 구현&lt;/li&gt;
&lt;li&gt;버그 수정&lt;/li&gt;
&lt;li&gt;GitLab 이슈 초안 작성&lt;/li&gt;
&lt;li&gt;MR 초안 생성&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Codex는 대규모 코드 생성보다 &lt;strong&gt;AI 개발 흐름 자동화&lt;/strong&gt;에 가까운 역할을 합니다.&lt;/p&gt;
&lt;p&gt;최근에는 다음 작업을 모두 Codex에게 맡겨보았습니다:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;리팩토링&lt;/li&gt;
&lt;li&gt;문제 분석&lt;/li&gt;
&lt;li&gt;GitLab 이슈 작성&lt;/li&gt;
&lt;li&gt;개발 계획서 작성&lt;/li&gt;
&lt;li&gt;MR 생성&lt;/li&gt;
&lt;li&gt;MR 리뷰 대응 작성&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;AI가 단순 코드 생성이 아니라 &lt;strong&gt;개발 프로세스 자동화 도구&lt;/strong&gt;로 활용될 수 있다는 가능성을 확인했습니다.&lt;/p&gt;
&lt;h3&gt;Antigravity — AI 리팩토링 전용&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;백엔드 구조 개선&lt;/li&gt;
&lt;li&gt;대규모 리팩토링&lt;/li&gt;
&lt;li&gt;테스트 기반 수정&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Cursor IDE가 실패했던 리팩토링을 Antigravity가 성공시킨 경험 이후, &lt;strong&gt;AI 리팩토링은 분리 처리&lt;/strong&gt;하고 있습니다.&lt;/p&gt;
&lt;p&gt;무료 플랜임에도 안정적인 결과를 보여주었습니다.&lt;/p&gt;
&lt;h3&gt;Cursor IDE — AI 프론트엔드 개발 도구&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;프론트엔드 구현&lt;/li&gt;
&lt;li&gt;배포 설정&lt;/li&gt;
&lt;li&gt;인프라 작업&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;현재는 주력 AI 코딩 도구라기보다 &lt;strong&gt;프론트엔드 중심 구현 도구&lt;/strong&gt;로 사용합니다.&lt;/p&gt;
&lt;h3&gt;Gemini CLI — AI 코드 리뷰 도구&lt;/h3&gt;
&lt;ul&gt;
&lt;li&gt;코드 리뷰&lt;/li&gt;
&lt;li&gt;정적 파일 기반 검증&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;실행자가 아니라 &lt;strong&gt;저비용 AI 코드 리뷰 도구&lt;/strong&gt;로 활용합니다.&lt;/p&gt;
&lt;h2&gt;Claude Code 비교: 왜 사용하지 않는가?&lt;/h2&gt;
&lt;p&gt;Claude Code는 강력한 AI 코딩 도구로 알려져 있습니다.&lt;/p&gt;
&lt;p&gt;하지만 현재 도입하지 않은 이유는 명확합니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;무료 플랜이 없어 가볍게 실험해볼 수 없었고,&lt;br&gt;제 AI 개발 비용 구조에서는 즉시 전환을 결정하기 어려웠습니다.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;AI 도구 선택 기준은 단순 성능이 아니라 &lt;strong&gt;내가 감당할 수 있는 비용 구조와 실험 가능성&lt;/strong&gt;입니다.&lt;/p&gt;
&lt;p&gt;도구의 우열 문제가 아니라 &lt;strong&gt;AI 도구 비교 관점에서 현재 제 운영 전략과의 적합성 문제&lt;/strong&gt;입니다.&lt;/p&gt;
&lt;h2&gt;2026년 AI 개발 비용 구조&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;도구&lt;/th&gt;
&lt;th&gt;비용&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;$20 / 월&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cursor IDE&lt;/td&gt;
&lt;td&gt;$20 / 월&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;ChatGPT 플랜 포함&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Antigravity&lt;/td&gt;
&lt;td&gt;무료&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini CLI&lt;/td&gt;
&lt;td&gt;무료&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;p&gt;독립 개발자의 AI 도구 운영 전략은 다음과 같습니다.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;무료 플랜 적극 활용&lt;/li&gt;
&lt;li&gt;고소모 AI 토큰 작업 분리&lt;/li&gt;
&lt;li&gt;재시도 비용 최소화&lt;/li&gt;
&lt;li&gt;설계 문서 표준화&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;기업의 지원 여부&lt;/h2&gt;
&lt;p&gt;기업에서 AI 도구의 비용을 제공하는 환경에 따라 다를 수 있습니다. &lt;/p&gt;
&lt;p&gt;회사에서 비용을 부담한다면 다음 전략이 가능합니다. &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;상위 플랜 사용 &lt;/li&gt;
&lt;li&gt;통합 플랫폼 중심 운영 &lt;/li&gt;
&lt;li&gt;생산성 극대화 전략 &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;하지만 개인 AI 도구의 비용을 지불한다면 상황이 달라집니다. &lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;실패 = 직접 비용 &lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;그래서 저는 다음을 선택했습니다. &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;통합이 아니라 분산 &lt;/li&gt;
&lt;li&gt;편의성보다 비용 안정성 &lt;/li&gt;
&lt;li&gt;고성능 모델 상시 사용 대신 역할 분리&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;결론: AI 도구 추천보다 중요한 것&lt;/h2&gt;
&lt;p&gt;AI 도구 추천은 상황에 따라 달라집니다.&lt;/p&gt;
&lt;p&gt;하지만 독립 개발자의 기준은 명확합니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;AI 도구의 성능이 아니라, 내가 감당할 수 있는 비용 구조가 기준이다.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;AI를 “무제한 생산성 도구”로 쓰지 않습니다.&lt;/p&gt;
&lt;blockquote data-ke-style=&quot;style1&quot;&gt;&lt;p data-ke-size=&quot;size16&quot;&gt;&lt;span style=&quot;font-family: 'Noto Serif KR';&quot;&gt;&lt;p&gt;비용이 있는 인프라로 취급합니다.&lt;/p&gt;
&lt;/span&gt;&lt;/p&gt;&lt;/blockquote&gt;&lt;p&gt;저는 여전히 실험 중입니다.&lt;br&gt;다만, 비용을 통제하지 않는 실험은 하지 않습니다.&lt;/p&gt;
&lt;p&gt;이상, 2026년 기준 독립 개발자의 AI 도구 운영 전략 기록이었습니다.&lt;/p&gt;</description>
      <category>IT/AI</category>
      <category>ai 개발 비용</category>
      <category>AI 개발 환경</category>
      <category>ai 도구 비교</category>
      <category>ai 도구 추천</category>
      <category>ai 코딩 도구</category>
      <category>AI 토큰 비용</category>
      <category>ChatGPT 개발 활용</category>
      <category>Codex 사용법</category>
      <category>Cursor IDE 비용</category>
      <category>독립 개발자 AI</category>
      <author>yeTi</author>
      <guid isPermaLink="true">https://yeti.tistory.com/410</guid>
      <comments>https://yeti.tistory.com/410#entry410comment</comments>
      <pubDate>Thu, 12 Feb 2026 11:13:29 +0900</pubDate>
    </item>
  </channel>
</rss>