๊ด€๋ฆฌ ๋ฉ”๋‰ด

์žก๋™์‚ฌ๋‹ˆ

Why Prompt Engineering Alone Fails in LLM Systems (And How to Fix It with Convergence) ๋ณธ๋ฌธ

IT/AI

Why Prompt Engineering Alone Fails in LLM Systems (And How to Fix It with Convergence)

yeTi 2026. 4. 13. 16:53

Lessons learned from building a real-world LLM coding agent with local models

๐Ÿ“Œ TL;DR

  • LLMs are non-deterministic โ†’ same input, different outputs
  • Pipeline architectures amplify failure probabilities
  • Prompt engineering improves outputs but cannot guarantee reliability
  • The real solution is not better prompts, but convergence systems

1. Problem โ€” You Canโ€™t Even Get Stable Outputs

I wanted to build a local LLM-powered coding assistant.

So I set up:

  • Mac Studio
  • Ollama
  • Claude Code CLI
  • qwen3.5

Then I tried the simplest possible task:

Build a simple API

But the results were unstable:

  • Sometimes no output at all
  • Sometimes excessive file exploration (over-exploration)
  • Sometimes the task never completed

The problem wasnโ€™t correctness.

The problem was that I couldnโ€™t reliably get results at all.

2. Observation โ€” Small Tasks Work

After multiple attempts, I noticed a pattern:

Local LLMs perform much better on small, well-defined tasks.

For example:

  • Implementing a single function
  • Fixing a specific bug
  • Tasks with clear input/output

This led to an important insight:

โ€œBreak the problem down into smaller pieces.โ€

3. Approach โ€” Role Decomposition

Instead of one large prompt, I split the task into stages:

[Analyze] โ†’ [Design] โ†’ [Implement]

Each step:

  • Has a narrow scope
  • Produces structured output
  • Can be validated

This significantly improved success rates (in manual runs).

4. Scaling Up โ€” Pipeline Automation

Naturally, the next step was:

โ€œLetโ€™s automate this workflow.โ€

So I built a pipeline:

User Input
   โ†“
[Analyze] โ†’ [Design] โ†’ [Implement]
   โ†“
 Final Output

5. Problem โ€” The Pipeline Breaks Easily

After automation, new issues appeared:

  • Sometimes it works
  • Sometimes it completely fails

The key issue:

A single failure breaks the entire pipeline.

6. Why Pipelines Fail

6.1 LLMs Are Non-Deterministic

Unlike traditional systems:

  • Same input โ†’ same output (X)
  • Same input โ†’ probabilistic output (O)

6.2 Probability Compounding

If each step succeeds with probability ( p ):

P_{total} = p_1 \times p_2 \times p_3

As the number of steps increases, total success probability drops rapidly.

6.3 Manual vs Automated Execution

Aspect Manual Automated
Human intervention Yes No
Error recovery Possible None
Progress condition Partial success Full success

Pipelines require every step to succeed every time.

7. The Real Problem

Initially, I thought:

โ€œWe need better prompts.โ€

But the real issue was:

โ€œHow do we handle failures?โ€

This is not a prompt problem.

It is a system design problem.

8. Solution โ€” Convergence System

Instead of a linear pipeline, I redesigned the system as a convergence loop.

         LLM Call
             โ†“
        Validation
        /        \
     OK           FAIL
     โ†“            โ†“
  Accept        Retry

9. Implementation โ€” Retry + Validation

9.1 Retry Loop

def run_with_retry(task_fn, validate_fn, max_retry=3):
    for attempt in range(max_retry):
        result = task_fn()

        if validate_fn(result):
            return result

    return result

9.2 Validation Example

def validate_code(result):
    if "```" not in result:
        return False
    if "TODO" in result:
        return False
    return True

9.3 Step Isolation

analysis = analyze(input)
design = design(analysis)
code = implement(design)

Each step is independently validated and recoverable.

10. Results

After introducing convergence mechanisms:

  • Reduced over-exploration
  • Fewer pipeline failures
  • More consistent outputs

The most important change:

The system started working by design, not by luck.

11. Final Takeaway

Prompt engineering matters.

But it is not enough for automation.

LLM systems are not about generating correct answers.

They are about controlling incorrect ones.

Comments