How Role Separation Reduced Execution Drift in Multi-Agent Systems

Notice

Recent Posts

Recent Comments

Link

« 2026/05 »
일	월	화	수	목	금	토
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31

Tags more

Archives

Today

Total

관리 메뉴

잡동사니

How Role Separation Reduced Execution Drift in Multi-Agent Systems 본문

IT/AI

How Role Separation Reduced Execution Drift in Multi-Agent Systems

yeTi 2026. 5. 7. 22:38

Lessons from building reliable AI agent workflows with Hermes and local LLMs

TL;DR

Multi-agent systems become unstable when responsibilities overlap
Stronger models do not automatically improve workflow convergence
Shared context without ownership boundaries creates execution drift
Separating Planner, Implementer, and Validator responsibilities significantly improved workflow stability
The Implementer should apply contracts precisely, not redesign the system during execution

1. The Problem — Execution Drift in Multi-Agent Workflows

When I first started building AI agent systems, I assumed the main problem was model capability.

If the model became smarter, the workflow would become more reliable.

That assumption turned out to be incomplete.

While experimenting with local LLM-based coding agents using:

Hermes Agent
Claude Code CLI
Ollama
local qwen models
Discord-based orchestration

I repeatedly encountered the same failure pattern.

At first, using a single agent felt efficient.

The same agent would:

plan tasks
write code
debug failures
retry execution
validate outputs
redesign architecture during retries

Everything happened inside one large shared context.

Initially, this looked flexible.

But as workflows became larger, execution stability degraded rapidly.

I started seeing problems such as:

endless retry loops
inconsistent file structures
duplicated abstractions
rewritten interfaces during execution
architectural drift between retries
increasing divergence from the original task

The workflow often looked productive.

But convergence became worse over time.

Eventually, I realized I was not only dealing with model errors.

I was dealing with execution drift.

2. Stronger Models Still Drifted During Execution

One surprising realization was that stronger models did not fundamentally solve the problem.

Larger models often generated better local outputs.

However, workflow instability still remained.

In some cases, stronger models amplified instability because they became more willing to reinterpret previous decisions during execution.

For example, an Implementer agent might:

rename directories during retries
introduce new abstractions mid-execution
redefine interfaces that were already agreed upon
restructure unrelated components while fixing a local issue

At first, this behavior appeared intelligent.

The model looked proactive and adaptive.

However, execution reliability became worse.

The model attempted to optimize locally during retries.

Instead of treating the existing structure as a fixed contract, it continuously searched for “better” architectures.

As retries accumulated, small local optimizations gradually destabilized the workflow itself.

Eventually, the workflow became harder to reason about after every retry.

This led me to an important realization:

Reliability problems in multi-agent systems are often coordination problems.

The workflow was unstable not because the model was incapable, but because responsibilities were unclear.

3. Shared Context Without Ownership Creates Instability

One of the biggest problems in multi-agent systems is uncontrolled shared context.

At first, shared memory feels efficient because every agent can access the same information.

However, in practice, this often removes ownership boundaries.

Once ownership becomes unclear, responsibilities begin overlapping.

For example:

the Planner modifies implementation details
the Implementer redesigns architecture decisions
the Validator proposes alternative execution strategies
retry loops introduce conflicting interpretations

Eventually, the workflow loses convergence.

The issue is not that the agents are unintelligent.

The issue is that every agent is allowed to make every type of decision.

This creates architectural instability.

While debugging these workflows, I realized the problem felt surprisingly familiar.

It resembled a classic problem from object-oriented design.

4. The Object-Oriented Design Parallel

In object-oriented programming, responsibility separation is considered one of the most important design principles.

The same idea appears repeatedly in concepts such as:

Single Responsibility Principle (SRP)
high cohesion
low coupling
ownership boundaries

The core idea is simple:

Systems become difficult to reason about when responsibilities overlap.

That idea started feeling very similar to what I was seeing in agent systems.

In traditional software systems:

a service should not own every responsibility
a class should not make every decision
a module should not redefine another module’s contract

The same pattern appeared inside multi-agent workflows.

When every agent could:

plan
implement
redesign
validate
reinterpret contracts during retries

workflow stability degraded rapidly.

At some point, I stopped thinking about agents as “smart tools.”

I started thinking about them as independently evolving components inside a distributed system.

That perspective changed how I designed workflows afterward.

5. The Implementer Should Not Redesign the System

One specific failure pattern repeatedly caused instability in my workflows.

The Implementer agent would begin modifying architectural decisions during execution.

For example:

changing directory structures during retries
introducing new abstractions unrelated to the original task
rewriting task boundaries while fixing local errors
redefining interfaces that other agents already depended on

At first, this behavior looked intelligent.

The model appeared proactive.

However, execution reliability became significantly worse.

Every retry introduced additional design changes.

As those changes accumulated, the workflow continuously drifted away from the original contract.

Eventually, retries stopped behaving like recovery mechanisms.

They became architecture mutation loops.

This problem became especially severe when the Implementer shared the same broad context as the Planner.

The Implementer gradually started behaving like another Planner.

That overlap destabilized the workflow.

Eventually, I realized the problem resembled a classic object-oriented design issue.

An object becomes difficult to reason about when it owns too many responsibilities.

The same pattern appeared in agent systems.

The Implementer should not make new decisions during execution.

Its role is to apply the already defined contract as precisely as possible.

Once I separated:

planning responsibilities
execution responsibilities
validation responsibilities

workflow convergence improved significantly.

6. The Harness Layer — Controlling Convergence

Role separation alone does not guarantee convergence.

The workflow still requires a control layer that verifies whether execution remains aligned with the original contract.

That became the responsibility of the Harness layer.

The Harness layer acts as a convergence controller.

It determines:

whether retries should continue
whether execution drift exceeded acceptable boundaries
whether rollback is necessary
whether the workflow should terminate

For example, if retries continuously modified unrelated files or redefined existing interfaces, the Harness layer treated the execution as divergence rather than recovery.

That distinction became important.

Without convergence control, retries often amplified instability instead of resolving failures.

The Harness layer then managed:

retries
convergence loops
execution stabilization
workflow validation

This architecture became significantly more stable than relying on a single highly capable agent operating inside a large shared context.

7. My Current Multi-Agent Structure

My current workflows are increasingly organized around ownership boundaries.

A simplified structure looks like this:

PM Agent
    ↓
Planner Agent
    ↓
Implementer Agent
    ↓
Validator Agent
    ↓
Harness Layer

Each role owns a different category of decisions.

That ownership is important.

Planner

Responsible for:

execution strategy
task decomposition
contract definition

But not responsible for execution changes during runtime.

Implementer

Responsible for:

applying predefined contracts
writing code
executing tasks precisely

But not responsible for redesigning architecture.

Validator

Responsible for:

invariant verification
semantic validation
execution correctness checks

But not responsible for redefining execution strategy.

As ownership boundaries became clearer, workflow behavior became significantly easier to reason about.

Execution drift decreased.

Retries became more predictable.

And convergence stability improved substantially.

8. Reliability Comes From Ownership Boundaries

One of the biggest misconceptions about AI agents is that reliability comes only from intelligence.

In practice, reliability often comes from constrained responsibilities.

The same principle already exists in software engineering.

Distributed systems become more stable when responsibilities are isolated.

Database systems become safer when transactional boundaries are explicit.

Microservices reduce instability by limiting ownership scope.

Multi-agent systems appear to follow similar patterns.

Without boundaries:

every agent becomes a planner
every retry becomes a redesign
every execution becomes negotiation

As workflows become more complex, instability grows quickly.

Role separation reduces that instability because ownership becomes predictable.

The more complex the workflow became, the more important role boundaries became for maintaining convergence.

9. The Future — Reliability Engineering for Agent Systems

I increasingly believe we are entering a new phase of AI system design.

Earlier generations of AI systems focused heavily on:

prompts
model quality
tool integration
inference capability

Those layers are still important.

However, as workflows become more autonomous, coordination and ownership also become architectural concerns.

Instead of only asking:

“Which model should execute this task?”

We may increasingly need to ask:

“Which role should own this decision?”

That shift feels important.

Because many of the hardest problems in agent systems are no longer only about generation quality.

They are increasingly about:

coordination
ownership
execution boundaries
convergence stability

As agent workflows become more autonomous, reliability engineering may increasingly become an exercise in defining ownership boundaries between agents.

Conclusion

One unexpected realization from building multi-agent systems was how familiar the failures looked.

Execution drift, responsibility overlap, and uncontrolled redesign during retries resembled classic software engineering problems.

In many ways, multi-agent workflows began behaving like distributed object systems.

The same lessons appeared again:

unclear ownership creates instability
overlapping responsibilities reduce predictability
uncontrolled autonomy weakens convergence

The Implementer should not redesign the system during execution.

Its responsibility is to apply the already defined contract as precisely as possible.

That separation turned out to be one of the biggest improvements in workflow stability.

Ironically, some of the most important ideas for future AI systems may not be entirely new.

Software engineering has already spent decades learning how to build stable systems through responsibility separation and ownership boundaries.

Now, those principles appear to be emerging again inside agent systems.

'IT > AI' 카테고리의 다른 글

Why Multi-Agent Systems Fail to Respond — Debugging a Real Hermes Agent Setup (0)	2026.04.30
How I Designed a Reliable LLM Coding Agent for Production (0)	2026.04.29
Why Prompt Engineering Fails — Harness Engineering for Reliable LLM Systems (0)	2026.04.24
Why LLM Systems Fail in Production (And Why Prompt Engineering Is Not Enough) (0)	2026.04.23
Why Prompt Engineering Alone Fails in LLM Systems (And How to Fix It with Convergence) (0)	2026.04.13

'IT/AI' Related Articles

Comments

잡동사니

How Role Separation Reduced Execution Drift in Multi-Agent Systems 본문

How Role Separation Reduced Execution Drift in Multi-Agent Systems

TL;DR

1. The Problem — Execution Drift in Multi-Agent Workflows

2. Stronger Models Still Drifted During Execution

3. Shared Context Without Ownership Creates Instability

4. The Object-Oriented Design Parallel

5. The Implementer Should Not Redesign the System

6. The Harness Layer — Controlling Convergence

7. My Current Multi-Agent Structure

Planner

Implementer

Validator

8. Reliability Comes From Ownership Boundaries

9. The Future — Reliability Engineering for Agent Systems

Conclusion

'IT > AI' 카테고리의 다른 글

티스토리툴바