잡동사니

How Building a Multi-Agent Development Pipeline Led Me to Design an AI Engineering Organization 본문

IT/AI

How Building a Multi-Agent Development Pipeline Led Me to Design an AI Engineering Organization

yeTi 2026. 6. 1. 16:39

From AI Coding Agents to AI Engineering Organizations

Introduction

When I started the sqlgen-ai project, my goal was straightforward.

I wanted to build an AI agent that could write code.

At the time, I imagined a future where an AI developer could continuously improve itself, create tools it needed, and gradually become more capable over time.

To explore that idea, I connected Hermes Agent, Codex, GitLab, and Discord into a development workflow and started building what I thought would become a self-improving coding agent.

What I learned was very different from what I expected.

The biggest challenge was not getting AI to write code.

The bigger challenge was operating AI reliably within a software development process.

AI Models Are Already Good at Writing Code

The first surprise was that coding itself was rarely the bottleneck.

Modern coding models such as Codex, Claude Code, and Gemini CLI can already perform a large portion of day-to-day development work.

They can:

  • Implement features
  • Fix bugs
  • Refactor existing code
  • Write tests
  • Create Merge Requests

When I started the project, I assumed model capability would be the primary limitation.

Instead, most failures came from workflow and coordination problems.

For example:

  • An agent implemented a feature against an outdated branch.
  • A Merge Request was created without local validation.
  • Acceptance Criteria were partially implemented.
  • Review feedback introduced regressions.
  • A dirty workspace caused unrelated files to be committed.

These were not coding failures.

The code itself was often reasonable.

The failures occurred because development work was being executed without the operational controls that exist in human engineering teams.

As the models became stronger, a different bottleneck emerged.

Generating code became easier.

Coordinating development work became harder.

Software Development Is Mostly State Transitions

One realization significantly changed how I approached the system.

Software development organizations are more structured than they initially appear.

Most engineering work follows a sequence of state transitions.

Issue
→ Planning
→ Approval
→ Development
→ Validation
→ Review
→ Merge
→ E2E
→ Release

Once viewed from this perspective, software development starts looking less like individual coding tasks and more like a state machine.

That realization changed how I used GitLab.

Initially, GitLab was simply a repository.

Over time, it became something much more important.

GitLab became the coordination layer of the entire agent organization.

For example:

issue:approved
dev:ready
dev:running
dev:done

mr:ready-for-review
mr:approved

e2e:ready
e2e:done

These labels are not just metadata.

They represent organizational state.

Agents consume and update those states as part of a larger workflow.

In practice:

  • Issues became work queues.
  • Labels became state transitions.
  • Merge Requests became review checkpoints.

The agents never communicated directly with each other.

They coordinated through GitLab.

Eventually, GitLab evolved from a repository into the operating system of the agent organization.

Specialized Roles Were More Reliable Than One Smart Agent

My original architecture assumed a single powerful agent would eventually handle everything.

Planning.

Implementation.

Validation.

Review.

Deployment.

In practice, reliability improved when responsibilities became narrower.

The current sqlgen-ai pipeline looks much closer to an engineering team than a single autonomous agent.

PM Agent
↓
Developer Agent
↓
Validator Agent
↓
Review Triage Agent
↓
Human

Each role has a specific responsibility.

The PM Agent creates implementation plans.

The Developer Agent focuses on execution.

The Validator Agent independently verifies results.

The Review Triage Agent processes review feedback.

The Human provides goals and final approval.

The separation is important because implementation and validation have fundamentally different objectives.

The Developer Agent tries to make progress.

The Validator Agent tries to find problems.

Combining both responsibilities into a single agent often creates blind spots.

Separating them makes failures easier to detect and easier to recover from.

Validation Was More Important Than Generation

The most important lesson from operating the system was simple.

Never trust implementation output without verification.

Large language models can generate convincing solutions that are still incorrect.

Because of that, "Implementation Complete" is not evidence.

Validation results are evidence.

In sqlgen-ai, implementation does not immediately lead to a Merge Request.

Every change passes through an independent validation layer.

A simplified validation pipeline looks like this:

Developer Agent
↓
ruff
↓
mypy
↓
pytest
↓
Acceptance Criteria Validator

Only validated work can move forward.

However, validation itself was not the most interesting discovery.

The more important discovery was what happens after validation fails.

Validation Is Not the Goal. Convergence Is.

Many agent workflows stop after validation.

A test fails.

A review check fails.

The workflow ends.

That approach is useful for reporting failures, but it does not help the system reach a successful outcome.

In practice, we found that validation needed to become part of a convergence loop.

Implementation
→ Validation
→ NO-GO
→ Fix
→ Validation
→ GO

When validation returns NO-GO, the implementation session is reused.

The agent receives the validation results and attempts a targeted correction.

The objective is not simply to detect problems.

The objective is to reduce the distance between the current state and the desired state.

This distinction became one of the most important architectural decisions in the system.

The value of validation is not that it finds errors.

The value of validation is that it guides convergence.

The Current sqlgen-ai Pipeline

Today, sqlgen-ai operates through a multi-stage development pipeline.

PM
→ Pick
→ Implementation
→ MR
→ Test
→ CI
→ E2E

Each stage is connected through GitLab state transitions and scheduled agents.

Agents do not communicate directly.

GitLab acts as the shared memory and coordination layer.

The role of humans has also become much smaller than I originally expected.

Human
→ Goal
→ Approval

The human defines direction.

The agent organization executes the workflow.

What Comes Next

The system is still far from a fully autonomous engineering organization.

Several challenges remain:

  • Parallel task execution
  • Automatic Merge Request approval
  • E2E auto-recovery
  • Release Agents
  • Deployment governance
  • Cross-agent conflict resolution

But the direction has become much clearer.

When I started the project, I thought I was building a coding agent.

Today, I think differently.

The real challenge is not creating an agent that can write code.

The real challenge is building an organization that can reliably move software through an engineering lifecycle.

Conclusion

The biggest shift in my thinking came from operating the system in practice.

Initially, I believed model capability would be the defining factor.

In reality, operational structure mattered far more.

Reliability did not come from better prompts.

It came from:

  • Role separation
  • State transitions
  • Independent validation
  • Convergence loops
  • Organizational coordination

As models continue to improve, I expect this distinction to become even more important.

The future may not belong to isolated coding assistants.

It may belong to AI engineering organizations.

And that was the unexpected lesson from building sqlgen-ai.

Comments