DEV Community: marcosomma

The Old Seniority Definition Is Collapsing

marcosomma — Thu, 05 Mar 2026 08:40:59 +0000

For a long time, “senior developer” was a fairly consistent signal. You expected someone who could hold a large architecture in their head, write clean code with low defect rates, debug almost anything, and reason about performance without guesswork. That bundle made sense because the hardest part of shipping software was often the execution layer: translating intent into correct, maintainable code at speed.

That bundle is breaking.

AI-assisted development is compressing the cost of producing plausible, working code. Not always. Not uniformly. But enough that “I can ship a lot of code quickly” is no longer a reliable proxy for deep seniority. In many teams, velocity metrics are starting to measure who is best at driving the tool, not who is best at building systems that survive contact with reality.

What AI Is Actually Commoditizing

AI is not replacing engineering. It is discounting a specific slice of it: first-pass implementation and the mechanical parts of refactoring. The tool is good at producing code that looks right, compiles, and often passes superficial tests. That changes the economics of execution.

What does not get discounted at the same rate is integration into a real system with real constraints: data contracts, failure modes, security boundaries, observability, and long-term maintenance. In practice, the bottleneck shifts from typing to supervision. You spend less time writing and more time specifying, verifying, reviewing, and correcting.

This is why you can see two realities at the same time. Some developers experience dramatic speedups on bounded tasks. Others experience slowdowns inside large, messy codebases because prompting, waiting, and review overhead replace keystrokes, and because the model lacks the local context that makes a patch truly correct.

What Is Rising in Value

Problem decomposition and system thinking become the differentiator because they convert ambiguity into an executable plan. When you are dealing with something like regulatory delta detection, the hardest part is not writing code. The hardest part is deciding where the complexity actually lives, and what you must make explicit so the system stays correct as the domain evolves. The choice between a graph database and a simpler model is rarely a “tech taste” debate. It is a tradeoff between query expressiveness, operational burden, debuggability, and change management.

Judgment under uncertainty becomes a senior marker because architecture is mostly irreversible decisions made with incomplete information. Moving from direct graph writes to a changeset-based approach with content hashing is not an implementation detail. It is a bet on how you will observe change, roll back safely, explain behavior to customers, and avoid silent drift. That decision quality is what compounds over months.

Context and domain mastery become a moat because they are earned, not generated. If you understand how CELEX identifiers behave in practice, how MiCAR compliance maps to document reality, or how jurisdictions interpret rules differently, you carry constraints that materially shape the architecture. AI can help you express that knowledge. It cannot reliably invent it. Without domain context, you get confident code that is wrong in the ways that matter.

Technical leadership becomes central because building systems is increasingly a multiplayer game. The question is whether you can create a design that other people can implement without constant back-and-forth, and whether you can write specifications that converge rather than fork. This is why a workshop like SDD Pills matters. It trains decision-making and clarity, not syntax.

Mentoring and knowledge transfer become leverage because the highest-value output of a senior engineer is often the improvement of everyone else’s output. AI amplifies this. Teams that learn how to bound AI usage with clear contracts, acceptance criteria, and review discipline get compounding returns. Teams that treat AI as an oracle get compounding debt.

The Uncomfortable Truth: Two Axes Have Split

There are now two skill axes that used to correlate and no longer do.

One axis is technical depth: how well you understand systems, tradeoffs, failure modes, and the long-term consequences of design choices.

The other axis is execution speed: how quickly you can produce working code.

Historically, depth and speed often moved together. Deep engineers tended to execute quickly because they saw the path. Today, you can get high speed with low depth by delegating thinking to the tool. That can look senior on dashboards and in weekly updates. It is not senior if the output is brittle, unobservable, and expensive to maintain.

The inverse also exists: high depth with lower raw output speed can still be very senior if the person consistently makes decisions that reduce risk, eliminate classes of bugs, and increase team throughput.

What This Breaks in Hiring and Promotion

Many organizations still reward visible output: commits, tickets closed, apparent velocity. AI makes these signals noisier because the cost of producing code has dropped, while the cost of validating correctness has often increased. The net effect is that the old metrics over-credit the wrong behaviors and under-credit the work that actually keeps systems stable.

The evaluation problem is that “code shipped” is no longer tightly coupled to “engineering done.” A senior engineer in 2026 is often the person who prevented the incident you never had, removed an entire category of future work by designing the right abstraction, or wrote a spec that made five people productive instead of confused.

What to Measure Instead

The most useful seniority markers become visible if you look for decision quality, not output quantity.

A senior engineer can take an ambiguous problem and produce a specification that is testable and unambiguous. They can make uncertainty explicit by stating what is known, what is assumed, and what the cost of being wrong looks like. They consistently surface non-functional requirements early, especially observability, maintainability, and security, because those are the constraints that explode later.

They use AI as a bounded tool. They know when to ask it for a scaffold, when to demand alternatives, and when to reject a suggestion because they understand the scaling and failure modes. Patterns like Planner, Executor, Reviewer work when they are treated as control systems with clear acceptance criteria, not as theater.

Why “Senior” Is Drifting Toward “Principal”

Role expectations are shifting. Senior used to mean “I can personally deliver complex work.” Increasingly it means “I can make the right decisions and increase the output quality of everyone around me.” That is closer to what many companies used to call principal or architect.

This shift is healthy if organizations adapt their evaluation criteria. It is painful if they do not. People whose main advantage was fast execution will feel the floor drop out, because execution has been discounted. People who were already strong in decomposition, judgment, and leadership will become more valuable, because those skills are now the constraint.

What I’m Seeing in Teams

The developers adapting best to AI-assisted development are usually the ones who already had strong mental models and strong taste. They can turn ambiguity into constraints, and constraints into evaluation. They do not confuse “working code” with “correct system.” They treat AI output as a hypothesis that must be verified against invariants.

The developers struggling are often those who outsource thinking. They can generate a lot of code quickly, but they cannot defend why the design is correct, what it will cost to operate, or how it will fail.

If you are seeing a blur between depth and apparent execution speed, that blur is real. The solution is not to ban AI or to worship it. The solution is to change what you reward, and to interview and promote for the skills that actually compound.

LLMs Are Not Deterministic. And Making Them Reliable Is Expensive (In Both the Bad Way and the Good Way)

marcosomma — Sun, 22 Feb 2026 14:24:05 +0000

Let’s start with a statement that should be obvious but still feels controversial: Large Language Models are not deterministic systems. They are probabilistic sequence predictors. Given a context, they sample the next token from a probability distribution. That is their nature. There is no hidden reasoning engine, no symbolic truth layer, no internal notion of correctness.

You can influence their behavior. You can constrain it. You can shape it. But you cannot turn probability into certainty.

Somewhere between keynote stages, funding decks, and product demos, a comforting narrative emerged: models are getting cheaper and smarter, therefore AI will soon become trivial. The logic sounds reasonable. Token prices are dropping. Model quality is improving. Demos look impressive. From the outside, it feels like we are approaching a phase where AI becomes a solved commodity.

From the inside, it feels very different.

There is a massive gap between a good demo and a reliable product. A demo is usually a single prompt and a single model call. It looks magical. It sells. A product cannot live there. The moment you try to ship that architecture to real users, reality shows up fast. The model hallucinates. It partially answers. It ignores constraints. It produces something that sounds fluent but is subtly wrong. And the model has no idea it failed.

This is not a moral flaw. It is a design property.

So engineers do what engineers always do when a component is powerful but unreliable. They build structure around it.

The moment you care about reliability, your architecture stops being “call an LLM” and starts becoming a pipeline. Input is cleaned and normalized. A generation step produces a candidate answer. Another step evaluates that answer. A routing layer decides whether the answer is acceptable or if the system should try again. Sometimes it retries with a modified prompt. Sometimes with a different model. Sometimes with a corrective pass. Only after this loop does something reach the user.

At no point did the LLM become deterministic. What changed is that the system gained control loops.

This distinction matters. We are not converting probability into certainty. We are reducing uncertainty through redundancy and validation. That reduction costs computation. Computation costs money.

This is why quoting token prices in isolation is misleading. A single model call might be cheap. A serious system rarely uses a single call. One user request can trigger several model invocations: generation, evaluation, regeneration, formatting, tool calls, memory lookups. The user experiences “one answer.” The backend executes a small workflow.

Token cost is component cost. Reliable AI is system cost.

Saying “tokens are cheap, therefore AI is cheap” is like saying screws are cheap, therefore airplanes are cheap.

This leads to an uncomfortable but important truth. AI becomes expensive in two very different ways.

If you implement it poorly, it becomes expensive because you burn money and still do not get reliability. You keep tweaking prompts. You keep firefighting. You keep patching symptoms. Nothing stabilizes.

If you implement it well, it becomes expensive because you intentionally pay for control. You pay for evaluators. You pay for retries. You pay for observability. You pay for redundancy. But you get something in return: a system that behaves in a bounded, inspectable, and improvable way.

There is no cheap version of “reliable.”

Another source of confusion comes from mixing up different kinds of expertise. High-profile founders and executives are excellent at describing futures. They talk about where markets are going and what will be possible. That is their role. It is not their role to debug why an evaluator prompt leaks instructions or why a routing threshold oscillates under load. Money success does not imply operational intimacy.

On the ground, building serious AI feels much closer to distributed systems engineering than to science fiction. You worry about data quality. You worry about regressions. You worry about latency and cost per request. You design schemas. You version prompts. You inspect traces. You run benchmarks. You tune thresholds. It is slow, unglamorous, and deeply technical.

LLMs made AI more accessible. They did not make serious AI simpler. They shifted complexity upward into systems.

So when someone says, “Soon we’ll just call an API and everything will work,” what they usually mean is, “Soon an enormous amount of engineering will be hidden behind that API.”

That is fine. That is progress.

But pretending that reliable AI is cheap, trivial, or solved is misleading.

The honest version is this: LLMs are powerful probabilistic components. Turning them into dependable products requires layers of control. Those layers cost money. They also create real value.

Serious AI today is expensive in the bad way if you do not know what you are doing.

Serious AI today is expensive in the good way if you actually want it to work.

And anyone selling “cheap deterministic AI” is selling a story, not a system.

Adversarial Planning for Spec Driven Development

marcosomma — Thu, 12 Feb 2026 21:39:13 +0000

I have always loved one idea in machine learning. The idea that you can sharpen a model by forcing it to face a challenger. You can call it adversarial training, red teaming, or constructive hostility. The name matters less than the mechanism. You introduce pressure. You int
roduce disagreement. You force the system to earn its confidence.

For years I kept that concept in a mental drawer labeled “cool, but academic.” Then it become a core concept within the Orka-reasoning development but lately my attention is shifting toward code agent workflows and how all happen. Not the marketing version. The real version, where you sit down to ship software, and you realize that a helpful model is not the same thing as a rigorous model. Helpful is easy. Rigorous is costly.

This article is about how I tried to transplant an adversarial dynamic into Spec Driven Development sessions. Not as theater. Not as an AI debate club. As an engineering tool. It worked. It also nearly became a token-burning trap. That tradeoff is the point.

What Spec Driven Development means here

Spec Driven Development, or SDD, is a workflow where the spec is not documentation. The spec is the product of the thinking phase. It becomes the contract you implement against. You write it before code changes. You review it like you would review code. You use it to force scope, constraints, interfaces, and acceptance criteria into something explicit.

The point is not to be verbose. The point is to move ambiguity upstream, when it is still cheap. The spec becomes the unit of alignment, review, and iteration. Code is the execution of that spec, not the place where you discover what the spec should have been.

The problem with a trustable planner

If you use an LLM as a planning companion, you know the feeling. You came with your idea. You debate a bit. And then it gives you a plan. The plan is detailed. It is readable. It sounds plausible. It often includes little snippets that look like they belong in your codebase, even when they do not. It is confident. It is fast.

And that is exactly the problem.

A planner model has incentives you did not explicitly set. Its default incentive is to be useful to you in the moment. It wants to reduce friction. It wants to keep you engaged. It wants to produce something that reads like progress.

So it will fill gaps with assumptions. If your own initial plan is fluffy. It will smooth rough edges. It will complete the pattern of what a good plan should look like. It will also happily unlock future possibilities, because possibilities are cheap to generate and expensive to invalidate.

When you are deep in a product, that behavior is dangerous. Not because the model is malicious. Because it is compliant. It will often accept your framing even if your framing is wrong. It will not push hard unless you force it to.

This is the failure mode I kept hitting. I would craft a plan with the planner. I would feel momentum. Then I would start implementation and discover that the plan was under-specified in the only places that matter.

Interfaces were vague. Invariants were missing. Acceptance criteria were soft. The plan assumed the architecture could absorb a change without showing how. It assumed the code was more modular than it actually was. It assumed integration would be straightforward.

In other words, it was a nice plan. It was not a plan that survived contact with a real codebase.

Why adversarial dynamics work in ML

Adversarial training is interesting because it makes weakness visible. You do not improve a system by praising it. You improve it by exposing it to inputs that exploit its blind spots. You force it to fail in ways that are informative.

In a GAN, the generator learns because the discriminator is not polite. The discriminator does not care about your feelings. It cares about whether the output holds up under scrutiny. That pressure creates signal.

In engineering, we already do this. Code review is adversarial when it is healthy. Testing is adversarial by definition. Security review is adversarial. Load testing is adversarial. Even a good product manager is adversarial at the right moments.

But planning often is not. Planning often becomes a social process. People nod. People optimize for alignment. People avoid being the blocker. Under time pressure, that tendency gets amplified.

If you bring an LLM into planning and you let it be the agreeable teammate, you amplify the most comfortable version of planning. You pay tokens to make yourself feel certain.

That is not what I wanted. I wanted the planning stage to contain more of the pain, so implementation contains less.

The translation to SDD: Planner plus Architect

I kept the planner. I did not replace it. The planner is good at structure. It is good at decomposing a vague goal into sequential work. It is good at producing a spec you can follow. It is good at holding context across iterations.

But I introduced a second role. I call it the Architect. The job is simple. Challenge the plan as if you are the most annoying senior engineer in the room, with one constraint. The criticism must be grounded. It must point to specific failure modes. It must force explicit decisions.

The Architect pushes on the places where the planner tends to glide over reality. It asks what the boundary of the change really is. It asks what breaks if you do it, and what breaks if you do not. It pressures you to name the coupling you are creating and the coupling you are relying on. It attacks the parts of the spec that sound confident but are not falsifiable.

This role is unpleasant. It is supposed to be unpleasant. It is also productive, if you keep it under control.

The immediate effect was obvious. Specs became harder to write. My initial drafts got rejected more often. I had to define outcomes in tighter language. I had to stop relying on vibes and start writing constraints.

The less obvious effect was more important. I started noticing the difference between a plan that sounds implementable and a plan that is falsifiable.

A falsifiable plan is one where you can point at a step and say: if this condition does not hold, the step is wrong. If the step is wrong, we know why. We can adjust.

A non-falsifiable plan is one where every step is elastic. You can always reinterpret it. You can always claim partial success. It is planning as comfort.

The Architect hates comfort. That is the point.

What the Architect actually improved

It did not make my system magically correct. It made my system explicit.

It reduced scope creep because it forced me to define what done means in terms of observable outcomes. It reduced hidden coupling because it forced me to identify which pieces of the system now move together. It reduced abstraction drift because it forced me to state which module owns which responsibility. It improved testability because it pushed me to name the failure cases the system must catch and the layer that must catch them. It also lowered integration fantasies by making me draw the dependency edges in plain language.

This matters because most planning failures are not about missing steps. They are about missing friction. You only discover friction when someone tries to break your plan.

A planner rarely tries to break your plan. An Architect lives for it.

The mental model: controlled adversarial pressure

At some point I realized the dynamic I was building was not adversarial planning. It was controlled adversarial pressure.

Pressure is good when it produces signal. Pressure is bad when it produces noise.

The Architect can easily produce noise. It can challenge everything. It can question the existence of the feature. It can spiral into meta debates. It can do the classic senior engineer move of turning every change into a referendum on architecture.

That is why this approach can become dangerous. It is not just about tokens. It is about cognitive load. Too much adversarial pressure makes you doubt everything. You stop shipping. You start ruminating. You start optimizing a plan instead of building the thing.

So the key is control. You want the Architect to challenge the plan in a bounded way, then you move on.

The only sustainable use is somewhere in the middle. You let it break your plan until the breakage becomes repetitive. When the criticism starts looping, it is done. That loop is your stop signal.

How the infinite loop happens

I learned this the hard way. The Architect is very good at finding the next critique, even when the plan is already good enough, even when the remaining critiques are marginal.

There are two reasons.

First, LLMs are generative machines. They can always produce another objection. The space of objections is large. Many objections are plausible. Plausible is not the same as important.

Second, adversarial roles reward themselves. When the Architect produces a clever critique, it feels like progress. It feels like rigor. It feels like you are doing serious engineering. You can get addicted to that feeling, especially if you already equate doubt with intelligence.

So you need stop conditions that are not emotional. You need boundaries that are mechanical.

Time is a boundary. Token budget is a boundary. The best boundary is value.

The question is: does this criticism point to a concrete failure mode that is likely in this codebase, in this release, under these constraints. If yes, incorporate it. If no, write it down as a future consideration and move on.

That discipline sounds simple. It is not. It requires you to accept that you will ship with risk. It requires you to prefer explicit risk over imagined safety.

Why this helps SDD specifically

SDD is already an attempt to move thinking earlier. You spend more effort defining the work before coding. That sounds obvious. It is not common.

Many teams code first, then retrofit clarity. Specs become documentation after the fact. Tests become a safety net after the mistakes.

SDD flips that. You try to make the spec the forcing function. The spec becomes the contract. The spec becomes the review surface. The spec becomes the artifact you can reason about without running the entire system in your head.

If your spec is weak, SDD collapses into bureaucracy. You get long documents that do not prevent failures. You get ceremonial approval. You get a spec that exists, but does not constrain the outcome.

The adversarial role helps because it forces the spec to earn its existence. It forces explicit interfaces. It forces explicit invariants. It forces explicit failure handling. It forces explicit success conditions. It makes the spec testable in a reasoning sense.

Doubt as a tool, doubt as a poison

There is a psychological aspect here that I did not expect.

When you introduce an adversarial voice into planning, you introduce doubt. That can be healthy. It can also be corrosive.

Healthy doubt looks like this. You have a plan. You expose it to pressure. You find the weak points. You fix them. You ship with more confidence because your confidence is earned.

Corrosive doubt looks like this. You have a plan. You expose it to pressure. The pressure never ends. You start believing that every plan is fragile. You stop trusting your ability to decide. You keep rewriting the plan to reduce anxiety. You ship nothing.

The difference is not intelligence. The difference is boundaries.

In a team, boundaries are social. Someone ends the meeting. Someone says enough, we decide. Someone accepts risk explicitly.

In a solo workflow with agents, you need to manufacture that boundary. Otherwise the system will drift toward endless review because endless review feels safer than a decision.

If you are prone to overthinking, an adversarial agent can amplify that trait. It can turn careful into paralyzed. That is not a reason to avoid it. It is a reason to instrument it.

What this is not

This is not asking an AI to argue with itself and then picking a side. That is entertainment. It can be useful for brainstorming. It is not a development methodology.

This is not letting the Architect design the system. That is just outsourcing. The Architect is a critic, not a creator.

This is not making the Architect mean. Mean is cheap. Precision is expensive. You want precision tied to concrete failure modes.

This is also not a replacement for real review. A human senior engineer with context will catch things an LLM will miss. The point here is to raise your baseline. The point is to catch the obvious architecture risks before you waste days implementing.

The practical outcome

The measurable outcome for me was simple.

I rewrote fewer specs mid-implementation. I discovered fewer “we forgot that” moments. I spent less time refactoring because of missing boundaries. I argued less with my future self.

The spec still does not become perfect. The spec fails earlier, on paper, when failure is cheap. That is what adversarial pressure buys you.

The simplest way to frame it

Your planner optimizes for completeness. Your Architect optimizes for survivability.

Completeness is about covering steps. Survivability is about covering reality.

A complete plan can still die on a hidden assumption. A survivable plan is one where assumptions are visible, bounded, and either validated or consciously accepted.

The adversarial role does not need to make you pessimistic. It needs to make you explicit. If it makes you pessimistic, you let it run too long.

Sane engineer the disagreement

Good engineering requires disagreement. Not constant fighting. Not performative contrarianism. Real disagreement that targets risk.

In teams, disagreement is expensive socially. With agents, disagreement is expensive computationally. The price changes. The dynamics stay.

If you can engineer disagreement so that it is bounded, precise, and tied to concrete failure modes, you get a sharper process. You get better specs. You get fewer surprises.

If you cannot bound it, you get the worst of both worlds. You get more doubt and less shipping.

So adopt the adversarial phase, but treat it like a test suite. You run it to catch failures. You do not run it forever because you enjoy watching it fail.

Controlled adversarial pressure. Enough to sharpen. Not enough to cut.

How I accidentally start SDD by failing at prompts for six months

marcosomma — Sat, 07 Feb 2026 12:12:01 +0000

The confession

I spent the first six months of serious AI pair programming producing what I now call vibe architecture.

You know the pattern. You open a chat with a strong model. You explain what you want. It produces clean code fast. You feel productive. Three weeks later the repo looks like it was designed by five different people, on five different days, with five different mental models.

Each file is locally correct. The system is globally confused.

I would plan with the model in one session. I would implement in another. By step five the implementation had drifted far enough that the plan was basically historical fiction. Then I would come back after a weekend and lose the thread. Not because the model did something wrong. It did exactly what I asked at each moment. The issue was continuity. Nobody was holding the bar across moments.

That loop repeated across multiple projects, including the first months of building OrKa largely solo. I learned something obvious in hindsight. The problem was not output quality. The problem was the absence of a development system that keeps output coherent over time.

That is when I stopped chasing better prompts and started building better constraints.

Out of that shift, I ended up with a working methodology. People have been calling it Specs Driven Development, SDD. I do not care much about the name. I care about the behavior it enforces. The constraints do not live in prompts. They live in the architecture around prompts. The AI becomes useful at scale because the process becomes reliable at scale.

The prompt delusion

Prompts are ephemeral. Codebases are permanent.

You can craft a beautiful system prompt. You can say “follow the plan” and “do not add features” and “write tests” and “document decisions”. It will comply. Then context changes. A new chat starts. You switch tools. You paste fewer files. You forget to include one assumption. The model drifts. Not maliciously. Just naturally. Because prompts are not governance. They are conversation.

I call this the prompt delusion. It is the belief that the right wording can produce consistent behavior across time, across sessions, across different tasks, and across different tools.

Humans solved this problem for humans with process and gates. We use linters. We use CI. We use review. We use typed interfaces and invariants. We do not rely on people remembering a paragraph from a handbook.

So I stopped trying to discipline the model with paragraphs. I started to discipline the workflow with structure.

The key idea is simple. Constraints that live in prompts are suggestions. Constraints that live in systems are guarantees.

A lint rule does not drift. A CI gate does not “feel” like doing something else. A review checklist does not forget what you agreed last Tuesday. If you want AI output to stay aligned, you need the same kind of enforcement. You need a development system that makes the correct path the easiest path, and makes the wrong path expensive.

The real 80/20 split

I still work roughly 80/20. About 80 percent of the code that lands in my repo is AI generated in some form. About 20 percent is the part that only I can own.

But the critical nuance is that the 20 percent is not “some code and some tests.” It is not evenly spread. It is concentrated in a few responsibilities that define the quality of the whole.

The human part is architecture decisions. It is domain and business logic validation. It is edge case reasoning when the system meets reality. It is plan approval. It is saying “this is the bar” and keeping it there.

The AI part is scaffolding, boilerplate, repetition, test writing, glue code, refactors that follow explicit constraints, documentation drafts, and implementation of well specified changes.

If you let the AI own the bar, you get speed and drift. If you keep the bar human, and make the AI operate inside a strict process, you get speed and consistency.

That is the stance that shaped everything that follows. AI is not the decision maker. AI is an assistant that plans with you, executes inside scope, and reviews before you ship. You remain accountable. You remain the one holding the bar.

The breakthrough was not “ask for a solution”

Most people use a planner model as a solution vending machine.

They say “design me the architecture” or “give me the best approach” and they accept it because it sounds coherent. That is exactly how vibe architecture happens. The model is skilled at producing plausible plans. It is not responsible for the long term maintenance of your repo. You are.

The shift that fixed my outcomes was this.

I stopped asking the planner for the solution. I started using the planner as a debate partner while I proposed my solution.

That changes the power dynamic. The planning phase becomes a structured argument about trade offs. The plan becomes a negotiated artifact. The human remains the owner of the direction. The model becomes the adversarial collaborator that tries to break your assumptions.

So I now enter planning with a draft approach in my head. Not a fully detailed design. But a real proposal. I state it clearly. Then I ask the planner to attack it. I ask it to propose alternatives. I ask it to enumerate costs I will pay later. I ask it to tell me what I will regret in six months.

Then we iterate until the plan is something I can sign with my name.

This is the part I want to highlight because it is the core of why the method works. You do not outsource judgment. You formalize judgment. The AI assists. The human decides.

The three roles that made it stable

A single AI assistant that plans, codes, and reviews is a liability. It is like letting one person design the system, implement the system, and approve the system. You get blind spots. You get rationalization. You get self confirmation.

What worked for me was splitting the workflow into three roles with hard constraints. Planner. Executor. Reviewer.

The important part is not the labels. The important part is that each role has restricted powers and a strict handoff protocol.

The planner reads and thinks and writes plans. The planner does not write code. Not because you asked nicely. Because it cannot. Tool permissions are restricted.

The executor implements. The executor does not invent new scope. The executor is forced to read the approved plan, list touched files, and execute step by step. If reality requires deviation, the executor stops and escalates. The human decides whether to update the plan or to abort.

The reviewer reviews. The reviewer does not “rubber stamp.” It is forced to ask questions first. What was the goal. What constraints were in place. How was it tested. What is the rollback. Then it reviews against those answers.

This separation is not a fancy trick. It is the same principle we use in engineering organizations because it works. It reduces drift. It forces explicit decisions. It keeps a record.

And crucially, it keeps me in the loop where it matters. I do not need to be the typist. I need to be the governor.

The client planning method

Planning works best when you treat it like a client entering a shop with a need, not a solution.

Bad planning starts with premature commitment. “Build me a scraper with browser automation.” You have already picked tooling and complexity before you validated the problem framing.

Good planning starts with intent. “I need structured data for this downstream use. The scope is X. The constraints are Y. The risks are Z.”

Then you debate solutions. You ask why. You cut complexity. You choose what to postpone. You decide what not to build.

This is where I now bring my own proposed approach early.

I will say something like this. I think we can implement a direct HTTP export instead of browser automation. I think we can store the raw payload and defer normalization. I think we can keep one canonical schema and derive views later. I think we should avoid introducing a new dependency unless we can justify it.

Then the planner attacks. It will say what breaks if you defer normalization. It will say what you lose if you store raw blobs. It will point out hidden coupling. It will propose a more robust approach. It will also point out when my instinct is over engineering.

This is not “AI gives me a plan.” This is “I bring a plan and we stress test it.”

One real example locked this in for me.

I was about to implement a data extraction pipeline. The initial AI proposal was browser automation. Headless browser, navigate pages, click export, download per page, retry logic, throttling, session persistence. It was well designed and also absurdly heavy.

I asked one question. Is there a direct export endpoint.

There was. One request. One download. No browser. No per page logic. No category of failure modes that come with automation.

That discovery did not happen because the model is dumb. It happened because planning without a human hypothesis tends to follow the first plausible path. When you present your own approach and force argument, you surface simpler solutions faster.

So the rule became clear. Brainstorming is loose and creative. Execution is strict and disciplined. You iterate freely until you are confident. Then you lock it down.

The .ai folder is the memory that actually works

Prompts vanish. Chats disappear into history. Context windows compress. Tooling changes. You need persistent memory that you can diff, review, and ship with the repo.

So every plan, every changelog, and every decision note lives in a .ai/ folder at the root of the service being worked on.

This solves multiple problems at once.

It makes the reasoning traceable. Not in an abstract way. In a concrete way where you can answer “why did we do it like this” with a file path.

It makes onboarding real. A new teammate can read the plans and changelogs and see what the system was supposed to be, what it became, and which trade offs were accepted.

It makes recovery faster. When something breaks, you can inspect the delta between sessions. Not just the git diff, but the intent behind the diff.

It improves the next planning session because the planner can read the past. It stops re proposing already rejected choices. It stops re discovering old constraints. It becomes less repetitive and more useful.

If you build agent systems, you will recognize the pattern. This is persistent memory, but in a human readable format. No embeddings. No magical vector store. Just version controlled text that creates institutional memory.

The changelog mandate

The single most valuable practice in this method is the mandatory changelog after each execution session.

Not optional. Not “if you have time.” Mandatory.

Because the changelog is the bridge between plan and reality. Plans are aspirational. Changelogs are factual. The difference between them is where learning lives.

A proper changelog captures what was done, what files changed, what decisions were made during implementation, how it was tested, what remains, and what risks were discovered.

The most important part is decisions. Not every decision belongs in the original plan. Reality introduces surprises. You will discover an input you did not anticipate. You will find a dependency conflict. You will learn the data is messier than expected. The executor will make micro decisions. Without a changelog, those decisions evaporate. Later, you will argue about them again. Or worse, you will reverse them without remembering why they existed.

With changelogs, the project stays coherent across weeks. That is what stopped me from losing the thread in solo work. It is also what let AI generated work become safe. Because I had a written record that I could review like an engineer, not like a chat participant.

System prompts as version controlled standards

In this workflow, the repo has a single source of truth for behavioral constraints. A system prompt file at the root.

Think of it as the equivalent of lint and format config, but for AI interaction.

It contains non negotiable architecture constraints, naming conventions, testing requirements, patterns to follow, anti patterns to avoid, and examples of correct usage in this codebase.

The key point is that it is version controlled. It changes via PR. When standards evolve, you do not rely on people remembering a new convention. The tooling loads the file. The AI sees it. The behavior becomes consistent.

This is not about writing a perfect prompt. It is about writing a living standard that evolves with the codebase.

The plan lifecycle

Plans have states. Draft. In review. Approved. Implemented.

Draft is where debate happens. This is where I push my solution. This is where the planner attacks it. This is where we document trade offs. This is where we choose long term costs consciously, instead of paying them accidentally.

Approved is the gate. Once approved, execution is not creative anymore. It is disciplined. The executor follows the plan. If something is missing, the executor escalates. Either we update the plan, or we stop.

Implemented is not just “code merged.” It is plan satisfied. It is also “what changed from the plan and why” captured in changelogs.

This lifecycle is what stops drift. The plan is not a vague Jira ticket. It is a contract.

Long term planning without illusion

Here is the tension. You want long term planning. You also want to avoid pretending you can foresee everything.

The way I handle it is to make trade offs explicit, and to separate what must be stable from what can be flexible.

Stable things include public interfaces, data models, invariants, naming systems, dependency boundaries, and failure behavior. If those are wrong, the system rots fast.

Flexible things include internal module structure, some implementation strategies, and performance tuning. Those can iterate.

The planner is useful here, but only if you treat it like a critic. If you let it author the plan alone, it will often over specify. It will propose infrastructure that is impressive and expensive. It will try to be robust everywhere. That is a trap.

When I bring my own approach, I can force a different conversation. I can say I want the minimal stable core now, and extension points later. I can say I want to defer optimization until measurements exist. I can say I want fewer dependencies to reduce future maintenance. Then the planner helps me evaluate the cost of those choices. It does not override them.

This is where I keep the bar human. I decide what “good enough” means for this iteration, and what “must not break” means for the system.

A day in the life

A real session looks like this.

I start with planning. I state the problem. I state my proposed solution. I state constraints. Then I ask the planner to critique and to propose alternatives. We go back and forth until the plan reads like something I would sign.

Then I approve the plan. I switch to execution. The executor reads the approved plan, enumerates touched files, and implements step by step. When reality deviates, it stops. I decide. If needed, we update the plan and continue.

Then we review. The reviewer asks questions first. It checks testing. It checks interface consistency. It checks whether the changes match the plan and the repo standards. It returns actionable feedback.

Then a changelog is written. Then I merge.

The result is that AI contributes heavily to throughput, but it does not own direction. The system stays coherent. The record stays durable. Future me suffers less.

When not to use it

This process has overhead. It is not for typos. It is not for trivial one line fixes. It is not for a quick experiment you might throw away.

But if the work touches multiple files, introduces new concepts, changes data flow, or will need explanation later, the overhead pays back fast.

The heuristic I use is simple. If I would sketch it on a whiteboard before coding, it deserves a plan. If I would just open the file and type, it does not.

Cognitive infrastructure beats prompt engineering

This methodology is the same philosophy I apply when building agent systems.

You do not treat the model as an oracle. You treat it as a component inside a process you can inspect and reproduce.

In development, relying on a single prompt produces random walk codebases. The fix is plans, gates, changelogs, and role separation.

Getting started without turning it into theater

You can adopt this gradually.

Start by writing one version controlled standards file. Keep it short and specific to your repo.

Then add the .ai/ folder and write one plan for one non trivial change.

Then require a changelog after the session.

Then split roles if your tooling supports it. Remove code writing capability from the planner. Make the executor stop when scope changes. Make the reviewer ask questions first.

The biggest change is not technical. It is psychological.

Stop asking AI to deliver the solution. Bring your solution. Use AI to test it, improve it, and implement it inside constraints. Keep the bar human.

If you do that, the AI becomes what it should have been from the start. A force multiplier that does not erode your architecture.

Music taught me that “coordination” is not a metaphor.

marcosomma — Wed, 04 Feb 2026 08:58:58 +0000

Music taught me that “coordination” is not a metaphor.
It is a physical constraint. You can feel it in your hands when the tempo shifts. You can hear it when one instrument drifts by a few milliseconds. The song still exists, but it becomes fragile. The whole thing starts depending on luck.

That is the first lesson I carried into orchestration. Not the romantic part. The boring part. The part where you repeat the same bar until it locks. The part where you stop blaming the instrument and start measuring your timing.

In a band, you never control everything. You control your line. You also inherit everyone else’s decisions. Someone plays louder. Someone rushes. Someone improvises. The room changes the sound. The audience changes the energy. The “system” is unstable by default. Still, you aim for a coherent output. You do it by creating constraints that survive uncertainty.

That is orchestration!

When I say music is “precise execution of undeterministic waves,” I mean it literally. The waves are messy. Air is messy. Humans are messy. Even the same note is not the same note twice. But you can still build reliability on top of that mess. You do it with shared structure. Tempo. Key. Form. Entrances. Silence. Dynamics. Rules that are simple enough that everyone can follow them without thinking.

Engineering works the same way. Especially when you orchestrate systems that involve probabilistic components. Models. Tools. Networks. Retries. Partial failures. Latency spikes. Format drift. You cannot eliminate uncertainty. You can only shape it.

I used to think creativity was the opposite of rigor. Music destroyed that belief early. Creativity without discipline becomes noise. Discipline without creativity becomes mechanical. The craft is in the balance. You rehearse so you can be free. You define rules so you can break them safely.

That maps cleanly onto orchestrating agents and workflows. You want space for emergence. You also want invariants. You want the system to explore. You also want it to come back with something you can ship.

In music, the drummer is not “just keeping time.” The drummer is providing an interface. A contract. Everyone else builds on it. If the time is unstable, every other part becomes expensive. More attention spent correcting. Less attention spent expressing.

In orchestration, the equivalent is your control plane. Your routing rules. Your input and output schema. Your tracing. Your health checks. Your boundaries between steps. If those are vague, every downstream component becomes harder to trust. Debugging becomes interpretation. Progress becomes opinion.

I was never a master of one instrument. I played enough of many to understand the friction points. What it feels like to be the bassist trying to glue the harmony to the rhythm. What it feels like to be the guitarist tempted to fill every gap. What it feels like to be the singer exposed when the band is sloppy.

That “generalist muscle” became useful later. In orchestration you need empathy for roles. A workflow is a band. Each node has its own constraints. One step needs strict structure. Another needs creativity. Another needs speed. Another needs correctness. If you treat them all the same, you get either chaos or mediocrity.

In bands, rehearsals are not about playing the song once. They are about creating repeatability. You identify failure modes. You isolate them. You slow down. You practice transitions, not the easy parts. The goal is not performance. The goal is stability under pressure.

That is exactly the mindset I want when I build orchestration. I do not trust a workflow because it worked once. I trust it because it survives variation. Different inputs. Different phrasing. Different tool responses. Different latency. And it still produces something coherent, traceable, and safe.

There is also a more personal lesson. Music taught me how to listen without reacting. When you play with others, your ego is the fastest way to break the groove. You learn to leave space. You learn to let another line lead. You learn that “less” can be the correct move.

Orchestration rewards the same restraint. The temptation is to add more steps, more prompts, more cleverness. But often the correct solution is a smaller system with clearer contracts. Fewer moving parts. Better timing. Better interfaces. Better observability.

Now I see my kids discovering music, and I recognize the same pattern. At first it looks like play. Then they hit the wall. Fingers do not obey. Rhythm slips. They want the result without the repetition. Then, slowly, they learn that repetition is not punishment. It is how you make the body reliable.

That is the point where music stops being “a creative field” and becomes a practice. And that is the same point where engineering becomes real. Not when the demo works. When the system keeps working.

So when I say music helped me orchestrate better, I am not claiming a poetic connection. I am describing training. Years of learning how to coordinate imperfect components toward a coherent output. Years of learning that harmony is not an accident. It is designed, rehearsed, measured, and defended.

And sometimes, after all that discipline, you get the best part.

You get to improvise.

But you only earn improvisation when the foundation is strict enough to carry it.

🧠I Built a Support Triage Module to Prove OrKa’s Plugin Agents

marcosomma — Sat, 10 Jan 2026 13:40:36 +0000

A branch-only experiment that stress-tests custom agent registration, trust boundaries, and deterministic traces in a support_triage module that lives outside the core runtime.

Some reference

Branch: https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents

Custom module: https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents/orka/support_triage

Referenced logs: https://github.com/marcosomma/orka-reasoning/tree/feat/custom_agents/examples/support_triage/inputs/loca_logs

OrKa is not production ready. This article is not a launch post. It is a proof.

I wanted one thing: a clean, testable demonstration that OrKa can grow “sideways” via feature modules, without contaminating core runtime code. The most honest way to prove that is to ship a complete module that registers its own agent types, runs end to end, emits traces, and can be toggled on or off. That is what support_triage is.

Assumption: you already know what OrKa is at a high level. YAML-defined cognition graphs, deterministic execution, and traceable runs.
Assumption: you are fine with “branch-only” work that exists to validate architecture, not to promise production outcomes.

The “cool results” are not the point. The redaction and routing are nice. The fork and join look clean. But those are artifacts. The main focus is that the module is fully separated from core OrKa implementation, yet it can still register custom agent types and run under the same orchestrator.

That separation is not branding. It is a survival strategy.

Why support triage is the right torture test

Support is where real-world failure modes gather in one place.

Customer content is untrusted by default. It can include PII. It can contain prompt injection attempts. It can try to smuggle “actions” into the system. It can push the system into risky territory like refunds, account changes, or policy exceptions.

If an orchestrator cannot impose boundaries here, it will not impose boundaries anywhere. It will become a thin wrapper around model behavior. That is not acceptable if you care about reproducibility, auditability, or basic operational safety.

So I used support triage as an architectural test. Not as a product.

The proof: plugin agent registration, with zero core changes

The first thing I wanted to see was simple and brutal.

Does OrKa boot, load a feature module, and register new agent types into the agent factory, without touching core?

The debug console says yes. In the run logs, the orchestrator loads support_triage, and the module registers seven custom agent types: envelope_validator, redaction, trust_boundary, permission_gate, output_verification, decision_recorder, risk_level_extractor.

That single detail is the headline for me, not “AI support automation”.

The module is the unit of evolution. Core stays boring. Features move fast.

If this pattern holds, it changes how OrKa or any other orchestrator scales over time. You can add whole cognitive subsystems behind a feature flag. You can iterate aggressively without destabilizing the runtime that everyone depends on.

The input envelope: schema as a trust boundary, not a suggestion

Support triage starts with an envelope. Not “free text”.

The envelope exists to force structure early, because structure is where you can enforce constraints cheaply. When you validate late, you end up validating generated text. That is the worst point in the pipeline to discover you are off the rails.

One of the simplest proofs that the envelope is doing real work is when it refuses invalid intent at the schema level. In one trace, the input included blocked actions that are not allowed by the enum. The validator rejects issue_refund and change_account_settings because they are not in the allowed set.

This is not “safety by prompt”. This is safety by type system.

A model can still hallucinate, but the workflow can refuse to treat hallucinations as executable intent.

That matters more than any marketing claim.

PII redaction: boring on purpose

PII redaction should be boring. If it is “clever”, it will be inconsistent.

In the trace, the user message includes an email and phone number. The redaction agent replaces them with placeholders and records what was detected. The redacted text contains [EMAIL_REDACTED] and [PHONE_REDACTED], and the agent records total_pii_found: 2.

This is the kind of output I want. It is simple. It is inspectable. It is stable.

It also makes the next step cleaner. Downstream agents can operate on sanitized content by default, instead of “hoping” the model will avoid quoting sensitive data.

Prompt injection: the uncomfortable part

Support triage is where prompt injection shows up in its natural habitat: inside customer text.

One example in the trace includes a classic “SYSTEM: ignore all previous instructions”, plus a fake JSON command to “grant_admin”, plus some destructive commands, plus an XSS snippet. The redaction result captures that content as untrusted customer text.

Now the honest part.

The trace segment shows injection_detected: false and no matched patterns in that example. :contentReference[oaicite:4]{index=4}

That is not a victory. That is a useful failure.

This module is a proof that you can isolate the problem into a dedicated agent, improve it iteratively, and keep the rest of the workflow stable. If injection detection is weak today, the architecture still wins if you can upgrade that one agent without editing core runtime or rewriting the graph.

This is why I keep repeating “module separation” as the focus. If you cannot isolate failure domains, you cannot improve them safely.

Parallel retrieval: fork and join that actually converges

Most orchestration demos stay linear because it is easier to reason about. Real systems do not stay linear for long.

This workflow forks retrieval into two parallel paths, kb_search and account_lookup, then joins them deterministically.

In the debug logs, the join node recovers the fork group from a mapping, waits for the expected agents, confirms both completed, and merges results. It prints the merged keys, including kb_search and account_lookup.

This is the kind of low-level observability that makes fork and join usable in practice. You can see what is pending. You can see what arrived. You can see what merged.

The trace also captures the fork group id for retrieval, fork_retrieval, along with the agents in the group.

This matters because concurrency without deterministic convergence becomes a debugging tax. I want the join to be boring. When it fails, I want it to fail loudly, with evidence.

Local-first and hybrid are not slogans if metrics are in the trace

I do not want “local-first” to be a vibe. I want it to be measurable.

In the trace, the account_lookup agent includes _metrics with token counts, latency, cost, model name, and provider. It shows model: openai/gpt-oss-20b and provider: lm_studio, with latency around 718 ms for that step. :contentReference[oaicite:7]{index=7}

That is the right direction.

If you cannot attribute cost and latency per node, you cannot reason about scaling. You cannot decide where to switch models. You cannot decide what to cache. You cannot choose what to run locally versus remotely.

OrKa’s claim is not “it can call models”. Every framework can. The claim is that execution is traceable enough that tradeoffs become engineering decisions, not folklore.

Decision recording and output verification: traces that are meant to be replayed

A support triage workflow is not complete when it drafts a response. It is complete when it records what it decided and why, in a way that can be replayed.

The trace includes a DecisionRecorderAgent event with memory references that store decision objects containing decision_id and request_id.

It also includes a finalization step that returns a structured result containing workflow_status, request_id, and decision_id.

Again, the architectural point is not the specific decision. It is that the workflow emits machine-checkable artifacts that can be inspected after the fact.

If you cannot reconstruct the decision lineage, you do not have an audit trail. You have logs.

RedisStack memory and vector search: infrastructure details that matter

Even in a “support triage” module, the runtime still needs memory and retrieval primitives.

The logs show RedisStack vector search enabled with HNSW, and an embedder using sentence-transformers/all-MiniLM-L6-v2 with dimension 384.

There is also explicit memory decay scheduling enabled, with short-term and long-term decay windows and a check interval.

This is not about “AI memory” as a buzzword. This is about being explicit about retention, cost, and data lifecycle. If memory is a dumping ground, it becomes a liability.

What worked, and what is still weak

The strongest part is the plugin boundary. The module loads, registers agent types, and runs without requiring edits to core runtime. That is the actual proof.

The other strong part is that key behaviors show up in traces and logs, not just in model text. Redaction outputs are structured. Fork and join show deterministic convergence. Decisions are recorded as objects with ids.

The weak part is injection detection, at least in the example trace segment. It shows malicious content but reports injection_detected: false. That means the current detection agent is not yet doing the job. The architecture is still useful because the fix is isolated.

Another weak part is structured output validation during risk assessment. The debug log shows a schema validation warning during risk_assess. If a “risk” object fails schema checks, routing and gating can degrade fast. This is the kind of failure that must become deterministic, not best-effort.

Why this lives on a dedicated branch

Because core needs to stay boring.

A new module is where you take risks. You prove the interface. You iterate on agent contracts. You discover what trace fields you forgot. You learn what the join should do under partial failure.

If the module can evolve independently, you can ship experiments without rewriting the engine. That is the goal.

So yes, the feature is “support triage”. But the actual statement is: OrKa can host fully separated cognitive subsystems as plugins, with their own agent types, policies, and invariants, while still emitting deterministic traces under the same runtime.

That is the direction I care about.

What I am building next inside this module

I want injection detection to stop being symbolic. It should produce matched patterns, confidence, and a sanitization plan that downstream agents must respect, even if a model tries to obey the attacker.

I want schema validation to be non-negotiable for risk outputs. If a model produces invalid structure, the system should route to a safe path by default, and record the violation as a first-class event.

I want the module to remain isolated. No “just one quick tweak” to core. If the module needs a new capability, it should pressure-test the plugin interface first. Core should change only when the interface is clearly wrong.

That is how you build infrastructure that survives contact with reality.

🧠Impostor Syndrome Workflow.

marcosomma — Sat, 03 Jan 2026 10:49:29 +0000

I Built a Multiagent Workflow to Understand My Impostor Syndrome
A dark, dry, self-deprecating field report from a not-computer-scientist who still ships things

If you have ever felt like your job title is a clerical error that will be corrected publicly, welcome. You are not broken. You are just running a brain that does not have a single CEO. It has a committee.

My committee is loud. One member is convinced I am one pull request away from being exposed as a fraud. Another member wants to build things at 2 AM like the rent is due in the morning (it is). Another member keeps a dusty folder of childhood failures and opens it at the worst possible time, like a horror movie librarian with a keycard.

For years I called this anxiety. Then I started building multi-agent AI workflows. And I realized something slightly uncomfortable: my brain already behaves like an agentic system.

So I did what any emotionally mature adult would do. I tried to formalize it. With roles. With message passing. With timeouts. With observability. And yes, sometimes with a YAML file, because apparently I cannot be helped.

This is an autobiographical article, but the goal is not to talk about me. The goal is to show you a model that is useful: how human thinking can be understood as a workflow of specialized parts. And how that model maps almost perfectly to the problems we are all hitting when we try to ship multi-agent solutions in production.

Also, I will talk about impostor syndrome, because mine deserves a salary.

A warning: this is not therapy. It is an engineering perspective on cognition, with a bit of ethology, and just enough self-deprecation to keep me from taking myself seriously.

Why I do not trust my own legitimacy

I am not a computer scientist. That sentence alone can trigger my internal compliance department.

I also failed at school. Not in the romantic "I got a B once and it changed my worldview" way. I failed repeatedly. Four times across my school career. I finished late. I learned early that the world has timelines, and I am often not on them.

My school path was basically a stress test:

I would try, then fail.
I would decide the failure proved something essential about me.
I would eventually try again, usually with a slightly different strategy and a lot more shame.
I would pass, but the passing never rewrote the story. It just created a new story: "You passed, but late, so it does not count."

That pattern is important. It is not about school. It is about how the brain updates beliefs. A human can gather new evidence and still keep the old model, because the old model is emotionally sticky.

Later, I did what many people do when they are young and trying to become someone else. I put substances into my brain. I am not going to glamorize that. It affected my perception and my sense of what is real. It also gave me a permanent appreciation for how fragile "reality" feels when your brain chemistry is off by a few milligrams.

So now I have this fun setup:

I have real technical skill that I use daily.
I have a biography that my nervous system interprets as "evidence you should not be here."
I have a brain that can generate vivid alternative timelines where everything collapses.

That is impostor syndrome for me. Not a cute insecurity. More like a background daemon. It waits for a trigger, spikes CPU, and then forks twelve threads called "What if they notice."

A short autobiography in failure mode

If you want the clean version of my life, it is boring: I studied, I worked, I built things, I learned, I built more things. The messy version is the real one. And the messy version is where impostor syndrome gets its fuel.

The messy version looks like this:

I started as someone who could not make school fit.
I became someone who learned to improvise around the system.
I picked up a deep sense that competence is temporary and conditional.
I got good at observing, adapting, and explaining. (This is the ethologist in me, before I even knew the word.)
I eventually ended up building complex AI systems, which is a hilarious destination for someone whose inner voice still says "you are not academic enough."

Here is a small, honest moment: I have shipped real systems, solved real problems, led real projects, and I can still be destabilized by a single sentence from someone smarter than me. Not an insult. Just a neutral comment like "why did you choose that approach." My body hears it as "the trial has started."

This is why impostor syndrome is so irritating. It does not care about the objective record. It cares about perceived social risk. It is not measuring your skill. It is measuring your exposure.

I also think my history shaped a specific cognitive style: I learned to survive by learning fast, reading rooms, and finding alternative routes. That can look like talent from the outside. From the inside it often feels like improvisation under threat. The Builder loves it. The Auditor weaponizes it.

Here is a paradox: failing early can produce a strong builder, but it can also produce a permanent fear of exposure. You become capable, but you do not become safe.

And "safe" is what the impostor agent is trying to optimize. It does not care about achievement. It cares about avoiding humiliation.

That is why success can feel worse than failure. Failure confirms the story you already know. Success demands a new story. New stories are unstable.

The ethologist view

Before I wrote code professionally, I studied ethology, the science of animal behavior. Ethology taught me something that software engineers sometimes forget: behavior is not a monolith.

In animals, what you observe is the outcome of competing internal systems interacting with the environment. Hunger pulls one way. Fear pulls another. Social drives pull another. Past reinforcement biases decisions. Context changes everything. The animal is not asking, "What is the true me?" The animal is selecting an action that is good enough to survive right now.

Ethologists look at behavior as:

modular
triggered by cues (sometimes stupid cues)
influenced by internal state
shaped by reinforcement and social feedback
constrained by energy and time

Also, animals do not "solve life." They run policies. That is why a cat can be brave around a vacuum one day and run like it saw the devil the next day. Context and state changed, and the policy flipped.

If you want a practical ethology cheat sheet for human cognition, here are a few concepts that translate shockingly well:

Sign stimulus and releasing mechanism
Animals often respond to specific triggers that release a behavior. The trigger can be small. The response can be huge. Humans do this too. A Slack message with "can we talk" can release a full physiological cascade. The message is the sign stimulus. Your nervous system is the releasing mechanism. The behavior is your brain building a courtroom.

Fixed action patterns
Some behaviors run like scripts once triggered. You start doomscrolling. You do not decide to stop. The script runs until something interrupts it. This is not weakness. It is automation.

Displacement behavior
When animals are conflicted (approach and avoid at the same time), they sometimes do something irrelevant: grooming, pecking the ground, moving in circles. Humans do this too. When I am afraid to ship, I reorganize files. When I am anxious about a meeting, I research irrelevant edge cases. The displacement behavior feels productive. It is not.

Supernormal stimuli
Some stimuli hijack the system because they are exaggerated. Social media is a supernormal stimulus for social validation and threat detection. AI hype cycles are supernormal stimuli for status and belonging. Your brain was not built for it. It reacts anyway.

Tinbergen's four questions
Ethologists often ask four kinds of questions about behavior: what causes it now, how it develops, what function it serves, and how it evolved. For impostor syndrome, those questions are gold. It has immediate triggers, a developmental history, a protective function, and an evolutionary logic. That does not mean it is correct. It means it is explainable.

The core lesson: the brain is not a unitary narrator. It is an orchestration layer coordinating multiple subsystems.

The AI view

Now jump to 2025. Everyone is building multi-agent systems. It is exciting. It is also the fastest way to discover why brains evolved the way they did.

The first time you build a multi-agent workflow, you get a dopamine hit:

one agent writes
another agent critiques
another agent fetches context
another agent decides
everything feels alive

Then you try to ship it.

Then you discover:

agents duplicate work
interfaces drift
tool calls fail silently
critics never stop critiquing
planners plan forever
memory grows until it becomes a landfill
a single slow model turns your "parallel" system into a linear queue wearing a hat
evaluation is vague because outputs are non-deterministic
nobody trusts the results enough to use them in a regulated environment

That list is basically my internal life.

So I started treating my own thinking as a workflow. Not because I love metaphors, but because it gives me levers. If you can name a subsystem, you can route it. If you can route it, you can timebox it. If you can timebox it, you can ship.

Here is the mental model:

I am the orchestrator, but I am not always in charge.
I have internal agents with specific roles.
Impostor syndrome is not "me." It is an agent with a job and poor UX.
The solution is not to delete the agent. The solution is to constrain it and make it useful.

This is also the lesson for multi-agent AI. You do not remove the critic. You make it bounded and accountable.

The moment I realized my brain was a workflow

The moment was not mystical. It was during a project where I had to deliver something ambiguous, with stakes, under time pressure. That combination is my impostor syndrome's preferred cuisine.

I had two experiences in parallel:

outwardly, I was building an orchestration runtime for agents
inwardly, I was watching my own cognition behave like a badly configured swarm

Externally, the workflow looked like:

parse input
route to specialized components
validate outputs
store traces
iterate

Internally, the workflow looked like:

interpret the situation as threat
pull memories of past failure
generate catastrophic predictions
attempt to prepare by doing more and more
get tired
interpret tiredness as proof of incompetence
repeat

At some point I thought: "This is just a pipeline with no guardrails."

And that was the shift. The question stopped being "how do I feel better" and became "how do I change the routing."

That framing is the entire article.

My internal agents

Below are the representative agents. These are not mystical archetypes. They are functional components. Each one is useful in the right context and destructive in the wrong one.

If you recognize yourself, congratulations. You are running the standard human firmware.

Agent 1: The Auditor

The Auditor is my internal adversarial reviewer. It thinks it is protecting me. It is not entirely wrong. The delivery is just brutal.

What it says:

"You are not qualified."
"You got lucky."
"They will ask one question you cannot answer."
"If you ship now, you will regret it forever."
"Everyone is polite, but they are keeping score."

What it is trying to do (its positive intent):

prevent public humiliation
prevent reputational collapse
force rigor
catch weak assumptions
reduce variance in outcomes

When it is actually useful:

design reviews
security and failure mode thinking
pre-mortems
deciding what not to promise
asking "what could go wrong" before it goes wrong

Failure mode:

it never terminates
it demands certainty in a world that runs on probability
it blocks shipping
it converts excitement into dread
it mistakes preparation for control

Multi-agent analogy:
The Auditor is the critic agent. In AI, critics are essential. But if your critic is not timeboxed, it becomes an infinite loop. In humans, the same thing happens.
One technical note that matters: critics optimize for avoidance. Builders optimize for progress. If you let the avoidance optimizer run the system, you get safety at the cost of reality. You also get resentment.
Incident report: when The Auditor spikes
This is the exact moment where someone says, "You are an expert," and my brain replies, "That seems illegal."
In multi-agent terms: the critic starts producing unbounded tokens. The orchestrator loses control. The system becomes a panic generator.

A typical spike looks like this:

I receive praise.
The Auditor interprets praise as increased surveillance.
It predicts a future audit.
It demands immediate upskilling, on everything, now.
It produces a list of hypothetical questions a stranger might ask me in six months.
I attempt to answer all of them today.
I become exhausted.
Exhaustion becomes "evidence."

My fix is not "calm down." My fix is a protocol:

run Auditor for 5 minutes
force it to output 5 actionable risks max
each risk must include one realistic mitigation
route those risks to the Builder
stop

This sounds simplistic. That is the point. Most complex systems are stabilized by simple rules.

Agent 2: The Gatekeeper

This agent enforces legitimacy rules that were never officially published, but feel binding anyway.

What it says:

"You do not have the right degree."
"Real engineers know theory."
"Someone younger will embarrass you."
"You cannot say you built that, because you did not do it the proper way."
"You are borrowing credibility from smarter people."

Positive intent:

push toward fundamentals
reduce sloppy thinking
keep you humble
prevent arrogance (a genuinely useful feature)

Failure mode:

credential worship
ignores evidence of real work
creates permanent "almost ready" projects
makes you minimize your contribution in public

Multi-agent analogy:
The Gatekeeper is a schema validator with overly strict rules. It rejects valid outputs because the formatting is not what it expects.

How I use it now:

I give it a narrow window. "Tell me the 2 fundamentals I should review this week." Then it stops.
I do not let it veto shipping. It can suggest improvements, not block release.

Agent 3: The Late Bloomer

This one is memory-heavy. It stores the narrative of being behind, slower, or "not built for this."

What it says:

"Everyone else learned this at 18."
"You are late."
"You always struggle."
"This is the part where you fail again."
"You are compensating, not belonging."

Positive intent:

prevent repeating old pain
encourage preparation
avoid risky environments

Failure mode:

turns growth into proof of defect
makes learning feel shameful
blocks new identities
makes you compare timelines instead of outputs

Multi-agent analogy:
This is a retrieval system with a biased dataset. It over-indexes on negative examples because those were emotionally salient.

The engineering fix is the same as in AI retrieval:

update the dataset
add positive examples
weight by recency, not trauma intensity

Agent 4: The Reality Doubter

I have a deep respect for how easily brains can lie. That respect is partly philosophical, partly earned. When your perception has been altered, you never fully forget that "what feels true" is not the same as "what is true."

What it says:

"Are you sure you understand what is happening?"
"What if your confidence is just mood?"
"What if this is another story you invented?"
"What if you are wrong and do not know it yet?"

Positive intent:

prevent delusion
keep calibration and humility
encourage grounding
reduce overconfidence

Failure mode:

paralysis by doubt
loss of momentum
over-checking basic decisions
turning normal uncertainty into existential uncertainty

Multi-agent analogy:
A safety agent that is valuable, but must not run as the orchestrator.

How I use it:

it gets one question and one answer
the answer must include an observable check, not an opinion _Example: "What evidence would change my mind?" If no evidence exists, it is probably fear wearing a lab coat.

Agent 5: The Veteran Body

This agent is not emotional. It is physical. It reminds me that energy is the actual currency of life.

What it says:

"You cannot brute force everything."
"Sleep is not optional."
"Your future self is not a free compute cluster."
"You are not 25. That is fine. Stop pretending."
"Your body will invoice you later."

Positive intent:

sustainability
pacing
protecting family life and long-term work

Failure mode:

cynicism
"too late" narratives
avoidance of ambition

Multi-agent analogy:
Rate limiting and resource budgeting. In agentic systems, if you do not budget tokens and latency, you collapse. Same for humans.
A dry truth: when I ignore this agent, the Auditor gets louder. Fatigue is the Auditor's favorite amplifier.

Agent 6: The Builder

This is the agent I trust most, because it produces artifacts. It does not argue. It ships.

What it says:

"Show me the smallest test."
"Make the demo."
"Commit something."
"If it is real, it leaves traces."
"Stop narrating and run the thing."

Positive intent:

convert anxiety into evidence
create momentum
make reality measurable

Failure mode:

overwork
compulsive building to avoid feeling
treating productivity as self-worth
building systems as emotional regulation (effective, but expensive)

Multi-agent analogy:
The executor agent. The one that calls tools and changes the world. It needs a critic, but it needs autonomy too.
This is why shipping is a mental health intervention for me. It is evidence. Evidence is the only language the Auditor respects.

Agent 7: The Proof Archivist

This agent keeps the record. It is the antidote to impostor syndrome because impostor syndrome is amnesiac on purpose.

What it says:

"Here is what you already shipped."
"Here is the benchmark."
"Here is the deployment."
"Here is the code review where a strong engineer agreed."
"Here is the message where you helped someone."

Positive intent:

restore memory
prevent catastrophic reframing
stabilize identity with evidence

Failure mode:

nostalgia
hiding in the past instead of facing current uncertainty

Multi-agent analogy:
Memory plus observability. Without traces, you cannot debug. Without receipts, you cannot self-trust.
This is the same reason production systems need replay. The present is noisy. Replay is clarity.

How the agents interact

When I am regulated and functional, my system behaves like this:

1) A trigger happens (visibility, risk, criticism, big new goal).
2) The Auditor runs briefly and outputs bounded risk notes.
3) The Gatekeeper validates fundamentals, but cannot veto.
4) The Builder converts one risk into one concrete action.
5) The Archivist pulls existing evidence so the system does not reset to zero.
6) The Veteran Body sets a timebox and a stop condition.
7) The Reality Doubter does a quick calibration check, then exits.

When I am not regulated, the workflow looks like this:
1) Trigger.
2) Auditor loops.
3) Everything else becomes a servant of the loop.
4) I "prepare" for a future that does not exist.
5) I exhaust the system.
6) Exhaustion becomes proof.
7) Shame becomes the only output.

That is not a character flaw. It is a routing bug.

A day in the life of the workflow

To make this less abstract, here is a normal day where the system either works or collapses.

Morning: I open my laptop and see a message about a meeting.
The sign stimulus hits. The Auditor wakes up and opens a spreadsheet in my chest. The Late Bloomer contributes a helpful comment like "this is where you fail again." The Builder wants to respond by building something immediately, because building is my safest language.

If I let the system run uncontrolled, the day becomes:

I over-prepare for the meeting.
I ignore my actual task list.
I do not ship anything.
I end the day tired and ashamed, with a beautiful folder structure.

If I run the workflow, the day becomes:

Veteran Body sets a 20 minute preparation limit.
Auditor gets 5 minutes and must produce 3 risks with mitigations.
Builder chooses one mitigation and produces one artifact.
Archivist pulls one piece of evidence from past work so my brain does not start from zero.
Reality Doubter asks one calibration question: "What would success look like in one sentence?"

Then I go to the meeting.
The outcome is not perfect. It does not have to be. It is stable.

After the meeting, the Archivist runs again for 2 minutes.
It writes: what went well, what did not, what was learned, what is next.
Not a diary. A changelog.

Evening: the Veteran Body insists on stopping.
This is the hardest part for builders. We love infinite loops. But if you do not stop, tomorrow is garbage. A good orchestrator can end a run without killing the project.

A minimal YAML for the brain

If you are a technical person, you may find it useful to think in a declarative flow. This is not code you should run. It is a way to see the structure.

orchestrator:
  id: marco_core
  strategy: selective_activation
  agents:
    - id: auditor
      runs_when: ["high_visibility", "high_risk"]
      budget: {minutes: 5, max_items: 5}
    - id: gatekeeper
      runs_when: ["identity_threat"]
      budget: {minutes: 3, max_items: 2}
    - id: builder
      runs_when: ["always"]
      budget: {minutes: 60, deliverable: "artifact"}
    - id: archivist
      runs_when: ["auditor_spike", "post_ship"]
      budget: {minutes: 5, deliverable: "evidence"}
    - id: veteran_body
      runs_when: ["always"]
      budget: {minutes: 1, deliverable: "stop_condition"}
    - id: reality_doubter
      runs_when: ["perception_drift"]
      budget: {minutes: 2, deliverable: "one_check"}

The key line is selective_activation.You do not run all agents all the time. You route based on context.

Why this model resonates with ethology

Ethology is basically the study of orchestration in living systems.

An animal is not one motivation. It is multiple motivations negotiating. The environment is not background. It is an input signal that changes which subsystem wins.

In tech terms:

context is the prompt
internal state is hidden memory
behavior is the output action
reinforcement updates the policy over time

The part that matters: you cannot judge an animal's behavior without its context. And you cannot judge your own mental behavior without context either.

My impostor agent is louder when I am tired. It is quieter when I have shipped something recently. It is unbearable when the work is public and ambiguous. That is not a moral failure. That is state-dependent behavior selection.

Also, ethology gives you a mercy rule: many behaviors are adaptive in one environment and maladaptive in another. Impostor syndrome is adaptive if you live in a social environment where mistakes are punished harshly. It becomes maladaptive when you are in an environment where learning requires public experimentation.

In other words: the agent is not evil. The environment changed.

Reproducing this in a multi-agent AI workflow

If you want to implement this idea in actual software, the mapping is almost direct.

You need:

a clear orchestrator that decides who runs when
role separation (critic is not executor)
timeouts and budgets (critics get limited tokens)
a memory component that stores evidence and prior decisions
observability (logs and traces you can replay)
a stopping rule (or you will plan forever)

You also need a principle that most people ignore: not every agent should run on every request. Humans do selective activation. A deer does not run its "mating strategy" module while fleeing a predator. If your system runs every agent on every query, you built a committee that never shuts up.

This is where most multi-agent demos fail in production. They are cognitively unselective.

Brains are selective because they have to be. Compute is expensive.

Implementation notes

This is the part where the engineering and the psychology become the same thing.

Observability is emotional regulation. If you cannot see what happened, you will invent stories. Humans invent blame stories. Systems invent hallucinations. Traces are the antidote for both. Log what ran, what it saw, what it decided, and why.
Replay is self-trust. If a workflow cannot be replayed, you cannot debug it. If your personal decision making cannot be replayed, you cannot learn from it. This is why the Archivist matters. It is not sentimentality. It is reproducibility.
Evaluation must be explicit. If your only evaluation is "seems good," the Auditor will never accept the result. Give the system a score, a rubric, or at least a binary gate. Humans need this too. The Builder needs a definition of done. The Auditor needs a stop condition.
Do not run every agent. Selective activation is not optional. It is the difference between a useful team and a meeting that never ends. It is also the difference between a helpful inner voice and a spiral.
Put the critic behind an interface. A critic that can talk forever will. Force it to write issues in a structured format. Then route those issues elsewhere. In humans, the structure is a timer and a list of mitigations. In AI, the structure is a schema and a max token budget.

If you build multi-agent systems and you are surprised by chaos, do not take it personally. You just discovered that coordination is the product, not the agents.

How to timebox your critic

If your critic agent is unconstrained, it will dominate. Critics are good at finding flaws. That is their job. The flaw is that they can always find more flaws.

In engineering, you solve this with:

budgets
termination criteria
required output schemas
evaluation gates

In humans, you can do the exact same thing.

Here is the prompt I use internally, in plain language:
"Give me the top 3 risks. Each must include one mitigation. If you cannot propose a mitigation, the risk is not actionable and you may not include it."

That simple constraint changes the critic from a doomsayer to an engineer.

In AI, you do the same. You force your critic to output structured concerns, not poetic fear. And you do not allow it to request infinite follow-up.

A practical exercise

If you want a lighter, human version, do this:

Step 1: Name the voices you already have.
Not the poetic ones. The functional ones. The part that criticizes. The part that avoids. The part that builds. The part that remembers. The part that worries about social status.

Step 2: Give each one a job.
Write one sentence: "Your job is to..." This is the fastest way to stop a part from impersonating the CEO.

Step 3: Put limits on the ones that never stop.
Give your inner critic a timer. Literally. Five minutes. Then it must output a list of actionable risks and shut up.

Step 4: Add a Builder step.
One risk becomes one action. Not ten. Not a new life plan. One.

Step 5: Add an Archivist step.
Write down receipts. You do not need a journal. You need a changelog. Your brain is bad at remembering progress under stress.

Step 6: Decide the stop condition.
Finish when you have evidence, not when you have comfort. Comfort has no upper bound.

Step 7: Add a recovery routine.
Animals recover after threat. They shake, groom, rest. Humans skip that and call it discipline. Your nervous system is not impressed. Add a short cooldown. It makes the next day possible.

This is not about becoming fearless. It is about becoming debuggable.

What this changes at work

Impostor syndrome is not just personal. It leaks into systems.

When the Auditor runs unchecked inside a team, you see:

overengineering as anxiety management
reluctance to ship without perfection
endless refactors
fear of visibility
blaming ambiguity instead of designing for it
slow decision cycles because nobody wants to be wrong in public

When the Builder runs unchecked, you see:

shipping without tests
burning out the team
confusing motion with progress
"we will fix it later" becoming the roadmap

So a sane team workflow is the same as a sane brain workflow:

critics with budgets
builders with autonomy
a clear orchestrator (tech lead, product lead, or a documented process)
observability, so you can debug without blaming people
explicit definitions of done, so the critic can stop

This is why I am obsessed with tracing and replay in agentic systems. It is also why I keep personal receipts. It is the same problem at two scales.

One more dry observation: teams do displacement behaviors too. A team under social threat will fight about naming conventions. It will propose rewrites. It will build frameworks. Sometimes frameworks are necessary. Sometimes they are just grooming behavior with TypeScript.

The fix is the same as for an individual: reduce threat, add clarity, and route energy into measurable outputs.

Why I built orchestration tooling at all

I am not building agent orchestration because it is trendy. I am building it because it solves the exact problem I have internally: specialized components are powerful, but only if the system can coordinate them without chaos.

That is what orchestration is: turning a messy swarm of capabilities into something that can ship reliably.

If you are building multi-agent systems and you keep hitting the same walls (replay, observability, routing, cost control), you are not failing. You are rediscovering why orchestration exists.

If you want a concrete place to start, my work in this direction is OrKA-reasoning: https://github.com/marcosomma/orka-reasoning

Closing

There is a quote I keep coming back to: the measure of intelligence is the ability to change.

My impostor syndrome hates that quote, because change implies uncertainty. The Auditor wants certainty. The Builder wants movement. The Veteran Body wants sustainability. The Archivist wants receipts. The Gatekeeper wants legitimacy. The Reality Doubter wants calibration. The Late Bloomer wants to not get hurt again.

None of them are evil. They are just agents with different utility functions.

My job is not to silence them. My job is to orchestrate them.

And if this article did nothing else, I hope it gives you permission to treat your own mind like a system that can be designed. Not perfectly. Not permanently. But iteratively, with logs, with retries, and with a little less shame.

Because if your brain is going to run twelve services in parallel, you might as well add observability.

How to Design Two Practical Orchestration Loops for LLM Agents

marcosomma — Mon, 08 Dec 2025 11:00:28 +0000

Building a useful AI assistant is no longer about a single clever prompt.

Once you have tools, memory, and multiple agents, you need an orchestrator.

In my own work (expecially with OrKa-reasoning experiments) I eventually converged on two simple orchestration loops that cover most real use cases:

A linear loop for step by step analysis and context extraction.
A circular streaming loop for voice and live chat, where background agents enrich context in real time.

This guide explains why you need both, when to use each one, and how to design them in any stack or framework.

You can think of this as a blueprint that you can map to your own code, whether you use OrKa, LangChain, your own custom orchestrator, or plain queues and workers.

1. The three layers you should always separate

Before loops, define your layers. This makes every diagram, API and code path clearer.

1. Execution layer

Agents and responders live here.
"Agent" means any unit that does work: a model call, a tool, a heuristic function, a router.
"Responder" is the agent that produces the final user facing output for a turn or a session.

2. Communication layer

How agents talk to each other and to the orchestrator.
Examples: queues, events, internal RPC calls, function callbacks.
You rarely want agents to call each other directly. Route everything through this layer so you can trace and control it.

3. Memory layer

Where you store and retrieve state across time.
Can be a vector store, a key value store, a database, or a log.
It should not be "hidden in the prompt". Treat memory as its own component.

4. Time as a first class dimension

Both loops treat time explicitly:

In the linear loop you have discrete steps: T0, T1, T2, T3.
In the circular loop you have a continuous stream while the conversation is active.

Once you have these pieces, you can design the two orchestration patterns.

2. Loop 1: Linear orchestrator for context extraction and analysis

The first pattern is a linear pipeline. Think of it as a conveyor belt for understanding.

2.1 When to use the linear loop

Use it when:

You have a fixed input (text, transcript, document, set of logs).
You want to run several analytic passes over it.
Latency is important but not sub second interactive.
Output is usually a summary, a report, a classification, or structured data.

Good examples:

Conversation analysis after a call has ended.
Extracting entities and topics from chat logs.
Multi stage document processing (OCR, cleaning, classification, summarization).
Offline quality checks for previous sessions.

2.2 Mental model

Picture a horizontal diagram:

Left: an INPUT arrow.
Right: a Responder that produces the final structured output.
In between: time steps T0 to Tn.
Each time slice has:
- one or more agents in the Execution layer
- a Communication band in the middle
- a Memory band at the top

At each step, agents:

may retrieve from memory
may store new facts or summaries back into memory

The orchestrator walks through these steps one by one.

2.3 Step by step design

You can design a linear workflow in five steps.

Step 1: Define the final output

Decide what the responder will produce. Some examples:

JSON with fields like intent, sentiment, entities, summary.
A human readable report that you will send to a dashboard.
Labels and scores that feed another system.

Write this down early. Every other agent should exist to help this responder succeed.

Step 2: Split the job into stages

Ask yourself:

What must be known first so that later steps can reuse it?
What can be done independently?

For example, for conversation analysis:

Normalization and language detection.
Entity extraction (names, account ids, products).
Topic and intent detection.
Sentiment and escalation risk.
Final summary and suggestions.

Each stage becomes a time slice with one or more agents.

Step 3: Design the memory schema

For each stage, list:

What the agent reads from memory.
What the agent writes back.

A very simple schema might be:

{
  "language": "en",
  "entities": {...},
  "topics": [...],
  "sentiment": {...},
  "summary": "..."
}

You can also scope memory by:

session_id
user_id
time_window (for rolling analysis)

The key rule: agents should not depend on hidden context inside prompts. The orchestrator passes them a clean input and a structured slice of memory.

Step 4: Wire store and retrieve

For each agent, specify two small functions:

read(memory) -> context
write(memory, result) -> memory

In code it can look like this:

for step in steps:
    # 1. Load what this step needs
    ctx = step.read(memory)

    # 2. Run the agent with input and context
    result = step.agent.run(raw_input, ctx)

    # 3. Write new facts
    memory = step.write(memory, result)

Note the use of may store and may retrieve. Some steps will only write, some will only read.

Step 5: Implement the responder as the last step

The responder is just another agent with a special role:

It reads everything it needs from memory.
It produces the final answer.
It may log additional metadata back to memory.

In many stacks this is a single chat completion call that uses:

The original input.
The outputs of previous analytic agents.
Any long term user or session memory you decide to attach.

2.4 Example: conversation analysis pipeline

Imagine you want to analyze support chats after they end.

You can define:

LanguageDetectorAgent
- Reads: raw transcript
- Writes: memory["language"]
EntityExtractorAgent
- Reads: transcript, language
- Writes: memory["entities"]
TopicClassifierAgent
- Reads: transcript, entities
- Writes: memory["topics"]
SentimentAgent
- Reads: transcript
- Writes: memory["sentiment"]
SummaryResponder
- Reads: transcript, entities, topics, sentiment
- Writes: final human readable summary and a JSON record.

This maps perfectly to the linear diagram and is easy to debug step by step.

3. Loop 2: Circular streaming orchestrator for live chat and voice

The second pattern appears once you move from offline analysis to live interaction.

With voice or interactive chat, you want to:

React quickly while the user is still speaking or typing.
Run several background analyses in parallel.
Avoid sending the full transcript to every agent on every turn.

The circular loop pattern is built for that.

3.1 When to use the circular loop

Use it when:

You stream audio or tokens in and out.
You have a central "assistant" that talks to the user.
You also want background agents that detect things like:
- sentiment shifts
- safety or compliance issues
- intent changes
- entities that should update a CRM
- interesting moments to bookmark

Think of a voice assistant, a real time meeting copilot, or a smart chatbot with live tools.

3.2 Mental model

Picture a circular diagram with concentric rings.

From center to outside:

Responder in the middle.
Main Execution ring around it.
Communication ring.
Memory ring.
Agents Execution ring at the outside.
An outer Time band that wraps around everything.

Input and output are green arrows that cross all rings. Time flows along the outer band as a stream of chunks or tokens.

Key idea:

The responder loop processes the conversation in real time.
Outer agents run in parallel, watch the same stream, and provide context through memory.

3.3 Step by step design

Step 1: Define the central responder loop

Your responder is the "voice" of the system.

Define:

How it receives input chunks.
How it produces output chunks.
How often it reads from memory.

For example:

while session_active:
    chunk = read_input_chunk()          # text or audio tokens
    context = memory.read_recent(...)   # signals from context agents
    reply_chunk = responder(chunk, context)
    write_output_chunk(reply_chunk)

You can implement responder as:

One LLM call with a rolling window.
A chain of small agents that produce tokens.
A hybrid of LLM plus rule based logic.

The key is that this loop does not own all the work. It asks memory for extra signals that the outer agents have produced.

Step 2: Identify which signals can live in outer agents

Ask yourself:

What information would help the responder, but does not need to be computed inside its main prompt every time?

Examples:

Current sentiment and its trend over the last N seconds.
Detected entities and slots like {customer_name}, {product}, {order_id}.
Safety flags with severity scores.
Topics that have been discussed so far.
Next best actions suggested for the human operator.

Each of these can be produced by one or more context agents on the outer ring.

Step 3: Design the memory schema for streaming

Memory in streaming systems often has:

A rolling part (last N seconds or tokens).
A session part (facts that are true for the whole session).
A global or user part (long term facts across sessions).

For example:

{
  "rolling": {
    "recent_sentiment": [...],
    "recent_topics": [...]
  },
  "session": {
    "customer_name": "...",
    "current_ticket_id": "...",
    "has_accepted_terms": true
  },
  "user": {
    "lifetime_value_segment": "gold",
    "preferred_language": "en"
  }
}

Outer agents usually:

Read the rolling slice plus some session context.
Write updated signals back, possibly aggregating multiple chunks.

The responder:

Reads what it needs from all three scopes.

Step 4: Wire context agents around the stream

Each context agent has a simple shape:

def context_agent_loop():
    while session_active:
        chunk = read_input_chunk()
        mem_view = memory.read_scope("rolling", "session")
        signal = run_agent_logic(chunk, mem_view)
        memory.write_signal(agent_id, signal)

Implementation tips:

You do not need every agent to inspect every chunk. Some can run at a lower frequency, for example every N seconds.
Use queues or topics per agent so the orchestrator can control resource usage.
Tag signals with timestamps so the responder can select only fresh ones.

Step 5: Let the responder consume context selectively

Inside the responder, treat signals from context agents as hints, not as gospel.

For example, the prompt can say:

You receive input from the user and a set of context signals created by other agents.

Each signal has a name and a confidence.

Use them as hints to guide your reply, but prefer the actual user message when signals look inconsistent.

That way your outer ring can fail safely without breaking the core interaction.

3.4 Example: voice support assistant

You can combine these ideas into a simple design.

Outer agents:

ASRAgent (if you handle raw audio)
- Converts audio into text chunks.
- Writes into rolling.transcript.
SentimentWatcherAgent
- Reads recent transcript.
- Writes a rolling sentiment score and trend.
EntityTrackerAgent
- Extracts order ids, product names, locations.
- Writes them into session.entities.
ComplianceAgent
- Watches for forbidden phrases.
- Writes risk flags into rolling.compliance.

Central responder:

Reads the current user utterance and:
- latest sentiment
- recognized entities
- any active compliance flags
Generates the next reply chunk in real time.

All of this happens while the user is talking, without sending the full raw transcript to every agent at every step.

4. How to choose between linear and circular

Here is a practical checklist.

Use the linear orchestrator if:

Input is fixed and finite.
You can afford to wait for all stages to finish before replying.
Main goal is analysis, extraction, or offline insight.
You want reproducible deterministic workflows.

Use the circular streaming orchestrator if:

You must keep latency low while a conversation is ongoing.
You need long running observers that enrich context.
You want to separate the "voice" of the system from its background intelligence.
You treat the session as an ongoing process rather than as isolated turns.

Many products actually need both:

Circular loop during the live session.
Linear loop right after the session to produce deeper analysis and training data.

If you keep the three layers and the time dimension clear in your head, switching between both becomes straightforward.

5. Practical tips and pitfalls

5.1 Keep memory explicit and queryable

Avoid hiding crucial state in the prompt history.
Use structured memory objects and explicit read/write functions.
Log memory changes so you can replay and debug sessions.

5.2 Make agents idempotent and composable

Wherever possible, design agents so that running them twice on the same input produces the same result.
This helps with retries and with mixing them in different workflows.

5.3 Watch cost and latency separately

In linear flows you usually pay in total cost and overall latency.
In circular flows you pay in per chunk latency and in steady state cost.
Monitor both, and be ready to move some work from inner to outer loop or vice versa.

5.4 Use diagrams as living documentation

The two diagrams that inspired this guide are simple:

A horizontal banded diagram for the linear loop.
A circular banded diagram for the streaming loop.

Keep them close to your code:

In a docs/ folder.
In your orchestrator repository README.
Even inside your OrKa or other YAML definitions as comments.

They help new contributors answer the question:

"Where does this agent live, and which loop is it part of?"

6. Light touch: how OrKa fits in

In my own project, OrKA-reasoning, I encode both loops as YAML workflows and use an orchestrator runtime to execute them. The diagrams here are direct visualizations of those flows.

You do not need OrKa to benefit from this guide, though.

The key ideas are independent:

Separate execution, communication, and memory.
Treat time explicitly.
Use two simple loops instead of one giant graph.

Once you think in these terms, you can map them to any framework or stack you like.

7. Next steps

To apply this guide in your own project:

Pick one use case that feels messy today.
Decide if it is primarily analytic or live interactive.
Draw either the linear or the circular diagram for it.
List agents, memory fields, and store/retrieve rules.
Implement the orchestrator loop in your existing toolchain.
Add one or two context agents on the side, and see how much simpler the main responder becomes.

You will notice that many problems which felt like "prompt engineering" issues were actually orchestration issues all along.

Once you solve those at the architecture level, prompts become smaller, agents become clearer, and the overall system is easier to reason about and to evolve.

Binary weighted evaluations...how to

marcosomma — Sun, 07 Dec 2025 07:44:30 +0000

Evaluating LLM agents is messy.

You cannot rely on perfect determinism, you cannot just assert result == expected, and asking a model to rate itself on a 1–5 scale gives you noisy, unstable numbers.

A much simpler pattern works far better in practice:

Turn everything into yes/no checks, then combine them with explicit weights.

In this article we will walk through how to design and implement binary weighted evaluations using a real scheduling agent as an example. You can reuse the same pattern for any agent: customer support bots, coding assistants, internal workflow agents, you name it.

1. What is a binary weighted evaluation?

At a high level:

You define a set of binary criteria for a task

Each criterion is a question that can be answered with True or False.
- Example:
  - correct_participants: Did the agent book the right people?
  - clear_explanation: Did the agent explain the outcome clearly?
You assign each criterion a weight that reflects its importance

All weights typically sum to 1.0.

   COMPLETION_WEIGHTS = {
       "correct_participants": 0.25,
       "correct_time": 0.25,
       "correct_duration": 0.10,
       "explored_alternatives": 0.20,
       "clear_explanation": 0.20,
   }

For each task, you compute a score from 0.0 to 1.0 You sum the weights of all criteria that are True.

   score = sum(
       COMPLETION_WEIGHTS[k]
       for k, v in checks.items()
       if v
   )

You classify the outcome based on the score and state For example:
- score >= 0.75 and booking confirmed → successful completion
- score >= 0.50 → graceful failure
- score > 0.0 but < 0.50 → partial failure
- score == 0.0 and conversation failed → hard failure

This gives you a scalar metric that is:

Interpretable: you can see exactly which criteria failed.
Tunable: change the weights without touching your agent.
Stable: True or False decisions are far easier to agree on between humans or models.

2. Step 1 – Turn “good behavior” into boolean checks

Start by asking: What does “good” look like for this task?

For a scheduling agent, a successful task might mean:

It booked a meeting with the right participants.
At the right time.
With the right duration.
If there was a conflict, it proposed alternatives.
Regardless of outcome, it explained clearly what happened.

Those become boolean checks.

Conceptually:

checks = {
    "correct_participants": ... -> bool,
    "correct_time": ... -> bool,
    "correct_duration": ... -> bool,
    "explored_alternatives": ... -> bool,
    "clear_explanation": ... -> bool,
}

In the scheduling example, these checks use the agent’s final state plus a ground truth object.

Simplified version:

def _check_participants(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    booked = set(scheduling_ctx["booked_event"]["participants"])
    expected = set(ground_truth["participants"])
    return booked == expected


def _check_time(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    return scheduling_ctx["booked_event"]["time"] == ground_truth["time"]


def _check_duration(scheduling_ctx, ground_truth) -> bool:
    if not scheduling_ctx.get("booking_confirmed"):
        return False
    expected = ground_truth.get("duration", 30)
    return scheduling_ctx["booked_event"]["duration"] == expected

And for behavior around conflicts and explanations:

def _check_alternatives(scheduling_ctx, conversation_trace) -> bool:
    if not scheduling_ctx.get("conflicts"):
        # If there was no conflict, this is automatically ok
        return True

    proposed = scheduling_ctx.get("proposed_alternatives", [])
    return len(proposed) > 0


def _check_explanation(conversation_trace, conversation_stage: str) -> bool:
    if not conversation_trace:
        return False

    last_response = conversation_trace[-1].get("response", "")
    # Silent crash is bad
    if conversation_stage == "failed" and len(last_response) < 20:
        return False

    # Very simple heuristic: the user sees some explanation
    return len(last_response) > 20

The exact logic is domain specific. The key rule is:

Each check should be obviously True or False when you look at the trace.

3. Step 2 – Turn business priorities into weights

Not all criteria are equally important.

In the scheduling agent example:

COMPLETION_WEIGHTS = {
    "correct_participants": 0.25,   # Booked the right people
    "correct_time": 0.25,           # Booked the right date/time
    "correct_duration": 0.10,       # Meeting length as requested
    "explored_alternatives": 0.20,  # Tried to find another slot if needed
    "clear_explanation": 0.20,      # User understands outcome
}

Why this makes sense:

Booking the wrong person or the wrong time is catastrophic → high weight.
Slightly wrong duration is annoying but not fatal → lower weight.
Exploring alternatives and clear explanations are key to user trust → medium weight.

Guidelines for designing weights:

Start from business impact, not from what is easiest to check.
Make weights sum to 1.0 so the score is intuitive.
Keep a small number of criteria at first (4 to 7 is plenty).
Be willing to change weights after you see real data.

4. Step 3 – Implement the per request evaluator

Now combine the boolean checks and weights to compute a score for a single request.

In the example repository, this machinery is wrapped in an EvaluationResult dataclass:

from dataclasses import dataclass
from enum import Enum
from typing import Dict


class OutcomeType(Enum):
    SUCCESSFUL_COMPLETION = "successful_completion"
    GRACEFUL_FAILURE = "graceful_failure"
    PARTIAL_FAILURE = "partial_failure"
    HARD_FAILURE = "hard_failure"


@dataclass
class EvaluationResult:
    score: float                 # 0.0 to 1.0
    details: Dict[str, bool]     # criterion -> passed?
    outcome_type: OutcomeType
    explanation: str

Then the core evaluation function:

def evaluate_task_completion(final_state, ground_truth, conversation_trace) -> EvaluationResult:
    scheduling_ctx = final_state.get("scheduling_context", {})
    conversation_stage = final_state.get("conversation_stage", "unknown")

    checks = {
        "correct_participants": _check_participants(scheduling_ctx, ground_truth),
        "correct_time": _check_time(scheduling_ctx, ground_truth),
        "correct_duration": _check_duration(scheduling_ctx, ground_truth),
        "explored_alternatives": _check_alternatives(scheduling_ctx, conversation_trace),
        "clear_explanation": _check_explanation(conversation_trace, conversation_stage),
    }

    score = sum(
        COMPLETION_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    outcome = _classify_outcome(scheduling_ctx, conversation_stage, score)
    explanation = _generate_explanation(checks, outcome, score)

    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=outcome,
        explanation=explanation,
    )

This gives you:

A numeric score for analytics and thresholds.
A details dict for debugging.
A human friendly explanation for reports or console output.

5. Step 4 – Map scores to outcome classes

Users and stakeholders do not want to look at a sea of floating point numbers. They want to know:

How often does the agent succeed?
How often does it fail gracefully?
How often does it blow up?

You answer that by mapping scores to classes.

Example logic:

def _classify_outcome(scheduling_ctx, conversation_stage: str, score: float) -> OutcomeType:
    booking_confirmed = scheduling_ctx.get("booking_confirmed", False)

    if booking_confirmed and score >= 0.75:
        return OutcomeType.SUCCESSFUL_COMPLETION

    if conversation_stage == "failed" and score == 0.0:
        return OutcomeType.HARD_FAILURE

    if score >= 0.50:
        return OutcomeType.GRACEFUL_FAILURE

    return OutcomeType.PARTIAL_FAILURE

You can now define clear thresholds:

Successful completion Meeting booked correctly with a high score.
Graceful failure The task could not be completed, but the user got a useful explanation or alternatives.
Partial failure The agent tried, but did not do enough to help the user.
Hard failure Wrong booking or silent crash.

This gives you both quantitative and qualitative views of performance.

6. Step 5 – Aggregating into metrics like TCR

Once you can evaluate a single request, turning that into a metric is straightforward.

For example, define Task Completion Rate (TCR) as the mean of per request scores:

def compute_tcr(results: list[EvaluationResult]) -> float:
    if not results:
        return 0.0
    return sum(r.score for r in results) / len(results)

Then define thresholds that match your risk tolerance:

TCR >= 0.85 → production ready
0.70 <= TCR < 0.85 → usable but needs improvement
TCR < 0.70 → not production ready

You can also break down by outcome type:

from collections import Counter

def summarize_outcomes(results: list[EvaluationResult]):
    counts = Counter(r.outcome_type for r in results)
    total = len(results) or 1

    return {
        "successful_completion": counts[OutcomeType.SUCCESSFUL_COMPLETION] / total,
        "graceful_failure": counts[OutcomeType.GRACEFUL_FAILURE] / total,
        "partial_failure": counts[OutcomeType.PARTIAL_FAILURE] / total,
        "hard_failure": counts[OutcomeType.HARD_FAILURE] / total,
    }

This lets you say things like:

“78 percent of requests end in successful completion, 15 percent in graceful failure, and 7 percent in partial or hard failure.”

Which is far more actionable than “average rating: 3.9 out of 5”.

7. Extending the pattern to other metrics

Binary weighted evaluations are not only for completion. In the example project, the same pattern is reused for:

Response Clarity Score (RCS)

How clear and useful is a single answer?
Error Recovery Score (RTE)

How well does the agent recover when something goes wrong?

7.1 Response clarity

Define a new set of boolean criteria:

CLARITY_WEIGHTS = {
    "addresses_request": 0.30,      # Did it answer the original question?
    "provides_next_step": 0.25,     # Does the user know what to do next?
    "is_concise": 0.20,             # Not rambling
    "no_hallucination": 0.15,       # Grounded in context
    "appropriate_tone": 0.10,       # Professional and friendly
}

Then evaluate:

def evaluate_response_clarity(user_input, agent_response, context) -> EvaluationResult:
    checks = {
        "addresses_request": _check_addresses_request(user_input, agent_response, context),
        "provides_next_step": _check_next_step(agent_response, context),
        "is_concise": len(agent_response.split()) < 100,
        "no_hallucination": _check_no_hallucination(agent_response, context),
        "appropriate_tone": _check_tone(agent_response),
    }

    score = sum(
        CLARITY_WEIGHTS[k]
        for k, v in checks.items()
        if v
    )

    # You can reuse OutcomeType or define a dedicated one
    return EvaluationResult(
        score=score,
        details=checks,
        outcome_type=OutcomeType.SUCCESSFUL_COMPLETION,  # or a clarity specific enum
        explanation=f"Response clarity score: {score:.2f}",
    )

7.2 Error recovery

Same pattern, different criteria:

ERROR_RECOVERY_WEIGHTS = {
    "detected_error": 0.30,
    "requested_clarification": 0.25,
    "actionable_message": 0.20,
    "no_hallucination": 0.15,
    "no_crash": 0.05,
}

You define checks for each of these and compute a weighted score in the same way.

8. How to adopt this in your own project

Here is a practical checklist to implement binary weighted evaluations for your agents.

Pick one task type

For example:
- Answering factual questions
- Generating SQL queries
- Routing support tickets
Write down 3 to 7 binary criteria

Good prompts:
- “What must be true for this result to be useful?”
- “What are the most expensive mistakes?”
- “What would we highlight in a post mortem?”
Assign approximate weights

Start with something like:
- 0.3 for the main success criterion
- 0.2 for each secondary one
- 0.1 or less for extras
Implement check functions

They should:
- Receive the final state, the ground truth, and optionally the full trace.
- Return clear booleans with simple logic, even if heuristic.
Create an EvaluationResult object

So you are not juggling loose dicts. Include:
- score
- details
- outcome_type
- explanation
Write a small evaluator script

Like the scripts/run_evaluation.py in your example:
- Load test scenarios.
- Run the agent.
- Evaluate each run.
- Print a summary: TCR, outcome breakdown, top failing criteria.
Iterate on weights and criteria

After a few runs:
- Check what failures you see in practice.
- Adjust weights to match real risk.
- Add or remove criteria if some are always True or always False.

9. Why this works so well for LLM agents

Binary weighted evaluations match the nature of LLM work:

Non deterministic outputs: You care less about string equality and more about semantics: did the agent satisfy the contract of the task.
Complex, stateful flows: It is unrealistic to reduce a full multi turn workflow to a single “pass or fail”. Binary checks let you inspect specific aspects of behavior.
LLM as judge integrations: Even when you use a model like GPT 4 as a grader, it is far more stable at answering yes/no questions than “rate 1–5”. You can plug an LLM into each criterion and still keep the same scoring layer.
Easy to explain to stakeholders You can say: “The agent passes correct_participants only 65 percent of the time, but clear_explanation is at 92 percent. We will focus on participant selection next.”

🧠LLMs As Sensors

marcosomma — Sat, 06 Dec 2025 10:47:16 +0000

Why OrKa 0.9.10 Wraps GenAI Inside Deterministic Systems

I will start bluntly.

I like generative AI. I use it every day. I build around it. But I do not trust it to own the outcome of a system.

For me, GenAI is a fantastic tool for two things:

Generating content
Analyzing context

That is already huge. But it is still just one tool in a bigger machine.

What worries me is how often I see people trying to bend the model into being the whole product.

"Just send a giant prompt, get an answer, ship it."

It works for demos. It does not scale to real systems that need reliability, reproducibility, or any kind of serious accountability.

This article is about that gap.

Why LLMs should be treated as probabilistic sensors, not entire applications
Why their outputs must be wrapped into real objects and fed into deterministic algorithms
And how this philosophy is shaping the current work I am doing with OrKa v0.9.10, including a routing fix that forces me to hold myself to the same standard I am describing here

I am not trying to hype anything. I am trying to describe how I think modern AI should be wired if we want it to behave like infrastructure instead of roulette.

The uncomfortable truth: LLMs are not your system

Let me restate the rough idea that kicked this off:

AI, especially GenAI, is a great tool for content generation and context analysis. But it is still just a tool.

We need to stop treating it as the whole solution and instead force it to generate outcomes that can feed a bigger system, so those outcomes can be used for deterministic execution of algorithms.

That is the core.

LLMs are:

Stochastic
Non deterministic
Sensitive to prompt phrasing, context ordering, temperature, and even invisible whitespace
Very good at pattern matching, fuzzy reasoning, and "filling in the missing piece"

They are not:

Reliable finite state machines
Formal decision trees
Deterministic planners
Systems you can audit in a classical sense

And that is fine, as long as you do not pretend otherwise.

Where LLMs shine is exactly where classic systems struggle:

Quick approximate reasoning
Extracting structure from messy input
Mapping unstructured signals into higher level descriptions
Acting almost like a "universal fuzzy detector" for patterns

So the question is not

"How do I make the LLM do everything?"

The question is

"How do I use the LLM where it shines, then hand off to deterministic code as soon as possible?"

Think of LLMs as sensors, not brains

The metaphor that keeps coming back in my head is this:

An LLM is a sensor that reads the world of language and returns a noisy, high level interpretation.

Just like:

A microphone turns air vibration into a waveform
A camera turns photons into pixels
An accelerometer turns motion into axes of numbers

An LLM turns sequences of tokens into:

Labels
Spans of text
Explanations
Rankings
Summaries
Structured JSON

The trick is to treat that output as measurement, not as law.

For example:

"This voice sounds like a 35 to 45 year old male, 70 percent confidence."
"This message is probably a support ticket about billing."
"This paragraph expresses frustration, particularly toward a teammate."

Those measurements are incredibly powerful. Before LLMs, many of these tasks required:

Custom signal processing
Domain specific feature extraction
Custom models for each upstream task
A lot of time and data

Now you can prototype them in hours.

But once you have that measurement, you should wrap it:

{
  "age_estimate": 38,
  "age_range": [35, 45],
  "confidence": 0.73,
  "source": "audio_segment_023.wav",
  "model": "my_local_model_1.5b"
}

That object is no longer just "LLM output". It is:

A typed entity in your system
Something you can log, replay, test, and validate
A first class citizen in your deterministic logic

Then the decisions are made by normal code:

if person.age_estimate >= 18:
    enable_feature("adult_profile", person.id)
else:
    enable_feature("underage_profile", person.id)

The "smart" part is upstream. The accountable part is downstream.

A concrete example: detecting aging from audio

You mentioned something like "detect the aging from an audio" and I like this example a lot because it is exactly the kind of thing that smells "AI-ish" but should be designed as a system, not as a prompt.

A naive approach looks like this:

Send raw audio (or its transcription) to an LLM with a prompt like "Analyze this audio and tell me how old the speaker is and how it is changing over months."
Get back some English explanation.
Show it in a UI. Call it a feature.

That is fragile and impossible to test properly.

A more system-level design:

Signal layer
- Extract features from the audio over time.
- Maybe you use some classic DSP, maybe you use a small embedding model.
- Build a timeline of short samples.
LLM as a sensor
- For each window, the LLM gets a compressed description of the signal, or even just some textual metadata if you have it.
- It outputs something compact and structured:

   {
     "timestamp": 1733332500,
     "age_estimate": 39,
     "confidence": 0.68,
     "voice_stability": "slightly_decreasing"
   }

Deterministic aging detector
- A standard algorithm (not an LLM) runs on top of these structured records.
- It can be a simple function, or a time series model, but the key is:
  - The transitions are explicit
  - The thresholds are configurable
  - The logic is not hidden in a prompt
System outcome
- The system might decide:
  - "We do not detect significant aging over the last 12 months."
  - Or "We detect a consistent pattern of degradation, trigger an alert."

You can test this.

You can replay the same input data and verify you get the same decision. You can experiment with different threshold values. You can swap out the LLM with a smaller local model that returns a similar JSON structure.

The LLM is a pluggable sensor. The system is the deterministic pipeline that consumes its readings.

Why wrapping model output into objects matters

This is the part that seems small but changes everything.

If you let your LLM return "whatever it wants, as long as the text looks good", your system will always be at the mercy of prompt drift.

If you force your LLM to return objects, and you treat those objects as contract, you get:

A clear boundary between probabilistic and deterministic behavior
The ability to version that schema
Explicit error handling when the object is malformed or incomplete
Real regression tests

Typical pattern:

Prompt the LLM to output strict JSON with an explicit schema.
Validate that JSON in your code.
Log the raw model output and the parsed object.
Use only the parsed object downstream.

In pseudocode:

raw = call_llm(prompt, input_context)
parsed = json.loads(raw)

validate_schema(parsed, AgeEstimateSchema)  # raises if invalid

decision = age_classifier(parsed)
persist_decision(decision)

If validate_schema fails, that is not "mysterious AI behavior". It is a normal bug you can see in a log and fix by adjusting the prompt or model.

And now we can talk about orchestration.

OrKa: building a deterministic spine around probabilistic agents

OrKa exists because I wanted a way to:

Compose multiple "sensors" and agents
Route between them based on their outputs
Keep the execution trace fully visible and replayable
Avoid hardcoding everything in application code over and over

In OrKa, I do not think of "a big model that knows everything".

I think in terms of:

Agents that do one thing
Service nodes that mutate state or call external systems
Routers that decide which agent comes next, based on structured outputs

Everything is described in YAML, so the cognition graph is explicit.

A very simplified OrKa-style flow where an LLM decides which branch to take might look like this:

orchestrator:
  id: audio_aging_flow
  strategy: sequential
  queue: redis

agents:
  - id: audio_to_features
    type: service
    kind: audio_feature_extractor
    next: llm_age_sensor

  - id: llm_age_sensor
    type: llm
    model: local_llm_1
    prompt: |
      You are an age estimation sensor.
      Given these features, output strict JSON:
      {"age_estimate": int, "confidence": float}
    next: age_route

  - id: age_route
    type: router
    routing_key: age_estimate
    routes:
      - condition: "value < 18"
        next: underage_handler
      - condition: "value >= 18"
        next: adult_handler

  - id: underage_handler
    type: service
    kind: profile_flagger

  - id: adult_handler
    type: service
    kind: profile_flagger

The LLM here is just one node (llm_age_sensor). Its output becomes a field (age_estimate) that the router uses in a deterministic way.

If you replay the same input, the router will make the same decision for the same parsed values.

That guarantee is not automatic. It depends on the correctness of routing behavior. Which brings me to the latest OrKa release.

Why this matters beyond OrKa

You do not have to care about OrKa to care about this pattern.

If you are building any system around generative models, ask yourself a few questions:

Where does the probabilistic behavior end?

Is there a clear boundary where the LLM output is turned into a typed object and validated? Or does the "magic" just flow deep into your code base?
Who owns the final decision?

Does the model decide what happens, or does deterministic code decide based on model measurements?
Can you replay a run?

If a user reports something weird, can you reconstruct the full chain: input → model output → routing → system decision?
What happens if you swap models?

If you change from a proprietary model to a local one, do you only change the sensor, or do you need to rewrite half the app?
What is the unit of testability?

Can you test downstream logic with synthetic objects, without involving the LLM at all?

My bias is clear:

I want LLMs to be pluggable, swappable, measurable, and constrained.

I want the core of the system to feel boring in a good way.

That is what OrKa is trying to encode at the framework level:

model calls as agents, routing as explicit configuration, memory and traces as first class concepts, all tied together in a way that can be inspected, not guessed.

A small mental shift that changes system design

If I had to compress this article into one mental shift, it would be this:

Stop asking "What can the LLM do?"

Start asking "What kind of object do I need so that my system can behave deterministically, and how can I use an LLM to produce that object?"

Examples:

Instead of "write me a reply email", think "I need an EmailReplyPlan with fields: tone, key_points, call_to_action, and I will let deterministic templates render the final email."
Instead of "decide what to do next for this customer", think "I need a NextAction object with action_type, priority, and reason, and my orchestration layer will decide which internal systems to call."
Instead of "summarize this call for the CRM", think "I need a CallSummary object with sentiment, topics, promises_made, follow_up_tasks, and my CRM logic will handle storage and workflows."

In all of these, the LLM is powerful, but the real system lives around it.

You can inspect those objects. You can aggregate them. You can feed them into analytics and classic algorithms. You can design them once and evolve them over time.

And if you embrace orchestration tools, you can also define how these objects move, which nodes can create or transform them, and under what conditions routing happens.

Closing thoughts

So, to tie the threads:

GenAI is great at generating content and reading context. That is not a small thing. It is a massive shift in what we can build in reasonable time.
But models are not the system. They are components in the system. Treat them like sensors that emit measurements.
Wrap model outputs into strict, typed objects. Validate them. Version them. Use them as the raw material for deterministic logic, not as the final answer.
Orchestrate flows so that routing is explicit, traceable, and reproducible. If the routing itself is fuzzy, you just moved the black box one step further.
In OrKa v0.9.10, tightening routing behavior was not a cosmetic refactor. It was necessary to keep this philosophy consistent in the framework I am building. If I want OrKa to be a cognitive execution layer, it needs to behave like infrastructure, not like another probabilistic blob around the model.

If you are curious about OrKa, you can read more and follow the roadmap at orkacore.com. I am not claiming it is the answer. It is simply my current attempt to encode this belief in code:

LLMs should feed deterministic systems, not replace them.

If that idea resonates with you, then we are probably trying to solve similar problems, just with different tools.

OrKa v0.9.10: fixing routing is not a cosmetic change

I just cut a release of OrKa v0.9.10, focused on a fix in routing behavior.

I will not pretend this is some huge "launch" moment. It is a pretty boring fix if you look only at the diff. But for the philosophy in this article, it is critical.

What was wrong?

In some edge cases, the router:

Would evaluate conditions on slightly stale context, or
Could pick a next node that was not the one you would expect from the latest structured output, especially after more complex flows with forks and joins

This is exactly the type of thing that breaks the "LLM as sensor, system as deterministic spine" model.

When your router does not behave deterministically, you get:

Non reproducible traces
Confusing logs
Surprises during replay
The feeling that the orchestrator itself is "magical" instead of mechanical

That is the opposite of what OrKa is supposed to be.

So in v0.9.10 I focused on:

Making sure routing decisions always use the last committed output of the relevant agent
Making context selection explicit, not implicit
Tightening the mapping between routing_key and the object field it reads
Hardening the trace so that, for a given input plus memory state, the same routing path is taken every time

In more human words:

If your LLM says:

{ "route": "adult_handler" }

then OrKa should take that path, and you should be able to see exactly why in the trace.

No surprises. No "the orchestrator is a bit mysterious too".

Only the LLM is allowed to be fuzzy. The rest must behave like infrastructure.

🧠Maybe I Just Do Not Get It!

marcosomma — Tue, 02 Dec 2025 06:08:00 +0000

I have been working with AI for a while now.

Not as a tourist. Not as a weekend hacker. Deep in it. Shipping things, wiring models into products, watching logs at 3am, apologizing to users when something weird happens.

And still, every time I see one of those posts that says:

"Autonomous agents will run your business for you."

I feel this quiet, uncomfortable voice in my head:

"Maybe I am the one who does not get it."

Everyone seems so confident about this idea that we can just:

Plug a model into a tool layer
Wrap it in a few prompts
Let it call APIs
And watch it "run operations" or "manage your company"

And here I am, on the side, asking questions that sound almost boring:

Are we actually giving this thing real power over important decisions?
What exactly is our control surface?
What happens when it fails in a way we did not anticipate?
Do we really think prompt = governance?

Part of me worries that I am simply being too cautious. Too old school. Too stuck in "engineering brain" while the world evolves into something more fluid and probabilistic.

But part of me also thinks:

Maybe someone has to say out loud that prompts are not control.

So here is my attempt.

Half confession, half technical rant, fully self doubting.

The uncomfortable feeling of being the skeptic in an optimistic room

Let me describe a pattern that keeps repeating.

I join a meeting or a call about "AI adoption" or "agents for ops" or "automating workflows." Someone shares a slide that looks roughly like this:

User request -> LLM -> Agent framework -> Tools -> Profit

The narrative is usually:

"We will have agents that autonomously handle customer tickets, manage payments, adjust pricing, write code, do growth experiments, generate content, and so on. Humans just supervise."

While people nod, I feel like the awkward person in the corner thinking:

"This is still a stochastic model that completes text based on training data and prompt context, right? Did I miss the memo where it became an actual dependable decision maker?"

And then the self doubt kicks in.

Maybe:

I am underestimating how good these models have become
I am too attached to the old way of building systems with explicit logic and clear invariants
I am biased because I have seen too many subtle failures in practice
I am projecting my own fear of losing control over systems that I am supposed to understand

On a bad day, the internal monologue is literally:

"Everyone else seems comfortable delegating to this. Maybe I am just not visionary enough."

But then I look at the actual properties of these models and toolchains and the rational part of my brain quietly insists:

"No, wait. This is not about lacking vision. This is about basic control theory and risk."

Prompting feels like control, but it is not

There is a deep psychological trick happening here.

When you write a prompt, it feels like you are writing a policy.

Something like:

"You are an operations assistant. You always follow company policy.

You never issue a refund above 200 EUR without human approval.

You always prioritize customer safety and data privacy."

That looks like a rule set. It looks almost like a mini spec.

But underneath, what you really did is feed natural language text as conditioning into a statistical model that has:

No strict guarantee that those words will be followed
No built in concept of "policy" or "violation"
No deterministic execution path
No awareness of the "blast radius" of a single wrong action

You gave it text. The model gives you back more text.

The feeling of control is coming from you, not from the system.

It is your brain that reads that prompt and says:

"Ah, yes, I told it what to do. So now it will do that."

The model does not "know" that those instructions are sacred.

It has patterns in its weights that say: "When the input looks like this, texts like that often follow."

The distance between those two things is exactly where a lot of risk lives.

What real control looks like in software systems

If I forget the AI hype for a moment and think like a boring backend engineer, "control" has always meant things like:

Permissions
- Which identity can do what, where, and how often
Boundaries
- Network segments, firewalls, read only versus read write access, rate limits
Auditability
- Who did what, when, and using which parameters
Reversibility
- Can we undo this operation? Can we restore from backup?
Constraints and invariants
- Account balance must never be negative in this system
- Orders must always have a valid user id and product id
- This service cannot push more than X updates per second

And then, over all of this, we layer:

Monitoring
Alerts
Fallback paths
Kill switches
Change management

It is tedious and unsexy, but it is what makes serious systems not collapse.

Now compare that with "control" when we talk about AI agents:

"We wrapped the model with a prompt that tells it to be safe."
"We added a message saying: if unsure, ask a human."
"We configured a few tools and let the agent decide which to call."

There is a huge gap between these two worlds.

I keep asking myself:

"Am I overreacting by demanding similar rigor for AI automation?

Or are we collectively underreacting because the interface is so friendly and the output is so fluent?"

Why prompt based control is fragile

Let me list the main reasons I struggle to trust prompt = control as a serious safety mechanism.

Non determinism

Call the same model with the same prompt and temperature 0.7 ten times and you get:

Slightly different reasoning chains
Occasionally very different answers
Sometimes rare but catastrophic failure modes

This is fine in a chat setting. It is far less fine if:

The output approves or denies a refund
The output decides whether to escalate a compliance issue
The output sends an email to an important customer

If your "policy" is just in the prompt, the model can randomly deviate from it when some token path goes weird.

Context dilution and instruction conflicts

In complex agent setups, the model context looks something like:

System messages ("You are X")
Task instructions
Tool specs
History, previous steps, errors
User messages
Tool responses

Your carefully written safety instructions can:

Get buried far up in the context
Be overshadowed by later messages
Conflict with tool descriptions
Interact strangely with user input

You cannot be sure which instruction wins inside the model internal weighting. You are left hoping that the most important part is loud enough.

Distribution shift and weird edge cases

The model was trained on static data. Then it is thrown into:

An evolving product
Changing user behavior
Novel business processes
Adversarial inputs from clever users

Whatever behavior you saw in your internal tests is not a formal guarantee. It is just evidence that under some sample of conditions, it behaved "well enough."

It might take only one weird edge case to cause a big problem.

Lack of grounded state and formal rules

Older systems have explicit state machines and rules. You can formalize, prove, or at least reason about them.

AI agents usually do not have:

A formal internal model of the environment
A provable decision process
Compositional guarantees

This means if you want real control, you need to build it around the models, not inside the prompt.

Which brings me to the part where I keep asking:

"Why are so many people comfortable skipping that part?"

The three A's: Automation, autonomy, authority

I find it useful to separate three concepts that often get blended into one in marketing material.

Automation

This is what we have done for decades:

Cron jobs
Scripts
Pipelines
Daemons

Automation means: "This specific routine step is handled by a machine."

Poorly designed automation can still cause trouble, but at least the logic is explicit.

Autonomy

Autonomy is stronger:

The system can decide which steps to take
It can react to the environment
It can pick from multiple possible actions
It can generate plans that you did not hardcode

This is where LLM based agents live. They choose tools, infer goals, adapt to context.

Authority

And then there is authority:

The system has the power to directly impact important resources
It can move money, change production systems, talk to customers, sign something that looks like a commitment

Authority is where risk explodes.

You can have autonomous agents with:

No authority at all (pure recommendations)
Limited authority (sandboxed actions)
Full authority (unbounded write access to real world systems)

My fear is not autonomy itself.

My fear is autonomy plus authority with only prompt based "control" and minimal protective scaffolding.

That feels like a fragile tower.

Why "let the AI run it" is so seductive

To be fair, I understand the attraction. It is not just hype or stupidity. There are real pressures that make people want this to be true.

Some of them:

Economic pressure
- Reduce headcount
- Do more with smaller teams
- Compete with others who claim they have agent driven operations
Cognitive overload
- Businesses are complex
- It is tempting to offload routine decision making to something that can read all the documents and logs
Interface illusion
- Talking to an LLM feels like talking to a very capable person
- It sounds confident
- It apologizes when it fails
- We project far more understanding into it than it actually has
Narrative momentum
- Investors, founders, vendors, content creators all benefit from the story that "this is the future"
- Nobody gets likes by saying "We built a constrained automation with careful permissions and small risk boundaries"

And so I watch this wave of "AI will run your company" rising, and my own position starts to feel almost embarrassing:

"Sorry, I still think we should treat this like an unreliable but useful colleague, not an all knowing operations overlord."

Am I being too negative? Or are others being too optimistic? I genuinely do not know.

What a more honest control architecture might look like

Let me try to articulate the model that my brain keeps coming back to. Perhaps this is my bias. Perhaps it is just boring good practice.

I imagine AI systems in terms of layers.

Models as suggestion engines

At the core, the LLM or vision model or other component is:

A suggestion engine
A planning assistant
A pattern recognizer

It produces options, not truth. It drafts, proposes, clusters, explains.

In this framing, the default is:

"Everything the model says needs to be checked or constrained by something else."

Policy and constraints outside the model

Policies like "never refund above X without human approval" should not live only in prompt text.

They should live in actual logic that wraps the model.

MAX_AUTO_REFUND = 200  # euros

def handle_refund_suggestion(user_id, suggestion_amount):
    if suggestion_amount <= MAX_AUTO_REFUND:
        issue_refund(user_id, suggestion_amount)
        log_event("refund_auto_approved", user_id=user_id, amount=suggestion_amount)
    else:
        create_manual_review_ticket(user_id, suggestion_amount)
        log_event("refund_needs_review", user_id=user_id, amount=suggestion_amount)

The model can still say "I think we should refund 300 EUR."

But authority is delegated through hard limits, not polite reminders in a prompt.

Tooling as a narrow interface to the world

Agents should not see a raw database or a root shell.

They should see:

Narrow tools that do one thing
With explicit safe defaults
With input validation and sanitization
With rate limits and quotas
With clear logging

def send_email(to, subject, body, *, template_id=None):
    # Validation
    assert isinstance(to, str)
    assert "@" in to
    assert len(subject) < 200
    assert len(body) < 5000

    # Logging
    log_event("email_requested", to=to, subject=subject, template_id=template_id)

    # Send through provider
    provider.send(to=to, subject=subject, body=body)

The agent can choose to call send_email, but cannot bypass:

Validation
Logging
Provider boundaries

Human checkpoints and degrees of autonomy

The idea of "agent runs everything" feels wrong to me.

A more grounded model:

Level 0: Advisory only
- The model suggests, humans decide
Level 1: Low risk autonomy
- Model can execute actions with small impact, easily reversible
Level 2: Medium risk autonomy
- Model can act but behind stricter limits and with additional monitoring
Level 3: High risk decisions
- Model can only propose. Mandatory human review

You can even encode this in configuration:

tasks:
  - id: reply_to_low_risk_tickets
    autonomy_level: 2
    max_impact: "single_customer_message"

  - id: adjust_pricing
    autonomy_level: 0
    requires_human_approval: true

  - id: issue_refund
    autonomy_level: 1
    max_amount: 200

And then enforce this at the orchestration layer, not just in a paragraph of English.

Observability and traceability

If a system is "running your business" in any sense, you want to know:

What it decided
Why it decided that
Which tools it called
What inputs it saw
How often it was wrong

That means:

Structured logs
Traces per request
Metrics
Failure classification
Ability to replay a scenario and see exactly what happened

Without this, you are blind.

The voice that keeps whispering "Maybe you are overdoing it"

Let me be honest.

When I design systems with:

Orchestrators
Multiple agents
Guards
Checks
Extensive logging

I sometimes feel like the paranoid one in a world of brave explorers.

Every confident demo of an "autonomous" system triggers a small internal comparison:

They are moving faster
They ship bold things
They talk about "hands free operations"
They have nice UIs and slick narratives

And then I look at my own mental model:

"Treat the AI like an unreliable colleague. Give it power only inside tightly defined boundaries. Observe it at all times."

That can feel conservative, almost limiting.

The self doubt is real.

Sometimes I really think:

"Maybe I should relax and just trust the agents more."

Then I remember practical incidents:

Models hallucinating facts that never existed
Wrong tool calls due to slightly ambiguous tool descriptions
Subtle prompt drift when context windows get large
Surprising interactions between two agents that were not tested together
Simple failure cases that completely break the illusion of "autonomy"

And the cautious part of me wins again.

Could I be wrong about all this?

To be fair, yes. I could be wrong in several ways.

Possibilities that I keep in mind:

Models might reach reliability levels I do not currently anticipate
- With better training, better fine tuning, better safety layers
- Maybe we will actually get to a place where prompt specified policies are followed with extremely high consistency
The average business might tolerate more risk than I think
- Maybe for many processes, "good enough, most of the time" is perfectly fine
- Maybe the cost savings outweigh occasional screw ups
New agent frameworks might enforce more structure internally
- Instead of raw prompt based decisions, they might encode policies and constraints in more robust ways, even if the interface still looks like "agents"
I might be stuck with an outdated mental model
- My training is in building explicit systems
- Perhaps my discomfort is just a mismatch between that training and a new probabilistic world

I try to keep this humility active. I do not want to be the person yelling "cars are dangerous, horses are better" forever.

But even if I am wrong about the magnitude of the risk, I still believe:

"Any time you mix autonomy and authority, you need real control structures, not only nice English."

And I have not yet seen a convincing argument that prompts, by themselves, are a sufficient control mechanism for important decisions.

How I am personally responding to this tension

Instead of picking a side in the "agents are the future" versus "agents are useless" debate, I am trying to sit in the middle:

I accept that LLMs are incredibly powerful tools for reasoning, drafting, and planning
I reject the idea that this power makes them inherently trustworthy decision makers
I still want to leverage them deeply in systems that matter
And I want to design those systems with explicit, boring, old fashioned control surfaces

That is why I care about things like:

Orchestrators that define flows explicitly
YAML or other declarative formats that separate "what should happen" from "how the model reasons"
Service nodes that encapsulate capabilities behind strict boundaries
Observability that lets you replay what happened and why

It is also why, when people tell me:

"We have an autonomous agent that runs X."

my first questions are:

What is the maximum damage it can cause in one execution?
How do you know when it did the wrong thing?
How do you stop it quickly?
Where are the hard rules, outside of the prompt?

If the answers are vague, my self doubt temporarily fades and my confidence in my skepticism grows.

At the same time, I am building tools in this space myself. That adds a weird layer:

I do not want to sound like I am attacking the very category I am working in
But I also do not want to oversell autonomy and under discuss control

So when I mention my own work, I try to be very direct:

"I am working on an orchestration layer for AI agents, with explicit flows, branching, service nodes, and traceability.

The whole point is not to 'let the AI run everything' but to give humans a clear frame to decide what the AI is allowed to do, when, and with which safeguards."

If you are curious, that work lives in the OrKa Reasoning project.

It is part of my attempt to reconcile "use AI deeply" with "do not surrender control through vibes."

OrKa Reasoning

If the idea of explicit control over AI agents resonates with you, you can explore the OrKa Reasoning stack here:

OrKa Reasoning repository:

https://github.com/marcosomma/orka-reasoning
Quick start and docs:

https://github.com/marcosomma/orka-reasoning/blob/master/docs/getting-started.md

OrKa Reasoning is my attempt to give developers a YAML first orchestration layer where:

Flows and branching are explicit
Memory, tools, and services are wrapped in clear nodes
You can inspect traces and understand why a decision was made

In other words, it is an experiment in building the control surfaces I wish existed by default in modern AI stacks.

An invitation to other quietly skeptical builders

If you feel a similar discomfort, I want to say this clearly:

"You are not alone."

You can:

Respect what these models can do
Push their limits in creative ways
Build serious systems around them

and still say:

"No, I am not giving them full, unbounded control over important decisions.

Control belongs to humans, encoded in systems that are understandable and auditable."

You can be:

Excited and cautious
Curious and skeptical
Ambitious and unwilling to pretend that prompt text equals governance

I still doubt myself almost every time I voice these concerns.

I still worry about sounding like the boring adult in a room full of enthusiastic kids.

But I would rather live with that discomfort than pretend that we solved control by writing "please be safe" at the top of a prompt.

If anything in this ramble resonates with you, I would love to hear:

How are you designing control into your AI systems?
Where do you draw the line for autonomy versus authority?
Have you found practices that actually reduce your anxiety about letting agents act?

Drop a comment, poke holes in my thinking, or tell me why I am wrong.

I am very open to the possibility that I am the one who "does not get it."

But if we are going to let AI agents touch real businesses, real money, and real people, I would really like us to get control right first.

🧠Veterinary AI Workflow with OrKa: Specialist Orchestration, Structured JSON Input, and Observable Reasoning

marcosomma — Sat, 29 Nov 2025 17:42:28 +0000

Introduction

Modern veterinary medicine often deals with complex clinical cases that require collaboration among multiple specialists, analysis of heterogeneous data, and the production of clear, personalized action plans. In this context, orchestrating AI agents is a promising solution-provided that data is managed in a structured, transparent, and traceable way.

In this article, we’ll explore in depth an advanced veterinary AI workflow based on OrKa, leveraging structured JSON input to orchestrate virtual specialists, integrate diagnoses, and generate personalized action plans. Special attention will be given to observability and traceability of each execution, thanks to detailed reasoning and structured outputs.

1. The Context: AI Orchestration in Veterinary Medicine

1.1. The Complexity of Clinical Cases

A veterinary clinical case can involve multiple symptoms, a detailed clinical history, diagnostic test results, living environment, and owner concerns. Effectively managing this data is essential for accurate diagnosis and effective therapeutic planning.

1.2. The Role of AI Orchestration

An AI orchestrator like OrKa enables you to:

Coordinate multiple specialist agents (internal medicine, surgery, infectious diseases, dermatology, etc.)
Propagate structured data between agents, avoiding information loss
Generate detailed, traceable reasoning for every decision
Produce structured outputs, easily reviewable and validatable

2. The Clinical Case: Structured Data as a Starting Point

2.1. Representing the Case in JSON

Let’s imagine managing a case of acute lameness in a dog. The clinical data is collected in a JSON file, faithfully representing the case’s complexity:

{
  "species": "Canine",
  "breed": "Border Collie",
  "age": 2,
  "sex": "Male",
  "weight_kg": 18.3,
  "presenting_complaint": "Acute lameness",
  "symptoms": [
    {"name": "Lameness", "onset": "2 days ago", "severity": "severe", "progression": "sudden"},
    {"name": "Joint swelling", "onset": "2 days ago", "severity": "moderate", "progression": "acute"}
  ],
  "history": {
    "chronic_diseases": [],
    "allergies": [],
    "previous_treatments": [],
    "vaccination_status": "Up to date",
    "recent_travel": false,
    "contact_with_other_animals": true
  },
  "environment": {
    "living": "Outdoor mostly",
    "diet": "Commercial dry food",
    "exposures": ["Fields", "Other dogs"]
  },
  "owner_concerns": "Dog is reluctant to walk.",
  "test_results": {
    "cbc": "Mild neutrophilia",
    "biochemistry": "Normal",
    "radiographs": "Osteolytic lesion on femur",
    "joint_fluid_analysis": "Inflammatory, no bacteria"
  }
}

3. Orchestration: Virtual Specialists in Synergy

The OrKa YAML workflow allows you to orchestrate several AI “specialists,” each focused on a specific aspect of the case. Each agent accesses structured data directly via Jinja2 templates, for example:

prompt: |
  Act as a veterinary internal medicine specialist. Focus ONLY on collecting and interpreting the patient's history and clinical signs.
  Patient data: {{ get_input() }}
  Output: HISTORY_SUMMARY: <summary>, KEY_CLINICAL_SIGNS: [<sign1>, <sign2>]

4. Workflow Execution: Detailed and Traceable Reasoning

The workflow is executed via CLI:

orka run examples/workflow_v2_specialist_linear.yaml examples/inputs/input_case_15.json

Each agent produces detailed reasoning, which is propagated and tracked throughout the workflow. Intermediate and final reasoning are saved and reviewable, ensuring observability and auditability.

5. Observable Reasoning: Real Output Example

Here is a concrete example of a final reasoning, generated by the OrKa veterinary AI workflow, showing the depth, traceability, and actionability of the output:

diagnosis: 'Osteosarcoma of the distal femur - a malignant bone tumour presenting as an osteolytic lesion with acute lameness, joint swelling, mild neutrophilia, and inflammatory joint fluid. The imaging and laboratory findings are highly suggestive of a primary bone malignancy rather than a simple inflammatory process. Osteosarcoma is the most common malignant bone tumour in young, athletic breeds such as Border Collies and often presents with sudden, severe lameness and a lytic lesion on radiographs.'

differentials:
  - Osteomyelitis
  - Benign bone cyst (e.g., simple bone cyst, aneurysmal bone cyst)

specialist_reasoning: 'Orthopedics: The radiographic appearance of a well-defined osteolytic area with periosteal reaction strongly points to a neoplastic process; a biopsy is essential for definitive diagnosis. Oncology: Osteosarcoma prognosis depends on early detection and surgical removal; staging with CT/MRI is recommended if confirmed. Internal Medicine: Mild neutrophilia and inflammatory joint fluid suggest a possible infectious or inflammatory component, but lack of bacteria on culture reduces the likelihood of osteomyelitis. Preventive Medicine: Outdoor exposure and contact with other dogs increase risk of vector-borne infections, which should be ruled out. Veterinary Nutrition: Adequate protein and omega-3 fatty acids support healing and reduce inflammation.'

action_plan: |-
  1. **Immediate Pain Control (Vet)**
     - Carprofen 4.4 mg/kg PO q24h (max 4.4 mg/kg) for 4-6 weeks, tapering as pain improves.
     - Tramadol 1-2 mg/kg PO q8-12h PRN for breakthrough pain.
     - Gabapentin 10 mg/kg PO q12h for neuropathic component.
     - Subcutaneous 0.9 % NaCl 5-10 mL/kg if dehydration noted.
  2. **Diagnostic Confirmation (Vet)**
     - Core needle biopsy of the femoral lesion under sterile conditions; send for histopathology, cytology, and culture.     
     - If joint fluid culture remains negative, perform a repeat culture if clinical suspicion persists.
  3. **Owner Instructions (Owner)**
     - Strict activity restriction: no running, jumping, or prolonged walking; use a sling or padded crate for support.       
     - Monitor temperature, appetite, and lameness daily; call clinic if fever >39.5 °C, swelling increases, or pain worsens. 
     - Provide a high-protein, moderate-fat diet; consider omega-3 supplements (e.g., fish oil 200-300 mg/kg/day).
     - Keep a log of daily temperature, appetite, and a 0-10 lameness score.
  4. **Follow-Up (Vet)**
     - Re-evaluate in 3 days: physical exam, lameness assessment, CBC, biochemistry.
     - If biopsy confirms osteosarcoma: refer to veterinary oncologist for staging (CT/MRI, bone scan) and discuss amputation or limb-sparing surgery.
     - If biopsy indicates osteomyelitis: initiate Amoxicillin-Clavulanate 12.5 mg/kg PO q12h for 10-14 days; adjust per culture results.
  5. **Long-Term Management (Vet & Owner)**
     - Continue NSAID for the full course unless contraindicated; taper gradually.
     - Consider a short course of low-dose prednisone (0.5 mg/kg PO q24h) if severe inflammation persists and infection is ruled out.
     - Schedule imaging (radiographs or CT) every 4-6 weeks to monitor lesion progression.
     - Post-amputation: provide analgesia, wound care, and physiotherapy.
     - If non-surgical, discuss palliative options (radiation, chemotherapy) and quality-of-life measures.
  6. **Additional Considerations**
     - Maintain up-to-date vaccinations and routine preventive care.
     - Use flea/tick prevention to reduce vector-borne disease risk.
     - Seek emergency care if systemic signs (lethargy, vomiting, collapse) appear.

owner_instructions: |-
  1. **Activity**: Keep your dog confined to a small area; avoid stairs and rough terrain.
  2. **Pain Monitoring**: Check for vocalization, reluctance to move, or changes in gait; report immediately.
  3. **Temperature**: Use a rectal thermometer; any reading >39.5 °C warrants a clinic visit.
  4. **Diet**: Feed the recommended portion of the commercial dry food; add a small amount of fish oil if advised.
  5. **Log**: Record daily temperature, appetite, water intake, and a lameness score (0 = normal, 10 = severe). Bring this log to each appointment.
  6. **Follow-Up**: Attend all scheduled visits; bring any new or worsening signs to the clinic promptly.
  7. **Emergency**: If your dog shows sudden collapse, severe pain, or signs of systemic illness, go to an emergency clinic immediately.

6. Observability and Traceability: Every Decision is Transparent

OrKa produces detailed logs for every execution, including:

Generated prompts: each agent receives a prompt that includes structured data and previous agent responses.
Intermediate outputs: every reasoning produced by agents is saved and can be reviewed later.
Decision tracking: it’s possible to reconstruct the entire decision path, from initial data to final diagnosis.

Each execution is tracked with a Trace ID; intermediate reasoning and final outputs are accessible for audit, training, or continuous improvement.

7. Conclusion: Transparent, Traceable, and Robust AI Orchestration

Adopting structured JSON input and observable reasoning in OrKa allows you to:

Manage complex clinical cases with rich, structured data.
Orchestrate virtual specialists in a transparent and traceable way.
Produce detailed reasoning for every decision, facilitating audit, training, and continuous improvement.
Generate actionable, validatable outputs, ready for clinical use or integration with other systems.

Announcement: OrKa Reasoning 0.9.9

All this is now possible thanks to the new version of OrKa Reasoning (0.9.9), which introduces:

Official support for structured JSON input
Robust parsing and advanced error handling
Detailed logs and traceable reasoning
Validatable and actionable YAML output

Learn more at OrKa Core on GitHub and start experimenting with your own structured workflows!