← Back to blog

A council for your coding agent.

Dylan Moore
Dylan Moore
· 5 min read

Listen to this article

Narrated with a natural AI voice.

Mixing several models and fusing their answers into one is a genuinely good idea. It also breaks any agent that uses tools, which is now most of them.

I wanted my coding agent to stop betting everything on a single model. Route the easy stuff to something fast and cheap, send the hard stuff to a panel of models that argue it out, and let the best answer win. OpenRouter ships exactly this and calls it Fusion. So I built a local version, pointed Claude Code at it, and watched it die on the first message.

@dymoo/fusion is the thing I ended up with instead. A local proxy that serves the OpenAI and Anthropic APIs, sits between your coding tool and several models, and decides per request who handles it. Open source, TypeScript, MIT.


Why fusing models breaks your agent

A panel works like this: fan the prompt out to several models, then have a judge read their answers and merge them into one. The output is prose. That's the whole mechanism: it produces text.

A coding agent does not run on text. It runs on tool calls: read this file, run the tests, apply this edit, read the failure, try again. The model's job each turn isn't to write an essay, it's to emit a tool_use block the harness can execute and feed back.

Put those together and you get nothing. The judge writes a lovely paragraph about what it would do; the agent receives prose with no tool call attached, has nothing to execute, and stops. I watched Claude Code do this on every single turn: "I'll start by reading the files…", and then it just returned. Confidently. Forever.

A panel can write you an essay about the work. It can't do the work.

What the big providers actually do

This isn't a bug I could patch. It's the shape of the thing. So I went and read how OpenRouter actually does it, because they clearly have it working.

They have two separate products. Fusion (the panel-and-judge, mixture-of-agents one), which their own docs scope to "research questions, expert critique, compare-and-contrast prompts" and describe as one-shot synthesis, explicitly not iterative agents. And Auto Router, which does something much less glamorous: it picks the single best model for a prompt, and for multi-turn agent work it adds session stickiness: pin that model for the whole conversation so you don't swap brains mid-task and lose the thread.

Nobody fuses agents. The frontier providers don't either. Fusion is for answers; routing is for agents. Once that landed, the design got obvious.

Council-then-act

The fix is to stop asking the panel to answer, and start asking it to advise.

On a request the router judges hard, the other models deliberate, as text. What's the next move, which tool, what are the edge cases, what is the obvious thing the actor is about to miss. A judge condenses all of that into a short briefing. Then one strong model (the actor) runs with the real tools and that briefing dropped into its context, and emits the actual tool call.

Multi-model reasoning, one coherent set of hands. The other models think; the actor acts.

Every response carries a header showing what happened:

x-fusion-route: mode=single; tier=plan; served_by=codex;
                actor=codex; council=minimax+glm+kimi

And the logs read like a meeting:

council gate       tier=plan  source=embedding      # local model rates it "hard"
council convened   advisors = minimax, glm, kimi    # three models deliberate
council member ok
council member ok
council member ok
council briefing   judge=kimi  chars=485            # opinions condensed to a brief
council acting     actor=codex  briefed=true        # strongest model takes over…
served             servedBy=codex                   # …and writes the code, with tools

The deliberation cost about thirteen seconds; the actor (GPT-5.5 at maximum reasoning) took the rest. The agent loop never stalled, because the thing that reaches it is a real tool call, not a paragraph about one.

The router picks the model

Not every request deserves a committee. Renaming a variable doesn't need four models and a judge.

So the routing decision is made by a small, code-trained embedding model that runs in your own process: jina-embeddings-v2-base-code, 768 dimensions, quantised to 8-bit, around 150 MB, on the GPU if you have one. It scores each request and sorts it: trivial work goes to one fast model, normal work round-robins a couple, genuinely hard work convenes the council. The request never leaves your machine to make that call.

It's also the only classifier. No keyword list, no token-count heuristic quietly overriding it when the embedding model is inconvenient. That's a deliberate rule across the repo: no silent fallbacks. If the embedding model can't load, smart requests fail loudly with a 503 and /health tells you why. They don't degrade to something dumber while pretending nothing happened. A fallback you didn't notice is just a bug with good manners.

Setup

Point any Anthropic-compatible client at it:

npm i -g @dymoo/fusion
fusion start
export ANTHROPIC_BASE_URL=http://localhost:8787
export ANTHROPIC_MODEL=fusion
claude

The upstreams are yours to choose in config.yaml. I run Codex (GPT-5.5) as the actor with GLM, Kimi and MiniMax on Ollama Cloud as the council, but it's any mix of Codex, OpenAI-compatible, or Anthropic endpoints. Tool turns pin to the actor and stay there for the session; tool-free questions can still use the full panel, because that's the one place fusing actually helps.

The honest bit

The panel still can't do tools. That hasn't changed and won't. It's physics, not a TODO. Any request carrying tools routes to a single model, always. Council-then-act is the workaround, not a refutation.

It's slow on hard turns. A council plus an actor at maximum reasoning is a deep-work tool, not autocomplete. The whole point is to spend more compute where it matters; if you want a fast single model, the router already gives you that for everything else.

No benchmarks. I'm not claiming it beats any particular model on any particular eval. The claim is narrower and architectural: you can get more than one model's reasoning into a tool-using agent without breaking the agent. Whether the extra opinions are worth the latency on your work is for you to decide.

v0.1.0. It's new. I use it daily; you should still smoke-test it against your own setup before you trust it with anything that matters.

GitHub · npm

Dylan Moore

Written by Dylan Moore

Self-taught developer since age 13. Sold first software company at 16 for $60K, second for mid-six figures. Founded multiple ventures. Currently founding developer at PodFirst.

Posts you may like