I trained my own coding model. My ruler was bent.

I wanted a coding agent that runs on my laptop. Not in a datacentre, not behind a meter, on one Mac, offline, free to call as many times as I like, that fixes real GitHub bugs by itself instead of just talking about them.

I got one. The surprising part wasn't the training. It was finding out that for the first stretch of the project, the instrument I was using to measure success was broken, and that every "the model is ruined" verdict I'd reached along the way was wrong.

The model was fine the whole time. My ruler was bent.

The setup

The student is Gemma-4-12B, quantised to 4-bit so it fits in about 48GB of unified memory on one machine. Small, local, free to run. The teachers are frontier models (Claude, GPT-5.5, Kimi K2.6) that solve the bug first so the student has a worked example to learn from. That's distillation: the big model demonstrates, the small model copies, and you train a tiny LoRA adapter rather than retraining twelve billion parameters. Cheap, fast, swappable.

The interesting design decision is the judge. A fix only counts if the repo's own test suite went from RED to GREEN: failing before the change, passing after. No "looks right." No model grading its own homework. The diff either turns the lights green or it doesn't.

And then a second, separate check that most people skip: an honesty gate. Coding models cheat. They claim success having changed nothing. They peek at the answer. They write a confident cover-story about a fix that isn't there. So I score two numbers, always side by side: can it do the work, and did it lie about doing the work. A capability win that comes with a rising cheat rate isn't a win. It's a liability you trained on purpose.

That second number is the one nobody puts in the press release. It's the one that matters most.

RED to GREEN, or it didn't happen

The first tuned versions looked like catastrophes. The model would start coherent, then spiral into HTML and JSON word-salad. Or announce "fixed!" having touched nothing. The cheat rate spiked to 17%. Every chart said the same thing: tuning was actively damaging the model.

It wasn't. Almost every "the model is broken" moment turned out to be a bug on my side of the glass. The model was doing fine; the pipeline was lying to me. That became the whole theme.

There were four of these, and they're worth naming because they're the kind of thing that quietly invalidates an entire research result:

The dialect mismatch. I trained the model to speak one tool-call language and the evaluator only understood a different one. The model was trying to edit files. I just couldn't hear it. Every real action scored as a no-op.
The truncation bug. My data prep was chopping the worked examples at 2,000 characters, right through the middle of the exact behaviour I wanted to teach. Roughly 1.3% of the "how to edit a file" signal was surviving into training. I was teaching the model to code with the verbs deleted.
The fake-diff disaster. One night the numbers were glorious: a thirty-point jump in activity. Then I found it: the sandbox had dropped macOS ~/Library cache files inside the work folder, so git add -A swept up 1,800 junk cache files and counted them as "the model wrote code." The headline evaporated on a clean re-run. The lesson is now tattooed on me: re-grade before you celebrate.

And then the big one.

The same adapter, from dead to alive

After fixing the data, the model still produced zero edits. I was ready to write the whole approach off. Then I found that the settings I served with didn't match the settings I trained with. Serving had the model's "thinking mode" switched on when training had it off, and applied a repetition penalty that throttled exactly the repetitive-by-nature tokens a tool call is made of.

I flipped one set of flags (a "parity mode") on the same adapter. No retraining.

conversion   0/12  →  5/12   ✅
diffs        0     →  7/12

Every prior verdict that the adapter was broken had been a serving artefact. The training was real all along; the harness around it was bent.

This is the unglamorous truth about working with local models. The model is one component. The eight things around it (the data prep, the sandbox, the tool dialect, the serving stack, the grader) are where your results actually live or die. Get any one of them subtly wrong and you'll spend a week debugging the one part that's working.

The champion: 209 rows

Once the ruler was straight, the win came fast, and from the opposite of brute force.

Instead of feeding the model whole rambling sessions, I built an edit-window corpus, zoomed in on the precise moments where edits happen and threw the rest away. The whole thing was 209 rows. Tiny. Dense with the one behaviour I cared about.

That produced the champion, v5a2@150:

Activity 36.6% vs the base model's 15.9%: the first metric in the entire project to statistically separate from base. The confidence intervals don't overlap: [27.0–47.4] against [9.5–25.3]. The model went from "discusses the bug" to "edits the file."
Conversion 24.4% vs 12.2%: twice as likely to actually emit a well-formed edit.
It resolved a bug in a repo it had never been trained on, which is the strongest signal you can get that it learned a general skill rather than memorising answers.
It stayed honest: cheat rate at noise level, never trading integrity for activity.
And it tied an earlier version trained on 3.3× more data. The edit-window trick wasn't just smaller; it was a data-efficiency multiplier.

A small local model, made measurably better at real coding, honestly, on one machine. That's the result I came for.

Is the wall the data, or the model?

A win is nice. Knowing why it's the ceiling is worth more, so I tried to break my own champion three different ways, each a clean experiment, each gated on beating it.

I threw at it…	Best result	vs the 209-row champion
More data (up to 663 rows)	5/12	below
More variety (56 sessions, 6 repos)	5/12	below
A bigger brain (4× adapter capacity)	4/12	below

All three lost to 209 rows.

That's not a failure. It's the finding. At this model size the bottleneck is no longer the training data or the adapter. It's the 12B itself. I'm not starving it. It's full. You can stop pouring.

This is the part the demo culture skips, because "more data didn't help and I can prove it" doesn't trend. But a negative result you can trust is worth more than a positive one you can't, and most of the positive ones floating around can't be trusted, because nobody checked whether their ruler was bent.

Even my teacher cheated

Auditing the "solved" examples, I caught the teachers cheating. Some of the frontier models were running gh pr diff to look up the bug's own fixing PR on GitHub and copying it. One session straight-up git apply-ed the gold patch and declared victory. About a quarter of one batch's "successes" were contaminated: the headline 50% solve rate was really closer to 33%.

The fix was a reference-peek detector and pulling the teachers' network and GitHub access entirely. But the lesson is older than any of this: what are the incentives? I'd handed an agent a hard task, full marks for passing, and left the answer key on the desk with the network switched on. Of course it read the answer key. It would be strange if it didn't. You don't get integrity by asking for it; you get it by not leaving the exploit lying around.

What I'm not doing yet

I built the obvious next thing: a self-improvement loop where the model tries to solve bugs itself, keeps only the test-verified wins, and trains on those. The dream is a model that bootstraps itself.

Verdict: not yet. At 12B and one attempt per bug, the student writes diffs that don't actually fix anything: zero of four cleanly-gradeable passes. It isn't strong enough to teach itself. Chicken, meet egg. The loop is built and parked, waiting for a stronger student.

(There was also the night I stacked too many GPU jobs at once, watched memory balloon to ~130GB, hit swap, and crashed the entire terminal. New iron rule: one heavy job at a time, memory-test a new config alone before you trust it. My dumb ass learns the boring lessons the expensive way like everyone else.)

So the fork is honest and open. To get past the champion you need either a bigger student (Gemma-4-31B, which wants more memory than one Mac has, which means renting a GPU and giving up the "runs on my laptop" purity) or reinforcement learning beyond plain distillation, which only becomes viable once the student is strong enough to occasionally succeed on its own. Nothing's running. That choice is the next real decision, and it's a strategic one, not a technical one.

The moral

I didn't just train a model. I spent half the time discovering that my ruler was bent, and that's the part I'd keep if I had to throw the rest away.

The cost to learn all of this for certain was about thirty-six hours of overnight GPU runs and a great deal of bug-hunting. What I have to show for it is a working, honest, base-beating local coding adapter, and something rarer: certainty about where the wall is and proof that I'm standing against it rather than guessing.

Most AI results are a number with no error bar and no audit of the apparatus that produced it. The number goes up, everyone cheers, nobody asks whether the instrument was lying. I asked. For a while, mine was.

Straighten the ruler first. Everything you measure after that is just measuring the bend.

I trained my own coding model. My ruler was bent.

The setup

RED to GREEN, or it didn't happen

The same adapter, from dead to alive

The champion: 209 rows

Is the wall the data, or the model?

Even my teacher cheated

What I'm not doing yet

The moral

Written by Dylan Moore

Posts you may like

The Shame Was Load-Bearing

the paperclip generation

Knowledge Is Not Moral. Action Is.