Training per-opponent hint policies with CMA-ES (and why we deleted the whole system)

Cell Division’s in-game hint system — the button that tells the player what to play when they’re stuck — has a surprisingly messy history. Today it just calls the Elite CNN. For a while it had its own trained models, one per (board size, difficulty, side) slot, optimized with CMA-ES to beat a specific opponent from a specific distribution of positions. This is the story of why we built that, why we threw it out, and why we deliberately never shipped the version that could have beaten Elite for the player.

The problem hints are actually trying to solve

A hint feels like a small feature, but the underlying question is subtle. A hint isn’t “what’s the objectively best move here.” It’s what move gives the player the highest chance of winning this game, against this specific opponent, from here. Those aren’t the same thing.

The best move against a tactically sharp Elite AI might be a quiet positional play that denies it the opening it wants.
Against Medium, the best move is usually the biggest immediate score, because Medium will happily give you the whole interior if you show up to take it.

A general strong player can hit a local optimum that an opponent-specific one would beat. That’s the observation that got us into training per-opponent hint models in the first place.

Why CMA-ES, not PPO

The shipping gameplay AI (Hard, legacy Elite) uses PPO self-play. PPO is the right tool when you want a policy that plays a full game well from arbitrary positions. The hint problem is shaped differently: we want a weight vector that beats a specific opponent from a specific distribution of starting positions. That collapses into black-box fitness maximization — pick a weight vector, play 100 games against the target opponent, count wins. No gradients required.

CMA-ES (Covariance Matrix Adaptation Evolution Strategy) is the classical answer to exactly that problem:

Population of candidate solutions, each evaluated by the fitness function.
Update a covariance matrix describing which directions in parameter space are currently promising.
Sample the next population from that adapted covariance.
Repeat.

It does not need gradients, it tolerates noisy fitness, and it self-tunes the search scale. For a 13-dimensional weight vector against a noisy win-rate objective, that’s exactly the profile you want.

The config we shipped with

ai/src/training/train_hint.py used:

Population size 30. Each generation evaluates 30 candidate weight vectors. Enough to get a decent covariance estimate, small enough to fit 80 generations in an overnight run on a laptop.
80 generations. 2,400 total fitness evaluations per slot. Fitness stabilizes well before then — the extra generations tighten the search around the mode rather than discover a new one.
Initial sigma 0.5. Starting spread in weight space. Too small and you never leave the initial basin; too large and early generations are pure noise.
100+ games per candidate. With randomized opening moves so the fitness reflects a distribution of positions, not a single fixed start. Fewer games and the fitness signal drowns in variance — you learn what random seeds won, not which weight vectors are better.

Each trained hint model inherited the same 14-feature architecture as the gameplay AI, so the optimizer was searching over the same ~13-dimensional weight space — just with a different objective. The output was a small dictionary of opponent-specific weight vectors, keyed by (board size, difficulty, side).

The “~20 slots × overnight” problem

The first reason we threw it out was combinatorial.

A weight vector trained to beat Medium on a 6×6 board is nothing more than that. Change the board size, change the difficulty, or let the player start second, and you need a different vector. When I finished counting the combinations we wanted to ship — five board sizes × three opponents (Easy/Medium/Hard) × two player sides, minus a few we could skip — we were looking at around twenty separate slots, each with its own overnight CMA-ES run, each of which had to be retrained from scratch any time we tweaked the feature set or the scoring rule.

That’s a lot of machinery for a button. The CI on-ramp alone (trigger a retrain whenever feature weights change, gate the commit on a fitness floor, ship the new dictionary) would have been more code than the gameplay AI itself.

The “main model got strong enough” problem

The second reason was about the ceiling.

When gameplay Elite was the 26-weight switching-linear model, there was a real gap between “best move a linear model can find” and “best hint we could train against a specific opponent.” The opponent-specific models genuinely helped — a hint optimized against Hard could find exploitable patterns that the general-purpose Elite missed.

Once we replaced gameplay Elite with the distilled AlphaZero CNN, that gap mostly closed. The CNN isn’t an opponent-specific policy — it’s just stronger in general. And “play the objectively strongest move” turned out to be a near-indistinguishable hint from “play the move optimized against this specific opponent,” for the vast majority of positions real players actually ask for help in.

So the current hint implementation is a few lines in src/engine/ai/engine.ts: call the Elite CNN from the player’s perspective and return its move. The weight dictionary that used to hold trained hint models is now an empty object. One model, zero retraining, fewer moving parts.

Replacing a whole trained subsystem with “just use the main model” is almost always the right call once the main model gets strong enough. The savings compound: less training, less code, less to explain, less to break when the rules change.

I flag this because it’s an easy trap to fall into the other way. An early-stage project builds a specialist model because the generalist isn’t good enough yet. Then the generalist improves — which is the whole point of having a generalist — and nobody goes back to audit whether the specialist is still earning its keep. You end up shipping three separate training pipelines in your mobile app because none of them was obviously deletable individually.

The hint we deliberately didn’t build

There’s a third reason we stopped chasing opponent-specific hint models, and it’s more about product design than engineering.

Elite is meant to be a challenge. That’s the point of the top difficulty. A hint specifically trained to exploit Elite’s weaknesses — a CMA-ES vector against Elite from the human side, with enough games per fitness eval — would find the seams. The player could mash the hint button every turn and grind out a win they never really earned. The game becomes a payment interface wearing a chessboard.

I could build that. I chose not to, and I don’t regret it. Hints should help you notice a strong move you missed — a tactic that’s on the board but invisible to you — not hand you a script that beats an opponent you otherwise couldn’t beat. Victories against Elite should feel like you dragged them out of the game with your own hands.

The rule I ended up with: the hint system is never stronger than the gameplay AI it’s helping you fight. On Easy / Medium / Hard the CNN is overkill, which is fine — those tiers are designed to be beatable and the hint accelerates you past a mistake. On Elite, the hint is exactly as strong as your opponent, which means a hint tells you what Elite would play in your seat. Useful, but not a shortcut.

That’s a design constraint I’d advocate for any single-player game that ships both a strong top tier and a hint button. If you can trivially beat your own top tier by spamming the hint, the tier isn’t really your top tier — it’s the difficulty of your hint model, which is the one thing no player ever sees on the difficulty picker.

What survived

Almost nothing at the code level — the CMA-ES training script still exists in the repo for any future specialist-model work, but the weight dictionary is empty and the runtime path is deleted. What survived is two pieces of load-bearing knowledge:

When opponent-specific hints genuinely help. They help when the gap between “objectively best” and “best against this specific opponent” is large — which is exactly when your main model is mid-strength. As the main model gets stronger, the gap closes.
CMA-ES is the right tool for small-parameter black-box fitness work. For a ~13-dim weight vector and noisy win-rate fitness, it’s faster-to-useful than anything gradient-based I tried. I’ll reach for it again — not for hint models, but for the occasional “tune these ten weights against this objective” task that doesn’t warrant a full training loop.

A lot of early AI engineering is load-bearing only until the core model gets better. The right response isn’t to treasure the scaffolding — it’s to build it so deleting it later is cheap.

For the companion posts on the model stack: the four-tier AI covers Easy/Medium/Hard/Elite and why hand-crafted features + CNN coexist; The AlphaZero detour covers how the shipping Elite was actually trained. The player-facing take on why Cell Division’s hint doesn’t beat Elite lives at Training Hint Models (and Why We Stopped) on the game blog.