One AI, four board sizes: feature normalization, a 10×10 padded CNN, and entropy-timed pacing

The Jelmata AI runs on phones with no GPU. A single linear model serves four board sizes, a distilled CNN handles the top tier with a switching-linear safety net for web builds, and a one-line entropy trick makes a sub-millisecond move feel like a considered one.

Jelmata is a small two-player board game with one mechanic that makes its AI unusual: your score is the product of your connected component sizes, not the sum. Merging two groups of 3 doesn’t give you 6 points — it multiplies them into a group of 6, which scores 6 instead of 9. That single rule changes what the AI should care about on every move, and it rules out the off-the-shelf feature vocabulary that works fine for sum-based games like Cell Division.

This post is about three Jelmata-specific engineering decisions I don’t see discussed much in indie-game-AI writeups: feature normalization that lets one linear model serve four board sizes, a fixed 10×10 padded CNN that generalizes across board sizes without retraining, and softmax entropy as a pacing signal so a sub-millisecond move selector feels like a considered opponent. The companion post on Cell Division’s model stack is at Four difficulty tiers, two model families; this one is about the parts of Jelmata that are genuinely different.

The feature vocabulary multiplicative scoring forces

If you score moves in Jelmata by looking only at “how many cells do I own,” you get a model that can’t distinguish a good move from a disastrous one. Merging 3 + 3 → 6 looks neutral by cell count and is catastrophic by score (9 → 6, a 33% cut). The feature vocabulary has to encode group topology, not just territory.

The shipping model (src/engine/ai/features.ts, mirrored in ai/src/ai/features.py) uses 14 features per candidate cell:

  • Score deltas, log-scaled (2): log_score_delta_ai, log_score_delta_opp. The score is a product, so the natural unit is log-space — a jump from 12 to 24 should matter the same as 6 to 12. Linear features on raw product-scores would be dominated by the handful of moves that happen to hit the biggest group in the position.
  • Cluster topology (2): cluster_size_created (size of the component the new cell joins or creates) and cluster_count_delta (positive means splitting your territory, negative means merging). The biggest single weight in the hand-tuned Hard model is a −6.0 on cluster_size_created — Hard’s entire strategy is “never grow a group above 3” encoded as one number.
  • Breathing room (2): openness (empty orthogonal neighbors), openness_8 (all eight). Both matter because some positions let diagonals become relevant through later moves.
  • Contact (2): ai_ortho_neighbors, opp_ortho_neighbors. Friendly and enemy cells already touching the target.
  • Distance (2): nearest_ai, nearest_opp. Manhattan distances — lets the model reason about reaching its own territory without a full-board BFS.
  • Global shape (2): components_diff (AI clusters minus opponent clusters, normalized by total clusters) and second_order_openness (empty cells at Manhattan distance exactly 2 — where the AI could expand next turn).
  • Edge awareness (2): nearest_invalid (distance to the nearest board edge or blocked square) and invalid_neighbor_count (how many of the eight neighbors are off-board).

Scoring a move is then a 14-dim dot product. On a mid-game 6×6 position with ~20 legal moves, that’s about 280 multiply-adds per turn. Sub-millisecond on any phone.

The one structural decision that makes this portable: normalization

Every one of those features is normalized to be board-size-independent. nearest_ai is divided by the board diagonal. ai_ortho_neighbors is in 4 regardless of board size because there are always at most four orthogonal neighbors. cluster_count_delta and components_diff are already ratios or small integers.

The payoff is that the same 14-weight vector trained on 6×6 plays reasonably on 5×5, 7×7, and 8×8 without ever being shown those sizes in training. Jelmata ships one set of weights for all four sizes, not four sets. When I later re-trained the CNN teacher on cross-size self-play, the linear student didn’t need to change at all — the feature space had already done that work.

This is the highest-leverage design decision in the whole AI stack, and it’s almost invisible. If I’d normalized features by the board size at read time instead of baking normalization into the feature definitions, I’d have spent a month re-tuning and shipping weight tables per size.

PPO on 14 parameters converges in minutes

The training loop is a few hundred lines of PyTorch in ai/. Standard PPO self-play: a copy of the policy plays itself, moves during rollout are sampled from a softmax over the feature scores so the policy actually explores, and at the end of each game every move gets a reward based on whether that player won. PPO’s clipped objective nudges the weight vector toward moves that led to wins.

With only 14 parameters, training converges absurdly fast:

  • A few thousand self-play games and the weights stop moving.
  • No GPU required. A laptop runs the whole loop in roughly the time it takes to reload it a couple of times.
  • No data pipeline. The game engine is the data. There’s nothing to label, nothing to scrape, nothing to version.
  • No instability. PPO’s clipping is load-bearing on deep networks where a single bad update can catastrophically diverge. On 14 parameters it’s almost overkill. The loop just… works.

The final weights export as a plain JSON file and are embedded into the TypeScript runtime as a constant. The JavaScript running on the player’s phone evaluates the exact same math the Python trainer did.

The CNN teacher and the 10×10 padded student

Elite goes beyond linear. The shipping Elite engine on iOS and Android is a small convolutional policy network — three input channels (AI pieces, opponent pieces, legal-move mask), followed by a handful of residual blocks, followed by a 100-way policy head. Runs through onnxruntime-react-native in a few milliseconds per move. See Shipping a distilled CNN to Expo via onnxruntime-react-native for the plumbing.

The two details specific to Jelmata are about board sizes:

  1. The input tensor is always 10×10, even on a 5×5 board. Smaller boards are centered in the 10×10 grid and the unused border cells are marked as illegal in the legal-move mask. The CNN sees every position in the same coordinate system, regardless of board size. The policy head’s 100 output logits correspond to the 100 cells of the 10×10 grid; moves on illegal cells are masked off before the argmax.

  2. Training data mixes all four supported sizes. Every game the AlphaZero-style teacher plays is on a randomly chosen board size from 8, rendered into the same 10×10 padded representation. The distilled student inherits a single set of weights that handles every size the app ships.

This is the CNN analogue of the linear model’s feature-normalization trick: instead of shipping one model per size, shipping one padded model and letting masking handle the rest.

The switching-linear safety net

Mobile ONNX is unreliable in small, non-catastrophic ways. The runtime might fail to initialize after an OS update. The bundled model might be missing from a malformed install. The web build — Jelmata is also playable in the browser — doesn’t have a native ONNX runtime at all.

Every call to the Elite CNN is wrapped in a try/catch that falls through to a 28-weight switching-linear model: two 14-weight vectors, one tuned for the opening and one for the endgame, blended by game_progress. The switching-linear is slightly weaker than the CNN, but:

  • It runs in pure JavaScript. No native dependencies.
  • It runs in the browser. Web builds get a real Elite opponent, not a stub.
  • It’s deterministic. If the CNN is unavailable for any reason on any device, the game continues with the same difficulty selector and no error surface.

The cost is one extra weight table in the bundle — a few KB. The alternative is an app that sometimes has Elite and sometimes has “Elite temporarily unavailable, try Hard,” which is a nightmare to support and a terrible experience to explain.

The rule is: if your top-tier AI depends on a native runtime, you need a pure-JS fallback that’s at least as strong as your second-tier AI. Not because it will usually run — it almost never does — but because the day your runtime breaks is not the day you want your top tier to disappear.

Entropy as pacing

The uncomfortable fact about a 280-multiply-add move selector is that it’s done in well under a millisecond. If the AI just plays immediately, every turn feels like the opponent is slapping the board, and the AI feels dumber than it is. The pacing is load-bearing on the perception of the AI’s strength.

The fix is a one-liner that measures the entropy of the softmax over move scores. When one move is clearly best, the distribution is peaked, entropy is low, and the AI plays quickly — as if it saw something obvious. When several moves look roughly equal, entropy is high and the AI takes longer to decide.

// after scoring every legal move
const probs = softmax(scores, T);
const H = -probs.reduce((acc, p) => acc + (p > 0 ? p * Math.log(p) : 0), 0);
const thinkingMs = lerp(150, 1800, H / Math.log(probs.length));
await delay(thinkingMs);

The AI doesn’t actually think harder in the high-entropy case — the dot product is the same 280 multiply-adds. But the pacing tells a story that matches the position. Players report Elite “feels like it considers hard positions” and “plays obvious moves quickly.” Both of those are true descriptions of the pacing layer, not the model.

This is a free upgrade to perceived strength, and it costs the player maybe 1.5 extra seconds of wall-clock time per difficult turn. I think it should be a standard trick for any evaluation-based game AI.

What this stack deliberately doesn’t have

Four omissions worth naming, because each is a decision and not a gap:

  • No minimax or alpha-beta. Jelmata’s product scoring makes placements non-local — one cell can merge two groups and drop the score. The branching factor is wide (every empty cell is a candidate), the horizon effect is severe, and a pure evaluator turns out to play a stronger game than shallow search would. Search lives in the teacher’s MCTS at training time, not in the shipping runtime.
  • No opening book. First-move symmetry means an opening book saves microseconds the player would never notice and adds one more artifact to keep in sync.
  • No endgame tablebase. The state space is too large to enumerate and too small to matter. A 14-feature evaluator is already near-perfect when only a handful of squares remain.
  • No online learning. Weights are frozen at ship time. Behavior is reproducible between sessions; a player’s moves don’t leave the device; regressions can’t sneak in between app updates.

Each of these is a thing I could have built that would have made the AI worse along some axis the player can feel — bigger binary, flakier shipping, non-reproducible difficulty, your-move-data-leaves-your-phone. I think the pattern holds across most indie game AI: the things you choose not to build are as load-bearing as the things you do.


The player-facing take, with screenshots and the four difficulty personalities, is at How We Built the Jelmata AI. The ONNX-on-mobile plumbing is at Shipping a distilled CNN to Expo via onnxruntime-react-native. The companion post on Cell Division’s related-but-different stack is Four difficulty tiers, two model families.