Solving every level at build time: how 1,532 hand-crafted puzzles stay winnable

JellySplit ships with 168 hand-crafted campaign levels and 1,364 daily puzzles. Every one of them was authored by a human (me) in a Python puzzle-gen tool. Hand-authored content means hand-authored bugs: a puzzle you think is solvable in six swaps but isn’t; a “hint” that leads the player into an unreachable dead end; a difficulty rating that’s off by two stars.

You cannot test that by playing. 1,532 puzzles × 2 minutes each = a full work week of tapping. If you regenerate the puzzle pool or tweak the scoring engine, you’d have to do it again.

So I don’t test by playing. I ship a solver with the game.

The core move

Not the player-facing AI. A build-time solver that:

For every puzzle, proves at least one 3-star solution exists.
Emits the move sequence as a lookup table, keyed by board state.
Ships that lookup table inside the app bundle.

At runtime, the “hint” button is table.get(currentBoardKey) — zero milliseconds, zero solver code on the device, zero weights to load. All the expensive work has already happened on my laptop before the App Store build.

The solver doubles as a verifier. If it can’t find a 3-star solution for a puzzle, the CI fails and I know the puzzle is either impossible or requires a higher swap budget than I gave it. Same machinery, two jobs.

It’s not A*

First instinct for “shortest path to a goal state” is A*. I reach for it, anyone would. It turned out to be the wrong fit.

The search problem isn’t “find the shortest plan to reach state G” — A*‘s sweet spot. It’s “find any plan that hits a target score within a move budget”, which is a constraint-satisfaction problem with a fixed, short horizon. What wins that game isn’t a good heuristic-ordered frontier. What wins is memoization and cheap pruning, because the branching factor is high (every swap on a small board) but the distinct reachable states are much fewer than the paths to them.

The actual solver is depth-first search with heavy memoization plus two custom pruners.

// scripts/precompute-hints.ts
function solve(
  remaining: number,
  score: number,
  w0: number, w1: number, w2: number, w3: number,
): number

Note what’s not in that signature: no board array, no heap, no game state object. The full board is packed into four unsigned 32-bit words. remaining is the move budget; score is the running total. The function returns the best score achievable from this state, packed alongside the winning flag and the next move index.

The memoization key is (remaining, w0, w1, w2, w3). Five ints, join('|') them, you’re done. The result is packed into a single 32-bit integer: score, moves used, winning bit, next-move index — all readable with bit shifts. The memo map holds strings → ints. No object allocations inside the solve loop.

This is the single most important detail for anyone writing performance-bound search: the memo hit rate is only useful if the memo key is cheap to produce. If your key-construction allocates two strings and a nested object, you spend more time building keys than you save on the hits. Pack your state into something you can stringify in one expression.

Pruning

Two heuristic pruners sit inside the solve loop:

canReachScore(budget, remaining) — a cheap overapproximation of the max score achievable from here. If even the best-case estimate can’t pass the threshold, return immediately.
canWinWithin(budget, remaining) — the same idea for the “did I hit the win condition” flag.

Both are fast and loose: they may say “yes you could still win” when in fact you can’t, which just means the solver continues and finds that out the slow way. What they must not do is ever say “no you can’t win” when you actually could, because that would miss real solutions and the hint table would be wrong.

The framing I’d offer is:

Good pruning heuristics are fast and loose. Bad pruning heuristics are slow and tight.

An expensive pruner that rejects 60% of branches is almost always worse than a cheap pruner that rejects 40% of branches, because you run it at every node.

The test (hintQuality.test.ts)

Precomputing the hint table is the gameplay feature. The test is what keeps it honest.

jest.setTimeout(120_000);

describe('daily hint table quality', () => {
  it.each(puzzleCases)(
    '%s: hints achieve 3 stars',
    (_id, index, puzzle) => {
      const state = simulateHints(puzzle, index);
      expect(state.solved).toBe(true);
      expect(state.swapsRemaining).toBeGreaterThanOrEqual(2);
    },
  );
});

simulateHints loops: look up hint for the current board, apply it, repeat until solved or out of moves. It’s not running the solver — it’s running the table the solver produced. The assertion is the 3-star condition: solved, with at least 2 swaps still in the bank.

This catches two different failure modes with one test:

Solver regression. If I change the scoring engine or the swap mechanic and forget to re-run hint precomputation, the old table leads to 2-star wins (or losses) on the new engine. CI fails.
Bad puzzles. If a puzzle is authored such that even the solver’s best plan only gets 2 stars, the test fails on that specific puzzle. I either re-author the puzzle, or relax its difficulty rating, or remove it.

1,532 simulated plays in about 90 seconds. Every PR. Cheap.

The seven puzzles I had to delete

Not every puzzle was tractable for the solver. One daily-puzzle configuration — k=9 allowed swaps, on the densest board — produced a state space the memoization couldn’t fit in RAM. The solver would run for a few minutes, allocate gigabytes, and fall over. Seven daily entries fit that config.

Two options:

Raise the memory ceiling and stretch the solver — add iterative deepening, disk-backed memoization, or lossy pruning.
Delete the seven puzzles.

I deleted them. Commit d66be9be:

Remove 7 daily puzzles with k=9 swaps whose state spaces are too large for exhaustive BFS hint computation.

Campaign levels max out at k=6 swaps and remain solvable. No player was going to notice seven missing days from the archive. The engineering cost of a more powerful solver was a week of work; the product cost of losing seven daily puzzles out of 1,364 was approximately zero.

The pattern here is worth naming explicitly:

When your verification breaks, try “make the verifiable set smaller” before “make the verification more powerful.”

It’s the cheapest fix, and for most content problems the skewed 99%-of-players-see-less-than-1%-of-content curve makes it a rounding error.

What the pattern generalizes to

Solver-as-test-oracle works anywhere the content has a property an algorithm can evaluate cheaper than a human. Some places I’ve seen it or would consider it:

Puzzle solvability and hint correctness (this post).
Regex match correctness, where a parser generates test strings and you assert the regex classifies them.
SQL query determinism, where a planner-like oracle verifies queries return stable results across equivalent plans.
Generated prose fact-checking, where a cheap classifier or a structured-data check flags claims that can’t be grounded in your source corpus.

The common shape: you pay the oracle’s cost once at build time, not N times at inference. 1,532 puzzles × one solver pass = 90 seconds in CI. 1,532 puzzles × one player per day × however many days × however many installs = never. The asymmetry is what makes it worth writing the solver.

If you want to see the end result, the 168 campaign levels and the daily stream are both on JellySplit’s App Store page. The hint button’s lookup is the part you can’t see, which is most of the point — the expensive thinking has already happened.