Formalizing a Time-Optimal Self-Stabilizing Majority Protocol, End to End

2026-06-05

This is a technical report on a complete Lean 4 formalization of the main theorem of Time- and Space-Optimal Silent Self-Stabilizing Exact Majority in Population Protocols by Haruki Kanaya, Ryota Eguchi, Taisho Sasada, Fukuhito Ooshita, and Michiko Inoue (arXiv:2503.17652): a population protocol that solves exact majority from any initial configuration, stabilizing in expected parallel time on the order of the population size. The formalization lives in the public repository SSExactMajority, builds with zero sorry and no custom axioms, and was driven to completion by a pipeline of AI proof agents under human direction. The final pair of theorems passed an independent adversarial audit.

The report has two halves. The first explains what was proved. The second — probably the more useful half — explains what it took, because almost none of the difficulty was where one would expect.

The problem

A population protocol models a crowd of $n$ identical, anonymous finite-state agents. A scheduler repeatedly picks two agents uniformly at random; they interact and update their states by a fixed rule. Parallel time is the number of interactions divided by $n$ — intuitively, the number of rounds in which each agent participates a constant number of times on average.

In exact majority, every agent starts with an opinion, A or B, and the population must converge to the true majority opinion — exactly, not approximately, even when the margin is a single vote. The self-stabilizing version is much harsher: the protocol must work from an arbitrary initial configuration, including adversarially corrupted internal states, not just from clean inputs. Silent means the population eventually stops changing state altogether.

Kanaya, Eguchi, Sasada, Ooshita, and Inoue gave a silent self-stabilizing exact-majority protocol that stabilizes in $O(n)$ expected parallel time using $O(n)$ states per agent, and proved both bounds optimal for silent protocols. Their construction builds on the time-optimal self-stabilizing ranking machinery of Janna Burman, Ho-Lin Chen, Hsueh-Ping Chen, David Doty, Thomas Nowak, Eric Severson, and Chuan Xu (Time-Optimal Self-Stabilizing Leader Election in Population Protocols, PODC 2021, arXiv:1907.06068), which assigns the agents distinct ranks $1$ through $n$; the majority protocol then runs a timed sequence of phases — ranking, comparison, a median-based decision, an answer epidemic, and a reset mechanism that restarts everything when an inconsistency is detected.

What was proved

The formalization produces two theorems about one protocol instance — the transition function protocolPEM n 1 Rmax (rankDeltaOSSR Rmax Emax Dmax), i.e. the paper’s protocol with constant timer-rank parameter and a reset counter bounded by $R_{\max}$:

Correctness. P_EM_solves_SSEM_log_trank1: for every $n \ge 4$, every $D_{\max} > 1$, and every $R_{\max} \ge 2\lceil\log_2 n\rceil + 2$, the protocol solves silent self-stabilizing exact majority. This theorem is unconditional — no cited lemmas, no axioms beyond Lean’s three classical ones (propext, Classical.choice, Quot.sound). The paper’s choice $R_{\max} = 60\log n$ satisfies the hypothesis with room to spare.

Complexity. PEM_expectedParallelTime_On_faithful_log: on the same instance, the expected parallel time to consensus is bounded by an explicitly quadratic interaction window divided by $n$ — that is, $O(n)$ parallel time — with a constant success probability per renewal window. This theorem is conditional on a small set of explicitly named premises: the phase-window bounds inherited from the ranking protocol of Burman, Chen, Chen, Doty, Nowak, Severson, and Xu, the standard one-rumor epidemic-speed fact, and one numeric tail bound whose satisfiability is itself proved in Lean. Citing the ranking analysis rather than re-proving it was a deliberate scope decision, mirroring how the paper itself builds on the PODC 2021 result.

The whole repository — roughly 3,300 compilation jobs — builds green, and a fresh #print axioms on both theorems reports exactly the three standard axioms.

What it actually took

Compilation is not faithfulness

The first hard lesson is epistemological. A conditional theorem with an unsatisfiable hypothesis compiles perfectly and proves nothing; #print axioms cannot see this. Twice during the project, an audit found a hypothesis that was false or unreachable for the very configurations the development itself constructs — once an invariant claimed for states that a counterexample refuted, once a “fresh reset” target requiring every agent’s delay timer to sit at its maximum simultaneously, which under a random scheduler is a synchronization coincidence rather than an event with constant probability.

The remedy was procedural, not technical: every milestone was reviewed by independent auditors that did not write the proofs — one reviewer working purely inside Lean (hunting unsatisfiable premises, degenerate definitions, hidden axioms), one reviewing against the paper (hunting unfaithful statements). Both found real, distinct bugs that the authors of the proofs — human and machine alike — had stared past. Independence is load-bearing: the author of a statement is structurally the wrong person to ask whether the statement says what it should.

Faithful citation is a design problem

The complexity proof cites external facts, and stating a citation faithfully turned out to be genuinely delicate. Three examples.

First, the epidemic premise. The standard rumor-spreading fact holds for an idealized epidemic, but in the real protocol a dormant agent can wake up mid-epidemic and disrupt it. The faithful premise is not “the epidemic reaches everyone” but “the epidemic reaches everyone or someone wakes” — and the proof then subtracts the wake probability, which the formalization bounds by an explicit binomial tail

$$ n\binom{K}{d}\left(\frac{2}{n}\right)^{d} \;\le\; n\,2^{-d} \quad\text{whenever}\quad 11K \le nd, $$

proved in Lean, where $K$ is the window length and $d$ the remaining delay budget. This is exactly the regime of an $O(n\log n)$-interaction epidemic against a $\Theta(\log n)$ delay budget, so the premises are jointly satisfiable — the non-vacuity is itself a theorem, not a hope.

Second, the reset premise. An early version cited the ranking paper for an endpoint it does not deliver (the completed answer epidemic). The repaired contract cites it only for what it actually provides — a freshly reset seed — and proves the rest locally.

Third, the wake-budget weakening. The drain argument originally demanded all delay timers exactly at maximum; the faithful version charges only dormant agents and asks for a budget $d \le D_{\max}$, which the protocol’s own mechanics deliver: agents with reset fuel never decrement their timers, and the moment fuel runs out the timer is re-initialized. Reading the transition function closely repeatedly turned “too strong to be reachable” into “exactly what the mechanism guarantees.”

The parameter battle: proof artifacts versus the theorem

The most instructive episode. An earlier development proved correctness with a deterministic reset argument: one seed agent carries enough reset fuel to convert every other agent single-handedly, which forces $R_{\max} \ge n$. The paper, however, sets $R_{\max} = 60\log n$. An independent audit flagged the mismatch, and the obvious question arose: is the logarithmic reset counter actually reachable by formal proof, or does one settle for the $R_{\max} = n$ variant?

The resolution is the part worth remembering. The fuel-dominance argument was a proof artifact, not a property of the protocol. The actual reset rule shares fuel: when a fueled agent recruits a partner, both end up with the donor’s fuel minus one. So recruiters double each generation, and a seed with $\lceil\log_2 n\rceil + 2$ units of fuel converts the whole population through a balanced binary tree — deterministically. A subsequent tournament merges duplicate leaders, mutual pairings drain residual fuel, and — because the dormancy transition re-initializes timers at the moment fuel hits zero — the construction lands in an exactly fresh configuration. Every “$n \le$” hypothesis in the correctness chain fell the same way: each was an artifact of one particular constructed schedule, never of the statement. The campaign’s running tally is stark: across hundreds of lemmas, not a single blocker turned out to be mathematical. All of it was interface archaeology — hard-coded parameters, over-strong endpoint definitions, coupled variables that should have been independent.

The last layer was exactly that: the protocol’s timer-rank parameter had been syntactically coupled to $R_{\max}$ in thousands of statements. Decoupling required a $\tau$-generic clone of an entire post-reset machinery layer — about 230 lemmas — after which the final correctness theorem specializes to the constant-parameter instance in one line, the same instance the complexity theorem speaks about. The two theorems agree on the protocol definitionally, not “up to renaming.”

The pipeline

The proofs were produced by a relay of AI systems under a human director: one model family did the bulk generalization and probabilistic window proofs, another picked up each time a quota or session limit cut the first off, and the director enforced the rules that made the output trustworthy — no sorry, no new axioms, no weakening of statements, every “completed” claim re-verified from rebuilt artifacts, every milestone committed with an audit trail, and independent review before anything was called done. When the shared build servers saturated, the entire compile loop moved to a desktop Apple M4 mini, whose single-core speed turned out to outrun a loaded 32-core server by a wide margin for Lean elaboration. The handoffs were not graceful degradation; they were the plan. Roughly: one agent’s unfinished file, plus an honest note of exactly where it stopped, is a better starting point than any summary of it.

Lessons

A green build certifies reasoning, not meaning. The gap between “compiles” and “says the right thing” is where formalization projects actually live.
Independent audit is not bureaucracy. Authors — human or model — share blind spots with their proofs. Two independent reviewers with different vantage points caught every serious defect in this project.
When a true statement won’t prove, look for the artifact. Hard-coded constants, coupled parameters, and over-strong intermediate contracts masquerade as mathematical difficulty. Here, every single one dissolved on inspection.
Cite exactly what the source delivers. A premise slightly stronger than the cited theorem is a silent unsoundness; the faithful form usually costs one extra disjunct and one subtraction.

The repository is public, and the two theorems — with their complete hypothesis lists — are in Convergence/LogTrank1.lean and UpperBound/Time/LogWindows.lean.

博学之，审问之，慎思之，明辨之，笃行之。——《礼记·中庸》

Learn broadly; question closely; think carefully; discern clearly; practice earnestly. — Doctrine of the Mean