EM, composite likelihoods, and variational inference

Building on Appendix A, this appendix develops the M-step and variational-inference machinery for TKF91 / TKF92 / MixDom at full generality. First, closed-form GTR substitution M-steps in their many specialisations (JC69, K80, F81, HKY85, GTR, GY94, plus rate-rescaling, tied-equilibria, and mixture-of-GTR variants); then the stochastic variational Baum–Welch (SVI-BW) loop with its convergence theorem (ELBO derivation, natural-gradient updates, minibatch variance bounds), the linearised analysis at stationarity, Maraschino as the cherry-count distillation of TKF92, the tree-level inference algorithms it composes with (FSA, BeamASR, VarAnc, svi-VarAnc), the mixture-of-trees variational ancestral presence/absence inference, and the structural-bias analysis of the BP cumulant under a column-factorised variational field.

B.1 Substitution M-Steps for Specific Models

This appendix specializes the general GTR substitution M-step (Section A.1.8) to specific DNA and codon substitution models. In each case, we state the rate matrix $\revsub $ in terms of the model parameters, derive the closed-form MLE (or MAP) update given the bridge expectations $(W_i, U_{ij}, V_i)$, and note the spectral structure when it simplifies the endpoint-conditioned integrals (A.3).

Throughout, we work with the complete-data log-likelihood (A.1) \[ \ell _2 = \sum _i V'_i \log \eqm _i + \sum _{j>i} (U_{ij} + U_{ji}) \log \exch _{ij} - \sum _{j>i} \exch _{ij}(W_i \eqm _j + W_j \eqm _i) \] where $V'_i = V_i + \sum _{j \neq i} U_{ji}$, under the GTR parameterization $\revsub _{ij} = \exch _{ij} \eqm _j$ with symmetric exchangeability $\exch _{ij} = \exch _{ji}$.

The indel M-step (quadratic (A.7) for $\kappa $, with L’Hôpital limits in Section A.4) is orthogonal and unchanged across all substitution models.

For DNA models ($|\alphabet | = 4$), the bridge-expectation integral $I^{ab}_{ij}(\evoltime )$ (A.2) involves only $4 \times 4 = 16$ matrix entries per endpoint pair, and the $J^{kl}(\evoltime )$ matrix (A.3) is $4 \times 4$, making the endpoint-conditioned expectations cheap to compute.

B.1.1 JC69 (Jukes–Cantor)

Rate matrix. All off-diagonal rates are equal: $\revsub _{ij} = \rho /3$ for $i \neq j$ (total rate out of any state is $\rho $). Equivalently, $\eqm _i = 1/4$ for all $i$ and $\exch _{ij} = 4\rho /3$ for all $i \neq j$.

M-step. By detailed balance, $\exch _{ij} = (U_{ij} + U_{ji}) / (W_i \eqm _j + W_j \eqm _i)$ is the GTR MLE. Under JC69, all exchangeabilities are forced equal, so we pool: \begin {equation} \label {eq:jc69-mstep} \hat {\exch } = \frac {\sum _{j>i}(U_{ij} + U_{ji})} {\sum _{j>i}(W_i \eqm _j + W_j \eqm _i)} = \frac {U_{\bullet }}{3 W_{\bullet }/4} = \frac {4 U_{\bullet }}{3 W_{\bullet }} \end {equation} where $U_\bullet = \sum _{i \neq j} U_{ij}$ is the total substitution count and $W_\bullet = \sum _i W_i$ is the total dwell time. (The denominator uses $\sum _{j>i}(W_i/4 + W_j/4) = 3W_\bullet /4$ since each of the 4 states appears in 3 of the $\binom {4}{2} = 6$ pairs.) The overall rate is then $\hat {\rho } = 3\hat {\exch }/4 = U_\bullet / W_\bullet $, which is simply the total substitution count divided by total dwell time.

Spectral structure. $\symsub = (\rho /3)J - (4\rho /3)\,I$ where $J$ is the $4 \times 4$ all-ones matrix. Eigenvalues: $0$ (once, eigenvector $\mathbf {1}/2$) and $-4\rho /3$ (triply degenerate). Thus $J^{kl}(\evoltime )$ has only two distinct diagonal values ($\evoltime $ and $\evoltime e^{-4\rho \evoltime /3}$) and the off-diagonal $J^{01}$ term $(1 - e^{-4\rho \evoltime /3})/(4\rho /3)$.

B.1.2 K80 (Kimura 2-Parameter)

Rate matrix. Equal equilibrium frequencies $\eqm _i = 1/4$. Transitions (purine$\leftrightarrow $purine, pyrimidine$\leftrightarrow $pyrimidine) occur at rate $\kappa _s$ and transversions at rate $1$ (or vice versa, depending on convention). Labeling the states A, G, C, T and writing $i \sim j$ for a transition pair: \[ \revsub _{ij} = \begin {cases} \kappa _s / 4 & \text {if } i \sim j \text { (transition)} \\ 1/4 & \text {if } i \not \sim j \text { (transversion)} \end {cases} \] with total rate out of any state $(\kappa _s + 2)/4$. The single free parameter beyond rate scaling is $\kappa _s$ (the transition/transversion rate ratio).

M-step. Partition the substitution counts into transitions $U_s = \sum _{i \sim j} U_{ij}$ and transversions $U_v = \sum _{i \not \sim j} U_{ij} = U_\bullet - U_s$. Similarly partition the dwell-weighted opportunities. Under K80, the exchangeability for transition pairs is $\kappa _s$ and for transversion pairs is $1$ (up to a common scale). The MLE for each group, from the pooled GTR formula: \begin {align} \hat {\kappa }_s &= \frac {U_s / N_s}{U_v / N_v} \label {eq:k80-kappa} \end {align}

where $N_s = \sum _{i \sim j, \, i<j} (W_i \eqm _j + W_j \eqm _i) = W_\bullet / 4$ is the opportunity for transitions (2 pairs: A–G and C–T, each contributing $(W_i+W_j)/4$, summing to $W_\bullet /4$) and $N_v = \sum _{i \not \sim j, \, i<j} (W_i \eqm _j + W_j \eqm _i) = W_\bullet / 2$ is the opportunity for transversions (4 pairs, each state appearing in 2 transversion pairs, summing to $2W_\bullet /4 = W_\bullet /2$). Simplifying: \begin {equation} \label {eq:k80-mstep} \hat {\kappa }_s = \frac {2\, U_s}{U_v} \end {equation} Given $\hat {\kappa }_s$, the overall scale $\hat \sigma $ satisfies $\hat \sigma (\hat \kappa _s N_s + N_v) = U_\bullet $, giving overall rate $\hat \rho = \hat \sigma (\hat \kappa _s + 2)/4 = U_\bullet / W_\bullet $.

Spectral structure. Three distinct eigenvalues: $0$, $-(\kappa _s + 1)/2$ (doubly degenerate, within-purine and within-pyrimidine contrasts), and $-1$ (purine-vs-pyrimidine contrast).

B.1.3 F81 (Felsenstein 1981)

Rate matrix. Equal exchangeabilities, variable equilibrium: $\revsub _{ij} = \rho \, \eqm _j$ for $i \neq j$ (equivalently, $\exch _{ij} = \rho $ for all $i \neq j$).

M-step. Since $\exch _{ij} = \rho $ is constant across all pairs, the exchangeability MLE pools all pairs: \begin {equation} \label {eq:f81-rho} \hat {\rho } = \frac {U_\bullet }{\sum _{j > i}(W_i \eqm _j + W_j \eqm _i)} = \frac {U_\bullet }{\sum _i W_i (1 - \eqm _i)} \end {equation} using the identity $\sum _{j>i}(W_i \eqm _j + W_j \eqm _i) = \sum _i W_i \sum _{j \neq i} \eqm _j = \sum _i W_i(1 - \eqm _i)$.

For equilibrium frequencies: \begin {equation} \label {eq:f81-pi} \hat {\eqm }_i = \frac {V'_i}{\sum _j V'_j} \end {equation} This is the standard empirical-frequency estimator (exact MLE when $\exch $ is held fixed; approximate when $\exch $ and $\eqm $ are jointly optimized, as discussed in Section A.1.8).

In practice, iterate: fix $\hat \eqm $, solve for $\hat \rho $ (B.3), update $\hat \eqm $ (B.4), repeat. The decoupling is exact after one iteration when $W_i \propto \eqm _i$ (which holds approximately for long dwell times).

Spectral structure. $\symsub = \rho (\sqrt {\eqm }\sqrt {\eqm }^\top - I)$ where $\sqrt {\eqm }$ is the vector with entries $\sqrt {\eqm _i}$. Eigenvalues: $0$ (once) and $-\rho $ ($|\alphabet |-1$ times, degenerate). The $J^{kl}$ matrix has the same simple structure as JC69.

B.1.4 HKY85 (Hasegawa–Kishino–Yano)

Rate matrix. Variable equilibrium frequencies $\eqm $ and a transition/transversion ratio $\kappa _s$: \[ \revsub _{ij} = \begin {cases} \kappa _s\, \eqm _j & \text {if } i \sim j \text { (transition)} \\ \eqm _j & \text {if } i \not \sim j \text { (transversion)} \end {cases} \] Equivalently, $\exch _{ij} = \kappa _s$ for transition pairs and $\exch _{ij} = 1$ for transversion pairs. This is K80 + F81: the $\kappa _s$ structure of K80 with the variable-$\eqm $ structure of F81.

M-step. Pool exchangeabilities within each class: \begin {align} \hat {\kappa }_s &= \frac {U_s \,/\, N_s}{U_v \,/\, N_v} \label {eq:hky85-kappa} \end {align}

where now (unlike K80) the opportunities depend on $\eqm $: \begin {align} N_s &= \sum _{i \sim j,\, i < j} (W_i \eqm _j + W_j \eqm _i) \label {eq:hky85-Ns} \\ N_v &= \sum _{i \not \sim j,\, i < j} (W_i \eqm _j + W_j \eqm _i) \label {eq:hky85-Nv} \end {align}

With $U_s$ and $U_v$ as in K80 (summed over directed transition and transversion pairs respectively).

For $\eqm $, use $\hat {\eqm }_i = V'_i / \sum _j V'_j$ as in F81, then iterate with $\kappa _s$ if desired.

Spectral structure. Let $\eqm _R = \eqm _A + \eqm _G$ and $\eqm _Y = \eqm _C + \eqm _T$. The eigenvalues are $0$, $-(\eqm _R + \kappa _s \eqm _Y)$ (within-pyrimidine contrast), $-(\eqm _Y + \kappa _s \eqm _R)$ (within-purine contrast), and $-1$ (purine-vs-pyrimidine contrast). Generically four distinct values; when $\eqm _R = \eqm _Y = 1/2$ (i.e. K80), the two within-class eigenvalues merge to $-(\kappa _s + 1)/2$.

B.1.5 GTR (General Time-Reversible)

Rate matrix. $\revsub _{ij} = \exch _{ij}\,\eqm _j$ with $\exch _{ij} = \exch _{ji} \geq 0$ for $i \neq j$. For DNA, $\exch $ has 6 free parameters (upper triangle) and $\eqm $ has 3 free parameters (4 frequencies summing to 1).

M-step. Fixing $\eqm $: \begin {equation} \label {eq:gtr-Q} \hat {\exch }_{ij} = \frac {U_{ij} + U_{ji}}{W_i \eqm _j + W_j \eqm _i} \end {equation} Fixing $\exch $, the $\eqm $ update requires solving a nonlinear system (Section A.1.8). In practice, set $\hat {\eqm }_i \propto V'_i$ and iterate with (B.5).

Spectral structure. $\symsub _{ij} = \exch _{ij}\sqrt {\eqm _i \eqm _j}$ is real symmetric with eigenvalues $0 = \xi ^{(0)} > \xi ^{(1)} \geq \cdots \geq \xi ^{(|\alphabet |-1)}$ (generically all distinct for DNA). No further simplification beyond the general formulas (A.3).

B.1.6 GY94 (Goldman–Yang Codon Model)

Rate matrix. The state space is the 61 sense codons ($|\alphabet | = 61$, excluding stop codons). The instantaneous rate from codon $i$ to codon $j$ is: \[ \revsub _{ij} = \begin {cases} 0 & \text {if $i$ and $j$ differ at more than one position} \\ \eqm _j & \text {if synonymous transversion} \\ \kappa _s\, \eqm _j & \text {if synonymous transition} \\ \omega \, \eqm _j & \text {if nonsynonymous transversion} \\ \omega \, \kappa _s\, \eqm _j & \text {if nonsynonymous transition} \end {cases} \] where $\omega $ is the nonsynonymous/synonymous rate ratio (dN/dS), $\kappa _s$ is the transition/transversion ratio, and $\eqm _j$ is the equilibrium frequency of codon $j$. This is a reversible model with exchangeability \[ \exch _{ij} = \begin {cases} 0 & \text {(multi-nucleotide change)} \\ 1 & \text {(synonymous transversion)} \\ \kappa _s & \text {(synonymous transition)} \\ \omega & \text {(nonsynonymous transversion)} \\ \omega \kappa _s & \text {(nonsynonymous transition)} \end {cases} \]

M-step. Partition the substitution counts into four classes indexed by (synonymous/nonsynonymous) $\times $ (transition/transversion): \begin {align*} U_{S,s} &= \text {total synonymous transition counts} & U_{S,v} &= \text {total synonymous transversion counts} \\ U_{N,s} &= \text {total nonsynonymous transition counts} & U_{N,v} &= \text {total nonsynonymous transversion counts} \end {align*}

and the corresponding dwell-weighted opportunities $N_{S,s}, N_{S,v}, N_{N,s}, N_{N,v}$ defined analogously to (??)–(??) but summing over single-nucleotide codon pairs in each class.

The exchangeabilities are $\exch = 1$ (synonymous transversion), $\kappa _s$ (synonymous transition), $\omega $ (nonsynonymous transversion), $\omega \kappa _s$ (nonsynonymous transition). The complete-data log-likelihood for $(\omega , \kappa _s)$, with $\eqm $ fixed, is: \begin {align*} \ell _2(\omega ,\kappa _s) &= U_{S,v}\log 1 + U_{S,s}\log \kappa _s + U_{N,v}\log \omega + U_{N,s}\log (\omega \kappa _s) \\ &\quad - N_{S,v} - \kappa _s N_{S,s} - \omega N_{N,v} - \omega \kappa _s N_{N,s} \end {align*}

Setting $\partial \ell _2/\partial \kappa _s = 0$ and $\partial \ell _2/\partial \omega = 0$: \begin {align} \frac {U_{S,s} + U_{N,s}}{\kappa _s} &= N_{S,s} + \omega \, N_{N,s} \label {eq:gy94-kappa-eq} \\ \frac {U_{N,v} + U_{N,s}}{\omega } &= N_{N,v} + \kappa _s\, N_{N,s} \label {eq:gy94-omega-eq} \end {align}

These are two equations in two unknowns. Substituting (??) into (??) to eliminate $\omega $: \[ \omega = \frac {U_{N,v} + U_{N,s}}{N_{N,v} + \kappa _s N_{N,s}} \] \[ \frac {U_{S,s} + U_{N,s}}{\kappa _s} = N_{S,s} + \frac {(U_{N,v} + U_{N,s})\, N_{N,s}}{N_{N,v} + \kappa _s N_{N,s}} \] Multiplying both sides by $\kappa _s(N_{N,v} + \kappa _s N_{N,s})$ and rearranging: \begin {equation} \label {eq:gy94-kappa-quad} N_{S,s}\, N_{N,s}\;\kappa _s^2 + \bigl (N_{S,s}\, N_{N,v} + (U_{N,v} - U_{S,s})\,N_{N,s}\bigr )\;\kappa _s - (U_{S,s}+U_{N,s})\, N_{N,v} = 0 \end {equation} The positive root gives $\hat {\kappa }_s$, and then \begin {equation} \label {eq:gy94-omega} \hat {\omega } = \frac {U_{N,v} + U_{N,s}}{N_{N,v} + \hat {\kappa }_s\, N_{N,s}} \end {equation}

The codon equilibrium frequencies $\eqm $ are estimated as $\hat {\eqm }_j \propto V'_j$.

Spectral structure. For 61 sense codons, the symmetrized rate matrix is $61 \times 61$ with at most 61 distinct eigenvalues. No closed-form eigendecomposition is known in general for arbitrary $\eqm $. When $\eqm $ factors as $\eqm _{abc} = \eqm ^{(1)}_a \eqm ^{(2)}_b \eqm ^{(3)}_c$ (the F3x4 model), the symmetrized matrix is a sum of Kronecker products and its eigenvectors factor as tensor products of $4\times 4$ matrices, reducing the eigendecomposition to three $4\times 4$ problems. In practice, for the full GY94, numerical eigendecomposition of the $61 \times 61$ symmetric matrix is inexpensive ($O(61^3) \approx 2 \times 10^5$ flops) and need only be performed once per M-step.

B.1.7 Summary and Practical Notes

Model	Free params	$\eqm $	M-step for exchangeability
JC69	1 ($\rho $)	fixed $1/4$	$\hat \rho = U_\bullet /W_\bullet $
K80	1 ($\kappa _s$)	fixed $1/4$	$\hat \kappa _s = 2U_s/U_v$
F81	4 ($\rho , \eqm $)	$V'_i/\sum V'_j$	$\hat \rho = U_\bullet / \sum _i W_i(1-\eqm _i)$
HKY85	5 ($\kappa _s, \eqm $)	$V'_i/\sum V'_j$	$\hat \kappa _s = (U_s/N_s)/(U_v/N_v)$
GTR	9 ($\exch , \eqm $)	$V'_i/\sum V'_j$	$\hat \exch _{ij} = (U_{ij}+U_{ji})/(W_i\eqm _j+W_j\eqm _i)$
GY94	62 ($\omega ,\kappa _s,\eqm $)	$V'_j/\sum V'_k$	quadratic in $\kappa _s$ (B.6)

For all models above, the indel M-step is identical: solve the $\kappa $ quadratic (A.7) for $(\insrate , \delrate )$. The substitution and indel parameter blocks decouple completely in the M-step (they share no sufficient statistics).

MAP estimates. To obtain MAP estimates, augment the sufficient statistics before applying the MLE formulas: $U_{ij} \to U_{ij} + \alpha _\exch - 1$, $W_i \to W_i + \beta _\exch $, $V_i \to V_i + \alpha _\eqm - 1$, as in Section C.1.4. The independent Gamma$(\alpha _\exch ,\beta _\exch )$ priors on each $\exch _{ij}$ together with a Dirichlet on $\eqm $ are conjugate to the irreversible CTMC complete-data likelihood (independent $\revsub _{ij}$) but not to the reversible GTR parameterization—the symmetric coupling $\exch _{ij} = \exch _{ji}$ together with the requirement that the joint complete-data likelihood include the stationary draw $X(0) \sim \eqm $ means the proper conjugate prior is the cycle-corrected edge-flow density of Diaconis and Rolles (12), with a Kirchhoff matrix-tree determinant accounting for the cycle structure of the state graph. See the discussion in Section C.1.4.0 for the form of the Diaconis–Rolles prior and its modification to incorporate the initial-state contribution.The simpler Gamma${}\times {}$Dirichlet pseudocounts used here are non-conjugate regularizers and remain perfectly valid for MAP, with the closed-form augmented M-step above; we adopt them throughout for simplicity.

Degenerate eigenvalues. Models with symmetry (JC69, K80, F81) have degenerate eigenvalues in $\symsub $, which means the $J^{kl}(\evoltime )$ matrix has repeated diagonal entries. This reduces the number of distinct $J$ values that must be computed, but requires care to use the $\xi ^{(k)} = \xi ^{(l)}$ branch ($J^{kl} = \evoltime \, e^{\xi ^{(k)}\evoltime }$) rather than the difference quotient, which is a removable $0/0$ singularity.

B.1.8 Reversible Mixture with Per-Component GTR

We now consider a $C$-component reversible mixture in which each component is a full GTR model with its own exchangeability matrix $\exch ^{(c)}$ and equilibrium distribution $\eqm ^{(c)}$: \begin {equation} \label {eq:mixture-Q} \revsub ^{(c)}_{ij} = \exch ^{(c)}_{ij}\,\eqm ^{(c)}_j \quad (i \neq j), \qquad c \in \{1,\dots ,C\}. \end {equation} Each site is assigned (latently) to one component, and conditional on $c$, the site evolves under the reversible CTMC with rate matrix $\revsub ^{(c)}$. This is the parameterisation used by the per-(domain, fragment-type) site-class emissions of MixDom (Section C.1.4), where $c$ is the site-class drawn independently at each position from $\classdist _{\dom \frag }$.Because each component is an independent GTR model, the overall rate scale is absorbed into $\exch ^{(c)}$ itself and no separate rate multiplier is needed; this makes the parameterisation a strict generalisation of per-domain GTR (the per-domain GTR case is recovered when $\classdist _{\dom \frag ,c} = \mathbf {1}[c = \dom ]$) and of the discrete-gamma rate-across-sites model (54).

E-step counts. The component assignment is latent. From the standard E-step, each branch produces an endpoint pair $(a,b)$ with marginal posterior weight $w_{ab}$; conditional on $(a,b)$ at a site with prior class distribution $P(c \mid \text {site}) = \classdist _c$, the component posterior is \begin {equation} \label {eq:mix-post} P(c \mid a,b,\evoltime ,\text {site}) = \frac {\classdist _c\,\eqm ^{(c)}_a\,M^{(c)}_{ab}(\evoltime )} {\sum _{c'} \classdist _{c'}\,\eqm ^{(c')}_a\,M^{(c')}_{ab}(\evoltime )}, \end {equation} with $M^{(c)}(\evoltime ) = \exp (\revsub ^{(c)}\evoltime )$. Per-component bridge-expectation sufficient statistics are then accumulated by weighting the standard expressions (??)–(??) by (B.9): \begin {align} V^{(c)}_i &= \textstyle \sum _{\text {starts}} w_a\, P(c\mid a,\cdot )\,\mathbf {1}[X(0)=i] \label {eq:mix-V}\\ U^{(c)}_{ij} &= \textstyle \sum _{\text {branches}} w_{ab}\, P(c\mid a,b,\evoltime ,\cdot )\, \emcount ^U_{ij}\!\bigl (a,b,\evoltime ;\,\revsub ^{(c)}\bigr ) \label {eq:mix-U}\\ W^{(c)}_i &= \textstyle \sum _{\text {branches}} w_{ab}\, P(c\mid a,b,\evoltime ,\cdot )\, \emcount ^W_i\!\bigl (a,b,\evoltime ;\,\revsub ^{(c)}\bigr ) \label {eq:mix-W} \end {align}

and we write $V'^{(c)}_i = V^{(c)}_i + \sum _{j \neq i} U^{(c)}_{ji}$ as before. Writing $U^{(c)}_\bullet = \sum _{i\neq j} U^{(c)}_{ij}$ for the total per-component substitution count.

Complete-data log-likelihood. Because each class is an independent GTR model under this parameterisation, the complete-data log-likelihood decomposes over classes: \begin {equation} \label {eq:mix-loglik} \ell _2(\{\exch ^{(c)}\},\{\eqm ^{(c)}\}) = \sum _c \ell _2^{(c)}, \end {equation} \begin {align} \ell _2^{(c)} &= \sum _i V'^{(c)}_i \log \eqm ^{(c)}_i + \sum _{j>i}\bigl (U^{(c)}_{ij}+U^{(c)}_{ji}\bigr )\log \exch ^{(c)}_{ij} \notag \\ &\quad - \sum _{j>i} \exch ^{(c)}_{ij} \bigl (W^{(c)}_i \eqm ^{(c)}_j + W^{(c)}_j \eqm ^{(c)}_i\bigr ). \label {eq:mix-loglik-c} \end {align}

Identifiability. No gauge to fix: the rate scale of class $c$ lives inside $\exch ^{(c)}$, and the $\{\exch ^{(c)}\}$ are uncoupled in both (B.8) and (??). Every class can be updated independently.

Per-class GTR M-step. Since $\ell _2$ decomposes class by class, the update for class $c$ is exactly the GTR M-step of Section A.1.8 applied to that class’s sufficient statistics. Differentiating (??) w.r.t. $\exch ^{(c)}_{ij}$: \[ \frac {\partial \ell _2^{(c)}}{\partial \exch ^{(c)}_{ij}} = \frac {U^{(c)}_{ij}+U^{(c)}_{ji}}{\exch ^{(c)}_{ij}} - \bigl (W^{(c)}_i \eqm ^{(c)}_j + W^{(c)}_j \eqm ^{(c)}_i\bigr ) = 0, \] giving the closed-form update \begin {equation} \label {eq:mix-S-update} \hat \exch ^{(c)}_{ij} = \frac {U^{(c)}_{ij} + U^{(c)}_{ji}} {W^{(c)}_i \eqm ^{(c)}_j + W^{(c)}_j \eqm ^{(c)}_i}. \end {equation} The equilibrium update uses the empirical-frequency approximation \begin {equation} \label {eq:mix-pi-update} \hat \eqm ^{(c)}_i \propto V'^{(c)}_i, \end {equation} which is exact when $\exch ^{(c)}_{ij} W^{(c)}_j$ is independent of $i$ (F81 limit) and approximate otherwise, identical to Section A.1.8. No alternation or projection is needed; one pass through each class completes the M-step.

Connection to standard models. With $C=1$, (B.11)–(B.12) reduce exactly to the GTR M-step (B.5). With $\classdist _{\dom \frag ,c} = \mathbf {1}[c=\dom ]$ so each domain uses its own dedicated class, the model is the per-domain GTR ($\exch ^{(\dom )}$) special case; the per-class M-step here reduces to the per-domain M-step of Section A.1.8 applied once per domain. With $\eqm ^{(c)} \equiv \eqm $ and $\exch ^{(c)} = \gamma _c\,\exch $ for some shared $\exch $, the model is the discrete-gamma rate model (54); a constrained M-step that enforces $\exch ^{(c)} \propto \exch $ recovers the shared-exchangeability case as a special case. Free-rate variants (46) are the shared-$\exch $ / free-$\gamma _c$ special case.

MAP estimates. Per-class regularising priors apply independently to each class. A Dirichlet pseudocount $\alpha ^{(c)}_i$ is added to each $V^{(c)}_i$, and a Gamma$(a_S, b_S)$ prior on each $\exch ^{(c)}_{ij}$ (e.g. calibrated to LG via $a_S = b_S\,\exch ^{\mathrm {LG}}_{ij}$) adds $a_S-1$ to the numerator and $b_S$ to the denominator of (B.11). As above, this independent-Gamma${}\times {}$Dirichlet structure is conjugate only in the irreversible parameterization; it is a non-conjugate regularizer for the reversible GTR mixture component, with the proper conjugate being a per-class Diaconis–Rolles prior (12). We use the simpler form here for the same reasons as the single-component case (Section C.1.4.0). The MAP objective \begin {equation} \label {eq:mix-map-obj} \mathcal {L} = \ell _2 + \sum _{c,i}(\alpha ^{(c)}_i - 1)\log \eqm ^{(c)}_i + \sum _{c,j>i}\bigl [(a_S - 1)\log \exch ^{(c)}_{ij} - b_S \exch ^{(c)}_{ij}\bigr ] \end {equation} decomposes across classes and has a unique joint maximum at \begin {equation} \label {eq:mix-map-S} \hat \exch ^{(c)}_{ij} = \frac {U^{(c)}_{ij}+U^{(c)}_{ji} + a_S - 1} {W^{(c)}_i \eqm ^{(c)}_j + W^{(c)}_j \eqm ^{(c)}_i + b_S}, \end {equation} with the $\eqm ^{(c)}$ update obtained by the same bisection as Section A.1.8. Each class is a standalone GTR M-step; no cross-class coupling.

Hyperparameter sizing under SVI. Under SVI-BW (B.33) the EMA sufficient statistics are scaled to the full-dataset size $N$, so prior pseudocounts must live on the same scale to retain their relative weight. Parameterising by effective sample size, set $\alpha ^{(c)}_i = N_\pi \,\eqm ^{\mathrm {LG}}_i$ and $b_S = N_S$ with $a_S = N_S\,\exch ^{\mathrm {LG}}_{ij}$. Defaults $N_\pi , N_S \sim 10^2\text {--}10^3$ (data-independent) give priors that vanish in the data limit but identify the posterior in small batches.

B.1.9 Rate Rescaling: M-Step for a Global Scalar Multiplier

In some training scenarios the relative pattern of substitution rates is known (e.g. from a previously-estimated GTR matrix or from an external published matrix such as LG) and only the overall rate needs to adapt to the current data. Concretely, we hold both the exchangeability matrix $\exch $ and the equilibrium distribution $\eqm $ fixed, and seek a single scalar multiplier $\sigma > 0$ such that the rescaled rate matrix \begin {equation} \label {eq:rescale-Q} \revsub ^{\mathrm {new}}_{ij} = \sigma \,\exch _{ij}\,\eqm _j \quad (i \neq j) \end {equation} maximises the complete-data log-likelihood $\ell _2$. Equivalently, the new exchangeability is $\exch ^{\mathrm {new}}_{ij} = \sigma \,\exch _{ij}$, so all relative exchangeabilities are preserved and only the overall tempo of the model changes.

Closed-form $\sigma $. Substituting $\exch _{ij} \to \sigma \exch _{ij}$ into the GTR complete-data log-likelihood, \[ \ell _2(\sigma ) = \mathrm {const} + \log \sigma \;\sum _{j>i}(U_{ij} + U_{ji}) - \sigma \sum _{j>i}\exch _{ij}\bigl (W_i\eqm _j + W_j\eqm _i\bigr ). \] The terms not involving $\sigma $ (the $V'_i\log \eqm _i$ part and $\sum _{j>i}(U_{ij}+U_{ji})\log \exch _{ij}$) are constants under rescaling. Differentiating and setting to zero yields the unique maximiser \begin {equation} \label {eq:rescale-sigma} \hat \sigma = \frac {U_\bullet }{D}, \qquad U_\bullet = \sum _{i\neq j} U_{ij}, \qquad D = \sum _{j>i}\exch _{ij}\bigl (W_i\eqm _j + W_j\eqm _i\bigr ), \end {equation} where $U_\bullet $ is the total expected substitution count and $D$ is the total dwell-weighted opportunity for substitution under the fixed shape $(\exch , \eqm )$. The objective is strictly concave in $\sigma $ on $(0, \infty )$ (the Hessian $-U_\bullet /\sigma ^2 < 0$), so (B.16) is the global maximiser.

Mixture form (per-class). Applied class by class to the reversible mixture of Section B.1.8, each class $c$ has its own scalar $\sigma _c$ updated independently from its own per-class statistics: \begin {equation} \label {eq:rescale-sigma-c} \hat \sigma _c = \frac {U^{(c)}_\bullet }{D^{(c)}}, \qquad D^{(c)} = \sum _{j>i}\exch ^{(c)}_{ij} \bigl (W^{(c)}_i \eqm ^{(c)}_j + W^{(c)}_j \eqm ^{(c)}_i\bigr ). \end {equation} The shapes $\exch ^{(c)}$ and $\eqm ^{(c)}$ are held at their current values and only the per-class rate scale $\sigma _c$ moves; the class-decomposition (B.10) ensures classes do not couple. A common-rate variant (a single shared $\hat \sigma $ across all classes) is recovered by pooling: $\hat \sigma = \sum _c U^{(c)}_\bullet / \sum _c D^{(c)}$. The per-class form (B.17) attains a strictly higher $\ell _2$ unless all $\sigma _c$ already coincide.

Aggregation of counts. The only sufficient statistics required are $U^{(c)}_\bullet = \sum _{i\neq j} U^{(c)}_{ij}$ (a scalar per class) and the dwell–frequency contraction $D^{(c)}$, which is itself a scalar inner product of the fixed quantity $\exch ^{(c)}_{ij}(W^{(c)}_i \eqm ^{(c)}_j + W^{(c)}_j \eqm ^{(c)}_i)$ against the per-pair count. Both are linear in the underlying per-pair $(W, U)$ statistics, so under SVI-BW they accumulate via the same EMA used for the standard M-step (no special handling).

MAP estimate. With a Gamma$(a_\sigma , b_\sigma )$ prior on $\sigma $, the augmented M-step is \begin {equation} \label {eq:rescale-map} \hat \sigma = \frac {U_\bullet + a_\sigma - 1}{D + b_\sigma }. \end {equation} The Gamma prior on the global rate is conjugate (the model is linear in $\sigma $), in contrast to the non-conjugate prior on the full-shape M-step. In the rate-rescaling regime the standard $(N_\pi , N_S)$ pseudocounts on $(\exch , \eqm )$ are inactive (those parameters are frozen); $(a_\sigma , b_\sigma )$ centred at $\sigma = 1$ (e.g. $a_\sigma = b_\sigma = N_\sigma $ for some effective sample size $N_\sigma $) regularises the rate towards the current frozen scale.

When to use. Rate rescaling is a useful intermediate regime between fully-frozen substitution parameters (no M-step) and the full-shape mixture M-step of Section B.1.8. It is particularly natural when warm-starting from a published matrix and one wishes to absorb a calibration drift in the time units of the data without re-estimating the relative chemistry. In an alternating schedule (e.g. alternate $\sigma _c$ updates with tied-$\eqm $ (B.29) updates) the rescaling step contributes a single guaranteed $\ell _2$ ascent per class per iteration, and the alternation does not destabilise the joint objective because each step is an exact maximisation in its own coordinate.

B.1.10 Joint Rate Rescaling and Equilibrium

A natural extension of the rate-rescaling regime (Section B.1.9) frees the equilibrium distribution $\eqm $ jointly with the scale $\sigma $ while keeping the exchangeability shape $\exch $ fixed. This matches the substitution degrees of freedom of a Maraschino-style fit with --freeze-class-S-shape (per-class $\sigma _c$ and $\eqm ^{(c)}$ free, $\exch ^{(c)}$ shape locked at LG), and is the natural SVI–BW counterpart when warm-starting from such a fit: the previous rescaling-rates mode artificially froze $\eqm ^{(c)}$ at the warm-start values, dropping $A-1$ degrees of freedom per class.

Joint complete-data log-likelihood. With $\revsub _{ij} = \sigma \,\exch _{ij}\,\eqm _j$ and $\exch $ frozen, the complete-data log-likelihood as a function of $(\sigma , \eqm )$ is \begin {equation} \label {eq:rescale-pi-ll} \ell _2(\sigma , \eqm ) = \sum _i V'_i \log \eqm _i + U_\bullet \log \sigma - \sigma \sum _{j>i} \exch _{ij}\bigl (W_i \eqm _j + W_j \eqm _i\bigr ) + \mathrm {const}, \end {equation} where the constant collects the frozen $\sum _{j>i}(U_{ij}+U_{ji})\log \exch _{ij}$ term. Using $\exch _{ij} = \exch _{ji}$ and $\exch _{ii} = 0$, the dwell penalty contracts as \[ \sum _{j>i}\exch _{ij}\bigl (W_i\eqm _j + W_j\eqm _i\bigr ) = \sum _b \eqm _b\,r_b, \qquad r_b := \sum _{a\neq b} \exch _{ab} W_a = (\exch W)_b, \] so $r_b$ depends only on the frozen quantities $(\exch , W)$ and is constant under variation of $(\sigma , \eqm )$. Equation (B.19) simplifies to \begin {equation} \label {eq:rescale-pi-ll-simple} \ell _2(\sigma , \eqm ) = \sum _b V'_b \log \eqm _b + U_\bullet \log \sigma - \sigma \,\mathbf {r}^\top \eqm + \mathrm {const}. \end {equation}

Stationary conditions. Lagrangian for $\sum _b \eqm _b = 1$ with multiplier $\lambda $: \[ \mathcal {L}(\sigma ,\eqm ,\lambda ) = \ell _2(\sigma ,\eqm ) - \lambda \Bigl (\sum _b \eqm _b - 1\Bigr ). \] The first-order conditions are \begin {align} \frac {\partial \mathcal {L}}{\partial \sigma } &= \frac {U_\bullet }{\sigma } - \mathbf {r}^\top \eqm = 0 \;\Longrightarrow \; \sigma = \frac {U_\bullet }{\mathbf {r}^\top \eqm }, \label {eq:rescale-pi-foc-sigma}\\[2pt] \frac {\partial \mathcal {L}}{\partial \eqm _b} &= \frac {V'_b}{\eqm _b} - \sigma r_b - \lambda = 0 \;\Longrightarrow \; \eqm _b = \frac {V'_b}{\lambda + \sigma r_b}. \label {eq:rescale-pi-foc-pi} \end {align}

The Lagrange multiplier is fixed. A clean structural property obtains by multiplying (??) through by $\eqm _b$ and summing: \[ \sum _b V'_b = \lambda \sum _b \eqm _b + \sigma \, \mathbf {r}^\top \eqm = \lambda + \sigma \,\mathbf {r}^\top \eqm . \] The $\sigma $-FOC (??) sets $\sigma \,\mathbf {r}^\top \eqm = U_\bullet $, so \[ N \;=\; \lambda + U_\bullet , \qquad N := \sum _b V'_b = V_\bullet + U_\bullet , \] which gives \begin {equation} \label {eq:rescale-pi-lambda} \boxed {\;\lambda \;=\; V_\bullet ,\;} \end {equation} independent of $\sigma $: the multiplier is fully determined by the boundary count $V_\bullet = \sum _i V_i$.

Closed-form $\eqm (\sigma )$ and 1-D root in $\sigma $. Substituting (B.21) into (??), \begin {equation} \label {eq:rescale-pi-pi-of-sigma} \hat \eqm _b(\sigma ) = \frac {V'_b}{V_\bullet + \sigma \, r_b}. \end {equation} Note $\sum _b \hat \eqm _b(\sigma ) = 1$ at $\sigma $ satisfying the $\sigma $-FOC, automatically. Plugging back into the $\sigma $-FOC (??), $\sigma $ satisfies the single one-dimensional equation \begin {equation} \label {eq:rescale-pi-sigma-eq} g(\sigma ) := \sum _b \frac {\sigma \, r_b\, V'_b}{V_\bullet + \sigma \, r_b} \;-\; U_\bullet \;=\; 0, \qquad \sigma > 0. \end {equation}

Existence and uniqueness. $g$ is well-defined and smooth for $\sigma > 0$ (denominators are positive when $V_\bullet > 0$ or $r_b > 0$). Boundary behaviour: \[ g(0) = -U_\bullet \le 0, \qquad g(\sigma ) \xrightarrow [\sigma \to \infty ]{} \sum _b V'_b - U_\bullet = V_\bullet \ge 0, \] and the derivative is strictly positive, \[ g'(\sigma ) = \sum _b \frac {r_b V'_b\, V_\bullet }{(V_\bullet + \sigma r_b)^2} > 0 \quad \text {whenever some } r_b V'_b > 0. \] Hence $g$ is strictly increasing and crosses zero exactly once at a unique $\hat \sigma > 0$ (provided $V_\bullet > 0$ and $U_\bullet > 0$; otherwise the boundary $\hat \sigma = 0$ or $\hat \sigma \to \infty $ is the maximiser and the regime is degenerate). The joint stationary point $(\hat \sigma , \hat \eqm )$ is thus globally unique on the simplex.

Numerical solution: 1-D Newton. $g$ is monotone with a closed-form derivative, so a few Newton steps on $\log \sigma $ (or bisection in $\sigma $) starting from the warm $\sigma _{\mathrm {old}}$ converges to machine precision in $\le 10$ iterations. Once $\hat \sigma $ is found, $\hat \eqm $ follows from (B.22) .

Equivalent: coordinate ascent. Both individual updates (??) and (??) are exact closed-form maximisations of $\ell _2$ in their own coordinate (for fixed value of the other), so alternating \begin {align*} \sigma &\leftarrow U_\bullet \,/\, (\mathbf {r}^\top \eqm ),\\ \eqm _b &\leftarrow V'_b \,/\, (V_\bullet + \sigma \, r_b) \quad (\text {automatically normalised once $\lambda = V_\bullet $ holds}) \end {align*}

gives a monotone $\ell _2$ ascent and converges to the unique fixed point identified above. In code, $\sim 5$ alternation rounds typically suffice; the 1-D Newton on (B.23) is slightly faster and gives the same answer.

Mixture form (per-class). Applied class by class to the reversible mixture of Section B.1.8, each class $c$ has its own pair $(\sigma _c, \eqm ^{(c)})$ updated independently from its own per-class sufficient statistics $(W^{(c)}, U^{(c)}, V^{(c)})$ and its own fixed $\exch ^{(c)}$ shape: \begin {equation} \label {eq:rescale-pi-class} \hat \sigma _c\;\text {solves}\; \sum _b \frac {\hat \sigma _c\, r^{(c)}_b\, V'^{(c)}_b} {V_\bullet ^{(c)} + \hat \sigma _c\, r^{(c)}_b} = U_\bullet ^{(c)}, \qquad \hat \eqm ^{(c)}_b = \frac {V'^{(c)}_b}{V_\bullet ^{(c)} + \hat \sigma _c\, r^{(c)}_b}, \end {equation} with $r^{(c)}_b = \sum _{a\neq b}\exch ^{(c)}_{ab} W^{(c)}_a$. Classes do not couple, by the same class-decomposition (B.10) that justifies per-class M-steps in the standard mixture regime.

Aggregation of counts. The required sufficient statistics per class are $(U_\bullet ^{(c)}, V_\bullet ^{(c)}, V'^{(c)}, r^{(c)})$. All are linear contractions of the underlying per-pair $(W, U, V)$ statistics against the fixed quantities $(\exch ^{(c)}, V_\bullet ^{(c)} = $ count of $V$ entries$)$, so under SVI–BW they accumulate via the same EMA used for the standard M-step (no special handling).

Reduction to pure rescaling. Setting $\eqm = \eqm _{\mathrm {old}}$ in (??) recovers the closed form $\hat \sigma = U_\bullet / D$ of (B.16). The joint mode strictly dominates pure rescaling whenever $\hat \eqm \neq \eqm _{\mathrm {old}}$ at the joint optimum (i.e. whenever the Maraschino-fit warm $\eqm $ is not exactly the M-step optimum given the current $W, U, V$); in practice the gain is largest immediately after a warm start that fixed $\eqm $ away from its joint optimum.

When to use. Use the joint $(\sigma , \eqm )$ rescaling when warm-starting from a fit that already learnt per-class $(\sigma _c, \eqm ^{(c)})$ but the SVI–BW M-step should keep updating both with new evidence. Use the pure $\sigma $-only rescaling when the warm $\eqm ^{(c)}$ should be treated as a strong prior or when one wishes to attribute any substitution-rate drift purely to the time scale.

B.1.11 Tied Equilibria across Class Blocks

A second restricted regime ties the equilibrium distribution $\eqm ^{(c)}$ across blocks of site classes while leaving exchangeabilities $\exch ^{(c)}$ free. Partition the $C$ classes $\{1,\dots ,C\}$ into $B = C / N_{\mathrm {tied}}$ disjoint blocks $\mathcal {B}_1,\dots ,\mathcal {B}_B$ each of size $N_{\mathrm {tied}}$ (we require $N_{\mathrm {tied}} \mid C$). Within each block $b$ all classes share a common equilibrium distribution \begin {equation} \label {eq:tied-pi-tying} \eqm ^{(c)} \equiv \tilde \eqm ^{(b)} \quad \text {for all } c \in \mathcal {B}_b. \end {equation}

Block-pooled complete-data log-likelihood. Substituting (B.25) into (B.10), the log-likelihood decomposes across blocks rather than across individual classes: \begin {equation} \label {eq:tied-pi-loglik} \ell _2 = \sum _{b=1}^{B} \ell _2^{(b)}, \end {equation} \begin {align} \ell _2^{(b)} &= \sum _i \tilde V'^{(b)}_i \log \tilde \eqm ^{(b)}_i \;+\; \sum _{c\in \mathcal {B}_b}\sum _{j>i} \bigl (U^{(c)}_{ij}+U^{(c)}_{ji}\bigr )\log \exch ^{(c)}_{ij} \notag \\ &\quad - \sum _{c\in \mathcal {B}_b}\sum _{j>i} \exch ^{(c)}_{ij}\, \bigl (W^{(c)}_i\,\tilde \eqm ^{(b)}_j + W^{(c)}_j\,\tilde \eqm ^{(b)}_i\bigr ), \label {eq:tied-pi-loglik-b} \end {align}

with the pooled equilibrium-character count \begin {equation} \label {eq:tied-pi-Vprime} \tilde V'^{(b)}_i = \sum _{c\in \mathcal {B}_b} V'^{(c)}_i. \end {equation} The exchangeability parts $\exch ^{(c)}$ remain class-specific; only the equilibrium term pools.

Block M-step. Within block $b$, alternate between the per-class $\exch ^{(c)}$ update (holding $\tilde \eqm ^{(b)}$ fixed) and the pooled $\tilde \eqm ^{(b)}$ update (holding all $\exch ^{(c)}, c\in \mathcal {B}_b$ fixed). The $\exch $-step uses the standard mixture formula (B.11) with the shared $\tilde \eqm ^{(b)}$: \begin {equation} \label {eq:tied-pi-S-update} \hat \exch ^{(c)}_{ij} = \frac {U^{(c)}_{ij}+U^{(c)}_{ji}} {W^{(c)}_i\,\tilde \eqm ^{(b)}_j + W^{(c)}_j\,\tilde \eqm ^{(b)}_i} \quad (c\in \mathcal {B}_b). \end {equation} The $\tilde \eqm ^{(b)}$-step is the same Lagrange-multiplier solve as the unconstrained mixture M-step, but with pooled coefficients. Differentiating (??): \begin {equation} \label {eq:tied-pi-pi-update} \tilde \eqm ^{(b)}_i = \frac {\tilde V'^{(b)}_i}{\tilde c^{(b)}_i - \eta }, \qquad \tilde c^{(b)}_i = \sum _{c\in \mathcal {B}_b}\sum _{j\neq i}\exch ^{(c)}_{ij}\,W^{(c)}_j, \end {equation} with $\eta $ found by 1D bisection from $\sum _i \tilde V'^{(b)}_i / (\tilde c^{(b)}_i - \eta ) = 1$, exactly as in Section A.1.8. Each coordinate update is an exact maximiser within its coordinate; the joint objective on $(\exch ^{(\cdot )}, \tilde \eqm ^{(b)})$ is strictly concave on $\mathbb {R}_+^{|\mathcal {B}_b| \cdot \binom {|\alphabet |}{2}} \times \Delta ^{|\alphabet |-1}$ (the latter the simplex), so the coordinate ascent converges to the unique block MLE.

Connection to standard models. With $N_{\mathrm {tied}} = 1$ (one class per block) the tying is trivial and (B.29) reduces to the per-class Lagrange solve of Section B.1.8. With $N_{\mathrm {tied}} = C$ (one block containing all classes) all classes share a single $\tilde \eqm $ the discrete-rate-classes parameterisation of (54) when the $\exch ^{(c)}$ are also proportional to a shared shape. For intermediate $N_{\mathrm {tied}}$ the tying captures shared equilibrium chemistry across classes that nonetheless differ in their exchangeability (and hence in their substitution pattern and rate), without forcing every class to maintain its own equilibrium estimate.

Pseudocounts. The LG-informed Dirichlet pseudocount on $\tilde \eqm ^{(b)}$ adds $N_\pi \,\eqm ^{\mathrm {LG}}_i$ to $\tilde V'^{(b)}_i$ once per block (not once per class within the block); the per-class Gamma$(a_S, b_S)$ pseudocount on $\exch ^{(c)}_{ij}$ enters the class-specific update (B.28) unchanged.

Combination with rate rescaling. Tied-$\eqm $ and rate rescaling can also be combined within a single M-step: each class has its own scalar $\sigma _c$ but classes within a block share the equilibrium $\tilde \eqm ^{(b)}$. The coordinate-ascent factorisation does dispatch in that joint regime; see Section B.1.12. The frozen-$\eqm $ mode (which fixes $\eqm ^{(c)}$ at its initialisation, with no M-step update) remains an alternative restriction. Alternation across M-steps (rate rescaling on odd iterations, tied-$\eqm $ on even iterations) is valid in addition because each step is an unconditional ascent on the joint $\ell _2$.

B.1.12 Tied Equilibria with Per-Class Rate Rescaling

The two restrictions of Sections B.1.9 and B.1.11 can be applied jointly to the same mixture M-step: each class $c$ in block $\mathcal {B}_b$ has its own scalar rate multiplier $\sigma _c$, with the exchangeability shape $\exch ^{(c)}$ frozen, while all classes within the block share a common equilibrium $\tilde \eqm ^{(b)}$. This matches the substitution degrees of freedom of a Maraschino-style fit with --freeze-class-S-shape under a block-tied class_pi constraint and is strictly more constrained than the joint $(\sigma _c, \eqm ^{(c)})$ M-step of Section B.1.10.

Block-pooled complete-data log-likelihood. With $\revsub ^{(c)}_{ij} = \sigma _c\,\exch ^{(c)}_{ij}\,\tilde \eqm ^{(b)}_j$ for $c \in \mathcal {B}_b$ (with $\exch ^{(c)}$ frozen), the per-block log-likelihood is \begin {equation} \label {eq:tpr-ll} \ell _2^{(b)}(\{\sigma _c\}_{c\in \mathcal {B}_b}, \tilde \eqm ^{(b)}) = \sum _b' \tilde V'^{(b)}_{b'} \log \tilde \eqm ^{(b)}_{b'} + \sum _{c\in \mathcal {B}_b}\Bigl [ U^{(c)}_\bullet \log \sigma _c - \sigma _c\, \mathbf {r}^{(c)\top } \tilde \eqm ^{(b)} \Bigr ] + \mathrm {const}, \end {equation} where the constant collects $\sum _c \sum _{j>i}(U^{(c)}_{ij}+U^{(c)}_{ji}) \log \exch ^{(c)}_{ij}$ (frozen), $\tilde V'^{(b)}_{b'} = \sum _{c\in \mathcal {B}_b} V'^{(c)}_{b'}$ is the block-pooled boundary count, and $r^{(c)}_{b'} = \sum _{a\neq b'} \exch ^{(c)}_{a,b'} W^{(c)}_a$ depends only on the frozen $(\exch ^{(c)}, W^{(c)})$.

Stationary conditions. Lagrangian for $\sum _{b'}\tilde \eqm ^{(b)}_{b'} = 1$ with multiplier $\lambda ^{(b)}$: \begin {align} \frac {\partial }{\partial \sigma _c} &= \frac {U^{(c)}_\bullet }{\sigma _c} - \mathbf {r}^{(c)\top }\tilde \eqm ^{(b)} = 0 \;\Longrightarrow \; \sigma _c = \frac {U^{(c)}_\bullet } {\mathbf {r}^{(c)\top }\tilde \eqm ^{(b)}}, \label {eq:tpr-foc-sigma}\\[2pt] \frac {\partial }{\partial \tilde \eqm ^{(b)}_{b'}} &= \frac {\tilde V'^{(b)}_{b'}}{\tilde \eqm ^{(b)}_{b'}} - \sum _{c\in \mathcal {B}_b} \sigma _c\,r^{(c)}_{b'} - \lambda ^{(b)} = 0 \;\Longrightarrow \; \tilde \eqm ^{(b)}_{b'} = \frac {\tilde V'^{(b)}_{b'}} {\lambda ^{(b)} + \sum _{c\in \mathcal {B}_b} \sigma _c r^{(c)}_{b'}}. \label {eq:tpr-foc-pi} \end {align}

Block Lagrange multiplier is fixed. Multiplying (??) by $\tilde \eqm ^{(b)}_{b'}$ and summing over $b'$, \[ \sum _{b'} \tilde V'^{(b)}_{b'} = \lambda ^{(b)} + \sum _{c\in \mathcal {B}_b} \sigma _c\, \mathbf {r}^{(c)\top }\tilde \eqm ^{(b)}. \] Substituting (??) into the right-hand side collapses the $\sigma _c$-dependent term to $\sum _c U^{(c)}_\bullet $, giving \begin {equation} \label {eq:tpr-lambda} \boxed {\;\lambda ^{(b)} \;=\; \tilde V_\bullet ^{(b)},\;} \qquad \tilde V_\bullet ^{(b)} := \sum _{c\in \mathcal {B}_b} V_\bullet ^{(c)} = \sum _{c\in \mathcal {B}_b}\sum _i V^{(c)}_i, \end {equation} generalising the single-class result (B.21). The multiplier is fully determined by the block-pooled boundary count.

Closed-form coupled updates. Substituting (B.31) into (??), \begin {equation} \label {eq:tpr-pi-update} \hat {\tilde \eqm }^{(b)}_{b'} = \frac {\tilde V'^{(b)}_{b'}} {\tilde V_\bullet ^{(b)} + \sum _{c\in \mathcal {B}_b} \sigma _c\, r^{(c)}_{b'}}. \end {equation} Equation (B.32) together with (??) forms a closed nonlinear system in the $|\mathcal {B}_b|+A$ unknowns $(\{\sigma _c\}, \tilde \eqm ^{(b)})$ per block. Two practical solvers, both monotone in $\ell _2^{(b)}$:

Both solvers reach the same fixed point because $\ell _2^{(b)}$ is strictly concave in each block of variables given the others (the $\log $ terms ensure positivity through the FOCs).

Mixture-wide M-step. The blocks decouple in the same way the classes do under Section B.1.11: the joint $\ell _2 = \sum _b \ell _2^{(b)}$ separates over $b$, so each block can be solved independently.

Aggregation of counts. Per block, the required sufficient statistics are the per-class $(U^{(c)}_\bullet , V'^{(c)}, r^{(c)})$ for $c \in \mathcal {B}_b$ plus the block totals $\tilde V'^{(b)}, \tilde V_\bullet ^{(b)}$. All are linear in $(W, U, V)$ and accumulate via the same EMA used by the standard M-step.

Reduction to special cases. Setting $|\mathcal {B}_b| = 1$ (each class its own block) recovers the joint $(\sigma _c, \eqm ^{(c)})$ M-step of Section B.1.10; setting all $\sigma _c = 1$ (no rate rescaling within block) recovers a degenerate version of the tied-$\eqm $ M-step with frozen exchangeabilities; freezing $\tilde \eqm ^{(b)}$ at its current value recovers the rate-only rescaling of Section B.1.9. Strict dominance holds in both directions: this regime improves on Section B.1.9 by adding the $\eqm $ degree of freedom and improves on Section B.1.11 (with $\exch ^{(c)}$ frozen at LG) by adding per-class $\sigma _c$.

When to use. Use this regime when the warm-start fit (e.g. Maraschino with --freeze-class-S-shape) ties classes into blocks with shared equilibrium chemistry but distinct rates, and the SVI–BW M-step should preserve that tying while continuing to update both $(\sigma _c, \tilde \eqm ^{(b)})$ from new evidence. The block tying is a strong inductive bias when the number of classes per block is small relative to $A$ and the per-class equilibrium counts $V'^{(c)}$ would otherwise be too sparse to estimate $\eqm ^{(c)}$ reliably.

B.1.13 Stochastic Variational Baum-Welch

For large datasets it is impractical to run the E-step over all training pairs every iteration. A natural stochastic extension is to sample a minibatch of pairs, run the E-step on those, and update parameters. Naively replacing the full sufficient statistics with each batch’s statistics discards information from previous batches and results in noisy, slowly converging parameter estimates.

The framework of stochastic variational inference (21) provides a principled resolution in the conditionally conjugate case. Each M-step parameter group in TKF91 (and TKF92 and MixDom) has an exponential-family complete-data log-likelihood, regular for the multinomial groups (mixture weights, fragment-type transitions, site-class distributions) and curved for the BDI rates and the reversible-CTMC (GTR) submodel (Sections A.1.3, A.1.2). For the multinomial groups we use Dirichlet priors that are strictly conjugate; for the BDI and GTR groups we use independent Gamma${}\times {}$Dirichlet pseudocounts that are non-conjugate regularisers in the reversible / curved-family parameterisation (Section C.1.4.0). For the conjugate groups, Hoffman et al.’s natural-gradient argument applies in full and the update (B.33) below is exact natural-gradient ascent on the variational ELBO. For the non-conjugate groups, the natural-gradient interpretation no longer holds verbatim, but the same exponential-moving-average update remains a sound stochastic-MAP-EM scheme: each M-step is affine in the augmented sufficient statistics, so an EMA of statistics commutes with the M-step and converges to the MAP fixed point of full-batch EM under the standard Robbins–Monro step-size conditions. In practice both groups are updated by the same rule: \begin {equation} \label {eq:svi-update} \bar {T}^{(t+1)} = (1 - \rho _t)\,\bar {T}^{(t)} + \rho _t\!\left (\bar {T}_0 + \frac {N}{|B_t|}\,T_{B_t}\right ), \end {equation} where $\bar {T}^{(t)}$ denotes the running sufficient statistics at iteration $t$ (transition counts $\hat {n}_{ij}$, bridge-expectation dwell times $W_i$ and jump counts $U_{ij}$, character counts $V_i$, and BDI expectations $\hat {B}, \hat {D}, \hat {S}$), $\bar {T}_0$ the prior pseudocounts, $T_{B_t}$ the batch sufficient statistics from the E-step on minibatch $B_t$, $N$ the total number of training pairs, $|B_t|$ the batch size, and $\rho _t = (t + \tau )^{-\kappa }$ the Robbins–Monro step size with delay $\tau \geq 0$ and forgetting rate $\kappa \in (0.5, 1]$.

The M-step—quadratic solve for indel rates (A.7), Dirichlet MAP for mixture weights, Beta MAP for extension rates, and the iterative coordinate ascent for $(\exch , \eqm )$—is then applied to $\bar {T}^{(t+1)}$ exactly as in full-batch EM. Each M-step maximizes a (penalised) Q-function that is linear in the sufficient statistics, so averaging the statistics before solving is equivalent to finding the MAP of the averaged Q-function. For the conjugate parameter groups this maximizer is also the natural parameter of the updated variational posterior; for the non-conjugate groups it is the MAP of the regularised Q-function and not a variational posterior, but the iteration is identical.

This retains the closed-form exactness of every M-step while enabling online processing of arbitrarily large pair collections, with each training pair visited only once. The two hyperparameters $\tau $ and $\kappa $ have robust defaults ($\tau = 10$, $\kappa = 0.7$) that work across a range of dataset sizes without tuning (21).

Equivalent formulations of the prior pseudocounts. Equation (B.33) places the prior pseudocounts $\bar T_0$ inside the EMA update. An equivalent implementation accumulates only the data contribution in $\bar T$ and adds $\bar T_0$ at M-step time: \[ \tilde T^{(t+1)} = (1-\rho _t)\,\tilde T^{(t)} + \rho _t\, \tfrac {N}{|B_t|}\, T_{B_t}, \qquad \theta ^{(t+1)} = \mathrm {M\text {-}step}\bigl (\tilde T^{(t+1)} + \bar T_0\bigr ). \] For every M-step considered in this paper (Dirichlet MAP, Gamma-augmented counts, pooled GTR formula), the M-step operator is affine in its sufficient statistics, so $\mathrm {M}(\bar T^{(t)}) = \mathrm {M}(\tilde T^{(t)} + \bar T_0)$ exactly at every iteration; the two prescriptions yield numerically identical estimates. This affineness is what we actually need for the EMA-of-pseudocounts formulation, and it holds whether or not the prior is strictly conjugate to the complete-data likelihood (Section C.1.4.0 discusses the distinction). The post-hoc form is slightly more convenient for implementation because each M-step keeps the prior encapsulated in its closed-form update; see the code comment at the SVI loop for details.

B.2 Stochastic Variational Baum–Welch Convergence

B.2.1 Setup

We consider the idealized infinite-data setting: each minibatch of $B$ sequence pairs is drawn i.i.d. from the true generative model, so there is no model misspecification. Stochastic Variational Baum–Welch (SVB) alternates:

At convergence, the true parameters $\theta ^*$ are a fixed point of $\theta \mapsto M\bigl (\expect _{\theta }[T]\bigr )$.

B.2.2 BDI sufficient statistics

For a single sequence pair $(x,y)$ with ancestor length $i$, descendant length $j$, and evolutionary time $t$, the BDI sufficient statistics are $T = (\expect [B \mid x,y],\; \expect [D \mid x,y],\; \expect [S \mid x,y])$, recovered via the score function identity. The M-step is (Section A.1.8): \begin {equation} \label {eq:mstep} \hat \insrate = \frac {\bar B}{\bar S + t}, \qquad \hat \delrate = \frac {\bar D}{\bar S}, \end {equation} where $\bar B, \bar D, \bar S$ denote averages (or sums) over the minibatch.

B.2.3 Variance of Minibatch Estimates

Per-pair variance Let $\sigma ^2_B, \sigma ^2_D, \sigma ^2_S$ denote the variances of $\expect [B \mid x,y]$, $\expect [D \mid x,y]$, $\expect [S \mid x,y]$ under the data-generating distribution (randomness over $(x,y)$ pairs drawn from the model). For a minibatch of $B$ independent pairs, the sample means satisfy \[ \Var [\bar B] = \frac {\sigma ^2_B}{B}, \qquad \Var [\bar D] = \frac {\sigma ^2_D}{B}, \qquad \Var [\bar S] = \frac {\sigma ^2_S}{B}. \]

Parameter variance via the delta method By the delta method applied to (B.34), the relative error of $\hat \insrate $ after one M-step on a minibatch of size $B$ is \begin {equation} \label {eq:delta-lambda} \Var [\hat \insrate ] / \insrate ^2 \;\approx \; \frac {1}{B} \left [ \frac {\sigma ^2_B}{\expect [B]^2} + \frac {\sigma ^2_S}{(\expect [S]+t)^2} - \frac {2\,\Cov [B,S]}{\expect [B](\expect [S]+t)} \right ] \equiv \frac {v_\insrate }{B}, \end {equation} and similarly for $\hat \delrate $: \begin {equation} \label {eq:delta-mu} \Var [\hat \delrate ] / \delrate ^2 \;\approx \; \frac {1}{B} \left [ \frac {\sigma ^2_D}{\expect [D]^2} + \frac {\sigma ^2_S}{\expect [S]^2} - \frac {2\,\Cov [D,S]}{\expect [D]\,\expect [S]} \right ] \equiv \frac {v_\delrate }{B}. \end {equation} The constants $v_\insrate , v_\delrate $ depend on $(\insrate , \delrate , t)$ and the sequence length distribution. They are the diagonal entries of the inverse Fisher information matrix (per observation), expressed in the $(\insrate , \delrate )$ parameterization.

Thus the per-pair Fisher information is $O(1)$ in sequence length — longer sequences do not help much for indel parameter estimation (in contrast to substitution parameters, where Fisher information grows linearly with alignment length).

Proposition B.1 (Per-pair Fisher information, long-time limit). For $\delta = \delrate - \insrate > 0$ and $t \to \infty $: \[ \expect [B] \to \frac {\kappa }{1-\kappa }, \qquad \expect [D] \to \frac {\kappa }{1-\kappa }, \qquad \expect [S] \to \frac {2\kappa }{(1-\kappa )\delta }, \] and the coefficient of variation of $\hat \insrate $ from a single pair is $O(1)$ (bounded away from zero).

This reflects the fact that, at stationarity, each pair provides a single “observation” of the BDI process endpoint distribution $P(j \mid i)$ — the information content is fundamentally per-pair, not per-residue.

B.2.4 Pseudocount representation and its advantages

The SVB iterates can be represented equivalently in two ways: as an EMA of data-only sufficient statistics $s_k$ with priors $\alpha $ added inside each M-step, or as an EMA directly in posterior pseudocount space $\tilde {\alpha }_k = s_k + \alpha $. In the latter view, the update rule is \begin {equation} \label {eq:ema-pseudocount} \tilde {\alpha }_k \;=\; (1 - \eta _k)\,\tilde {\alpha }_{k-1} + \eta _k\bigl (\alpha + (N/|B|)\,s_{\text {batch}_k}\bigr ), \end {equation} matching (B.33) with the prior $T_0 \equiv \alpha $ absorbed into the EMA target. By induction, $\tilde {\alpha }_k = s_k + \alpha $ at every step given a matched initial state, so the two formulations produce identical parameter iterates.

For the Dirichlet-multinomial groups, $\tilde {\alpha }$ is the natural parameter of the conjugate posterior; for the BDI / GTR groups it is the augmented sufficient-statistic vector at which the regularised Q-function is maximised (the underlying priors there are non-conjugate regularisers; see Section C.1.4.0). In either case, the closed-form M-step maximises \begin {equation} \label {eq:pseudocount-mstep} J(\theta \mid \tilde {\alpha }) \;=\; \tilde {\alpha } \cdot \log \theta - Z(\theta ), \end {equation} where $Z$ is the log-partition function of the family. For a Dirichlet-multinomial component (transitions, mixture weights, class pis, classdist), $J = \sum _i (\tilde {\alpha }_i - 1)\log \theta _i$ up to a constant, with unique maximizer $\theta _i \propto \tilde {\alpha }_i - 1$. The BDI rates $(\insrate ,\delrate )$ follow the same pattern but the joint $J$ couples them through the geometric fragment length prior and the M-step is obtained from the quadratic in $\kappa = \insrate /\delrate $ of (A.7). The SVB iteration thus reduces to updating one pytree of pseudocounts per step, with M-step parameters computed on demand.

Variance via a single Jacobian. Let $f\colon \tilde {\alpha }\mapsto \theta $ denote the M-step map. The delta method gives \begin {equation} \label {eq:delta-pseudocount} \Var [\theta _k] \;\approx \; J_f(\tilde {\alpha }_k)\,\Var [\tilde {\alpha }_k]\,J_f(\tilde {\alpha }_k)^\top . \end {equation} Equations (B.35)–(B.36) are the diagonal entries of (B.39) in the $(\insrate ,\delrate )$ parameterization; in pseudocount form, the same delta method is one derivative of $f$ away regardless of parameter group.

Polyak–Ruppert averaging is built in. Expanding (B.37), \[ \tilde {\alpha }_K \;=\; \sum _{j=1}^K w_{j,K} \bigl (\alpha + (N/|B|)\, s_{\text {batch}_j}\bigr ), \qquad w_{j,K} \;=\; \eta _j \prod _{i>j}^K(1-\eta _i). \] The pseudocount state is itself a weighted average of batch-level pseudocounts, and the M-step applied to this average is the Rao–Blackwell improvement over averaging parameter iterates — which matters when $f$ is nonlinear, as for the BDI rate M-step (§C.1.4, eq. (A.7)). An effective sample size \[ \mathrm {ESS}_K \;=\; \frac {\bigl (\sum _j w_{j,K}\bigr )^2}{\sum _j w_{j,K}^2} \] quantifies how “warmed up” the EMA is and is strictly less than $K$ whenever $\eta _k < 1$. For a parameter group receiving a non-negligible contribution in only a fraction $\varepsilon $ of batches (a rare domain, a rarely-visited fragment state, a rarely-active class), the per-group effective sample size is $\varepsilon \,\mathrm {ESS}_K$, not $\mathrm {ESS}_K$ — exposing the rare-parameter bottleneck of §B.3 as a direct quantity rather than an approximation.

Complete-data Fisher information. For a Dirichlet-multinomial component, the Hessian of $J$ at its maximum is diagonal in natural-parameter coordinates with $[I_{\text {comp}}(\tilde {\alpha })]_{ii} = 1/\tilde {\alpha }_i$. Rare categories (small $\tilde {\alpha }_i$) therefore have large per-component variance, and the bottleneck for each parameter group can be read off directly as $\tilde {\alpha }_i/\mathrm {ESS}$ — a diagnostic in natural-parameter space that replaces the CV-based convergence recommendation of item 6 in §B.2.

Practical consequence. An implementation that carries $\tilde {\alpha }_k$ as its primary state — rather than separately tracking $s_k$ and re-adding $\alpha $ in each M-step — makes (B.39), the Polyak–Ruppert weights, and the per-group Fisher diagnostics all computable from the same pytree, with no additional bookkeeping.

B.2.5 SVB Convergence Rate

Stochastic approximation framework SVB is a stochastic fixed-point iteration. By the results of Cappé & Moulines (2009), under regularity conditions satisfied by exponential families, the parameter iterates $\theta _k$ satisfy \begin {equation} \label {eq:clt} \sqrt {K}(\bar \theta _K - \theta ^*) \xrightarrow {d} \mathcal {N}(0, \Sigma ) \end {equation} where $\bar \theta _K = \frac {1}{K}\sum _{k=1}^K \theta _k$ is the Polyak–Ruppert average and $\Sigma $ depends on the learning rate schedule.

Direct M-step (no averaging). With direct M-step updates (step size 1), the iterate $\theta _K$ has variance \begin {equation} \label {eq:nonavg} \Var [\theta _K] \approx \frac {\Sigma _{\text {per-pair}}}{B} \end {equation} after convergence of the transient, where $\Sigma _{\text {per-pair}}$ is the asymptotic covariance of the M-step estimator applied to a single pair. This does not decrease with $K$ — the iterates fluctuate around $\theta ^*$ with amplitude $O(1/\sqrt {B})$.

Polyak–Ruppert averaging. Averaging the iterates: \begin {equation} \label {eq:avg} \Var [\bar \theta _K] \approx \frac {\Sigma _{\text {per-pair}}}{BK}. \end {equation} This is optimal: the total number of pairs processed is $N = BK$, and we achieve variance $\Sigma / N$ regardless of how $N$ is split between $B$ and $K$.

Exponential moving average (EMA). In practice, one often uses $\theta _{k+1} = (1-\eta )\theta _k + \eta \, M(\hat T_B)$ with step size $\eta \in (0,1]$. The asymptotic variance is \[ \Var [\theta _\infty ] \approx \frac {\eta }{2-\eta } \cdot \frac {\Sigma _{\text {per-pair}}}{B} \approx \frac {\eta \, \Sigma _{\text {per-pair}}}{2B} \quad (\eta \ll 1). \] This trades convergence speed for lower asymptotic variance.

Convergence transient The EM rate matrix governs the transient. For exponential families, the EM convergence rate is $\rho = 1 - I_{\text {obs}} / I_{\text {comp}}$, where $I_{\text {obs}}$ and $I_{\text {comp}}$ are the observed and complete-data Fisher information matrices. For BDI, the fraction of missing information is typically moderate ($\rho \approx 0.3$–$0.7$ for typical protein evolution times $t \sim 0.5$–$2$), so the transient dies out in $O(10)$ iterations.

B.2.6 Practical Recommendations

Minibatch size For target relative error $\varepsilon $ on indel parameters after $K$ steps with Polyak–Ruppert averaging: \begin {equation} \label {eq:minibatch} BK \geq \frac {v_\theta }{\varepsilon ^2}, \end {equation} where $v_\theta \sim O(1)$ is the per-pair relative variance from (B.35)–(B.36).

Rule of thumb. Take $v_\theta \approx 5$ as a conservative estimate for typical protein BDI parameters ($\kappa \approx 0.9$, $t \approx 1$). Then:

Natural gradient The BDI complete-data statistics form a curved exponential family with natural parameters $(\log \insrate , \log \delrate , -(\insrate +\delrate ))$ and sufficient statistics $(B, D, S)$. The natural gradient replaces the M-step update $\Delta \theta = -I_F^{-1} \nabla _\theta Q(\theta )$ where $I_F$ is the Fisher information matrix of the complete-data family.

For the BDI M-step, the closed-form solution (B.34) already implicitly incorporates the complete-data Fisher information structure. Stochastic natural gradient (SNatGrad) would precondition the stochastic update direction using $I_F^{-1}$, but since $\hat \insrate = \bar B / (\bar S + t)$ is already an unbiased estimator of $\insrate $ (to first order), the benefit of explicit preconditioning is marginal.

Recommendation: Natural gradient is not worth the implementation complexity for SVB on TKF-family models. The closed-form M-step already provides the natural parameterization. Use Polyak–Ruppert averaging of the M-step outputs instead.

B.2.7 Bias diagnostics: Hellinger, ESS, and Fisher readouts

The convergence diagnostics above answer “have the parameter iterates stabilised?” but are silent on a distinct question that arises in practice: when validation log-likelihood regresses during training, is the regression within the noise floor predicted by the minibatch variance, or is it evidence of a structural problem (class-assignment drift, over-concentration of a mixture component, over-fitting of the training likelihood)? The pseudocount view of §B.2.4 makes several readouts available that are cheap to compute from saved checkpoints and sharply discriminate among these hypotheses.

Hellinger distance on mixture distributions. For any probability vector $p$ (e.g. a row of classdist[d, f, :], a fragment-transition row $\ext ^{(d)}_{f,\cdot }$, or a domain-weight vector), the Hellinger distance \[ H(p, q) \;=\; \tfrac {1}{\sqrt {2}}\,\|\sqrt {p} - \sqrt {q}\|_2 \;\in \; [0, 1] \] is the natural metric: bounded, symmetric (unlike KL), and locally linear in probability-mass movement. A transfer of $\delta $ mass from one category to another produces $H \approx \sqrt {\delta }/\sqrt {2}$, so a 1% reassignment registers as $H \approx 0.07$ and a clean class-swap gives $H = 1$. The drift hypothesis — that a mixture component is discretely reassigned between iterations $k$ and $k{+}1$ and that this reassignment accounts for the val-LL regression — predicts one or more iter-to-iter Hellinger values well above the EM-concentration noise floor, where the noise floor is empirically $\sim 0.005$ on training runs seeded from an informed mixture profile (e.g. IQ-TREE’s C10). A “sharp” drift signature appropriate to a $\sim 10\%$ per-pair val-LL regression would be $H > 0.15$ at the responsible iter; a flat trajectory at $H \leq 0.02$ throughout the regression window falsifies the discrete-drift hypothesis and redirects investigation to continuous alternatives (over-fitting of the training likelihood, slow limit cycles, or systematic mis-specification).

Effective sample size per parameter group. As derived in §B.2.4, the EMA effective sample size $\mathrm {ESS}_K = (\sum _j w_{j,K})^2 / \sum _j w_{j,K}^2$ is a first-class quantity in pseudocount space, strictly smaller than $K$ whenever $\eta _k < 1$. For a parameter group receiving a non-negligible batch contribution in only a fraction $\varepsilon $ of minibatches, the group-specific effective sample size is $\varepsilon \,\mathrm {ESS}_K$, turning the rare-parameter bottleneck of §B.3 from a theoretical concern into a direct quantity that can be logged per iter.

Fisher readout of remaining uncertainty. Combining the two quantities, the per-component posterior uncertainty for a Dirichlet-multinomial parameter group is proportional to $\tilde {\alpha }_i / \mathrm {ESS}$: rare categories with small posterior pseudocount and small per-group ESS are exactly those whose point estimates are least certain. This replaces the CV-of-$\hat \theta $ heuristic of item 6 above with a reading in natural-parameter space and lets one rank parameter groups by residual uncertainty without running a second pass of DP.

Suggested dashboard. At minimum, log per iter: (i) $\eta _k$ and $\mathrm {ESS}_k$; (ii) for each mixture group (classdist, ext rows, dom_weights), the max Hellinger distance to the previous iter and the entropy of the current iterate; (iii) $\|\Delta \tilde {\alpha }_k\|_1 / \|\tilde {\alpha }_k\|_1$ per parameter group as an $\tilde {\alpha }$-space convergence diagnostic; (iv) the Hungarian-matched distance of the current $\mathrm {class\_pis}$ (or analogous mixture-profile tensor) to the initial informed seed, as a “drift-from-prior” signature independent of relabelling. All are computable from saved parameters alone; no DP is required. Taken together, these give a rapid read-out on whether a val-LL regression reflects minibatch noise, a discrete drift event, or systematic over-fitting — and thus whether the response should be a larger minibatch, a frozen-component diagnostic, or a stronger prior.

B.3 Expected Statistics and Linearized Convergence

The preceding sections characterized SVB convergence in terms of abstract per-pair variances $\sigma ^2_B, \sigma ^2_D, \sigma ^2_S$. We now derive these quantities in closed form at stationarity, giving concrete convergence estimates for the BDI parameter group of any TKF-family model. The corresponding specialisation to the MixDom hierarchy of top-level, per-domain, and intra-fragment chains is collected in Appendix C.6.

B.3.1 BDI expected statistics at stationarity

Lemma B.2 (Expected BDI sufficient statistics). Consider a single BDI process with insertion rate $\insrate $, deletion rate $\delrate $, $\kappa = \insrate /\delrate < 1$, $\delta = \delrate - \insrate > 0$, and define $\alpha = e^{-\delrate \evoltime }$, $\Phi = 1 - \alpha $. For an ancestor–descendant pair $(i,j)$ with ancestor length $i$ drawn from the stationary distribution $i \sim \mathrm {Geom}(\kappa )$ (i.e. $P(i) = (1-\kappa )\kappa ^i$ for $i = 0,1,2,\ldots $), the expected BDI sufficient statistics are: \begin {align} \expect _\pi [i] &= L \;=\; \frac {\kappa }{1-\kappa } \label {eq:stat-L} \\[4pt] \expect _\pi [S] &= L\,\evoltime \label {eq:stat-ES} \\[4pt] \expect _\pi [B] &= \insrate \,(L + 1)\,\evoltime \;=\; \frac {\insrate \,\evoltime }{1 - \kappa } \label {eq:stat-EB} \\[4pt] \expect _\pi [D] &= \delrate \,L\,\evoltime \;=\; \frac {\delrate \,\kappa \,\evoltime }{1 - \kappa } \;=\; \expect _\pi [B] \label {eq:stat-ED} \end {align}

with $M = 1$ (one trajectory endpoint per pair) and $T = \evoltime $ (total observation time for a single process).

The key intermediate result is: for a BDI process starting from $N(0) = i$ mortal links, the expected number at time $s$ is \begin {equation} \label {eq:bdi-mean} \expect [N(s) \mid N(0) = i] = i\,e^{-\delta s} + L\,(1 - e^{-\delta s}), \end {equation} so the time-integrated expected count is \begin {equation} \label {eq:time-integrated} \expect [S \mid i] = \int _0^\evoltime \expect [N(s) \mid i]\,ds = (i - L)\,\frac {1 - e^{-\delta \evoltime }}{\delta } + L\,\evoltime . \end {equation} At stationarity, $\expect _\pi [i] = L$, so $\expect _\pi [S] = L\,\evoltime $. The birth and death expectations follow from $\expect [B \mid i] = \insrate \,\expect [S \mid i] + \insrate \,\evoltime $ (the $\insrate \,\evoltime $ term is the immortal link’s contribution) and $\expect [D \mid i] = \delrate \,\expect [S \mid i]$. The conservation law $\expect _\pi [B] = \expect _\pi [D]$ confirms that stationarity is maintained.

Proof. Equation (B.44) is the standard result for a linear birth-death process with immigration at rate $\insrate $ and per-capita death rate $\delrate $. The $i$ initial links each survive for $\mathrm {Exp}(\delrate )$ time (giving the $i\,e^{-\delta s}$ decay), while the immigration generates new links at rate $\insrate $ from the immortal link plus the mortal links themselves; the mean converges to the stationary value $L = \kappa /(1-\kappa )$ as $s \to \infty $.

Integrating (B.44): \begin {align*} \expect [S \mid i] &= \int _0^\evoltime \bigl [i\,e^{-\delta s} + L\,(1 - e^{-\delta s})\bigr ]\,ds \\ &= i\,\frac {1 - e^{-\delta \evoltime }}{\delta } + L\Bigl (\evoltime - \frac {1 - e^{-\delta \evoltime }}{\delta }\Bigr ) \\ &= (i - L)\,\frac {1 - e^{-\delta \evoltime }}{\delta } + L\,\evoltime . \end {align*}

Taking $\expect _\pi [i] = L$, the $(i-L)$ term vanishes, giving $\expect _\pi [S] = L\,\evoltime $.

For births: $\expect _\pi [B] = \insrate \,\expect _\pi [S] + \insrate \,\evoltime = \insrate \,(L+1)\,\evoltime = \insrate \,\evoltime /(1-\kappa )$.

For deaths: $\expect _\pi [D] = \delrate \,\expect _\pi [S] = \delrate \,L\,\evoltime = \delrate \,\kappa \,\evoltime /(1-\kappa ) = \insrate \,\evoltime /(1-\kappa ) = \expect _\pi [B]$. □

B.3.2 Per-pair variance of sufficient statistics

The randomness in $(B, D, S)$ across pairs arises from the randomness in the ancestor and descendant lengths $(i, j)$. At stationarity, $i \sim \mathrm {Geom}(\kappa )$ with $\Var [i] = \kappa /(1-\kappa )^2$.

Since $\expect [S \mid i]$ is linear in $i$ (eq. (B.45)), the variance of $\expect [S \mid i]$ across pairs is: \begin {equation} \label {eq:var-ES-outer} \Var _\pi [\expect [S \mid i]] = \left (\frac {1 - e^{-\delta \evoltime }}{\delta }\right )^{\!2} \cdot \frac {\kappa }{(1-\kappa )^2} \;\equiv \; \sigma ^2_{S,\text {outer}}. \end {equation} Similarly, since $\expect [B \mid i] = \insrate \,\expect [S \mid i] + \insrate \,\evoltime $: \begin {equation} \label {eq:var-EB-outer} \Var _\pi [\expect [B \mid i]] = \insrate ^2\,\sigma ^2_{S,\text {outer}}, \end {equation} and $\expect [D \mid i] = \delrate \,\expect [S \mid i]$: \begin {equation} \label {eq:var-ED-outer} \Var _\pi [\expect [D \mid i]] = \delrate ^2\,\sigma ^2_{S,\text {outer}}. \end {equation}

By the law of total variance, $\Var _\pi [S] = \Var _\pi [\expect [S \mid i]] + \expect _\pi [\Var [S \mid i]]$. The inner variance $\Var [S \mid i]$ captures the randomness of the BDI trajectory given a fixed starting state $i$. This is harder to compute in closed form, but for the purpose of convergence estimates, the outer variance (B.46) provides a lower bound on $\Var [S]$ and is the dominant term when $\kappa $ is near 1 (long sequences), since both sources of variance scale as $O(\kappa /(1-\kappa )^2)$.

The covariance structure is also dominated by the outer term: \begin {equation} \label {eq:cov-BS-outer} \Cov _\pi [\expect [B \mid i],\, \expect [S \mid i]] = \insrate \,\sigma ^2_{S,\text {outer}}. \end {equation}

B.3.3 Relative error and per-pair coefficients of variation

Substituting into the delta-method formulas (B.35)–(B.36) with $\expect [B] = \insrate \,\evoltime /(1-\kappa )$, $\expect [S] = L\,\evoltime $, and the outer-variance approximation:

Define \[ \psi = \frac {1 - e^{-\delta \evoltime }}{\delta \,\evoltime } \;\in \;(0,1) \] (the ratio of the “regression to mean” time scale to the total time). Then: \begin {align} \sigma ^2_{S,\text {outer}} &= \psi ^2\,\evoltime ^2\,\frac {\kappa }{(1-\kappa )^2} \label {eq:sigma-S-psi} \\[4pt] \frac {\sigma ^2_{S,\text {outer}}}{\expect _\pi [S]^2} &= \frac {\psi ^2}{\kappa \,/(1-\kappa )^2 \cdot \evoltime ^2} \cdot \frac {\kappa }{(1-\kappa )^2} \;=\; \frac {\psi ^2\,(1-\kappa )^2}{\kappa } \;\cdot \; \frac {1}{1} \notag \end {align}

The relative variance of $\hat \insrate $ from a single pair becomes (using the outer-variance approximation in (B.35), and noting that the $B$–$S$ covariance term partially cancels the $B$ and $S$ terms): \begin {equation} \label {eq:v-lambda-closed} v_\insrate \;\approx \; \frac {\psi ^2\,(1-\kappa )^2}{\kappa } \left [ 1 + \frac {\insrate ^2}{(\insrate + 1/\evoltime )^2} - \frac {2\insrate }{\insrate + 1/\evoltime } \right ] \;=\; \frac {\psi ^2\,(1-\kappa )^2}{\kappa \,(1+\insrate \evoltime )^2} \end {equation} This uses $\expect [S] + \evoltime = (L+1)\,\evoltime = \evoltime /(1-\kappa )$, so $\insrate /(\insrate + 1/\evoltime ) = \insrate \evoltime /(1+\insrate \evoltime ) = \kappa \delta \evoltime /(\delta + \delta /(\kappa \cdots ))$—more directly, substitute into (B.35): \begin {align*} v_\insrate &= \frac {\insrate ^2\sigma ^2_{S,\text {outer}}}{\expect [B]^2} + \frac {\sigma ^2_{S,\text {outer}}}{(\expect [S]+\evoltime )^2} - \frac {2\insrate \sigma ^2_{S,\text {outer}}}{\expect [B](\expect [S]+\evoltime )} \\ &= \sigma ^2_{S,\text {outer}} \left ( \frac {\insrate ^2(1-\kappa )^2}{\insrate ^2\evoltime ^2} + \frac {(1-\kappa )^2}{\evoltime ^2} - \frac {2\insrate (1-\kappa )^2}{\insrate \evoltime ^2} \right ) \\ &= \frac {\psi ^2\kappa }{(1-\kappa )^2} \cdot \frac {(1-\kappa )^2}{\evoltime ^2} \cdot \bigl (1 + 1 - 2\bigr ) \\ &= 0. \end {align*}

The outer-variance contribution to $v_\insrate $ vanishes exactly! This is because $\expect [B \mid i] = \insrate \,(\expect [S \mid i] + \evoltime )$, so the ratio $\expect [B \mid i]/(\expect [S \mid i] + \evoltime ) = \insrate $ is constant in $i$—the delta-method perturbation is zero.

This means the dominant source of variance in $\hat \insrate $ is the inner (trajectory) variance, not the length distribution. The per-pair coefficient of variation of $\hat \insrate $ is determined by the fluctuations in $(B, D, S)$ given the endpoint pair $(i, j)$, which depends on the stochastic path structure of the BDI process.

Proposition B.3 (Per-pair relative variance from inner fluctuations). For $\hat \insrate = \bar B / (\bar S + \evoltime )$ estimated from a single pair with ancestor length $i$ drawn from stationarity: \begin {equation} \label {eq:v-lambda-inner} v_\insrate \;\approx \; \frac {\expect _\pi [\Var [B \mid i, j]]}{\expect _\pi [B]^2} + \frac {\expect _\pi [\Var [S \mid i, j]]}{(\expect _\pi [S] + \evoltime )^2} - \frac {2\,\expect _\pi [\Cov [B, S \mid i, j]]} {\expect _\pi [B]\,(\expect _\pi [S] + \evoltime )}. \end {equation} The inner variances $\Var [B \mid i,j]$ and $\Var [S \mid i,j]$ are the posterior variances of the BDI trajectory statistics given the observed endpoints. These are computable from the observed Fisher information (Hessian of the log-likelihood), but do not simplify to elementary closed forms.

However, we can bound the total per-pair relative variance using the observed Fisher information. For a single BDI process observed at one timepoint, the complete-data Fisher information for $\insrate $ is $I_{\text {comp}}(\insrate ) = \expect [S + \evoltime ]/\insrate ^2 = \evoltime /(\insrate ^2(1-\kappa ))$, and the observed (incomplete-data) information satisfies $I_{\text {obs}} \leq I_{\text {comp}}$. The per-pair relative variance is $v_\insrate \geq 1/(\insrate ^2\,I_{\text {obs}}) \geq 1/(\insrate ^2\,I_{\text {comp}})$, giving: \begin {equation} \label {eq:v-lambda-bound} v_\insrate \;\geq \; \frac {1-\kappa }{\insrate \,\evoltime } \;=\; \frac {1}{\expect _\pi [B]}. \end {equation} The per-pair relative variance is at least $1/\expect _\pi [B]$. When $\expect _\pi [B]$ is small (few births per pair), the relative error is large.

Practical approximation. For moderate times ($\delta \evoltime \sim 1$), the missing-information fraction is $\rho \approx 0.3$–$0.7$, so $I_{\text {obs}} \approx (1-\rho )\,I_{\text {comp}}$ and $v_\insrate \approx (1-\kappa )/(\insrate \,\evoltime \,(1-\rho ))$. Taking $\rho \approx 0.5$ as a rough estimate: \begin {equation} \label {eq:v-lambda-approx} v_\insrate \;\approx \; \frac {2}{\expect _\pi [B]} \;=\; \frac {2(1-\kappa )}{\insrate \,\evoltime }. \end {equation} Similarly: \begin {equation} \label {eq:v-mu-approx} v_\delrate \;\approx \; \frac {2}{\expect _\pi [D]} \;=\; \frac {2(1-\kappa )}{\delrate \,\kappa \,\evoltime }. \end {equation}

B.4 Maraschino: Cherry-Counts for TKF92

We earlier introduced Maraschino as a TKF92 generalization of CherryML in Section A.2.4. We now develop it in full.

The CherryML composite likelihood replaces the joint phylogenetic likelihood with a product of pairwise sibling (“cherry”) likelihoods (40), exploiting the fact that for a fixed alignment, the alignment-substitution counts are sufficient statistics for the substitution rate parameters. This reduces the training data—which can be many gigabytes of alignments—to a fixed-shape counts tensor whose size depends only on the alphabet $\alphabet $ and the number of discretized cherry-time bins, and is independent of the number of training alignments. The resulting compact tensor fits in GPU memory and may be optimized by standard autograd-based gradient methods.

Maraschino extends CherryML to TKF92 by additionally including the alignment transition counts as sufficient statistics for the indel rates and fragment-extension probability. Concretely, for each pair of consecutive (non-empty) alignment columns we record the source column type ($\xstate \in \{\sta ,\mat ,\ins ,\del \}$), the destination column type ($\ystate \in \{\mat ,\ins ,\del ,\fin \}$), and the ancestor/descendant characters in each column. We refer to the resulting tensor of counts—binned by discretized pairwise divergence time and aggregated over training cherries—as the cherry-count tensor, and the procedure for training TKF92 from it as Maraschino.

The Maraschino composite log-likelihood for TKF92 is the sum, over divergence-time bins $b$ and adjacency types $\xstate \to \ystate $, of the cherry-count multiplied by the model log-probability of that adjacency at that bin’s representative time: \begin {multline} \mathcal {L}_{\text {cherry}}^{\mathrm {TKF92}}(\insrate ,\delrate ,\ext ,\exch ,\eqm ) = \sum _{b=1}^{n_\tau }\Bigg [ \sum _{\xstate ,\ystate } n^{(b)}_{\xstate \to \ystate }\, \log \tkftrans '_{\xstate \to \ystate }(\insrate ,\delrate ,\bar \evoltime _b,\ext ) \\ + \sum _{\anctok ,\destok } n^{\mat ,(b)}_{\anctok ,\destok } \log \big [ \eqm _\anctok \exp (\revsub \bar \evoltime _b)_{\anctok \destok } \big ] + \sum _\destok n^{\ins ,(b)}_\destok \log \eqm _\destok + \sum _\anctok n^{\del ,(b)}_\anctok \log \eqm _\anctok \Bigg ] \label {eq:maraschino-tkf92-ll} \end {multline} where $n^{(b)}_{\xstate \to \ystate }$ are the binned transition counts, $n^{\mat ,(b)}_{\anctok ,\destok }, n^{\ins ,(b)}_\destok , n^{\del ,(b)}_\anctok $ are the binned per-state emission counts, $\tkftrans '$ is the TKF92 Pair HMM transition matrix from Section A.2.2, and $\bar \evoltime _b$ is the representative time of bin $b$ (typically the geometric mean of the bin endpoints). The first term scores indel/fragment behaviour; the remaining three score the substitution model. Boundary contributions ($\sta \to \cdot $, $\cdot \to \fin $, $\sta \to \fin $ for empty alignments) are scored against the corresponding $\sta $-row and $\fin $-column entries of $\tkftrans '$.

Why this is a sufficient-statistic compression. For fixed alignment and fixed discretized cherry times, the entire collection of training cherries enters equation (??) only through the binned counts. In particular, two cherry collections with identical counts yield identical $\mathcal {L}_{\text {cherry}}^{\mathrm {TKF92}}$, identical gradients, and identical maximum-likelihood estimates. The counts tensor is therefore a sufficient statistic for the TKF92 parameters under the cherry-count composite likelihood.

Sharded count accumulation. For a corpus such as Pfam-Full, cherries are extracted from each MSA in parallel and contribute additively to the global counts tensor. A simple map-reduce over Pfam clans accumulates the counts in constant memory per worker; the global tensor is the elementwise sum of per-shard tensors and is independent of the order of accumulation.

B.4.1 Optimizing Maraschino: gradient methods vs. inner EM

The cherry-count log-likelihood $\mathcal {L}_{\text {cherry}}^{\mathrm {TKF92}}$ is a smooth function of the unconstrained parameters $(\log \insrate , \log \delrate , \log \ext , \log \exch , \log \eqm )$ (with the simplex and exchangeability constraints handled by softmax / log-Cholesky reparameterisation). Two distinct optimizers apply.

Maraschino-as-autograd. The objective is differentiable, so Adam, L-BFGS, and similar optimizers may be applied directly via autograd. Each step requires one matrix exponential per cherry-time bin, and a forward+backward pass through equation (??); both steps are highly parallel on GPU.

Maraschino-around-EM. Alternatively, the full-batch EM iterates of Section A.2.3 apply, with the cherry-count tensor playing the role of the E-step output. Concretely:

Because every M-step in Section A.2.3 is closed-form and linear in the sufficient statistics, the inner EM iterates are exact ascent on $\mathcal {L}_{\text {cherry}}^{\mathrm {TKF92}}$, with no gradient-step or learning-rate hyperparameter. In practice, EM converges in ${\sim }10$–$30$ iterations on protein cherry-count tensors and is competitive with Adam on the same objective.

Practical comparison. Adam handles arbitrary regularizers and is straightforward to combine with stochastic minibatching over time-bins or alphabet slices. Inner EM does not require a learning rate, has monotonic convergence guarantees, and produces the natural-gradient direction “for free.” When the cherry-count tensor fits in memory (it does, even for Pfam-Full at $|\alphabet |=20$ and $n_\tau =20$), inner-EM Maraschino is the recommended default.

B.4.2 EM-around-Maraschino: mixtures of TKF92

A mixture of $K$ TKF92 components, each with its own indel rates, fragment-extension probability, and substitution model, captures heterogeneity within and across protein families. We fit such a mixture by an outer EM loop with Maraschino as the inner optimizer of each component:

Algorithm 1: EM-around-Maraschino for a mixture of $K$ TKF92 components.

1:   Initialize component parameters $\theta ^{(0)}_1,\ldots ,\theta ^{(0)}_K$ and mixture weights $\pi ^{(0)}_1,\ldots ,\pi ^{(0)}_K$.
2:   for $t = 0,1,\ldots $ do
3:    E-step. For each cherry $c$ in the training set, compute the responsibility $\gamma ^{(t)}_{c,k} \propto \pi ^{(t)}_k\,P(c\mid \theta ^{(t)}_k)$ (the per-component cherry-count likelihood normalised across components). Each $P(c\mid \theta _k)$ is a single forward pass through the cherry-count likelihood at the cherry’s counts subtensor.
4:    Component-weighted count accumulation. For each component $k$, accumulate the responsibility-weighted cherry-count tensor $C_k^{(t+1)} = \sum _c \gamma ^{(t)}_{c,k}\,C_c$.
5:    Component M-step (Maraschino). For each component $k$, run Maraschino (Section B.4.1.0) on $C_k^{(t+1)}$ to obtain $\theta ^{(t+1)}_k$.
6:    Mixture-weight M-step. $\pi ^{(t+1)}_k \propto \sum _c \gamma ^{(t)}_{c,k}$.
7:   end for

The outer EM is monotone in the mixture log-likelihood $\sum _c \log \sum _k \pi _k P(c\mid \theta _k)$ because both the responsibilities (E-step) and the per-component Maraschino solves (M-step) are exact ascent on their respective Q-functions.

The decisive feature is that the outer M-step does not require re-extracting cherries or recomputing alignments per iteration: the per-component cherry-count tensor $C_k^{(t+1)}$ is a responsibility-weighted average of the same per-cherry tensors that were extracted once and for all at the start of training.

B.4.3 EM-around-CherryML: mixtures over site classes

A complementary class of mixture model places the latent variable on MSA columns rather than on whole cherries: each MSA column $c$ draws an unobserved site class $z_c \sim \catdist (\eqm _1,\ldots ,\eqm _C)$ from a shared per-class substitution model $\subproc (\exch ^{(k)}, \eqm ^{(k)})$, and every cherry in the family that has a residue pair at column $c$ evolves its match emission under the same class. Because no indel rates vary across mixture components, this is strictly a substitution-only mixture, and the corresponding fitter is CherryML rather than Maraschino: the cherry-count tensor of Section B.4 contributes only the matched substitution observations $(\anctok , \destok , t_p)$ and the indel-free composite likelihood reduces to that of (40).

Column-shared responsibilities. The defining observation is that within a single MSA, all cherries that have a match at column $c$ inherit the same latent class $z_c$. This couples cherries across the same column: the column-level posterior \[ \gamma _c(k) \;\propto \; \eqm _k \prod _{p \,:\, \mat \text { at } c} P\!\bigl (\anctok _{p,c}, \destok _{p,c} \mid k, t_p\bigr ), \qquad P(\anctok , \destok \mid k, t) = \eqm ^{(k)}_\anctok \exp (\revsub ^{(k)} t)_{\anctok \destok }, \] multiplies the per-cherry match-emission likelihoods over all cherries $p$ that contribute a residue pair to column $c$. Cherries that have a gap at $c$ contribute no factor for that column. The resulting composite log-likelihood \begin {equation} \label {eq:em-cherryml-ll} \mathcal {L}_{\text {site-mix}} \;=\; \sum _m \sum _{c \,\in \, \text {cols}(m)} \log \sum _{k=1}^{C} \eqm _k \prod _{p \,:\, \mat \text { at } c} \eqm ^{(k)}_{\anctok _{p,c}} \exp (\revsub ^{(k)} t_p)_{\anctok _{p,c}\,\destok _{p,c}} \end {equation} is a sum, over MSAs $m$ and over columns $c$ within each MSA, of a log-mixture in which cherries share the column’s class. This is in direct contrast to EM-around-Maraschino (Section B.4.2), whose responsibilities $\gamma ^{(t)}_{c,k}$ are per cherry: there each cherry is assigned to a single component independently of the columns it spans.

Outer / inner EM. We fit (B.55) by alternating an MSA-wide column-responsibility update with a per-class bridge-expectation M-step. At iteration $t$, the column responsibilities are \begin {equation} \label {eq:em-cherryml-resp} \gamma ^{(t)}_{m,c}(k) \;\propto \; \eqm ^{(t)}_k \prod _{p\,:\, \mat \text { at } c} \eqm ^{(t,k)}_{\anctok _{p,c}}\, \exp \!\bigl (\revsub ^{(t,k)}\, t_p\bigr )_{\anctok _{p,c}\,\destok _{p,c}}, \end {equation} normalised across $k$ for each $(m, c)$, and the responsibility-weighted bridge-expectation sufficient statistics for class $k$ are \begin {equation} \label {eq:em-cherryml-WUk} \begin {aligned} \hat W_i^{(t+1,k)} &= \sum _{m,c} \gamma ^{(t)}_{m,c}(k) \sum _{p\,:\, \mat \text { at } c} \emcount ^W_i\!\bigl (\anctok _{p,c}, \destok _{p,c}, t_p\bigr ), \\ \hat U_{ij}^{(t+1,k)} &= \sum _{m,c} \gamma ^{(t)}_{m,c}(k) \sum _{p\,:\, \mat \text { at } c} \emcount ^U_{ij}\!\bigl (\anctok _{p,c}, \destok _{p,c}, t_p\bigr ), \\ \hat V_i^{(t+1,k)} &= \sum _{m,c} \gamma ^{(t)}_{m,c}(k) \sum _{p\,:\, \mat \text { at } c} \delta \bigl (\anctok _{p,c} = i\bigr ), \end {aligned} \end {equation} where $\emcount ^W_i, \emcount ^U_{ij}$ are the standard endpoint-conditioned bridge expectations (Section A.1.2, eqs. (??)–(??)) at the class-$k$ rate matrix $\revsub ^{(t,k)}$. The full outer iteration is then Algorithm 2.

Algorithm 2: EM-around-CherryML for $C$ site classes.

1:   Initialize per-class GTR parameters $\theta ^{(0)}_k$ and class weights $\eqm ^{(0)}_k$.
2:   for $t = 0, 1, \ldots $ do
3:    Outer E-step. Compute column responsibilities $\gamma ^{(t)}_{m,c}(k)$ via (B.56).
4:    Sufficient-statistic accumulation. Compute $(\hat W^{(t+1,k)}, \hat U^{(t+1,k)}, \hat V^{(t+1,k)})$ via (B.57).
5:    Per-class GTR inner EM. For each class $k$, run the closed-form GTR coordinate-ascent M-step of Section A.1.8 on $(\hat W^{(t+1,k)}, \hat U^{(t+1,k)}, \hat V^{(t+1,k)})$ to obtain $\theta ^{(t+1)}_k$. Because the responsibility weights are fixed, this step is a standard CherryML rate fit on a single GTR model with weighted observations.
6:    Class-weight update. $\eqm ^{(t+1)}_k \propto \sum _{m,c} \gamma ^{(t)}_{m,c}(k)$.
7:   end for

The outer iteration is monotone in $\mathcal {L}_{\text {site-mix}}$: the column responsibilities are the exact posterior over $z_c$ and the per-class GTR M-step is exact ascent on its (responsibility-weighted) Q-function. The inner GTR EM is itself iterative because the GTR M-step couples $(\exch ^{(k)}, \eqm ^{(k)})$ nonlinearly (Section A.1.8, item 2 of the M-step list); a few inner sweeps per outer iteration suffice in practice.

Why CherryML, not Maraschino. No indel rates appear in (B.55): the model assumes the alignment is given and that all alignment columns share a single TKF92 indel process whose parameters are held fixed (or estimated elsewhere, e.g. by Maraschino on the same cherries). Mixing only the substitution model means the $(\insrate , \delrate , \ext )$-related row-normalisation and fragment-extension responsibility issues of TKF92 do not arise; the relevant sufficient statistics are exactly those of CherryML. The only structural change relative to plain CherryML is the per-MSA, per-column responsibility weighting in (B.57), which couples cherries that share columns of the same MSA.

Relation to per-class GTR mixtures elsewhere. The per-class GTR M-step of equation (B.57) is exactly the reversible-mixture update of Appendix B.1.8; the only difference between this appendix and Algorithm 2 is the provenance of the responsibilities (here: cross-cherry column sharing within an MSA; in the appendix: any user-supplied per-site posterior). Site-mixture parameters fitted by EM-around-CherryML can therefore be plugged into mixture-aware downstream methods — including Maraschino with a fixed indel-process backbone — without further re-derivation.

B.4.4 Expected sufficient statistics as fast custom VJPs

The Maraschino objective (equation (??)) is concretely a sum of log-probabilities, each of which has a closed-form gradient via the score identity and the BDI / bridge-expectation sufficient statistics. Specifically, the $(\insrate ,\delrate )$-gradient of $\log \tkftrans '_{\xstate \to \ystate }(\insrate ,\delrate ,t,\ext )$ is given exactly by the score-derivative tables of Sections A.4.2.0–A.4.3.0, and the $(\exch ,\eqm )$-gradient of $\log [\eqm _\anctok \exp (\revsub t)_{\anctok \destok }]$ is given by the bridge-expectation dwell times and jump counts (equations (??)–(??)).

For training pipelines built around stochastic-gradient optimizers such as Adam, this observation enables a substantial speedup. Rather than relying on autograd to back-propagate through a matrix exponential and an eigendecomposition, we register a custom vector-Jacobian product (VJP) that returns the score-derivative-and-bridge-expectation closed forms directly:

The forward and backward costs are both $O(|\alphabet |^2)$ per bin per emission state for the substitution part and $O(1)$ for the indel part, matching the cost of evaluating the closed-form M-step itself.

This pattern generalises beyond Maraschino. Any model whose M-step is closed-form linear in expected sufficient statistics admits a corresponding fast custom VJP that exposes those expectations as the gradient of the observed log-likelihood (this is, again, the score identity (A.5)). Concretely, this enables Adam and other gradient-based optimizers to train TKF91, TKF92, and mixture-of-TKF92 models at a per-step cost matching exact EM, while retaining Adam’s flexibility to incorporate arbitrary differentiable regularizers (e.g., weight decay on the BDI rates, KL-divergence priors on the GTR exchangeability), to interleave updates with non-EM-amenable parameters (e.g., the per-component mixture weights of Section B.4.2 when fit jointly with neural-network embeddings), and to work with the mini-batch SVI-BW pseudocount stream of Section B.1.13.

B.5 Selected Inference Algorithms for TKF92

This section describes four inference algorithms that operate on TKF91 / TKF92 models and on phylogenetic trees: the Fast Statistical Alignment (FSA) framework with a Newton-step time-optimization on the per-cherry expected log-likelihood (Section C.2.1); the Beam Search Ancestral Reconstruction (BeamASR) algorithm (Section C.2.2); the Variational Ancestral Reconstruction (VarAnc) algorithm that operates on a fixed MSA with known gap structure (Section B.5.3); and its stochastic-variational training extension svi-VarAnc (Section B.5.4).

B.5.1 Fast Statistical Alignment (FSA)

Given a set of sequences and a phylogenetic tree with branch-specific TKF91 or TKF92 pair HMMs, we construct a multiple sequence alignment using the sequence annealing approach of (6), to which we refer the reader for a full description of the algorithm.

Briefly, the method proceeds as follows. For each pair of sequences $(x,y)$ in a selected subset (either all $\binom {N}{2}$ pairs or an $O(N \log N)$ Erdős–Rényi sample), we compute pairwise residue alignment posteriors $P(x_i \sim y_j)$ by running the Forward-Backward algorithm on the pair HMM at an optimized evolutionary time $\hat {\tau }$.

Per-pair time optimization (NR step). The time $\hat {\tau }$ is found by Newton–Raphson optimization of the expected log-likelihood (the “NR step”): \begin {align} \hat {\tau } &= \operatorname *{argmax}_\tau \; \mathbb {E}_{P(\pi | x,y,\tau _0)} \bigl [ \log P(x,y,\pi \mid \tau ) \bigr ] \label {eq:fsa-nr} \end {align}

where $\pi $ ranges over alignment paths and $\tau _0$ is an initial estimate. This expectation is computed from Forward–Backward expected counts at $\tau _0$, and—because the TKF92 expected log-likelihood is quadratic-like in $\tau $ in a neighbourhood of the maximum, with analytic derivatives via the BDI score identities of Section A.1.3—the Newton iteration typically converges in 3–5 steps. This time-maximisation differs from the original FSA approach of (6), which attempts to optimize all model parameters via unregularized EM for every pair, and consequently must terminate the EM recursion early to avoid instability. Restricting the optimization to the per-pair time $\tau $, while keeping the model parameters fixed at the population estimate, is both cheaper and more stable.

Sequence annealing. The pairwise posteriors are then assembled into a multiple alignment by the greedy sequence annealing procedure of (6), which iteratively merges alignment columns to maximize a sum-of-pairs posterior objective.

B.5.2 Beam Search Ancestral Sequence Reconstruction (BeamASR)

We now describe an alternative progressive reconstruction method that finds the maximum-likelihood ancestral sequence at each internal node by beam search, without materializing the full composite automaton.

At each internal node $v$ with children $l,r$ and observed descendant sequences $c_l, c_r$, we seek \begin {align} \hat {a}_v &= \operatorname *{argmax}_{a} \bigl [ \log P(a, c_l \mid B_l) + \log P(a, c_r \mid B_r) - \log P(a \mid R) \bigr ] \label {eq:beam-ancestor} \end {align}

where $P(a, c_k \mid B_k)$ is the pair HMM forward probability on branch $k$ and $P(a \mid R)$ is the singlet probability under the root generator, subtracted to avoid double-counting the prior on $a$.

Incremental forward profiles Since the branches are conditionally independent given the ancestor, we can evaluate (??) by maintaining incremental forward profiles: for each branch $k$, a 1D forward table $F_k[i, q]$ giving the log-probability that descendant positions $1,\ldots ,i$ have been emitted and the branch machine is in state $q$, given ancestor positions $1,\ldots ,j$ processed so far.

Each ancestor character extends both profiles independently in $O(L_k)$ time per branch, where $L_k = |c_k|$.

Beam search The ancestor sequence $\hat {a}_v$ is built left-to-right by beam search. At each position $j$, the beam maintains $B$ candidate partial ancestors. For each candidate and each alphabet character $\sigma $:

The top $B$ extensions (by cumulative score) are retained. Total cost per node is $O(K \cdot B \cdot A \cdot (L_l + L_r))$ where $K = |\hat {a}_v|$ and $A$ is the alphabet size.

Insertion phase via associative scan The insertion recurrence within each branch profile has the form \begin {align} x_{i+1} &= \operatorname {logsumexp}\bigl (A_{II} \, x_i,\; b_i\bigr ) + e_i \label {eq:insert-recurrence} \end {align}

where $A_{II}$ is the I-to-I log-transition submatrix, $b_i$ collects transitions into insertion states from M and D, and $e_i$ is the emission score. This is a log-semiring affine recurrence, parallelizable via an associative scan with operator \[ (A_1, b_1) \oplus (A_2, b_2) = (A_2 \otimes A_1,\; \operatorname {logsumexp}(A_2 \, b_1,\; b_2)) \] where $\otimes $ denotes log-semiring matrix multiplication. This reduces the insertion phase from $O(L)$ sequential depth to $O(\log L)$.

Supported model types The beam search interface is generic over the pair HMM used on each branch. For the present (TKF) paper we specialize to:

The same beam-search framework extends without algorithmic change to hierarchical TKF extensions in which each branch carries a larger latent-state pair HMM (30); we do not pursue that direction here.

B.5.3 Variational Ancestral Reconstruction (VarAnc)

Given an MSA with known gap structure—i.e., for each internal node $v$ and MSA column $c$, we know whether $v$ is present (ungapped) or absent (gapped)—but unknown ancestral residue identities, we optimize a variational lower bound (ELBO) on the marginal likelihood. Our approach is a “product of trees” structured variational approximation (28).

Model structure The joint distribution over observed leaf sequences $\mathbf {y}$ and ancestral sequences $\mathbf {a}$ factorises over tree edges. With the alignment structure fixed, the log-likelihood decomposes into contributions from consecutive alignment-column pairs along each edge $e = (u,v)$: \begin {align} \log P(\mathbf {y}, \mathbf {a}) &= \sum _{e \in \mathrm {edges}} \sum _{k=1}^{K_e} \log w_e\!\bigl (\mathrm {type}_{k-1},\, \mathrm {type}_k,\, \mathrm {ctx}_{k-1},\, \mathrm {ctx}_k\bigr ) \label {eq:lik-edges-tkf} \end {align}

where $w_e$ is the TKF92 pair-HMM transition weight on edge $e$ (depending on the previous and current column types and context characters, plus the branch length). The first and last columns contribute start ($\sta $) and end ($\fin $) transition factors.

The key computational challenge is that even the order-0 TKF92 transitions couple adjacent alignment columns through the fragment-extension self-loops: the pair-HMM weight at column $c$ depends on the ancestral and descendant characters at both column $c$ and its predecessor. This makes the full graphical model a tree (phylogeny) $\times $ chain (sequence) grid with undirected cycles, and exact inference is intractable.

Product-of-trees variational distribution Following (28), we factorize the variational posterior over MSA columns while retaining the full tree joint at each column: \begin {align} q(\mathbf {a}) &= \prod _{c=1}^{L} q_c\!\bigl (\mathbf {h}_c\bigr ) \label {eq:var-q-pot-tkf} \end {align}

where $\mathbf {h}_c = \{a_{v,c} : v \in \mathrm {ungapped}(c)\}$ denotes the set of all ancestral characters at MSA column $c$, and each $q_c$ is a tree-structured distribution over the characters at the internal nodes that are ungapped at column $c$. Leaf characters are observed (clamped).

This approximation decouples columns in $q$ but preserves the full parent-child correlation structure within each column. In particular, the pairwise joint marginal $q_c^{(u,v)}(a_u, a_v)$ on each tree edge is available exactly from the Felsenstein peeling/unpeeling pass used to represent $q_c$ (Section B.5.3.0).

Free energy and ELBO The evidence lower bound is \begin {align} \mathcal {L}(q) &= \mathbb {E}_q\bigl [\log P(\mathbf {y}, \mathbf {a})\bigr ] + H(q) \;\geq \; \log P(\mathbf {y}) \label {eq:elbo-tkf} \end {align}

where $H(q) = -\sum _c \sum _{\mathbf {h}_c} q_c(\mathbf {h}_c) \log q_c(\mathbf {h}_c)$ is the entropy of the product-of-trees distribution. Since each $q_c$ is tree-structured, its entropy decomposes as \begin {align} H(q_c) &= \sum _{v \in \mathrm {ungapped}(c)} H\!\bigl (q_c^{(v)}\bigr ) - \sum _{e = (u,v)} \mathrm {MI}_c(u, v) \label {eq:tree-entropy-tkf} \end {align}

where $H(q_c^{(v)})$ is the marginal entropy at node $v$ and $\mathrm {MI}_c(u,v)$ is the mutual information between parent $u$ and child $v$ at column $c$, both computable from the Felsenstein marginals.

Potentials from neighboring columns The inter-column coupling enters through the TKF92 pair-HMM transitions. For each edge $e = (u,v)$ and MSA column $c$ where $e$ has an event of type $\tau _c$, let $c^-$ denote the predecessor column (the previous column where $e$ had an event) and $c^+$ the successor column. The potential at column $c$ for edge $e$ receives two contributions:

As-child term (from $c^-$). The transition from column $c^-$ to $c$ on edge $e$ depends on the characters at both columns. Using the pairwise marginal from $q_{c^-}$: \begin {align} \log \phi _e^{\mathrm {child}}(a_{u,c}, a_{v,c}) &= \sum _{a', b'} q_{c^-}^{(u,v)}(a', b')\; \log w_e(\tau _{c^-\!}, \tau _c, a', b', a_{u,c}, a_{v,c}) \label {eq:pot-child-tkf} \end {align}

where the sum over $(a', b')$ uses the joint pairwise marginal $q_{c^-}^{(u,v)}(a', b')$—not the product of independent marginals. This is the key advantage over mean-field: the within-tree parent-child correlation at the predecessor column is preserved exactly.

As-parent term (from $c^+$). Symmetrically, column $c$ acts as the predecessor for column $c^+$: \begin {align} \log \phi _e^{\mathrm {parent}}(a_{u,c}, a_{v,c}) &= \sum _{a'', b''} q_{c^+}^{(u,v)}(a'', b'')\; \log w_e(\tau _c, \tau _{c^+\!}, a_{u,c}, a_{v,c}, a'', b'') \label {eq:pot-parent-tkf} \end {align}

For insert transitions (only the descendant is present at $c$), the potential reduces to a per-node function $\psi _v(a_{v,c})$. For delete transitions (only the ancestor is present), it becomes $\psi _u(a_{u,c})$. For match transitions, it contributes a per-edge potential $\phi _e(a_{u,c}, a_{v,c})$. The start and end transitions contribute analogous per-node or per-edge terms.

Felsenstein coordinate ascent Each coordinate ascent step updates $q_c$ for a single MSA column $c$, holding all other columns fixed. We accumulate, for each edge $e$ and node $v$ in the tree at column $c$:

The optimal $q_c$, given the potentials, is the Gibbs distribution on the tree at column $c$: \begin {align} q_c(\mathbf {h}_c) &\propto \prod _v \psi _v(a_v) \prod _{e=(u,v)} \phi _e(a_u, a_v) \prod _{\ell \in \mathrm {leaves}} \delta (a_\ell = y_\ell ) \label {eq:q-gibbs-tkf} \end {align}

Since this is a tree-structured MRF, the normalising constant and all node and edge marginals can be computed exactly by Felsenstein peeling (postorder) and unpeeling (preorder) in $O(|E| \cdot |\alphabet |^2)$ time.

Peeling (postorder). For each node $v$ in postorder, compute the conditional likelihood: \begin {align} \mathrm {CL}_v(a) &= \psi _v(a) \prod _{\mathrm {children}\; c} \Bigl [\sum _{a_c} \phi _{(v,c)}(a, a_c)\; \mathrm {CL}_c(a_c)\Bigr ] \label {eq:peel-tkf} \end {align}

with $\mathrm {CL}_\ell (a) = \delta (a = y_\ell )$ for observed leaves. The log-partition function is $\log Z_c = \log \sum _a \pi (a)\, \mathrm {CL}_{\mathrm {root}}(a)$.

Unpeeling (preorder). Propagate top-down to obtain the posterior marginal at each node: \begin {align} q_c^{(v)}(a) &\propto \mathrm {CL}_v(a) \cdot \mathrm {msg}_{\mathrm {parent} \to v}(a) \label {eq:unpeel-marginal-tkf} \end {align}

and the pairwise marginal on each edge: \begin {align} q_c^{(u,v)}(a_u, a_v) &\propto \mathrm {msg}_{\mathrm {above}\,u}(a_u) \cdot \phi _{(u,v)}(a_u, a_v) \cdot \mathrm {CL}_v(a_v) \label {eq:unpeel-pair-tkf} \end {align}

where the “message from above $u$” combines the top-down message to $u$ with $u$’s conditional likelihood excluding child $v$.

Sweep. One iteration sweeps through all MSA columns $c = 1, \ldots , L$: for each column, recompute the potentials from the current neighbor marginals, run peeling/unpeeling, and store the updated node and edge marginals.

Properties The product-of-trees approximation enjoys the same monotonic convergence guarantee as mean-field coordinate ascent (each column update minimises the free energy in its coordinate), with the additional guarantee that the ELBO is at least as tight as the fully-factored mean-field bound. This follows because the product-of-trees family contains the mean-field family as a special case (where each $q_c$ is itself fully factored).

B.5.4 Stochastic Variational VarAnc (svi-VarAnc)

The VarAnc algorithm of Section B.5.3 computes a variational posterior over ancestral residues for a fixed TKF92 model $\theta $ on a single fixed-MSA family. We now extend it to a training algorithm that updates $\theta $ on a corpus of tree-structured MSAs. The construction parallels SVI-BW (Section B.1.13) but uses an MSA-conditioned VarAnc E-step in place of the per-pair Pair-HMM Forward–Backward.

Outer EM. The outer iteration is a stochastic-EM loop over MSAs. At each iteration:

Comparison to SVI-BW. The two algorithms differ in their E-step and converge to different fixed points:

For pairwise data (a tree of two leaves), the two algorithms coincide. For deep trees with many siblings, svi-VarAnc tightens the indel-rate posterior considerably by coupling sibling pairs through the tree. ELBO monitoring, mini-batch construction, and warm-start initialisation from cherry-trained TKF92 parameters follow the SVI-BW recipe of Section B.1.13 verbatim, replacing the Pair-HMM Forward–Backward E-step with the family-level VarAnc sweep of Section B.5.3.0.

B.6 Mixture-of-trees variational ancestral presence/absence

This appendix derives a variational lower bound on the indel log-likelihood of an MSA under the TKF92 conditional WFST $\tkfwfst '(\insrate ,\delrate ,\evoltime ,\ext )$ of Section A.3 when applied independently on each branch of a phylogenetic tree, combined with a TKF stationary HMM at the root. The bound is parameterised by a tree-structured graphical model on per-column ancestral presence/absence indicators with three irreversible states {NotYetInserted, Present, Deleted}. It is a proper lower bound on the full TKF92 marginal log-likelihood $\log p(\text {MSA} \mid \parsetree , \theta )$, with gap split into a standard $q$-dependent variational KL term and a $q$-independent “restriction” term arising because our latent space cannot represent ancestor residues that get inserted and then fully deleted on every leaf-bound lineage before reaching any leaf (Section B.6.2). For ancestral-presence inference at the observed columns the restriction term is irrelevant, since it cancels from the q-optimal posterior; for parameter learning it does not, and a tighter bound would require explicit modelling of those ghost-column histories. Special-case parameter settings recover an irreversible 3-state analogue of Felsenstein gap-CTMC reconstruction and Fitch parsimony.

B.6.1 Setting and approximation

Fix a rooted phylogenetic tree $\parsetree $ with internal-node set $\mathcal {I}$, leaf set $\mathcal {L}$, and a branch length $\branchlen _e > 0$ on each edge $e$. We are given a multiple-sequence alignment with $L$ columns whose rows are the leaf sequences of $\parsetree $; let $X^v_n \in \{0,1\}$ denote the observed presence indicator ($1$ if leaf $v \in \mathcal {L}$ has a residue in column $n$). The task is to infer per-internal-node presence indicators $\{X^v_n\}_{v \in \mathcal {I}, 1 \le n \le L}$.

We replace the full TKF92 generative model on the tree with the following per-branch conditional approximation: each branch $v \to w$ is taken to evolve its row according to the TKF92 conditional WFST $\tkfwfst '$ acting independently on the whole parent sequence, with no shared latent fragment structure across branches. A residue inserted by the WFST on one branch may therefore be partially deleted, character by character, on a later branch (unlike in the strict TKF92 process, where fragments are indivisible). The WFST $\tkfwfst '$ is itself a closed-form approximation to the finite-time transition probabilities of the General Geometric Indel (GGI) model the maximum-entropy CTMC with given expected per-site indel rates and expected inserted-fragment length and has been shown empirically to fit GGI Gillespie simulations (23) and pairwise-alignment data (30) well. Using it as the per-branch transition law of the tree-level model in this appendix is therefore a small additional approximation on top of the GGI approximation itself.

B.6.2 Restricted generative model

The model joint over leaf and internal-node presence patterns at the $L$ MSA columns is \begin {equation} \label {eq:joint} p(\text {MSA},\, \{X^v\}_{v \in \mathcal {I}} \mid \parsetree , \theta ) = p_{\text {singlet}}\!\big (X^{\text {root}} \mid \theta \big )\, \prod _{(v \to w) \in \parsetree } P\!\big (X^w \mid X^v, \branchlen _{vw}, \theta \big ), \end {equation} where $p_{\text {singlet}}$ is the TKF stationary distribution (Section B.6.3, below) and each $P(X^w \mid X^v, t)$ is the per-branch presence-pattern conditional defined by the WFST path log-likelihood (Section B.6.4). Marginalising (B.58) over internal patterns supported on the $L$ observed MSA columns gives a quantity we denote $\tilde {p}(\text {MSA} \mid \parsetree , \theta )$. The full TKF92 marginal $p(\text {MSA} \mid \parsetree , \theta )$ also sums over evolutionary histories with ancestral residues that get inserted and then fully deleted on every leaf-bound lineage before reaching any leaf ghost columns that leave no MSA trace and so are not representable as values of $X^v_n$ at observed columns $1, \ldots , L$. Since each ghost-column history adds a non-negative probability to $p$ that is absent from $\tilde {p}$, \begin {equation} \label {eq:tildep-le-p} \tilde {p}(\text {MSA} \mid \parsetree , \theta ) \;\leq \; p(\text {MSA} \mid \parsetree , \theta ). \end {equation} The gap $\log p - \log \tilde {p} \geq 0$ depends on the model parameters and the tree but not on the variational distribution $q$ introduced below. For the ancestral-presence reconstruction task fix $\theta $, optimise $q$ to recover internal-node marginals at the observed columns this gap is a $q$-independent constant and the restricted-model posterior at observed columns coincides with the full-model posterior conditioned on the same support, so all inferences below proceed unchanged.

B.6.3 Singlet HMM at the root

The TKF92 stationary HMM (Section A.2.2) emits a sequence of length $L_{\text {root}}$ with \begin {equation} \label {eq:singlet-tkf92} p_{\text {singlet}}(L_{\text {root}}) = \begin {cases} 1 - \kappa & L_{\text {root}} = 0, \\ \kappa \, p^{L_{\text {root}} - 1}\,(1-\ext )(1-\kappa ) & L_{\text {root}} \geq 1, \end {cases} \end {equation} where $\kappa = \insrate /\delrate $ and $p = \ext + (1-\ext )\kappa $. Identifying $L_{\text {root}}$ with the number of MSA columns at which the root is in state $\state {P}$ ($L_{\text {root}} = \sum _n \delta (X^{\text {root}}_n = 1)$) gives the root prior in the joint (B.58). The two-cases form arises because the TKF92 singlet HMM has distinct “from $\sta $” (probability $\kappa $ to emit, $1-\kappa $ to end) and “from emitting state” (probability $p$ to extend, $(1-\ext )(1-\kappa )$ to end) transition contexts.

For TKF91 ($\ext = 0$): $p = \kappa $ and the two cases collapse to the single form $p_{\text {singlet}}(L_{\text {root}}) = (1-\kappa )\kappa ^{L_{\text {root}}}$.

Note on a tempting shortcut. One might hope to avoid the singlet by interpreting the root as the output of a pseudo-branch with $\evoltime \to \infty $ and an all-absent pseudo-parent. For TKF91 this works the $t \to \infty $ limit of $\tkfwfst $ with an empty input does recover the singlet sequence-length distribution. For TKF92 it does not: the WFST’s $(1-\ext )$ factors are tied to the fragment-extension structure of the Pair HMM and the limiting output mass for length $L \geq 1$ is $\kappa \,p^{L-1}(1-\kappa )$ rather than the correct $\kappa \,p^{L-1}(1-\ext )(1-\kappa )$ i.e. the WFST does not normalise on an empty input under TKF92. We therefore include the singlet term explicitly.

B.6.4 Per-branch path log-likelihood

Consider a single branch $v \to w$ of length $\branchlen $. Reading the parent and child rows of the MSA column by column gives, for each $n \in \{1, \ldots , L\}$, an ordered pair $(X^v_n, X^w_n) \in \{0,1\}^2$. Map each pair to a WFST column-state $S_n \in \{\mat , \ins , \del , \mathsf {Ig}\}$ via \begin {equation} \label {eq:state-map} S_n = \begin {cases} \mat & \text {if } (X^v_n, X^w_n) = (1,1) \\ \ins & \text {if } (X^v_n, X^w_n) = (0,1) \\ \del & \text {if } (X^v_n, X^w_n) = (1,0) \\ \mathsf {Ig} & \text {if } (X^v_n, X^w_n) = (0,0), \end {cases} \end {equation} where the symbol $\mathsf {Ig}$ (“Ignore”) marks columns absent in both parent and child and so contributing no transition to the WFST path. Add boundary sentinels $S_0 = \sta $ and $S_{L+1} = \fin $. Stripping out columns with $S_n = \mathsf {Ig}$ collapses the sequence $(S_0, S_1, \ldots , S_L, S_{L+1})$ to a unique path through the WFST whose nontrivial transitions are between consecutive non-Ignore columns. The branch log-likelihood is the sum of the corresponding entries of $\log \tkfwfst '(\insrate , \delrate , \branchlen , \ext )$ from equation (??). Equivalently, summing over each column’s incoming transition, \begin {equation} \label {eq:branch-LL} \log P(X^w \mid X^v, \branchlen , \theta ) = \sum _{N=1}^{L+1} \delta (S_N \neq \mathsf {Ig}) \sum _{M=0}^{N-1} \delta (S_M \neq \mathsf {Ig}) \log \tkfwfst '_{S_M S_N} \prod _{K=M+1}^{N-1} \delta (S_K = \mathsf {Ig}), \end {equation} where $\theta = (\insrate , \delrate , \ext )$, $\delta (\cdot )$ is a Kronecker indicator, and the empty product (when $M+1 > N-1$) equals $1$. For each non-Ignore column $N$ the inner sum picks out the unique $M < N$ that is also non-Ignore with all columns strictly between $M$ and $N$ in state $\mathsf {Ig}$; that pair contributes $\log \tkfwfst '_{S_M S_N}$ and all other $M$ contribute zero.

B.6.5 Tree-structured variational family

We approximate the joint posterior $P(\{X^v_n\}_{v \in \mathcal {I}} \mid \{X^v_n\}_{v \in \mathcal {L}})$ by a product over MSA columns \begin {equation} \label {eq:q-factor} q(\{X^v_n\}_{v \in \mathcal {I}, 1 \le n \le L}) = \prod _{n=1}^{L} q_n(\{X^v_n\}_{v \in \mathcal {I}}), \end {equation} where each per-column distribution $q_n$ is a directed graphical model on the tree with node values in \begin {equation} \label {eq:Z-states} \mathcal {Z} = \{\textsc {NotYetInserted},\ \textsc {Present},\ \textsc {Deleted}\} \equiv \{\state {N}, \state {P}, \state {D}\}. \end {equation} The variable at the root is drawn from a free distribution $q_n^{\text {root}} \in \Delta ^{\mathcal {Z}}$; along each edge $v \to w$ the variable $Z^w_n \in \mathcal {Z}$ is drawn conditionally on $Z^v_n$ from a free distribution $q_n^{v \to w}(\cdot \mid \cdot )$. The 9-entry matrix $q_n^{v \to w}(z' \mid z)$ has 4 entries pinned to zero to enforce irreversibility: \begin {equation} \label {eq:irreversibility} q_n^{v \to w}(\state {D} \mid \state {N}) = q_n^{v \to w}(\state {N} \mid \state {P}) = q_n^{v \to w}(\state {N} \mid \state {D}) = q_n^{v \to w}(\state {P} \mid \state {D}) = 0. \end {equation} The two unconstrained rows ($\state {N}$ and $\state {P}$ parents) each have two surviving outcomes, contributing one free parameter per row; the $\state {D}$ row is deterministic ($q_n^{v \to w}(\state {D} \mid \state {D}) = 1$). This leaves two free parameters per (edge, column). Leaf-node states $Z^v_n$ for $v \in \mathcal {L}$ are clamped: if $X^v_n = 1$, then $Z^v_n = \state {P}$; if $X^v_n = 0$, then $Z^v_n \in \{\state {N}, \state {D}\}$ with the variational mass split between the two by $q_n$.

The presence indicator at internal node $v$ is recovered as \begin {equation} \label {eq:Z-to-X} X^v_n = \delta (Z^v_n = \state {P}), \end {equation} so that any realisation drawn from $q_n$ has the property that the nodes with $X^v_n = 1$ form a connected subtree of $\parsetree $ (or the empty set, which by hypothesis cannot arise once the leaf clamps are respected, since every column of the supplied MSA contains at least one Present leaf).

Why this support. Under the strict TKF92 process, each MSA column corresponds to a single insertion event in the history of the tree; the residue descends, and the set of Present nodes is necessarily a connected subtree containing the insertion point. A factorised mean-field $q$ with independent Bernoulli marginals at each $(v, n)$ would assign positive mass to disconnected Present-sets, which (i) are unreachable under the model, and (ii) would force the substitution likelihood to factor as a product of independent Felsenstein recursions on the two pieces corresponding to two separate insertions co-located in one MSA column, a different generative process. The 3-state irreversible $q$ rules out both pathologies by construction.

B.6.6 Expected log-likelihood under $q$

For a parent-child pair $(v, w)$ on the tree, define the per-column state probability under $q_n$: \begin {align} \label {eq:column-state-probs} P_q(S_n = \mat ) &= q_n(Z^v_n = \state {P},\, Z^w_n = \state {P}), \\ P_q(S_n = \del ) &= q_n(Z^v_n = \state {P},\, Z^w_n = \state {D}), \\ P_q(S_n = \ins ) &= q_n(Z^v_n = \state {N},\, Z^w_n = \state {P}), \\ P_q(S_n = \mathsf {Ig}) &= q_n(Z^v_n = \state {N},\, Z^w_n = \state {N}) + q_n(Z^v_n = \state {D},\, Z^w_n = \state {D}). \end {align}

The four other $(z, z')$ joint marginals are zero by equation (B.65). Each pairwise marginal $q_n(Z^v_n, Z^w_n)$ is computed by belief propagation (Section B.6.9).

The product factorisation (B.63) makes the per-branch expected log-likelihood factor across columns: \begin {equation} \label {eq:E-branch-LL} \expect _q\!\big [\log P(X^w \mid X^v, \branchlen , \theta )\big ] = \sum _{s, s'} \log \tkfwfst '_{s s'}\, W_{s s'}^{(v \to w)}, \end {equation} where $s$ ranges over $\smide \setminus \{\fin \}$ and $s'$ ranges over $\smide \setminus \{\sta \}$ (so $\sta $ enters only as a source and $\fin $ only as a destination, both via the deterministic boundary sentinels), and \begin {equation} \label {eq:W-pair} W_{s s'}^{(v \to w)} \;=\; \sum _{N=1}^{L+1} \sum _{M=0}^{N-1} P_q(S_M = s)\,P_q(S_N = s') \prod _{K=M+1}^{N-1} P_q(S_K = \mathsf {Ig}) \end {equation} is the expected number of times the path uses transition $s \to s'$ on this branch. Boundary terms have $P_q(S_0 = \sta ) = 1$ and $P_q(S_{L+1} = \fin ) = 1$ deterministically.

Root-prior contribution. Under the per-column factorisation (B.63), the expected singlet log-likelihood at the root (equation (B.60)) reduces to \begin {equation} \label {eq:E-singlet-root} \expect _q\!\big [\log p_{\text {singlet}}(X^{\text {root}})\big ] = \log p \cdot \sum _{n=1}^{L} q_n(X^{\text {root}}_n = 1) + \big (1 - P_q(L_{\text {root}}{=}0)\big )\,c_1 + P_q(L_{\text {root}}{=}0)\,\log (1 - \kappa ), \end {equation} with $c_1 = \log \!\big [\kappa (1-\ext )(1-\kappa )/p\big ]$ and $P_q(L_{\text {root}}{=}0) = \prod _{n=1}^{L} q_n(X^{\text {root}}_n = 0)$ the variational mass on an empty root sequence. For TKF91 ($\ext = 0$) the second and third terms collapse to a single $q$-independent constant $\log (1-\kappa )$ and (B.69) reduces to the textbook form $\log (1-\kappa ) + \log \kappa \cdot \sum _n q_n(X^{\text {root}}_n = 1)$. For TKF92 the $L_{\text {root}}{=}0$ special case must be retained; note that $P_q(L_{\text {root}}{=}0)$ is small but generically nonzero, since some columns may have positive variational mass on root being $\state {N}$ (the column was inserted on a strict subtree below the root).

B.6.7 ELBO

For any product-of-trees $q$ that gives joint marginals $q_n(Z^v_n, Z^w_n)$ at every (edge, column), \begin {equation} \label {eq:elbo} \boxed {\;\; \log p(\text {MSA} \mid \parsetree , \theta ) \;\geq \; \log \tilde {p}(\text {MSA} \mid \parsetree , \theta ) \;\geq \; \mathcal {L}[q] \;=\; \expect _q\!\big [\log p_{\text {singlet}}(X^{\text {root}})\big ] + \sum _{(v \to w) \in \parsetree } \expect _q\!\big [\log P(X^w \mid X^v, \branchlen _{vw}, \theta )\big ] + H\!\big [q(\{X^v\}_{v \in \mathcal {I}} \mid \text {MSA})\big ]. \;\;} \end {equation} The first two terms are given by (B.69) and (B.67) respectively; the third is the entropy of the leaf-conditioned variational distribution. The leftmost inequality is (B.59); the right inequality is the standard Jensen bound on the restricted joint (B.58).

Entropy decomposition. The variational $q$ is parameterised through a directed graphical model on the full tree (root + internal + leaf nodes) with edge conditionals $q^{v \to w}_n(\cdot \mid \cdot )$. Writing $q_{\text {joint}}$ for this joint and conditioning on the observed leaf states gives $q(\{X^v\}_{v \in \mathcal {I}} \mid \text {MSA}) = q_{\text {joint}}(\text {internals},\,\text {MSA})\,/\,Z_q$, where $Z_q = \sum _{\text {internals}} q_{\text {joint}}$ is the per-column belief-propagation partition function summed over columns. The leaf-conditioned entropy then decomposes as \begin {equation} \label {eq:entropy} H\!\big [q(\text {internals} \mid \text {MSA})\big ] = -\,\expect _q\!\big [\log q_{\text {joint}}\big ] + \log Z_q, \end {equation} where the first term factors over the directed graphical model: \begin {equation} \label {eq:cross-entropy-decomp} -\expect _q\!\big [\log q_{\text {joint}}\big ] = -\sum _{n=1}^{L}\!\!\sum _{z \in \mathcal {Z}}\! q_n^{\text {root}}(z)\,\log q_n^{\text {root}}(z) \;-\; \sum _{(v \to w) \in \parsetree } \sum _{n=1}^{L}\!\!\sum _{z, z' \in \mathcal {Z}}\! q_n(Z^v_n{=}z, Z^w_n{=}z')\,\log q_n^{v \to w}(z' \mid z). \end {equation} The first term is the per-column root entropy. Each per-edge term is a cross-entropy between the BP-derived joint marginal $q_n(Z^v_n, Z^w_n)$ and the prior edge conditional $q_n^{v \to w}$; it is not the parent-marginal-weighted entropy $\expect _{q_n(Z^v_n)}\!\big [H(q_n^{v \to w}(\cdot \mid Z^v_n))\big ]$, which only equals (B.72) when the leaves impose no evidence on the edge generically a different quantity once the BP-conditioning shifts pair marginals away from the prior factorisation. The $\log Z_q$ term in (B.71) comes out of the up-pass partition function and ensures the entropy is on the conditioned distribution rather than the joint.

Bound interpretation. Equation (B.70) is a proper lower bound on the full TKF92 marginal log-likelihood $\log p(\text {MSA})$, with the gap split into two non-negative pieces: \begin {equation} \label {eq:gap-decomp} \log p(\text {MSA}) - \mathcal {L}[q] \;=\; \underbrace {\big (\log p(\text {MSA}) - \log \tilde {p}(\text {MSA})\big )}_{\text {restriction gap}} \;+\; \underbrace {\text {KL}\!\big [q(\text {internals} \mid \text {MSA})\,\|\,\tilde {p}(\text {internals} \mid \text {MSA})\big ]}_{\text {variational gap}}. \end {equation} The variational gap is the standard Jensen slack on the restricted joint (B.58) and depends on $q$; the restriction gap is the contribution of ghost-column histories that our latent space cannot represent (Section B.6.2) and depends on the model parameters $\theta $ and the tree but not on $q$. For ancestral-presence reconstruction (fix $\theta $, optimise $q$) the restriction gap is a constant additive offset, so maximising (B.70) is equivalent to minimising the KL divergence from $q$ to the true model posterior on the support of observed-column internal-node configurations. For parameter learning over $\theta $ this equivalence breaks: the restriction gap shifts with $\theta $, so the bound is not tight up to a parameter-independent constant. A tighter bound would require augmenting $q$ to model ghost columns explicitly for instance via per-(branch, gap-region) latent counts of inserted-then-fully-deleted residues at the cost of additional machinery beyond the per-MSA-column factorisation used here. Equivalently and intuitively, one recovers the full marginal in the limit of inserting infinitely many additional “empty” columns into the alignment between (and around) every observed column, with all leaves clamped to absent at those columns and the variational $q$ free over their internal states; the present formulation simply truncates this padding to zero.

B.6.8 Stable computation: cumulant trick

The expected transition count $W_{s s'}^{(v \to w)}$ in (B.68) appears at first sight to require $O(L^2)$ work per (branch, state-pair). A standard prefix-sum reduces this to $O(L)$.

For brevity drop the branch superscript and write $P_n^{\mathsf {Ig}} = P_q(S_n = \mathsf {Ig})$, $P_n^{s} = P_q(S_n = s)$ for $s \in \{\sta , \mat , \del , \ins , \fin \}$ (with $P_0^{\sta } = 1$, $P_{L+1}^{\fin } = 1$, all other boundary values zero). Define the running log-Ignore mass \begin {equation} \label {eq:cumulant} C_N \;=\; \sum _{k=1}^{N} \log P_k^{\mathsf {Ig}}, \qquad C_0 = 0. \end {equation} Then $\prod _{K=M+1}^{N-1} P_K^{\mathsf {Ig}} = \exp (C_{N-1} - C_M)$, and (B.68) factorises as \begin {equation} \label {eq:W-prefix} W_{s s'} \;=\; \sum _{N=1}^{L+1} P_N^{s'}\,e^{C_{N-1}} \;\cdot \; \underbrace {\sum _{M=0}^{N-1} P_M^{s}\,e^{-C_M}}_{\text {prefix sum in }M}. \end {equation} Both inner factors are non-negative; the difference $C_{N-1} - C_M$ is non-positive (it is a sum of log-probabilities), so the geometric factor lies in $(0, 1]$ and (B.75) is numerically benign in float64 for any practical $L$. For very long alignments where individual exponents could underflow, one $\epsilon $-floors $P_n^{\mathsf {Ig}}$ before computing $C_n$; the floor is harmless because the product weights paths that traverse many Ignore columns, which contribute negligibly to the expectation.

The total work for the per-branch expected log-likelihood is $O(L \cdot |S|^2)$ with $|S| \le 5$ states, yielding $O(B L)$ overall for $B$ branches. This is fully vectorisable across columns and branches.

Why the inner sum is in probability space. Equation (B.62) for a fixed deterministic realisation of the indicators is purely a sum of $\log \tkfwfst '$ entries: no probabilities, no products, no exponentials. The probability-space arithmetic in (B.75) arises only when we marginalise over the variational distribution: each summand is a probability mass (product of independent column factors) times a real-valued $\log \tkfwfst '$ entry, and the sum collapses these weighted log-transitions into the expected transition count.

B.6.9 Belief propagation for pairwise marginals

The pairwise marginals $q_n(Z^v_n, Z^w_n)$ feeding equations (??) are computed by the standard Felsenstein up-down algorithm on $\parsetree $. For each column $n$:

The up and down passes are $O(|\mathcal {I}| \cdot |\mathcal {Z}|^2)$ per column and trivially vectorise across columns (same tree topology, different leaf clamps and different per-column edge conditionals). For typical protein alignments ($L \approx 360$ amino acids on average in UniProtKB/Swiss-Prot (49), with the long tail extending to a few thousand) this is unproblematic; for DNA MSAs (gene-scale regions in the few-kbp range (53), longer for syntenic blocks) the per-column independence keeps cost linear in $L$.

B.6.10 Special cases and scalability

Irreversible Felsenstein-3. Tie the per-(edge, column) variational conditionals $q_n^{v \to w}(\cdot \mid \cdot )$ across columns to a single per-edge transition matrix $\bar {q}^{v \to w}(\cdot \mid \cdot )$ equivalently, parameterise each $\bar {q}^{v \to w}$ as the transition matrix of an irreversible 3-state CTMC over $(\state {N}, \state {P}, \state {D})$ with rates fitted by maximum likelihood on the leaf data. The variational marginals $q_n$ are then exactly the Felsenstein posteriors of this CTMC at column $n$, and the ELBO reduces to the log-likelihood of the leaf clamps under that CTMC plus the (now column-independent) expected log of $\tkfwfst '$.

Fitch parsimony. The zero-temperature limit of the irreversible Felsenstein-3 posterior assigning each internal node deterministically to its most-likely state under the CTMC recovers the Fitch parsimony labelling of presence/absence under uniform priors and unit transition costs.

Scalability of stochastic ELBO optimisation. The free-parameter count scales as $2 \cdot |\mathcal {E}| \cdot L$ (plus a per-column root parameter) where $|\mathcal {E}|$ is the number of tree edges and $L$ is the number of MSA columns. The variational $q$ factorises over columns, but the ELBO is non-trivially coupled across columns through the prefix-of-Ignore factor in equation (B.75), which weights each $(M, N)$ transition contribution by the chain of intervening Ignore probabilities. Match-rich columns (where most lineages are clamped to $\state {P}$) have $P_n^{\mathsf {Ig}} \approx 0$ and so act as anchors that break the chain: contributions from $(M, N)$ pairs that straddle such an anchor column are damped to zero, decoupling the gradient on either side. For both protein-scale and DNA-scale alignments with non-trivial column coverage, ELBO optimisation should therefore scale linearly with total residue count without running into long-range coupling pathologies.

B.7 Theory: structural bias of the BP cumulant under column-factorised $q$

We give a precise derivation of the structural bias displayed empirically in Section ??. The bias is not a finite-data noise artifact it is a property of the column-factorised variational family used by SVI-VarAnc, and persists at any $q$ whose per-column marginals match the truth.

Setup. On a single branch $e$ of length $t_e$, the per-branch WFST is a chain HMM over the $5$ states $\{S, M, I, D, E\}$ with chain transition matrix $\mathbf {T}_e[s, s']$ given by the standard TKF91 / TKF92 closed forms (tkf92_wfst_T, eq. ??). The branch’s contribution to the data log-likelihood is $\sum _{n=1}^{L+1} \log T_e[\zeta _{n-1}, \zeta _n]$ where $\boldsymbol {\zeta } = (\zeta _0, \dots , \zeta _{L+1})$ is the chain trajectory across the $L+2$ columns (including boundaries). Conventionally $\zeta _n \in \{S, M, I, D, E, \mathrm {Ig}\}$ where $\mathrm {Ig}$ denotes a non-emitting "insertion-gap" column where the branch is in NYI/NYI or $D$/$D$ joint state and contributes nothing to the chain transitions; chain transitions are indexed by consecutive non-$\mathrm {Ig}$ columns.

The variational $q$ factorises across columns: $q(\boldsymbol {\zeta }) = \prod _{n=1}^L q_n(\zeta _n)$ with each $q_n$ a tree-structured distribution over edge presence/absence carrying the leaf clamps of column $n$. Define the per-column WFST-state probabilities at branch $e$: \begin {equation} P_n^{(e)}(s) = q_n(\text {branch state } e \text { is } s), \quad s \in \{M, I, D\}, \qquad P_n^{(e)}(\mathrm {Ig}) = q_n(\text {NYI},\text {NYI}) + q_n(D, D). \end {equation}

The cumulant formula and what it computes exactly. Section A.2.3 introduces the BP cumulant $\mathbf {W}_e \in \mathbb {R}^{5 \times 5}$ as the sufficient statistic the M-step ingests: \begin {equation} \label {eq:cumulant-W} W_e[s, s'] = \sum _{1 \leq M < N \leq L+1} P_M^{(e)}(s) \cdot P_N^{(e)}(s') \cdot \prod _{k = M+1}^{N-1} P_k^{(e)}(\mathrm {Ig}). \end {equation} Under the factorised $q$, $W_e[s, s']$ is the exact expected count of WFST chain transitions $s \to s'$ on branch $e$: \begin {align*} \expect _q\bigl [\#\{\text {chain transitions } s \to s'\}\bigr ] &= \sum _{M < N} q\bigl (\zeta _M = s, \zeta _N = s', \zeta _{M+1} = \cdots = \zeta _{N-1} = \mathrm {Ig}\bigr ) \\ &= \sum _{M < N} P_M(s)\,P_N(s')\,\prod _{k=M+1}^{N-1} P_k(\mathrm {Ig}) \;\;=\;\; W_e[s, s'], \end {align*}

where the second equality uses the column-factorisation of $q$. The sum over $(M, N)$ counts each chain transition once because of the explicit $\mathrm {Ig}$-padding constraint: any single trajectory realisation $\boldsymbol {\zeta }$ with non-$\mathrm {Ig}$ columns at indices $n_1 < n_2 < \cdots < n_K$ contributes weight 1 to exactly the adjacent-pair transitions $(n_0, n_1), (n_1, n_2), \dots , (n_K, n_{K+1})$ (with $n_0 = 0, n_{K+1} = L+1$); for any other $(M, N)$ pair, at least one intervening column is non-$\mathrm {Ig}$ and the indicator product is zero. This $O(L)$ prefix-sum implementation agrees with the $O(L^2)$ direct sum to machine precision ($6.7\times 10^{-16}$).

The structural bias. The factorisation assumption is not satisfied by the truth posterior $p^*(\boldsymbol {\zeta } \mid \text {data})$, even for TKF91 ($\ext = 0$). Although TKF91’s generative model is per-position i.i.d., the per-branch WFST chain has nonzero transition probabilities $T_e[D, D]$, $T_e[I, I]$, $T_e[I, D]$ and so on (closed-form expressions in (??) reduce, at $\ext = 0$, to the standard TKF91 Pair HMM weights). These chain memories propagate from data conditioning into a non-factorised joint posterior: \begin {equation} p^*\bigl (\zeta _M = s, \zeta _N = s', \zeta _{M+1..N-1} = \mathrm {Ig} \mid \text {data}\bigr ) \;\neq \; \prod _n p^*\bigl (\zeta _n \mid \text {data}\bigr ), \end {equation} even though the per-column truth marginal $p^*(\zeta _n \mid \text {data})$ is exactly recoverable by per-column Felsenstein BP. The bias of the cumulant against the truth-posterior chain expectation is the integrated difference between the factorised-q product of marginals and the truth-posterior joint: \begin {equation} \label {eq:cumulant-bias-correct} \boxed { W_e[s, s'] - \expect _{p^*}\!\bigl [\#\{s \to s'\} \bigm | \text {data}\bigr ] = \sum _{M < N}\!\Bigl [ p^*_M(s)\,p^*_N(s')\! \prod _{k=M+1}^{N-1}\! p^*_k(\mathrm {Ig}) \;-\; p^*\bigl (\zeta _M=s, \zeta _N=s', \zeta _{M+1..N-1}=\mathrm {Ig} \bigm | \text {data}\bigr ) \Bigr ]. } \end {equation} The bracketed term is the connected $(N-M+1)$-way correlation of the truth posterior over the columns $\{M, M+1, \dots , N\}$; it vanishes only when the truth posterior factorises across these columns. For TKF91, $T_e[D,D] > 0$ and $T_e[I,I] > 0$ at $\ext = 0$ alone suffice to make the right-hand side of (B.79) nonzero on the $(s, s') = (I, I)$ and $(D, D)$ entries: an opened gap is positively correlated with continued gap by the chain transition kernel itself, even before any explicit fragment-extension parameter is invoked.

Entropy reward exacerbates the bias on rare entries. The ELBO objective decomposes as $\mathrm {ELBO} = \expect _q[\log p(z, \text {data}; \theta )] - \expect _q[\log q]$ with the entropy term $-\expect _q[\log q]$ rewarding diffuse $q$. At ELBO stationarity, $q$’s per-column marginals balance the data-fit term against the entropy term; for tied or weakly-fitted logits, the resulting $q$ is slightly more diffuse than the truth posterior on individual rare states (i.e., $P_n^{(e)}(I)$ above the per-column truth marginal $p_n^*(I)$). Through the cumulant (B.77), even a small per-column inflation $P_n(s) = p_n^*(s) + \epsilon $ contributes a leading linear-in-$\epsilon $ bias on $W[s, s']$ with positive coefficient $\partial W / \partial P_n(s) = \sum _{N > n} P_N^{(e)}(s') \prod _{k=n+1}^{N-1} P_k^{(e)}(\mathrm {Ig}) > 0$, summed over $L$ source columns; the cumulant amplifies per-column diffuseness via its $O(L^2)$ pair structure.

Why the EM converges to a fixed point. The EM map $\theta \mapsto \mathrm {MLE}(\mathbf {W}(\theta ))$ is self-consistent at the biased fixed point $\theta _\infty $ when the cumulant bias on $\theta $ saturates. Each entry of the cumulant is bounded uniformly in $\theta $: $W_e[s, s'] \leq L + 1$ on a single branch (each $(M, N)$ pair contributes at most $1$, and the geometric $\mathrm {Ig}$-product is summable), giving a $\theta $-independent upper bound on the aggregated sufficient-statistic count. The exposure denominator $T_{\text {eff}} = \sum _e t_e$ in the BDI rate-MLE (tkf92_vbem.py:223) is itself $\theta $-independent — it depends only on the tree branch lengths, not on the inferred rates — and is bounded below by the smallest tree branch. Composing these, the rate-MLE $\hat \lambda (\theta ) \propto \sum _e W[\cdot , I] / T_{\text {eff}}$ (with the precise BDI quadratic-MLE form in tkf92_vbem.py:m_step_indel_quadratic) is a continuous, bounded self-map on a compact rate interval $[0, \bar \lambda ]$ with $\bar \lambda \leq (L + 1) / \min _e t_e$ (continuity follows because $W(\theta )$ depends continuously on $\theta $ via the per-edge TKF92 transition matrix (??), and the M-step quadratic root is continuous in its coefficients). Brouwer’s fixed-point theorem applies: a fixed point $\theta _\infty $ exists. Consistent with this boundedness argument, the EM trajectory of Section ?? reaches a finite biased fixed point and does not diverge.

Implications for parameter inference. The structural bias is a property of any inference scheme that uses the cumulant tensor $\mathbf {W}$ as a sufficient statistic under a column-factorised variational family. It does not affect: (i) the ELBO value itself (which is a coherent $\log p(\text {data})$ lower bound regardless of what the cumulant looks like), nor (ii) ancestral indel-presence reconstruction (Section ??: VarAnc and Fitch parsimony are statistically indistinguishable on the unified Pfam test splits, $|\Delta \text {F1}| < 0.006$ across short / hard / xhard), nor (iii) downstream MSA-conditioned rate estimation via exact-EM (Maraschino, Section ??). It does affect any rate estimate computed from the variational cumulant, including the SVI-VarAnc refinement of cherry-trained parameters (Section ??), where parameters drift toward $\theta _\infty $ rather than toward truth. We report both initial cherry-trained and $\theta _\infty $ estimates without correction; the structural bias is the principal known limitation of variational training in this paper.

Model	Free params	\(\eqm \)	M-step for exchangeability
JC69	1 (\(\rho \))	fixed \(1/4\)	\(\hat \rho = U_\bullet /W_\bullet \)
K80	1 (\(\kappa _s\))	fixed \(1/4\)	\(\hat \kappa _s = 2U_s/U_v\)
F81	4 (\(\rho , \eqm \))	\(V'_i/\sum V'_j\)	\(\hat \rho = U_\bullet / \sum _i W_i(1-\eqm _i)\)
HKY85	5 (\(\kappa _s, \eqm \))	\(V'_i/\sum V'_j\)	\(\hat \kappa _s = (U_s/N_s)/(U_v/N_v)\)
GTR	9 (\(\exch , \eqm \))	\(V'_i/\sum V'_j\)	\(\hat \exch _{ij} = (U_{ij}+U_{ji})/(W_i\eqm _j+W_j\eqm _i)\)
GY94	62 (\(\omega ,\kappa _s,\eqm \))	\(V'_j/\sum V'_k\)	quadratic in \(\kappa _s\) (B.6)

Target \(\varepsilon \)	\(BK\) needed	\(B=50, K=?\)	\(B=200, K=?\)
0.10	\(\sim 100\,v_\theta \)	2	1
0.05	\(\sim 400\,v_\theta \)	8	2
0.01	\(\sim 10{,}000\,v_\theta \)	200	50

B EM, composite likelihoods, and variational inference