The infinite Pair HMM and its MCMC sampler

The Pair HMM postprocessing methods of §D.7 are progressively faithful approximations to a single underlying object: the alignment-and-partition posterior on $\pi (A, \pi _{\!M} \mid X, Y)$, where $A$ is the TKF92 alignment and $\pi _{\!M}$ is a Chinese-restaurant-process partition of its Match cells into singletons (block of size 1, contributing the standard per-site CTMC factor) and pairs (block of size 2, contributing the joint Potts CTMC factor). This appendix develops the three-factor factorisation $\pi _{\TKF 92} \cdot \pi _{\mathrm {CRP}} \cdot \prod _b P_{\mathrm {block}}$ in full; describes the exact $F_2$-SCFG inside–outside formulation that achieves it at $O(L^4)$; re-encodes the same content as a memory-augmented Pair HMM at $O(L^2 A^2)$; and develops the Gibbs+MH+replica-exchange MCMC sampler that draws from the unbounded-edge limit at amortised $O(L^3)$ per sweep. The conceptual hierarchy at the end of the appendix lifts this construction to the full multi-branch phylogenetic SCFG, the natural fixed point of the entire ladder of approximations.

E.1 Exact 0-or-1-edge marginal posteriors via Pair-SCFG inside-outside

The deprecated mean-field correction (remark D.13 of Appendix D) and the coupled-pair greedy annealer of §D.7.0.0 are two routes to the same intractable posterior of equation D.3; the present subsection derives an exact dynamic-programming alternative under the hard restriction that the coupling partition $E$ contains zero or one edge. The result is a per-residue-pair posterior $Q'_{ij}$ that consumes the same downstream FSA assembler as the baseline, with the coupling evidence absorbed exactly into the posterior rather than fed in via a per-pair correction or via a custom annealer move set.

Three pair-HMM marginal moments

Under the no-coevolution baseline TKF92 Pair HMM with the class-mixture single-site emissions of Appendix D, let $\pi (X, Y, A)$ denote the joint probability of an alignment $A$ between sequences $X$ and $Y$. Define \begin {align} F_0(X, Y) \;&\equiv \; \sum _A \pi (X, Y, A), \label {eq:F0} \\ F_1(X, Y; i, j) \;&\equiv \; \sum _{A \,:\, X_i \sim _A Y_j} \pi (X, Y, A), \label {eq:F1} \\ F_2(X, Y; i, j; k, l) \;&\equiv \; \sum _{A \,:\, X_i \sim _A Y_j \,\wedge \, X_k \sim _A Y_l} \pi (X, Y, A), \label {eq:F2} \end {align}

where $X_i \sim _A Y_j$ denotes that the alignment $A$ pairs residue $X_i$ with $Y_j$ in a Match emission. Computationally, $F_0$ is a Forward in $O(L^2)$, $F_1$ is a Forward–Backward in $O(L^2)$, and $F_2$ is the Inside–Outside table of a non-bifurcating Pair SCFG (or equivalently a doubled Forward–Backward) in $O(L^4)$ time and memory. The conventional Pair-HMM match-state posterior is $Q_{ij} \equiv F_1(i, j) / F_0$.

The objects $F_1$ and $F_2$ stand in the relation that gradients and Hessians do to a partition function: $F_1(i, j) / F_0$ is the marginal probability that $(X_i, Y_j)$ is paired and $F_2(i, j; k, l) / F_0$ is the marginal joint probability that both $(X_i, Y_j)$ and $(X_k, Y_l)$ are paired, with the no-coevolution model serving as the unperturbed reference point.

The 0-or-1-edge Pair SCFG

Let $M(i, j; k, l; t) \equiv M((X_i, X_k), (Y_j, Y_l); t)$ denote the four-residue Potts coupling boost of equation D.5 of Appendix D, evaluated at the observed residues $(X_i, Y_j, X_k, Y_l)$ and the inferred branch length $t$. Define the (unnormalised) Pair-SCFG generating function \begin {equation} \label {eq:scfg-grammar} S \;\to \; S_0 \,\,\big |\,\, S_1, \qquad \text {with multiplicative weights $1$ and $\varepsilon = 1 / \alpha _z$,} \end {equation} where the two non-terminals expand as \begin {align*} S_0 \;&\to \; \text {TKF92 Pair HMM with zero coupled column-pairs (the baseline)}, \\ S_1 \;&\to \; \text {TKF92 Pair HMM with exactly one coupled column-pair}. \end {align*}

The 1-edge nonterminal’s parse picks one ordered column-pair $(i, j; k, l)$ with $i < k$ as the coupled pair and replaces the product of independent single-site Match emissions at columns $(i, j)$ and $(k, l)$ with the joint pair-emission $P_{\mathrm {joint}}((X_i, X_k) \to (Y_j, Y_l); t, H)$, equivalently the product of independent emissions multiplied by $M(i, j; k, l; t)$. The partition function of the grammar is \begin {equation} \label {eq:exact-01} L_{\mathrm {exact}}(X, Y) \;=\; F_0(X, Y) \,+\, \varepsilon \sum _{(i, j; k, l), \,\, i < k} F_2(X, Y; i, j; k, l) \cdot M(i, j; k, l; t). \end {equation} The first term is the no-coupling marginal likelihood; the second is the marginal contribution of all alignments that contain exactly one coupled column-pair, summed over all candidate locations $(i, j; k, l)$ for that pair. The $\varepsilon $ prefactor is the per-pair partner prior under the size-$\{1, 2\}$ Ewens partition with concentration $\alpha _z$; the assumption restricting to at most one edge per alignment is a hard-truncation of the size-$\{1, 2\}$ partition that retains the leading-order correction in $\varepsilon $.

The exact match posterior

Inside–Outside on the grammar (E.1) yields the marginal probability that residue $X_i$ is aligned to $Y_j$ in the data: \begin {equation} \label {eq:exact-Q} Q'_{ij} \;=\; \frac {F_1(X, Y; i, j) \,+\, \varepsilon \sum _{(k, l) \,:\, k \neq i} F_2(X, Y; i, j; k, l) \cdot M(i, j; k, l; t)}{L_{\mathrm {exact}}(X, Y)}. \end {equation} The numerator collects all alignment-paths through the SCFG that touch $(i, j)$ as a Match: the first term covers $S_0$ paths (no coupled pair touches $(i, j)$), the second term covers $S_1$ paths in which $(i, j)$ is part of the (unique) coupled pair, summed over the choice of partner column $(k, l)$. Both summands include the boost factor $M$ at exactly the right place. The denominator is the partition function (E.2).

In the limit $\varepsilon \to 0$ (or $H = 0$ so $M \equiv 1$ everywhere), $Q'_{ij}$ collapses to $F_1(i, j) / F_0 = Q_{ij}$, the no-coupling baseline, as required.

Cost and practical considerations

The bottleneck of the explicit $F_2$ tensor encoding is the $O(L^4)$ time and memory required to materialise $F_2$ itself: $F_2$ has $\binom {L}{2}^2 \approx L^4 / 4$ free entries, which at $L = 200$ occupies a few GB in single precision. Longer sequences require either (a) prefix/suffix chunking on the second-anchor axis $(k, l)$ to keep memory bounded, or (b) candidate pruning that drops $(i, j)$ or $(k, l)$ pairs with $F_1(\cdot ) < q_{\min }$, reducing the candidate set from $O(L^4)$ to $O(L^2)$ in practice. The reference implementation lives in src/tkfdp/f2_scfg.py; in the current codebase it is retained primarily as a cross-validation reference for the more efficient memory-augmented Pair HMM described in the next subsection (which agrees with $F_2$-SCFG to machine precision and runs at $O(L^2 A^2)$).

The available pathways to a coupled-FSA-consumable per-residue posterior, in order of fidelity:

Pathway	Cost	Approximation

Mean-field pre-correction (deprecated; rem. D.13)	$O(L^2 K_c^2 A^2)$	First-order in $\varepsilon $
Coupled-pair greedy annealer (§D.7.0.0)	$O(L^3)$	Greedy commit ordering
$F_2$-SCFG explicit enumeration	$O(L^4)$	0-or-1-edge truncation, otherwise exact
Aug-PHMM 1-edge (§E.2)	$O(L^2 A^2)$	0-or-1-edge truncation, otherwise exact
Aug-PHMM 2-edge (§E.2)	$O(L^2 A^4)$	0-or-1-or-2-edge truncation, otherwise exact
Infinite Pair HMM MCMC (§E.3)	$O(L^4) + O(L^3) \cdot N_{\mathrm {sweep}}$	Stochastic; converges to exact

The bounded-edge approximations capture progressively more of the coupling correction at progressively higher cost; the infinite Pair HMM MCMC is the principled limit, exact in expectation up to MCMC sampling error. For families dominated by a single strong coevolutionary edge per column, the 1-edge aug-PHMM at $O(L^2 A^2)$ is empirically adequate; for families with multiple competing partners per column, the 2-edge aug-PHMM (or the MCMC sampler) is necessary.

E.2 Memory-augmented Pair HMM: the same content at $O(L^2 A^2)$

The $F_2$-SCFG of equations E.1–E.3 can be re-encoded as a Pair HMM whose latent state is augmented with a small tag memory of in-progress coupled-edge endpoints. For the 0-or-1-edge case the tag set has $|\mathcal {T}| = 1 + A^2 + 1$ values: a sentinel no_edge state, $A^2$ states $(a, b)$ recording that a left endpoint with observed amino acids $(a, b)$ is awaiting its right partner, and a terminal done state recording that a coupled pair has been consumed. At any Match cell $(i, j)$ the HMM may either (a) carry the tag forward unchanged, (b) transition $\texttt {no\_edge} \to (X_i, Y_j)$ with weight $\varepsilon $ (spawn a left endpoint), or (c) transition $(a, b) \to \texttt {done}$ with weight $M(\cdot ,\cdot ;X_i,Y_j;t)$ summed over the recorded $(a, b)$ (close the pair). At all non-Match states the tag is preserved verbatim. The augmented Forward partition function equals $L_{\mathrm {exact}}$ of equation E.2, and the marginal posterior at Match-state cell $(i, j)$, summed over tag values, equals $Q'_{ij}$ of equation E.3. Cost is $O(L^2 \cdot 5 \cdot |\mathcal {T}|) = O(L^2 A^2)$ time and memory: a factor $L^2 / A^2 \approx 100$ speedup over the explicit $F_2$ tensor at typical sequence lengths, with no loss of fidelity. Implementation lives in src/tkfdp/aug_phmm.py; cross-validation against the $O(L^4)$ $F_2$-SCFG reference confirms agreement to machine precision (max $|Q'_{ij} - Q'^{\mathrm {SCFG}}_{ij}| \approx 10^{-14}$ on tested pairs).

The same construction generalises to allow up to $k$ coupled column-pairs per alignment by enlarging the tag set to record (i) the multiset of in-flight left endpoints (size up to $k$), (ii) the number of resolved pairs (in $\{0, 1, \ldots , k\}$), and (iii) optionally the in-flight endpoints alongside the resolved-pair counter. For $k = 2$ the tag set has $|\mathcal {T}_2| = 1 + A^2 + A^2(A^2+1)/2 + A^2 + 1 \approx 81{,}000$ values for $A = 20$: a sentinel no_edge, $A^2$ singleton states (one in-flight, none resolved), an unordered-multiset pair state of size $A^2(A^2+1)/2$ (two in-flight, none resolved; same-element multisets are valid distinct states), $A^2$ closed-singleton states (one resolved, one in-flight), and a terminal closed_done. The match-cell transitions add a spawn from singleton$\to $pair with weight $\varepsilon $ (multiset construction), and three closure transitions: pair$\to $closed-singleton with weight $M(\cdot ;X_i,Y_j;t)$ over the closed multiset element, singleton$\to $closed-done with weight $M$, and closed-singleton$\to $closed-done with weight $M$. Cost is $O(L^2 \cdot 5 \cdot |\mathcal {T}_2|) = O(L^2 A^4)$ time and memory.

E.3 The principled formulation: three-factor model and MCMC

The bounded-edge truncations (0-or-1-edge in §E.1; 0-or-1-or-2-edge as a natural extension) are dynamic-programming approximations to a more principled underlying model whose joint posterior factorises cleanly into three independent ingredients: the alignment path $A$ under the TKF92 Pair HMM, a partition $\pi _{\!M}$ of its Match cells under a Chinese Restaurant Process (CRP) prior, and the per-block substitution likelihood under the substitution CTMC: \begin {equation} \label {eq:infinite-hmm-factorisation} \pi (A, \pi _{\!M} \mid X, Y) \;\propto \; \pi _{\mathrm {TKF92}}(A \mid X, Y) \;\cdot \; \pi _{\mathrm {CRP}}(\pi _{\!M} \mid \alpha _z, |\mathrm {Match}(A)|) \;\cdot \; \prod _{b \,\in \, \pi _{\!M}} P_{\mathrm {block}}(b \mid t, H). \end {equation} Blocks of size 1 (uninvolved Match cells) contribute the standard single-site CTMC factor $P_{\mathrm {singlet}}$ of equation ?? of Appendix D; blocks of size 2 (coupled-edge endpoints) contribute the joint Potts CTMC factor $P_{\mathrm {doublet}}$ of equation ??. The CRP prior with concentration $\alpha _z$ on a partition of $N = |\mathrm {Match}(A)|$ Match cells into $K$ blocks of sizes $\{n_b\}_{b=1}^K$ is the canonical Ewens partition prior (15, 39) \begin {equation} \label {eq:crp-prior} \pi _{\mathrm {CRP}}(\pi _{\!M} \mid \alpha _z, N) \;=\; \frac {\alpha _z^K \, \prod _{b=1}^K (n_b - 1)!}{\alpha _z (\alpha _z + 1) \cdots (\alpha _z + N - 1)}; \end {equation} large $\alpha _z$ favours many small blocks (mostly singletons, few coupled pairs). In practice we restrict to blocks of size at most 2: blocks of size $\geq 3$ would correspond to triangular cliques whose marginal mass is suppressed by factors of $1/\alpha _z$ at high concentration. We call (E.4) the infinite Pair HMM in the sense that the number of coupled edges per alignment is not capped by the prior — in contrast to the bounded-edge dynamic-programming methods of §E.2.

The original infinite Hidden Markov Model of Beal, Ghahramani & Rasmussen (4) and the HDP-HMM of Teh, Jordan, Beal & Blei (48) provide the broader Bayesian-nonparametric framework in which equation E.4 sits.

MCMC sampler from the infinite Pair HMM

The factorisation (E.4) admits an efficient Markov-chain Monte Carlo sampler that scales as $O(L^4)$ for one-time setup followed by $O(L^3)$ amortised cost across many sweeps. We give the setup phase first, then the three move types that compose the sweep.

Target distribution. The sampler operates on the joint state $(A, E)$ where $A$ is a TKF92 Pair HMM alignment and $E$ is a set of unordered Match-cell-pair edges. The unnormalised target distribution is \begin {equation} \label {eq:mcmc-target} \pi (A, E \mid X, Y) \;\propto \; P_{\mathrm {baseline}}(A \mid X, Y) \cdot \frac {\alpha _z^{|E|}}{\prod _{m=1}^{N_M(A)} (m - 1 + \alpha _z)} \cdot \prod _{e \in E} M(e \mid t), \end {equation} where $N_M(A)$ is the number of Match cells in $A$ and $M(e \mid t)$ is the four-residue Potts boost at edge $e$. The combinatorial prior is the canonical Ewens distribution restricted to blocks of size $\leq 2$.

Setup phase. The setup phase precomputes (a) the partial-Forward tensor $F^{\mathrm {partial}}[i, j; k, l]$ giving the baseline TKF92 Pair HMM probability of any alignment passing through Match at both anchor cells $(i, j)$ and $(k, l)$ (the same $O(L^4)$ object enumerated by the $F_2$ inside–outside table of §E.1), and (b) the four-residue Potts boost tensor $M$ at the observed amino acids over all anchor quadruples. Both are stored once per $(X, Y)$ pair and reused across all sweeps and chains.

Replica exchange. Parallel-tempering replica exchange over a ladder of $\alpha _z$ values (corresponding to varying coupling strengths) ensures global mixing across the multimodal posterior. Adjacent replicas swap states with the usual Metropolis acceptance ratio for the temperature-pair exchange.

Cost. The setup phase is $O(L^4)$ one-time per sequence pair. Each MCMC sweep does $O(1)$ inter-edge-anchor segment resamples; each segment resample is an $O(L_{\rm seg})$ stochastic traceback through the cached partial-Forward tensor, giving an amortised $O(L)$ per-sweep cost. Aggregated per-cell match indicators across post-burn-in samples yield a stochastic estimate of the per-residue posterior $Q'_{ij}$ that the FSA assembler consumes in place of the bounded-edge analytic estimate. The block move reduces to a pure Gibbs “segment between adjacent edge anchors” resample whenever $a_L$ and $a_R$ are adjacent in the current edge-anchor list.

Algorithmic detail

Algorithm 3: One MCMC sweep on the infinite Pair HMM

Require:   Current state $(A, E)$, partial-Forward tensor $F^{\mathrm {partial}}$, boost tensor $M$, concentration $\alpha _z$
Ensure:   New state $(A', E')$
1:   Pick $K_{\mathrm {seg}}$ segment-resample positions (e.g. $K_{\mathrm {seg}} = 8$, distributed by alignment length)
2:   for each segment $[a_L, a_R]$ do
3:    $A_{\mathrm {seg}}^{\mathrm {new}} \gets $ stochastic traceback of $F^{\mathrm {partial}}_{\mathrm {segment}}[a_L; a_R]$
4:    Discard edges in $E$ with at least one endpoint in $[a_L, a_R]$ (orphaned partners become singletons)
5:    $\rho \gets $ MH ratio combining CRP and boost factors (alignment factor cancels by construction of the proposal)
6:    if $u \sim \mathrm {Uniform}(0, 1) < \rho $ then
7:    $A \gets A$ with $[a_L, a_R]$ replaced by $A_{\mathrm {seg}}^{\mathrm {new}}$
8:    $E \gets E$ with orphaned edges removed
9:    end if
10:   end for
11:   Pick $K_{\mathrm {add}}$ add-edge candidates and $K_{\mathrm {rem}}$ remove-edge candidates (e.g. $K_{\mathrm {add}} = 4 |E|$, $K_{\mathrm {rem}} = |E|$)
12:   for each candidate do
13:    Propose under the appropriate move kernel
14:    Compute MH ratio (CRP $\times $ boost for add; reciprocal for remove)
15:    Accept / reject
16:   end for
17:   Periodically: attempt replica-exchange swap with adjacent $\alpha _z$ rung
18:   return $(A, E)$

The exact form of the segment-resample MH ratio depends on the edge-handling convention. Under the simplest convention (“Strategy S-1” in the implementation spec), edges whose endpoints lie wholly within the resampled region are discarded unconditionally and their boost contributions enter only the forward MH ratio; edges with one endpoint outside the region are discarded and their orphaned partner becomes a singleton.

Verification protocol

The first three checks pin the sampler to the exact reference at each of three orders of edge truncation; the fourth checks detailed balance independently of the data-conditioned target.

E.4 The conceptual hierarchy: infinite phylogenetic SCFG

The infinite Pair HMM is itself an approximation. The infinite Pair HMM of equation E.4 is the principled limit of the bounded-edge dynamic-programming approximations of §E.2, but it is itself an approximation to a still-more-principled object: the infinite-edge generalisation of the gravestone-augmented time-indexed pair SCFG of §D.5 of Appendix D. That fuller model samples not only the visible alignment $A$ and the edge set $E$ but also the timed gravestone-augmented branch history $G$ comprising every fragment that ever existed during the branch (whether visible at the alignment endpoints, deleted on the branch, or transiently inserted then deleted), with the same canonical CRP rule of equation E.4 governing edge spawn at every Match-class member of either an alive or a gravestone fragment. On the substitution side, transient gravestones inserted during the branch can be assigned the same $z_s$ key-DP class as their alive class-mates and therefore contribute coupling terms to those class-mates’ rates during the gravestone’s lifetime — so the substitution likelihood at a Match cell $(i, j)$ depends on $G$, not just on $A$. The infinite Pair HMM of this appendix marginalises $G$ implicitly via the standard TKF92 Pair HMM substitution emissions, which is correct in expectation when transient lineages are rare or when their coupling contribution is small, but loses the within-branch dynamics that the gravestone-augmented SCFG captures.

The resulting hierarchy is nested: the bounded-edge dynamic-programming methods of §E.2 are exact in the limit of the infinite Pair HMM of this section; the infinite Pair HMM is exact in the limit of the infinite gravestone-augmented pair SCFG; and that fuller object is the natural fixed point at which the indel-side and the substitution-side principled formulations meet. The MCMC sampler described above samples from the middle level of this hierarchy; lifting it to the full infinite gravestone-augmented pair SCFG — by interleaving the infinite-HMM CRP edge moves with the gravestone-augmented branch-history moves of §D.5 of Appendix D — is a natural extension.

Further: the infinite gravestone-augmented phylogenetic SCFG. The pairwise hierarchy above lifts to the full multi-branch phylogenetic setting at no further modelling cost, by composition with the Felsenstein-style upwards–downwards machinery of §D.5 that already glues per-branch gravestone-augmented histories into a tree-level posterior. At every internal node and along every branch of the tree, the gravestone-augmented latent state is sampled jointly with the alignment, while the same canonical CRP rule of equation E.4 governs edge spawn at every Match-class member of every fragment (alive or gravestone) at every node. The MCMC sampler above lifts by interleaving (i) per-branch gravestone-augmented history moves of §D.5, (ii) per-class-pair CRP edge moves of this appendix, and (iii) the per-internal-node Felsenstein-style updates already present in the SVI loop of §D.5. Each ingredient is already load-bearing in the existing pipeline; no new modelling assumptions are required to assemble them.

each rung exact in the limit of the next, the entire ladder falling out of the simple stick-breaking-TKF92-with-Potts construction of Appendices D–E with no further mathematical machinery.

E The infinite Pair HMM and its MCMC sampler