Supplementary material for
Insertions, deletions, and exchangeable couplings:
a Dirichlet process over TKF Potts models

Annabel Large and Ian Holmes

Department of Bioengineering, University of California, Berkeley
{annabel_large, ihh}@berkeley.edu

Abstract

This supplement collects the long technical derivations referenced by the main paper. It is organised into five appendices that mirror the conceptual ladder of the main text. Appendix A develops the finite-state CTMC and the linear birth–death–immigration process, derives the closed-form bridge-expectation and Fisher-score sufficient statistics for TKF91 and TKF92, resolves the L’Hôpital singularity at \(\insrate {}={}\delrate \), and discusses the relationship between TKF92 and the latent-boundary-free General Geometric Indel (GGI) model. Appendix B collects the closed-form substitution M-steps for TKF91 / TKF92 / MixDom in their many GTR specialisations, the stochastic-variational Baum–Welch loop with its convergence theorem and linearised analysis at stationarity, Maraschino (the TKF92 cherry-distilled generalisation of CherryML) and its tree-level inference algorithms (FSA, BeamASR, VarAnc, svi-VarAnc), the mixture-of-trees variational ancestral presence/absence inference, and the structural bias of the BP cumulant under a column-factorised variational field. Appendix C develops the recursive TKF family: MixDom, the hierarchical-mixture-of-domains generalisation of TKF92, with its exact closed-form M-step via six-step chain restoration through a fully exploded null-state model; the order-1 Maraschino adjacency distillation; algebraic distillation of MixDom; the MixDom-specific SVI-BW convergence considerations; the tree-level VEM and ancestral-reconstruction algorithms; the generalised phylo-HMM; the labeled-MixDom Singlet and WFST; and the recursive-grammar-elaboration rules together with worked recursive examples (L-TKF, TKFST, TKFStack, TKF-Genome). Appendix D develops the TKF-DP generative model, the class-level path-measure variational likelihood with pairwise bridge expectations, the time-indexed gravestone-augmented pair SCFG, the SVI inference loop, and the pairwise alignment postprocessing landscape. Appendix E develops the infinite Pair HMM as the principled fixed point and the Gibbs+MH+replica-exchange MCMC sampler that draws from it.

Contents
Common notation
Common notation
A BDI and TKF foundations
A.1 The TKF91 Model
A.2 The TKF92 Model
A.3 TKF92 WFST by Singlet Division
A.4 Equal-Rate Limits for TKF Parameters
A.5 TKF91 Score Function
A.6 General BDI Sufficient Statistics
A.7 The General Geometric model
B EM, composite likelihoods, and variational inference
B.1 Substitution M-Steps for Specific Models
B.2 Stochastic Variational Baum–Welch Convergence
B.3 Expected Statistics and Linearized Convergence
B.4 Maraschino: Cherry-Counts for TKF92
B.5 Selected Inference Algorithms for TKF92
B.6 Mixture-of-trees variational ancestral presence/absence
B.7 Theory: structural bias of the BP cumulant under column-factorised \(q\)
C Recursive TKF
C.1 The TKF-Mixed Domain Model (MixDom)
C.2 Selected Inference Algorithms for MixDom
C.3 Exploded MixDom Pair HMM
C.4 Order-1 Maraschino: Distilled Adjacency Frequencies
C.5 Algebraic Distillation of MixDom
C.6 MixDom-Specific SVI-BW Convergence Considerations
C.7 Variational EM training of MixDom from tree-structured data
C.8 Mixture-of-trees variational MixDom ancestral inference
C.9 Generalized Phylo-HMM for MixDom
C.10 Labeled-MixDom Singlet HMM and WFST
C.11 Formal Grammar Elaboration Rules
C.12 Recursive TKF Models
D TKF-DP: Dirichlet-process Potts coupling
D.1 The TKF-DP generative model
D.2 IBP variant
D.3 Site classes and a GTR-parameterized generator
D.4 Class-level variational substitution likelihood
D.5 Augmented indel histories via a time-indexed pair SCFG
D.6 Posterior sampling and parameter learning
D.7 Pairwise alignment postprocessing
E The infinite Pair HMM and its MCMC sampler
E.1 Exact 0-or-1-edge marginal posteriors via Pair-SCFG inside-outside
E.2 Memory-augmented Pair HMM: the same content at \(O(L^2 A^2)\)
E.3 The principled formulation: three-factor model and MCMC
E.4 The conceptual hierarchy: infinite phylogenetic SCFG
References