Contents

Common notation
A BDI and TKF foundations
A.1 The TKF91 Model
A.1.1 Finite-State Continuous-Time Markov Chain (CTMC)
A.1.2 Sufficient Statistics for Finite-State CTMCs
A.1.3 Linear Birth-Death-Immigration Process (BDI)
A.1.4 Sufficient Statistics for Linear BDI Process
A.1.5 The TKF91 Model: Linear BDI + Finite-State CTMC
A.1.6 Finite State Machines of the TKF91 Model
A.1.7 Sufficient Statistics for the TKF91 Model
A.1.8 Baum-Welch Algorithm for TKF91
A.1.9 Extending TKF91 to Phylogenetic Trees
A.2 The TKF92 Model
A.2.1 Latent Information in TKF92
A.2.2 Singlet HMM, Pair HMM, and WFST Representations
A.2.3 Baum-Welch Algorithm for TKF92
A.2.4 Maraschino: Distilled Cherries
A.3 TKF92 WFST by Singlet Division
A.3.1 TKF92 Singlet HMM Transitions
A.3.2 Why an Ins0 / Ins1 Split is Necessary
A.3.3 The \(6{\times }6\) TKF92 Pair HMM
A.3.4 The \(6{\times }6\) TKF92 WFST
A.3.5 Structure and Verification
A.3.6 Comparison with the \(5{\times }5\) Form
A.3.7 Normalization Structure
A.3.8 Testable Invariants
A.4 Equal-Rate Limits for TKF Parameters
A.4.1 TKF91 Transition Parameters
A.4.2 Score Derivatives: General Case
A.4.3 Score Derivatives: L’Hôpital Limits
A.4.4 BDI Sufficient Statistics
A.4.5 Direct BDI Rate EM at Equal Rates
A.4.6 \(\log \kappa \) and \(\log (1{-}\kappa )\) Derivatives
A.4.7 Joint vs Conditional Likelihoods
A.4.8 Irreversible Models and the \(\lambda > \mu \) Regime
A.4.9 Summary of Limits
A.5 TKF91 Score Function
A.5.1 Derivatives of TKF parameters
A.5.2 Observed-data score
A.5.3 Joint likelihood correction
A.5.4 Recovering sufficient statistics and M-step
A.6 General BDI Sufficient Statistics
A.6.1 Complete-data log-likelihood
A.6.2 Score equations
A.6.3 Conservation law
A.6.4 Closed-form solution
A.6.5 Transition probability for general BDI
A.6.6 Score derivatives for \(\nu \)
A.6.7 Chain rule
A.7 The General Geometric model
B EM, composite likelihoods, and variational inference
B.1 Substitution M-Steps for Specific Models
B.1.1 JC69 (Jukes–Cantor)
B.1.2 K80 (Kimura 2-Parameter)
B.1.3 F81 (Felsenstein 1981)
B.1.4 HKY85 (Hasegawa–Kishino–Yano)
B.1.5 GTR (General Time-Reversible)
B.1.6 GY94 (Goldman–Yang Codon Model)
B.1.7 Summary and Practical Notes
B.1.8 Reversible Mixture with Per-Component GTR
B.1.9 Rate Rescaling: M-Step for a Global Scalar Multiplier
B.1.10 Joint Rate Rescaling and Equilibrium
B.1.11 Tied Equilibria across Class Blocks
B.1.12 Tied Equilibria with Per-Class Rate Rescaling
B.1.13 Stochastic Variational Baum-Welch
B.2 Stochastic Variational Baum–Welch Convergence
B.2.1 Setup
B.2.2 BDI sufficient statistics
B.2.3 Variance of Minibatch Estimates
B.2.4 Pseudocount representation and its advantages
B.2.5 SVB Convergence Rate
B.2.6 Practical Recommendations
B.2.7 Bias diagnostics: Hellinger, ESS, and Fisher readouts
B.3 Expected Statistics and Linearized Convergence
B.3.1 BDI expected statistics at stationarity
B.3.2 Per-pair variance of sufficient statistics
B.3.3 Relative error and per-pair coefficients of variation
B.4 Maraschino: Cherry-Counts for TKF92
B.4.1 Optimizing Maraschino: gradient methods vs. inner EM
B.4.2 EM-around-Maraschino: mixtures of TKF92
B.4.3 EM-around-CherryML: mixtures over site classes
B.4.4 Expected sufficient statistics as fast custom VJPs
B.5 Selected Inference Algorithms for TKF92
B.5.1 Fast Statistical Alignment (FSA)
B.5.2 Beam Search Ancestral Sequence Reconstruction (BeamASR)
B.5.3 Variational Ancestral Reconstruction (VarAnc)
B.5.4 Stochastic Variational VarAnc (svi-VarAnc)
B.6 Mixture-of-trees variational ancestral presence/absence
B.6.1 Setting and approximation
B.6.2 Restricted generative model
B.6.3 Singlet HMM at the root
B.6.4 Per-branch path log-likelihood
B.6.5 Tree-structured variational family
B.6.6 Expected log-likelihood under \(q\)
B.6.7 ELBO
B.6.8 Stable computation: cumulant trick
B.6.9 Belief propagation for pairwise marginals
B.6.10 Special cases and scalability
B.7 Theory: structural bias of the BP cumulant under column-factorised \(q\)
C Recursive TKF
C.1 The TKF-Mixed Domain Model (MixDom)
C.1.1 The MixDom Model
C.1.2 Singlet HMM for MixDom
C.1.3 Pair HMM for MixDom
C.1.4 Baum-Welch Algorithm for MixDom Pair HMM
C.1.5 WFSTs for MixDom
C.2 Selected Inference Algorithms for MixDom
C.2.1 Fast Statistical Alignment (FSA)
C.2.2 Beam Search Ancestral Sequence Reconstruction (BeamASR)
C.2.3 Phylogenetic Hidden Markov Model (PhyloHMM)
C.2.4 Phylogenetic composition
C.2.5 Beam Backward algorithm (BeamMSA)
C.2.6 Progressive alignment via profile construction (ProgRec)
C.3 Exploded MixDom Pair HMM
C.3.1 State Space
C.3.2 Transition Weights
C.3.3 Null State Classification
C.3.4 Null Elimination
C.3.5 Exact Count Restoration
C.3.6 Parameter Group Decomposition
C.4 Order-1 Maraschino: Distilled Adjacency Frequencies
C.4.1 Cherry-count summary statistics
C.4.2 Cherry-count likelihood for the MixDom Pair HMM
C.4.3 Distillation From MixDom To Order-1 Machines
C.4.4 Notation for path marginalizations
C.4.5 Distillation to Order-1 HMM
C.4.6 Distillation to Order-1 WFST
C.5 Algebraic Distillation of MixDom
C.5.1 Setup
C.5.2 Class-mixture emissions
C.5.3 Single HMM Distillation
C.5.4 Pair HMM Distillation
C.5.5 Block Structure and Matrix Inversions
C.5.6 Within-Domain Inversion: closed form
C.5.7 Within-Domain Inversion: \(\mathcal {F}= 1\) closed form
C.5.8 Bilinear Factored Form of Adjacency Frequencies
C.5.9 Full-Context Distillation: Passthrough Context for Insert and Delete
C.5.10 Domains versus Fragments versus Classes for Adjacency Capture
C.5.11 Identifiability
C.5.12 Scaling to \(\mathcal {N}, \mathcal {F}, C\)
C.5.13 Summary
C.6 MixDom-Specific SVI-BW Convergence Considerations
C.6.1 Parameter groups and Fisher information
C.6.2 Substitution vs. indel information
C.6.3 MixDom expected statistics
C.6.4 Convergence rate estimates
C.6.5 Discussion: why top-level indel rates are hardest
C.7 Variational EM training of MixDom from tree-structured data
C.7.1 Outer EM loop
C.7.2 Per-family E-step
C.7.3 M-step from aggregated sufficient statistics
C.7.4 Stochastic VBEM (SVI-VBEM)
C.7.5 Convergence and ELBO monitoring
C.7.6 Initialisation and warm-start
C.7.7 Computational scaling and minibatching
C.7.8 Comparison to SVI-BW
C.8 Mixture-of-trees variational MixDom ancestral inference
C.8.1 Setting and reduced state space
C.8.2 Restricted generative model
C.8.3 Variational family
C.8.4 Reduced WFST: marginalising \((g, e)\) and the class \(c\)
C.8.5 Per-branch path log-likelihood
C.8.6 Per-column expected indel log-likelihood under \(q\)
C.8.7 Per-column expected substitution log-likelihood
C.8.8 ELBO
C.8.9 Cross-column constraint vanishes
C.8.10 Special cases and recovery
C.8.11 Open issues
C.9 Generalized Phylo-HMM for MixDom
C.9.1 The Vanishing-Top-Level-Indel Limit
C.9.2 Partition Decomposition
C.9.3 Why the State Space Cannot Be Collapsed
C.9.4 Setup and Definitions
C.9.5 Intra-Block Forward Recurrence
C.9.6 The Forward Recursion
C.9.7 The Backward Recursion
C.9.8 Intra-Block Backward Recurrence
C.9.9 Posterior Domain and Fragment State Assignment
C.9.10 Root Residue Reconstruction
C.9.11 Why the Trick Fails with Top-Level Indels
C.9.12 Complexity
C.9.13 Simulation from MixDom
C.10 Labeled-MixDom Singlet HMM and WFST
C.10.1 Labeled Alphabet
C.10.2 Labeled-MixDom Singlet HMM
C.10.3 Labeled-MixDom WFST
C.10.4 Emission Weights
C.10.5 Verification: Composition Reproduces the Pair HMM
C.10.6 Simplification of Domain-Boundary WFST Weights
C.10.7 State Count and Sparsity
C.11 Formal Grammar Elaboration Rules
C.11.1 Base Grammar
C.11.2 Elaboration Rules
C.11.3 Null State Management
C.11.4 Composition Properties
C.11.5 Toward Implementation
C.12 Recursive TKF Models
C.12.1 Example One: Left-Recursive TKF (L-TKF)
C.12.2 Example Two: The TKF Structure Tree (TKFST)
C.12.3 Example Three: The TKF Basepair Stack (TKFStack)
C.12.4 Example Four: The TKF Genome
D TKF-DP: Dirichlet-process Potts coupling
D.1 The TKF-DP generative model
D.2 IBP variant
D.3 Site classes and a GTR-parameterized generator
D.4 Class-level variational substitution likelihood
D.5 Augmented indel histories via a time-indexed pair SCFG
D.6 Posterior sampling and parameter learning
D.7 Pairwise alignment postprocessing
E The infinite Pair HMM and its MCMC sampler
E.1 Exact 0-or-1-edge marginal posteriors via Pair-SCFG inside-outside
E.2 Memory-augmented Pair HMM: the same content at \(O(L^2 A^2)\)
E.3 The principled formulation: three-factor model and MCMC
E.4 The conceptual hierarchy: infinite phylogenetic SCFG