BDI and TKF foundations

This appendix builds the foundations on which the rest of the supplement rests. We develop the finite-state continuous-time Markov chain (CTMC) sufficient statistics; the linear birth–death–immigration (BDI) process and its endpoint-conditioned sufficient statistics via the Fisher score identity; the TKF91 model as a factorisation of these; its finite-state machine representations; the TKF92 generalisation; the explicit $6{\times }6$ WFST derivation by Pair-HMM-over-Singlet division; the L’Hôpital limits at $\insrate {=}\delrate $ that close the $0/0$ singularity in the score identity; and a discussion of how the latent-boundary TKF92 model relates to the latent-boundary-free General Geometric Indel (GGI) model via moment matching.

A.1 The TKF91 Model

A.1.1 Finite-State Continuous-Time Markov Chain (CTMC)

A finite-state continuous-time Markov chain (CTMC) on alphabet $\alphabet $ has rate matrix (generator) $\revsub $ with off-diagonal entries $\revsub _{ij} \geq 0$ (instantaneous rate of substitution from $i$ to $j$) and diagonal entries $\revsub _{ii} = -\sum _{j \neq i} \revsub _{ij}$. The stationary distribution $\eqm $ satisfies $\eqm \revsub = 0$ and $\sum _i \eqm _i = 1$.

We restrict our analysis to time-reversible CTMCs, which satisfy detailed balance $\eqm _i \revsub _{ij} = \eqm _j \revsub _{ji}$ for all $i,j$. The symmetrized rate matrix $\symsub _{ij} = \revsub _{ij}\sqrt {\eqm _i/\eqm _j}$ is real symmetric, so it has a spectral decomposition $S = \sum _\sumidx \xi ^{(\sumidx )} v^{(\sumidx )} {v^{(\sumidx )}}^\top $ with real eigenvalues $\xi ^{(\sumidx )} \leq 0$ and orthonormal eigenvectors $v^{(\sumidx )}$. Conjugating back, the transition matrix is \[ P(X(\evoltime ) = j \mid X(0) = i, \revsub ) = \left ( e^{\revsub \evoltime } \right )_{ij} = \sum _\sumidx e^{\xi ^{(\sumidx )} \evoltime } \sqrt {\frac {\eqm _j}{\eqm _i}}\, v^{(\sumidx )}_i v^{(\sumidx )}_j \]

A.1.2 Sufficient Statistics for Finite-State CTMCs

The sufficient statistics for a continuously-observed path $\xpath = \{X(t)\}_{0 \leq t \leq T}$ in the finite-state CTMC are the dwell times $W_i$ and transition counts $U_{ij}$. The complete-data log-likelihood is \begin {equation} \label {eq:ctmc-complete} \log P(\xpath \mid X(0), \revsub ) = \sum _{i \neq j} U_{ij} \log \revsub _{ij} + \sum _i W_i \revsub _{ii} = \sum _{i \neq j} U_{ij} \log \revsub _{ij} - \sum _i W_i \sum _{j \neq i} \revsub _{ij}. \end {equation} This is an exponential-family form with natural parameters $\log \revsub _{ij}$ and $-\revsub _{ii}$, and sufficient statistics $U_{ij}$ and $W_i$.

The a posteriori expectations of these sufficient statistics for an endpoint-conditioned path in the finite-state CTMC are (19, 20, 25, 47) \begin {align} \emcount ^W_i(a,b,\evoltime ) \equiv \expect [W_i|a,b] &= \frac {1}{M_{ab}(\evoltime )} I^{ab}_{ii}(\evoltime ) \label {eq:em-dwell} \\ \emcount ^U_{ij}(a,b,\evoltime ) \equiv \expect [U_{ij}|a,b] &= \frac {\revsub _{ij}}{M_{ab}(\evoltime )} I^{ab}_{ij}(\evoltime ) \label {eq:em-counts} \end {align}

where $M(\evoltime ) = e^{\revsub \evoltime }$ is the matrix exponential and \begin {equation} \label {eq:em-integral} I^{ab}_{ij}(\evoltime ) = \int _0^\evoltime M_{ai}(t)\, M_{jb}(\evoltime -t)\, dt. \end {equation}

Writing $M_{ij}(t)$ in the eigenbasis of the symmetrized rate matrix $\symsub _{ij} = \revsub _{ij}\sqrt {\eqm _i/\eqm _j}$ (with eigenvalues $\xi ^{(k)}$ and orthonormal eigenvectors $v^{(k)}$): \begin {equation} \label {eq:em-Jkl} I^{ab}_{ij}(\evoltime ) = \sqrt {\frac {\eqm _i \eqm _b}{\eqm _a \eqm _j}} \sum _k v_a^{(k)} v_i^{(k)} \sum _l v_j^{(l)} v_b^{(l)}\, J^{kl}(\evoltime ) \end {equation} where $J^{kl}(\evoltime ) = \evoltime e^{\xi ^{(k)}\evoltime }$ if $\xi ^{(k)} = \xi ^{(l)}$, and $(e^{\xi ^{(k)}\evoltime } - e^{\xi ^{(l)}\evoltime })/(\xi ^{(k)} - \xi ^{(l)})$ otherwise.

Using the GTR parameterization $\revsub _{ij} = \exch _{ij} \eqm _j$, where $\exch $ is a symmetric exchangeability matrix (so $\symsub _{ij} = \exch _{ij} \sqrt {\eqm _i \eqm _j}$), and including the prior $P(X(0)=i) = \eqm _i$, the complete data log-likelihood is \[ \log P(\xpath , X(0) \mid \exch , \eqm ) = \sum _i \left ( V_i + \sum _{j \neq i} U_{ji} \right ) \log \eqm _i + \sum _{j>i} (U_{ij} + U_{ji}) \log \exch _{ij} - \sum _{j > i} \exch _{ij} \left ( W_i \eqm _j + W_j \eqm _i \right ) \] where $V_i$ is the indicator/count for the initial state $X(0) = i$.

A.1.3 Linear Birth-Death-Immigration Process (BDI)

The linear birth-death-immigration (BDI) process (29) has per-capita birth rate $\insrate $, per-capita death rate $\delrate $, and constant immigration rate $\immrate $, so that the total event rates from state $k$ are $\insrate _k = k\insrate + \immrate $ and $\delrate _k = k\delrate $.

The transition distribution for this process is well-studied in the general case. We confine our analysis to the specific case $\immrate = \insrate , \delrate > \insrate $ where the immigration rate is equal to the birth rate and the process is positive recurrent and has an equilibrium distribution, a zero-based geometric with parameter $\insrate /\delrate $. This is the regime relevant to the TKF91 and related models. However, many of the formulas we derive will hold more generally, and the methods we use can be applied to other parameter regimes as well.

Let $Y_f$ denote the number of founder deaths and $Y_o$ the number of youngest orphans (i.e. the number of times a lineage goes extinct while leaving behind at least one surviving descendant). Then the finite-time transition probability is obtained by summing the joint probability \begin {multline*} P(X(\evoltime ) = j,\, Y_f = m,\, Y_o = n \mid X(0) = i) = \binom {i}{m} \binom {m}{n} \binom {j}{i-m+n} \\ \times \; \alpha ^{i-m} (1-\alpha )^m \, \beta ^{j-i+m-n} (1-\beta )^{i+1-m+n} \, \gamma ^n (1-\gamma )^{m-n} \end {multline*} over all $m,n$ consistent with $i$ and $j$ ($0 \leq m \leq i$ and $0 \leq n \leq \min (m, j-i+m)$), where

\begin {eqnarray} \alpha (\insrate ,\delrate ,\evoltime ) & = & \exp (-\delrate \evoltime ) \label {tkf:alpha} \\ \beta (\insrate ,\delrate ,\evoltime ) & = & \frac {\insrate \exp (-\insrate \evoltime ) - \insrate \alpha } {\delrate \exp (-\insrate \evoltime ) - \insrate \alpha } \label {tkf:beta} \\ \\ \gamma (\insrate ,\delrate ,\evoltime ) & = & 1 - \frac {\delrate \beta }{\insrate (1- \alpha )}. \label {tkf:gamma} \end {eqnarray}

A.1.4 Sufficient Statistics for Linear BDI Process

The sufficient statistics for an augmented continuously-observed path $\{X(t)\}_{0 \leq t \leq T}$ in the linear BDI process wherein immigration events are labeled as such (i.e. we observe the full path and know which births are due to immigration and which are due to reproduction) are the individual-birth count $B_\ind $, the immigration count $B_\imm $ (with $B = B_\ind + B_\imm $ the total), the death count $D$, and the time-integrated population size $S = \int _0^T X(t)\,dt$. The complete-data log-likelihood is \begin {equation} \label {eq:bdi-complete} \log P\bigl (\xpath \mid X(0),\insrate ,\delrate ,\nu \bigr ) = B_\ind \log \insrate + B_\imm \log \nu + D\log \delrate - (\insrate +\delrate )\,S - \nu T + c \end {equation} where $c$ does not depend on $(\insrate ,\delrate ,\nu )$. The natural parameters are $(\log \insrate , \log \delrate , \log \nu , -(\insrate +\delrate ))$ and the sufficient statistics $(B_\ind , D, B_\imm , S)$. This is a curved exponential-family form because the natural parameters are not independent.

Score function identity. Let $P_{ij}(T;\theta )$ denote the observed likelihood, i.e. the transition probability $P(X(T)=j \mid X(0)=i;\,\theta )$, with $\theta = (\insrate ,\delrate ,\nu )$. The score function identity states that the gradient of the observed log-likelihood equals the posterior expectation of the complete-data score: \begin {equation} \label {eq:score-identity} \nabla _\theta \log P_{ij}(T;\theta ) = \expect \!\left [\nabla _\theta \log P(\xpath \mid X(0),\theta ) \;\middle |\; X(0)=i,\, X(T)=j;\,\theta \right ]. \end {equation} No approximation is involved; (A.5) holds for any parametric family and any choice of complete/incomplete data.

Specializing to $\immrate = \insrate $ and writing $B = B_\ind + B_\imm $ for the total birth count, the complete-data log-likelihood (A.4) becomes $\log P = B \log \insrate + D \log \delrate - (\insrate +\delrate )\,S - \insrate T + c$ with two free parameters $(\insrate , \delrate )$. Differentiating and applying (A.5):

\begin {eqnarray*} \frac {\partial }{\partial \insrate } \log P_{ij}(T;\theta ) & = & \frac {\expect [B \mid i,j,T]}{\insrate } - \expect [S \mid i,j,T] - T \\ \frac {\partial }{\partial \delrate } \log P_{ij}(T;\theta ) & = & \frac {\expect [D \mid i,j,T]}{\delrate } - \expect [S \mid i,j,T] \end {eqnarray*}

Each equation involves two unknowns because the natural parameters $(\log \insrate , \log \delrate , \insrate +\delrate )$ are not independent. The conservation law $i + B - D = j$ provides the missing equation: \begin {equation} \label {eq:conservation} \expect [B \mid i,j] - \expect [D \mid i,j] = j - i \end {equation} Solving this system (using $\insrate \neq \delrate $):

\begin {eqnarray*} \expect [S \mid i,j] & = & \frac {j - i + \delrate \frac {\partial }{\partial \delrate } \log P_{ij} - \insrate \frac {\partial }{\partial \insrate } \log P_{ij} - \insrate T}{\insrate - \delrate } \\ \expect [B \mid i,j] & = & \insrate \frac {\partial }{\partial \insrate } \log P_{ij} + \insrate \, \expect [S \mid i,j] + \insrate T \\ \expect [D \mid i,j] & = & \delrate \frac {\partial }{\partial \delrate } \log P_{ij} + \delrate \, \expect [S \mid i,j] \end {eqnarray*}

Inclusion of an initial distribution $P(X(0)=L) = \kappa ^L (1-\kappa )$ brings the complete-data log-likelihood into the form \[ \log P\bigl (\xpath , X(0) \mid \insrate ,\delrate \bigr ) = B \log \insrate + D \log \delrate + L \log \kappa + M \log (1-\kappa ) - (\insrate +\delrate )\,S - \insrate T + c \] where $M=1$ represents the number of BDI trajectories observed, introduced for later generalization.

A.1.5 The TKF91 Model: Linear BDI + Finite-State CTMC

Thorne, Kishino, and Felsenstein’s 1991 model (TKF91) was introduced to describe the molecular evolution of a biological sequence (DNA, RNA, or protein) subject to insertions, deletions, and point substitutions (50).

The parameters $\theta = (\insrate , \delrate , \exch , \eqm )$ consist of the insertion rate $\insrate $, deletion rate $\delrate $, exchangeability matrix $\exch $, and equilibrium distribution $\eqm $.

The BDI part of the model, known as the links model, describes the evolution of an “immortal link” followed by zero or more “mortal links”. Each link evolves according to an independent linear BDI process; births and immigrations correspond to insertions, which generate new mortal links, and deaths correspond to deletions.

In the TKF91 model, each mortal link is associated with an observed character that evolves according to an independent finite-state CTMC with rate matrix $\exch $ and stationary distribution $\eqm $. Inserted characters are drawn from the equilibrium distribution $\eqm $.

In later TKF91-derived models, a more elaborate structure may be associated with each link; for example, in the TKF92 model, a fragment of characters of geometrically-distributed length (51).

To distinguish between these models we introduce the notation $\tkflinks (\model ;\insrate ,\delrate )$ for the links model where an independent stationary stochastic process $\state (t) \sim \model $ is associated with every mortal link. This defines a stationary stochastic process over sequences $\seq (t) \sim \tkflinks (\model ;\insrate ,\delrate )$ where $\seq $ is a sequence over the alphabet defined by the state space of $\model $.

Let $\subproc (\exch ,\eqm )$ denote the time-reversible continuous-time Markov chain with stationary distribution $\eqm $ and generator $\revsub = \exch \diag (\eqm )$, where $\exch $ is the exchangeability matrix. Then the TKF91 model is just the links model with a point substitution process transpiring at each mortal link \[ \seq (\evoltime ) \sim \tkflinks (\subproc (\exch ,\eqm );\insrate ,\delrate ) \]

A.1.6 Finite State Machines of the TKF91 Model

There are three distinct state machines associated with nodes and branches in a tree of TKF-related sequences.

TKF91 Stationary HMM The stationary distribution of the TKF91 model can be represented as a Hidden Markov Model (HMM) that emits $n \sim \geomdist (\kappa )$ characters drawn i.i.d. from $\eqm $.

TKF91 Transducer The conditional distribution of a descendant sequence $\seq (\evoltime )$ given ancestor sequence $\seq (0)$ can be represented as a Weighted Finite-State Transducer (WFST) that consumes an ancestor sequence and emits a descendant sequence, with transition weights determined by the TKF91 parameters \[ \tkfwfst (\insrate ,\delrate ,\evoltime ) = \left ( \begin {array}{r|ccccc} & \sta & \mat & \ins & \del & \fin \\ \hline \sta & 0 & (1-\beta )\alpha & \beta & (1-\beta )(1-\alpha ) & 1-\beta \\ \mat & 0 & (1-\beta )\alpha & \beta & (1-\beta )(1-\alpha ) & 1-\beta \\ \ins & 0 & (1-\beta )\alpha & \beta & (1-\beta )(1-\alpha ) & 1-\beta \\ \del & 0 & (1-\gamma )\alpha & \gamma & (1-\gamma )(1-\alpha ) & 1-\gamma \\ \fin & 0 & 0 & 0 & 0 & 0 \end {array} \right ) \] with $\mat $ aligning pairs of characters $(i,j)$ with probability $\exp (\revsub \evoltime )_{ij}$, $\ins $ emitting unaligned descendant characters $(\epsilon ,i)$ with $i \sim \eqm $, and $\del $ consuming unaligned ancestral characters $(i,\epsilon )$ without any additional weight penalty.

This transducer can be interpreted as a decomposition of each mortal-link BDI trajectory into founder survival/death, possible survival of the youngest orphan (which, by insertion order, is the leftmost surviving descendant when the founder dies), and a geometric tail of further offspring. This yields the parameters $\alpha ,\beta ,\gamma $ appearing in the WFST transition matrix and determines which transition counts correspond to birth and death events.

TKF91 Pair HMM The joint distribution of ancestor and descendant is given by the transducer composition product of the stationary HMM and the transition WFST. This product can itself be represented as a Pair HMM with states $\{\sta ,\mat ,\ins ,\del ,\fin \}$ that models the joint distribution, as opposed to the conditional distribution represented by the WFST. The transition matrix $\tkftrans $ of the Pair HMM is similar to the transition matrix $\tkfwfst $ of the transducer, with the differences that incoming transitions to $\mat $ and $\del $ states acquire a factor of $\kappa $ for the extension of the ancestral sequence and transitions to $\fin $ states acquire a factor of $1-\kappa $ for the termination of the ancestral sequence.

For concreteness, because we will refer to it frequently, the transition matrix of the TKF91 Pair HMM is \[ \tkftrans (\insrate ,\delrate ,\evoltime ) = \left ( \begin {array}{r|ccccc} & \sta & \mat & \ins & \del & \fin \\ \hline \sta & 0 & (1-\beta )\alpha \kappa & \beta \kappa & (1-\beta )(1-\alpha )\kappa & (1-\beta )(1-\kappa ) \\ \mat & 0 & (1-\beta )\alpha \kappa & \beta \kappa & (1-\beta )(1-\alpha )\kappa & (1-\beta )(1-\kappa ) \\ \ins & 0 & (1-\beta )\alpha \kappa & \beta \kappa & (1-\beta )(1-\alpha )\kappa & (1-\beta )(1-\kappa ) \\ \del & 0 & (1-\gamma )\alpha \kappa & \gamma \kappa & (1-\gamma )(1-\alpha )\kappa & (1-\gamma )(1-\kappa ) \\ \fin & 0 & 0 & 0 & 0 & 0 \end {array} \right ) \]

The emission weights of the Pair HMM are similarly related to the input/output emission weights of the WFST, with $\mat $- and $\ins $-state emissions acquiring a factor of $\eqm _i$ for the ancestral character $i$.

For the indel score identities associated with birth, death, and exposure statistics, it is convenient to work with the conditionally normalized TKF91 WFST representing $P(y|x,\theta )$, since its transition weights correspond directly to the finite-time BDI factors $\alpha ,\beta ,\gamma $. However, the M-step for $(\insrate ,\delrate )$ under the stationary TKF91 model must also include the prior contribution of the ancestral sequence length, which belongs to the joint distribution $P(x,y|\theta ) = P(x|\theta ) P(y|x,\theta )$. Accordingly, we use the conditional WFST to obtain posterior expectations of $B,D,S$ and then augment the M-step objective with the joint-model terms involving $L$ and $M$.

Appendix A.4 considers the limits and implications of these formulas in the biologically-relevant regime of long sequences, $\insrate \to \delrate $.

A.1.7 Sufficient Statistics for the TKF91 Model

For observed ancestral and descendant sequences $x,y$ under TKF91 with parameters $\theta =(\lambda ,\mu ,Q,\pi )$, and a particular hidden Pair HMM/WFST path $z$, the $(\insrate ,\delrate )$-dependent part of the complete-data log-likelihood is linear in the transition and emission counts, \[ \log P\bigl (y,z \mid x, \theta \bigr ) = \sum _{i,j} n_{ij}(z)\,\log \tkfwfst _{ij}(\lambda ,\mu ,t) \ +\ \sum _b e^{\ins }_{(\epsilon ,b)}(z)\,\log \eqm _b \ +\ \sum _{a,b} e^{\mat }_{(a,b)}(z)\,\log \exp (\revsub t)_{ab}. \]

\begin {eqnarray*} \hat n_{ij} & = & \expect [n_{ij}\mid x,y;\theta ] \\ \hat e^\ins _{(\epsilon ,b)} & = & \expect [e^\ins _{(\epsilon ,b)}\mid x,y;\theta ] \\ \hat e^\mat _{(a,b)} & = & \expect [e^\mat _{(a,b)}\mid x,y;\theta ] \\ \end {eqnarray*}

denote the posterior expectations of these counts returned by the Forward-Backward algorithm. By the score function identity,

\begin {eqnarray*} \lambda \frac {\partial }{\partial \lambda } \log P\bigl (y\mid x, \theta \bigr ) & = & \sum _{i,j} \hat n_{ij}\; \emcount ^\lambda _{ij}(\lambda ,\mu ,t) \\ \mu \frac {\partial }{\partial \mu } \log P\bigl (y\mid x,\theta \bigr ) & = & \sum _{i,j} \hat n_{ij}\; \emcount ^\mu _{ij}(\lambda ,\mu ,t) \end {eqnarray*}

where $\emcount ^\xi _{ij}(\lambda ,\mu ,t) = \xi \frac {\partial }{\partial \xi }\log \tkfwfst _{ij}(\lambda ,\mu ,t)$ for $\xi \in \{\lambda ,\mu \}$. (See Sections A.4.2.0 and A.4.3.0 for explicit formulae.) The conservation law is likewise linear in the same counts:

\begin {eqnarray} \label {eq:conservation-tkf} \expect [B \mid x,y] - \expect [D \mid x,y] & = & \hat n_{\sta \ins }+\hat n_{\mat \ins }+\hat n_{\ins \ins } - \hat n_{\sta \del }-\hat n_{\mat \del }-\hat n_{\del \del }-\hat n_{\ins \del } \\ & = & \sum _{i,j} \hat n_{ij}\; \emcount ^{\mathrm {cons}}_{ij} \end {eqnarray}

where $\emcount ^{\mathrm {cons}}_{ij} = \delta (j=\ins ) - \delta (j=\del )$ counts whether the WFST transition $i \to j$ corresponds to a birth, death, or neither. Substituting into the BDI score identities gives

\begin {eqnarray} \expect [S \mid x,y] & = & \frac { \sum _{i,j} \hat n_{ij} \left ( \emcount ^{\mathrm {cons}}_{ij} - \emcount ^\lambda _{ij} + \emcount ^\mu _{ij} \right ) - \lambda \evoltime }{\lambda -\mu } \label {eq:exposure-tkf} \\ \expect [B \mid x,y] & = & \lambda \sum _{i,j} \hat n_{ij}\; \emcount ^\lambda _{ij}(\lambda ,\mu ,t) + \lambda \,\expect [S \mid x,y] + \lambda \evoltime \label {eq:births-tkf} \\ \expect [D \mid x,y] & = & \mu \sum _{i,j} \hat n_{ij}\; \emcount ^\mu _{ij}(\lambda ,\mu ,t) + \mu \,\expect [S \mid x,y] \label {eq:deaths-tkf} \end {eqnarray}

Thus the posterior expectations of the BDI sufficient statistics $\emcount ^B({\bf n},\evoltime ) = \expect [B|x,y]$, $\emcount ^D({\bf n},\evoltime ) = \expect [D|x,y]$, and $\emcount ^S({\bf n},\evoltime ) = \expect [S|x,y]$ are affine functions of the Forward-Backward expected transition counts $\hat n_{ij}$.

\begin {eqnarray*} \expect [W_i \mid x,y] & = & \sum _{a,b} \hat {e}^\mat _{(a,b)}\; \emcount ^W_i(a,b,\evoltime )\ + \agecorrection ^W_i (x,y) \\ \expect [U_{ij} \mid x,y] & = & \sum _{a,b} \hat {e}^\mat _{(a,b)}\; \emcount ^U_{ij}(a,b,\evoltime )\ + \agecorrection ^U_{ij} (x,y) \\ \end {eqnarray*}

where the first term is the contribution from observed endpoint-matched lineages, obtained from CTMC bridge expectations conditional on $a \to b$ over time $\evoltime $. The correction terms $\agecorrection ^W_i, \agecorrection ^U_{ij}$ account for unobserved partial CTMC trajectories on lineages that are deleted before time $\evoltime $, inserted after time 0, or born and deleted within $(0,\evoltime )$. Exact evaluation of these terms would require genealogical information about the latent BDI history, including birth and death times and lineage lifetimes. In the reduced Pair-HMM EM algorithm used here, we optimize the endpoint alignment likelihood rather than the complete-data likelihood of the full birth-death-substitution process. Consequently, the reduced substitution update for $\exch $ retains only matched-lineage bridge contributions. For the equilibrium distribution $\eqm $, we additionally retain observable endpoint composition counts from insertion and deletion states, while omitting genealogical correction terms from nonterminal birth and death times and from fully transient hidden lineages.

Considering now the joint distribution $P(x,y|\theta )$ and including the sufficient statistics for the stationary composition and length distribution of ancestral sequences, as well as the compositional statistics for inserted sequences that are already included in the conditional case, the expectations are

\begin {eqnarray*} \expect [L \mid x,y] & = & \sum _{i \in \{\sta ,\mat ,\ins ,\del \}} (\hat {n}_{i\mat } + \hat {n}_{i\del }) = |x| \\ \expect [M \mid x,y] & = & 1 \\ \expect [V_i \mid x,y] & = & \sum _b \hat {e}^\mat _{(i,b)} + \hat {e}^\ins _{(\epsilon ,i)} + \hat {e}^\del _{(i,\epsilon )} + \agecorrection ^V_i (x,y) \end {eqnarray*}

where the three observable terms count, respectively, ancestral residues at match positions (drawn from $\eqm $ at time $0$ and surviving until $t$), descendant residues at insert positions (drawn from $\eqm $ at the time of insertion), and ancestral residues at delete positions (drawn from $\eqm $ at time $0$ and deleted before $t$); $\agecorrection ^V_i$ is a similar genealogical correction term, the omission of which again corresponds to assuming (for the purposes of the CTMC parameter update) that insertion and deletion events are synchronized with the trajectory endpoints, and that no transient characters are born and die within the trajectory. Thus $V_i$ counts every observable equilibrium $\eqm $-draw of character $i$ in the joint pair HMM, regardless of whether the residue subsequently underwent a match-position substitution, was deleted, or never had an ancestor (insertion). The conditional $P(\descendant | \ancestor , t)$ analogue drops the two ancestor-derived terms (matches and deletions), since the ancestral sequence is given.

A.1.8 Baum-Welch Algorithm for TKF91

Given a collection of ancestral and descendant sequences $\{ x^{(n)}, y^{(n)} \}$, the Baum-Welch algorithm iterates between an E-step and an M-step.

E-step. For each of the sequence pairs, run the Forward-Backward algorithm on the TKF91 WFST yielding the expected transition counts $\hat {n}_{ab}$ for $a,b \in \{\sta ,\mat ,\ins ,\del ,\fin \}$ and the expected emission counts $\hat {e}^\mat _{(a,b)}, \hat {e}^\ins _{(\epsilon ,b)}, \hat {e}^\del _{(a,\epsilon )}$. Use these to accumulate expectations for the sufficient statistics $S, B, D, W, U, L, M, V$ as described in the previous section. In this section we will simply refer to the accumulated expectations for these statistics using the variable names for the statistics themselves. Thus $S \equiv \expect [S \mid \{x^{(n)}, y^{(n)}\}]$ and so forth.

M-step. The goal is to maximize $\ell (\theta ) = \ell _1(\insrate ,\delrate ) + \ell _2(\exch ,\eqm )$ where

\begin {eqnarray} \ell _1(\insrate ,\delrate ) & = & (B + L) \log \insrate + (D - L - M) \log \delrate + M \log (\delrate - \insrate ) - (\insrate +\delrate )\,S - \insrate T \label {eq:ell1} \\ \ell _2(\exch ,\eqm ) & = & \sum _i V'_i \log \eqm _i + \sum _{j>i} (U_{ij} + U_{ji}) \log \exch _{ij} - \sum _{j > i} \exch _{ij} \left ( W_i \eqm _j + W_j \eqm _i \right ) \label {eq:ell2} \end {eqnarray}

using the reversible parameterization $\kappa = \insrate /\delrate $ and $\revsub _{ij} = \exch _{ij} \eqm _j$ and letting $V'_i = V_i + \sum _{j \neq i} U_{ji}$ for notational convenience. Differentiating and setting to zero yields equations the MLEs must satisfy. For $\insrate ,\delrate $ these are simultaneous quadratics

\begin {eqnarray} \partial _\insrate \ell _1 & = & \frac {B + L}{\insrate } - \frac {M}{\delrate - \insrate } - S - T = 0 \\ \partial _\delrate \ell _1 & = & \frac {D - L - M}{\delrate } + \frac {M}{\delrate - \insrate } - S = 0 \end {eqnarray}

which reduce to a single quadratic in $\kappa $. Setting $a = B + L$ and $b = D - L - M$ \begin {equation} \label {eq:kappa-quadratic} \kappa ^2 (S + T) b - \kappa \left ( S(a + b + 2M) + T(b + M) \right ) + S a = 0. \end {equation} For the coefficients $A_\kappa = (S + T) b$, $B_\kappa = -\left ( S(a + b + 2M) + T(b + M) \right )$, and $C_\kappa = S a$, the root $\kappa \in (0,1)$ will be the smaller of the two roots

\begin {eqnarray} \hat {\kappa }(B,D,L,M,S,T) & = & \frac {-B_\kappa - \sqrt {B_\kappa ^2 - 4 A_\kappa C_\kappa }}{2 A_\kappa } \label {eq:kappa-root} \\ \hat {\delrate }(B,D,L,M,S,T) & = & \frac {1}{S}\left (b + \frac {M}{1-\hat {\kappa }}\right ) \label {eq:delrate-root} \\ \hat {\insrate }(B,D,L,M,S,T) & = & \hat {\kappa } \hat {\delrate } \label {eq:insrate-root} \end {eqnarray}

\begin {eqnarray} \partial _{\exch _{ij}} \ell _2 & = & \frac {U_{ij} + U_{ji}}{\exch _{ij}} - \left ( W_i \eqm _j + W_j \eqm _i \right ) = 0 \\ \partial _{\eqm _i} \ell _2 & = & \frac {V'_i}{\eqm _i} - \sum _{j \neq i} \exch _{ij} W_j + \eta = 0 \end {eqnarray}

where in the second equation $\eta $ is a Lagrange multiplier for the constraint $\sum _i \eqm _i = 1$. These equations are nonlinearly coupled and do not, in general, have a closed-form solution, nor even a closed-form iterative solution. There are several practical alternatives:

A.1.9 Extending TKF91 to Phylogenetic Trees

Hein (18) showed how to compute a multi-sequence Forward likelihood for TKF91 on a binary tree. Holmes and Bruno (24) used a similar approach to compute posterior marginals for Gibbs sampling of ancestral sequences and alignments. Lunter et al showed how to calculate alignment likelihoods efficiently (36). Suchard and Redelings (42) extended the MCMC sampling approach to jointly sample over MSAs and trees. Westesson et al. (52) showed how this approach can be seen as a generalization of Felsenstein’s pruning algorithm for substitution-only models, where the WFST plays the role of a matrix exponential defined on whole sequences, with the WFST’s Forward algorithm computing the entries of this matrix exponential. The linear algebraic interpretation was further remarked upon by Bouchard-Côté (? ).

A.2 The TKF92 Model

In sequence-alignment terms, TKF92 is the classical fragment-based extension of TKF91 and induces gap behavior analogous to an affine gap penalty (51). Instead of each mortal link getting a single character from $\alphabet $, each link is associated with a fragment consisting of $\fraglen \geq 1$ characters from $\alphabet $, where $\fraglen \sim \geomdist (\ext )$. We write this as \[ \seq '(\evoltime ) \sim \tkffrags (\subproc (\exch ,\eqm );\insrate ,\delrate ,\ext ) \] where $\seq ' \in (\alphabet ^+)^\ast $ is a sequence of multi-character fragments. Specifically \[ \tkffrags (\subproc (\exch ,\eqm );\insrate ,\delrate ,\ext ) \equiv \tkflinks (\fragproc (\subproc (\exch ,\eqm );\ext );\insrate ,\delrate ) \] where $\fragproc (\model ,\ext )$ denotes a tuple of $\fraglen $ i.i.d. random variables, each governed by $\model $, and $\fraglen $ is geometrically-distributed with parameter $\ext $ \[ \state \sim \fragproc (\model ,\ext )\ \Leftrightarrow \ \fraglen \sim \geomdist (\ext ), \ \state = (x_1 \ldots x_\fraglen ),\ x_\sumidx \sim \model \] The TKF92 model thus has the same BDI process as TKF91, but with a more complex substitution process that emits fragments instead of single characters.

A.2.1 Latent Information in TKF92

The mean length of an insertion in TKF92 is $1/(1-\ext )$, i.e. the expected fragment length. For deletions, things are slightly more complicated. Superficially, it looks like deletions are also fragments so they should have the same mean length. However, if we do not have knowledge of the fragment boundaries, but are conditioning just on the current sequence length, we have to weight this by the posterior probability that the subsequence being deleted does in fact constitute a single fragment. For a subsequence of length $k$, this posterior probability is proportional to $\left (\frac {\ext }{\ext + (1-\ext )\frac {\insrate }{\delrate }}\right )^{k-1}$ (indels at the end of the sequence have a slightly different correction factor since they are more likely to constitute a fragment, but the $k$-dependence is the same). The rate of starting an insertion or deletion at a particular site is subject to a similar correction factor.

This illustrates how the derivation of the WFST from the Pair HMM of such models becomes more complicated as the models carry greater amounts of latent information in their state space.

A.2.2 Singlet HMM, Pair HMM, and WFST Representations

The TKF92 Singlet HMM, Pair HMM, and WFST have the same state spaces $\{\sta ,\mat ,\ins ,\del ,\fin \}$ as TKF91, but with transition probabilities modified to account for fragment extension. Within a link’s fragment, each character beyond the first extends the fragment with probability $\ext $, so the fragment terminates with probability $1-\ext $.

Singlet HMM (stationary distribution). The total sequence length in TKF92 is a geometric sum of geometric fragment lengths. Since $N \sim \geomdist (\kappa )$ links and each fragment has length $\fraglen \sim \geomdist (\ext )$, the total length $L = \sum _{i=1}^N \fraglen _i$ is distributed as $P(L=0) = 1-\kappa $ and $P(L=\ell \mid L \geq 1) = (1-p)\,p^{\ell -1}$ where $p = \ext + \kappa (1-\ext )$ is the effective continuation probability. The HMM emits no characters with probability $1-\kappa $, emits a first character with probability $\kappa $, and then continues with probability $p$ after each emitted character. This is equivalent to a zero-inflated geometric distribution with parameters $\kappa $ and $p$.

Pair HMM (finite-time joint distribution). Each emitting state has a fragment extension self-loop with probability $\ext $. On fragment termination (probability $1-\ext $), the TKF91 link-level transitions apply. The Pair HMM transition matrix is \[ \tkftrans '(\insrate ,\delrate ,\evoltime ,\ext ) = \left ( \begin {array}{r|ccccc} & \sta & \mat & \ins & \del & \fin \\ \hline \sta & 0 & \tkftrans _{\sta \mat } & \tkftrans _{\sta \ins } & \tkftrans _{\sta \del } & \tkftrans _{\sta \fin } \\ \mat & 0 & \ext + (1-\ext )\tkftrans _{\mat \mat } & (1-\ext )\tkftrans _{\mat \ins } & (1-\ext )\tkftrans _{\mat \del } & (1-\ext )\tkftrans _{\mat \fin } \\ \ins & 0 & (1-\ext )\tkftrans _{\ins \mat } & \ext + (1-\ext )\tkftrans _{\ins \ins } & (1-\ext )\tkftrans _{\ins \del } & (1-\ext )\tkftrans _{\ins \fin } \\ \del & 0 & (1-\ext )\tkftrans _{\del \mat } & (1-\ext )\tkftrans _{\del \ins } & \ext + (1-\ext )\tkftrans _{\del \del } & (1-\ext )\tkftrans _{\del \fin } \\ \fin & 0 & 0 & 0 & 0 & 0 \end {array} \right ) \] where $\tkftrans _{ab}(\insrate ,\delrate ,\evoltime )$ are the TKF91 Pair HMM transition probabilities. Each row sums to 1: for $a \in \{\mat ,\ins ,\del \}$, $\ext + (1-\ext )\sum _b \tkftrans _{ab} = \ext + (1-\ext ) = 1$.

WFST (finite-time conditional distribution). Forming a WFST for TKF92 is complicated by the latent information associated with fragment boundaries. If those boundaries are known, one can construct a WFST that respects them exactly. If they are not known, one can still construct a plausible WFST that imputes them on the fly. This trick, however, ceases to be straightforward for more highly decorated descendants of TKF91. Alternatively, one can expose the fragment boundary information in the transducer alphabet, at the cost of reifying indivisibility of fragments over time. The choice between these approaches is non-obvious and presumably application-dependent. The explicit TKF92 WFST construction, obtained by dividing the Pair HMM by the Singlet HMM, is given in Appendix A.3. That construction requires a $6{\times }6$ state space ($\sta ,\mat ,\ins _0,\ins _1,\del ,\fin $) rather than the $5{\times }5$ form one might naively borrow from TKF91, because the singlet’s outgoing-emission weight differs between the start state ($\kappa $ for the first ancestor character) and any subsequent emit state ($p = \ext + (1{-}\ext )\kappa $), so the insert state must be split into a pre-immortal-link variant ($\ins _0$) and a post-immortal-link variant ($\ins _1$). The resulting WFST is conditionally normalised in the global sense: for any input ancestor sequence $x$, the total weight of all paths from $\sta $ to $\fin $ that read $x$ exactly equals $1$, summing over all alignments and descendant outputs. Equivalently, $\text {Singlet}\circ \tkfwfst ''=\tkftrans ''$ is row-stochastic on the $6$-state space. Local single-step row sums and per-state $\varepsilon $-closure normalisation hold for the $\sta ,\mat ,\ins _0,\del $ rows but not for the $\ins _1$ row, which is improper in isolation but accumulates the correct $\kappa /p$ factor whenever it is reached from a proper source state along an $\sta $-rooted path (Appendix A.3.7). The same trade-off recurs for MixDom WFSTs in Section C.1.5, and is discussed more carefully in Appendix A.3.7.

These issues are not unique to TKF92, but are a general feature of models with latent information that is not directly observable in the sequence data. They are dealt with at greater length in the section on MixDom WFSTs (Section C.1.5), where the same issues arise in a more complex setting. To state the general principle: as TKF91-derived models acquire additional latent decoration, that decoration must either be promoted to explicit symbols in the interface alphabet of the transducer construction, or integrated out, in which case the resulting transducer is only approximate.

A.2.3 Baum-Welch Algorithm for TKF92

The Baum-Welch algorithm for TKF92 is similar to that for TKF91, but we must now resolve the Pair HMM’s self-looping transition counts onto their separate components (fragment extension vs. new link), and accumulate fragment-level sufficient statistics in addition to the substitution-level sufficient statistics.

E-step. Run Forward-Backward on the TKF92 Pair HMM. This yields expected transition counts $\hat {n}'_{ab}$. For $a \in \{\mat ,\ins ,\del \}$, the self-loop count $\hat {n}'_{aa}$ combines fragment extension (probability $\ext $) and new-link-same-state (probability $(1-\ext )\tkftrans _{aa}$). The expected number of fragment extensions from state $a$ is $\hat {n}'_{aa} \cdot \ext / (\ext + (1-\ext )\tkftrans _{aa})$ and the expected number of new-link self-transitions is the remainder. Thus, we accumulate new expectations for the additional fragment-level sufficient statistics, $F$ the number of fragment extensions and $E$ the number of fragment ends, and then recover the TKF91 link-level counts by removing the fragment extensions

\begin {eqnarray*} F_a & = & \frac {\hat {n}'_{aa} \cdot \ext }{\ext + (1-\ext )\tkftrans _{aa}},\ a \in \{\mat ,\ins ,\del \} \\ F & = & \sum _{a \in \{\mat ,\ins ,\del \}} F_a \\ E & = & \sum _{a \in \{\mat ,\ins ,\del \}} \sum _b \hat {n}'_{ab} - F \\ \hat {n}_{ab} & = & \hat {n}'_{ab} - \delta _{ab} F_a \end {eqnarray*}

We then proceed to accumulate $S, B, D, W, U, L, M, V$ as for TKF91. Note that $L$ counts links, which are now fragments rather than simply residues: $L = \hat {n}_\kappa = \sum _{i} (\hat {n}_{i\mat } + \hat {n}_{i\del })$ where $\hat {n}$ is the resolved (TKF91-level) count matrix. The character-level statistics $W,U,V$ are accumulated per character exactly as in TKF91, because fragment extension affects the indel process estimates but not the substitution process estimates.

M-step. This is identical to TKF91, except we also set $\ext \leftarrow F / (F + E)$.

A.2.4 Maraschino: Distilled Cherries

The TKF92 Baum-Welch algorithm, as with other instances of the Expectation Maximization algorithm, is most useful if there is hidden information that we need to marginalize during parameter estimation; e.g. an unknown alignment that needs to be summed over, or (as in Section C.1.1) latent site or fragment classes that are not directly observed.

For many applications, we can take the alignment data (and potentially site classes) as a given, and then a more efficient estimation strategy becomes available. This is the CherryML approach: a composite likelihood consisting of a product of alignments likelihoods for nearest-neighbor sibling pairs, a.k.a. “cherries” (40). The CherryML developers observe that, given cherry branch lengths, the alignment substitution counts are sufficient to estimate substitution rates and can be performed efficiently using autograd, independently of data size except for a fast preprocessing step We observe that this can be generalized to TKF92 by including the alignment transition statistics i.e. the sixteen transition counts $n_{\xstate \to \ystate }$ where $\xstate \in \{\sta ,\mat ,\ins \,del\}$ and $\ystate \in \{\mat ,\ins ,\del ,\fin \}$ states. The CherryML approach can then be applied: 1) discretize times, yielding a fixed-shape tensor of summary counts; 2) write down the observed data likelihood in terms of the summary counts; 3) maximize using autograd and standard gradient-based optimizers.

We call this algorithm Maraschino, as it involves distilling cherries to state machines.

A.3 TKF92 WFST by Singlet Division

This appendix derives the TKF92 conditional WFST $\tkfwfst '(\insrate ,\delrate ,\evoltime ,\ext )$ representing $P(\text {descendant} \mid \text {ancestor}, \theta )$ by dividing the TKF92 Pair HMM by the TKF92 Singlet HMM.

The TKF91 WFST (Section A.1.6) is a $5{\times }5$ matrix obtained from the TKF91 Pair HMM $\tkftrans $ by stripping $\kappa $ from ancestor-consuming columns ($\mat ,\del $) and $1{-}\kappa $ from the end column ($\fin $). The TKF92 WFST cannot be expressed as a $5{\times }5$ matrix in the same state space. As we show in Section A.3.2, the singlet’s outgoing-emission weight differs between its $\sta $ state and its emit state, so the two cases must be carried as distinct insert states (Ins0 and Ins1) in the WFST. The natural object is therefore $6{\times }6$, with state space $\{\sta ,\mat ,\ins _0,\ins _1,\del ,\fin \}$. A $5{\times }5$ presentation that lumps $\ins _0$ and $\ins _1$ into a single $\ins $ state and uses the emit-state divisor $p$ throughout is incorrect for paths that traverse $\sta \to \ins \to \cdots \to \{\mat ,\del \}$; this is discussed in Section A.3.6.

A.3.1 TKF92 Singlet HMM Transitions

The TKF92 singlet at stationarity emits sequences whose length $L$ follows a zero-inflated geometric distribution (Section A.2.2). The singlet HMM has two transition contexts:

The effective continuation probability $p$ decomposes as: fragment extension with probability $\ext $, or fragment termination ($1{-}\ext $) followed by a new nonempty link ($\kappa $). Each emitted character is drawn i.i.d. from $\eqm $.

The key consequence for the WFST construction below is that $\kappa \neq p$ whenever $\ext > 0$.

A.3.2 Why an Ins0 / Ins1 Split is Necessary

Transducer composition $\text {Singlet} \circ \text {WFST} = \text {Pair HMM}$ couples each ancestor-consuming Pair HMM transition to one of the two singlet emission events:

The Pair HMM’s $\ins $ state corresponds to “the descendant is in the middle of a descendant-only fragment.” This state is reached either before the first ancestor character has been consumed (entered from $\sta $, possibly via $\ins \to \ins $ self-loops) or after $\geq 1$ ancestor characters have been consumed (entered from $\mat ,\del ,\ins $). In the first case, the next ancestor-consuming transition out of $\ins $ corresponds to the $\sta \to \text {emit}$ event in the singlet (weight $\kappa $); in the second case, it corresponds to $\text {emit}\to \text {emit}$ (weight $p$). The two divisors differ unless $\ext =0$.

The $5{\times }5$ Pair HMM $\tkftrans '(\insrate ,\delrate ,\evoltime ,\ext )$ of Section A.2.2 marginalises over the “has any ancestor character been consumed yet?” bit and is correct as a generator of the joint distribution. But because that bit controls which singlet divisor applies, the WFST construction must expose it. We therefore split the Pair HMM $\ins $ state into

The $6{\times }6$ Pair HMM in this state space is row-stochastic (verified below), and the $5{\times }5$ Pair HMM is recovered by lumping $\ins _0$ and $\ins _1$ into a single $\ins $ row. The lumping preserves the joint distribution because the outgoing transitions from $\ins _0$ and $\ins _1$ have identical Pair HMM structure once weighted by their singlet contributions; what fails to lump is the WFST.

A.3.3 The $6{\times }6$ TKF92 Pair HMM

Writing $\alpha ,\beta ,\gamma ,\kappa $ for the TKF link-level parameters (as in Section A.1.6), and $r=\ext $ for the fragment-extension probability:

\begin {eqnarray} \lefteqn {\tkftrans ''(\insrate ,\delrate ,\evoltime ,\ext ) \;= \quad } \label {eq:tkf92-pair-hmm-6x6} \\ & & \renewcommand {\arraystretch }{1.6} \left ( \begin {array}{r|cccccc} & \sta & \mat & \ins _0 & \ins _1 & \del & \fin \\\hline \sta & 0 & (1{-}\beta )\kappa \alpha & \beta & 0 & (1{-}\beta )\kappa (1{-}\alpha ) & (1{-}\beta )(1{-}\kappa ) \\ \mat & 0 & r + (1{-}r)(1{-}\beta )\kappa \alpha & 0 & (1{-}r)\beta & (1{-}r)(1{-}\beta )\kappa (1{-}\alpha ) & (1{-}r)(1{-}\beta )(1{-}\kappa ) \\ \ins _0 & 0 & (1{-}r)(1{-}\beta )\kappa \alpha & r + (1{-}r)\beta & 0 & (1{-}r)(1{-}\beta )\kappa (1{-}\alpha ) & (1{-}r)(1{-}\beta )(1{-}\kappa ) \\ \ins _1 & 0 & (1{-}r)(1{-}\beta )\kappa \alpha & 0 & r + (1{-}r)\beta & (1{-}r)(1{-}\beta )\kappa (1{-}\alpha ) & (1{-}r)(1{-}\beta )(1{-}\kappa ) \\ \del & 0 & (1{-}r)(1{-}\gamma )\kappa \alpha & 0 & (1{-}r)\gamma & r + (1{-}r)(1{-}\gamma )\kappa (1{-}\alpha ) & (1{-}r)(1{-}\gamma )(1{-}\kappa ) \\ \fin & 0 & 0 & 0 & 0 & 0 & 0 \end {array} \right ) \nonumber \end {eqnarray}

The structural zeros are: $\sta $ cannot reach $\ins _1$ in one step (no ancestor has yet been emitted), and $\mat ,\del ,\ins _1$ cannot reach $\ins _0$ in one step (an ancestor has already been emitted). The $\ins _0$ and $\ins _1$ rows differ only in their self-loop column ($\ins _0\to \ins _0$ vs $\ins _1\to \ins _1$); their entries to $\mat ,\del , \fin $ are identical.

Row-stochasticity. For each non-$\fin $ source row, the row sum equals \[ r + (1{-}r)\Bigl [\beta + (1{-}\beta )\kappa \alpha + (1{-}\beta )\kappa (1{-}\alpha ) + (1{-}\beta )(1{-}\kappa )\Bigr ] = r + (1{-}r)\bigl [\beta + (1{-}\beta )\bigr ] = 1 \] (replacing $\beta $ with $\gamma $ for the $\del $ row), confirming that $\tkftrans ''$ is a valid Markov transition matrix.

Recovery of the $5{\times }5$ Pair HMM. Lumping $\ins _0$ and $\ins _1$ into a single $\ins $ state preserves all four non-$\ins $-column entries (which are identical between the two rows by construction). In the self-loop column, $\ins _0\to \ins _0$ and $\ins _1\to \ins _1$ both equal $r+(1{-}r)\beta $, so the lumped self-loop is also $r+(1{-}r)\beta $. The result is the $5{\times }5$ Pair HMM of Section A.2.2, recovered exactly.

A.3.4 The $6{\times }6$ TKF92 WFST

Each Pair HMM transition consumes either one ancestor character ($\to \mat $ or $\to \del $), zero ancestor characters ($\to \ins _0$ or $\to \ins _1$), or signals termination ($\to \fin $). The WFST $\tkfwfst ''$ is obtained by dividing each Pair HMM entry by the corresponding singlet weight:

The emission factor $\eqm _\anctok $ is also stripped from the emission weights, exactly as in TKF91, so that $\mat $-state emission becomes $\exp (\revsub \evoltime )_{\anctok \destok }$ and $\del $-state emission becomes $1$; we focus on the transition matrix below.

\begin {eqnarray} \lefteqn {\tkfwfst ''(\insrate ,\delrate ,\evoltime ,\ext ) \;= \quad } \label {eq:tkf92-wfst} \\ & & \renewcommand {\arraystretch }{1.7} \left ( \begin {array}{r|cccccc} & \sta & \mat & \ins _0 & \ins _1 & \del & \fin \\\hline \sta & 0 & (1{-}\beta )\alpha & \beta & 0 & (1{-}\beta )(1{-}\alpha ) & 1{-}\beta \\ \mat & 0 & \dfrac {r + (1{-}r)(1{-}\beta )\kappa \alpha }{p} & 0 & (1{-}r)\beta & \dfrac {(1{-}r)(1{-}\beta )\kappa (1{-}\alpha )}{p} & 1{-}\beta \\ \ins _0 & 0 & (1{-}r)(1{-}\beta )\alpha & r + (1{-}r)\beta & 0 & (1{-}r)(1{-}\beta )(1{-}\alpha ) & (1{-}r)(1{-}\beta ) \\ \ins _1 & 0 & \dfrac {(1{-}r)(1{-}\beta )\kappa \alpha }{p} & 0 & r + (1{-}r)\beta & \dfrac {(1{-}r)(1{-}\beta )\kappa (1{-}\alpha )}{p} & 1{-}\beta \\ \del & 0 & \dfrac {(1{-}r)(1{-}\gamma )\kappa \alpha }{p} & 0 & (1{-}r)\gamma & \dfrac {r + (1{-}r)(1{-}\gamma )\kappa (1{-}\alpha )}{p} & 1{-}\gamma \\ \fin & 0 & 0 & 0 & 0 & 0 & 0 \end {array} \right ) \nonumber \end {eqnarray}

The $\ins _1$ row uses the $p$ divisor on its $\mat $ and $\del $ exits and $1{-}\beta $ on its $\fin $ exit, matching the post-immortal-link regime. The $\ins _0$ row uses $\kappa $ on its $\mat ,\del $ exits (absorbed into the numerator: $(1{-}r)(1{-}\beta )\kappa \alpha /\kappa = (1{-}r)(1{-}\beta )\alpha $, etc.) and $1{-}\kappa $ on its $\fin $ exit (giving $(1{-}r)(1{-}\beta )$). These are exactly the TKF91 WFST $\sta /\mat /\ins $-row outputs scaled by $(1{-}r)$, reflecting the leading-extension fragment’s chance of terminating before any ancestor is consumed.

A.3.5 Structure and Verification

Sanity check 1: $\sta $ row. The $\sta $ row inherits the TKF91 form because the first character of the immortal-link descendant fragment has no extension self-loop. Multiplying by the singlet emission weights from $\sta $ ($\kappa $ for $\mat ,\del $, $1$ for $\ins _0$, $1{-}\kappa $ for $\fin $) recovers the Pair HMM $\sta $ row of (??).

Sanity check 2: $\ins _0$ row. Multiplying $\tkfwfst ''$ entries by the appropriate singlet weights (“from $\sta $” weights: $\kappa $ for $\mat /\del $, $1$ for $\ins _0$, $1{-}\kappa $ for $\fin $): \begin {align*} \kappa \cdot (1{-}r)(1{-}\beta )\alpha &= (1{-}r)(1{-}\beta )\kappa \alpha \quad =\;\tkftrans ''_{\ins _0\mat },\\ 1\cdot [r+(1{-}r)\beta ] &= r+(1{-}r)\beta \quad =\;\tkftrans ''_{\ins _0\ins _0},\\ \kappa \cdot (1{-}r)(1{-}\beta )(1{-}\alpha ) &= (1{-}r)(1{-}\beta )\kappa (1{-}\alpha ) \quad =\;\tkftrans ''_{\ins _0\del },\\ (1{-}\kappa )\cdot (1{-}r)(1{-}\beta ) &= (1{-}r)(1{-}\beta )(1{-}\kappa ) \quad =\;\tkftrans ''_{\ins _0\fin }. \end {align*}

All four entries match the Pair HMM, so $\text {Singlet}\circ \tkfwfst ''= \tkftrans ''$ on the $\ins _0$ row.

Sanity check 3: $\mat $ row. Multiplying by emit-state weights ($p$ for $\mat /\del $, $1$ for $\ins _1$, $(1{-}\ext )(1{-}\kappa )$ for $\fin $): \begin {align*} p\cdot \frac {r+(1{-}r)(1{-}\beta )\kappa \alpha }{p} &= r+(1{-}r)(1{-}\beta )\kappa \alpha \;=\;\tkftrans ''_{\mat \mat },\\ 1\cdot (1{-}r)\beta &= (1{-}r)\beta \;=\;\tkftrans ''_{\mat \ins _1},\\ p\cdot \frac {(1{-}r)(1{-}\beta )\kappa (1{-}\alpha )}{p} &= (1{-}r)(1{-}\beta )\kappa (1{-}\alpha ) \;=\;\tkftrans ''_{\mat \del },\\ (1{-}r)(1{-}\kappa )\cdot (1{-}\beta ) &= (1{-}r)(1{-}\beta )(1{-}\kappa ) \;=\;\tkftrans ''_{\mat \fin }. \end {align*}

The $\ins _1$ and $\del $ rows verify identically (with $\gamma $ for $\beta $ in the $\del $ row).

TKF91 limit ($\ext =0$). At $\ext =0$, $p=\kappa $ and $r=0$. The $\ins _0$ and $\ins _1$ rows become: \begin {align*} \tkfwfst ''_{\ins _0\,*\;|\,r=0} &= \bigl (0,\;(1{-}\beta )\alpha ,\;\beta ,\;0,\;(1{-}\beta )(1{-}\alpha ),\;1{-}\beta \bigr ),\\ \tkfwfst ''_{\ins _1\,*\;|\,r=0} &= \bigl (0,\;(1{-}\beta )\alpha ,\;0,\;\beta ,\;(1{-}\beta )(1{-}\alpha ),\;1{-}\beta \bigr ). \end {align*}

With the $\ins _0$ and $\ins _1$ self-loops both equal to $\beta $ and all other columns identical, the two rows reduce to the same TKF91-Ins row when $\ins _0,\ins _1$ are merged. The TKF91 WFST (Section A.1.6) is recovered.

Latent fragment-boundary posterior. The $\mat $ row’s $\mat \to \mat $ entry decomposes as \[ \tkfwfst ''_{\mat \mat } = \underbrace {\frac {r}{p}}_{\substack {\text {same}\\\text {frag}}} \cdot 1 \;+\; \underbrace {\frac {(1{-}r)\kappa }{p}}_{\substack {\text {new}\\\text {frag}}} \cdot (1{-}\beta )\alpha , \] exactly as in the original construction; under same-fragment continuation the descendant certainly stays in $\mat $ (the link persists within the fragment), and under new-fragment start the descendant must survive ($1{-}\beta $) and align ($\alpha $). The fragment-boundary posterior $r/p + (1{-}r)\kappa /p = 1$ holds by definition of $p$. The $\ins _0$ row has no same-fragment ancestor-consuming term — since the singlet has not yet emitted, the fragment-boundary posterior is degenerate at “new fragment” — which is why the $\ins _0\to \mat $ entry has the simple $(1{-}r)(1{-}\beta )\alpha $ form rather than a $1/p$-divided form.

A.3.6 Comparison with the $5{\times }5$ Form

A $5{\times }5$ presentation that replaces the two rows $\ins _0,\ins _1$ of (??) by a single $\ins $ row equal to the $\ins _1$ row is correct only at $\ext =0$. For $\ext >0$, that $5{\times }5$ form misweights paths $\sta \to \ins \to \cdots \to \{\mat ,\del ,\fin \}$ with one or more leading inserts before the first ancestor consumption: the divisor on the leading-insert exit should be $\kappa $ (or $1{-}\kappa $ for $\fin $) rather than $p$ (or $(1{-}\ext )(1{-}\kappa )$). For a single-leading-insert path $\sta \to \ins \to \mat \to \fin $, the multiplicative discrepancy is $(1{-}\ext )\kappa /p < 1$ when $\ext >0$, so the $5{\times }5$ WFST underweights such paths.

For benchmarks dominated by alignments without leading inserts (typical when the ancestor’s $N$-terminus is well-conserved), the numerical effect of the discrepancy is small per pair. However, the $5{\times }5$ form does not represent the true conditional distribution $P(\text {descendant} \mid \text {ancestor}, \theta )$, and the $6{\times }6$ WFST in (??) is the construction we recommend.

A.3.7 Normalization Structure

The WFST is conditionally normalised in the global sense: for every input sequence $x \in \Sigma ^*$, \[ \sum _{(y,\,\text {alignment})} \bigl [\text {WFST path-weight}\bigr ] \;=\; 1, \] where the sum is over all paths from $\sta $ to $\fin $ that read $x$ exactly. This is a direct consequence of $\text {Singlet} \circ \tkfwfst '' = \tkftrans ''$ (verified row by row in Section A.3.5) combined with the fact that $\tkftrans ''$ is row-stochastic and the Singlet HMM defines a proper distribution on $x$.

The WFST is not stochastic in the local single-step sense: its rows do not sum to $1$, and the divisor structure (ancestor-consuming columns divide by $\kappa $ from $\sta ,\ins _0$ and by $p$ from $\mat ,\ins _1,\del $; the $\fin $ column divides by $1{-}\kappa $ from $\sta ,\ins _0$ and by $(1{-}\ext )(1{-}\kappa )$ from $\mat ,\ins _1,\del $; insert columns divide by $1$) means a fixed source row’s outgoing weights are not constrained to sum to $1$.

$\varepsilon $-closure conditional normalisation. A more useful local property is the following. Define $w_X^L$ as the total WFST weight of paths starting at state $X$ that consume the next non-$\varepsilon $ input symbol $L$ (a character or end-of-input sentinel $\$\,$), where each path is a (possibly empty) geometric chain of insert self-loops followed by a single input-consuming exit. For our $6{\times }6$ WFST: \begin {equation} \begin {array}{c|cc} X & w_X^c & w_X^\$ \\\hline \sta & 1 & 1 \\ \mat & 1 & 1 \\ \ins _0 & 1 & 1 \\ \ins _1 & \kappa /p & 1/(1-\ext ) \\ \del & 1 & 1 \end {array} \label {eq:wfst-eps-closure-table} \end {equation} The $\sta ,\mat ,\ins _0,\del $ rows are conditionally normalised: each sums to exactly $1$. The $\ins _1$ row alone is not, because its destination divisors mix the $r$ (extension) and $1{-}r$ (exit) weights asymmetrically into the single-step edges.

This local non-normalisation does not affect the WFST’s correctness as a representation of $P(\text {descendant}\mid \text {ancestor},\theta )$: every WFST path begins at $\sta $, and from $\sta $ the chain-through-$\ins _1$ contribution accumulates exactly the right $\kappa /p$ factor at the appropriate step to bring the total back to $1$. Concretely, starting from $\mat $ the $\varepsilon $-closure weight to consume the next character is \[ w_\mat ^c =\bigl (\tkfwfst ''_{\mat \mat }+\tkfwfst ''_{\mat \del }\bigr ) +\tkfwfst ''_{\mat \ins _1}\cdot \frac {\tkfwfst ''_{\ins _1\mat }+\tkfwfst ''_{\ins _1\del }}{1-\tkfwfst ''_{\ins _1\ins _1}} =\frac {r+(1{-}r)(1{-}\beta )\kappa }{p}+(1{-}r)\beta \cdot \frac {\kappa }{p} =\frac {r+(1{-}r)\kappa }{p}=1, \] so the $\kappa /p$ factor that the $\ins _1$ row alone undershoots is exactly recovered when $\ins _1$ is reached from a “proper” source state. The $\ins _1$ row is improper only if the WFST were started mid-stream from $\ins _1$ in isolation, which never happens in normal use.

Per-input single-step row sums. The simpler invariant $\sum _b\tkfwfst ''_{Xb}=1$ does not hold for any of the body rows. This is a state-encoding artefact of folding the Bernoulli-$r$ extension-vs-exit decision into the same outgoing edges that carry the destination-singlet factor. An alternative WFST that exposes the extension event as a silent “decision” state would be locally row-stochastic at the cost of more states. The compact $6{\times }6$ form trades local stochasticity for a smaller graph; the global property $\text {Singlet}\circ \tkfwfst ''=\tkftrans ''$ (with $\tkftrans ''$ row-stochastic) is preserved either way.

A.3.8 Testable Invariants

The following invariants hold for any valid parameters $0 < \insrate , \delrate $, $\evoltime > 0$, $0 \leq \ext < 1$ and can be verified numerically from the WFST tensor code.

A.4 Equal-Rate Limits for TKF Parameters

For biologically realistic equilibrium lengths under TKF91/TKF92, one is typically driven into the near-critical regime $\insrate \approx \delrate $ ($\kappa \approx 1$). Under TKF91, the expected equilibrium sequence length is $\kappa /(1-\kappa )$ characters. Under TKF92 with fragment extension probability $\ext $, it is $\kappa /((1-\ext )(1-\kappa ))$ characters (over $\kappa /(1-\kappa )$ fragments). The average protein in UniProtKB/Swiss-Prot is approximately 360 amino acids long (49). Even under TKF92 with a mean indel length of $1/(1-\ext ) \approx 3$ residues (matching the empirically observed predominance of 1–5 residue indels in protein evolution (1, 7)), the required $\kappa $ is $120/121 \approx 0.992$. Under TKF91, $\kappa = 360/361 \approx 0.997$. For DNA sequences (a typical prokaryotic ORF or eukaryotic intron-free coding sequence is $\sim $1000 bp (53); intron-bearing eukaryotic genes and syntenic genomic regions can be much longer), $\kappa > 0.999$. Since $1 - \kappa = (\delrate - \insrate )/\delrate $, these biological sequence lengths force $\insrate $ very close to $\delrate $.

The TKF91 transition probabilities $\alpha , \beta , \gamma $ and the BDI score/sufficient statistics all contain factors of $(\delrate - \insrate )$ in denominators. When $\insrate \to \delrate $ ($\kappa \to 1$), these become $0/0$ indeterminate forms. This section derives the L’Hôpital limits, providing numerically stable expressions in the equal-rate regime.

Let $\varepsilon = \delrate - \insrate \to 0$ and $s = \delrate \,\evoltime $. We parameterize $\insrate = \delrate - \varepsilon $ and take $\varepsilon \to 0^+$.

Throughout, we write $\Phi = 1 - \alpha $ for the complement of the survival probability ($\Phi $ is well-defined at any $(\insrate ,\delrate )$). In the equal-rate regime, where $s = \delrate \evoltime $ and $\alpha = e^{-s}$, we additionally define: \[ \phi = \Phi /s = (1-e^{-s})/s \] Note $\phi (0) = 1$ and $\phi (s) \to 0$ as $s \to \infty $.

A.4.1 TKF91 Transition Parameters

$\beta $ (insertion continuation probability). \begin {equation} \beta = \frac {\insrate (e^{-\insrate \evoltime } - e^{-\delrate \evoltime })} {\delrate \,e^{-\insrate \evoltime } - \insrate \,e^{-\delrate \evoltime }} \end {equation} Write $e^{-\insrate \evoltime } = \alpha \,e^{\varepsilon \evoltime }$ and let $u = \varepsilon \evoltime $. Then: \begin {align*} \text {Numerator} &= (\delrate - \varepsilon )\,\alpha \,(e^u - 1) \\ \text {Denominator} &= \alpha \bigl (\delrate \,e^u - \delrate + \varepsilon \bigr ) \end {align*}

Cancelling $\alpha $ and expanding $e^u = 1 + u + u^2/2 + \cdots $ with $u = \varepsilon \evoltime $: \begin {align*} \text {Numer} &= (\delrate - \varepsilon )(\varepsilon \evoltime + \varepsilon ^2\evoltime ^2/2 + \cdots ) = \varepsilon \bigl (\delrate \evoltime + O(\varepsilon )\bigr ) \\ \text {Denom} &= \delrate \varepsilon \evoltime + \varepsilon (1 + O(\varepsilon )) = \varepsilon \bigl (\delrate \evoltime + 1 + O(\varepsilon )\bigr ) \end {align*}

After cancelling $\varepsilon $: \begin {equation} \boxed {\beta \;\xrightarrow {\;\insrate \to \delrate \;}\; \frac {s}{1+s}} \label {eq:beta-limit} \end {equation}

$\gamma $ (orphan insertion probability). Using $\gamma = 1 - \delrate \beta /(\insrate (1-\alpha ))$: \begin {equation} \boxed {\gamma \;\xrightarrow {\;\insrate \to \delrate \;}\; 1 - \frac {1}{(1+s)\,\phi } = \frac {(1+s)\phi - 1}{(1+s)\phi }} \label {eq:gamma-limit} \end {equation}

Complementary parameters. \begin {align} 1-\beta &\;\to \; \frac {1}{1+s} \label {eq:1mbeta-limit}\\ 1-\gamma &\;\to \; \frac {1}{(1+s)\,\phi } \label {eq:1mgamma-limit} \end {align}

A.4.2 Score Derivatives: General Case

Define $\eta = e^{-\insrate \evoltime }$ and $\delta = \delrate \eta - \insrate \alpha $ (the denominator of $\beta $). The score of the observed log-likelihood decomposes into 8 log-parameter groups. Below we list the derivatives of each log-factor with respect to $\insrate $ and $\delrate $, from which the log-elasticities follow by multiplication: \[ \frac {\partial \log \xi }{\partial \log \insrate } = \insrate \,\frac {\partial \log \xi }{\partial \insrate }, \qquad \frac {\partial \log \xi }{\partial \log \delrate } = \delrate \,\frac {\partial \log \xi }{\partial \delrate }, \qquad \frac {\partial \log \xi }{\partial \log (\insrate {+}\delrate )} = \frac {\partial \log \xi }{\partial \log \insrate } + \frac {\partial \log \xi }{\partial \log \delrate } \] The third identity holds because differentiating with respect to $\log (\insrate {+}\delrate )$ at fixed ratio $\kappa = \insrate /\delrate $ is equivalent to scaling both rates by a common factor $c$ and differentiating with respect to $\log c$, which gives $\insrate \,\partial _\insrate + \delrate \,\partial _\delrate $.

$\log \alpha $ and $\log (1{-}\alpha )$ Since $\alpha = e^{-\delrate \evoltime }$ depends only on $\delrate $: \begin {align} \frac {\partial \log \alpha }{\partial \insrate } &= 0, & \frac {\partial \log \alpha }{\partial \delrate } &= -\evoltime \label {eq:dlogalpha} \\[4pt] \frac {\partial \log (1{-}\alpha )}{\partial \insrate } &= 0, & \frac {\partial \log (1{-}\alpha )}{\partial \delrate } &= \frac {\evoltime \alpha }{1-\alpha } = \frac {\evoltime \alpha }{\Phi } \label {eq:dlog1malpha} \end {align}

$\log \beta $ and $\log (1{-}\beta )$ From $\log \beta = \log \insrate + \log (\eta - \alpha ) - \log \delta $: \begin {align} \frac {\partial \log \beta }{\partial \insrate } &= \frac {1}{\insrate } - \frac {\evoltime \eta }{\eta - \alpha } + \frac {\evoltime \delrate \eta + \alpha }{\delta } \label {eq:dlogbeta-dl} \\[4pt] \frac {\partial \log \beta }{\partial \delrate } &= \frac {\evoltime \alpha }{\eta - \alpha } - \frac {\eta + \evoltime \insrate \alpha }{\delta } \label {eq:dlogbeta-dm} \end {align}

Both $(\eta - \alpha ) \to 0$ and $\delta \to 0$ as $\insrate \to \delrate $, so each term individually diverges; the combination is finite (Section A.4.3).

For $\log (1{-}\beta )$, using the chain rule $\partial \log (1{-}\beta ) = -\frac {\beta }{1-\beta }\,\partial \log \beta $: \begin {align} \frac {\partial \log (1{-}\beta )}{\partial \insrate } &= -\frac {\beta }{1-\beta } \cdot \frac {\partial \log \beta }{\partial \insrate } \label {eq:dlog1mbeta-dl} \\[4pt] \frac {\partial \log (1{-}\beta )}{\partial \delrate } &= -\frac {\beta }{1-\beta } \cdot \frac {\partial \log \beta }{\partial \delrate } \label {eq:dlog1mbeta-dm} \end {align}

$\log (1{-}\gamma )$ and $\log \gamma $ From $1 - \gamma = \delrate \beta /(\insrate (1{-}\alpha ))$, so $\log (1{-}\gamma ) = \log \delrate + \log \beta - \log \insrate - \log (1{-}\alpha )$: \begin {align} \frac {\partial \log (1{-}\gamma )}{\partial \insrate } &= \frac {\partial \log \beta }{\partial \insrate } - \frac {1}{\insrate } \label {eq:dlog1mgamma-dl} \\[4pt] \frac {\partial \log (1{-}\gamma )}{\partial \delrate } &= \frac {1}{\delrate } + \frac {\partial \log \beta }{\partial \delrate } - \frac {\evoltime \alpha }{1-\alpha } \label {eq:dlog1mgamma-dm} \end {align}

For $\log \gamma $: \begin {align} \frac {\partial \log \gamma }{\partial \insrate } &= -\frac {1-\gamma }{\gamma } \cdot \frac {\partial \log (1{-}\gamma )}{\partial \insrate } \label {eq:dloggamma-dl} \\[4pt] \frac {\partial \log \gamma }{\partial \delrate } &= -\frac {1-\gamma }{\gamma } \cdot \frac {\partial \log (1{-}\gamma )}{\partial \delrate } \label {eq:dloggamma-dm} \end {align}

(sum of above two rows)

	$\log \alpha $	$\log (1{-}\alpha )$	$\log \beta $	$\log (1{-}\beta )$	$\log \gamma $	$\log (1{-}\gamma )$

$\dfrac {\partial }{\partial \log \insrate }$	$0$	$0$	$1 - \dfrac {s_\insrate \eta }{\eta -\alpha } + \dfrac {s_\insrate \delrate \eta /\insrate + \insrate \alpha /\insrate }{\delta /\insrate }$	$-\dfrac {\beta }{1-\beta }\cdot [\cdot ]_\beta ^\insrate $	$-\dfrac {1-\gamma }{\gamma }\cdot [\cdot ]_{1-\gamma }^\insrate $	$[\cdot ]_\beta ^\insrate - 1$

$\dfrac {\partial }{\partial \log \delrate }$	$-s$	$\dfrac {s\alpha }{\Phi }$	$\dfrac {s\alpha }{\eta -\alpha } - \dfrac {\delrate \eta + s\insrate \alpha }{\delta }$	$-\dfrac {\beta }{1-\beta }\cdot [\cdot ]_\beta ^\delrate $	$-\dfrac {1-\gamma }{\gamma }\cdot [\cdot ]_{1-\gamma }^\delrate $	$1 + [\cdot ]_\beta ^\delrate - \dfrac {s\alpha }{\Phi }$

$\dfrac {\partial }{\partial \log (\insrate {+}\delrate )}$

where $s_\insrate = \insrate \evoltime $, $s = \delrate \evoltime $, and $[\cdot ]_\xi ^\theta $ denotes the $\partial \log \xi /\partial \log \theta $ entry.

The table is dense because each $\log \beta $ and $\log \gamma $ entry involves the singular terms $(\eta -\alpha )^{-1}$ and $\delta ^{-1}$. These singularities cancel in each entry, as we show next.

A.4.3 Score Derivatives: L’Hôpital Limits

We systematically expand each derivative to first order in $\varepsilon = \delrate - \insrate $ and extract the finite limit. With $u = \varepsilon \evoltime $, the key expansions are: \begin {align} \eta - \alpha &= \alpha (e^u - 1) = \alpha \bigl (u + u^2/2 + \cdots \bigr ) \label {eq:eta-alpha-expand} \\ \delta &= \delrate \eta - \insrate \alpha = \alpha \varepsilon (1 + s + O(\varepsilon )) \label {eq:delta-expand} \end {align}

and the Laurent expansion $1/(e^u - 1) = 1/u - 1/2 + u/12 - \cdots $, giving: \begin {align} \frac {\evoltime \eta }{\eta - \alpha } = \frac {\evoltime e^u}{e^u - 1} &= \frac {1}{\varepsilon } + \frac {\evoltime }{2} + O(\varepsilon ) \label {eq:singular-A} \\ \frac {\evoltime \alpha }{\eta - \alpha } = \frac {\evoltime }{e^u - 1} &= \frac {1}{\varepsilon } - \frac {\evoltime }{2} + O(\varepsilon ) \label {eq:singular-B} \\ \frac {\evoltime \delrate \eta + \alpha }{\delta } &= \frac {1}{\varepsilon } + \frac {s\evoltime }{2(1+s)} + O(\varepsilon ) \label {eq:singular-C} \\ \frac {\eta + \evoltime \insrate \alpha }{\delta } &= \frac {1}{\varepsilon } - \frac {s\evoltime }{2(1+s)} + O(\varepsilon ) \label {eq:singular-D} \end {align}

Derivation of (??): the numerator is $\evoltime \delrate \eta + \alpha = \alpha (se^u + 1) = \alpha \bigl ((1{+}s) + s\varepsilon \evoltime + O(\varepsilon ^2)\bigr )$ and the denominator is $\delta = \alpha \varepsilon \bigl ((1{+}s) + s\varepsilon \evoltime /2 + O(\varepsilon ^2)\bigr )$ from (??). So: \[ \frac {\evoltime \delrate \eta + \alpha }{\delta } = \frac {(1{+}s) + s\varepsilon \evoltime + \cdots } {\varepsilon \bigl ((1{+}s) + s\varepsilon \evoltime /2 + \cdots \bigr )} = \frac {1}{\varepsilon } \cdot \frac {1 + \frac {s\varepsilon \evoltime }{1+s} + \cdots } {1 + \frac {s\varepsilon \evoltime }{2(1+s)} + \cdots } \] The $O(\varepsilon )$ coefficients differ: $s\evoltime /(1{+}s)$ in the numerator vs. $s\evoltime /(2(1{+}s))$ in the denominator. Their difference gives the $O(1)$ correction: $1/\varepsilon + s\evoltime /(2(1{+}s)) + O(\varepsilon )$. Similarly for (??).

$\partial \log \beta /\partial \insrate $ and $\partial \log \beta /\partial \delrate $ Using (??) with (??) and (??): \begin {align*} \frac {\partial \log \beta }{\partial \insrate } &= \frac {1}{\insrate } - \Bigl (\frac {1}{\varepsilon } + \frac {\evoltime }{2}\Bigr ) + \Bigl (\frac {1}{\varepsilon } + \frac {s\evoltime }{2(1+s)}\Bigr ) + O(\varepsilon ) \\ &= \frac {1}{\delrate } - \frac {\evoltime }{2} + \frac {s\evoltime }{2(1+s)} = \frac {1}{\delrate } - \frac {\evoltime }{2(1+s)} \end {align*}

Using (??) with (??) and (??): \begin {align*} \frac {\partial \log \beta }{\partial \delrate } &= \Bigl (\frac {1}{\varepsilon } - \frac {\evoltime }{2}\Bigr ) - \Bigl (\frac {1}{\varepsilon } - \frac {s\evoltime }{2(1+s)}\Bigr ) + O(\varepsilon ) \\ &= -\frac {\evoltime }{2} + \frac {s\evoltime }{2(1+s)} = -\frac {\evoltime }{2(1+s)} \end {align*}

Converting to log-elasticities: \begin {equation} \boxed { \frac {\partial \log \beta }{\partial \log \insrate } \;\to \; \frac {s+2}{2(1+s)}, \qquad \frac {\partial \log \beta }{\partial \log \delrate } \;\to \; -\frac {s}{2(1+s)}} \label {eq:logbeta-elasticity-limit} \end {equation}

Check. As a consistency check, we verify that the sum of the two elasticities (which gives the response to uniformly scaling both rates at fixed $\kappa = \insrate /\delrate $) matches the direct derivative of the limit formula. The sum is $(s{+}2)/(2(1{+}s)) + (-s)/(2(1{+}s)) = 1/(1{+}s)$. At the limit, $\beta = s/(1{+}s)$ where $s = \delrate \evoltime $, so $d\log \beta /d\log s = 1/(1{+}s)$. $\m@th \mathchar "458$

Check. Summing the two elasticities (uniform rate scaling): $(-s^2{-}2s{+}s^2)/(2(1{+}s)) = -s/(1{+}s)$. Direct: $d\log (1{-}\beta )/d\log s = -s/(1{+}s)$. $\m@th \mathchar "458$

From (??): $\partial \log (1{-}\gamma )/\partial \delrate = 1/\delrate + \partial \log \beta /\partial \delrate - \evoltime \alpha /\Phi \to 1/\delrate - \evoltime /(2(1{+}s)) - \evoltime \alpha /\Phi $.

Converting: \begin {equation} \boxed { \frac {\partial \log (1{-}\gamma )}{\partial \log \insrate } \;\to \; -\frac {s}{2(1+s)}, \qquad \frac {\partial \log (1{-}\gamma )}{\partial \log \delrate } \;\to \; \frac {s+2}{2(1+s)} - \frac {s\alpha }{\Phi }} \label {eq:log1mgamma-elasticity-limit} \end {equation}

Check. Summing the two elasticities (uniform rate scaling): $-s/(2(1{+}s)) + (s{+}2)/(2(1{+}s)) - s\alpha /\Phi = 1/(1{+}s) - s\alpha /\Phi $. Direct: at the limit, $1-\gamma = s/((1{+}s)\Phi )$, so $\log (1{-}\gamma ) = \log s - \log (1{+}s) - \log \Phi $. Under $s \to cs$: $d/d\log c|_{c=1} = 1 - s/(1{+}s) - s\alpha /\Phi = 1/(1{+}s) - s\alpha /\Phi $. $\m@th \mathchar "458$

\begin {equation} \boxed { \frac {\partial \log \gamma }{\partial \log \insrate } \;\to \; \frac {Rs}{2(1+s)}, \qquad \frac {\partial \log \gamma }{\partial \log \delrate } \;\to \; -R\Bigl (\frac {s+2}{2(1+s)} - \frac {s\alpha }{\Phi }\Bigr )} \label {eq:loggamma-elasticity-limit} \end {equation}

Summary Table: L’Hôpital Limits At $\insrate = \delrate $, with $s = \delrate \evoltime $, $\alpha = e^{-s}$, $\Phi = 1-\alpha $, $\phi = \Phi /s$, $R = 1/((1{+}s)\phi - 1)$:

	$\log \alpha $	$\log (1{-}\alpha )$	$\log \beta $	$\log (1{-}\beta )$	$\log \gamma $	$\log (1{-}\gamma )$

$\dfrac {\partial }{\partial \log \insrate }$	$0$	$0$	$\dfrac {s+2}{2(1+s)}$	$-\dfrac {s(s+2)}{2(1+s)}$	$\dfrac {Rs}{2(1+s)}$	$-\dfrac {s}{2(1+s)}$

$\dfrac {\partial }{\partial \log \delrate }$	$-s$	$\dfrac {s\alpha }{\Phi }$	$-\dfrac {s}{2(1+s)}$	$\dfrac {s^2}{2(1+s)}$	$-R\!\left (\dfrac {s{+}2}{2(1{+}s)}{-}\dfrac {s\alpha }{\Phi }\right )$	$\dfrac {s+2}{2(1+s)}{-}\dfrac {s\alpha }{\Phi }$

$\dfrac {\partial }{\partial \log (\insrate {+}\delrate )}$	$-s$	$\dfrac {s\alpha }{\Phi }$	$\dfrac {1}{1+s}$	$-\dfrac {s}{1+s}$	$R\!\left (\dfrac {s}{2(1{+}s)}{-}\dfrac {s{+}2}{2(1{+}s)}{+}\dfrac {s\alpha }{\Phi }\right )$	$\dfrac {1}{1+s}{-}\dfrac {s\alpha }{\Phi }$

All entries are finite. The third row equals the sum of the first two (as it must by construction). The $\log \alpha $ and $\log (1{-}\alpha )$ columns are trivially continuous (no $\insrate $ dependence). The $\log \beta $ and $\log (1{-}\beta )$ entries have a particularly clean form involving only $s/(1{+}s)$. The $\log \gamma $ entries additionally involve $\alpha /\Phi = e^{-s}/(1-e^{-s}) = 1/(e^s - 1)$ and the ratio $R = 1/((1{+}s)\phi - 1)$.

A.4.4 BDI Sufficient Statistics

The EM-relevant quantities are the posterior expectations $\mathbb {E}[B|i,j,T]$, $\mathbb {E}[D|i,j,T]$, $\mathbb {E}[S|i,j,T]$, recovered from the score via the conservation law $i + B - D = j$: \begin {align} \mathbb {E}[S] &= \frac {j - i + \delrate \,\partial _\delrate \log P_{ij} - \insrate \,\partial _\insrate \log P_{ij} - \insrate \evoltime }{\insrate - \delrate } \label {eq:ES-formula} \\ \mathbb {E}[B] &= \insrate \,\partial _\insrate \log P_{ij} + \insrate \,\mathbb {E}[S] + \insrate \evoltime \label {eq:EB-formula} \\ \mathbb {E}[D] &= \delrate \,\partial _\delrate \log P_{ij} + \delrate \,\mathbb {E}[S] \label {eq:ED-formula} \end {align}

The score $\partial _\insrate \log P_{ij}$ itself is computed from the transition count groups: \[ \partial _\insrate \log P_{ij} = \sum _{\xi \in \{\alpha ,1{-}\alpha ,\beta ,\ldots \}} n_{\log \xi } \cdot \frac {\partial \log \xi }{\partial \insrate } \] Each derivative $\partial \log \xi /\partial \insrate $ has a finite L’Hôpital limit (Section A.4.3), so the score itself has a finite limit. The denominator $\insrate - \delrate $ in (??) is what creates the $0/0$ for $\mathbb {E}[S]$.

The $\mathbb {E}[S]$ singularity. The numerator of (??) also vanishes at $\insrate = \delrate $, since the complete-data log-likelihood for $\immrate = \insrate $ is $\ell = B\log \insrate + D\log \delrate - (\insrate +\delrate )S - \insrate \evoltime + c$, and at $\insrate = \delrate $ the numerator becomes $j - i + \mathbb {E}[D] - \mathbb {E}[B] = 0$ by conservation.

To resolve the $0/0$, write $f(\insrate ) = j - i + \delrate \,\partial _\delrate \log P - \insrate \,\partial _\insrate \log P - \insrate \evoltime $ and $g(\insrate ) = \insrate - \delrate $. By L’Hôpital ($g' = 1$): \begin {equation} \lim _{\insrate \to \delrate } \mathbb {E}[S] = f'(\delrate ) = \delrate \,\partial ^2_{\insrate \delrate }\log P - \partial _\insrate \log P - \delrate \,\partial ^2_{\insrate ^2}\log P - \evoltime \label {eq:ES-limit} \end {equation} where all derivatives are evaluated at $\insrate = \delrate $.

The first and second derivatives of $\log P$ that appear in (A.16) can be expressed in terms of the transition count groups and the L’Hôpital-limited score derivatives from Section A.4.3, together with their first-order corrections in $\varepsilon $. Specifically, the second derivatives of $\log P$ require the $O(\varepsilon )$ terms in the Laurent expansions (??)–(??).

Alternative: direct gradient on $\delrate $. At equal rates, the BDI has a single parameter $\delrate $. The gradient is: \[ \frac {\partial \log P}{\partial \delrate }\bigg |_{\insrate =\immrate =\delrate } = \sum _\xi n_{\log \xi }\, \Bigl (\frac {\partial \log \xi }{\partial \insrate } + \frac {\partial \log \xi }{\partial \delrate }\Bigr ) = \frac {\mathbb {E}[B{+}D]}{\delrate } - 2\mathbb {E}[S] - \evoltime \] This is well-defined using the L’Hôpital limits, and gives the combined quantity $\mathbb {E}[B{+}D] - \delrate (2\mathbb {E}[S]+\evoltime )$ without needing to separate $\mathbb {E}[B]$, $\mathbb {E}[D]$, $\mathbb {E}[S]$ individually. The M-step (A.17) sets this to zero: $\hat \delrate = (\mathbb {E}[B]+\mathbb {E}[D])/(2\mathbb {E}[S]+\evoltime )$.

To break out of equal rates (determine whether $\insrate $ should differ from $\delrate $), we need the individual $\partial \log P/\partial \insrate $ and $\partial \log P/\partial \delrate $, which the table in Section A.4.3.0 provides, or we need to use additional information from the ancestral length distribution.

A.4.5 Direct BDI Rate EM at Equal Rates

If one ignores the stationary root-length prior and considers only the conditional BDI bridge model with $\insrate = \immrate = \delrate $, the complete-data log-likelihood (A.4) reduces to: \[ \ell = (B + D)\log \delrate - 2\delrate S - \delrate \evoltime + c \] This has a single free parameter $\delrate $. Setting $\partial \ell /\partial \delrate = 0$: \begin {equation} \boxed {\hat {\delrate } = \frac {B + D}{2S + \evoltime }} \label {eq:mu-mle-equal} \end {equation} where $B$, $D$, $S$ are understood as posterior expectations.

Implication. At equal rates, the M-step collapses to estimating a single rate from the total event count and total sojourn time.

A.4.6 $\log \kappa $ and $\log (1{-}\kappa )$ Derivatives

The $\kappa = \insrate /\delrate $ and $(1{-}\kappa )$ groups are not among the six $(\alpha ,\beta ,\gamma )$ factors but appear in the transition matrix. At $\kappa = 1$: \begin {align*} \frac {\partial \log \kappa }{\partial \insrate } &= \frac {1}{\insrate } \;\to \; \frac {1}{\delrate }, & \frac {\partial \log \kappa }{\partial \delrate } &= -\frac {1}{\delrate } \end {align*}

No singularity. However, $\log (1{-}\kappa )$ diverges: \[ \frac {\partial \log (1-\kappa )}{\partial \insrate } = \frac {-\kappa }{(1-\kappa )\insrate } \;\to \; -\infty \] Unlike the conditional WFST factors $\alpha ,\beta ,\gamma $, the stationary-length factors $\kappa $ and $1-\kappa $ do not admit a regular equal-rate limit in the full joint TKF model. As $\kappa \to 1$, the equilibrium length distribution ceases to be normalizable, and the $\log (1-\kappa )$ contribution diverges. Accordingly, the L’Hôpital limits derived above apply to the conditional bridge/WFST quantities, whereas the full stationary joint model must either remain in the subcritical regime $\kappa < 1$ or be reparameterized with a nonstationary root-length model.

A.4.7 Joint vs Conditional Likelihoods

The TKF91 transition matrix includes factors of $\kappa = \insrate /\delrate $ and $(1{-}\kappa )$ arising from the geometric$(\kappa )$ prior on ancestor length: $\Pr (|\text {anc}| = i) = \kappa ^i(1{-}\kappa )$. This prior is part of the joint likelihood $P(x,y)$ but can be factored out with respect to the ancestor-length prior to yield an ancestor length-conditioned likelihood $P_{\mathrm {cond}}(x,y \mid |x|=i) = P(x,y) / [\kappa ^i(1{-}\kappa )]$.

The four regimes. Of the four combinations of $\kappa < 1$ vs $\kappa = 1$ and joint vs ancestor length-conditioned, only one is degenerate:

	$\kappa < 1$	$\kappa = 1$
Joint $P(x,y)$	well-defined	degenerate ($\to 0$)
Ancestor length-conditioned $P_{\mathrm {cond}}$	well-defined	well-defined

At $\kappa = 1$, the geometric prior places vanishing mass on finite sequences, so $P(x,y) \to 0$ and $\log P \to -\infty $. The ancestor length-conditioned WFST removes the $\kappa $ and $(1{-}\kappa )$ factors corresponding to the stationary length distribution from the transition matrix, leaving only $(\alpha ,\beta ,\gamma )$-dependent terms. These have finite L’Hôpital limits (Sections A.4.3–A.4.4), so $P_{\mathrm {cond}}$ is finite and well-defined even at $\kappa = 1$.

BDI sufficient statistics. The BDI sufficient statistics $\mathbb {E}[B]$, $\mathbb {E}[D]$, $\mathbb {E}[S]$ are most naturally recovered from the ancestor length-conditioned WFST, whose score depends only on the six TKF bridge groups $(\alpha , 1{-}\alpha , \beta , 1{-}\beta , \gamma , 1{-}\gamma )$. In the joint model, additional contributions from the ancestral length prior appear through the $\kappa $ and $1{-}\kappa $ factors. These must be included in the M-step when optimizing the full stationary TKF likelihood, but they are conceptually distinct from the bridge statistics themselves.

Equations (??)-?? give the correct expectations for the BDI sufficient statistics in terms of the counts $\hat {n}_{ij}$ regardless of whether those counts were computed using the joint Pair HMM or the conditionally-normalized transducer, as long as the WFST transition matrix $\tkfwfst $ is used for the counts $\emcount ^\xi _{ij} = \xi \partial _\xi \log \tkfwfst _{ij}$ (e.g. using the tables in Section A.4.2.0 or Section A.4.3.0). The M-step for the joint likelihood includes additional terms from the $\kappa $ and $1{-}\kappa $ factors, leading to a coupled system of equations that can be solved in closed form (Section A.1.8).

In both cases, the M-step maximizes the correct $Q$-function, so EM monotonicity is guaranteed for the corresponding tracked objective ($\log P_{\mathrm {cond}}$ for the ancestor length-conditioned model, $\log P$ for the full joint model).

A.4.8 Irreversible Models and the $\insrate > \delrate $ Regime

The $\kappa > 1$ regime. When $\insrate > \delrate $, the ratio $\kappa = \insrate /\delrate > 1$ and the geometric$(\kappa )$ prior on ancestor length is not normalizable: $\sum _{i=0}^\infty \kappa ^i(1-\kappa )$ diverges. Therefore the joint likelihood $P(x,y)$ does not exist. However, the ancestor length-conditioned likelihood $P_{\mathrm {cond}}(x,y \mid |x|)$ is well-defined for all $\kappa > 0$. The transition parameters $\alpha $, $\beta $, $\gamma $ are continuous functions of $(\insrate ,\delrate ,\evoltime )$ with no singularity at $\insrate = \delrate $ or $\insrate > \delrate $. The BDI sufficient statistics $\mathbb {E}[B]$, $\mathbb {E}[D]$, $\mathbb {E}[S]$ remain well-defined, and the conditioned M-step (??) applies unchanged: it is unconstrained in $\insrate $ and $\delrate $ and permits $\hat \insrate > \hat \delrate $ if the data support it.

Irreversible substitution models. For completeness, our implementations support dropping the reversibility assumption for the substitution CTMC. The integrals described in Section A.1.2 and in (25) exploit the symmetric factorization of $\revsub $ using $\eqm $-weighted similarity, which yields real eigenvalues. In the irreversible case, $\revsub $ is a general rate matrix with no detailed-balance constraint. The eigendecomposition $\revsub = V\,\operatorname {diag}(d)\,V^{-1}$ may have complex eigenvalues. Assuming $\revsub $ is diagonalizable, the spectral formulae for the integrals of Section A.1.2 carry over, with $V^{-1}$ replacing $V^\top $ and complex arithmetic throughout: \[ I^{ab}_{ij}(\evoltime ) = \sum _{k,l} V_{ak}\,V^{-1}_{ki}\, \frac {e^{d_k\evoltime } - e^{d_l\evoltime }}{d_k - d_l}\, V_{jl}\,V^{-1}_{lb} \] with the convention $(e^{d_k\evoltime } - e^{d_l\evoltime })/(d_k - d_l) = \evoltime \,e^{d_k\evoltime }$ when $d_k = d_l$. The result is real (complex eigenvalues come in conjugate pairs).

Fixed-point iteration for $\eqm $. In the reversible case, the equilibrium distribution $\eqm $ is constrained to be the stationary distribution of $\revsub $ by by detailed balance, and may either be fixed a priori or updated jointly with the reversible rate parameters during the M-step. In the irreversible case, $\eqm $ must be recomputed as the left eigenvector of $\revsub $ (with eigenvalue 0) after each M-step update of $\revsub $ from the bridge-expectation sufficient statistics. This gives an ECM (Expectation-Conditional-Maximization) scheme (37): the E-step uses the current $(\revsub , \eqm )$; the M-step updates $\revsub $, then $\eqm $ is recomputed from the updated $\revsub $. Standard monotonicity results apply providing each conditional maximization step is well-defined.

Differentiability. Standard autograd libraries provide general eigendecomposition and matrix-exponential primitives (with Fréchet derivatives). The eigendecomposition route is convenient for EM-style closed-form updates, while gradient-based optimization is often more naturally based on the matrix exponential. We do not rely on differentiating through eigenvectors here.

A.4.9 Summary of Limits


Quantity	Formula	Limit as $\insrate \to \delrate $

$\alpha $	$e^{-\delrate \evoltime }$	$e^{-s}$ (continuous)
$\beta $	$\frac {\insrate (\eta -\alpha )}{\delrate \eta - \insrate \alpha }$	$\frac {s}{1+s}$
$1-\beta $		$\frac {1}{1+s}$
$\gamma $	$1 - \frac {\delrate \beta }{\insrate (1-\alpha )}$	$1 - \frac {1}{(1+s)\phi }$
$\kappa $	$\insrate /\delrate $	$1$
$\partial _{\log \insrate }\log \beta $	(??)	$\frac {s+2}{2(1+s)}$
$\partial _{\log \delrate }\log \beta $	(??)	$-\frac {s}{2(1+s)}$
$\hat {\delrate }_{\text {MLE}}$		$\frac {\mathbb {E}[B]+\mathbb {E}[D]}{2\mathbb {E}[S]+\evoltime }$

A.5 TKF91 Score Function

We derive the explicit partial derivatives needed to compute the observed-data score from Pair HMM transition counts, as described in Section A.1.8.

A.5.1 Derivatives of TKF parameters

Define $\eta = e^{-\insrate \evoltime }$ (so that $\alpha = e^{-\delrate \evoltime }$ and $\eta = e^{-\insrate \evoltime }$) and $\Delta = \delrate \eta - \insrate \alpha $ (the denominator of $\beta $).

$\alpha = e^{-\delrate \evoltime }$. \[ \frac {\partial \log \alpha }{\partial \insrate } = 0, \qquad \frac {\partial \log \alpha }{\partial \delrate } = -\evoltime \]

\begin {eqnarray*} \frac {\partial \log \beta }{\partial \insrate } & = & \frac {1}{\insrate } - \frac {\evoltime \eta }{\eta - \alpha } + \frac {\evoltime \delrate \eta + \alpha }{\Delta } \\ \frac {\partial \log \beta }{\partial \delrate } & = & \frac {\evoltime \alpha }{\eta - \alpha } - \frac {\eta + \evoltime \insrate \alpha }{\Delta } \end {eqnarray*}

\begin {eqnarray*} \frac {\partial \log (1-\gamma )}{\partial \insrate } & = & \frac {\partial \log \beta }{\partial \insrate } - \frac {1}{\insrate } \\ \frac {\partial \log (1-\gamma )}{\partial \delrate } & = & \frac {1}{\delrate } + \frac {\partial \log \beta }{\partial \delrate } + \frac {\evoltime \alpha }{1-\alpha } \end {eqnarray*}

and $\frac {\partial \log \gamma }{\partial \theta } = -\frac {1-\gamma }{\gamma }\frac {\partial \log (1-\gamma )}{\partial \theta }$ for $\theta \in \{\insrate ,\delrate \}$.

$\kappa = \insrate /\delrate $. \[ \frac {\partial \log \kappa }{\partial \insrate } = \frac {1}{\insrate }, \qquad \frac {\partial \log \kappa }{\partial \delrate } = -\frac {1}{\delrate } \]

Complementary parameters. For any parameter $\xi $: $\frac {\partial \log (1-\xi )}{\partial \theta } = -\frac {\xi }{1-\xi }\frac {\partial \log \xi }{\partial \theta }$.

A.5.2 Observed-data score

Table A.1: Correspondence between transition counts, complete-data log-likelihood terms, and lineage fates.


Log-likelihood term	Coefficient	Interpretation

$\log \alpha $	$n_{SM}+n_{MM}+n_{DM}+n_{IM}$	Founder survival (match)
$\log (1-\alpha )$	$n_{SD}+n_{MD}+n_{DD}+n_{ID}$	Founder death (delete)
$\log (1-\gamma )$	$n_{DM}+n_{DD}+n_{DE}$	Lineage extinction (no $\del \to \ins $)
$\log \gamma $	$n_{DI}$	Youngest orphan ($\del \to \ins $)
$\log \beta $	$n_{SI}+n_{MI}+n_{II}$	Further offspring (insertions continue)
$\log (1-\beta )$	$n_{IM}+n_{ID}+n_{IE}$	No further offspring (insertions end)

Using the transition count decomposition from Table A.1, the Pair HMM log-likelihood under the conditional model $P(y|x)$ is \[ \ell _{\text {cond}} = n_{\log \alpha }\log \alpha + n_{\log (1-\alpha )}\log (1-\alpha ) + n_{\log \beta }\log \beta + n_{\log (1-\beta )}\log (1-\beta ) + n_{\log \gamma }\log \gamma + n_{\log (1-\gamma )}\log (1-\gamma ) \] where we have suppressed the $\kappa $ and $(1-\kappa )$ terms since they cancel in the conditional likelihood (they appear symmetrically in the transition probabilities and the ancestral sequence prior).

The score with respect to $\insrate $ is \[ \frac {\partial \ell _{\text {cond}}}{\partial \insrate } = \sum _{\xi } n_\xi \frac {\partial \xi }{\partial \insrate } \] where $\xi $ ranges over $\{\log \alpha , \log (1-\alpha ), \log \beta , \log (1-\beta ), \log \gamma , \log (1-\gamma )\}$, and similarly for $\delrate $.

A.5.3 Joint likelihood correction

For the joint model $P(x,y)$, the log-likelihood includes the ancestral sequence prior \[ \ell _{\text {joint}} = \ell _{\text {cond}} + |x|\log \kappa + \log (1-\kappa ) + \sum _i n_i\log \eqm _i \] where $|x|$ is the ancestor length, $n_i$ counts character $i$ in the ancestor, and the $\kappa $/$1-\kappa $ terms come from the geometric length distribution. The additional score contributions are \[ \frac {\partial }{\partial \insrate }\left [|x|\log \kappa + \log (1-\kappa )\right ] = \frac {|x|}{\insrate } - \frac {1}{\delrate - \insrate }, \qquad \frac {\partial }{\partial \delrate }\left [|x|\log \kappa + \log (1-\kappa )\right ] = -\frac {|x|}{\delrate } + \frac {1}{\delrate - \insrate } \]

A.5.4 Recovering sufficient statistics and M-step

The gradients $\partial \ell /\partial \insrate $ and $\partial \ell /\partial \delrate $ relate to the BDI sufficient statistics via \[ \frac {\partial \ell }{\partial \insrate } = \frac {\expect [B]}{\insrate } - \expect [S] - T, \qquad \frac {\partial \ell }{\partial \delrate } = \frac {\expect [D]}{\delrate } - \expect [S] \] Combined with the conservation law (A.6), the closed-form M-step updates are $\insrate _{\text {new}} = \expect [B]/\expect [S]$ and $\delrate _{\text {new}} = \expect [D]/\expect [S]$.

Alternatively, gradient ascent can be performed directly using the score, without needing to recover the individual sufficient statistics.

A.6 General BDI Sufficient Statistics

We derive the endpoint-conditioned sufficient statistics for the general linear BDI process with birth rate $\insrate $, death rate $\delrate $, and immigration rate $\immrate $ (not necessarily equal to $\insrate $), conditioned on $X(0) = i$ and $X(T) = j$.

A.6.1 Complete-data log-likelihood

The complete-data log-likelihood for a path is \[ \log P(\xpath ) = B_\ind \log \insrate + B_\imm \log \immrate + D\log \delrate - (\insrate + \delrate )S - \immrate T + c \] with sufficient statistics $B_\ind $ (individual births), $B_\imm $ (immigrations), $D$ (deaths), and $S = \int _0^T X(t)\,dt$ (time-integrated population).

A.6.2 Score equations

\begin {eqnarray*} \frac {\partial }{\partial \insrate }\log P_{ij} & = & \frac {\expect [B_\ind ]}{\insrate } - \expect [S] \\ \frac {\partial }{\partial \delrate }\log P_{ij} & = & \frac {\expect [D]}{\delrate } - \expect [S] \\ \frac {\partial }{\partial \immrate }\log P_{ij} & = & \frac {\expect [B_\imm ]}{\immrate } - T \end {eqnarray*}

A.6.3 Conservation law

Since population changes only by births and deaths, $i + B_\ind + B_\imm - D = j$, giving \begin {equation} \label {eq:conservation-general} \expect [B_\ind ] + \expect [B_\imm ] - \expect [D] = j - i \end {equation}

A.6.4 Closed-form solution

Solving the system of four equations (three score equations plus conservation) in four unknowns, using $\insrate \neq \delrate $:

\begin {eqnarray*} \expect [B_\imm ] & = & \immrate \frac {\partial }{\partial \immrate }\log P_{ij} + \immrate T \\ \expect [S] & = & \frac {j - i + \delrate \frac {\partial }{\partial \delrate }\log P_{ij} - \insrate \frac {\partial }{\partial \insrate }\log P_{ij} - \immrate \frac {\partial }{\partial \immrate }\log P_{ij} - \immrate T}{\insrate - \delrate } \\ \expect [B_\ind ] & = & \insrate \frac {\partial }{\partial \insrate }\log P_{ij} + \insrate \,\expect [S] \\ \expect [D] & = & \delrate \frac {\partial }{\partial \delrate }\log P_{ij} + \delrate \,\expect [S] \end {eqnarray*}

A.6.5 Transition probability for general BDI

For the general linear BDI process with per-capita birth rate $\insrate $, per-capita death rate $\delrate $, and constant immigration rate $\immrate $ (where $\immrate $ is a free parameter, not necessarily $0$ or $\insrate $), the transition probability $P_{ij}(T) = P(X(T) = j \mid X(0) = i)$ is known in closed form (29).

Using the $\alpha $ and $\beta $ from Section A.1.3, define $\rho = \immrate / \insrate $ (the immigration-to-birth ratio). The transition probability factorizes as: \begin {equation} \label {eq:pij-general} P_{ij}(T) = \underbrace {P_{ij}^{(0)}(T)}_{\text {no-immigration}} \cdot \underbrace {R_{ij}(T)}_{\text {immigration factor}} \end {equation} The no-immigration factor is \begin {equation} \label {eq:pij-noimm} P_{ij}^{(0)}(T) = \sum _{k=0}^{\min (i,j)} \binom {i}{k} \binom {j-1}{k-1} \alpha ^k (1-\alpha )^{i-k} (1-\gamma )^{i-k} \gamma ^{j-k} (1-\beta )^{k+1} \beta ^{j-k-1} \cdot \delta _{j>0} \end {equation} (with the $k=0$ term understood as $(1-\alpha )^i (1-\gamma )^i$ when $j=0$, and the convention $\binom {-1}{-1} = 1$).

For the immigration factor, when $\immrate \neq \insrate $ (i.e. $\rho \neq 1$): \begin {equation} \label {eq:imm-factor} R_{ij}(T) = (1-\beta )^{\rho } \cdot {}_2F_1\!\left (-j, \rho ;\, 1;\, \frac {\beta }{1-\beta } \cdot \frac {1}{\beta /(1-\beta )}\right ) \end {equation} More explicitly, the full $P_{ij}(T)$ with immigration can be written using the negative binomial form (29): \begin {equation} \label {eq:pij-full} P_{ij}(T) = \sum _{k=0}^{\min (i,j)} \binom {i}{k}\binom {j-1}{k-1} \alpha ^k (1-\alpha -\gamma +\alpha \gamma )^{i-k} \gamma ^{j-k}(1-\beta )^{k+1}\beta ^{j-k-1} \cdot (1-\beta )^{\rho } \sum _{m=0}^{j-k} \binom {\rho +m-1}{m}\beta ^m \end {equation} where the inner sum is a truncated negative binomial in $\beta $ with parameter $\rho $. When $\rho = 1$ (i.e. $\immrate = \insrate $), the immigration factor simplifies to a factor of $(1-\beta )$, recovering the TKF91 transition probabilities. When $\rho = 0$ (no immigration), $R = 1$ and we recover $P_{ij}^{(0)}$.

A.6.6 Score derivatives for $\immrate $

The score with respect to $\immrate $ is \[ \frac {\partial }{\partial \immrate }\log P_{ij} = \frac {1}{P_{ij}} \frac {\partial P_{ij}}{\partial \immrate } \] Since $\immrate $ enters through $\rho = \immrate /\insrate $ and the factor $(1-\beta )^\rho $ times the negative binomial sum, we have \begin {equation} \frac {\partial }{\partial \immrate }\log P_{ij} = \frac {1}{\insrate }\left ( \log (1-\beta ) + \psi (\rho ) \cdot [\text {correction from the sum}] \right ) \end {equation} where $\psi $ is the digamma function (arising from $\partial /\partial \rho \;\binom {\rho +m-1}{m}$). The full expression is: \begin {equation} \label {eq:dlogP-dnu} \frac {\partial }{\partial \immrate }\log P_{ij} = \frac {1}{\insrate } \frac {\sum _k (\cdots ) \sum _m \binom {\rho +m-1}{m}\beta ^m \left [\log (1-\beta ) + \sum _{\ell =0}^{m-1}\frac {1}{\rho +\ell }\right ]} {\sum _k (\cdots ) \sum _m \binom {\rho +m-1}{m}\beta ^m} \end {equation} where $(\cdots )$ abbreviates the terms from the outer sum in (A.22).

A.6.7 Chain rule

In practice, automatic differentiation of (A.22) is the preferred approach for computing $\partial \log P_{ij}/\partial \insrate $, $\partial \log P_{ij}/\partial \delrate $, and $\partial \log P_{ij}/\partial \immrate $, since the nested sums and special functions make manual differentiation unwieldy.

For completeness, the chain rule decomposes as: \[ \frac {\partial \log P_{ij}}{\partial \insrate } = \frac {\partial \log P_{ij}}{\partial \alpha }\frac {\partial \alpha }{\partial \insrate } + \frac {\partial \log P_{ij}}{\partial \beta }\frac {\partial \beta }{\partial \insrate } + \frac {\partial \log P_{ij}}{\partial \gamma }\frac {\partial \gamma }{\partial \insrate } + \frac {\partial \log P_{ij}}{\partial \rho }\frac {\partial \rho }{\partial \insrate } \] with the $\alpha $, $\beta $, $\gamma $ derivatives given in Appendix A.5, and the additional term \[ \frac {\partial \rho }{\partial \insrate } = -\frac {\immrate }{\insrate ^2}, \qquad \frac {\partial \rho }{\partial \immrate } = \frac {1}{\insrate } \] The derivative $\partial \log P_{ij}/\partial \rho $ is the quantity in (A.24) times $\insrate $. The $\delrate $ derivative has no $\rho $ term since $\rho $ does not depend on $\delrate $.

The formulas above hold for any $(i,j,\insrate ,\delrate ,\immrate )$ with $\insrate \neq \delrate $ and $P_{ij}(T) > 0$. In the TKF91 regime ($\immrate = \insrate $, $i \in \{0,1\}$), they specialize to the results of the main text.

A.7 The General Geometric model

Ideally, the underlying evolutionary model we’d use for indels would be the General Geometric Indel (GGI) model (11, 23). The GGI model is, in some sense, the maximum-entropy indel model for a given expected rate and length of indels. Unfortunately, the probability distributions of aligned ancestor-descendant sequences at finite times under this model are hard to solve. It turns out that the pairwise alignment distributions under the GGI model are well approximated by the TKF92 model (51), so we use that model for inference. In this section we discuss the relationship between the models and, in particular, how we can map parameters between them in a principled way.

The GGI model has parameters $(\insrate _0,\delrate _0,x,y)$. In this model, the instantaneous rate of insertion of $k$ residues at any available site is $\insrate _0 x^{k-1} (1-x)$ (with the residues themselves sampled from a stationary distribution), while the instantaneous rate of deletion of a stretch of exactly $k$ contiguous residues is $\delrate _0 y^{k-1} (1-y)$. For reversibility (or, indeed, the existence of an equilibrium) we require that $\insrate _0 y (1-x) = \delrate _0 x (1-y)$. The mean insertion and deletion lengths are $1/(1-x)$ and $1/(1-y)$. The rate at which any given residue is deleted, counting deletions that start at or before that residue (in an infinitely long sequence), is $\delrate _0 / (1-y)$. The probability that a sequence has length $n$ at equilibrium is $(x/y)^n(1-x/y)$, so the expected sequence length at equilibrium is $\ell _{\mathrm {GGI}} = \frac {x/y}{1-x/y} = \frac {1}{y/x-1}$.

The TKF92 model has parameters $(\insrate ,\delrate ,\ext )$. In this model, $k$-residue fragments (where $k>0$ and $P(k) = \ext ^{k-1}(1-\ext )$) are inserted and deleted at rates $\insrate ,\delrate $. The rate at which any given residue is deleted is just the rate at which its fragment is deleted, $\delrate $. The mean number of fragments at equilibrium is $\frac {\insrate /\delrate }{1-\insrate /\delrate } = \frac {1}{\delrate /\insrate - 1}$ and the mean fragment length is $1/(1-\ext )$, so the mean sequence length at equilibrium is $\ell _{\mathrm {TKF}} = \frac {1}{(\delrate /\insrate - 1)(1 - \ext )}$. The precise distribution of sequence lengths at equilibrium is a zero-altered geometric distribution \[ P(n) = \left \{ \begin {array}{ll} 1 - \frac {\insrate }{\delrate } & \mbox {if $n=0$} \\ \frac {\insrate }{\delrate } \left ( \ext + (1-\ext )\frac {\insrate }{\delrate } \right )^{n-1} (1-\ext ) \left ( 1 - \frac {\insrate }{\delrate } \right ) & \mbox {if $n>0$} \end {array} \right . \]

The mean length of an insertion in TKF92 is $1/(1-\ext )$, i.e. the expected fragment length. For deletions, things are slightly more complicated. Superficially, it looks like deletions are also fragments so they should have the same mean length. However, if we do not have knowledge of the fragment boundaries, but are conditioning just on the current sequence length (as in GGI), we have to weight this by the posterior probability that the subsequence being deleted does in fact constitute a single fragment. For a subsequence of length $k$, this posterior probability is proportional to $\left (\frac {\ext }{\ext + (1-\ext )\frac {\insrate }{\delrate }}\right )^{k-1}$ (indels at the end of the sequence have a slightly different correction factor since they are more likely to constitute a fragment, but the $k$-dependence is the same). The rate of starting an insertion or deletion at a particular site is subject to a similar correction factor.

Given the equilibrium constraint on GGI, each model has three free parameters. In order to find a correspondence, it is useful to match moments (expectations). In view of the above discussion, the most natural expectations to equate are (in the form “GGI expectation = TKF92 expectation”)

These equations, together with the reversibility constraint $\insrate _0 y (1-x) = \delrate _0 x (1-y)$, fully specify a one-to-one mapping between GGI and TKF92. In what follows, we will simply use the TKF92 parameters $(\insrate , \delrate , \ext )$ and likelihoods (including the zero-adjusted equilibrium length distribution), but with the understanding that we are approximating GGI. In particular, when we jointly sample the pairwise alignments around any given node in the phylogenetic tree, we will not require the fragment boundaries to coincide (as would be the case in TKF92). Technically speaking, any indel model parameter priors should be on the GGI parameters rather than the TKF92 ones, but since we are using uninformative priors this may be an acceptable compromise.

One way in which the TKF92 model, where the process state is augmented with latent fragment boundaries, deviates from the sequence-only GGI is that when we use the TKF92 Pair HMM (with the fragment boundaries marginalized out) as an approximation to $P(X_t|X_0)$, we cannot expect it perfectly to obey the Chapman-Kolmogorov equation $P(X_{t+u}=x''|X_0=x) = \sum _{x'} P(X_t=x'|X_0=x)P(X_{t+u}=x''|X_t=x')$. This has consequences for any MCMC moves that split branches using a Pair HMM (which models exactly the Chapman-Kolmogorov equation), since it means that those moves will be miscalibrated in probability, but the approximation should be mild. In fact, it seems possible that no five-state ($\smide $) Pair HMM exactly obeys the Chapman-Kolmogorov equation for a GGI model, except in the special case where the model reduces to the TKF91 model. The best approximations known are obtained by matching moments of the underlying counting process and thereby deriving ODEs for the transition probabilities (23). However, the TKF92 model is a close runner-up (and actually appears to fit real data better).

	\(\log \alpha \)	\(\log (1{-}\alpha )\)	\(\log \beta \)	\(\log (1{-}\beta )\)	\(\log \gamma \)	\(\log (1{-}\gamma )\)

\(\dfrac {\partial }{\partial \log \insrate }\)	\(0\)	\(0\)	\(1 - \dfrac {s_\insrate \eta }{\eta -\alpha } + \dfrac {s_\insrate \delrate \eta /\insrate + \insrate \alpha /\insrate }{\delta /\insrate }\)	\(-\dfrac {\beta }{1-\beta }\cdot [\cdot ]_\beta ^\insrate \)	\(-\dfrac {1-\gamma }{\gamma }\cdot [\cdot ]_{1-\gamma }^\insrate \)	\([\cdot ]_\beta ^\insrate - 1\)

\(\dfrac {\partial }{\partial \log \delrate }\)	\(-s\)	\(\dfrac {s\alpha }{\Phi }\)	\(\dfrac {s\alpha }{\eta -\alpha } - \dfrac {\delrate \eta + s\insrate \alpha }{\delta }\)	\(-\dfrac {\beta }{1-\beta }\cdot [\cdot ]_\beta ^\delrate \)	\(-\dfrac {1-\gamma }{\gamma }\cdot [\cdot ]_{1-\gamma }^\delrate \)	\(1 + [\cdot ]_\beta ^\delrate - \dfrac {s\alpha }{\Phi }\)

\(\dfrac {\partial }{\partial \log (\insrate {+}\delrate )}\)

	\(\log \alpha \)	\(\log (1{-}\alpha )\)	\(\log \beta \)	\(\log (1{-}\beta )\)	\(\log \gamma \)	\(\log (1{-}\gamma )\)

\(\dfrac {\partial }{\partial \log \insrate }\)	\(0\)	\(0\)	\(\dfrac {s+2}{2(1+s)}\)	\(-\dfrac {s(s+2)}{2(1+s)}\)	\(\dfrac {Rs}{2(1+s)}\)	\(-\dfrac {s}{2(1+s)}\)

\(\dfrac {\partial }{\partial \log \delrate }\)	\(-s\)	\(\dfrac {s\alpha }{\Phi }\)	\(-\dfrac {s}{2(1+s)}\)	\(\dfrac {s^2}{2(1+s)}\)	\(-R\!\left (\dfrac {s{+}2}{2(1{+}s)}{-}\dfrac {s\alpha }{\Phi }\right )\)	\(\dfrac {s+2}{2(1+s)}{-}\dfrac {s\alpha }{\Phi }\)

\(\dfrac {\partial }{\partial \log (\insrate {+}\delrate )}\)	\(-s\)	\(\dfrac {s\alpha }{\Phi }\)	\(\dfrac {1}{1+s}\)	\(-\dfrac {s}{1+s}\)	\(R\!\left (\dfrac {s}{2(1{+}s)}{-}\dfrac {s{+}2}{2(1{+}s)}{+}\dfrac {s\alpha }{\Phi }\right )\)	\(\dfrac {1}{1+s}{-}\dfrac {s\alpha }{\Phi }\)

	\(\kappa < 1\)	\(\kappa = 1\)
Joint \(P(x,y)\)	well-defined	degenerate (\(\to 0\))
Ancestor length-conditioned \(P_{\mathrm {cond}}\)	well-defined	well-defined


Quantity	Formula	Limit as \(\insrate \to \delrate \)

\(\alpha \)	\(e^{-\delrate \evoltime }\)	\(e^{-s}\) (continuous)
\(\beta \)	\(\frac {\insrate (\eta -\alpha )}{\delrate \eta - \insrate \alpha }\)	\(\frac {s}{1+s}\)
\(1-\beta \)		\(\frac {1}{1+s}\)
\(\gamma \)	\(1 - \frac {\delrate \beta }{\insrate (1-\alpha )}\)	\(1 - \frac {1}{(1+s)\phi }\)
\(\kappa \)	\(\insrate /\delrate \)	\(1\)
\(\partial _{\log \insrate }\log \beta \)	(??)	\(\frac {s+2}{2(1+s)}\)
\(\partial _{\log \delrate }\log \beta \)	(??)	\(-\frac {s}{2(1+s)}\)
\(\hat {\delrate }_{\text {MLE}}\)		\(\frac {\mathbb {E}[B]+\mathbb {E}[D]}{2\mathbb {E}[S]+\evoltime }\)


Log-likelihood term	Coefficient	Interpretation

\(\log \alpha \)	\(n_{SM}+n_{MM}+n_{DM}+n_{IM}\)	Founder survival (match)
\(\log (1-\alpha )\)	\(n_{SD}+n_{MD}+n_{DD}+n_{ID}\)	Founder death (delete)
\(\log (1-\gamma )\)	\(n_{DM}+n_{DD}+n_{DE}\)	Lineage extinction (no \(\del \to \ins \))
\(\log \gamma \)	\(n_{DI}\)	Youngest orphan (\(\del \to \ins \))
\(\log \beta \)	\(n_{SI}+n_{MI}+n_{II}\)	Further offspring (insertions continue)
\(\log (1-\beta )\)	\(n_{IM}+n_{ID}+n_{IE}\)	No further offspring (insertions end)

A BDI and TKF foundations