Recursive TKF

C Recursive TKF

MixDom adds two levels of nested mixture structure on top of TKF92: a top-level TKF91 process governing domain births and deaths, and per-domain TKF92 processes governing fragments within. Each fragment also carries a substitution-class index. The exact Baum–Welch M-step proceeds via a six-step chain-restoration identity through a fully exploded null-state model. The same nesting pattern — a parent BDI / TKF process emitting child fragments that themselves carry latent class / domain / structure indices — supports a family of recursive TKF models that this appendix collects: MixDom, its order-1 Maraschino adjacency distillation, the algebraic full-Woodbury distillation, the MixDom-specific SVI-BW convergence theory, the tree-level VEM and ancestral-reconstruction algorithms, the generalised phylo-HMM, the labeled-MixDom Singlet and WFST, the formal recursive-grammar-elaboration rules, and four worked recursive examples (L-TKF, TKFST, TKFStack, TKF-Genome).

C.1 The TKF-Mixed Domain Model (MixDom)

This model was preliminarily described and empirically evaluated in (30).

C.1.1 The MixDom Model

The Mixture of TKF92 Domains (MixDom) is a multiply-nested hierarchical mixture model. At the top level is a TKF91-like links process where each link is associated with a domain of random type $\dom \sim \catdist (\domdist _1,\ldots ,\domdist _\ndom )$. Each top-level link emits its own domain sequence, with model parameters determined by domain type.

Three nesting levels. MixDom is generated as three nested processes:

1.: Top-level TKF91 over domains: a sequence of top-level links, each of domain type $\dom \sim \catdist (\domdist _1,\ldots ,\domdist _\ndom )$, evolving under its own per-domain TKF92 indel process.
2.: Per-domain TKF92 over fragments: within each domain of type $\dom $, a TKF92 process generates a sequence of nested links. Each such nested link is called a fragment. Different fragments are statistically independent.
3.: Intra-fragment Markov chain on fragment-types: a single fragment consists of a sequence of fragment-type characters drawn from a per-domain Markov chain with $\nfrag +2$ states (start, end, and $\nfrag $ emitting states $\frag \in \{1,\ldots ,\nfrag \}$). The initial fragment-type is drawn from $\fragdist _{\dom \frag }$. From type $\srcfrag $, the chain advances within the current fragment to type $\destfrag $ with probability $\ext ^{(\dom )}_{\srcfrag \destfrag }$, or terminates the fragment with probability $\notext ^{(\dom )}_\srcfrag = 1 - \sum _{\destfrag } \ext ^{(\dom )}_{\srcfrag \destfrag }$.

The chain restarts at Start for each fresh fragment; different fragments are independent Markov realisations.

Each emitted site within a fragment of domain $\dom $ and fragment-type $\frag $ independently draws a site class $\class \sim \catdist (\classdist _{\dom \frag 1}, \ldots , \classdist _{\dom \frag \nclasses })$, with $\nclasses $ the number of site classes. The site is then governed by the substitution process $\subproc (\exch ^{(\class )}, \eqm ^{(\class )})$.

We denote the per-domain process (TKF92 fragments with an intra-fragment fragment-type chain and per-(domain, fragment-type) site-class mixture) by $\hmmproc (\{\exch ^{(\class )},\eqm ^{(\class )}\}_\class ;\ext ^{(\dom )},\classdist _\dom )$. The full MixDom model is then

\begin {eqnarray*} M_{\text {dom}} & = & \tkflinks (\hmmproc (\{\exch ^{(\class )},\eqm ^{(\class )}\}_\class ;\ext ^{(\dom )},\classdist _\dom );\insrate _\dom ,\delrate _\dom ) \\ \text {MixDom} & = & \tkflinks (\mixture _{\dom \sim \domdist _\dom }(M_{\text {dom}});\insrate _\main ,\delrate _\main ) \end {eqnarray*}

where $\mixture (\ldots )$ denotes a mixture model, with weights $p$, over parameters $\theta $ for model $M$ \[ \state \sim \mixture _{\sumidx \sim p}(\model (\theta _\sumidx );p)\ \Leftrightarrow \ \sumidx \sim \catdist (p),\ \state \sim \model (\theta _\sumidx ) \]

Remark C.1 (Relationship to TKF92). When $\nfrag = 1$ and $\nclasses = 1$, each fragment’s intra-fragment chain has a single emitting state, so the fragment length is $\geomdist (\ext ^{(\dom )}_{11})$ and all sites share the single substitution model $\subproc (\exch , \eqm )$, recovering TKF92. In general, the $\nfrag \times \nfrag $ transition matrix $\ext ^{(\dom )}$ allows intra-fragment correlations between the fragment-types of adjacent positions within a single fragment (e.g., fragment-type 1 positions tend to follow fragment-type 1 positions when $\ext ^{(\dom )}_{11}$ is the dominant row entry), and the per-(fragment-type) class distributions $\classdist _{\dom \frag \class }$ allow the resulting substitution patterns to vary with fragment-type. Different fragments remain statistically independent under this scheme; Markov correlations are strictly within a fragment, carried by the fragment-type chain.

C.1.2 Singlet HMM for MixDom

The Singlet HMM generates sequences from the stationary distribution. Each domain may be empty with probability $\notkappa _\dom \equiv 1-\kappa _\dom $, so the probability that a top-level link generates a zero-length domain sequence is $\emptyseg _0 = \sum _{\dom \in \ndom } \domdist _\dom \notkappa _\dom $. This leads to null cycles (a link is entered but immediately terminates). Eliminating these—by the Schur complement procedure described in the next section—yields a collapsed Singlet HMM with state space $\nonemptystates ^{(\mathrm {eqm})} = \{ \sta , \fin \} \cup \{ \ins _{\dom \frag } : \dom \in \ndom , \frag \in \nfrag \}$ ($\ndom \nfrag + 2$ states). Each emitting state $\ins _{\dom \frag }$ emits a character from $\sum _\class \classdist _{\dom \frag \class }\, \eqm ^{(\class )}$. The transition matrix $\nonemptytrans ^{(\mathrm {eqm})}$ for this collapsed Singlet HMM has entries \[ \nonemptytrans ^{(\mathrm {eqm})}_{\ins _{\srcdom \srcfrag },\, \ins _{\destdom \destfrag }} = \frac {\notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom \cdot \kappa _\main \cdot \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag }}{1 - \kappa _\main \emptyseg _0} + \delta (\srcdom {=}\destdom )\, \notext ^{(\srcdom )}_\srcfrag \kappa _\srcdom \fragdist _{\srcdom \destfrag } + \delta (\srcdom {=}\destdom )\, \ext ^{(\srcdom )}_{\srcfrag \destfrag } \] where $\notext ^{(\dom )}_\frag = 1 - \sum _{\destfrag } \ext ^{(\dom )}_{\frag \destfrag }$ is the fragment termination probability, and the three terms represent (respectively) inter-domain transitions via the null-corrected top-level geometric process, same-domain new-fragment transitions, and same-domain intra-fragment Markov transitions on fragment-types. The $\sta $ row is $\nonemptytrans ^{(\mathrm {eqm})}_{\sta ,\, \ins _{\destdom \destfrag }} = \kappa _\main \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag } / (1 - \kappa _\main \emptyseg _0)$ and $\nonemptytrans ^{(\mathrm {eqm})}_{\sta ,\fin } = (1-\kappa _\main ) / (1 - \kappa _\main \emptyseg _0)$; the $\fin $ column is $\nonemptytrans ^{(\mathrm {eqm})}_{\ins _{\srcdom \srcfrag },\fin } = \notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom (1-\kappa _\main ) / (1 - \kappa _\main \emptyseg _0)$.

C.1.3 Pair HMM for MixDom

In the joint TKF92 Pair HMM, the start$\to $end weight is $\tkftrans (\insrate ,\delrate ,\evoltime )_{\sta \fin }$. In MixDom, the probability that a $\mat $ state emits no sequence is obtained by summing this TKF92 null output probability over domain types, $\emptyseg _\evoltime = \sum _{\dom \in \ndom } \domdist _\dom \tkftrans ^{(\dom )}_{\sta \fin }$ where $\tkftrans ^{(\dom )} \equiv \tkftrans (\insrate _\dom ,\delrate _\dom ,\evoltime )$. The Singlet HMM’s null probability $\emptyseg _0$ also appears in the Pair HMM, governing the $\ins $ and $\del $ state null outputs (which involve only one sequence).

Start with the Pair HMM and split the $\mat $, $\ins $, and $\del $ states into non-emitting and emitting states. Let $\mnull $, $\inull $, $\dnull $ denote the separated empty-match, empty-insert, and empty-delete states, respectively. The $8 \times 8$ transition matrix for this null-separated joint pair HMM is \begin {equation} \exptrans = \left ( \begin {array}{r|cccccccc} & \sta & \mat & \ins & \del & \fin & \mnull & \inull & \dnull \\ \hline \sta & 0 & (1 - \emptyseg _\evoltime )\tkftrans _{\sta \mat } & (1 - \emptyseg _0) \tkftrans _{\sta \ins } & (1 - \emptyseg _0) \tkftrans _{\sta \del } & \tkftrans _{\sta \fin } & \emptyseg _\evoltime \tkftrans _{\sta \mat } & \emptyseg _0 \tkftrans _{\sta \ins } & \emptyseg _0 \tkftrans _{\sta \del } \\ \mat & 0 & (1 - \emptyseg _\evoltime )\tkftrans _{\mat \mat } & (1 - \emptyseg _0) \tkftrans _{\mat \ins } & (1 - \emptyseg _0) \tkftrans _{\mat \del } & \tkftrans _{\mat \fin } & \emptyseg _\evoltime \tkftrans _{\mat \mat } & \emptyseg _0 \tkftrans _{\mat \ins } & \emptyseg _0 \tkftrans _{\mat \del } \\ \ins & 0 & (1 - \emptyseg _\evoltime )\tkftrans _{\ins \mat } & (1 - \emptyseg _0) \tkftrans _{\ins \ins } & (1 - \emptyseg _0) \tkftrans _{\ins \del } & \tkftrans _{\ins \fin } & \emptyseg _\evoltime \tkftrans _{\ins \mat } & \emptyseg _0 \tkftrans _{\ins \ins } & \emptyseg _0 \tkftrans _{\ins \del } \\ \del & 0 & (1 - \emptyseg _\evoltime )\tkftrans _{\del \mat } & (1 - \emptyseg _0) \tkftrans _{\del \ins } & (1 - \emptyseg _0) \tkftrans _{\del \del } & \tkftrans _{\del \fin } & \emptyseg _\evoltime \tkftrans _{\del \mat } & \emptyseg _0 \tkftrans _{\del \ins } & \emptyseg _0 \tkftrans _{\del \del } \\ \fin & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \mnull & 0 & (1 - \emptyseg _\evoltime )\tkftrans _{\mat \mat } & (1 - \emptyseg _0) \tkftrans _{\mat \ins } & (1 - \emptyseg _0) \tkftrans _{\mat \del } & \tkftrans _{\mat \fin } & \emptyseg _\evoltime \tkftrans _{\mat \mat } & \emptyseg _0 \tkftrans _{\mat \ins } & \emptyseg _0 \tkftrans _{\mat \del } \\ \inull & 0 & (1 - \emptyseg _\evoltime )\tkftrans _{\ins \mat } & (1 - \emptyseg _0) \tkftrans _{\ins \ins } & (1 - \emptyseg _0) \tkftrans _{\ins \del } & \tkftrans _{\ins \fin } & \emptyseg _\evoltime \tkftrans _{\ins \mat } & \emptyseg _0 \tkftrans _{\ins \ins } & \emptyseg _0 \tkftrans _{\ins \del } \\ \dnull & 0 & (1 - \emptyseg _\evoltime )\tkftrans _{\del \mat } & (1 - \emptyseg _0) \tkftrans _{\del \ins } & (1 - \emptyseg _0) \tkftrans _{\del \del } & \tkftrans _{\del \fin } & \emptyseg _\evoltime \tkftrans _{\del \mat } & \emptyseg _0 \tkftrans _{\del \ins } & \emptyseg _0 \tkftrans _{\del \del } \\ \end {array} \right ) \label {eq:exptrans_matrix} \end {equation} with $\tkftrans \equiv \tkftrans (\insrate _\main ,\delrate _\main ,\evoltime )$; or, elementwise,

\begin {eqnarray} \exptrans _{ij} & = & \left \{ \begin {array}{ll} 0 & \mbox {if $j = \sta $} \\ (1 - \emptyseg _\evoltime ) \tkftrans _{i\mat } & \mbox {if $j = \mat $} \\ (1 - \emptyseg _0) \tkftrans _{ij} & \mbox {if $j \in \{ \ins , \del \}$} \\ \tkftrans _{i\fin } & \mbox {if $j=\fin $} \\ \emptyseg _\evoltime \tkftrans _{i\mat } & \mbox {if $j = \mnull $} \\ \emptyseg _0 \tkftrans _{i\ins } & \mbox {if $j = \inull $} \\ \emptyseg _0 \tkftrans _{i\del } & \mbox {if $j = \dnull $} \end {array} \right . \label {eq:exptrans_elementwise} \end {eqnarray}

Let $\slicetrans _{\Phi _1,\Phi _2}$ be the matrix formed by zeroing all rows except $i \in \Phi _1$ and all columns except $j \in \Phi _2$ of $\exptrans $. Consider the matrix of transitions between the empty states $\mnull ,\inull ,\dnull $ \[ \slicetrans _{\mnull \inull \dnull ,\mnull \inull \dnull } = \left ( \begin {array}{r|cccc} & \ldots & \mnull & \inull & \dnull \\ \hline \ldots & \ldots & \ldots & \ldots & \ldots \\ \mnull & \ldots & \emptyseg _\evoltime \tkftrans _{\mat \mat } & \emptyseg _0 \tkftrans _{\mat \ins } & \emptyseg _0 \tkftrans _{\mat \del } \\ \inull & \ldots & \emptyseg _\evoltime \tkftrans _{\ins \mat } & \emptyseg _0 \tkftrans _{\ins \ins } & \emptyseg _0 \tkftrans _{\ins \del } \\ \dnull & \ldots & \emptyseg _\evoltime \tkftrans _{\del \mat } & \emptyseg _0 \tkftrans _{\del \ins } & \emptyseg _0 \tkftrans _{\del \del } \end {array} \right ) \] where the unshown rows and columns (corresponding to transitions to, from, or between $\sta ,\mat ,\ins ,\del ,\fin $) have zero entries. Summing over paths of all (including zero) lengths via $\mnull ,\inull ,\dnull $ (i.e. performing the Schur complement) yields \[ \sum _{\sumidx =0}^\infty \slicetrans _{\mnull \inull \dnull ,\mnull \inull \dnull }^\sumidx = (I - \slicetrans _{\mnull \inull \dnull ,\mnull \inull \dnull })^{-1} = \nullpathsum \] which has a relatively simple closed form (reducing to $3 \times 3$ matrix inversion). The effective nonempty $5 \times 5$ transition matrix (with $\mnull ,\inull ,\dnull $ summed out) is \[ \nonemptytrans \equiv \nonemptytrans (\theta ,\evoltime ) = \slicetrans _{\sta \mat \ins \del \fin ,\sta \mat \ins \del \fin } \ +\ \slicetrans _{\sta \mat \ins \del \fin ,\mnull \inull \dnull } \cdot \nullpathsum \cdot \slicetrans _{\mnull \inull \dnull ,\sta \mat \ins \del \fin } \] where $\theta = (\insrate _\main ,\delrate _\main ,\{\domdist _\dom \},\{\insrate _\dom ,\delrate _\dom \},\{\fragdist _{\dom \frag }\},\{\ext ^{(\dom )}_{\srcfrag \destfrag }\},\{\classdist _{\dom \frag \class }\},\{\exch ^{(\class )},\eqm ^{(\class )}\})$.

Following null state elimination, the collapsed Pair HMM has $5\ndom \nfrag +2$ states, namely \[\nonemptystates = \{ \sta \sta , \fin \fin \} \cup \{ \ustate \xstate _{\srcdom \srcfrag }: \ustate \xstate \in \{\mat \mat ,\mat \ins ,\mat \del ,\ins \ins ,\del \del \}, 1 \leq \srcdom \leq \ndom , 1 \leq \srcfrag \leq \nfrag \}\] where $\srcdom $ is a domain index and $\srcfrag $ is a fragment index. For notational convenience, we treat the distinguished states $\sta \sta $ and $\fin \fin $ as carrying sentinel indices $(\srcdom ,\srcfrag )=(0,0)$, which are unused by other states as true domain and fragment indices are 1-based. Thus expressions written in terms of a generic state $\ustate \xstate _{\srcdom \srcfrag }$ are understood to include the cases $\sta \sta $ and $\fin \fin $ by setting $(\srcdom ,\srcfrag )=(0,0)$, except where formulae explicitly require $1 \leq \srcdom \leq \ndom $ or $1 \leq \srcfrag \leq \nfrag $.

Loosely speaking, and noting the above caveat, every transition $\ustate \xstate _{\srcdom \srcfrag }\to \vstate \ystate _{\destdom \destfrag }$ involves a potential domain exit transition from $\xstate \to \fin $ in the nested model (weight $\domexit $), an inter-domain transition $\ustate \to \vstate $ in the top-level model that factors in paths through empty domains (weight $\nonemptytrans _{\ustate \vstate }$), and a potential domain re-entry transition from $\sta \to \ystate $ in the nested model (weight $\domenter $), which must include a factor of $(1 - \emptyseg _\evoltime )^{-1}$ (for top-level Match states) or $(1 - \emptyseg _0)^{-1}$ (for top-level Insert/Delete states) to account for domain entry being conditional on the domain being nonempty (these factors will precisely cancel out the factors in the $\mat $, $\ins $, and $\del $ columns of $\exptrans $, which have been included here solely to preserve row-normalization of $\exptrans $ and $\transnest $). If the domain type and top-level state are the same for source and destination state ($\ustate =\vstate $ and $\srcdom =\destdom $), then an additional intra-domain transition which extends the current domain, starting a new fragment, is folded in (weight $p_\text {SameDom}$). If the domain is the same ($\srcdom =\destdom $), then an additional intra-fragment fragment-type transition from $\srcfrag $ to $\destfrag $ is folded in (weight $p_\text {SameFrag}$), governed by the $\nfrag \times \nfrag $ transition matrix $\ext ^{(\srcdom )}_{\srcfrag \destfrag }$ of the Markov chain within the current fragment.

The transition matrix $\transnest (\theta ,\evoltime )$ for this collapsed Pair HMM has entries \begin {equation} \begin {array}{llrclll} \mbox {Source}\ i & \mbox {Destination}\ j & \multicolumn {5}{l}{\transnest _{ij} = \domexit \times \nonemptytrans _{\ustate \vstate } \times \domenter + \delta (\ustate =\vstate ) \samedom (p_\text {SameDom} + \delta (\xstate =\ystate ) p_\text {SameFrag})} \\ \left ( \ustate \xstate _{\srcdom \srcfrag } \right ) & \left ( \vstate \ystate _{\destdom \destfrag } \right ) & \domexit (\ustate ,\xstate ,\srcdom ,\srcfrag ) & \nonemptytrans _{\ustate \vstate } & \domenter (\vstate ,\ystate ,\destdom ,\destfrag ) & p_\text {SameDom}(\xstate ,\ystate ,\srcdom ,\srcfrag ,\destfrag ) & p_\text {SameFrag}(\srcdom ,\srcfrag ,\destfrag ) \\ \hline \sta \sta & \mat \ystate _{\destdom \destfrag } & 1 & \nonemptytrans _{\sta \mat } & (1 - \emptyseg _\evoltime )^{-1} \domdist _\destdom \tkftrans _{\sta \ystate }^{(\destdom )} \fragdist _{\destdom \destfrag } & 0 & 0 \\ & \ins \ins _{\destdom \destfrag } & 1 & \nonemptytrans _{\sta \ins } & (1 - \emptyseg _0)^{-1} \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag } & 0 & 0 \\ & \del \del _{\destdom \destfrag } & 1 & \nonemptytrans _{\sta \del } & (1 - \emptyseg _0)^{-1} \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag } & 0 & 0 \\ & \fin \fin & 1 & \nonemptytrans _{\sta \fin } & 1 & 0 & 0 \\ \mat \xstate _{\srcdom \srcfrag } & \mat \ystate _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \tkftrans _{\xstate \fin }^{(\srcdom )} & \nonemptytrans _{\mat \mat } & (1 - \emptyseg _\evoltime )^{-1} \domdist _\destdom \tkftrans _{\sta \ystate }^{(\destdom )} \fragdist _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \tkftrans _{\xstate \ystate }^{(\srcdom )} \fragdist _{\srcdom \destfrag } & \ext ^{(\srcdom )}_{\srcfrag \destfrag } \\ & \ins \ins _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \tkftrans _{\xstate \fin }^{(\srcdom )} & \nonemptytrans _{\mat \ins } & (1 - \emptyseg _0)^{-1} \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag } & 0 & 0 \\ & \del \del _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \tkftrans _{\xstate \fin }^{(\srcdom )} & \nonemptytrans _{\mat \del } & (1 - \emptyseg _0)^{-1} \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag } & 0 & 0 \\ & \fin \fin & \notext ^{(\srcdom )}_\srcfrag \tkftrans _{\xstate \fin }^{(\srcdom )} & \nonemptytrans _{\mat \fin } & 1 & 0 & 0 \\ \ins \ins _{\srcdom \srcfrag } & \mat \ystate _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom & \nonemptytrans _{\ins \mat } & (1 - \emptyseg _\evoltime )^{-1} \domdist _\destdom \tkftrans _{\sta \ystate }^{(\destdom )} \fragdist _{\destdom \destfrag } & 0 & 0 \\ & \ins \ins _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom & \nonemptytrans _{\ins \ins } & (1 - \emptyseg _0)^{-1} \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \kappa _\srcdom \fragdist _{\srcdom \destfrag } & \ext ^{(\srcdom )}_{\srcfrag \destfrag } \\ & \del \del _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom & \nonemptytrans _{\ins \del } & (1 - \emptyseg _0)^{-1} \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag } & 0 & 0 \\ & \fin \fin & \notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom & \nonemptytrans _{\ins \fin } & 1 & 0 & 0 \\ \del \del _{\srcdom \srcfrag } & \mat \ystate _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom & \nonemptytrans _{\del \mat } & (1 - \emptyseg _\evoltime )^{-1} \domdist _\destdom \tkftrans _{\sta \ystate }^{(\destdom )} \fragdist _{\destdom \destfrag } & 0 & 0 \\ & \ins \ins _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom & \nonemptytrans _{\del \ins } & (1 - \emptyseg _0)^{-1} \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag } & 0 & 0 \\ & \del \del _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom & \nonemptytrans _{\del \del } & (1 - \emptyseg _0)^{-1} \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag } & \notext ^{(\srcdom )}_\srcfrag \kappa _\srcdom \fragdist _{\srcdom \destfrag } & \ext ^{(\srcdom )}_{\srcfrag \destfrag } \\ & \fin \fin & \notext ^{(\srcdom )}_\srcfrag \notkappa _\srcdom & \nonemptytrans _{\del \fin } & 1 & 0 & 0\\ \end {array}, \label {eq:mixdom_transitions} \end {equation} where $\tkftrans ^{(\dom )} \equiv \tkftrans (\insrate _\dom ,\delrate _\dom ,\evoltime )$, $\notext ^{(\dom )}_\srcfrag \equiv 1 - \sum _\destfrag \ext ^{(\dom )}_{\srcfrag \destfrag }$ is the fragment termination probability for fragment state $\srcfrag $ in domain $\dom $, and (as before) $\notkappa _\srcdom \equiv 1 - \kappa _\srcdom $.

The substitution parameters ($\exch ^{(\class )}, \eqm ^{(\class )}$) do not appear in the transition matrix, but in the emission probabilities of the various states. The probability of emitting token $(\anctok ,\destok )$ from state $\mat \mat _{\dom \frag }$ is \begin {equation} \label {eq:match-emission} P(\anctok ,\destok \mid \mat \mat _{\dom \frag }, \evoltime ) = \sum _{\class =1}^{\nclasses } \classdist _{\dom \frag \class }\, \eqm ^{(\class )}_\anctok \exp (\revsub ^{(\class )} \evoltime )_{\anctok ,\destok } \end {equation} where $\revsub ^{(\class )} = \exch ^{(\class )} \cdot \diag (\eqm ^{(\class )})$ is the rate matrix for site class $\class $, and $\classdist _{\dom \frag \class }$ is the probability that fragment state $\frag $ in domain $\dom $ generates site class $\class $. The probability of emitting ancestral token $\ancordestok $ from states $\{ \mat \del _{\dom \frag }, \del \del _{\dom \frag } \}$, or descendant token $\ancordestok $ from states $\{ \mat \ins _{\dom \frag }, \ins \ins _{\dom \frag } \}$, is $\sum _\class \classdist _{\dom \frag \class }\, \eqm ^{(\class )}_\ancordestok $.

C.1.4 Baum-Welch Algorithm for MixDom Pair HMM

In order to map HMM transition counts back to the BDI sufficient statistics, we need to correct for the null transition elimination performed in the previous section, resolve the transition counts in the nested Pair HMM onto the separate components of the model, and then apply the formulas from earlier sections.

E-step. Run Forward-Backward on the collapsed $(5\ndom \nfrag +2)$-state Pair HMM (Section C.1.1). This yields expected transition counts $\hat {n}''_{ij}$ for all state pairs $i,j$ in the collapsed Pair HMM, and expected emission counts $\hat {e}^{i''}_{(\anctok ,\destok )}$ for each state $i$ and input-output token pair $(\anctok ,\destok )$. The sufficient statistics for the mixture component selectors $\{ \domdist _\dom , \fragdist _{\dom \frag } \}$ can be recovered directly at this stage, and the emission counts accumulated onto the appropriate $W, U, V$ statistics for the per-domain CTMCs.

Conceptually speaking, we next resolve the collapsed-HMM transition counts $\hat {n}''_{ij}$ to TKF91 Pair HMM-like transition counts $\hat {n}^{(\main )}_{ij}$ for transitions in the top-level inter-domain model; TKF92 Pair HMM-like transition counts $\hat {n}^{\mat (\srcdom )'}_{\xstate \ystate }$ for the nested intra-domain match, insert, and delete states $\mat \xstate _{\srcdom \frag }$ of each domain-match submodel (for $\xstate ,\ystate \in \{\mat ,\ins ,\del \}$); and singlet-style transition counts for the within-domain fragment Markov chain $\hat {n}^{\ins \ins (\srcdom )'}_{0}$, $\hat {n}^{\ins \ins (\srcdom )'}_{1}$, $\hat {n}^{\del \del (\srcdom )'}_{0}$, $\hat {n}^{\del \del (\srcdom )'}_{1}$ for the domain-insert and domain-delete submodels (where the “continuation” is now governed by the $\nfrag \times \nfrag $ Markov chain $\ext ^{(\srcdom )}$ rather than a scalar geometric parameter).

The intra-domain transition counts arise solely from the $p_\text {SameDom}$-weighted transitions, and can be isolated proportionally, while the inter-domain counts must correct for the Schur complement null cycle elimination. The zero-adjustment length counts $\hat {n}^{\ins \ins (\srcdom )'}_{\kappa }$, $\hat {n}^{\ins \ins (\srcdom )'}_{\notkappa }$, $\hat {n}^{\del \del (\srcdom )'}_{\kappa }$, $\hat {n}^{\del \del (\srcdom )'}_{\notkappa }$ accounting for the number of times a domain was empty vs nonempty must also receive contributions from the Schur complement correction.

The intra-domain counts can then be resolved to TKF91-like counts $\hat {n}^{(\srcdom )}_{ab}$ and Intra-fragment fragment-type transition counts $\hat {n}^{(\dom )}_{\srcfrag \destfrag }$ (the expected number of transitions from fragment-type $\srcfrag $ to fragment-type $\destfrag $ within a single fragment of domain $\dom $), together with fragment termination counts $\hat {n}^{(\dom )}_{\notext ,\srcfrag }$. The singlet-style counts for each domain $\hat {n}^{\ins \ins (\srcdom )'}_{0}$, $\hat {n}^{\ins \ins (\srcdom )'}_{1}$, $\hat {n}^{\del \del (\srcdom )'}_{0}$, $\hat {n}^{\del \del (\srcdom )'}_{1}$ can be resolved via a similar procedure to $\hat {n}_{\kappa }$, $\hat {n}_{\notkappa }$, which are directly link sequence extension/termination counts $L^{(\srcdom )},M^{(\srcdom )}$, to be accumulated onto the running totals for those counts. Finally, all TKF91-like transition counts for the top-level and nested models are accumulated onto the BDI sufficient statistics $S, B, D, L, M$ as in Section A.1.8.

We now consider the null count restorations in more detail. In practice, we will use a notational shortcut (that is, nevertheless, correct and reliable, and entirely equivalent to the procedure we just outlined) to greatly simplify many of these calculations, bypassing much of this conceptual tower of piecewise count restorations. In doing this we again exploit a form of the score function identity—in this case, that the expected transition usage is the derivative of the log-likelihood with respect to the log-transition weight.

Resolving transition counts. Following the notation of Section C.1.1, let $(\slicetrans _{\Phi _1,\Phi _2})_{ij} = \exptrans _{ij} \delta (i \in \Phi _1, j \in \Phi _2)$ be the masked transition matrix so that $\nullpathsum = (I - \slicetrans _{\mnull \inull \dnull ,\mnull \inull \dnull })^{-1}$ represents sums over null paths and $\nonemptytrans = \slicetrans _{\sta \mat \ins \del \fin ,\sta \mat \ins \del \fin } + \slicetrans _{\sta \mat \ins \del \fin ,\mnull \inull \dnull } \cdot \nullpathsum \cdot \slicetrans _{\mnull \inull \dnull ,\sta \mat \ins \del \fin }$ the null-eliminated transition matrix.

Considering paths in $\exptrans $ that begin and end in $i,j \in \smide $, whose intermediate states (if any) are in $\{ \mnull , \inull , \dnull \}$, the expected transition usages and state occupancies are

\begin {eqnarray*} \expect [n_{ab}|i \to j] & = & \frac {(I + \slicetrans _{\sta \mat \ins \del \fin ,\mnull \inull \dnull } \cdot \nullpathsum )_{ia} \nonemptytrans _{ab} (\nullpathsum \cdot \slicetrans _{\mnull \inull \dnull ,\sta \mat \ins \del \fin })_{bj}}{\nonemptytrans _{ij}} \\ & = & \frac {\partial \log \nonemptytrans _{ij}}{\partial \log \exptrans _{ab}} \end {eqnarray*}

The expected transition usage is the score function identity, appearing in a new form: the $n_{ab}$ are sufficient statistics for the $\exptrans $ path log-likelihood. We can use this identity, with the chain rule, for many of the counts we seek:

\begin {eqnarray} \hat {n}^{(\main )}_{\ustate \vstate } & = & \sum _{i,j} \hat {n}''_{ij} \sum _{k,l} \frac {\partial \log \transnest _{ij}}{\partial \log \nonemptytrans _{kl}} \sum _{a,b} \frac {\partial \log \nonemptytrans _{kl}}{\partial \log \exptrans _{ab}} \frac {\partial \log \exptrans _{ab}}{\partial \log \tkftrans ^{(\main )}_{\ustate \vstate }} \label {eq:main-counts} \\ \hat {n}^{(\srcdom )}_{\ustate \vstate } & = & \sum _{i,j} \hat {n}''_{ij} \frac {\partial \log \transnest _{ij}}{\partial \log \tkftrans ^{(\srcdom )}_{\ustate \vstate }} \label {eq:domain-counts} \\ \hat {n}_{\vartheta } & = & \sum _{i,j} \hat {n}''_{ij} \left ( \frac {\partial \log \transnest _{ij}}{\partial \log \vartheta } + \sum _{k,l} \frac {\partial \log \transnest _{ij}}{\partial \log \nonemptytrans _{kl}} \sum _{a,b} \frac {\partial \log \nonemptytrans _{kl}}{\partial \log \exptrans _{ab}} \frac {\partial \log \exptrans _{ab}}{\partial \log \vartheta } \right ) \label {eq:var-counts} \end {eqnarray}

for $\vartheta \in \{ \domdist _\dom , \kappa _\dom , \notkappa _\dom , \fragdist _{\dom \frag }, \ext ^{(\dom )}_{\srcfrag \destfrag }, \notext ^{(\dom )}_\srcfrag \}$. A few notes:

1.: In these expressions we have written $\tkftrans ^{(\main )}_{\ustate \vstate }$ for the inter-domain TKF91 transition probabilities that just appear as $\tkftrans _{\ustate \vstate }$ in Equation (C.1).
2.: The term $\frac {\partial \log \nonemptytrans _{ij}}{\partial \log \exptrans _{ab}}$ is used to highlight use of the chain rule, but the actual calculation of this term can be done via the matrix formula for $\expect [n_{ab}|i \to j]$ above.
3.: We must be careful to treat $(\kappa _\dom ,\notkappa _\dom )$ as independent free parameters of $\transnest _{ij}$ for the purpose of the partial derivatives in Equation (??), and similarly for $(\ext ^{(\dom )}_{\srcfrag \destfrag },\notext ^{(\dom )}_\srcfrag )$; even though they are deterministically related by $\kappa _\dom + \notkappa _\dom = 1$ and $\notext ^{(\dom )}_\srcfrag = 1 - \sum _\destfrag \ext ^{(\dom )}_{\srcfrag \destfrag }$, we should not differentiate through those constraints when calculating the derivatives of $\exptrans _{ab}$ with respect to $\kappa _\dom $, $\notkappa _\dom $, $\ext ^{(\dom )}_{\srcfrag \destfrag }$, and $\notext ^{(\dom )}_\srcfrag $.
4.: We are also treating $\exptrans _{ab}$ and $\vartheta $ as independent free parameters of $\transnest _{ij}$ for the purpose of Equation (??), which is why we expand the derivative into two terms: one term involving $\frac {\partial \log \transnest _{ij}}{\partial \log \vartheta }$ which captures the direct dependence of $\transnest _{ij}$ on $\vartheta $ (e.g. via $\domdist _\destdom $ or $\fragdist _{\destdom \destfrag }$ in the transition formulas above), and another term which captures the indirect dependence of $\transnest _{ij}$ on $\vartheta $ via $\exptrans _{ab}$.
5.: Similar points apply to the roles of $\tkftrans ^{(\srcdom )}_{\ustate \vstate }$ in Equation (??), and of $\tkftrans ^{(\main )}_{\ustate \vstate }$ in Equation (??): these must be treated as free parameters and not differentiated through.
6.: In contrast, however, we must differentiate through the $\emptyseg _0$ and $\emptyseg _\evoltime $ terms in $\exptrans $ when calculating the derivatives of $\exptrans _{ab}$ with respect to $\domdist _\dom $, $\notkappa _\dom $, and $\tkftrans ^{(\dom )}_{\sta \fin }$.
7.: Similarly, we must differentiate through $\domexit $, $\domenter $, $p_\text {SameDom}$, and $p_\text {SameFrag}$ when calculating the derivatives of $\nonemptytrans _{ij}$. These terms do not represent free parameters.

Finally, we use these counts to accumulate onto the BDI sufficient statistics for the top-level inter-domain TKF91 model and nested intra-domain TKF92 models as described in Section A.1.8. The fragment correction for the intra-domain counts has effectively already been performed by the way we resolved the intra-domain counts from the collapsed Pair HMM counts, which account for the probability of fragment continuation vs new fragments via the $p_\text {SameDom}$-weighted terms which were differentiated through by the score function identity. We also need to accumulate the expected counts for the mixture component selectors $\{ \domdist _\dom , \fragdist _{\dom \frag } \}$, which are $\hat {n}_{\vartheta }$ for $\vartheta \in \{ \domdist _\dom , \fragdist _{\dom \frag } \}$ as calculated above.

Accumulating sufficient statistics. Initialize all sufficient statistics to zero. Then, for each training pair $(x, y)$ with evolutionary time $\evoltime $, accumulate as follows.

Top-level BDI (inter-domain TKF91). From the top-level count matrix $\hat {n}^{(\main )}_{ij}$, compute BDI expectations using (??)–(??) with parameters $(\insrate _\main ,\delrate _\main ,\evoltime )$, and accumulate:

\begin {eqnarray*} B^{(\main )} & \pluseq & \emcount ^B(\hat {\bf n}^{(\main )},\evoltime ) \\ D^{(\main )} & \pluseq & \emcount ^D(\hat {\bf n}^{(\main )},\evoltime ) \\ S^{(\main )} & \pluseq & \emcount ^S(\hat {\bf n}^{(\main )},\evoltime ) \\ L^{(\main )} & \pluseq & \hat {n}^{(\main )}_{\kappa } \quad \text {(expected ancestral \& inserted domain count)} \\ M^{(\main )} & \pluseq & 1 \quad \text {(expected top-level terminations)} \\ T^{(\main )} & \pluseq & \evoltime \quad \text {(top-level BDI observation time)} \end {eqnarray*}

Note that $L^{(\main )}$ counts domains (top-level links), not residues. It is not directly observed, but is recovered from the M and D column sums of $\hat {n}^{(\main )}_{ij}$: $\hat {n}^{(\main )}_\kappa = \sum _{i}(\hat {n}^{(\main )}_{i\mat } + \hat {n}^{(\main )}_{i\del })$. The time statistic $T^{(\main )}$ is weighted by $\hat {n}^{(\main )}_{\notkappa }$, since each independent BDI process contributes one trajectory of duration $\evoltime $.

Per-domain BDI (intra-domain TKF92, domain $\srcdom $). The intra-domain counts $\hat {n}^{(\srcdom )}_{ij}$ have already had the TKF92 fragment correction applied (fragment continuation vs new fragments was resolved by differentiating through the $p_\text {SameFrag}$ terms). Thus:

\begin {eqnarray*} B^{(\srcdom )} & \pluseq & \emcount ^B(\hat {\bf n}^{(\srcdom )},\evoltime ) \\ D^{(\srcdom )} & \pluseq & \emcount ^D(\hat {\bf n}^{(\srcdom )},\evoltime ) \\ S^{(\srcdom )} & \pluseq & \emcount ^S(\hat {\bf n}^{(\srcdom )},\evoltime ) \\ L^{(\srcdom )} & \pluseq & \hat {n}^{(\srcdom )}_\kappa \quad \text {(expected ancestral \& inserted fragment count)} \\ M^{(\srcdom )} & \pluseq & \hat {n}^{(\srcdom )}_{\notkappa } \quad \text {(expected domain terminations)} \\ T^{(\srcdom )} & \pluseq & \evoltime \cdot \hat {n}^{(\srcdom )}_{\notkappa } \quad \text {(expected domain-level BDI time)} \\ F^{(\dom )}_{\srcfrag \destfrag } & \pluseq & \hat {n}^{(\dom )}_{\ext ,\srcfrag \destfrag } \quad \text {(fragment $\srcfrag \to \destfrag $ transitions)} \\ E^{(\dom )}_{\srcfrag } & \pluseq & \hat {n}^{(\dom )}_{\notext ,\srcfrag } \quad \text {(fragment $\srcfrag $ terminations)} \end {eqnarray*}

Here $L^{(\srcdom )}$ counts fragments (links within the domain), not residues. Each domain-$\srcdom $ entry runs an independent fragment-level BDI process for time $\evoltime $, so $T^{(\srcdom )}$ is weighted by the expected number of entries. The fragment-type transition counts $F^{(\dom )}_{\srcfrag \destfrag }$ form an $\nfrag \times \nfrag $ matrix per domain, recording the expected number of intra-fragment Markov transitions between fragment-types within a fragment of domain $\dom $.

Mixture selectors.

\begin {eqnarray*} \mixcount _{\domdist _\srcdom } & \pluseq & \hat {n}_{\domdist _\srcdom } \\ \mixcount _{\fragdist _{\srcdom \frag }} & \pluseq & \hat {n}_{\fragdist _{\srcdom \frag }} \\ \end {eqnarray*}

CTMC substitution (per site class). For each match emission at state $\mat \mat _{\dom \frag }$ with observed pair $(\anctok ,\destok )$, the posterior probability of site class $\class $ given the emission is $\gamma _\class \propto \classdist _{\dom \frag \class }\, \eqm ^{(\class )}_\anctok \exp (\revsub ^{(\class )} \evoltime )_{\anctok ,\destok }$. Using this posterior weight, accumulate endpoint-conditioned CTMC expectations (??)–(??) on the $|\alphabet |$-state chain with rate matrix $\revsub ^{(\class )}$:

\begin {eqnarray*} W^{(\class )}_\anctok & \pluseq & \text {(dwell in state $\anctok $, weighted by $\gamma _\class $ and HMM posterior)} \\ U^{(\class )}_{\anctok ,\anctok '} & \pluseq & \text {(transition $\anctok \to \anctok '$, weighted by $\gamma _\class $)} \\ V^{(\class )}_\anctok & \pluseq & \text {(composition count for $\anctok $ at match-position ancestors, insertions, and deletions, weighted by $\gamma _\class $)} \end {eqnarray*}

Site class assignment counts. For each emission at state $\mat \mat _{\dom \frag }$ (or the corresponding insert/delete states), accumulate the posterior site class assignment: \[ \mixcount _{\classdist _{\dom \frag \class }} \pluseq \gamma _\class \]

Remark C.2 (Genealogical correction terms in nested models). As noted in Section A.1.8, the CTMC sufficient statistics $W$, $U$, $V$ omit genealogical correction terms from transient and partially-observed lineages. An analogous omission applies at each nesting level of the MixDom model: transient domain insertions and deletions (domains that are born and die between times $0$ and $\evoltime $ without being directly observed) contribute to the top-level BDI sufficient statistics $B$, $D$, $S$ via the null count restoration, but their internal fragment-level processes (intra-fragment fragment-type Markov chain and BDI) are not modeled. Similarly, domains that are inserted after time $0$ or deleted before time $\evoltime $ do not accumulate fragment-level statistics for the period of their non-existence. This is consistent with the principle that the M-step optimizes only the complete-data log-likelihood for structures whose existence is certified by the HMM state path.

M-step. All M-step updates use MAP estimates whose priors contribute additive pseudocounts to the sufficient statistics, so the maximizer formulas are the same as the MLE formulas applied to prior-augmented statistics. For the multinomial parameter groups (mixture weights, fragment-type transitions, site-class distributions) the priors below are the standard Dirichlet conjugates; for the BDI rates and the reversible GTR submodel the priors below are non-conjugate regularizers and the true conjugate priors are different (see “Comments on conjugacy” below for what the proper conjugate priors look like and why we use the simpler ones here).

Priors.

Gamma$(\alpha _\insrate , \beta _{\insrate \delrate })$ on $\insrate $ and Gamma$(\alpha _\delrate , \beta _{\insrate \delrate })$ on $\delrate $, sharing the rate parameter $\beta _{\insrate \delrate }$. Augmented statistics: $B \to B + \alpha _\insrate - 1$, $D \to D + \alpha _\delrate - 1$, $S \to S + \beta _{\insrate \delrate }$.
Gamma$(\alpha _\exch , \beta _\exch )$ on each $\exch _{ij}$ (shared $\beta _\exch $ per row): $U_{ij} \to U_{ij} + \alpha _\exch - 1$, $W_i \to W_i + \beta _\exch $.
Dirichlet$(\alpha _\eqm )$ on $\eqm $: $V_i \to V_i + \alpha _\eqm - 1$.
Dirichlet$(\alpha _\domdist )$ on domain weights: $\mixcount _{\domdist _\dom } \to \mixcount _{\domdist _\dom } + \alpha _\domdist - 1$.
Dirichlet$(\alpha _\fragdist )$ on fragment weights: $\mixcount _{\fragdist _{\dom \frag }} \to \mixcount _{\fragdist _{\dom \frag }} + \alpha _\fragdist - 1$.
Dirichlet$(\alpha _\ext )$ on each row of the fragment transition matrix (including the termination probability): $F^{(\dom )}_{\srcfrag \destfrag } \to F^{(\dom )}_{\srcfrag \destfrag } + \alpha _\ext - 1$, $E^{(\dom )}_\srcfrag \to E^{(\dom )}_\srcfrag + \alpha _\ext - 1$.
Dirichlet$(\alpha _\classdist )$ on site class distributions: $\mixcount _{\classdist _{\dom \frag \class }} \to \mixcount _{\classdist _{\dom \frag \class }} + \alpha _\classdist - 1$.

Comments on conjugacy. The Dirichlet priors on $\domdist $, $\fragdist $, $\classdist $, and the fragment-type transition matrix are conjugate to the corresponding multinomial likelihoods in the standard way. The two remaining cases warrant comment:

Reversible CTMC. Treated as an irreversible CTMC (independent off-diagonal rates $\revsub _{ij}$), the complete-data likelihood (A.1) is a regular exponential family and a product of independent Gammas on $\revsub _{ij}$ together with a Dirichlet on the initial distribution is conjugate. For the reversible parameterization $\revsub _{ij} = \exch _{ij}\eqm _j$ with symmetric exchangeabilities, the conjugate prior is not a product of independent Gammas on $\exch _{ij}$ and a Dirichlet on $\eqm $: detailed balance couples the two factors, and the conjugate prior is the cycle-corrected edge-flow density of Diaconis and Rolles (12), which arises as the de Finetti mixing measure for an edge-reinforced random walk on the state graph (10, 43). Reparameterizing in terms of an undirected edge weight $x_e = \eqm _i \exch _{ij} \eqm _j$ on each edge $e = \{i,j\}$ (with loops $x_{ii} = \eqm _i^2\, \revsub ^{\mathrm {loop}}_{ii}$ in the Grassmann-uniformized variant) and normalising so that $\sum _e x_e = 1$, the prior takes the form \[ \phi _{v_0,a}(x) \;\propto \; \Bigl (\prod _e x_e^{a_e - 1/2}\Bigr )\, x_{v_0}^{a_{v_0}/2}\, \prod _{v \ne v_0} x_v^{-(a_v + 1)/2} \;\sqrt {\det A(x)}, \] where $x_v = \sum _{e \ni v} x_e$ is the total flow at vertex $v$, $a_v = \sum _{e \ni v} a_e$ is the corresponding pseudocount, and $A(x)$ is a matrix indexed by a basis of cycles in the state graph whose determinant equals $\sum _{T} \prod _{e \notin T} x_e^{-1}$ (sum over spanning trees $T$), by Kirchhoff’s matrix-tree theorem. For the complete graph with loops (i.e. GTR), $\binom {|\alphabet |-1}{2}$ independent cycles contribute and the joint prior cannot be factored across edges. The Diaconis–Rolles prior conditions on the initial state $v_0$; including a stationary observation of $X(0)=v_0$ multiplies the prior by $\eqm (v_0) \propto \sqrt {x_{v_0}}$ in the edge-flow chart, which shifts the $v_0$-exponent and stays within the same family with adjusted hyperparameters. The independent Gamma${}\times {}$Dirichlet priors used here are non-conjugate regularizers in the reversible parameterization and are perfectly valid for MAP, but the closed-form posterior is in the Diaconis–Rolles family rather than back in Gamma${}\times {}$Dirichlet.

TKF91 BDI rates. The joint complete-data log-likelihood $\ell _1(\insrate ,\delrate )$ from (??), which includes the prior probability of the ancestral sequence length, is a curved exponential family in $(\insrate ,\delrate )$ because the $\log (\delrate -\insrate )$ term couples the natural parameters. A product of independent Gammas on $\insrate $ and $\delrate $ is therefore not conjugate; it is conjugate to the complete-data likelihood of the underlying linear birth–death process conditioned on the initial sequence length, but not to the joint likelihood that includes a stationary $L$-prior. The proper conjugate prior, derived in the queueing-theory literature by Armero and Bayarri (3) and applied to the linear-growth BDI by Conti (9), is most naturally written in terms of $\kappa = \insrate /\delrate \in (0,1)$ and $\delrate $: \[ \phi (\kappa ,\delrate ) \;\propto \; \kappa ^{a-1}(1-\kappa )^{b-1}\,\delrate ^{c-1}\, \exp \!\bigl (-\delrate \,(\tau _1\,\kappa + \tau _2)\bigr ), \] with five hyperparameters $(a,b,c,\tau _1,\tau _2)$ updated by $a \to a + B + L$, $b \to b + M$, $c \to c + B + D$, $\tau _1 \to \tau _1 + S + T$, $\tau _2 \to \tau _2 + S$. Marginalising $\delrate $ yields a Gauss-hypergeometric distribution on $\kappa $ (in the Johnson–Kotz–Balakrishnan family); the normaliser is a ${}_2F_1$ value and the posterior is tractable by one-dimensional quadrature or Gibbs sampling. Equivalently, the $M\log (\delrate -\insrate ) + L\log \insrate $ contribution from incorporating the stationary initial-length prior amounts to an extra $\mathrm {Beta}(L+1,M+1)$-like factor in $\kappa $ on top of the dynamics-only Gamma evidence. As above, our independent Gamma priors on $\insrate $ and $\delrate $ are non-conjugate regularizers in this parameterisation. In the long-time stationary regime where $B \approx D$, the $M\log (\delrate -\insrate )$ and $L\log \insrate - (L+M)\log \delrate $ stationary contributions are doing the work of identifying $\kappa $ separately from the overall rate scale; ignoring them entirely (i.e. Gamma EM with no $L,M$ counts) silently underuses the data when only one long observation is available.

We use simple Gamma${}\times {}$Dirichlet pseudocounts throughout because they are easy to set, behave well as MAP regularizers, and keep the M-step closed-form (the augmented sufficient statistics still have a unique maximiser via the same quadratic in $\kappa $ and the same pooled GTR formula). A fully Bayesian treatment would substitute the Diaconis–Rolles and Armero–Bayarri/Conti priors above; we leave that to future work.

Indel rates. For the top-level rates $(\insrate _\main , \delrate _\main )$ and each per-domain rate pair $(\insrate _\dom , \delrate _\dom )$, solve the quadratic (A.7) with augmented $(B, D, L, M, S, T)$ and extract $\kappa ,\delrate ,\insrate $ via (??)–(??).

Fragment transition matrix. Row-normalize the augmented fragment-type transition counts per domain: $\ext ^{(\dom )}_{\srcfrag \destfrag } \leftarrow F^{(\dom )}_{\srcfrag \destfrag } / (E^{(\dom )}_\srcfrag + \sum _{\destfrag '} F^{(\dom )}_{\srcfrag \destfrag '})$ (with augmented $F$, $E$).

Mixture weights. Normalize augmented counts: $\domdist _\dom \propto \mixcount _{\domdist _\dom }$, $\fragdist _{\dom \frag } \propto \mixcount _{\fragdist _{\dom \frag }}$.

Site class distributions. $\classdist _{\dom \frag \class } \propto \mixcount _{\classdist _{\dom \frag \class }}$.

CTMC parameters. The sufficient statistics are projected onto parameter groups.

Per-class equilibrium: $\eqm ^{(\class )}_\anctok \propto V^{(\class )}_\anctok + \alpha _\eqm - 1$. (This is the standard empirical-frequency estimator; the exact EM M-step couples $\eqm $ with $\exch $ through dwell-time statistics, but the approximation is standard practice for GTR models.)

Per-class exchangeability: The rate is $\exch ^{(\class )}_{\anctok ,\anctok '} \cdot \eqm ^{(\class )}_{\anctok '}$. The exchangeability is estimated from the bridge-expectation transition and dwell counts: \[ \exch ^{(\class )}_{\anctok ,\anctok '} = \frac {U^{(\class )}_{\anctok ,\anctok '} + U^{(\class )}_{\anctok ',\anctok }} {W^{(\class )}_\anctok \cdot \eqm ^{(\class )}_{\anctok '} + W^{(\class )}_{\anctok '} \cdot \eqm ^{(\class )}_\anctok } \]

C.1.5 WFSTs for MixDom

As with TKF92 (Section A.2.2), constructing a WFST for MixDom is complicated by latent information—in this case, the domain type and fragment type associated with each position. We here outline two approaches to this issue.

The first approach is to preserve the latent information by promoting it to the transducer’s input/output alphabet: each character is decorated with its domain and fragment labels, yielding a Labeled-MixDom WFST whose state space is comparable to the Pair HMM but whose alphabet is enlarged. This approach is exact but produces larger machines.

The second approach integrates out the latent variables and approximates the result using compact order-1 machines whose transitions depend only on the most recently emitted characters. These are smaller and more efficient for tree-based inference, at the cost of approximating the full latent structure via local context.

C.2 Selected Inference Algorithms for MixDom

C.2.1 Fast Statistical Alignment (FSA)

Given a set of sequences and a phylogenetic tree with branch-specific pair HMMs (TKF92, MixDom, or distilled order-1 transducers), we construct a multiple sequence alignment using the sequence annealing approach of (6), to which we refer the reader for a full description of the algorithm.

Briefly, the method proceeds as follows. For each pair of sequences $(x,y)$ in a selected subset (either all $\binom {N}{2}$ pairs or an $O(N \log N)$ Erdős–Rényi sample), we compute pairwise residue alignment posteriors $P(x_i \sim y_j)$ by running the Forward-Backward algorithm on the pair HMM at an optimized evolutionary time $\hat {\tau }$. The time $\hat {\tau }$ is found by Newton–Raphson optimization of the expected log-likelihood (the “NR step”): \begin {align} \hat {\tau } &= \operatorname *{argmax}_\tau \; \mathbb {E}_{P(\pi | x,y,\tau _0)} \bigl [ \log P(x,y,\pi \mid \tau ) \bigr ] \label {eq:fsa-nr} \end {align}

where $\pi $ ranges over alignment paths and $\tau _0$ is an initial estimate. This expectation is computed from Forward-Backward expected counts at $\tau _0$, and typically converges in 3–5 Newton steps. (This time-maximization differs slightly from the approach of (6) which attempts to optimize all model parameters via unregularized EM for every pair, and consequently must terminate the EM recursion early to avoid instability.)

The pairwise posteriors are then assembled into a multiple alignment by the greedy sequence annealing procedure of (6), which iteratively merges alignment columns to maximize a sum-of-pairs posterior objective.

C.2.2 Beam Search Ancestral Sequence Reconstruction (BeamASR)

We now describe an alternative progressive reconstruction method that finds the maximum-likelihood ancestral sequence at each internal node by beam search, without materializing the full composite automaton.

At each internal node $v$ with children $l,r$ and observed descendant sequences $c_l, c_r$, we seek \begin {align} \hat {a}_v &= \operatorname *{argmax}_{a} \bigl [ \log P(a, c_l \mid B_l) + \log P(a, c_r \mid B_r) - \log P(a \mid R) \bigr ] \label {eq:beam-ancestor} \end {align}

where $P(a, c_k \mid B_k)$ is the pair HMM forward probability on branch $k$ and $P(a \mid R)$ is the singlet probability under the root generator, subtracted to avoid double-counting the prior on $a$.

Incremental forward profiles Since the branches are conditionally independent given the ancestor, we can evaluate (??) by maintaining incremental forward profiles: for each branch $k$, a 1D forward table $F_k[i, q]$ giving the log-probability that descendant positions $1,\ldots ,i$ have been emitted and the branch machine is in state $q$, given ancestor positions $1,\ldots ,j$ processed so far.

Each ancestor character extends both profiles independently in $O(L_k)$ time per branch, where $L_k = |c_k|$.

Beam search The ancestor sequence $\hat {a}_v$ is built left-to-right by beam search. At each position $j$, the beam maintains $B$ candidate partial ancestors. For each candidate and each alphabet character $\sigma $:

1.: Extend both branch profiles by one ancestor character $\sigma $, comprising a match/delete phase (the ancestor emits $\sigma $, descendant positions advance via M or D transitions) and an insertion phase (descendant-only insertions following the ancestor emission).
2.: Update the singlet forward score for $\sigma $.
3.: Score the extension: $\Delta (j, \sigma ) = \Delta F_{l} + \Delta F_{r} - \Delta _{\mathrm {singlet}}$.

The top $B$ extensions (by cumulative score) are retained. Total cost per node is $O(K \cdot B \cdot A \cdot (L_l + L_r))$ where $K = |\hat {a}_v|$ and $A$ is the alphabet size.

Insertion phase via associative scan The insertion recurrence within each branch profile has the form \begin {align} x_{i+1} &= \operatorname {logsumexp}\bigl (A_{II} \, x_i,\; b_i\bigr ) + e_i \label {eq:insert-recurrence} \end {align}

where $A_{II}$ is the I-to-I log-transition submatrix, $b_i$ collects transitions into insertion states from M and D, and $e_i$ is the emission score. This is a log-semiring affine recurrence, parallelizable via an associative scan with operator \[ (A_1, b_1) \oplus (A_2, b_2) = (A_2 \otimes A_1,\; \operatorname {logsumexp}(A_2 \, b_1,\; b_2)) \] where $\otimes $ denotes log-semiring matrix multiplication. This reduces the insertion phase from $O(L)$ sequential depth to $O(\log L)$.

Supported model types The beam search interface is generic over the pair HMM used on each branch:

1.: TKF92 — order-0, 5-state pair HMM (the standard model).
2.: MixDom — the full latent-state pair HMM (Section C.1.1) with $2 + 5NK$ states; latent-state correlations are marginalized in the forward pass without distillation.

Potentials from neighboring columns The inter-column coupling enters through the order-1 WFST transitions. For each edge $e = (u,v)$ and MSA column $c$ where $e$ has an event of type $\tau _c$, let $c^-$ denote the predecessor column (the previous column where $e$ had an event) and $c^+$ the successor column. The potential at column $c$ for edge $e$ receives two contributions:

As-child term (from $c^-$). The transition from column $c^-$ to $c$ on edge $e$ depends on the characters at both columns. Using the pairwise marginal from $q_{c^-}$: \begin {align} \log \phi _e^{\mathrm {child}}(a_{u,c}, a_{v,c}) &= \sum _{a', b'} q_{c^-}^{(u,v)}(a', b')\; \log w_e(\tau _{c^-\!}, \tau _c, a', b', a_{u,c}, a_{v,c}) \label {eq:pot-child} \end {align}

where the sum over $(a', b')$ uses the joint pairwise marginal $q_{c^-}^{(u,v)}(a', b')$—not the product of independent marginals. This is the key advantage over mean-field: the within-tree parent-child correlation at the predecessor column is preserved exactly.

As-parent term (from $c^+$). Symmetrically, column $c$ acts as the predecessor for column $c^+$: \begin {align} \log \phi _e^{\mathrm {parent}}(a_{u,c}, a_{v,c}) &= \sum _{a'', b''} q_{c^+}^{(u,v)}(a'', b'')\; \log w_e(\tau _c, \tau _{c^+\!}, a_{u,c}, a_{v,c}, a'', b'') \label {eq:pot-parent} \end {align}

For insert transitions (only the descendant is present at $c$), the potential reduces to a per-node function $\psi _v(a_{v,c})$. For delete transitions (only the ancestor is present), it becomes $\psi _u(a_{u,c})$. For match transitions, it contributes a per-edge potential $\phi _e(a_{u,c}, a_{v,c})$. The start and end transitions contribute analogous per-node or per-edge terms.

Felsenstein coordinate ascent Each coordinate ascent step updates $q_c$ for a single MSA column $c$, holding all other columns fixed. We accumulate, for each edge $e$ and node $v$ in the tree at column $c$:

Per-edge log-potentials $\log \phi _e(a_u, a_v) = \log \phi _e^{\mathrm {child}} + \log \phi _e^{\mathrm {parent}}$ (for match transitions where both endpoints are present).
Per-node log-potentials $\log \psi _v(a_v)$ (from insert/delete transitions on incident edges, plus the root prior $\log \pi (a)$ at the root node).

The optimal $q_c$, given the potentials, is the Gibbs distribution on the tree at column $c$: \begin {align} q_c(\mathbf {h}_c) &\propto \prod _v \psi _v(a_v) \prod _{e=(u,v)} \phi _e(a_u, a_v) \prod _{\ell \in \mathrm {leaves}} \delta (a_\ell = y_\ell ) \label {eq:q-gibbs} \end {align}

Since this is a tree-structured MRF, the normalizing constant and all node and edge marginals can be computed exactly by Felsenstein peeling (postorder) and unpeeling (preorder) in $O(|E| \cdot |\alphabet |^2)$ time.

Peeling (postorder). For each node $v$ in postorder, compute the conditional likelihood: \begin {align} \mathrm {CL}_v(a) &= \psi _v(a) \prod _{\mathrm {children}\; c} \Bigl [\sum _{a_c} \phi _{(v,c)}(a, a_c)\; \mathrm {CL}_c(a_c)\Bigr ] \label {eq:peel} \end {align}

with $\mathrm {CL}_\ell (a) = \delta (a = y_\ell )$ for observed leaves. The log-partition function is $\log Z_c = \log \sum _a \pi (a)\, \mathrm {CL}_{\mathrm {root}}(a)$.

Unpeeling (preorder). Propagate top-down to obtain the posterior marginal at each node: \begin {align} q_c^{(v)}(a) &\propto \mathrm {CL}_v(a) \cdot \mathrm {msg}_{\mathrm {parent} \to v}(a) \label {eq:unpeel-marginal} \end {align}

and the pairwise marginal on each edge: \begin {align} q_c^{(u,v)}(a_u, a_v) &\propto \mathrm {msg}_{\mathrm {above}\,u}(a_u) \cdot \phi _{(u,v)}(a_u, a_v) \cdot \mathrm {CL}_v(a_v) \label {eq:unpeel-pair} \end {align}

where the “message from above $u$” combines the top-down message to $u$ with $u$’s conditional likelihood excluding child $v$.

Sweep. One iteration sweeps through all MSA columns $c = 1, \ldots , L$: for each column, recompute the potentials from the current neighbor marginals, run peeling/unpeeling, and store the updated node and edge marginals. The sweep order is left-to-right; the “as-child” potentials use the just-updated predecessor marginals, while the “as-parent” potentials use stale successor marginals from the previous iteration.

Properties The product-of-trees approximation enjoys the same monotonic convergence guarantee as mean-field coordinate ascent (each column update minimizes the free energy in its coordinate), with the additional guarantee that the ELBO is at least as tight as the fully-factored mean-field bound. This follows because the product-of-trees family contains the mean-field family as a special case (where each $q_c$ is itself fully factored).

C.2.3 Phylogenetic Hidden Markov Model (PhyloHMM)

If the top-level indel rates in MixDom are low, and the ancestral presence/absence fully specified by the MSA, the phylogenetic likelihood calculation and ancestral reconstruction problems admit systematic approximation by a generalized Phylo-HMM, yielding $O(L^2)$-complexity versions of the Forward and Forward-Backward algorithms. This approach is described in Section C.9.

C.2.4 Phylogenetic composition

The order-1 HMM (Section C.4.5) and transducers (Section C.4.6) can be composed on a phylogenetic tree to yield a single composite machine whose state encodes the joint configuration of all branch machines.

Given a rooted binary tree with $n$ leaves:

Number nodes $0,\ldots ,2n-2$ in preorder (root $= 0$).
Place the order-1 Singlet HMM on a notional branch above the root (node 0).
Place an order-1 Pair Transducer on each real branch.
Each node $v$ carries a tag $\in \alphabet \cup \{\varepsilon \}$, initially $\varepsilon $.

Each branch machine is either the root HMM or a branch transducer. All machines are converted to waiting-machine form (they already are, by construction above).

Composition rules

State constraint. A node’s machine may advance (take a transition) only if all higher-numbered nodes’ machines are in waiting states.

Tagging. When an internal node $v$’s transducer takes a transition with output symbol $\destok \neq \varepsilon $, node $v$ is tagged with $\destok $. If $v$’s tag is non-$\varepsilon $, the next move must feed $v$’s tag as input to both child branch transducers (forcing them out of their waiting states). This clears $v$’s tag (but the children’s transitions may tag downstream nodes).

Priority. If multiple nodes are tagged simultaneously, the lowest-numbered (closest to root) tagged node is cleared first.

Cascading. This system of tags allows upward propagation from observed leaf emissions (MSA columns) via Felsenstein-style pruning: working from leaves (known emissions) upward, each internal node’s ancestral character is a latent variable marginalized by the DP.

Composite state space A configuration of the composed machine is a tuple \[ \sigma = (q_0, q_1, \ldots , q_{2n-2},\; \tau _0, \tau _1, \ldots , \tau _{2n-2}) \] where $q_v$ is node $v$’s machine state and $\tau _v \in \alphabet \cup \{\varepsilon \}$ is its tag. The start configuration has all machines in $\sta $ and all tags $\varepsilon $. The end configuration has all machines in $\fin $ and all tags $\varepsilon $.

Practical caveat. While this composition defines a valid single machine whose language is the set of MSAs weighted by the full phylogenetic likelihood, the composite state space is $O(|Q|^{2n-1} \cdot |\alphabet |^{2n-1})$ where $|Q|$ is the number of states per branch machine—i.e. geometric in the number of taxa. Explicitly constructing the composed machine is therefore impractical for all but the smallest trees. The beam algorithms that follow (Sections C.2.5–C.2.6) avoid this by enumerating only the configurations reachable within a pruned beam, so that the effective state space remains manageable. The composition formalism is nonetheless useful as a specification: it defines the target distribution from which the beam search samples or whose expected counts the Forward-Backward algorithm estimates.

C.2.5 Beam Backward algorithm (BeamMSA)

Given a multiple sequence alignment (MSA) with $L$ columns and the composite machine from Section C.2.4, we compute the alignment likelihood using a beam Backward algorithm, working from the end configuration backward.

Columns and the emission constraint. Each MSA column $\ell = 1,\ldots ,L$ specifies, for each leaf $v$, either a character $y_v^\ell \in \alphabet $ or a gap. A configuration $\sigma $ is compatible with column $\ell $ if the set of leaf emissions implied by $\sigma $’s tags matches the column.

Backward recurrence. Let $B(\sigma )$ denote the Backward variable: the total probability of generating MSA columns $\ell , \ell +1, \ldots , L$ and reaching the end state, given that the composite machine is currently in configuration $\sigma $ just before column $\ell $. \begin {align} B(\sigma _{\fin }) &= 1 && \mbox {(end configuration)} \label {eq:back-base}\\ B(\sigma ) &= \sum _{\sigma '} T(\sigma , \sigma ')\, B(\sigma ') && \mbox {(all other configurations)} \label {eq:back-rec} \end {align}

where $T(\sigma ,\sigma ')$ is the composite transition weight (product of individual machine transitions, subject to the composition rules above) and the sum is over all successor configurations $\sigma '$.

The alignment likelihood is $B(\sigma _{\sta })$.

Beam pruning. Maintain a beam $\mathcal {B}_\ell $ of at most $W$ configurations per column, ranked by $B(\sigma )$. When expanding $\mathcal {B}_\ell $ from $\mathcal {B}_{\ell +1}$, discard any $\sigma $ whose $B(\sigma )$ falls below $B_{\max } / \Delta $ where $B_{\max }$ is the current maximum and $\Delta $ is the beam width ratio. If the beam collapses (no configurations remain), backtrack.

Epsilon closures within a column. A column-emitting move is any transition whose output cascades down the tree to produce a new MSA column: specifically, an insertion (a transition that outputs a character without consuming input), which then tags the node and cascades to its descendants. Between column-emitting moves, machines may make silent transitions (tag propagation, waiting-state transitions, etc.). These form an $\varepsilon $-closure that must be computed at each step. For each configuration in the beam, enumerate all reachable configurations via silent transitions (respecting the priority ordering), accumulating weights multiplicatively along each path. In practice, null cycles (if any) can either be ignored (assuming the model’s null-state topology is acyclic) or handled by allowing a configurable number of extra exploratory steps in the beam search.

Forward traceback After the Backward pass reaches $\sigma _{\sta }$, a stochastic Forward traceback samples a path from $\sigma _{\sta }$ to $\sigma _{\fin }$: \begin {align} P(\sigma ' | \sigma ) &= \frac {T(\sigma ,\sigma ')\, B(\sigma ')}{B(\sigma )} \label {eq:fwd-sample} \end {align}

At each step, sample the next configuration proportional to (??). This yields a sampled alignment (including ancestral sequences at internal nodes).

Beam Forward-Backward Alternatively, after the Backward beam pass:

1.: Prune dead-end configurations from each $\mathcal {B}_\ell $ (those with no predecessor in $\mathcal {B}_{\ell -1}$).
2.: Run a Forward pass over the pruned beam: \begin {align} F(\sigma _{\sta }) &= 1 \label {eq:fwd-base}\\ F(\sigma ') &= \sum _{\sigma \in \mathcal {B}} T(\sigma ,\sigma ')\, F(\sigma ) \label {eq:fwd-rec} \end {align}
3.: Posterior marginals for any feature $\phi $: \begin {align} P(\phi | \mbox {MSA}) &= \frac {1}{B(\sigma _{\sta })} \sum _{\sigma \to \sigma '} F(\sigma )\, T(\sigma ,\sigma ')\, B(\sigma ')\, [\phi (\sigma ,\sigma ')] \label {eq:fb-posterior} \end {align}

Beam Viterbi Replace $\sum $ with $\max $ in the Backward recurrence (??) and store argmax pointers: \begin {align} B^V(\sigma _{\fin }) &= 1 \label {eq:vit-base}\\ B^V(\sigma ) &= \max _{\sigma '} T(\sigma ,\sigma ')\, B^V(\sigma ') \label {eq:vit-rec} \end {align}

The optimal alignment is recovered by Forward traceback following the argmax pointers.

C.2.6 Progressive alignment via profile construction (ProgRec)

We now describe a progressive multiple sequence alignment algorithm using the order-1 machines from Sections C.4.5–C.4.6, with model parameters from (30). This follows the transducer-composition approach of (52), adapted here for Mealy machines (I/O on transitions rather than states).

The antecedents of this approach are the full multidimensional alignment algorithm of (18) which computes the Forward algorithm for TKF91 on a binary tree. This may be seen as unifying the tree-based Viterbi multiple alignment approach of (44) with the statistical phylogenetics of (16). The approach described here also maintains a partial order graph of intermediate alignments (33), which essentially is the approach used by (35).

Recognizers and Profiles Let the phylogenetic tree have $n$ leaves with observed sequences $\{y_v : v \in \mbox {leaves}\}$, nodes numbered in preorder. Let $R$ denote the order-1 Singlet HMM (root generator, Section C.4.5) and $B_v$ the order-1 Pair Transducer on the branch to node $v$ (Section C.4.6).

Recognizers. For each leaf $v$, the exact-match recognizer $\mathcal {R}(y_v)$ is a transducer with empty output alphabet that accepts only $y_v$: it has $|y_v|+1$ states (positions $0,\ldots ,|y_v|$), with a single input-consuming transition $i \to i+1$ labeled by $y_v[i+1]$ at each position. All states are waiting states except the start.

Profiles. A profile $E_v$ at node $v$ is a recognizer (empty output alphabet) that accepts a set of plausible ancestral sequences at $v$, weighted by their approximate posterior probability given the descendants of $v$. For leaves, $E_v = \mathcal {R}(y_v)$.

Progressive reconstruction Working from the leaves toward the root, for each internal node $v$ with children $l,r$:

Step 1: Compose branch and profile. For each child $c \in \{l,r\}$, form the composition $B_c \circ E_c$, which is a transducer mapping the sequence at $v$ to the constrained sequences at child $c$. Since $E_c$ has empty output, $B_c \circ E_c$ is a recognizer (it reads a candidate parent sequence and recognizes it with weight proportional to the probability of generating the descendant data at $c$).

Step 2: Intersect siblings. Form the intersection \[ H_v = (B_l \circ E_l) \cap (B_r \circ E_r) \] This recognizer reads a candidate sequence at $v$ and scores it by the joint probability of both children’s descendant data, given that parent sequence. In the Mealy-machine intersection, the composite state is $(q_l, e_l, q_r, e_r)$ where $q_c \in B_c$ and $e_c \in E_c$. Both sides must agree on the same input symbol when both are ready (waiting); when one side is not waiting, it advances silently while the other stays put.

Step 3: Compose with root prior. Form the generator \[ M_v = R \circ H_v \] This has empty input and empty output: it is a weighted automaton over the empty string, whose total weight $Z = \sum _\pi w(\pi )$ over all paths $\pi $ is the marginal likelihood of the descendant data below $v$ (under the stationary prior $R$ at $v$).

Step 4: Sample paths and construct profile. Sample $K$ paths from $P(\pi | M_v) = w(\pi ) / Z$ using a Forward pass followed by stochastic traceback (??). For each sampled path $\pi $, extract the $H_v$-component states visited. The profile $E_v$ is the sub-recognizer of $H_v$ containing exactly those states visited by at least $\tau \geq 2$ of the $K$ sampled paths. This bounds $|E_v| = O(KL)$ where $L = \max _v |y_v|$.

Mealy-machine composition and intersection For completeness, we state the composition and intersection rules for Mealy machines in waiting-machine normal form (“ready” states = waiting states with input-consuming transitions only; “unready” states = non-waiting, silent transitions only).

Composition. Given $T = (\Omega _X, \Omega _Y, \ldots )$ and $U = (\Omega _Y, \Omega _Z, \ldots )$ in Mealy normal form, $T \circ U$ has states $\subseteq \mathcal {S}_T \times \mathcal {S}_U$ with transition weight: \[ w''((t,u), \omega _x, \omega _z, (t',u')) = \begin {cases} \delta _{tt'}\, \delta _{\omega _x\varepsilon }\, w'(u,\varepsilon ,\omega _z,u') & \text {if } u \text { unready} \\[4pt] \delta _{uu'}\, \delta _{\omega _z\varepsilon }\, w(t,\omega _x,\varepsilon ,t') + \displaystyle \sum _{\omega _y} w(t,\omega _x,\omega _y,t')\, w'(u,\omega _y,\omega _z,u') & \text {if } u \text { ready} \end {cases} \]

Intersection. Given $T = (\Omega _X, \Omega _T, \ldots )$ and $U = (\Omega _X, \Omega _U, \ldots )$ in Mealy normal form, $T \cap U$ has states $\subseteq \mathcal {S}_T \times \mathcal {S}_U$ with output alphabet $\Omega _T \times \Omega _U$ and transition weight: \[ w''((t,u), \omega _x, (\omega _y,\omega _z), (t',u')) = \begin {cases} \delta _{tt'}\, \delta _{\omega _x\varepsilon }\, \delta _{\omega _y\varepsilon }\, w'(u,\varepsilon ,\omega _z,u') & \text {if } u \text { unready} \\[4pt] \delta _{uu'}\, \delta _{\omega _x\varepsilon }\, \delta _{\omega _z\varepsilon }\, w(t,\varepsilon ,\omega _y,t') & \text {if } t \text { unready, } u \text { ready} \\[4pt] w(t,\omega _x,\omega _y,t')\, w'(u,\omega _x,\omega _z,u') & \text {if both ready} \end {cases} \]

Forward recursion for $M_v$ The generator $M_v = R \circ H_v$ has states $m = (\rho , q_l, e_l, q_r, e_r)$ where $\rho \in R$, $(q_l,e_l) \in B_l \circ E_l$, and $(q_r,e_r) \in B_r \circ E_r$. The Forward variable $Z(m)$ satisfies: \begin {align} Z(\sta _M) &= 1 \label {eq:profile-fwd-base}\\ Z(m') &= \sum _{m : (m,\varepsilon ,\varepsilon ,m') \in \mathcal {T}} w(m,\varepsilon ,\varepsilon ,m')\, Z(m) \label {eq:profile-fwd-rec} \end {align}

where $\mathcal {T}$ is the transition set of $M_v$. The total likelihood is $Z(\fin _M)$.

The fill order iterates over $e_l$ and $e_r$ in topological order (corresponding to positions in the child profiles), with an inner loop over $Q_v = R \circ (B_l \cap B_r)$ states (the “comparison kernel” Pair HMM). This has time complexity $O(|B|^2\, |E_l|\, |E_r|)$ per internal node, where $|B|$ is the branch transducer state count.

Profile extraction Given the Forward table $Z$, sample paths $\pi ^{(1)},\ldots ,\pi ^{(K)}$ from $M_v$ using the stochastic traceback (??). For each path $\pi ^{(k)}$, let $\mathcal {H}^{(k)} = \{(q_l,e_l,q_r,e_r) : (\rho ,q_l,e_l,q_r,e_r) \in \pi ^{(k)}\}$ be the $H_v$-states visited.

The profile $E_v$ is the sub-automaton of $H_v$ induced by the states \[ \mathcal {S}_{E_v} = \{ h \in H_v : |\{k : h \in \mathcal {H}^{(k)}\}| \geq \tau \} \] with the same transition weights as $H_v$, restricted to $\mathcal {S}_{E_v}$. Adding appropriate wait states places $E_v$ in Mealy normal form.

Bubble merging. Paths through $E_v$ that traverse the same sequence of wait states but differ only in latent-state assignments can be merged by collapsing bubbles (identifying states with identical incoming and outgoing wait-state connectivity). This further compresses the profile without changing the recognized language.

MSA extraction A sampled path through $M_1$ (the root) determines, at each position, which $H_1$-state is visited, and therefore which states of $E_l$, $E_r$ are aligned. Recursing into the child profiles yields a full column assignment for all leaves.

Specifically, each emitting transition in the sampled path implies:

a character at the current node (from the root generator or branch match),
advancement of the left profile, right profile, or both,
and therefore a column in the MSA (with gaps for non-advancing sides).

When bubble merging has been applied, the canonical path (chosen during merging) is used to resolve any ambiguity in the sub-alignment of the clade below the merged bubble.

Viterbi Progressive Reconstruction An alternative to the sampling-based profile construction of Step 4 above is to use Viterbi decoding at each internal node. Instead of sampling $K$ paths from $M_v$ and building a multi-path profile, the Viterbi variant computes a single maximum-likelihood path through $M_v$ and uses the resulting ancestral sequence directly as the reconstructed sequence at node $v$. This gives a deterministic progressive reconstruction that avoids the $O(KL)$ profile size but sacrifices the ability to represent uncertainty in the ancestral sequence.

C.3 Exploded MixDom Pair HMM

C.3.1 State Space

The exploded MixDom Pair HMM makes every structural decision explicit as a separate state transition. Let $\ndom $ denote the number of domain types, $\nfrag $ the number of fragment types per domain. Parameters are indexed by domain type $k$ and fragment type $f$.

The states are shown in Table C.1 (emitting states marked with $\star $).

Table C.1: Exploded MixDom Pair HMM state space.


Category	States	Count

Start/End	$\sta $, $\fin $	$2$
Domain-level (top-level TKF91 states)	$\matdom $, $\insdom $, $\deldom $, $\matdomend $, $\insdomend $, $\deldomend $	$6$
Domain type selection (one per domain type $k$)	$\matdomtype {k}$, $\insdomtype {k}$, $\deldomtype {k}$	$3\ndom $
Fragment-level (inner TKF states within $\matdomtype {k}$)	$\mkfrag {k}$, $\mkifrag {k}$, $\mkdfrag {k}$	$3\ndom $
Fragment-level (single looping state within $\insdomtype {k}$, $\deldomtype {k}$)	$\ikfrag {k}$, $\dkfrag {k}$	$2\ndom $
Fragment type selection (one per fragment type $f$)	$\mkfragtype {k}{f}$, $\mkifragtype {k}{f}$, $\mkdfragtype {k}{f}$, $\ikfragtype {k}{f}$, $\dkfragtype {k}{f}$	$5\ndom \nfrag $
Emit states (the only emitting states)	$\mkfragemit {k}{f}$, $\mkifragemit {k}{f}$, $\mkdfragemit {k}{f}$, $\ikfragemit {k}{f}$, $\dkfragemit {k}{f}$	$5\ndom \nfrag $
Fragment end (fragment termination)	$\mkfragend {k}{f}$, $\mkifragend {k}{f}$, $\mkdfragend {k}{f}$, $\ikfragend {k}{f}$, $\dkfragend {k}{f}$	$5\ndom \nfrag $

Total	$8 + 8\ndom + 15\ndom \nfrag $

The emitting states correspond to the compound states of the collapsed model: $\mkfragemit {k}{f} = \matmat _{kf}$, $\mkifragemit {k}{f} = \matins _{kf}$, $\mkdfragemit {k}{f} = \matdel _{kf}$, $\ikfragemit {k}{f} = \insins _{kf}$, $\dkfragemit {k}{f} = \deldel _{kf}$.

C.3.2 Transition Weights

All transitions are between non-emitting states, or from non-emitting to emitting, or from emitting to non-emitting (Mealy machine: emissions occur on the transitions into emit states).

We use BDI parameters for two TKF91 processes:

Top-level (domain sequence): $\alpha _0, \beta _0, \gamma _0, \kappa _0$ from $(\insrate _0, \delrate _0, \evoltime )$
Per-domain $k$ (fragment sequence): $\alpha _k, \beta _k, \gamma _k, \kappa _k$ from $(\insrate _k, \delrate _k, \evoltime )$

Top-level transitions These implement the TKF91 Pair HMM structure with $\matdom /\insdom /\deldom /\fin $ (for incoming connections) and $\sta /\matdomend /\insdomend /\deldomend $ (for outgoing connections) playing the roles of $\sta /\mat /\ins /\del /\fin $: \begin {align} \sta &\to \matdom : \quad \tkftrans _{\sta \mat }(\insrate _0,\delrate _0,\evoltime ) \\ \sta &\to \insdom : \quad \tkftrans _{\sta \ins }(\insrate _0,\delrate _0,\evoltime ) \\ \sta &\to \deldom : \quad \tkftrans _{\sta \del }(\insrate _0,\delrate _0,\evoltime ) \\ \sta &\to \fin : \quad \tkftrans _{\sta \fin }(\insrate _0,\delrate _0,\evoltime ) \\ \matdomend &\to \matdom : \quad \tkftrans _{\mat \mat }(\insrate _0,\delrate _0,\evoltime ) \\ \matdomend &\to \insdom : \quad \tkftrans _{\mat \ins }(\insrate _0,\delrate _0,\evoltime ) \\ \matdomend &\to \deldom : \quad \tkftrans _{\mat \del }(\insrate _0,\delrate _0,\evoltime ) \\ \matdomend &\to \fin : \quad \tkftrans _{\mat \fin }(\insrate _0,\delrate _0,\evoltime ) \\ \insdomend &\to \matdom : \quad \tkftrans _{\ins \mat }(\insrate _0,\delrate _0,\evoltime ) \\ \insdomend &\to \insdom : \quad \tkftrans _{\ins \ins }(\insrate _0,\delrate _0,\evoltime ) \\ \insdomend &\to \deldom : \quad \tkftrans _{\ins \del }(\insrate _0,\delrate _0,\evoltime ) \\ \insdomend &\to \fin : \quad \tkftrans _{\ins \fin }(\insrate _0,\delrate _0,\evoltime ) \\ \deldomend &\to \matdom : \quad \tkftrans _{\del \mat }(\insrate _0,\delrate _0,\evoltime ) \\ \deldomend &\to \insdom : \quad \tkftrans _{\del \ins }(\insrate _0,\delrate _0,\evoltime ) \\ \deldomend &\to \deldom : \quad \tkftrans _{\del \del }(\insrate _0,\delrate _0,\evoltime ) \\ \deldomend &\to \fin : \quad \tkftrans _{\del \fin }(\insrate _0,\delrate _0,\evoltime ) \end {align}

Domain type selection \begin {align} \matdom &\to \matdomtype {k}: \quad v_k \quad \text {(domain weight)} \\ \insdom &\to \insdomtype {k}: \quad v_k \\ \deldom &\to \deldomtype {k}: \quad v_k \end {align}

Domain-to-fragment entry (M-type domains) Within $\matdomtype {k}$, the fragment-level TKF91 begins. Again, this follows the TKF91 Pair HMM structure, now with $\mkfrag {k}/\mkifrag {k}/\mkdfrag {k}\matdomend $ (for incoming connections) and $\matdomtype {k}/\mkfragend {k}/\mkifragend {k}/\mkdfragend {k}$ states (for outgoing connections) playing the roles of $\sta /\mat /\ins /\del /\fin $: \begin {align} \matdomtype {k} &\to \mkfrag {k}: \quad \tkftrans _{\sta \mat }(\insrate _k,\delrate _k,\evoltime ) \\ \matdomtype {k} &\to \mkifrag {k}: \quad \tkftrans _{\sta \ins }(\insrate _k,\delrate _k,\evoltime ) \\ \matdomtype {k} &\to \mkdfrag {k}: \quad \tkftrans _{\sta \del }(\insrate _k,\delrate _k,\evoltime ) \\ \matdomtype {k} &\to \matdomend : \quad \tkftrans _{\sta \fin }(\insrate _k,\delrate _k,\evoltime ) \\ \mkfragend {k} &\to \mkfrag {k}: \quad \tkftrans _{\mat \mat }(\insrate _k,\delrate _k,\evoltime ) \\ \mkfragend {k} &\to \mkifrag {k}: \quad \tkftrans _{\mat \ins }(\insrate _k,\delrate _k,\evoltime ) \\ \mkfragend {k} &\to \mkdfrag {k}: \quad \tkftrans _{\mat \del }(\insrate _k,\delrate _k,\evoltime ) \\ \mkfragend {k} &\to \matdomend : \quad \tkftrans _{\mat \fin }(\insrate _k,\delrate _k,\evoltime ) \\ \mkifragend {k} &\to \mkfrag {k}: \quad \tkftrans _{\ins \mat }(\insrate _k,\delrate _k,\evoltime ) \\ \mkifragend {k} &\to \mkifrag {k}: \quad \tkftrans _{\ins \ins }(\insrate _k,\delrate _k,\evoltime ) \\ \mkifragend {k} &\to \mkdfrag {k}: \quad \tkftrans _{\ins \del }(\insrate _k,\delrate _k,\evoltime ) \\ \mkifragend {k} &\to \matdomend : \quad \tkftrans _{\ins \fin }(\insrate _k,\delrate _k,\evoltime ) \\ \mkdfragend {k} &\to \mkfrag {k}: \quad \tkftrans _{\del \mat }(\insrate _k,\delrate _k,\evoltime ) \\ \mkdfragend {k} &\to \mkifrag {k}: \quad \tkftrans _{\del \ins }(\insrate _k,\delrate _k,\evoltime ) \\ \mkdfragend {k} &\to \mkdfrag {k}: \quad \tkftrans _{\del \del }(\insrate _k,\delrate _k,\evoltime ) \\ \mkdfragend {k} &\to \matdomend : \quad \tkftrans _{\del \fin }(\insrate _k,\delrate _k,\evoltime ) \end {align}

The $\matdomtype {k} \to \matdomend $ transition is the “phantom” null path: the domain is entered but the inner model immediately terminates with no fragments emitted. This leads to null cycles that must be eliminated by Schur complement.

Domain-to-fragment entry (I/D-type domains) $\insdomtype {k}$ and $\deldomtype {k}$ have a single looping fragment state: \begin {align} \insdomtype {k} &\to \ikfrag {k}: \quad \kappa _k \\ \insdomtype {k} &\to \insdomend : \quad 1 - \kappa _k \\ \deldomtype {k} &\to \dkfrag {k}: \quad \kappa _k \\ \deldomtype {k} &\to \deldomend : \quad 1 - \kappa _k \end {align}

Again, the $\to \insdomend / \deldomend $ transitions are null (empty domain).

Fragment type selection \begin {align} \mkfrag {k} &\to \mkfragtype {k}{f}: \quad w_{kf} \quad \text {(fragment weight)} \\ \mkifrag {k} &\to \mkifragtype {k}{f}: \quad w_{kf} \\ \mkdfrag {k} &\to \mkdfragtype {k}{f}: \quad w_{kf} \\ \ikfrag {k} &\to \ikfragtype {k}{f}: \quad w_{kf} \\ \dkfrag {k} &\to \dkfragtype {k}{f}: \quad w_{kf} \end {align}

Fragment emission These are the only transitions with emissions (the exploded HMM is represented as a Mealy machine, so emissions occur on transitions; the collapsed HMM treats emissions as state-based). The emission probability is summed over site classes $\class \in \{1,\ldots ,\nclasses \}$, weighted by the per-fragment class distribution $\classdist _{k\frag \class }$: \begin {align} \mkfragtype {k}{f} &\xrightarrow {(\anctok ,\destok )} \mkfragemit {k}{f}: \quad \sum _{\class =1}^{\nclasses } \classdist _{k\frag \class }\, \eqm ^{(\class )}_\anctok \exp (\revsub ^{(\class )} \evoltime )_{\anctok \destok } \quad \text {(align $(\anctok ,\destok )$)} \\ \mkifragtype {k}{f} &\xrightarrow {(\epsilon ,\destok )} \mkifragemit {k}{f}: \quad \sum _{\class =1}^{\nclasses } \classdist _{k\frag \class }\, \eqm ^{(\class )}_\destok \quad \text {(insert $\destok $)} \\ \mkdfragtype {k}{f} &\xrightarrow {(\anctok ,\epsilon )} \mkdfragemit {k}{f}: \quad \sum _{\class =1}^{\nclasses } \classdist _{k\frag \class }\, \eqm ^{(\class )}_\anctok \quad \text {(delete $\anctok $)} \\ \ikfragtype {k}{f} &\xrightarrow {(\epsilon ,\destok )} \ikfragemit {k}{f}: \quad \sum _{\class =1}^{\nclasses } \classdist _{k\frag \class }\, \eqm ^{(\class )}_\destok \quad \text {(insert $\destok $)} \\ \dkfragtype {k}{f} &\xrightarrow {(\anctok ,\epsilon )} \dkfragemit {k}{f}: \quad \sum _{\class =1}^{\nclasses } \classdist _{k\frag \class }\, \eqm ^{(\class )}_\anctok \quad \text {(delete $\anctok $)} \end {align}

Intra-fragment fragment-type transition vs. fragment termination Within each fragment of domain $k$ the fragment-type process is a Markov chain on $\nfrag +2$ states (start, end, and $\nfrag $ fragment-type states): from the current emit state with fragment-type $f$, the chain either advances within the fragment to fragment-type $g$ with probability $\ext ^{(k)}_{fg}$, or terminates the fragment (transition to the end state) with probability $\notext ^{(k)}_f = 1 - \sum _g \ext ^{(k)}_{fg}$. Different fragments are statistically independent realisations of this chain. The transition from each emit state goes to any fragment-type’s type-selection state within the current fragment (not just the same type), or to the fragment end: \begin {align} \mkfragemit {k}{f} &\to \mkfragtype {k}{g}: \quad \ext ^{(k)}_{fg} \quad \text {(intra-fragment type transition $f \to g$)} \\ \mkfragemit {k}{f} &\to \mkfragend {k}{f}: \quad \notext ^{(k)}_f \quad \text {(fragment termination)} \\ \mkifragemit {k}{f} &\to \mkifragtype {k}{g}: \quad \ext ^{(k)}_{fg} \\ \mkifragemit {k}{f} &\to \mkifragend {k}{f}: \quad \notext ^{(k)}_f \\ \mkdfragemit {k}{f} &\to \mkdfragtype {k}{g}: \quad \ext ^{(k)}_{fg} \\ \mkdfragemit {k}{f} &\to \mkdfragend {k}{f}: \quad \notext ^{(k)}_f \\ \ikfragemit {k}{f} &\to \ikfragtype {k}{g}: \quad \ext ^{(k)}_{fg} \\ \ikfragemit {k}{f} &\to \ikfragend {k}{f}: \quad \notext ^{(k)}_f \\ \dkfragemit {k}{f} &\to \dkfragtype {k}{g}: \quad \ext ^{(k)}_{fg} \\ \dkfragemit {k}{f} &\to \dkfragend {k}{f}: \quad \notext ^{(k)}_f \end {align}

where $g$ ranges over all $\nfrag $ fragment types. For $\nfrag = 1$, this reduces to a scalar self-extension with $\ext ^{(k)}_{11} = \ext _{k1}$ and $\notext ^{(k)}_1 = 1 - \ext _{k1}$.

C.3.3 Null State Classification

Every state except $\sta $, $\fin $, and the five emit states $\mkfragemit {k}{f}$, $\mkifragemit {k}{f}$, $\mkdfragemit {k}{f}$, $\ikfragemit {k}{f}$, $\dkfragemit {k}{f}$ is a null state (non-emitting). The collapsed model retains only $\sta $, $\fin $, and the $5\ndom \nfrag $ emit states.

C.3.4 Null Elimination

The null states are eliminated by the standard HMM null closure: \[ \chi _{\text {emit},\text {emit}} = T_{\text {emit},\text {emit}} + T_{\text {emit},\text {null}} (I - T_{\text {null},\text {null}})^{-1} T_{\text {null},\text {emit}} \] Since there are no direct emit$\to $emit transitions in the exploded model (every path between emit states passes through at least one null state), $T_{\text {emit},\text {emit}} = 0$ and: \[ \chi = T_{\text {emit},\text {null}} (I - T_{\text {null},\text {null}})^{-1} T_{\text {null},\text {emit}} \]

This gives the collapsed $\chi $ matrix with states $\{\sta , \fin , \matmat _{kf}, \matins _{kf}, \matdel _{kf}, \insins _{kf}, \deldel _{kf}\}$, matching the collapsed MixDom Pair HMM in Section C.1.1.

Each state in the ($5\ndom \nfrag $+2)-state collapsed HMM corresponds to an uneliminated state in the ($15\ndom \nfrag +8\ndom +8$)-state exploded HMM: either $\sta $ ($\sta \sta $), $\fin $ ($\fin \fin $), or one of the emit states $\mkfragemit {\srcdom }{\srcfrag }$ ($\mat \mat _{\srcdom \srcfrag }$), $\mkifragemit {\srcdom }{\srcfrag }$ ($\mat \ins _{\srcdom \srcfrag }$), $\mkdfragemit {\srcdom }{\srcfrag }$ ($\mat \del _{\srcdom \srcfrag }$), $\ikfragemit {\srcdom }{\srcfrag }$ ($\ins \ins _{\srcdom \srcfrag }$), $\dkfragemit {\srcdom }{\srcfrag }$ ($\del \del _{\srcdom \srcfrag }$). The transition weight from $\ustate \xstate _{\srcdom \srcfrag }$ to $\vstate \ystate _{\destdom \destfrag }$ in the collapsed Pair HMM has the form $\domexit (\ustate ,\xstate ,\srcdom ,\srcfrag ) \times \nonemptytrans _{\ustate \vstate }(\ustate ,\vstate ) \times \domenter (\vstate ,\ystate ,\destdom ,\destfrag ) + \delta _{\ustate \vstate } \samedom (p_\text {SameDom}(\ustate ,\xstate ,\srcdom ,\srcfrag ,\destfrag ) + \delta _{\xstate \ystate } p_\text {SameFrag}(\srcdom ,\srcfrag ,\destfrag ))$ corresponding to the following path segments

$\domexit (\ustate ,\xstate ,\srcdom ,\srcfrag )$ represents transitions from the emit state to the end state of the domain, e.g. $\tkftrans _{\mat \fin }(\mat ,\ins ,\srcdom ,\srcfrag )$ represents $\mkifragemit {\srcdom }{\srcfrag } \to \mkifragend {\srcdom }{\srcfrag } \to \matdomend $;
$\nonemptytrans _{\ustate \vstate }(\ustate ,\vstate )$ represents the sum over all paths from the domain end state (or $\sta $), through zero or more empty domains, to the next (nonempty) domain start (or $\fin $), e.g. $\nonemptytrans _{\mat \del }(\mat ,\del )$ represents paths like $\matdomend \to (\ldots \insdom \to \insdomtype {\ndom '} \to \insdomend \ldots )^\ast \to \deldom $. This is where null cycle elimination happens;
$\domenter (\vstate ,\ystate ,\destdom ,\destfrag )$ represents paths from the domain start state to an emit state inside the domain, e.g. $\tkftrans _{\sta \mat }(\mat ,\ins ,\destdom ,\destfrag )$ represents $\matdom \to \matdomtype {\destdom } \to \mkifrag {\destdom } \to \mkifragtype {\destdom }{\destfrag }$;
$p_\text {SameDom}(\xstate ,\ystate ,\srcdom ,\srcfrag ,\destfrag )$ represents paths from the emit state to another emit state of similar profile but (potentially) different fragment type within the same domain, e.g. $p_\text {SameDom}(\mat ,\ins ,\srcdom ,\srcfrag ,\destfrag )$ represents $\mkifragemit {\srcdom }{\srcfrag } \to \matdomend \to \matdom \to \matdomtype {\srcdom } \to \mkifrag {\srcdom } \to \mkifragtype {\srcdom }{\destfrag }$;
$p_\text {SameFrag}(\srcdom ,\srcfrag ,\destfrag ) = \ext ^{(\srcdom )}_{\srcfrag \destfrag }$ represents intra-fragment Markov transitions between fragment-types (extending the current fragment by one position), e.g. $\mkfragemit {\srcdom }{\srcfrag } \to \mkfragtype {\srcdom }{\destfrag }$ with weight $\ext ^{(\srcdom )}_{\srcfrag \destfrag }$. Different fragments are independent; the Markov structure is strictly within a single fragment, allowing the fragment-type to change at each position (not just self-loop).

C.3.5 Exact Count Restoration

Given Forward-Backward expected transition counts $\hat {n}_\chi (i,j)$ on the collapsed model, we recover the expected counts on the exploded model using the null closure inverse.

Define: \begin {align} C &= (I - T_{\text {null},\text {null}})^{-1} \quad \text {(null closure)} \\ C_{ab} &= \text {expected visits to null state $b$, starting from null state $a$} \end {align}

For each collapsed transition $\hat {n}_\chi (s, s')$ from emit state $s$ to emit state $s'$, the path in the exploded model is: \[ s \xrightarrow {1} \text {FragEnd}(s) \xrightarrow {\text {null chain}} \text {FragType}(s') \xrightarrow {1} s' \]

The null chain from $\text {FragEnd}(s)$ to $\text {FragType}(s')$ passes through a sequence of null states. The expected count for each null transition $(a \to b)$ along this chain is: \begin {equation} \hat {n}_{\text {exploded}}(a, b) = \sum _{s, s'} \hat {n}_\chi (s, s') \cdot \frac {T_{\text {emit},a'} \cdot C_{a',a} \cdot T_{a,b} \cdot C_{b,b'} \cdot T_{b',\text {emit}'}} {\chi (s, s')} \label {eq:count-restoration-general} \end {equation} where $a'$ is the first null state entered from $s$, and $b'$ is the last null state before reaching $s'$.

More explicitly, each collapsed transition decomposes into contributions to the following parameter groups.

Intra-fragment type transitions vs. new-fragment transitions For a transition $\hat {n}_\chi (s, s')$ where $s = \text {Emit}_{kf}^X$ and $s' = \text {Emit}_{kg}^Y$ with the same domain $k$, the path splits into:

Intra-fragment fragment-type transition: $s \to \text {FragType}_{kg} \to s'$, with weight $\ext ^{(k)}_{fg}$. This applies to all transitions where $X = Y$ (same TKF state type) and allows $f \neq g$. It extends the current fragment by one position without generating a new TKF92 link.
New fragment via domain loop: $s \to \text {FragEnd}_{kf} \to \ldots \to \text {Frag}_k \to \text {FragType}_{kg} \to s'$, with weight $\notext ^{(k)}_f \cdot \tkftrans _k[X,Y] \cdot w_{kg}$ (for M-type) or $\notext ^{(k)}_f \cdot \kappa _k \cdot w_{kg}$ (for I/D-type). This terminates the current fragment and initiates an independent fresh fragment via the TKF92 process.

The expected intra-fragment fragment-type transition count from $f$ to $g$ is: \begin {equation} \hat {n}_{\text {ext}}(k,f,g) = \hat {n}_\chi (s, s') \cdot \frac {\ext ^{(k)}_{fg}}{\ext ^{(k)}_{fg} + \notext ^{(k)}_f \cdot p_{\text {new},g}} \end {equation} where $p_{\text {new},g}$ is the new-fragment-to-type-$g$ probability. These counts form an $\nfrag \times \nfrag $ matrix per domain, and the M-step row-normalizes $(\hat {n}_{\text {ext}}(k,f,g), \hat {n}_{\notext }(k,f))$ to obtain the updated $\ext ^{(k)}_{fg}$.

Intra-domain TKF transitions Each new-fragment transition (after fragment termination) contributes one TKF transition at the domain level: $\tkftrans _k[X, Y]$ for M-type domains, or $\kappa _k$ / $(1-\kappa _k)$ for I/D-type domains.

These counts go into the domain-$k$ TKF91 count matrix (for M-type) or the $\kappa _k$ / $(1-\kappa _k)$ accumulators (for I/D-type).

Domain entry/exit and phantom counts Inter-domain transitions pass through $\matdomend / \insdomend / \deldomend $ (exit from source domain) and $\matdom / \insdom / \deldom $ then $\matdomtype {k}$ (entry to destination domain).

Within the entry, the path $\matdomtype {k} \to \mkfrag {k}$ uses $\tkftrans _k[\sta , \cdot ]$. The phantom path $\matdomtype {k} \to \matdomend $ has probability $\tkftrans _k[\sta , \fin ] = (1-\beta _k)(1-\kappa _k)$ and contributes a phantom birth-death event to domain $k$’s BDI statistics.

Similarly for I/D-type entries: $\insdomtype {k} \to \insdomend $ with probability $(1-\kappa _k)$ is a phantom I/D-type domain.

Top-level TKF transitions Each inter-domain transition contributes one TKF transition at the top level: $\tkftrans _0[U, V]$ where $U \in \{\sta , \mat , \ins , \del \}$ is the domain-end type and $V$ is the domain-start type.

The null domain paths (empty domains via $\matdomtype {k} \to \matdomend $) contribute additional phantom top-level transitions via the null closure $(I - T_{\text {null},\text {null}})^{-1}$.

Domain and fragment weight counts Each domain entry contributes one count to $v_k$ (domain weight). Each fragment-type selection contributes one count to $w_{kf}$ (fragment weight).

C.3.6 Parameter Group Decomposition

Each transition in the exploded model involves exactly one of the following parameter factors:


Parameter	Factor	Where it appears

$\alpha _0$	$(1-\beta _0)\kappa _0\alpha _0$	Top-level $\to \matdom $
$1-\alpha _0$	$(1-\beta _0)\kappa _0(1-\alpha _0)$	Top-level $\to \deldom $
$\beta _0$	$\beta _0$	Top-level $\to \insdom $
$1-\beta _0$	$(1-\beta _0)$	Top-level $\to \matdom , \deldom , \fin $
$\gamma _0$	$\gamma _0$	$\deldomend \to \insdom $
$1-\gamma _0$	$(1-\gamma _0)$	$\deldomend \to \matdom , \deldom , \fin $
$\kappa _0$	$\kappa _0$	Top-level $\to \matdom , \deldom $
$1-\kappa _0$	$(1-\kappa _0)$	Top-level $\to \fin $
$\alpha _k$	$(1-\beta _k)\kappa _k\alpha _k$	Domain-$k$ $\to $ MatFrag
$\beta _k$	$\beta _k$	Domain-$k$ $\to $ InsFrag
$\gamma _k$	$\gamma _k$	Domain-$k$ DelFragEnd $\to $ InsFrag
$\kappa _k$	$\kappa _k$	Domain-$k$ $\to $ MatFrag/DelFrag, I/D-type continuation
$1-\kappa _k$	$(1-\kappa _k)$	Domain-$k$ $\to $ DomEnd
$\ext ^{(k)}_{fg}$	$\ext ^{(k)}_{fg}$	Intra-fragment fragment-type transition $f \to g$
$\notext ^{(k)}_f$	$1 - \sum _g \ext ^{(k)}_{fg}$	Fragment termination
$v_k$	$v_k$	Domain type selection
$w_{kf}$	$w_{kf}$	Fragment type selection

Because each exploded transition involves a product of these factors, and each factor’s log depends on at most one natural parameter ($\insrate _k$ or $\delrate _k$), the Q-function on the exploded model decomposes into independent BDI score terms. The null-state count restoration maps collapsed counts exactly onto exploded counts, allowing the M-step to decompose into the same parameter-group updates used in the component TKF91/TKF92 models, together with standard mixture-weight updates.

C.4 Order-1 Maraschino: Distilled Adjacency Frequencies

“Cherries” are pairwise training examples chosen, in place of full phylogenetically-annotated multiple sequence alignments, as a composite likelihood approximation to the full phylogenetic likelihood (40).

“Maraschino Cherries” are order-1 counts tensors that summarize the adjacency statistics of such pairwise alignments. MixDom’s Maraschino Cherries generalize CherryML to include context-dependent substitution and indel patterns (40).

The Maraschino pipeline has two phases. First (Section C.4.1), pairwise alignments are reduced to fixed-shape cherry-count tensors that aggregate adjacency statistics binned by divergence time. Second (Section C.4.2), the parameters of the MixDom Pair HMM (Section C.1.1) are estimated by maximizing the cherry-count log-likelihood: a composite likelihood that scores each adjacency in each time bin under the latent-marginalized collapsed Pair HMM transition matrix $\transnest (\theta ,\evoltime )$ derived in Section C.1.1. After fitting, the MixDom model is then distilled to compact order-1 machines (an HMM and a WFST) suitable for use in tree algorithms (Sections C.4.5 and C.4.6).

C.4.1 Cherry-count summary statistics

The input to the Maraschino fitter is a precomputed tensor of pairwise adjacency counts. For each multiple sequence alignment, sibling pairs are extracted, gapped columns dropped, and the resulting pairwise alignment classified into adjacency contexts: for every pair of consecutive non-empty alignment columns, we record the source column type ($\sta $, $\mat $, $\ins $, $\del $), the destination column type ($\mat $, $\ins $, $\del $, $\fin $), and the ancestor/descendant characters in each.

The pairwise $p$-distance of each cherry is converted to an estimated divergence time $\evoltime $, and $\evoltime $ is discretized into $n_\tau $ geometric bins $\{\evoltime _1,\ldots ,\evoltime _{n_\tau }\}$ with representative bin centres $\bar \evoltime _b$. Counts are accumulated per bin into the following tensors over the amino-acid alphabet $\alphabet $ ($|\alphabet |=20$), with extended vocabulary $\alphabet \cup \{\sta , \fin \}$ for boundary positions:


Tensor	Shape	Meaning

$B$	$n_\tau \times (\|\alphabet \|+2)^2$	Singlet bigrams (incl. $\sta $/$\fin $)
$C^{\mat \mat }$	$n_\tau \times \|\alphabet \|^4$	Match$\to $Match: $(\anctok ,\destok ,\anctok ',\destok ')$
$C^{\mat \ins }$	$n_\tau \times \|\alphabet \|^3$	Match$\to $Insert: $(\anctok ,\destok ,\destok ')$
$C^{\mat \del }$	$n_\tau \times \|\alphabet \|^3$	Match$\to $Delete: $(\anctok ,\destok ,\anctok ')$
$C^{\ins \mat }$	$n_\tau \times \|\alphabet \|^3$	Insert$\to $Match: $(\destok ,\anctok ',\destok ')$
$C^{\ins \ins }$	$n_\tau \times \|\alphabet \|^2$	Insert$\to $Insert: $(\destok ,\destok ')$
$C^{\ins \del }$	$n_\tau \times \|\alphabet \|^2$	Insert$\to $Delete: $(\destok ,\anctok ')$
$C^{\del \mat }$	$n_\tau \times \|\alphabet \|^3$	Delete$\to $Match: $(\anctok ,\anctok ',\destok ')$
$C^{\del \del }$	$n_\tau \times \|\alphabet \|^2$	Delete$\to $Delete: $(\anctok ,\anctok ')$
$C^{\del \ins }$	$n_\tau \times \|\alphabet \|^2$	Delete$\to $Insert: $(\anctok ,\destok ')$
$C^{\sta \mat }$	$n_\tau \times \|\alphabet \|^2$	Start$\to $Match: $(\anctok ',\destok ')$
$C^{\sta \ins }$	$n_\tau \times \|\alphabet \|$	Start$\to $Insert: $(\destok ')$
$C^{\sta \del }$	$n_\tau \times \|\alphabet \|$	Start$\to $Delete: $(\anctok ')$
$C^{\mat \fin }$	$n_\tau \times \|\alphabet \|^2$	Match$\to $End: $(\anctok ,\destok )$
$C^{\ins \fin }$	$n_\tau \times \|\alphabet \|$	Insert$\to $End: $(\destok )$
$C^{\del \fin }$	$n_\tau \times \|\alphabet \|$	Delete$\to $End: $(\anctok )$
$C^{\sta \fin }$	$n_\tau $	Start$\to $End (empty alignment)

The post-Insert and post-Delete tensors carry only the adjacent emitted character ($\destok $ for inserts, $\anctok $ for deletes) as context, not the previous match’s full $(\anctok ,\destok )$ context: the model will marginalise its context-rich frequencies down to this reduced context when computing the likelihood. The Match-to-Match tensor $C^{\mat \mat }$ is the largest and dominates the parameter budget. Boundary tensors record alignments that begin or end in a particular adjacency type.

C.4.2 Cherry-count likelihood for the MixDom Pair HMM

The cherry-count tensors of Section C.4.1 are scored against the collapsed MixDom Pair HMM $\transnest (\theta ,\evoltime )$ defined by equation (C.2) of Section C.1.1. The free parameters are \[ \theta = \big (\insrate _\main , \delrate _\main ,\ \{\insrate _\dom , \delrate _\dom \}_{\dom =1}^{\ndom },\ \{\domdist _\dom \}_{\dom =1}^{\ndom },\ \{\fragdist _{\dom \frag }\}_{\dom ,\frag },\ \{\ext ^{(\dom )}_{\srcfrag \destfrag }\}_{\dom ,\srcfrag ,\destfrag },\ \{\classdist _{\dom \frag \class }\}_{\dom ,\frag ,\class },\ \{\eqm ^{(\class )}, \exch ^{(\class )}\}_{\class =1}^{\nclasses }\big ), \] parameterised in unconstrained space (log-rates, log-Dirichlet weights, log-exchangeabilities) so that gradient methods are unconstrained. Note in particular:

$\fragdist _{\dom \frag }$ is the per-domain Dirichlet over fragment entry types (initial state of the intra-fragment Markov chain).
$\ext ^{(\dom )}_{\srcfrag \destfrag }$ is a per-domain $\nfrag \times \nfrag $ row-stochastic matrix (with row sums $\leq 1$) giving the intra-fragment Markov transition probability from fragment-type $\srcfrag $ to fragment-type $\destfrag $ within the same fragment; the residual mass $\notext ^{(\dom )}_\srcfrag = 1 - \sum _\destfrag \ext ^{(\dom )}_{\srcfrag \destfrag }$ is the fragment-termination probability. The TKF92 scalar self-extension is the special case $\nfrag =1$, with $\ext ^{(\dom )}_{11}$ playing the role of the TKF92 extension probability (the M-step closed forms for both reduce to a single binary count split per row of $\ext ^{(\dom )}$).
$\classdist _{\dom \frag \class }$ is a per-(domain, fragment-type) Dirichlet over $\nclasses $ static site classes (drawn independently per emitted site, with no chain-time class switching).
Each site class $\class $ has its own reversible substitution model $\subproc (\exch ^{(\class )}, \eqm ^{(\class )})$ with rate matrix $\revsub ^{(\class )} = \exch ^{(\class )}\,\diag (\eqm ^{(\class )})$. All rate variation across sites is captured by these per-class GTR matrices; there is no separate Yang-style discretized-gamma rate-multiplier mechanism.

Pair adjacency frequencies. For each $\evoltime $-bin centre $\bar \evoltime _b$, the collapsed $(5\ndom \nfrag +2)$-state Pair HMM transition matrix $\transnest ^{(b)} \equiv \transnest (\theta ,\bar \evoltime _b)$ is constructed via the closed form of equation (C.2). The marginal stationary distribution $\pi ^{\text {stat}}_b$ over emitting states is obtained as the left null vector of $I - \transnest ^{(b)}_{\emit \emit }$, where $\emit $ denotes the set of $5\ndom \nfrag $ emitting states.

For source machine state $u \in \{\sta , \mat , \ins , \del \}$, destination machine state $v \in \{\mat , \ins , \del , \fin \}$, and characters $(\anctok , \destok , \anctok ', \destok ')$ in their respective slots, the model-side adjacency frequency \[ F^{uv}_b(\anctok ,\destok ;\anctok ',\destok ') = \sum _{\substack {s \in \emit ^{u} \\ s' \in \emit ^{v}}} \big (\pi ^{\text {stat}}_{b,s}\big ) \,\emprob _s(\anctok ,\destok ) \,\transnest ^{(b)}_{s s'} \,\emprob _{s'}(\anctok ',\destok ') \] sums over latent (domain, fragment-type) realisations of source and destination collapsed states $s,s'$, weighting by their stationary probability and emission probabilities. The emission probability of a Match state $\mat \mat _{\dom \frag }$ emitting $(\anctok ,\destok )$ marginalises the static site-class mixture: \begin {equation} \emprob _{\mat \mat _{\dom \frag }}(\anctok ,\destok ) = \sum _{\class =1}^{\nclasses } \classdist _{\dom \frag \class }\, \eqm ^{(\class )}_\anctok \exp (\revsub ^{(\class )} \evoltime )_{\anctok \destok }, \label {eq:maraschino-match-emprob} \end {equation} and emission probabilities for $\mat \ins , \ins \ins , \mat \del , \del \del $ analogously marginalise the same class mixture but emit only one of $(\anctok , \destok )$: \[ \emprob _{\ins \ins _{\dom \frag }}(\destok ) =\emprob _{\mat \ins _{\dom \frag }}(\destok ) = \sum _\class \classdist _{\dom \frag \class }\, \eqm ^{(\class )}_\destok , \quad \emprob _{\del \del _{\dom \frag }}(\anctok ) =\emprob _{\mat \del _{\dom \frag }}(\anctok ) = \sum _\class \classdist _{\dom \frag \class }\, \eqm ^{(\class )}_\anctok . \] Boundary frequencies $F^{\sta v}_b$, $F^{u\fin }_b$, $F^{\sta \fin }_b$ replace the corresponding endpoint factor with the $\sta $ row or $\fin $ column of $\transnest ^{(b)}$. The reduced-context frequencies needed by post-Insert and post-Delete counts are obtained by marginalisation: \[ F^{\ins v}_b(\destok ;\cdot ) = \sum _{\anctok } F^{\ins v}_b(\anctok ,\destok ;\cdot ), \qquad F^{\del v}_b(\anctok ;\cdot ) = \sum _{\destok } F^{\del v}_b(\anctok ,\destok ;\cdot ). \] For each context $(u; \anctok ,\destok )$ at bin $b$, the row normalisation constant is \[ Z^{u}_b(\anctok ,\destok ) = \sum _{v} \sum _{\anctok ',\destok '} F^{uv}_b(\anctok ,\destok ;\anctok ',\destok '), \] where the inner sum runs over the characters carried by the destination adjacency type, and the post-Insert/post-Delete normalisations use the reduced-context frequencies above.

Cherry-count log-likelihood. The composite log-likelihood that the Maraschino fitter maximises is \begin {align} \mathcal {L}_{\text {cherry}}(\theta ) &= \mathcal {L}_{\text {singlet}}(\theta ) + \sum _{b=1}^{n_\tau } \mathcal {L}_{\text {pair},b}(\theta ), \\ \mathcal {L}_{\text {singlet}}(\theta ) &= \sum _{X,Y \in \alphabet \cup \{\sta ,\fin \}} \Big (\textstyle \sum _b B^{(b)}_{XY}\Big )\, \log P^{\text {singlet}}_{XY}(\theta ), \\ \mathcal {L}_{\text {pair},b}(\theta ) &= \sum _{u,v}\, \sum _{\anctok ,\destok ,\anctok ',\destok '} C^{uv,(b)}_{\anctok \destok \,\anctok '\destok '}\, \log \!\bigg (\frac {F^{uv}_b(\anctok ,\destok ;\anctok ',\destok ')} {Z^{u}_b(\anctok ,\destok )}\bigg ). \label {eq:maraschino-pair-ll} \end {align}

Here $P^{\text {singlet}}$ is the order-1 transition matrix obtained from the MixDom Singlet HMM by row-normalising its adjacency frequencies (Section C.4.5); the singlet term scores the bigram counts $B$ summed over time bins, since the singlet model is time-independent. The pair term scores each per-bin adjacency tensor under its own row-normalised conditional distribution.

Optimisation. $\mathcal {L}_{\text {cherry}}(\theta )$ is differentiable in $\theta $. The maximisation is carried out by gradient methods (Adam, optionally L-BFGS for refinement) on the unconstrained parameterisation, using the same MixDom initialiser as the exact Baum–Welch trainer (including the same flags for the number of site classes, the class-equilibrium initialisation, and the fragment-class assignment), so the two fitters can be started from identical parameters and compared directly. Because the M-step of the exact-EM trainer is closed-form and Maraschino’s gradient optimiser is not, this provides a controlled comparison of cherry-count fitting against full Baum–Welch on the same data and architecture.

The fitted MixDom parameters are written to a checkpoint with the same key layout as a train_pfam-produced checkpoint, so that either trainer’s output can be loaded and refined by the other and either can be used as input to the order-1 distillation (Sections C.4.5 and C.4.6).

C.4.3 Distillation From MixDom To Order-1 Machines

The MixDom HMMs defined above have large structured state spaces (domains $\times $ fragments). We now show how to distill these into compact order-1 machines—an HMM and a transducer—whose transition probabilities depend only on the most recently emitted characters. These machines are approximations of the full MixDom model. The MixDom model parameters are assumed to have been trained, e.g. via Baum-Welch or via Maraschino cherry-count fitting (Section C.4.2); see also (30).

The distillation operates on the alphabet $|\alphabet |$ (with the per-(domain, fragment-type) site-class mixture marginalised implicitly inside the match emission tensors of equation (C.6)). The Woodbury structural weights depend only on indel parameters; the per-(class) emission tensors $\eqm ^{(\class )}_\anctok \exp (\revsub ^{(\class )} \evoltime )_{\anctok \destok }$, weighted by $\classdist _{\dom \frag \class }$, are contracted with those weights to produce the order-1 transition probabilities.

Let $|\alphabet |$ denote the alphabet size and $\circ $ denote a distinguished beginning-of-sequence (BOS) symbol.

C.4.4 Notation for path marginalizations

Both distillations require marginalizing over null (non-emitting) states between consecutive emissions. We introduce a compact notation for the expected frequency of partially observed paths.

In any HMM or transducer with states partitioned into emitting states $\emit $ and non-emitting (null) states $\silent $, define the null closure \[ \nullcl \equiv (I - T_{\silent \silent })^{-1} \] where $T_{\silent \silent }$ is the submatrix of transitions among null states. The effective transition matrix from emitting-or-start to emitting-or-end, marginalizing all intervening null paths, is \[ \effT _{ij} = T_{ij} + \sum _{p,q \in \silent } T_{ip}\, \nullcl _{pq}\, T_{qj} \qquad i \in \emit \cup \{\sta \},\ j \in \emit \cup \{\fin \} \]

For paths through the MixDom HMMs, we use the notation \[ \pathE [\,\sta \to \underbrace {s_1}_{(\anctok _1,\destok _1)} \xrightarrow {\silent ^\ast } \underbrace {s_2}_{(\anctok _2,\destok _2)} \to \cdots \to \fin \,] \] to denote the expected number of times the path visits the indicated sequence of emitting states (with indicated emissions in subscript) separated by zero or more null states (denoted $\silent ^\ast $), summed over all completions of the path to the left ($\sta \to \cdots $) and right ($\cdots \to \fin $). Formally, if $\pi $ is the stationary distribution over states, \[ \pathE [\, \underbrace {s_1}_{(\anctok _1,\destok _1)} \xrightarrow {\silent ^\ast } \underbrace {s_2}_{(\anctok _2,\destok _2)} \,] = \sum _{s_1 \in \emit } \sum _{s_2 \in \emit } \left ( \sum _{i} \pi _i \effT ^\ast _{is_1} \right ) \emprob _{s_1}(\anctok _1,\destok _1)\, \effT _{s_1 s_2}\, \emprob _{s_2}(\anctok _2,\destok _2) \left ( \sum _{j} \effT ^\ast _{s_2 j} \right ) \] where $\emprob _s(\cdot )$ is the emission probability at state $s$ and $\effT ^\ast _{ij} = [(I - \effT _{\emit \emit })^{-1}]_{ij}$ (with $\effT _{\emit \emit }$ the submatrix of $\effT $ restricted to emitting states) sums over all paths through emitting states.

C.4.5 Distillation to Order-1 HMM

The MixDom Singlet HMM generates sequences from the stationary distribution. We distill it into an order-1 HMM with states \[ \{\sta , \fin \} \cup \{ \anctok : \anctok \in \alphabet \} \] where state $\anctok $ deterministically emits character $\anctok $. Define the adjacency frequency from the full MixDom Singlet HMM: \[ f(\anctok ,\destok ) = \pathE [\,\underbrace {s_1}_{\anctok } \xrightarrow {\silent ^\ast } \underbrace {s_2}_{\destok }\,] \] where the sum is over all emitting states $s_1,s_2$ weighted by emission probabilities $\emprob _{s_1}(\anctok )$ and $\emprob _{s_2}(\destok )$ as above. Each singlet emission probability $\emprob _{\ins _{\dom \frag }}(\anctok ) = \sum _\class \classdist _{\dom \frag \class } \eqm ^{(\class )}_\anctok $ marginalises the per-(domain, fragment-type) class mixture.

Parameterization. The order-1 Singlet HMM transition probabilities are: \begin {align} P(\destok | \sta ) &= \frac {\displaystyle \sum _{\destok '} f(\destok ,\destok ')}{\displaystyle \sum _{\anctok ',\destok '} f(\anctok ',\destok ')} && \mbox {(start $\to $ first emission)} \label {eq:hmm-start} \\ P(\destok | \anctok ) &= \frac {f(\anctok ,\destok )}{\displaystyle \sum _{\destok '} f(\anctok ,\destok ')} && \mbox {(emission $\to $ emission)} \label {eq:hmm-trans} \\ P(\fin | \anctok ) &= 1 - \sum _{\destok } P(\destok | \anctok ) && \mbox {(emission $\to $ end)} \label {eq:hmm-end} \end {align}

That is: normalize each row of the adjacency matrix $f$ to obtain transition probabilities, allocating the residual probability mass to the end state. The start distribution (??) is proportional to the column marginals of $f$.

C.4.6 Distillation to Order-1 WFST

The MixDom Pair HMM describes the joint distribution over ancestor-descendant sequence pairs. We distill it into an order-1 transducer (a Mealy machine) whose state depends on the last inputted ancestor character and the last outputted descendant character.

Machine states. The transducer has seven machine states $\{\sta ,\mat ,\ins ,\del ,\waitm ,\waitd ,\fin \}$ organized as a waiting machine:

Non-waiting (all outgoing transitions have $\varepsilon $ input): $\sta $ (start), $\mat $ (just matched), $\ins $ (just inserted), $\del $ (just deleted)
Waiting (all outgoing transitions consume input): $\waitm $ (ready after $\mat $ or $\ins $), $\waitd $ (ready after $\del $)
Terminal: $\fin $ (end)

The distinction $\waitm \neq \waitd $ is needed because outgoing transition weights differ: in the MixDom Pair HMM, the $\mat /\ins $ rows use $\beta $ while $\del $ rows use $\gamma $.

Transitions. For a state with last-input ancestor $X$ and last-output descendant $Y$ (where $X,Y \in \alphabet \cup \{\circ \}$ and $\circ $ denotes BOS):

Source	Dest	Input	Output	Weight

$\sta $	$\waitm $	$\varepsilon $	$\varepsilon $	$p_{\sta \waitm }$
$\sta $	$\ins $	$\varepsilon $	$\destok $	$p_{\sta \ins }(Y,\destok )$
$\sta $	$\fin $	$\varepsilon $	$\varepsilon $	$p_{\sta \fin }$

$\waitm $	$\mat $	$\anctok $	$\destok $	$p_{\waitm \mat }(X,Y,\anctok ,\destok )$
$\waitm $	$\del $	$\anctok $	$\varepsilon $	$p_{\waitm \del }(X,Y,\anctok )$

$\waitd $	$\mat $	$\anctok $	$\destok $	$p_{\waitd \mat }(X,Y,\anctok ,\destok )$
$\waitd $	$\del $	$\anctok $	$\varepsilon $	$p_{\waitd \del }(X,Y,\anctok )$

$\mat $	$\waitm $	$\varepsilon $	$\varepsilon $	$p_{\mat \waitm }$
$\mat $	$\ins $	$\varepsilon $	$\destok $	$p_{\mat \ins }(X,Y,\destok )$
$\mat $	$\fin $	$\varepsilon $	$\varepsilon $	$p_{\mat \fin }$

$\ins $	$\waitm $	$\varepsilon $	$\varepsilon $	$p_{\ins \waitm }$
$\ins $	$\ins $	$\varepsilon $	$\destok $	$p_{\ins \ins }(X,Y,\destok )$
$\ins $	$\fin $	$\varepsilon $	$\varepsilon $	$p_{\ins \fin }$

$\del $	$\waitd $	$\varepsilon $	$\varepsilon $	$p_{\del \waitd }$
$\del $	$\ins $	$\varepsilon $	$\destok $	$p_{\del \ins }(X,Y,\destok )$
$\del $	$\fin $	$\varepsilon $	$\varepsilon $	$p_{\del \fin }$

Note that $\waitm $ and $\waitd $ are not associated with emissions; they serve only to enforce the waiting-machine property. Transitions from $\waitm $ and $\waitd $ always consume an ancestor input symbol; transitions from $\mat $, $\ins $, $\del $, and $\sta $ never do.

Parameterization from the MixDom Pair HMM. We need the expected frequency of transitions conditioned on (last ancestor input $X$, last descendant output $Y$, transition type, new symbols). The key subtlety: through insert states, the last ancestor symbol $X$ must be propagated from the preceding match or delete, since inserts do not consume input.

Using the path notation from above, we enumerate all adjacency types that arise in the MixDom Pair HMM. Write $\mat [X,Y]$ for a match state that inputs $X$ and outputs $Y$, $\ins [Y]$ for an insert that outputs $Y$, and $\del [X]$ for a delete that inputs $X$. The last-ancestor and last-descendant context is carried implicitly.

Adjacency frequencies. The following table lists all adjacency types and their corresponding path marginalizations. In each case, the frequency is computed as a sum over MixDom Pair HMM states, with null states marginalized via $\nullcl $. Write $\circ $ for the boundary (start/end) context.

Start $\to $ Match:

Start $\to $ Insert:

Start $\to $ End (empty sequence):

Match $\to $ Match (via null states only):

Match $\to $ Insert:

Match $\to $ Delete:

Match $\to $ End:

Insert $\to $ Insert (ancestor context $X$ propagated):

Insert $\to $ Match (ancestor context $X$ propagated):

Insert $\to $ Delete (ancestor context $X$ propagated):

Insert $\to $ End (ancestor context $X$ propagated):

Delete $\to $ Match (descendant context $Y$ propagated):

Delete $\to $ Delete (descendant context $Y$ propagated):

Delete $\to $ Insert (descendant context $Y$ propagated):

Delete $\to $ End (descendant context $Y$ propagated):

Context	Adjacency	MixDom path	Frequency


	$\sta \to \mat [X',Y']$	$\sta \xrightarrow {\silent ^\ast } \mat [X',Y']$	$f^{\sta \mat }(X',Y')$

	$\sta \to \ins [Y']$	$\sta \xrightarrow {\silent ^\ast } \ins [Y']$	$f^{\sta \ins }(Y')$

	$\sta \to \fin $	$\sta \xrightarrow {\silent ^\ast } \fin $	$f^{\sta \fin }$


	$\mat [X,Y] \to \mat [X',Y']$	$\mat [X,Y] \xrightarrow {\silent ^\ast } \mat [X',Y']$	$f^{\mat \mat }(X,Y,X',Y')$

	$\mat [X,Y] \to \ins [Y']$	$\mat [X,Y] \xrightarrow {\silent ^\ast } \ins [Y']$	$f^{\mat \ins }(X,Y,Y')$

	$\mat [X,Y] \to \del [X']$	$\mat [X,Y] \xrightarrow {\silent ^\ast } \del [X']$	$f^{\mat \del }(X,Y,X')$

	$\mat [X,Y] \to \fin $	$\mat [X,Y] \xrightarrow {\silent ^\ast } \fin $	$f^{\mat \fin }(X,Y)$


	$\ins [Y] \to \ins [Y']$	$\ins [Y] \xrightarrow {\silent ^\ast } \ins [Y']$	$f^{\ins \ins }(X,Y,Y')$

	$\ins [Y] \to \mat [X',Y']$	$\ins [Y] \xrightarrow {\silent ^\ast } \mat [X',Y']$	$f^{\ins \mat }(X,Y,X',Y')$

	$\ins [Y] \to \del [X']$	$\ins [Y] \xrightarrow {\silent ^\ast } \del [X']$	$f^{\ins \del }(X,Y,X')$

	$\ins [Y] \to \fin $	$\ins [Y] \xrightarrow {\silent ^\ast } \fin $	$f^{\ins \fin }(X,Y)$


	$\del [X] \to \mat [X',Y']$	$\del [X] \xrightarrow {\silent ^\ast } \mat [X',Y']$	$f^{\del \mat }(X,Y,X',Y')$

	$\del [X] \to \del [X']$	$\del [X] \xrightarrow {\silent ^\ast } \del [X']$	$f^{\del \del }(X,Y,X')$

	$\del [X] \to \ins [Y']$	$\del [X] \xrightarrow {\silent ^\ast } \ins [Y']$	$f^{\del \ins }(X,Y,Y')$

	$\del [X] \to \fin $	$\del [X] \xrightarrow {\silent ^\ast } \fin $	$f^{\del \fin }(X,Y)$

In all cases, the ancestor context $X$ is propagated through insert states (which do not consume input), and the descendant context $Y$ is propagated through delete states (which do not emit output). The notation $\xrightarrow {\silent ^\ast }$ denotes zero or more transitions through null states, marginalized via the null closure $\nullcl $.

Computing the frequencies. Each frequency above is computed from the MixDom Pair HMM as follows. Let $\pi _s$ denote the stationary probability of state $s$ and $\effT _{s_1 s_2}$ the null-marginalized effective transition. For the simplest case (direct adjacency): \[ f^{\mat \del }(X,Y,X') = \sum _{s_1 \in \emit ^{\mat }} \sum _{s_2 \in \emit ^{\del }} \left (\sum _i \pi _i \effT ^\ast _{i s_1}\right ) \emprob _{s_1}(X,Y)\, \effT _{s_1 s_2}\, \emprob _{s_2}(X') \left (\sum _j \effT ^\ast _{s_2 j}\right ) \] Boundary frequencies use the start/end rows of $\effT $: \[ f^{\sta \mat }(X',Y') = \sum _{s \in \emit ^{\mat }} \effT _{\sta s}\, \emprob _s(X',Y') \left (\sum _j \effT ^\ast _{s j}\right ), \quad f^{\mat \fin }(X,Y) = \sum _{s \in \emit ^{\mat }} \left (\sum _i \pi _i \effT ^\ast _{i s}\right ) \emprob _s(X,Y)\, \effT _{s\,\fin } \] (and analogously for $f^{\sta \ins }$, $f^{\sta \fin }$, $f^{\ins \fin }$, $f^{\del \fin }$).

Normalization to transducer parameters. Given the adjacency frequencies, the order-1 transducer weights are obtained by normalization. For each context $(X,Y)$ and source machine state, normalize outgoing weights to sum to 1:

Start transitions (context $\circ $): \begin {align*} p_{\sta \waitm } &= \frac {\sum _{X',Y'} f^{\sta \mat }(X',Y')} {\sum _{X',Y'} f^{\sta \mat }(X',Y') + \sum _{Y'} f^{\sta \ins }(Y') + f^{\sta \fin }}, \quad p_{\sta \ins }(\destok ) = \frac {f^{\sta \ins }(\destok )} {\sum _{X',Y'} f^{\sta \mat }(X',Y') + \sum _{Y'} f^{\sta \ins }(Y') + f^{\sta \fin }} \end {align*}

(and $p_{\sta \fin }$ uses the same denominator with numerator $f^{\sta \fin }$).

Wait-after-match/insert transitions (context $(X,Y)$, consuming input $\anctok $): \begin {align*} p_{\waitm \mat }(X,Y,\anctok ,\destok ) &= \frac {f^{\cdot \mat }(X,Y,\anctok ,\destok )} {\sum _{\anctok '}\left [\sum _{\destok '} f^{\cdot \mat }(X,Y,\anctok ',\destok ') + f^{\cdot \del }(X,Y,\anctok ')\right ]} \\ p_{\waitm \del }(X,Y,\anctok ) &= \frac {f^{\cdot \del }(X,Y,\anctok )} {\sum _{\anctok '}\left [\sum _{\destok '} f^{\cdot \mat }(X,Y,\anctok ',\destok ') + f^{\cdot \del }(X,Y,\anctok ')\right ]} \end {align*}

where $f^{\cdot \mat }(X,Y,\anctok ,\destok ) = f^{\mat \mat }(X,Y,\anctok ,\destok ) + f^{\ins \mat }(X,Y,\anctok ,\destok )$ and $f^{\cdot \del }(X,Y,\anctok ) = f^{\mat \del }(X,Y,\anctok ) + f^{\ins \del }(X,Y,\anctok )$, aggregating over both match and insert sources that share the $\waitm $ wait state.

Wait-after-delete transitions (context $(X,Y)$, consuming input $\anctok $): \begin {align*} p_{\waitd \mat }(X,Y,\anctok ,\destok ) &= \frac {f^{\del \mat }(X,Y,\anctok ,\destok )} {\sum _{\anctok '}\left [\sum _{\destok '} f^{\del \mat }(X,Y,\anctok ',\destok ') + f^{\del \del }(X,Y,\anctok ')\right ]} \\ p_{\waitd \del }(X,Y,\anctok ) &= \frac {f^{\del \del }(X,Y,\anctok )} {\sum _{\anctok '}\left [\sum _{\destok '} f^{\del \mat }(X,Y,\anctok ',\destok ') + f^{\del \del }(X,Y,\anctok ')\right ]} \end {align*}

Post-match transitions (context $(X,Y)$, after matching with $\anctok ,\destok $; new context becomes $(\anctok ,\destok )$): \begin {align*} p_{\mat \waitm }(X,Y) &= \frac {\sum _{\anctok ',\destok '} f^{\mat \mat }(X,Y,\anctok ',\destok ') + \sum _{\anctok '} f^{\mat \del }(X,Y,\anctok ')} {Z^{\mat }(X,Y)} \\ p_{\mat \ins }(X,Y,\destok ) &= \frac {f^{\mat \ins }(X,Y,\destok )}{Z^{\mat }(X,Y)}, \quad p_{\mat \fin }(X,Y) = \frac {f^{\mat \fin }(X,Y)}{Z^{\mat }(X,Y)} \end {align*}

where $Z^{\mat }(X,Y) = \sum _{\anctok ',\destok '} f^{\mat \mat }(X,Y,\anctok ',\destok ') + \sum _{\anctok '} f^{\mat \del }(X,Y,\anctok ') + \sum _{\destok } f^{\mat \ins }(X,Y,\destok ) + f^{\mat \fin }(X,Y)$.

Post-insert transitions are analogous, using $f^{\ins \cdot }$ frequencies with $Z^{\ins }(X,Y)$.

Post-delete transitions (context $(X,Y)$, after deleting $\anctok $; new context becomes $(\anctok ,Y)$): \begin {align*} p_{\del \waitd }(X,Y) &= \frac {\sum _{\anctok '} f^{\del \del }(X,Y,\anctok ') + \sum _{\anctok ',\destok '} f^{\del \mat }(X,Y,\anctok ',\destok ')} {Z^{\del }(X,Y)} \\ p_{\del \ins }(X,Y,\destok ) &= \frac {f^{\del \ins }(X,Y,\destok )}{Z^{\del }(X,Y)}, \quad p_{\del \fin }(X,Y) = \frac {f^{\del \fin }(X,Y)}{Z^{\del }(X,Y)} \end {align*}

where $Z^{\del }(X,Y) = \sum _{\anctok '} f^{\del \del }(X,Y,\anctok ') + \sum _{\anctok ',\destok '} f^{\del \mat }(X,Y,\anctok ',\destok ') + \sum _{\destok } f^{\del \ins }(X,Y,\destok ) + f^{\del \fin }(X,Y)$.

C.5 Algebraic Distillation of MixDom

We investigate whether the distillation of the MixDom model to order-1 machines (Section C.4.5 and C.4.6) can be performed in closed algebraic form, and how the computation scales with the number of domain types $\ndom $, the number of fragment types per domain $\nfrag $, and the number of site classes $\nclasses $. The same algebraic decomposition powers the cherry-count log-likelihood that the Maraschino fitter (Section C.4.2) maximises under MixDom.

C.5.1 Setup

The MixDom model has the following parameters (matching Section C.1.1):

Top-level TKF91: $\insrate _\main , \delrate _\main $ (domain birth/death rates).
Domain weights: $\domdist _\dom $, $\sum _\dom \domdist _\dom = 1$.
Per-domain TKF91 rates governing the per-domain TKF92 fragment process: $\insrate _\dom , \delrate _\dom $ for $\dom = 1, \ldots , \ndom $.
Per-domain fragment-type entry distribution: $\fragdist _{\dom \frag }$ with $\sum _\frag \fragdist _{\dom \frag } = 1$.
Per-domain intra-fragment Markov ext matrix $\ext ^{(\dom )}_{\srcfrag \destfrag }$, $\nfrag \times \nfrag $, with row sums $\leq 1$. The fragment-termination probability is $\notext ^{(\dom )}_\srcfrag = 1 - \sum _\destfrag \ext ^{(\dom )}_{\srcfrag \destfrag }$.
Per-(domain, fragment-type) site-class Dirichlet: $\classdist _{\dom \frag \class }$ with $\sum _\class \classdist _{\dom \frag \class } = 1$.
Per-class reversible substitution model: $(\exch ^{(\class )}, \eqm ^{(\class )})$ for $\class = 1, \ldots , \nclasses $; rate matrix $\revsub ^{(\class )} = \exch ^{(\class )} \diag (\eqm ^{(\class )})$.

The total scalar parameter count is $2 + (\ndom -1) + 2\ndom + \ndom \nfrag ^2 + \ndom (\nfrag -1) + \ndom \nfrag (\nclasses -1)$ plus $\nclasses $ rate matrices.

Write $\kappa _\dom = \insrate _\dom / \delrate _\dom $, $\alpha _\dom = \alpha (\insrate _\dom , \delrate _\dom , \evoltime )$, $\beta _\dom = \beta (\insrate _\dom , \delrate _\dom , \evoltime )$, $\gamma _\dom = \gamma (\insrate _\dom , \delrate _\dom , \evoltime )$, and similarly $\alpha _\main , \beta _\main , \gamma _\main , \kappa _\main $ for the top-level parameters.

C.5.2 Class-mixture emissions

In MixDom the per-(domain, fragment-type) site-class mixture appears in every emission probability of the collapsed Pair HMM. Define the per-class match emission as \[ P^{(\class )}(\anctok ,\destok \mid \evoltime ) = \eqm ^{(\class )}_\anctok \, \exp (\revsub ^{(\class )}\evoltime )_{\anctok \destok }. \] The model’s per-(domain, fragment-type) emission factors at evolutionary time $\evoltime $ are: \begin {align*} \phi ^{\mat }_{\dom \frag }(\anctok , \destok ) &\equiv \emprob _{\mat \mat _{\dom \frag }}(\anctok ,\destok ) = \sum _{\class =1}^{\nclasses } \classdist _{\dom \frag \class }\, P^{(\class )}(\anctok ,\destok \mid \evoltime ), \\ \phi ^{\ins }_{\dom \frag }(\destok ) &\equiv \emprob _{\mat \ins _{\dom \frag }}(\destok ) = \emprob _{\ins \ins _{\dom \frag }}(\destok ) = \sum _\class \classdist _{\dom \frag \class }\, \eqm ^{(\class )}_\destok , \\ \phi ^{\del }_{\dom \frag }(\anctok ) &\equiv \emprob _{\mat \del _{\dom \frag }}(\anctok ) = \emprob _{\del \del _{\dom \frag }}(\anctok ) = \sum _\class \classdist _{\dom \frag \class }\, \eqm ^{(\class )}_\anctok . \end {align*}

Each of these is a $\nclasses $-term mixture over the per-class GTRs. The match emission $\phi ^{\mat }_{\dom \frag }$ inherits the time dependence from the per-class transition kernels $P^{(\class )}(\evoltime )$; the singlet emissions $\phi ^{\ins }_{\dom \frag }$, $\phi ^{\del }_{\dom \frag }$ are time-independent.

Remark C.3 (Bilinear emission tensor). The match emission factors as \[ \phi ^{\mat }_{\dom \frag }(\anctok ,\destok ) = \sum _{\class } \classdist _{\dom \frag \class }\, \eqm ^{(\class )}_\anctok \, \exp (\revsub ^{(\class )}\evoltime )_{\anctok \destok } \] which is rank-$\nclasses $ in the latent class index, and the resulting $|\alphabet | \times |\alphabet |$ emission tensor at $(\dom ,\frag )$ has rank at most $\nclasses $. The single-class case $\classdist = \delta _{\dom \class }$ makes this rank-1 per domain (a single GTR per domain). The general parameterisation strictly generalises by sharing the GTR pool across (domain, fragment-type) contexts and allowing any soft mixture.

C.5.3 Single HMM Distillation

State space The MixDom Singlet HMM generates sequences from the stationary distribution. Its collapsed state space (Section C.1.1) is $\{ \sta , \fin \} \cup \{ \ins _{\dom \frag } : \dom \in \ndom , \frag \in \nfrag \}$, so there are $\ndom \nfrag $ emitting states with emissions $\phi ^\ins _{\dom \frag }(\anctok )$.

Within a domain, fragment continuation is governed by the intra-fragment Markov ext matrix $\ext ^{(\dom )}_{\srcfrag \destfrag }$ and the new-fragment $\kappa _\dom \fragdist _{\dom ,\destfrag }$ branch. Adapting the formulae in Section C.1.1 for the singlet, the effective transition matrix between emitting states $\ins _{\srcdom \srcfrag } \to \ins _{\destdom \destfrag }$ has entries \[ \effT _{\ins _{\srcdom \srcfrag },\ins _{\destdom \destfrag }} = \ext ^{(\srcdom )}_{\srcfrag \destfrag }\delta _{\srcdom \destdom } + \notext ^{(\srcdom )}_\srcfrag \kappa _\srcdom \fragdist _{\srcdom \destfrag }\delta _{\srcdom \destdom } + \frac {\notext ^{(\srcdom )}_\srcfrag (1-\kappa _\srcdom )\, \kappa _\main \domdist _\destdom \kappa _\destdom \fragdist _{\destdom \destfrag }} {1 - \kappa _\main \emptyseg _0} \] where $\emptyseg _0 = \sum _\dom \domdist _\dom (1-\kappa _\dom )$. The path-sum matrix $(I - \effT _{\emit ,\emit })^{-1}$ is an $\ndom \nfrag \times \ndom \nfrag $ inversion in general.

Adjacency frequencies The Singlet adjacency frequency is \begin {equation} \label {eq:single-adj} f(\anctok , \destok ) = \sum _{(\srcdom ,\srcfrag ),(\destdom ,\destfrag )} W_{(\srcdom ,\srcfrag ),(\destdom ,\destfrag )}\, \phi ^\ins _{\srcdom \srcfrag }(\anctok )\, \phi ^\ins _{\destdom \destfrag }(\destok ) \end {equation} where the structural weights $W_{(\srcdom ,\srcfrag ),(\destdom ,\destfrag )} = L_{(\srcdom ,\srcfrag )}\, \effT _{(\srcdom ,\srcfrag )(\destdom ,\destfrag )}\, R_{(\destdom ,\destfrag )}$ are character-independent, with $L_{(\dom ,\frag )} = \sum _i \pi _i \effT ^\ast _{i,(\dom ,\frag )}$ and $R_{(\dom ,\frag )} = \sum _j \effT ^\ast _{(\dom ,\frag ),j}$. Substituting the class-mixture emission yields a bilinear sum over both the (domain, fragment-type) latent and the site-class latent: \[ f(\anctok , \destok ) = \sum _{\srcdom ,\srcfrag ,\destdom ,\destfrag ,\class _1,\class _2} W_{(\srcdom ,\srcfrag ),(\destdom ,\destfrag )} \classdist _{\srcdom \srcfrag \class _1} \classdist _{\destdom \destfrag \class _2}\, \eqm ^{(\class _1)}_\anctok \, \eqm ^{(\class _2)}_\destok . \]

The order-1 HMM transition is therefore \[ P(\destok \mid \anctok ) = \frac {f(\anctok , \destok )}{\sum _{\destok '} f(\anctok , \destok ')} = \frac { \sum _{\class _2} \big (\sum _{\srcdom ,\srcfrag ,\destdom ,\destfrag ,\class _1} W \classdist \classdist \, \eqm ^{(\class _1)}_\anctok \big ) \eqm ^{(\class _2)}_\destok }{ \sum _{\srcdom ,\srcfrag ,\class _1} \eqm ^{(\class _1)}_\anctok \sum _{\destdom ,\destfrag ,\class _2} W_{(\srcdom ,\srcfrag ),(\destdom ,\destfrag )} \classdist _{\srcdom \srcfrag \class _1} \classdist _{\destdom \destfrag \class _2} }. \]

Remark C.4 (Non-trivial character correlations). When the per-class equilibria $\eqm ^{(\class )}$ differ, the previous character $\anctok $ carries information about the latent class $\class _1$ of the source state via the Bayesian posterior $P(\class _1 \mid \anctok ) \propto \eqm ^{(\class _1)}_\anctok \, \sum _{\srcdom \srcfrag ,\destdom \destfrag ,\class _2} W \classdist \classdist $, and consequently about which (domain, fragment-type) is likely to be generating this region. The order-1 HMM therefore has genuinely $\anctok $-dependent transitions: a mixture of $\eqm ^{(\class _2)}$ tilted by the joint posterior over latent state and source class.

Remark C.5 (Closed form). All quantities in (C.7) are rational functions of the model parameters: the $\ndom \nfrag \times \ndom \nfrag $ matrix $(I - \effT )^{-1}$ has entries that are ratios of polynomials in $(\ext ^{(\dom )}_{\srcfrag \destfrag }, \kappa _\dom , \kappa _\main , \beta _\main , \fragdist _{\dom \frag }, \domdist _\dom )$. The class mixture enters linearly through $\classdist _{\dom \frag \class }$, and the per-class GTRs appear only inside $\eqm ^{(\class )}, P^{(\class )}(\evoltime )$.

C.5.4 Pair HMM Distillation

Emitting state space The collapsed MixDom Pair HMM with $\ndom $ domain types and $\nfrag $ fragment types per domain has $5\ndom \nfrag + 2$ states: $\{ \sta \sta , \fin \fin \}$ and $\{ \mat \mat _{\dom \frag }, \mat \ins _{\dom \frag }, \mat \del _{\dom \frag }, \ins \ins _{\dom \frag }, \del \del _{\dom \frag } : \dom \in \ndom , \frag \in \nfrag \}$.

Group the $5\ndom \nfrag $ emitting states by emission type: match $\mathcal {M} = \{\mat \mat _{\dom \frag }\}$, insert $\mathcal {I} = \{\mat \ins _{\dom \frag }, \ins \ins _{\dom \frag }\}$, delete $\mathcal {D} = \{\mat \del _{\dom \frag }, \del \del _{\dom \frag }\}$.

Class-mixture emissions in the Pair HMM The per-state emissions at $(d,f)$ are the class-mixture emissions $\phi ^{\mat ,\ins ,\del }_{\dom \frag }$ defined in Section C.5.2. Two states with the same $(d,f)$ but different top-level types share the same equilibrium mixture, but the match state combines this with the time-evolved class-mixture transition kernel.

The per-(domain, fragment-type) emission factors do not lift cleanly out of the structural sum; however, they have low rank in the class index $\class $ (Section C.5.2). Within a fixed $(d,f)$, all match states emit from $\phi ^{\mat }_{\dom \frag }$ and all insert/delete singlets from $\phi ^{\ins }_{\dom \frag }$ / $\phi ^{\del }_{\dom \frag }$.

Per-(domain, fragment)-pair structural weights The pair adjacency frequencies decompose as sums over (domain, fragment-type) pairs. For match-to-match, writing $X = \anctok , Y = \destok , X' = \anctok ', Y' = \destok '$: \begin {equation} \label {eq:pair-adj} f^{\mathcal {M}\mathcal {M}}(X,Y,X',Y') = \sum _{(\dom _1,\frag _1),(\dom _2,\frag _2)} C^{\mathcal {M}\mathcal {M}}_{(\dom _1,\frag _1),(\dom _2,\frag _2)}\, \phi ^{\mat }_{\dom _1\frag _1}(X,Y)\, \phi ^{\mat }_{\dom _2\frag _2}(X',Y') \end {equation} where \[ C^{\mathcal {M}\mathcal {M}}_{(\dom _1,\frag _1),(\dom _2,\frag _2)} = \sum _{s_1 \in \mathcal {M}_{\dom _1\frag _1}} \sum _{s_2 \in \mathcal {M}_{\dom _2\frag _2}} L_{s_1}\, \effT _{s_1 s_2}\, R_{s_2} \] are structural weights indexed by (domain, fragment-type) pairs (not just emission types). Similarly for other adjacency types ($\mathcal {M}\mathcal {I}$, $\mathcal {M}\mathcal {D}$, etc.), with the appropriate per-(domain, fragment-type) emissions.

Each adjacency frequency is therefore a sum of $(\ndom \nfrag )^2$ structural-weight terms, each multiplied by per-(domain, fragment-type) emission factors that are themselves rank-$\nclasses $ class mixtures.

Context dependence In the order-1 WFST, the wait-after-match/insert transition involves \begin {align*} &f^{\cdot \mathcal {M}}(X,Y,\anctok ',\destok ') = \sum _{(\dom _1,\frag _1),(\dom _2,\frag _2)} \Big ( C^{\mathcal {M}\mathcal {M}}_{(\dom _1,\frag _1),(\dom _2,\frag _2)}\, \phi ^{\mat }_{\dom _1\frag _1}(X,Y) \\ &\phantom {= \sum _{(\dom _1,\frag _1),(\dom _2,\frag _2)} \Big (} + C^{\mathcal {I}\mathcal {M}}_{(\dom _1,\frag _1),(\dom _2,\frag _2)}\, \phi ^{\ins }_{\dom _1\frag _1}(Y) \Big ) \phi ^{\mat }_{\dom _2\frag _2}(\anctok ',\destok ') \end {align*}

After normalization, $p_{\waitm \mat }(X,Y,\anctok ',\destok ')$ depends on $(X,Y)$ through the joint posterior over (domain, fragment-type, class) of the previous emission. Specifically, the context enters through the $\ndom \nfrag $-dimensional vector \[ \boldsymbol {\rho }(X,Y) = \Big ( \sum _{(\dom _2,\frag _2)} C^{\mathcal {M}\mathcal {M}}_{(1,1),(\dom _2,\frag _2)}\, \phi ^{\mat }_{1,1}(X,Y) + \sum _{(\dom _2,\frag _2)} C^{\mathcal {I}\mathcal {M}}_{(1,1),(\dom _2,\frag _2)}\, \phi ^{\ins }_{1,1}(Y), \ldots \Big ) \] which captures the relative likelihood of each source (domain, fragment-type).

Remark C.6 (Richer context than per-domain shared-$\eqm $ case). With class-mixture emissions, the WFST transition probabilities depend on context $(X,Y)$ through a joint (domain, fragment-type, class) posterior that is sensitive to both $X$ and $Y$ independently—not just through a single scalar ratio. The right-side character dependence ($\anctok ', \destok '$) also varies by (domain, fragment-type, class): the next emission is a mixture of $\eqm ^{(\class _2)} P^{(\class _2)}(\evoltime )$ weighted by the structural weights, the source posterior, and the destination $\classdist $. Despite this richer structure, all quantities are closed-form rational functions of the model parameters (composed with $\nclasses $ rate-matrix exponentials).

C.5.5 Block Structure and Matrix Inversions

Top-level null closure (always $3 \times 3$) The null states $\mnull , \inull , \dnull $ yield a $3 \times 3$ submatrix with entries depending on \begin {align*} \emptyseg _0 &= \sum _\dom \domdist _\dom (1 - \kappa _\dom ) \\ \emptyseg _\evoltime &= \sum _\dom \domdist _\dom (1 - \kappa _\dom )(1 - \beta _\dom ) \end {align*}

The null closure is a $3 \times 3$ inversion with closed-form determinant, independent of $\ndom , \nfrag , \nclasses $ (the dependence on $\ndom $ enters only through these two scalar sums; $\nfrag $ and $\nclasses $ do not appear at all because empty domains are independent of fragment and class structure inside the domain).

Block-diagonal-plus-low-rank decomposition The $5\ndom \nfrag \times 5\ndom \nfrag $ effective transition matrix between emitting states decomposes as \[ \effT _{\emit ,\emit } = \underbrace {\mathrm {diag}(D_1, \ldots , D_\ndom )}_{\text {intra-domain}} + \underbrace {E\,\nonemptytrans _\bullet \, S^\top }_{\text {inter-domain (rank } \leq 3\text {)}} \] where each $D_\dom $ is a $5\nfrag \times 5\nfrag $ within-domain block that combines the intra-fragment Markov ext matrix $\ext ^{(\dom )}_{\srcfrag \destfrag }$ (acting on the fragment-type axis) with the same-domain new-fragment TKF92 transitions ($\notext ^{(\dom )}_\srcfrag \tkftrans ^{(\dom )}_{\xstate \ystate } \fragdist _{\dom ,\destfrag }$ for M-type entries, $\notext ^{(\dom )}_\srcfrag \kappa _\dom \fragdist _{\dom ,\destfrag }$ for I/D-type entries), $E \in \mathbb {R}^{5\ndom \nfrag \times 3}$ stacks per-(domain,fragment-type) exit vectors projected to top-level types $\{\mat ,\ins ,\del \}$ via $(\notext ^{(\dom )}_\srcfrag \, \notkappa _\dom )$ for the I/D singlet rows and $(\notext ^{(\dom )}_\srcfrag \, \tkftrans ^{(\dom )}_{\xstate \fin })$ for the matched-domain rows, $S \in \mathbb {R}^{5\ndom \nfrag \times 3}$ stacks per-(domain,fragment-type) start vectors with $(1-\emptyseg _\evoltime )^{-1}\domdist _\destdom \, \tkftrans ^{(\destdom )}_{\sta \ystate }\fragdist _{\destdom \destfrag }$ for M-type entries and $(1-\emptyseg _0)^{-1}\domdist _\destdom \, \kappa _\destdom \, \fragdist _{\destdom \destfrag }$ for I/D-type entries (these $1/(1-\emptyseg )$ factors are the domenter normalisation that conditions on the destination domain being non-empty), and $\nonemptytrans _\bullet $ is the relevant $5 \times 5$ submatrix of the null-eliminated top-level matrix.

Note that $D_{\dom _1} \neq D_{\dom _2}$ in general (different $\ext ^{(\dom )}_{\srcfrag \destfrag }, \fragdist _{\dom \frag }, \insrate _\dom , \delrate _\dom $). For $\nfrag > 1$, $D_\dom $ has no further emission-type block-diagonal sub-structure within the matched-domain block (the intra-fragment Markov ext couples $\{\mat \mat , \mat \ins , \mat \del \}$ across fragment-types); however, it admits a Kronecker-plus-low-rank decomposition that preserves the linear-in-$\ndom \nfrag $ scaling (Section C.5.6). The diagonal-ext baseline (–freeze-offdiag-ext in the fitter) recovers the diagonal-extension special case block structure as a special case.

Remark C.7 (Domenter normalization). The $\transnest $ matrix (Section C.1.1, Equation C.2) includes factors $(1-\emptyseg _\evoltime )^{-1}$ for M-type destinations and $(1-\emptyseg _0)^{-1}$ for I/D-type destinations, conditioning domain entry on the domain being non-empty. These factors cancel the corresponding $(1-\emptyseg _\evoltime )$ and $(1-\emptyseg _0)$ column scaling in $\nonemptytrans _\bullet $ (which arises from the $\exptrans $ null-elimination Schur complement). In the distillation, the start vectors $\mathbf {s}_{\dom ,\frag }$ carry the domenter factors while $\nonemptytrans _\bullet $ retains the column factors; the product $\nonemptytrans _\bullet \cdot \mathbf {s}_{\dom ,\frag }$ thus reproduces the correct $\transnest $ entries.

Woodbury identity Writing $G_\dom = (I-D_\dom )^{-1}$ (each $5\nfrag \times 5\nfrag $), $E = (\mathbf {e}_{\dom ,\frag })$ (exit vectors, $5\ndom \nfrag \times 3$), $S = (\mathbf {s}_{\dom ,\frag })$ (start vectors, $5\ndom \nfrag \times 3$), and $\nonemptytrans _\text {mid}$ for the $3\times 3$ subblock of $\nonemptytrans _\bullet $ restricted to $\{\mat ,\ins ,\del \}$ (the only top-level types that couple across domains), the path-sum matrix is \[ (I - \effT )^{-1} = G + G\,E\,\nonemptytrans _\text {mid} \bigl (I_3 - S^\top G\,E\,\nonemptytrans _\text {mid}\bigr )^{-1} S^\top G \] using the push-through form of the Woodbury identity, where $G = \mathrm {diag}(G_1, \ldots , G_\ndom )$. This avoids inverting $\nonemptytrans _\text {mid}$, which is singular (the TKF91 $\mat $ and $\ins $ rows are identical, so $\nonemptytrans _\text {mid}$ has rank 2).

Proposition C.1 (Closed-form distillation for MixDom). All adjacency frequencies, and hence all order-1 HMM and WFST parameters, are closed-form rational functions of the model parameters (composed with the matrix exponentials $\exp (\revsub ^{(\class )}\evoltime )$), for any finite $\ndom , \nfrag , \nclasses $. The computation requires:

1.: A $3 \times 3$ inversion for the top-level null closure.
2.: $\ndom $ within-domain inversions $(I - D_\dom )^{-1}$, each of size $5\nfrag \times 5\nfrag $. By the Kronecker-plus-rank-3 decomposition of $D_\dom $ (Section C.5.6), this reduces to one $\nfrag \times \nfrag $ inversion of $(I - \ext ^{(\dom )})$ plus a $3 \times 3$ inner Woodbury kernel per domain. For $\nfrag = 1$ the whole within-domain block collapses to a $3 \times 3$ adjugate plus two scalar inversions (Section C.5.7).
3.: One $3 \times 3$ outer Woodbury correction for inter-domain coupling.
4.: Summation of $(\ndom \nfrag )^2$ (domain, fragment-type)-pair terms per adjacency entry, with each term involving a class-mixture emission factor of length $\nclasses $.

Every step is a rational function of the model parameters: no numerical iteration is required to evaluate the closed form.

Remark C.8 (Contrast with shared-emission case). If all $(d,f)$ shared a single class mixture (e.g. $\classdist _{\dom \frag \class } = u_\class $ uniform), the adjacency frequencies would still factor as in (C.8), but the emission tensor would have rank-1 structure across $(d,f)$, collapsing the WFST context dependence. With per-(domain, fragment-type) class mixtures, the structural constants are full $(\ndom \nfrag ) \times (\ndom \nfrag )$ matrices indexed by (domain, fragment-type) pairs, the emission tensors are class-rank-$\nclasses $, and the context dependence is genuinely multi-dimensional.

C.5.6 Within-Domain Inversion: closed form

For general $\nfrag $, the within-domain block $D_\dom $ has a Kronecker-plus-rank-3 structure that yields a closed-form inverse without ever inverting a $5\nfrag \times 5\nfrag $ matrix directly.

Kronecker-plus-rank-3 decomposition of the matched-domain block The matched-domain inner block (acting on $\{\mat \mat , \mat \ins , \mat \del \}$ at each fragment-type, total $3\nfrag $ states) combines intra-fragment Markov continuation with fragment termination plus a TKF92 same-domain new-fragment branch: \begin {equation} \label {eq:mixdom-D-match} D_\dom ^{\text {match}}[(\xstate , \frag ), (\ystate , \destfrag )] = \delta _{\xstate \ystate }\, \ext ^{(\dom )}_{\frag \destfrag } + \notext ^{(\dom )}_\frag \cdot \tkftrans ^{(\dom )}_{\xstate \ystate } \cdot \fragdist _{\dom ,\destfrag }. \end {equation} The first term is $I_{3} \otimes \ext ^{(\dom )}$ (the intra-fragment Markov chain, identical at every emission type); the second is $\tkftrans ^{(\dom )}_{3\times 3} \otimes (\notext ^{(\dom )}\, \fragdist _\dom ^\top )$ where $\notext ^{(\dom )}\, \fragdist _\dom ^\top $ is the rank-1 $\nfrag \times \nfrag $ outer product of fragment-termination probabilities and fragment-entry weights. The whole rank-3 correction is therefore of rank $\leq 3$ in the joint $(\xstate ,\frag )$ space (one factor of 3 from $\tkftrans ^{(\dom )}_{3\times 3}$, and rank 1 in the fragment slot from the outer product). Writing \[ D_\dom ^{\text {match}} = I_{3} \otimes \ext ^{(\dom )} + U_\dom \, V_\dom ^\top , \quad U_\dom = \tkftrans ^{(\dom )}_{3\times 3} \otimes \notext ^{(\dom )}, \quad V_\dom = I_{3} \otimes \fragdist _\dom , \] $U_\dom $ and $V_\dom $ are $3\nfrag \times 3$ matrices.

Inverse via inner Woodbury Since $I_{3} \otimes \ext ^{(\dom )}$ commutes with itself, $I - I_3 \otimes \ext ^{(\dom )} = I_3 \otimes (I_\nfrag - \ext ^{(\dom )})$, which inverts blockwise: $(I_3 \otimes (I_\nfrag - \ext ^{(\dom )}))^{-1} = I_3 \otimes (I_\nfrag - \ext ^{(\dom )})^{-1}$. Note $(I_\nfrag - \ext ^{(\dom )})^{-1}$ is a single $\nfrag \times \nfrag $ inversion per domain. Applying the Sherman–Morrison–Woodbury identity to the rank-3 correction gives \begin {equation} \label {eq:mixdom-Dmatch-inverse} \begin {aligned} (I - D_\dom ^{\text {match}})^{-1} &= G_0 + G_0\, U_\dom \, K_\dom ^{\text {inner}}\, V_\dom ^\top G_0, \\ G_0 &\equiv I_{3} \otimes (I_\nfrag - \ext ^{(\dom )})^{-1}, \\ K_\dom ^{\text {inner}} &\equiv (I_3 - V_\dom ^\top G_0\, U_\dom )^{-1}. \end {aligned} \end {equation} $K_\dom ^{\text {inner}}$ is a $3 \times 3$ matrix whose entries are rational functions of $\fragdist _\dom $, $\ext ^{(\dom )}$, $\tkftrans ^{(\dom )}$ via a single $\nfrag $-fold inner product $\fragdist _\dom ^\top (I - \ext ^{(\dom )})^{-1} \notext ^{(\dom )}$ times $\tkftrans ^{(\dom )}_{3\times 3}$.

The two singlet-domain blocks act on $\nfrag $ states each ($\{\ins \ins _{\dom \frag } : \frag \}$ and $\{\del \del _{\dom \frag } : \frag \}$): \[ d_\dom ^{\text {ins}}[\frag ,\destfrag ] = d_\dom ^{\text {del}}[\frag ,\destfrag ] = \ext ^{(\dom )}_{\frag \destfrag } + \notext ^{(\dom )}_\frag \, \kappa _\dom \, \fragdist _{\dom ,\destfrag }. \] Their inverse follows the same pattern: rank-1 correction to $(I - \ext ^{(\dom )})$, giving \[ (I - d_\dom ^{\text {ins}})^{-1} = (I - \ext ^{(\dom )})^{-1} + (I - \ext ^{(\dom )})^{-1} \notext ^{(\dom )}\, k_\dom ^{\text {ins}}\, \fragdist _\dom ^\top (I - \ext ^{(\dom )})^{-1}, \] with scalar inner kernel $k_\dom ^{\text {ins}} = \kappa _\dom \, [1 - \kappa _\dom \fragdist _\dom ^\top (I - \ext ^{(\dom )})^{-1} \notext ^{(\dom )}]^{-1}$, and identically for delete.

Every entry of $(I - D_\dom )^{-1}$ is therefore a rational function of $(\ext ^{(\dom )}, \fragdist _\dom , \alpha _\dom , \beta _\dom , \gamma _\dom , \kappa _\dom )$, computed at cost $O(\nfrag ^3)$ per domain (the inversion of $I_\nfrag - \ext ^{(\dom )}$) plus $O(\nfrag ^2)$ for the singlet blocks and $O(1)$ for the inner $3 \times 3$ Woodbury kernel. No numerical iteration is required; the inversion can be carried out analytically (for small $\nfrag $, by Cramer’s rule on $I_\nfrag - \ext ^{(\dom )}$) or directly evaluated for larger $\nfrag $.

Remark C.9 (Compactness). For small $\nfrag $ the inverse $(I_\nfrag - \ext ^{(\dom )})^{-1}$ has a compact symbolic form (e.g. at $\nfrag = 2$ it is a $2 \times 2$ adjugate over a scalar determinant). The further reductions of the matched-domain block via (C.10) and the singlet blocks above involve only $3 \times 3$ and scalar inner inversions, so the overall within-domain inverse can be written down symbolically without ever exceeding a $\nfrag \times \nfrag $ inversion. The $\nfrag = 1$ scalar-extension special case collapses $(I_\nfrag - \ext ^{(\dom )})^{-1}$ to the scalar $1/(1 - \ext _\dom )$ and recovers the more compact $3 \times 3$ adjugate of Section C.5.7 below.

C.5.7 Within-Domain Inversion: $\nfrag = 1$ closed form

When $\nfrag = 1$ (scalar self-extension $\ext _\dom \equiv \ext ^{(\dom )}_{11}$, no off-diagonal Markov coupling), $D_\dom $ collapses to $5 \times 5$ with the additional block structure described below; we record this special case because it admits a particularly compact algebraic form.

Block decomposition: $3 \times 3 + 1 + 1$ In the $\nfrag = 1$ limit, the five emitting state types $\{\mat \mat , \mat \ins , \mat \del , \ins \ins , \del \del \}$ cannot transition between top-level types within a domain: if the domain is inserted (top-level $\ins $), all emissions stay in $\ins \ins $ until the domain ends; similarly for deleted domains ($\del \del $). Therefore $D_\dom $ is block-diagonal: \[ D_\dom = \begin {pmatrix} D_\dom ^{\text {match}} & 0 & 0 \\ 0 & d_\dom ^{\text {ins}} & 0 \\ 0 & 0 & d_\dom ^{\text {del}} \end {pmatrix} \] where $D_\dom ^{\text {match}}$ is $3 \times 3$ (for matched-domain states $\mat \mat , \mat \ins , \mat \del $) and the singlet-domain states have scalar self-loops: \[ d_\dom ^{\text {ins}} = d_\dom ^{\text {del}} = \ext _\dom + (1 - \ext _\dom )\kappa _\dom \]

The scalar inversions are trivial: $(1 - d_\dom ^{\text {ins}})^{-1} = [(1-\ext _\dom )(1-\kappa _\dom )]^{-1}$.

The $3\times 3$ matched-domain block The matched-domain block combines fragment extension (self-loop at rate $\ext _\dom $) with intra-domain TKF transitions: \[ D_\dom ^{\text {match}} = \ext _\dom I_3 + (1 - \ext _\dom )\, \tkftrans _{\mat \ins \del ,\mat \ins \del }^{(\dom )} \] where $\tkftrans _{\mat \ins \del ,\mat \ins \del }^{(\dom )}$ is the $3 \times 3$ submatrix of $\tkftrans ^{(\dom )}$ restricted to rows and columns $\mat , \ins , \del $.

A key property: in the TKF transition matrix, rows $\mat $ and $\ins $ are identical. Defining shorthand (suppressing domain subscript $\dom $): \begin {align*} \mathfrak {a} &= (1-\beta )\kappa \alpha , & \mathfrak {b} &= \beta , & \mathfrak {c} &= (1-\beta )\kappa (1-\alpha ) \\ \mathfrak {d} &= (1-\gamma )\kappa \alpha , & \mathfrak {g} &= \gamma , & \mathfrak {h} &= (1-\gamma )\kappa (1-\alpha ) \end {align*}

the $3 \times 3$ TKF submatrix is \[ \tkftrans _{\mat \ins \del ,\mat \ins \del }^{(\dom )} = \begin {pmatrix} \mathfrak {a} & \mathfrak {b} & \mathfrak {c} \\ \mathfrak {a} & \mathfrak {b} & \mathfrak {c} \\ \mathfrak {d} & \mathfrak {g} & \mathfrak {h} \end {pmatrix} \] with row sums $\mathfrak {a}+\mathfrak {b}+\mathfrak {c} = (1-\beta )\kappa + \beta $ for the first two rows and $\mathfrak {d}+\mathfrak {g}+\mathfrak {h} = (1-\gamma )\kappa + \gamma $ for the third.

Factoring out the extension rate Since $I - D^{\text {match}} = I - \ext I - (1-\ext )\tkftrans _{3\times 3} = (1-\ext )(I - \tkftrans _{3\times 3})$, the extension rate factors out as a scalar: \[ (I - D_\dom ^{\text {match}})^{-1} = \frac {1}{1-\ext _\dom }\, P_\dom ^{-1} \] where \[ P_\dom \equiv I - \tkftrans _{\mat \ins \del ,\mat \ins \del }^{(\dom )} = \begin {pmatrix} 1-\mathfrak {a} & -\mathfrak {b} & -\mathfrak {c} \\ -\mathfrak {a} & 1-\mathfrak {b} & -\mathfrak {c} \\ -\mathfrak {d} & -\mathfrak {g} & 1-\mathfrak {h} \end {pmatrix} \] with row sums $(1-\beta )(1-\kappa )$, $(1-\beta )(1-\kappa )$, $(1-\gamma )(1-\kappa )$ (the domain-exit probabilities from each inner state).

Determinant of $P_\dom $ Since rows 1 and 2 of $\tkftrans _{3\times 3}$ are identical, subtracting row 2 from row 1 in $P$ yields $(1, -1, 0)$. Expanding the determinant along this simplified row gives \begin {align*} \det (P_\dom ) &= (1-\mathfrak {b})(1-\mathfrak {h}) - \mathfrak {c}\mathfrak {g} - \mathfrak {a}(1-\mathfrak {h}) - \mathfrak {c}\mathfrak {d} \\ &= (1-\beta _\dom )(1-\kappa _\dom ) \end {align*}

This factors cleanly: since $\mathfrak {a}+\mathfrak {b}+\mathfrak {c}$ and $\mathfrak {d}+\mathfrak {g}+\mathfrak {h}$ are the row sums of $\tkftrans _{3\times 3}$, we have \[ \boxed {\det (P_\dom ) = (1-\beta _\dom )(1-\kappa _\dom )} \] The determinant is the product of the M/I-row exit probability $(1-\beta )$ and the stationary emptiness probability $(1-\kappa )$.

Inverse of $P_\dom $ The adjugate $\text {adj}(P_\dom )$ has entries (again suppressing domain subscripts): \[ \text {adj}(P) = \begin {pmatrix} (1-\beta )(1-\kappa (1-\alpha )) & \beta + \kappa (1-\alpha )(\gamma -\beta ) & (1-\beta )\kappa (1-\alpha ) \\[4pt] (1-\beta )\kappa \alpha & 1 - (1-\beta )\kappa \alpha - (1-\gamma )\kappa (1-\alpha ) & (1-\beta )\kappa (1-\alpha ) \\[4pt] (1-\beta )\kappa \alpha & \gamma + \kappa \alpha (\beta -\gamma ) & (1-\beta )(1-\kappa \alpha ) \end {pmatrix} \] Note the symmetries: $\text {adj}(P)_{21} = \text {adj}(P)_{31}$ and $\text {adj}(P)_{13} = \text {adj}(P)_{23}$ (reflecting the identical rows of $\tkftrans _{3\times 3}$), while $\text {adj}(P)_{11} - \text {adj}(P)_{33} = (1-\beta )\kappa (2\alpha -1)$, which vanishes only when $\alpha = \tfrac {1}{2}$.

The full inverse is \[ P_\dom ^{-1} = \frac {\text {adj}(P_\dom )}{(1-\beta _\dom )(1-\kappa _\dom )} \] and therefore \[ (I - D_\dom ^{\text {match}})^{-1} = \frac {\text {adj}(P_\dom )}{(1-\ext _\dom )(1-\beta _\dom )(1-\kappa _\dom )} \]

Remark C.10 (Compactness in the $\nfrag = 1$ limit). Each entry of $(I - D_\dom ^{\text {match}})^{-1}$ is a ratio of a polynomial with 1–3 terms (numerator, from the adjugate) over a polynomial with 2–3 terms (denominator $(1-\ext )(1-\beta )(1-\kappa )$). These are genuinely compact closed-form expressions in the TKF parameters $(\alpha _\dom , \beta _\dom , \gamma _\dom , \kappa _\dom , \ext _\dom )$. For $\nfrag > 1$ the inversion is still closed-form (a rational function in $\ext ^{(\dom )}, \fragdist _\dom , \alpha _\dom , \beta _\dom , \gamma _\dom , \kappa _\dom $), via the Kronecker-plus-rank-3 Woodbury decomposition of Section C.5.6: the only inversion of dimension exceeding $3 \times 3$ is a single $\nfrag \times \nfrag $ inverse $(I_\nfrag - \ext ^{(\dom )})^{-1}$ per domain, which can itself be written symbolically via the matrix adjugate.

C.5.8 Bilinear Factored Form of Adjacency Frequencies

The Woodbury identity gives the full path-sum matrix in a form that makes the adjacency computation practical for any $\ndom , \nfrag $.

Structure of the path-sum matrix Writing $G_\dom = (I - D_\dom )^{-1}$ ($5\nfrag \times 5\nfrag $ per domain), the Woodbury expansion gives, for states $s_1$ in $(\dom _1, \frag _1)$ and $s_2$ in $(\dom _2, \frag _2)$: \[ [(I - \effT )^{-1}]_{s_1 s_2} = \delta _{\dom _1\dom _2}\,[G_{\dom _1}]_{s_1 s_2} + [G_{\dom _1}\,\mathbf {e}_{\dom _1}]_{s_1}^\top \, \nonemptytrans _\text {mid}\,K\, [\mathbf {s}_{\dom _2}^\top G_{\dom _2}]_{s_2} \] where $K = \bigl (I_3 - \sum _\dom \mathbf {s}_\dom ^\top G_\dom \, \mathbf {e}_\dom \, \nonemptytrans _\text {mid}\bigr )^{-1}$ is the $3 \times 3$ Woodbury kernel (push-through form), computed once.

Bilinear form of structural weights The structural weight $C^{\alpha \beta }_{(\dom _1,\frag _1),(\dom _2,\frag _2)}$ decomposes as a within-domain diagonal plus a cross-domain bilinear form: \[ C^{\alpha \beta }_{(\dom _1,\frag _1),(\dom _2,\frag _2)} = \delta _{\dom _1\dom _2}\, c^\alpha _{\dom _1\frag _1\frag _2} + \mathbf {a}^{\alpha \top }_{\dom _1\frag _1}\, M^{-1}\, \mathbf {b}^{\beta }_{\dom _2\frag _2} \] where $\mathbf {a}^\alpha _{\dom \frag }, \mathbf {b}^\beta _{\dom \frag } \in \mathbb {R}^5$ are per-(domain, fragment-type) vectors derived from $G_\dom $, $\mathbf {e}_\dom $, $\mathbf {s}_\dom $, and the emission-type projections, and $c^\alpha _{\dom \frag _1\frag _2}$ is the within-domain entry of $G_\dom $ between fragment-types $\frag _1$ and $\frag _2$ at emission type $\alpha $ (a $\nfrag \times \nfrag $ matrix per domain rather than a scalar).

Closed-form adjacency formula Substituting into the adjacency frequency (C.8) and using the per-(domain, fragment-type) class-mixture emissions $\phi ^\alpha _{\dom \frag }$ from Section C.5.2, the full adjacency table takes the form: \begin {equation} \label {eq:bilinear-adj} f^{\alpha \beta }(\text {chars}_L, \text {chars}_R) = \underbrace { \left [\sum _{\dom ,\frag } \mathbf {a}^\alpha _{\dom \frag }\, \phi ^\alpha _{\dom \frag }(\text {chars}_L)\right ]^\top M^{-1} \left [\sum _{\dom ,\frag } \mathbf {b}^\beta _{\dom \frag }\, \phi ^\beta _{\dom \frag }(\text {chars}_R)\right ] }_{\text {cross-domain: bilinear in two 5-vectors}} + \underbrace { \sum _\dom \sum _{\frag _1,\frag _2} c^\alpha _{\dom \frag _1\frag _2}\, \phi ^\alpha _{\dom \frag _1}(\text {chars}_L)\, \phi ^\beta _{\dom \frag _2}(\text {chars}_R) }_{\text {within-domain: $\nfrag \times \nfrag $ inner sum}} \end {equation}

This is algebraic closed form: every quantity appearing—$G_\dom $ entries, $\mathbf {a}^\alpha _{\dom \frag }$, $\mathbf {b}^\beta _{\dom \frag }$, $c^\alpha _{\dom \frag _1\frag _2}$, $M^{-1}$, and $\phi ^\alpha _{\dom \frag }$—is a rational function of the model parameters (composed with the matrix exponentials $\exp (\revsub ^{(\class )}\evoltime )$). All inversions reduce to closed-form rational expressions: the only matrix inversion of dimension exceeding $3 \times 3$ in the entire distillation pipeline is $(I_\nfrag - \ext ^{(\dom )})^{-1}$ per domain (Section C.5.6), which is a single $\nfrag \times \nfrag $ adjugate-over-determinant. The Woodbury correction $M^{-1}$ is a single $3 \times 3$ inversion regardless of $\ndom , \nfrag , \nclasses $.

Remark C.11 (Practical computation for $\ndom = \nfrag = \nclasses = 3$). For $\ndom = 3, \nfrag = 3, \nclasses = 3$:

1.: Precompute $\nclasses = 3$ per-class transition kernels $\exp (\revsub ^{(\class )}\evoltime )$.
2.: Precompute 3 within-domain inverses $G_\dom $ (each $15 \times 15$, i.e. $5\nfrag = 15$).
3.: Precompute 9 pairs $(\mathbf {a}^\alpha _{\dom \frag }, \mathbf {b}^\beta _{\dom \frag })$ for each $(\alpha ,\beta )$ pair, plus the within-domain $\nfrag \times \nfrag $ matrices $c^\alpha _{\dom \frag _1\frag _2}$.
4.: Accumulate $\Sigma = \sum _\dom \mathbf {s}_\dom ^\top G_\dom \mathbf {e}_\dom $ and compute the push-through kernel $K$ (one $3 \times 3$ inversion).
5.: For each character entry: evaluate (C.11) by summing 9 scaled 3-vectors for the left and right factors of the cross-domain term, then a $3 \times 3$ bilinear product, plus a within-domain double sum over $(\frag _1, \frag _2)$.

The per-entry cost is $O(\ndom \nfrag + \nclasses )$ multiplications.

C.5.9 Full-Context Distillation: Passthrough Context for Insert and Delete

In a two-tape transducer, the state at any point should encode the most recent character on each tape:

After Match$(X,Y)$: context is $(X,Y)$ — both tapes updated.
After Insert emitting $Y'$: context is $(X, Y')$ — ancestor context $X$ unchanged (passthrough from prior Match or Delete), descendant updated to $Y'$.
After Delete consuming $X'$: context is $(X', Y)$ — descendant context $Y$ unchanged (passthrough from prior Match or Insert), ancestor updated to $X'$.

The adjacency frequencies in Section C.5 track only partial context for Insert and Delete states: $\phi ^{\ins }_{\dom \frag }(Y) = \sum _\class \classdist _{\dom \frag \class } \eqm ^{(\class )}_Y$ depends only on the descendant character, and $\phi ^{\del }_{\dom \frag }(X) = \sum _\class \classdist _{\dom \frag \class } \eqm ^{(\class )}_X$ depends only on the ancestor character. This loses information about which (domain, fragment-type, class) generated the passthrough context.

We now show that the full-context adjacency frequencies—with both $(X,Y)$ tracked in all states—can be computed in closed algebraic form, preserving the Woodbury structure.

Insert chain Green’s function Define the Insert chain as the sub-process restricted to Insert states $\mathcal {I} = \{\mat \ins _{\dom \frag }, \ins \ins _{\dom \frag } : \dom \in \ndom , \frag \in \nfrag \}$. The restricted transition matrix $T^{\mathcal {II}}_{\text {eff}}$ (Insert$\to $Insert transitions within $\effT $) has the same block-diagonal-plus-low-rank structure as the full $\effT $: \[ T^{\mathcal {II}}_{\text {eff}} = D^{II} + E^{II}\, \nonemptytrans ^{II}_{\text {mid}}\, {S^{II}}^\top \] where $D^{II} = \text {diag}(D^{II}_1, \ldots , D^{II}_\ndom )$ with each $D^{II}_\dom $ being the within-domain Insert self-loop block ($2\nfrag \times 2\nfrag $ for $\{\mat \ins _{\dom \frag }, \ins \ins _{\dom \frag }\}$ over fragment-types), and the cross-domain coupling has rank $\leq 2$ through the Insert rows of $\nonemptytrans _\bullet $.

The Insert chain Green’s function is therefore \[ G^{II} = (I - T^{\mathcal {II}}_{\text {eff}})^{-1} \] computable via Woodbury with a kernel of dimension $\leq 2$. Each within-domain inverse $(I - D^{II}_\dom )^{-1}$ is a $2\nfrag \times 2\nfrag $ inversion — simpler than the $5\nfrag \times 5\nfrag $ within-domain block of the full system.

An analogous Delete chain Green’s function $G^{DD} = (I - T^{\mathcal {DD}}_{\text {eff}})^{-1}$ handles the descendant passthrough through Delete states, with identical structure.

Ancestor-conditioned structural weights To enter the Insert chain with ancestor context $X$, the process must have come from a prior Match or Delete state in some $(\dom _0, \frag _0)$ that emitted ancestor $X$. The match emission satisfies $\sum _Y \phi ^{\mat }_{\dom \frag }(X,Y) = \phi ^{\del }_{\dom \frag }(X)$ (row-stochastic per ancestor), so the entry weight from $(\dom _0, \frag _0)$ with ancestor $X$ is proportional to $\phi ^{\del }_{\dom _0\frag _0}(X)$ for both Match and Delete source states.

Define the ancestor-conditioned left vector: \[ \tilde {L}^{\mathcal {I}}_{\dom _1\frag _1}(X) = \sum _{\dom _0,\frag _0} \underbrace {(L_{\mat \mat _{\dom _0\frag _0}} + L_{\mat \del _{\dom _0\frag _0}} + L_{\del \del _{\dom _0\frag _0}})}_{\lambda _{\dom _0\frag _0}} \cdot \; \phi ^{\del }_{\dom _0\frag _0}(X) \cdot \; \bigl [\effT _{\text {entry}}\, G^{II}\bigr ]_{(\dom _0,\frag _0),(\dom _1,\frag _1)} \] where $L_s$ are the standard left boundary weights from the full path-sum computation, $\lambda _{\dom _0\frag _0}$ collects all ancestor-emitting (Match and Delete) boundary contributions for $(\dom _0, \frag _0)$, $\effT _{\text {entry}}$ is the entry transition from ancestor-emitting states $\{\mat \mat , \mat \del , \del \del \}$ into the Insert chain (i.e. the $\{M,D\} \to \{I\}$ block of $\effT $), and $G^{II}_{(\dom _0,\frag _0),(\dom _1,\frag _1)}$ sums over all Insert-chain continuations.

Key observation: $\phi ^{\del }_{\dom _0\frag _0}(X)$ factors out multiplicatively from the structural sum, preserving the bilinear structure.

Full-context adjacency for Insert-sourced transitions The full-context adjacency for Insert$\to $Match is: \begin {equation} \label {eq:full-im} f^{\mathcal {I}\mathcal {M}}_{\text {full}}(X, Y, X', Y') = \sum _{(\dom _1,\frag _1),(\dom _2,\frag _2)} \tilde {L}^{\mathcal {I}}_{\dom _1\frag _1}(X) \cdot \eta ^{\mathcal {IM}}_{(\dom _1,\frag _1),(\dom _2,\frag _2)} \cdot \phi ^{\ins }_{\dom _1\frag _1}(Y) \cdot \phi ^{\mat }_{\dom _2\frag _2}(X', Y') \end {equation} where $\eta ^{\mathcal {IM}}_{(\dom _1,\frag _1),(\dom _2,\frag _2)} = T_{I_{\dom _1\frag _1}, M_{\dom _2\frag _2}} \cdot R_{M_{\dom _2\frag _2}}$ and $Y$ is the descendant character at the Insert, $X$ the passthrough ancestor.

The bilinear factored form (cf. (C.11)) generalizes to: \begin {equation} \label {eq:bilinear-full} f^{\mathcal {I}\mathcal {M}}_{\text {full}}(X, Y, X', Y') = \left [\sum _{\dom ,\frag } \tilde {\mathbf {a}}^{\mathcal {I}}_{\dom \frag }(X)\, \phi ^{\ins }_{\dom \frag }(Y) \right ]^\top K_I \left [\sum _{\dom ,\frag } \mathbf {b}^{\mathcal {M}}_{\dom \frag }\, \phi ^{\mat }_{\dom \frag }(X', Y')\right ] + \sum _{\dom ,\frag _1,\frag _2} \text {(diagonal terms)} \end {equation} where $\tilde {\mathbf {a}}^{\mathcal {I}}_{\dom \frag }(X)$ are modified per-(domain, fragment-type) left vectors incorporating the ancestor context through $\tilde {L}^{\mathcal {I}}_{\dom \frag }(X)$, and $K_I$ is the Woodbury kernel for the Insert-chain correction.

Similarly, for Insert$\to $Insert with ancestor passthrough: \[ f^{\mathcal {I}\mathcal {I}}_{\text {full}}(X, Y, X, Y') = \sum _{(\dom _1,\frag _1),(\dom _2,\frag _2)} \tilde {L}^{\mathcal {I}}_{\dom _1\frag _1}(X) \cdot \eta ^{\mathcal {II}}_{(\dom _1,\frag _1),(\dom _2,\frag _2)} \cdot \phi ^{\ins }_{\dom _1\frag _1}(Y) \cdot \phi ^{\ins }_{\dom _2\frag _2}(Y') \] where the ancestor $X$ is preserved on both sides (3 effective character dimensions, not 4).

By symmetry, the Delete-sourced adjacencies with descendant passthrough are: \begin {equation} \label {eq:full-dm} f^{\mathcal {D}\mathcal {M}}_{\text {full}}(X, Y, X', Y') = \sum _{(\dom _1,\frag _1),(\dom _2,\frag _2)} \tilde {L}^{\mathcal {D}}_{\dom _1\frag _1}(Y) \cdot \eta ^{\mathcal {DM}}_{(\dom _1,\frag _1),(\dom _2,\frag _2)} \cdot \phi ^{\del }_{\dom _1\frag _1}(X) \cdot \phi ^{\mat }_{\dom _2\frag _2}(X', Y') \end {equation} where $\tilde {L}^{\mathcal {D}}_{\dom _1\frag _1}(Y)$ is the descendant-conditioned left vector for the Delete chain, defined analogously with $G^{DD}$.

Computational cost The affected adjacency tables gain one character dimension:

Tensor	Original	Full-context	Per-entry cost
$f^{\mathcal {M}\mathcal {M}}$	$\|\alphabet \|^4$	$\|\alphabet \|^4$ (unchanged)	$O(\ndom \nfrag + \nclasses )$
$f^{\mathcal {M}\mathcal {I}}$	$\|\alphabet \|^3$	$\|\alphabet \|^3$ (unchanged)	$O(\ndom \nfrag + \nclasses )$
$f^{\mathcal {M}\mathcal {D}}$	$\|\alphabet \|^3$	$\|\alphabet \|^3$ (unchanged)	$O(\ndom \nfrag + \nclasses )$
$f^{\mathcal {I}\mathcal {M}}$	$\|\alphabet \|^3$	$\|\alphabet \|^4$	$O(\ndom \nfrag + \nclasses )$
$f^{\mathcal {I}\mathcal {I}}$	$\|\alphabet \|^2$	$\|\alphabet \|^3$ ($X$ preserved)	$O(\ndom \nfrag + \nclasses )$
$f^{\mathcal {I}\mathcal {D}}$	$\|\alphabet \|^2$	$\|\alphabet \|^3$ ($X$ preserved, $Y \to X'$)	$O(\ndom \nfrag + \nclasses )$
$f^{\mathcal {D}\mathcal {M}}$	$\|\alphabet \|^3$	$\|\alphabet \|^4$	$O(\ndom \nfrag + \nclasses )$
$f^{\mathcal {D}\mathcal {D}}$	$\|\alphabet \|^2$	$\|\alphabet \|^3$ ($Y$ preserved)	$O(\ndom \nfrag + \nclasses )$
$f^{\mathcal {D}\mathcal {I}}$	$\|\alphabet \|^2$	$\|\alphabet \|^3$ ($Y$ preserved, $X \to Y'$)	$O(\ndom \nfrag + \nclasses )$

The per-entry cost is $O(\ndom \nfrag + \nclasses )$ after Woodbury factoring (cf. the same per-entry cost for $f^{\mathcal {M}\mathcal {M}}$ noted in Remark on p. 136), since the Insert and Delete chain Green’s functions have the same bilinear decomposition. The total cost for the full adjacency table is $O((\ndom \nfrag + \nclasses ) \cdot |\alphabet |^4)$ — the same asymptotic scaling as the existing match-to-match computation. The Woodbury kernel remains $3 \times 3$ for the full system; the Insert and Delete chain kernels are $\leq 2 \times 2$. All quantities remain closed-form rational functions of the model parameters (composed with the $\nclasses $ rate-matrix exponentials).

Remark C.12 (Why the original formulation lost context). The emission functions $\phi ^{\ins }_{\dom \frag }(Y)$ and $\phi ^{\del }_{\dom \frag }(X)$ are genuinely single-character: the Insert emission does not depend on the ancestor, and vice versa. The lost context is not an emission effect but a transition routing effect: the joint posterior $P((\dom ,\frag ,\class ) \mid X, Y)$ depends on both characters, so the passthrough character informs which (domain, fragment-type, class) we are in, and hence which transition probabilities apply. The correction above recovers this information by conditioning the structural weights on the passthrough character.

C.5.10 Domains versus Fragments versus Classes for Adjacency Capture

The order-1 WFST has $O(|\alphabet |^4)$ free parameters (from match-to-match transition weights parameterized by context $(X,Y)$ and next pair $(\anctok ',\destok ')$), while the MixDom model has far fewer. We analyse which type of model complexity most efficiently captures adjacency structure.

Adjacency tensor rank The match-to-match adjacency table $f^{\mathcal {M}\mathcal {M}}(X,Y,X',Y')$ is a sum of $(\ndom \nfrag )^2$ structural-weight terms (in the (domain, fragment-type)-pair sense), each weighted by a rank-$\nclasses $ emission factor on each side. To approach saturating the WFST capacity ($|\alphabet |^4$ entries), the tensor rank must approach $|\alphabet |^2$; this is achievable through a combination of $(\ndom , \nfrag , \nclasses )$.

Domains: independent TKF rates and weights Each domain type $\dom $ brings its own $(\alpha _\dom , \beta _\dom , \gamma _\dom , \kappa _\dom )$, creating genuinely different:

$\mat \to \del $ vs. $\mat \to \ins $ ratios (different $\kappa _\dom (1-\alpha _\dom )$ vs. $\beta _\dom $),
$\waitm $ vs. $\waitd $ transition behaviour (different $\beta _\dom $ vs. $\gamma _\dom $),
overall indel/match balance.

These features drive cross-type adjacency diversity in the WFST and are not reproducible by fragments or classes alone.

Fragments: vary intra-fragment transition pattern as well as boundary frequency The intra-fragment Markov ext matrix $\ext ^{(\dom )}_{\srcfrag \destfrag }$ allows transitions between fragment-types within a single fragment, without invoking the TKF92 new-fragment branch. These intra-fragment $\srcfrag \to \destfrag $ transitions at each emitted site can carry persistent emission-class context: a chain that prefers one fragment-type tends to stay in that fragment-type’s class mixture, producing richer M/I/D adjacency patterns than fragments without intra-fragment correlation. The off-diagonal entries of $\ext ^{(\dom )}$ thus contribute genuine new $\mat /\ins /\del $ adjacency content; setting them to zero (–freeze-offdiag-ext in the fitter) recovers the diagonal-extension special case behaviour where fragments only modulate boundary frequency, and the $\nfrag = 1$ case collapses fragments to a scalar self-extension per domain.

Classes: emission-mixture diversity, decoupled from structure Each site class $\class $ has its own GTR $(\exch ^{(\class )}, \eqm ^{(\class )})$. Classes contribute emission diversity: the rank-$\nclasses $ class-mixture emission $\phi ^{\mat ,\ins ,\del }_{\dom \frag }$ allows the joint match emission tensor to deviate from a single GTR’s product structure, even within a single $(d,f)$. Classes do not contribute to the $\mat /\ins /\del $ transition pattern (they do not appear in $\effT $ or in any of the structural weights $C^{\alpha \beta }$); they affect only the character-side of each adjacency frequency. Sharing the class pool across $(d,f)$ via $\classdist $ gives a low-parameter way to enrich the emission tensor without growing the latent-state space.

Remark C.13 (Parameter allocation). For adjacency capture in MixDom:

Domains are essential to enrich the $\mat /\ins /\del $ transition pattern via independent TKF rates.
Fragments enrich the $\mat /\ins /\del $ structure (via the intra-fragment Markov ext) and the boundary frequency (via $\notext ^{(\dom )}_\srcfrag $); they also support an additional layer of emission diversity through $\classdist _{\dom \frag \class }$.
Classes enrich emissions only, but cheaply (decoupled from $\ndom , \nfrag $): with $\nclasses $ classes, every $(d,f)$ has access to the full pool via its Dirichlet $\classdist _{\dom \frag \class }$.

The total adjacency-tensor rank scales as $\ndom \nfrag $ (structural-weight pairs) $\times \nclasses ^2$ (left/right class mixture); in practice, allocating the parameter budget across all three axes gives the most efficient match to the WFST capacity.

C.5.11 Identifiability

Are the MixDom parameters recoverable from the distilled order-1 models?

Generic identifiability (up to standard ambiguities). The distilled WFST provides $O(|\alphabet |^4)$ constraints on the MixDom parameters. With $\ndom $ domains, $\nfrag $ fragment types per domain, and $\nclasses $ site classes, there are $2 + (\ndom -1) + 2\ndom + \ndom \nfrag ^2 + \ndom \nfrag (\nclasses -1) + \nclasses (|\alphabet |^2 + |\alphabet | - 2)$ free parameters (the last term accounts for $\nclasses $ symmetric exchangeability matrices and equilibrium distributions). The system is heavily over-determined for moderate $(\ndom , \nfrag , \nclasses )$.

The per-domain TKF parameters $(\insrate _\dom \evoltime , \delrate _\dom \evoltime , \ext ^{(\dom )}_{\srcfrag \destfrag })$ are recoverable from the within-(domain, fragment-type) $\mat /\ins /\del $ transition structure: $\alpha _\dom $ and $\kappa _\dom $ determine $\delrate _\dom \evoltime $ and $\insrate _\dom /\delrate _\dom $; $\beta _\dom $ provides a redundant check; $\ext ^{(\dom )}$ separates from $\kappa _\dom $ because intra-fragment Markov transitions decompose differently from new-fragment transitions (see Section C.3.5 of the Maraschino fitter and Baum–Welch derivation). The top-level parameters $(\insrate _\main \evoltime , \delrate _\main \evoltime )$ are identifiable from the inter-domain transition pattern. The per-class GTRs $(\exch ^{(\class )}, \eqm ^{(\class )})$ and per-(domain, fragment-type) $\classdist _{\dom \frag \class }$ are jointly identifiable up to label permutation of classes, because the WFST context dependence reveals the mixture components as the joint posterior shifts across different $(X, Y)$ contexts.

Unavoidable ambiguities.

1.: Label permutation: permuting domain labels gives the same model ($\ndom !$-fold), permuting fragment-type labels at fixed domain gives the same model ($\nfrag !^\ndom $-fold), and permuting class labels gives the same model ($\nclasses !$-fold).
2.: Rate-time confounding: only products $\insrate \evoltime , \delrate \evoltime $ are identifiable from pairwise data (a single evolutionary time).
3.: Class-domain mixing: when $\nclasses \geq \ndom \nfrag $, the class structure can absorb domain-level emission differences, creating identifiability issues unless the structural weights $C^{\alpha \beta }$ resolve them.

Lossy in distribution, injective in parameters. The distilled model captures only pairwise-adjacent correlations; the full MixDom has higher-order structure (e.g., runs of characters from the same fragment, intra-fragment Markov correlations between fragment-types). The distillation map MixDom $\to $ order-1 is therefore lossy for the sequence distribution but generically injective for the parameters: one can recover the MixDom parameters from the distilled model, even though the distilled model cannot reproduce all statistics of the MixDom.

C.5.12 Scaling to $\ndom , \nfrag , \nclasses $

The top-level null closure does not grow with $\ndom , \nfrag , \nclasses $ The null states $\mnull , \inull , \dnull $ are always exactly three, regardless of the latent-state cardinalities. Their transition submatrix depends on $\ndom $ only through the scalars $\emptyseg _0 = \sum _\dom \domdist _\dom (1 - \kappa _\dom )$ and $\emptyseg _\evoltime = \sum _\dom \domdist _\dom (1 - \kappa _\dom )(1 - \beta _\dom )$; $\nfrag $ and $\nclasses $ do not appear at this top level. The null closure and the effective $5 \times 5$ top-level matrix $\nonemptytrans $ are $O(\ndom )$ to compute but $O(1)$ in matrix dimension.

Domain types are drawn i.i.d., not as a Markov chain When a new domain begins, its type is drawn independently from $\domdist $, regardless of the previous domain’s type. This i.i.d. structure means the cross-domain transitions factor as $\effT (s, s') = e_\srcdom (s) \times \nonemptytrans _\bullet \times s_\destdom (s')$ (exit $\times $ top-level $\times $ start), and the cross-domain contribution to the full $5\ndom \nfrag \times 5\ndom \nfrag $ transition matrix has rank at most 3. Note that this i.i.d. property holds at the domain level only; within a fragment, the fragment-type process is a Markov chain on $\nfrag $ states (see Section C.1.1, Equation C.2), which is encoded in the within-domain block $D_\dom $ and does not break the cross-domain factorisation.

Woodbury for general $\ndom , \nfrag $ The emitting state space has $5\ndom \nfrag $ states. The effective transition matrix decomposes as \[ \effT = D + E\, \nonemptytrans _\bullet \, S^\top \] where $D = \text {diag}(D_1, \ldots , D_\ndom )$ with each $D_\dom $ a $5\nfrag \times 5\nfrag $ within-domain block (now domain-specific and, for $\nfrag > 1$, with intra-fragment Markov coupling), $E \in \mathbb {R}^{5\ndom \nfrag \times 3}$ stacks exit vectors projected to $\{\mat ,\ins ,\del \}$, and $S \in \mathbb {R}^{5\ndom \nfrag \times 3}$ stacks start vectors projected to $\{\mat ,\ins ,\del \}$.

The Woodbury identity gives $(I - \effT )^{-1}$ via:

1.: $\ndom $ independent $5\nfrag \times 5\nfrag $ inversions $(I - D_\dom )^{-1}$, each evaluated in closed form via the Kronecker-plus-rank-3 inner Woodbury of Section C.5.6: one $\nfrag \times \nfrag $ adjugate $(I_\nfrag - \ext ^{(\dom )})^{-1}$ plus a $3 \times 3$ inner kernel per domain. For $\nfrag = 1$ this further collapses to a $3\times 3$ adjugate (closed-form determinant $(1-\ext )(1-\beta )(1-\kappa )$) plus two scalar inversions (Section C.5.7). The total cost is $O(\ndom \nfrag ^3)$ per evaluation.
2.: Computing $\sum _\dom \mathbf {s}_\dom ^\top G_\dom \, \mathbf {e}_\dom $, a $3 \times 3$ matrix built by summing $\ndom $ contributions: $O(\ndom \nfrag ^2)$ work.
3.: One $3 \times 3$ inversion for the Woodbury correction: $O(1)$ work.

Proposition C.2 (Linear-in-$\ndom $, cubic-in-$\nfrag $, linear-in-$\nclasses $ scaling). The order-1 distillation computation scales as $O(\ndom \nfrag ^3)$ in within-domain inversions and $O(\nclasses )$ in per-class GTR exponentials:

$O(\ndom )$ to compute $\emptyseg _0, \emptyseg _\evoltime $ and the $5\times 5$ top-level matrix.
$O(\ndom \nfrag ^3)$ for within-domain path sums.
$O(\ndom \nfrag ^2)$ to accumulate the $3 \times 3$ Woodbury correction.
$O((\ndom \nfrag )^2)$ (domain, fragment-type)-pair terms per adjacency entry (or $O((\ndom \nfrag )^2 |\alphabet |^d \cdot \nclasses )$ for the full adjacency table, with $d \in \{2,3,4\}$ character dimensions depending on state type; see Section C.5.9).
$O(\nclasses |\alphabet |^3)$ for per-class transition kernels $\exp (\revsub ^{(\class )}\evoltime )$.

The Woodbury kernel is always $3 \times 3$, regardless of the latent state cardinalities.

C.5.13 Summary

Component	Matrix size	Scaling
Top-level null closure	$3 \times 3$	$O(1)$
Top-level eff. matrix	$5 \times 5$	$O(\ndom )$ to compute
Within-domain inversions	$5\nfrag \times 5\nfrag $ each	$O(\ndom \nfrag ^3)$
Woodbury correction	$3 \times 3$	$O(\ndom \nfrag ^2)$ to accumulate
Per-class GTR exponentials	$\|\alphabet \| \times \|\alphabet \|$	$O(\nclasses \|\alphabet \|^3)$
(Domain, fragment-type)-pair adjacency terms	—	$O((\ndom \nfrag )^2 \nclasses )$ per entry

The distillation is closed-form for any finite $\ndom , \nfrag , \nclasses $: all quantities are rational functions of the model parameters (composed with $\nclasses $ exponentials of rate-times-time products). The key structural features enabling this are:

1.: Top-level null closure is $O(1)$ in dimension. The MixDom null states $\mnull , \inull , \dnull $ are always exactly three, regardless of $\ndom , \nfrag , \nclasses $. All dependence on $\ndom $ enters through the scalars $\emptyseg _0, \emptyseg _\evoltime $.
2.: Cross-domain transitions factor. The exit-vector $\times $ top-level $\times $ start-vector structure caps the rank of the inter-domain coupling at 3 (the $\mat , \ins , \del $ top-level states that couple across domains), enabling Woodbury reduction to a fixed $3 \times 3$ inversion.
3.: Domain types are i.i.d., not Markov. If domain type depended on the previous domain (a Markov chain on domain types), the cross-domain block would lose its factored structure and the Woodbury reduction would fail—the full $5\ndom \nfrag \times 5\ndom \nfrag $ inversion would be required, scaling as $O((\ndom \nfrag )^3)$. The domain-level i.i.d. mixture is what keeps the cross-domain part linear in $\ndom $. The intra-fragment Markov chain on fragment-types is encoded inside $D_\dom $ and does not break this factorisation.
4.: Class mixture is bilinear in emissions. The class index $\class $ enters every emission as a linear sum weighted by $\classdist _{\dom \frag \class }$, never inside the structural weights $C^{\alpha \beta }$. This makes the class mixture a low-rank emission factor that does not interact with the block-diagonal-plus-low-rank structure of the transition matrix.

C.6 MixDom-Specific SVI-BW Convergence Considerations

This appendix specialises the model-agnostic convergence analysis of Appendix B.2 and the BDI expected-statistic formulae of Appendix B.3 to the hierarchical MixDom model. Each subsection corresponds to a question that arises specifically because MixDom carries multiple interacting parameter groups (top-level vs. per-domain BDI, intra-fragment fragment-type chains, and per-class substitution models).

C.6.1 Parameter groups and Fisher information

The MixDom model has several parameter groups, each with different Fisher information characteristics:

Parameter group	# params	Info per pair	Bottleneck
Top-level $(\insrate _0, \delrate _0)$	2	$O(1)$	few domains per sequence
Per-domain $(\insrate _d, \delrate _d)$	$2N_{\text {dom}}$	$O(w_d)$	domain frequency $w_d$
Intra-fragment fragment-type transitions $\ext ^{(d)}_{fg}$	$N_{\text {dom}} F^2$	$O(w_d \bar L_d)$	fragment count
Domain weights $w_d$	$N_{\text {dom}}-1$	$O(1)$	multinomial
Substitution $Q_c$	$O(\|\mathcal {A}\|^2 C)$	$O(\bar L \cdot w_c)$	alignment length

The bottleneck for SVB convergence is the indel parameters of rare domains. A domain with frequency $w_d = 0.05$ contributes useful BDI statistics to only $\sim 5\%$ of pairs, so its effective sample size is $\sim 0.05 B$ per minibatch.

C.6.2 Substitution vs. indel information

Substitution parameters benefit from $O(\bar L)$ information per pair (one observation per aligned column), while indel parameters get $O(1)$ information per pair. For a protein with $\bar L \approx 200$ residues, substitution parameters converge $\sim 200\times $ faster than indel parameters. This motivates decoupled update frequencies: update substitution parameters every minibatch, but average indel parameter estimates over multiple minibatches.

C.6.3 MixDom expected statistics

We now aggregate the per-process expectations of Appendix B.3 to the full MixDom model with $N_{\text {dom}}$ domain types.

Top-level (domain birth-death) The top-level BDI process has rates $(\insrate _0, \delrate _0)$ with $\kappa _0 = \insrate _0/\delrate _0$ and creates/destroys domains. At stationarity: \begin {alignat} {2} L_0 &= \kappa _0/(1-\kappa _0) &\quad & \text {(expected \# domains per sequence)} \label {eq:L0} \\ \expect [B_0] &= \insrate _0\,\evoltime /(1-\kappa _0) &\quad & \text {(domain births per pair)} \label {eq:EB0} \\ \expect [D_0] &= \expect [B_0] &\quad & \text {(domain deaths per pair)} \label {eq:ED0} \\ \expect [S_0] &= L_0\,\evoltime &\quad & \text {(time-integrated domain count)} \label {eq:ES0} \\ M_0 &= 1, \quad T_0 = \evoltime &\quad & \text {(one endpoint, one process)} \label {eq:M0T0} \end {alignat}

Per-domain type $d$ (fragment birth-death within a domain) Each surviving domain link of type $d$ (with probability $w_d$) contains a TKF92 fragment process with rates $(\insrate _d, \delrate _d)$, $\kappa _d = \insrate _d/\delrate _d$.

The number of domain links of type $d$ is approximately $L_d = w_d\,L_0$ at stationarity. Each such link contributes one independent BDI fragment process, so the aggregated fragment-level statistics for domain type $d$ are: \begin {alignat} {2} L_d &= w_d\,L_0\;\cdot \;\frac {\kappa _d}{1-\kappa _d} &\quad & \text {(total fragment links, type $d$)} \label {eq:Ld} \\ \expect [B_d] &= w_d\,L_0\;\cdot \; \frac {\insrate _d\,\evoltime }{1-\kappa _d} &\quad & \text {(fragment births, type $d$, per pair)} \label {eq:EBd} \\ \expect [D_d] &= \expect [B_d] &\quad & \text {(fragment deaths, type $d$, per pair)} \label {eq:EDd} \\ \expect [S_d] &= w_d\,L_0\;\cdot \; \frac {\kappa _d\,\evoltime }{1-\kappa _d} &\quad & \text {(time-integrated fragment count, type $d$)} \label {eq:ESd} \\ M_d &= w_d\,L_0 &\quad & \text {(\# independent processes)} \label {eq:Md} \\ T_d &= M_d\,\evoltime = w_d\,L_0\,\evoltime &\quad & \text {(total observation time)} \label {eq:Td} \end {alignat}

Note that $M_d$ and $T_d$ are themselves random (they depend on how many domains of type $d$ survive), but at stationarity their expectations are as above.

Intra-fragment fragment-type transitions and sequence length Within each fragment of domain $d$ the fragment-type chain is governed by an $F \times F$ Markov transition matrix $\ext ^{(d)}_{fg}$ (intra-fragment; different fragments are independent realisations). The termination probability from fragment-type $f$ is $\notext ^{(d)}_f = 1 - \sum _g \ext ^{(d)}_{fg}$. The expected sojourn length starting from fragment state $f$ is determined by the fundamental matrix $(I - \ext ^{(d)})^{-1}$: specifically, the expected number of sites emitted starting from a fragment initiated in state $f$ is $\sum _g [(I - \ext ^{(d)})^{-1}]_{fg}$. Averaging over the initial fragment distribution $\fragdist _{d,f}$, the expected number of residues per fragment is $\bar {K}_d = \sum _f \fragdist _{d,f} \sum _g [(I - \ext ^{(d)})^{-1}]_{fg}$.

The expected number of residues per domain of type $d$ is: \begin {equation} \label {eq:chars-per-domain} \bar {C}_d = \frac {\kappa _d}{1-\kappa _d}\, \bar {K}_d, \end {equation} and the total expected sequence length is: \begin {equation} \label {eq:total-length} \bar {L}_{\text {seq}} = L_0 \sum _d w_d\,\bar {C}_d = \frac {\kappa _0}{1-\kappa _0} \sum _d w_d\,\frac {\kappa _d}{1-\kappa _d}\, \bar {K}_d. \end {equation}

Remark C.14 (Convergence of Markov chain transition counts). When fragments are IID (the $\nfrag = 1$ scalar-extension special case), the sufficient statistics for the fragment extension parameter $\ext _f$ are Bernoulli counts (extend vs. terminate), and the convergence analysis reduces to a Beta-distributed posterior. With the general fragment-type Markov, the sufficient statistics are rows of a Markov chain transition count matrix: for each domain $d$ and source fragment state $f$, we observe counts $\hat {n}^{(d)}_{fg}$ of transitions to each target state $g$ plus termination counts $\hat {n}^{(d)}_{f,\text {end}}$. The M-step row-normalizes these counts (a Dirichlet posterior), and the per-pair Fisher information for each row scales with the expected number of visits to state $f$ per pair. For fragment states that are rarely visited (low stationary probability under $\ext ^{(d)}$), the effective sample size is correspondingly small, mirroring the rare-domain bottleneck at the top level.

Numerical example: d3 checkpoint We evaluate these formulas using the MixDom d3 checkpoint parameters: \begin {alignat*} {3} &\insrate _0 = 0.01328, \quad &&\delrate _0 = 0.01412, \quad &&\kappa _0 = 0.9405 \\ &w = (0.662,\; 0.075,\; 0.264) \\ &\insrate _d = (0.00302,\; 0.01033,\; 0.12370) \\ &\delrate _d = (0.00372,\; 0.04686,\; 0.17775) \\ &\kappa _d = (0.812,\; 0.220,\; 0.696) \\ &\evoltime = 0.5 \end {alignat*}

	Top-level	Dom 1	Dom 2	Dom 3
	($d{=}0$)	($d{=}1$)	($d{=}2$)	($d{=}3$)
$\kappa $	0.941	0.812	0.220	0.696
$L = \kappa /(1{-}\kappa )$	15.81	4.31	0.283	2.29
$w_d \cdot L_0$	—	10.47	1.19	4.17
$\expect [B]$ per pair	0.112	0.084	0.008	0.849
$\expect [D]$ per pair	0.112	0.084	0.008	0.849
$\expect [S]$	7.91	22.58	0.168	4.78
$M$	1	10.47	1.19	4.17
$T$	0.5	5.23	0.59	2.09
$v_\theta $ (approx.)	$\sim 18$	$\sim 24$	$\sim 255$	$\sim 2.4$

Computation of the table entries. Top level: $L_0 = 0.9405 / 0.0595 = 15.81$. $\expect [B_0] = 0.01328 \times 0.5 / 0.0595 = 0.112$. $\expect [S_0] = 15.81 \times 0.5 = 7.91$.

Domain type 1 ($w_1 = 0.662$): $M_1 = 0.662 \times 15.81 = 10.47$ independent fragment processes. $L_1 = 10.47 \times 0.812/0.188 = 10.47 \times 4.31 = 45.1$ total fragment links. $\expect [B_1] = 10.47 \times 0.00302 \times 0.5/0.188 = 0.084$. (The table shows $L$ per process and $\expect [B]$ aggregated over $M_d$ processes.)

Domain type 3 ($w_3 = 0.264$): $M_3 = 0.264 \times 15.81 = 4.17$. $\expect [B_3] = 4.17 \times 0.12370 \times 0.5/0.304 = 0.849$. This domain has high $\insrate _3$, so fragment births are frequent and its indel parameters are easy to estimate ($v \approx 2.4$).

$v_{\insrate _0}$: Using (B.53) with $\rho \approx 0.5$: $v_{\insrate _0} \approx 2/\expect [B_0] = 2/0.112 = 17.9$.

$v_{\insrate _2}$: $\expect [B_2] = 1.19 \times 0.01033 \times 0.5 / 0.780 = 0.0079$. $v_{\insrate _2} \approx 2/0.008 = 255$.

C.6.4 Convergence rate estimates

From (B.43) and the per-pair relative variance $v_\theta $, the number of pairs $N = BK$ needed for target relative error $\varepsilon $ is: \begin {equation} \label {eq:N-needed} N \;\geq \; \frac {v_\theta }{\varepsilon ^2}. \end {equation}

Parameter	$v_\theta $	$N$ for $\varepsilon {=}10\%$	$N$ for $\varepsilon {=}5\%$	$N$ for $\varepsilon {=}1\%$
$\insrate _0$ (top-level ins)	18	1 800	7 200	180 000
$\delrate _0$ (top-level del)	18	1 800	7 200	180 000
$\insrate _1$ (dom 1 ins)	24	2 400	9 600	240 000
$\insrate _2$ (dom 2 ins)	255	25 500	102 000	2 550 000
$\insrate _3$ (dom 3 ins)	2.4	240	960	24 000
$w_d$ (domain weights)	$\sim 1/L_0 \approx 0.06$	6	25	630
$\ext ^{(d)}_{fg}$ (fragment trans.)	$\sim 1/\bar {C}_d$	depends on domain
Substitution ($Q$)	$\sim 1/\bar {L}_{\text {seq}}$	$\ll 100$

C.6.5 Discussion: why top-level indel rates are hardest

The table reveals a clear hierarchy of estimation difficulty:

1. Substitution parameters are easiest. Each aligned residue pair contributes one independent observation to the substitution sufficient statistics. With $\bar {L}_{\text {seq}} \approx 200$–$400$ residues per sequence, the per-pair Fisher information for substitution parameters is $O(\bar {L}_{\text {seq}})$, so the per-pair relative variance is $O(1/\bar {L}_{\text {seq}})$. A few dozen pairs suffice for $\varepsilon = 5\%$ accuracy.

2. Domain weights converge fast. Each domain in the ancestor contributes one multinomial observation of the domain type. With $L_0 \approx 16$ domains per sequence, the per-pair Fisher information is $O(L_0)$, giving $v_{w_d} \sim 1/(w_d L_0)$. For the most common domain type ($w_1 = 0.662$), $v_{w_1} \approx 0.095$, and even for the rarest ($w_2 = 0.075$), $v_{w_2} \approx 0.84$. Domain weights are precisely estimated with $N \sim 100$ pairs.

3. Fragment-type transition parameters are moderate. Each intra-fragment fragment-type transition contributes one observation of the $F \times F$ Markov chain row $\ext ^{(d)}_{f,:}$. With $\sim L_0 w_d \kappa _d/(1-\kappa _d)$ fragments of type $d$ per pair, the per-pair information scales with the total fragment count. For domain 1 (which dominates), there are $\sim 45$ fragment links, so the per-row variance scales as $\sim F/(45 \cdot \pi _f)$ where $\pi _f$ is the stationary probability of fragment state $f$. Domain 2 fragments are rare ($\sim 0.3$ per pair) and harder to estimate.

4. Indel parameters for active domains are easy. Domain type 3 has high insertion rate ($\insrate _3 = 0.124$) and $\expect [B_3] \approx 0.85$ births per pair, giving $v_{\insrate _3} \approx 2.4$. A few hundred pairs suffice for $5\%$ accuracy.

5. Indel parameters for common but slow domains are moderate. Domain type 1 has $\expect [B_1] \approx 0.084$ births per pair, giving $v_{\insrate _1} \approx 24$. This requires $N \approx 10{,}000$ pairs for $5\%$ accuracy—feasible but not trivial.

6. Indel parameters for rare domains are the bottleneck. Domain type 2 has weight $w_2 = 0.075$, low $\kappa _2 = 0.22$ (short domains), and low $\insrate _2 = 0.01$. The expected fragment births per pair are only $\expect [B_2] \approx 0.008$. This means on average, only 1 in $\sim 125$ pairs shows even a single fragment birth event in a domain of type 2. The per-pair relative variance $v_{\insrate _2} \approx 255$ requires $N > 100{,}000$ pairs for $5\%$ accuracy.

7. Top-level indel rates are intrinsically hard. The top-level BDI controls domain creation and destruction. With $\insrate _0 = 0.013$, $\evoltime = 0.5$, and $L_0 \approx 16$ domains, the expected number of domain births per pair is only $\expect [B_0] \approx 0.11$. The per-pair relative variance $v_{\insrate _0} \approx 18$ requires $N \approx 7{,}200$ pairs for $5\%$ accuracy.

The fundamental reason is that domain-level indel events are rare compared to residue-level observations. A sequence of 300 residues organized into 16 domains gives $\sim 300$ substitution observations but only $\sim 0.1$ domain birth observations per pair at $\evoltime = 0.5$. The information ratio is roughly $300 / 0.1 = 3{,}000 \times $ in favor of substitution parameters.

Implications for training. These estimates motivate the decoupled update strategy recommended in Section 5: substitution parameters can be accurately estimated from small minibatches ($B \sim 10$), while indel parameters (especially for rare domain types) require accumulation over $N \sim 10^4$–$10^5$ pairs. The Maraschino pipeline (Section 5.5) achieves this by processing all training pairs in a single count tensor, at the cost of the composite-likelihood efficiency gap. The hybrid Maraschino $\to $ SVB pipeline is optimal: Maraschino provides accurate initial estimates using all data, and SVB refines the indel parameters using the full model.

C.7 Variational EM training of MixDom from tree-structured data

The variational ELBO of Appendix C.8 treats the model parameters $\theta = (\insrate _\main , \delrate _\main , \{\insrate _\dom \}, \{\delrate _\dom \}, \domdist , \fragdist , \ext ^{(\dom )}, \classdist , \{\eqm ^{(c)}\}, \{S^{(c)}_{\mathrm {exch}}\})$ as fixed inputs and optimises only the variational distribution $q$ over per-(node, column) latent states. In this appendix we extend the framework to a Variational Bayesian EM (VBEM) training algorithm that learns $\theta $ by alternating between a per-family E-step (Adam ascent on the variational $q$) and a global M-step (closed-form $\theta $ update from aggregated sufficient statistics).

This is structurally analogous to the SVI Baum-Welch pipeline of Section C.1.4 (which trains MixDom from sequence pairs via the labelled Pair HMM), but operates on whole MSAs with their tree topology — each MSA family becomes a single training datum, with the variational distribution capturing the posterior over internal-node states and column-wide tuples.

C.7.1 Outer EM loop

Given the corpus $\{\mathcal {D}_i\}_{i=1}^N$ of MSA-with-tree training data and an initial parameter estimate $\theta ^{(0)}$, the outer loop is

E-step (per family $i$): fit the variational $q_i = (q_i^{(\tau )}, q_i^{(\pi |\tau )})$ by Adam ascent on the ELBO $\mathcal {L}_i(q_i; \theta ^{(t)})$ at the current parameter estimate, returning per-family sufficient statistics $\Phi _i = \{W^{(v\to w)}_i, q^{(\tau )}_i, L^{(\mathrm {sub})}_{i,n,c}\}$.
M-step (global): aggregate sufficient statistics across families and apply closed-form parameter updates for each parameter group, yielding $\theta ^{(t+1)}$.

The corpus may be subsampled minibatch-style each iteration (SVI-style), in which case the M-step uses an exponential-moving-average of sufficient statistics across iterations (Section B.1.13).

C.7.2 Per-family E-step

For family $i$ with binary tree $\parsetree _i$, branch lengths $\branchlen _i$, and observed leaf data $X_i$, the E-step maximises \begin {equation} \label {eq:vbem-estep-elbo} \mathcal {L}_i(q_i; \theta ) = \sum _{(v\to w) \in \parsetree _i} \expect _{q_i}[\log P^{\mathrm {WFST}}(Z^w \mid Z^v)] + \sum _n \expect _{q_i^{(f)}}[\log L^{(\mathrm {sub}),\text {tot}}_n] + \log p^{\text {red}}_\text {singlet}(Z^{\text {root}}) + H[q_i^{(\tau )}] + H[q_i^{(\pi |\tau )}] + \log Z_q \end {equation} over the variational logits $(\text {edge\_logits}, \text {root\_logit}, \text {tuple\_logits})$ via JIT-compiled Adam, identical to the inference-time ELBO of Appendix C.8 but treating $\theta $ as fixed input.

After convergence (typically 100–300 Adam iterations with $\text {lr} \approx 0.05$, Fitch-seeded init), the E-step extracts the following sufficient statistics from $q_i$:

Per-branch reduced expected counts. For each branch $(v\to w)$, the cumulant-trick prefix sum (Section C.8.6) yields \begin {equation} \label {eq:W-tensor} W^{(v\to w)}_{ss', \tau \tau '} = \sum _{N=1}^{L+1} \sum _{M=0}^{N-1} q^{(\tau )}_{n(M)}(\tau )\, q^{(\tau )}_{n(N)}(\tau ')\, P^{v\to w}_{q,M}(s)\, P^{v\to w}_{q,N}(s') \prod _{K=M+1}^{N-1} P^{v\to w}_{q,K}(\mathsf {Ig}), \end {equation} a $5 \times T \times 5 \times T$ tensor that summarises the expected labelled WFST transitions on this branch.

Per-column class posteriors. For each column $n$ and class $c$, compute \begin {equation} \label {eq:vbem-class-posterior} q^{(c|f)}_n(c) = \frac {\classdist _{f, c}\, L^{(\mathrm {sub})}_n(c; \mathcal {F}_n)} {\sum _{c'} \classdist _{f, c'}\, L^{(\mathrm {sub})}_n(c'; \mathcal {F}_n)}, \quad q^{(c)}_n(c) = \sum _f q^{(f)}_n(f)\, q^{(c|f)}_n(c), \end {equation} where $\mathcal {F}_n$ is the Fitch subtree of column $n$ and $L^{(\mathrm {sub})}_n(c)$ is the Felsenstein up-pass likelihood under class-$c$ rate matrix.

Per-class substitution counts. For each class $c$ and column $n$, weighted Felsenstein expected substitution counts on $\mathcal {F}_n$: \begin {equation} \label {eq:vbem-subst-counts} \hat {M}^{(c)}_{i,n}[a, b] = q^{(c)}_n(c)\, \expect [N_{ab}(\text {branch}; \mathcal {F}_n) \mid \text {leaf data}, c, \theta ], \qquad \hat {T}^{(c)}_{i,n}[a] = q^{(c)}_n(c)\, \expect [\text {dwell time in } a \mid \text {leaf data}, c, \theta ]. \end {equation} These are computed via the standard bridge-expectation posterior formulae for substitution-only CTMCs on trees, weighted by the per-class responsibility $q^{(c)}_n(c)$.

C.7.3 M-step from aggregated sufficient statistics

Let $\Phi = \{\Phi _i\}_i$ denote the per-family sufficient statistics aggregated across the (mini)batch. The M-step decomposes parameter group by parameter group, exploiting the route decomposition (C.37) and the responsibilities derived above.

Route attribution. For each per-character labelled WFST transition the route posterior \begin {equation} \label {eq:route-posterior} \rho ^{(r)}_{ss', \tau \tau '}(\theta ) = \frac {\omega ^{(r)}_{\tau \tau '}\, \tilde T^{\text {lab},(r)}_{ss'}} {\hat T_{ss'}(\tau , \tau '; \theta )} \end {equation} partitions the per-branch counts $W^{(v\to w)}_{ss', \tau \tau '}$ into route-attributable soft counts \begin {equation} \label {eq:route-soft-counts} \tilde W^{(r), (v\to w)}_{ss', \tau \tau '} = W^{(v\to w)}_{ss', \tau \tau '} \cdot \rho ^{(r)}_{ss', \tau \tau '}(\theta ^{(t)}). \end {equation} Each route $r \in \{R1, R2, R3\}$ has a fixed BDI / fragment / domain factor signature (Section C.8.4): R1 contributes only to within-fragment extension counts; R2 contributes to fragment-level BDI counts in domain $d$; R3 contributes to both top-level BDI counts and to fragment-level BDI counts in the destination domain $d'$.

Indel rate M-step. For each domain $d$, the per-route soft counts in $\tilde W^{(R2)}$ (restricted to $d = d'$ in source/dest tuples) and $\tilde W^{(R3)}$ (with destination domain $d'$) accumulate to a per-domain $5 \times 5$ WFST transition count matrix $\hat n^{(d)}_{ss'}$ analogous to the SVI-BW pair-counts. The standard transition-count groups (Section C.1.4, transition_count_groups) and quadratic-in-$\kappa $ closed form (m_step_indel_quadratic) then deliver $(\hat \insrate _d, \hat \delrate _d)$: \begin {equation} \label {eq:vbem-indel-mstep} (\hat \insrate _d, \hat \delrate _d) = \mathrm {m\_step\_indel\_quadratic}(\hat B_d, \hat D_d, \hat S_d, \hat L_d, \hat M_d, \hat T_d;\, \text {prior}). \end {equation} The top-level rates $(\hat \insrate _\main , \hat \delrate _\main )$ are recovered analogously from $\tilde W^{(R3)}$ contributions to the top-level WFST transition counts (with R3 supplying both a top-level BDI event and a destination-domain BDI event per transition).

Within-fragment Markov M-step. For each domain $d$, the per-fragchar transition matrix $\ext ^{(d)}_{f, f'}$ has Dirichlet-conjugate update from the R1 soft counts: \begin {equation} \label {eq:vbem-ext-mstep} \hat \ext ^{(d)}_{f, f'} = \frac {\hat E_{d, f, f'} + \alpha _\ext - 1} {\sum _{f''} \hat E_{d, f, f''} + \hat N_{d, f} + (N_\nfrag + 1)(\alpha _\ext - 1)}, \quad \hat \notext ^{(d)}_f = 1 - \sum _{f'} \hat \ext ^{(d)}_{f, f'}, \end {equation} where $\hat E_{d, f, f'} = \sum _{(v\to w), s, s'} \tilde W^{(R1), (v\to w)}_{ss', (d,f),(d,f')}$ and $\hat N_{d, f} = \sum _{(v\to w), s, s'} \big [\tilde W^{(R2)}_{ss', (d,f),(d,\cdot )} + \tilde W^{(R3)}_{ss', (d,f),(\cdot ,\cdot )}\big ]$ are the within- fragment and fragment-end soft counts respectively.

Domain and fragment weight M-step. The tuple priors update via Dirichlet-conjugate from per-route soft-counts: \begin {equation} \label {eq:vbem-domw-mstep} \hat \domdist _{d'} \propto \sum _{(v\to w)} \sum _{(d, f, f'), s, s'} \tilde W^{(R3), (v\to w)}_{ss', (d,f),(d',f')} + \alpha _\dom - 1, \end {equation} \begin {equation} \label {eq:vbem-fragw-mstep} \hat \fragdist _{d', f'} \propto \sum _{(v\to w)} \sum _{(d, f), s, s'}\Big [ \delta _{d, d'} \tilde W^{(R2)}_{ss', (d,f),(d',f')} + \tilde W^{(R3)}_{ss', (d,f),(d',f')} \Big ] + \alpha _\fragdist - 1, \end {equation} both normalised over the appropriate index.

Substitution M-step. Per-class rate-matrix and stationary updates use the standard class-weighted bridge-expectation closed forms (Section B.1.10 and the substitution-mstep.tex appendix), with per-class soft-counts $\hat M^{(c)} = \sum _i \sum _n \hat M^{(c)}_{i, n}$ and $\hat T^{(c)} = \sum _i \sum _n \hat T^{(c)}_{i, n}$. The class distribution itself updates via Dirichlet-conjugate: \begin {equation} \label {eq:vbem-classdist-mstep} \hat \classdist _{d, f, c} \propto \sum _i \sum _n q^{(\tau )}_{i, n}(d, f) \cdot q^{(c|f)}_{i, n}(c) + \alpha _\classdist - 1. \end {equation}

C.7.4 Stochastic VBEM (SVI-VBEM)

For corpora with $N \gg 100$ families a full-batch outer iteration is infeasible (Section C.7.7). We adopt the SVI-BW machinery of Section B.2.4 verbatim, with one substitution: the per-batch sufficient-statistic vector $s_{\text {batch}_k}$ in (B.37) is now the per-family Tree-VBEM aggregate \begin {equation} \label {eq:svi-vbem-batch-suff} s_{\text {batch}_k} \;=\; \sum _{i \in \mathcal {B}_k} \Big \{ \tilde {W}^{(R1)}_i,\; \tilde {W}^{(R2)}_i,\; \tilde {W}^{(R3)}_i,\; q^{(\tau )}_i,\; \hat M^{(c)}_i,\; \hat T^{(c)}_i \Big \}, \end {equation} each component being aggregated linearly over the families $i \in \mathcal {B}_k$ in minibatch $k$. (The route-attributed soft counts $\tilde {W}^{(r)}_i$ are themselves linear in the per-branch $W_i^{(v\to w)}$ tensors of (C.19) and the route posteriors (C.22), so the per-family contributions sum directly without bias.) The pseudocount EMA carries one such state $\tilde {\alpha }^{(g)}_k$ per parameter group $g$ and is updated each iteration as \begin {equation} \label {eq:svi-vbem-update} \tilde {\alpha }^{(g)}_k \;=\; (1 - \eta _k)\,\tilde {\alpha }^{(g)}_{k-1} \;+\; \eta _k\bigl (\alpha ^{(g)} + (N/|\mathcal {B}_k|)\, s^{(g)}_{\text {batch}_k}\bigr ), \end {equation} with the closed-form M-step applied on demand to derive $\theta ^{(t+1)}_g = f^{(g)}(\tilde {\alpha }^{(g)}_k)$ for each group: the BDI rates per domain via (C.24); the Dirichlet groups $\domdist $, $\fragdist $, $\classdist $, $\ext ^{(d)}_{f, \cdot }$ direct from (C.25)–(C.27) and (C.28); and the per-class GTR substitution parameters via the closed forms of Section B.1.10.

The Polyak–Ruppert / ESS / Fisher analyses of Section B.2.4 (eqs. (B.39) onward) carry over verbatim, with $s_{\text {batch}_k}$ replaced by the family-aggregate (C.29).

Step-size schedule. The standard Robbins–Monro choice $\eta _k = (k + \tau )^{-\kappa }$ with $\tau \in [10, 100]$, $\kappa \in [0.5, 1]$ guarantees almost-sure convergence as in SVI-BW. For tree-VBEM we suggest defaults $\tau = 10$, $\kappa = 0.7$.

Breadth-first minibatch sampling. I.i.d. uniform minibatch sampling concentrates updates on the $\sim |\mathcal {B}|/N$ most-recently-visited families and under-represents the long tail; per-group ESS for a parameter that is informative in only a fraction $\varepsilon $ of families collapses to $\varepsilon \,\mathrm {ESS}_K$ (the “rare-parameter bottleneck” of Section B.2.4). We instead maintain a per-family visit count $c_i^{(0)} = 0$ and select each minibatch $\mathcal {B}_k$ as the $|\mathcal {B}|$ families with the smallest $c_i^{(k-1)}$, breaking ties at random; visit counts then update $c_i^{(k)} = c_i^{(k-1)} + 1$ for $i \in \mathcal {B}_k$. This deterministic round-robin guarantees every family is visited at least once every $\lceil N/|\mathcal {B}|\rceil $ iterations (“one epoch”), so the per-family contribution arrives in $\Theta (K|\mathcal {B}|/N)$ of the first $K$ batches with no starvation.

Convergence diagnostics. Per outer iteration $k$, log:

Per-batch ELBO sum, scaled by $N/|\mathcal {B}_k|$ for cross-iteration comparison.
Per-group $\tilde {\alpha }^{(g)}_k$ effective sample size, $\mathrm {ESS}_k = (\sum _j w_{j,k})^2 / \sum _j w_{j,k}^2$ with $w_{j,k} = \eta _j \prod _{i>j}^k (1-\eta _i)$, to monitor warm-up.
Visit-count distribution $\min _i c_i^{(k)}$, $\max _i c_i^{(k)}$, to confirm the breadth schedule is achieving uniform coverage.

Validation ELBO on a held-out family subset is computed once per “epoch” (every $\lceil N/|\mathcal {B}|\rceil $ iterations).

C.7.5 Convergence and ELBO monitoring

Each E-step monotonically increases the per-family ELBO at fixed $\theta $; each M-step monotonically increases the corpus-aggregate ELBO at the new q (this is the standard EM monotonicity result, modulo the variational E-step’s sub-optimality).

For monitoring purposes, log:

Per-iteration corpus ELBO (sum across families).
Per-domain BDI sufficient statistics aggregates.
Per-class soft-count totals and effective sample size.
Validation-set ELBO on a held-out family subset (for early stopping and overfitting monitoring; same role as val LL/pair in SVI-BW).

Convergence in practice: corpus ELBO increase plateaus below a relative threshold ($\sim 10^{-4}$ per outer iteration), or validation ELBO begins to decrease.

C.7.6 Initialisation and warm-start

For warm-start from an existing SVI-BW checkpoint: $\theta ^{(0)}$ is loaded directly from the checkpoint (the variational EM operates on the same parameter space as SVI-BW provided the checkpoint includes class_pis and class_S_exch; checkpoints without a class layer are auto-promoted to a 1-class-per-domain structure).

For the variational $q^{(0)}$: per-family Fitch-seeded init for the inner 3-state q (as in the inference-only benchmark of Appendix C.8); per-family tuple init biased toward the substitution-likelihood-maximising fragchar (via class_marginalised_sub_LL_per_column applied to the initial $\theta $).

The first outer iteration will see a large ELBO improvement as the variational distributions are fit to the warm-start parameters. Subsequent iterations refine $\theta $ toward the tree-aggregated likelihood maximum, which differs in general from the pair-aggregated SVI-BW maximum.

C.7.7 Computational scaling and minibatching

Per-family E-step cost is dominated by Adam-ELBO evaluation: $O(|\mathcal {E}| \cdot L \cdot T^2)$ per Adam step where $T = N_\dom N_\nfrag $ is the reduced tuple count, plus $O(L \cdot N_{\mathrm {cl}} \cdot \text {Felsenstein cost})$ for the substitution likelihoods (computed once per family at the start of the E-step). Typical numbers for unified-short ($T = 6$, $N_{\mathrm {cl}} = 3$, $|\mathcal {E}| \approx 40$, $L \approx 100$, 100 Adam iters) give $\sim 30$ s/family on a single GPU.

For the full Pfam corpus ($\sim 17{,}000$ families) a full E-step pass is $\sim 140$ GPU-hours. Stochastic VBEM (Section C.7.4) replaces the full pass with a breadth-first minibatch of $\sim 10$–$200$ families per outer iteration and an EMA accumulation of sufficient statistics across iterations, reducing per-iteration cost to a few GPU-minutes with convergence in $\sim 10\,N/|\mathcal {B}|$ outer iterations (i.e. $\sim 10$ epochs).

C.7.8 Comparison to SVI-BW

The SVI-BW pipeline trains $\theta $ from sequence pairs sampled from the corpus, using the labelled Pair HMM forward-backward as its inference primitive. Tree-VBEM trains from whole MSAs with their tree topology, using the variational TreeVarAnc-MixDom inference of Appendix C.8.

Differences:

Information per training datum. A pair contributes two-leaf data; a family contributes $|\text {leaves}|$-leaf joint data with phylogenetic structure. Tree-VBEM extracts more information per datum but requires more compute per datum.
Bias. SVI-BW assumes pairs are i.i.d. samples from the model; real pairs share evolutionary history (within-family pairs are not independent under the tree). Tree-VBEM correctly handles this correlation. The cost: SVI-BW’s pair-likelihood is the exact data likelihood under its assumption; tree-VBEM’s ELBO is a lower bound on the family likelihood (with a non-trivial variational gap).
Application alignment. Parameters trained by tree-VBEM are likely better-suited to tree-based downstream tasks (ancestral reconstruction, progressive alignment) since the training objective matches the inference task. Parameters trained by SVI-BW may be better for pairwise tasks (pairwise alignment scoring).

Empirically, the two regimes can be combined: SVI-BW for fast warm-up followed by tree-VBEM for task-specific fine-tuning, exploiting both algorithms’ strengths.

C.8 Mixture-of-trees variational MixDom ancestral inference

This appendix generalises the variational ancestral-presence reconstruction of Section B.6 from the TKF92-conditional-WFST-as-MaxEnt-GGI approximation to the labelled MixDom model, with two further simplifications relative to a fully labelled treatment:

the per-character substitution class $c$ is integrated out at the model level via the per-column prior $\classdist _{f, c}$ (rather than carried as a variational latent);
the fragchar-boundary indicator $g$ and domain-end indicator $e$ are absorbed into a reduced per-character WFST kernel that sums over the implicit fragment/domain bookkeeping at each step (rather than tracked explicitly in the variational state).

The variational state per (internal node, MSA column) is therefore $\{N, D\} \cup \{(d, f)\}$, $|\mathcal {Z}| = N_{\text {dom}} N_{\text {fr}} + 2$ a $(4 N_{\text {cl}})\times $ reduction over the fully-labelled state space. Three concrete payoffs: (i) the variational bound is strictly tighter than the fully-labelled ELBO at any non-Bayes-optimal class posterior, because the analytic marginalisation of $c$ and $(g, e)$ replaces two Jensen inequalities with equalities; (ii) the labelled-variant cross-column hard-zero structural constraint disappears, so per-column factorised $q^{(\tau )}$ is adequate (the column-Markov-chain upgrade is no longer required); (iii) per-branch evaluation is roughly $|\mathcal {T}^{\text {lab}}|^2 / |\mathcal {T}|^2$-times cheaper.

The bound is $\log p(\text {MSA} \mid \parsetree , \theta ) \geq \log \tilde {p}(\text {MSA} \mid \parsetree , \theta ) \geq \mathcal {L}[q]$, in the same restricted-model sense as appendix B.6; the restriction-gap ($q$-independent) and variational-gap (KL) decomposition carries over.

C.8.1 Setting and reduced state space

Fix a rooted phylogeny $\parsetree $ with internal-node set $\mathcal {I}$, leaf set $\mathcal {L}$, and branch lengths $\branchlen _e$ on each edge $e$. The MSA has $L$ columns; each leaf $v \in \mathcal {L}$ has an observed presence indicator $X^v_n \in \{0, 1\}$ and (when present) a residue $\anctok ^v_n \in \alphabet $.

The per-(node, column) variational state space is \begin {equation} \label {eq:Z-mixdom} \mathcal {Z} \;=\; \{N, D\} \cup \mathcal {T}, \qquad \mathcal {T} \;=\; [N_{\text {dom}}] \times [N_{\text {fr}}]. \end {equation} A tuple-valued state $\tau = (d, f) \in \mathcal {T}$ encodes only the column’s domain $d$ and fragchar $f$; the per-character class $c$ is marginalised at the model level (Section C.8.7) and the fragchar-boundary / domain-end indicators $(g, e)$ are absorbed into the reduced WFST kernel (Section C.8.4).

The presence indicator at internal node $v$, column $n$ is $X^v_n = \mathbb {1}\{Z^v_n \in \mathcal {T}\}$. Leaf states are partially clamped: $X^v_n = 0$ pins $Z^v_n \in \{N, D\}$; $X^v_n = 1$ pins $Z^v_n \in \mathcal {T}$ (consistent with the column-wide tuple $\tau _n$ see below).

C.8.2 Restricted generative model

Following Section B.6.2, the model joint over MSA columns and internal-node states is \begin {equation} \label {eq:joint-mixdom} p(\text {MSA},\, \{Z^v\}_{v \in \mathcal {I}} \mid \parsetree , \theta ) \;=\; p_{\text {singlet}}^{\text {red}}(Z^{\text {root}}) \,\prod _{(v \to w) \in \parsetree } \hat {P}^{\text {WFST}}\!(Z^w \mid Z^v, \branchlen _{vw}, \theta ) \,\prod _{n: \mathcal {F}_n \neq \emptyset } L^{(\text {sub}),\text {tot}}_n(f_n; \mathcal {F}_n), \end {equation} where $\mathcal {F}_n$ is the Fitch-parsimony subtree of column $n$, $L^{(\text {sub}),\text {tot}}_n(f; \mathcal {F}_n)$ is the class-marginalised Felsenstein column-substitution likelihood under fragchar $f$ (Section C.8.7), and $p_{\text {singlet}}^{\text {red}}, \hat {P}^{\text {WFST}}$ are the reduced singlet HMM and WFST defined on $(d, f)$ alone (Section C.8.4). As in the simple case we marginalise (C.32) over internal patterns supported on the $L$ observed columns to obtain $\tilde {p}(\text {MSA} \mid \parsetree , \theta ) \le p(\text {MSA} \mid \parsetree , \theta )$ (with the same ghost-column caveat), and bound the latter from below.

C.8.3 Variational family

The variational $q$ factorises over MSA columns, $q(\{Z^v_n\}) = \prod _{n=1}^L q_n$, and within each column, \begin {equation} \label {eq:q-mixdom-factor} q_n \;=\; q_n^{(\tau )}(\tau _n) \cdot q_n^{(\pi |\tau )}\!\big (\{Z^v_n\}_{v \in \mathcal {I}} \,\big |\, \tau _n\big ), \end {equation} where $q_n^{(\tau )}$ is a free categorical over $\mathcal {T}$ on the column-wide tuple, and $q_n^{(\pi |\tau )}$ is the same 3-state irreversible tree-structured graphical model of Section B.6.5 (with $\state {P} \equiv \tau _n$). By construction $q$ assigns zero mass to within-column configurations in which Present nodes carry differing tuples the column has one $(d, f)$, full stop. The cross-column structural-label constraints that forced a column-Markov chain in the labelled formulation vanish under reduction (Section C.8.9).

Free parameter count. Per column: $|\mathcal {T}| - 1 = N_{\text {dom}} N_{\text {fr}} - 1$ free parameters in $q_n^{(\tau )}$, plus $2 |\mathcal {E}|$ in the inner 3-state graphical model (as in Section B.6.10), plus 2 for the inner root distribution. For typical $(N_{\text {dom}}, N_{\text {fr}}) = (3, 2)$ that is 5 tuple parameters per column, vs $4 N_{\text {cl}} N_{\text {fr}} N_{\text {dom}} - 1$ in the labelled formulation (e.g. 239 for $N_{\text {cl}} = 10$) and $\sim |\mathcal {T}|^2$ in the labelled column-Markov upgrade ($\sim 6 \times 10^4$).

C.8.4 Reduced WFST: marginalising $(g, e)$ and the class $c$

The reduced per-character WFST kernel is \begin {equation} \label {eq:reduced-WFST} \hat {T}_{ss'}\!\big ((d, f), (d', f');\, \branchlen , \theta \big ), \qquad s, s' \in \{\sta , \mat , \ins , \del , \fin \}, \end {equation} obtained by marginalising the labelled-MixDom WFST $T^{\text {lab}}_{ss'}((c, f, g, d, e), (c', f', g', d', e'); \branchlen , \theta )$ over $(c, c', g, e, g', e')$ at each end. The reduced state space is $\{\sta , \fin \} \cup \{(s, d, f) : s \in \{\mat , \ins , \del \}\}$ i.e. $3 N_{\text {dom}} N_{\text {fr}} + 2$ states.

Routes from $(d, f)$ to $(d', f')$. Under the labelled MixDom model, the per-character labelled transition $(d, f) \to (d', f')$ admits up to three latent routes, indexed by the value of $(g, e)$ at the source position:

R1.: Intra-fragment fragchar transition ($g = 0$): the current character is not the last of its fragment, and the next character stays in the same fragment with fragchar transitioning $f \to f'$ via $\ext ^{(d)}_{f, f'}$. Restricted to $d' = d$.
R2.: New fragment, same domain ($g = 1, e = 0$): the current fragment terminates ($\notext ^{(d)}_f$) and the same domain starts a new fragment ($\kappa _d$) with first fragchar drawn from $\fragdist _{d, f'}$. Restricted to $d' = d$.
R3.: New domain ($g = 1, e = 1$): the current fragment terminates, the current domain ends ($1{-}\kappa _d$), and (after possibly skipping geometrically-many empty domains) a new domain $d'$ begins with its first fragment’s first fragchar drawn from $\fragdist _{d', f'}$. Available for any $d'$, including the self-recurrence $d' = d$.

For $d' \neq d$ only R3 is enabled, so the source’s $(g, e) = (1, 1)$ is uniquely determined by the transition. For $d' = d$, multiple routes contribute and the latent $(g, e)$ at the source carries a non-trivial posterior over routes given the observed transition. The diagonal-extension special case ($\ext ^{(d)}_{f, f'} = \ext _f \delta _{f, f'}$) suppresses R1 whenever $f' \neq f$, but R2 and R3 still mix for any $(d, f) \to (d, f')$, so the route-sum machinery introduced below is required even there. With a general off-diagonal $\ext ^{(d)}$ matrix, R1 admits $f' \neq f$ transitions, so the route-sum acquires a third contributing term.

The per-route singlet contribution, accounting for the joint prior on $(g, e)$ at the source and the singlet emission of the next character, is: \begin {align} \label {eq:omega-routes} \omega ^{(R1)}_{(d, f) \to (d', f')} &= \delta _{d, d'}\,\ext ^{(d)}_{f, f'}, \\ \omega ^{(R2)}_{(d, f) \to (d', f')} &= \delta _{d, d'}\,\notext ^{(d)}_f\,\kappa _d\,\fragdist _{d, f'}, \\ \omega ^{(R3)}_{(d, f) \to (d', f')} &= \frac {\notext ^{(d)}_f\,(1-\kappa _d)\,\kappa _\main \,\domdist _{d'}\,\kappa _{d'}\,\fragdist _{d', f'}}{1-\zeta }, \end {align}

where $\zeta = \kappa _\main \sum _{d''} \domdist _{d''}(1-\kappa _{d''})$ is the empty-domain renormalisation (equation (C.69)). Each $\omega ^{(r)}$ factorises as $P((g, e) = (g_r, e_r) \mid (d, f)) \cdot P_{\text {singlet}}((d', f') \mid (d, f), g_r, e_r)$, so summing over routes recovers the marginal singlet emission probability: \begin {equation} \label {eq:omega} \omega ^{(d, f, d', f')} \;=\; \sum _r \omega ^{(r)}_{(d, f) \to (d', f')} \;=\; \delta _{d, d'}\!\Big [\ext ^{(d)}_{f, f'} + \notext ^{(d)}_f\,\kappa _d\,\fragdist _{d, f'}\Big ] \;+\; \frac {\notext ^{(d)}_f\,(1-\kappa _d)\,\kappa _\main \,\domdist _{d'}\,\kappa _{d'}\,\fragdist _{d', f'}}{1-\zeta }. \end {equation}

Marginalisation of $c$. The class $c$ at column $n$ is generated once per column by $\classdist _{f_n, \cdot }$ at column birth and governs the substitution likelihood across all branches. The labelled WFST’s indel block is class-independent, so $\sum _c \classdist _{f, c} = 1$ trivially. Class-marginalisation of $\hat {T}$ has no residue at the indel level; it is handled separately in the substitution term (Section C.8.7).

Step-by-step derivation of the reduced kernel The route-sum (C.37) is not stipulated; it follows from two facts about the labelled MixDom model. We state and prove both.

Notation. Write $Z_n = (d_n, f_n, g_n, e_n)$ for the labelled chain state at position $n$ (dropping $c_n$, which is class-independent for the indel block) and $\tau _n = (d_n, f_n)$ for the reduced state. Let $\sigma $ denote the labelled singlet HMM transition kernel of mixdom-wfst.tex Section C.10.2, and let $T^{\text {lab}}_{ss'}$ denote the labelled conditional WFST kernel of Section C.10.3. The labelled per-character Pair HMM joint factorises as \begin {equation} \label {eq:pair-fact} P\big ((s', Z') \mid (s, Z)\big ) \;=\; T^{\text {lab}}_{ss'}\!\big (Z, Z'\big )\, \sigma \!\big (Z' \mid Z\big ), \end {equation} i.e. the WFST conditional weight times the singlet emission weight; this is the construction of mixdom-wfst.tex Section C.10.3 (Singlet $\circ $ WFST $=$ Pair HMM).

Singlet-table convention. Throughout the proof we use the singlet HMM table at mixdom-wfst.tex line 165–170 with the following normalisation (consistent with the joint-contribution language at line 187–189 and the marginalisation check at line 228–237). Two structural facts about the labelled MixDom chain underpin this:

(a): $e$ is fragment-level. The domain-end indicator $e$ is constant for all characters within a single fragment (it records whether the current fragment is the last in its domain, which is set when the fragment begins and does not change as the fragment extends). Hence within-fragment transitions ($g{=}0$ rows) have $e' = e$ implicitly, and a single “$g{=}0$ row” covers all source $e$ values without depending on $e$.
(b): Destination indicators $(g', e')$ are summed implicitly. The destination $(g', e')$ in each row is treated as a free index ranging over $\{0, 1\}^2$, with the implicit prior $\pi (g', e' \mid d', f')$ to be applied separately if a finer-grained joint over destination indicators is needed.

Under these conventions, the table entry $\widetilde \sigma $ for a given source-row $(d, f, g, e)$ and destination structural state $(d', f')$ is the joint \[ \widetilde \sigma ((d, f, g, e) \to (d', f')) \;=\; \pi (g, e \mid d, f)\,\cdot \,\sigma _{\text {cond}}((d', f') \mid (d, f, g, e)), \] where $\sigma _{\text {cond}}$ is the source-conditional next-character emission probability and $\pi (g, e \mid d, f)$ is the source-row marginal. As a sanity check: summing each row of the table over $(d', f')$ recovers exactly $\pi (g, e \mid d, f)$, which sums in turn to 1 over the three rows (with termination column included). For the $g{=}0$ row: $\sum _{f'} \ext ^{(d)}_{f, f'} = 1{-}\notext ^{(d)}_f = \pi (g{=}0 \mid d, f)$. For the $g{=}1, e{=}0$ row: $\sum _{f'} \notext ^{(d)}_f \kappa _d \fragdist _{d, f'} = \notext ^{(d)}_f \kappa _d = \pi (g{=}1, e{=}0 \mid d, f)$. For the $g{=}1, e{=}1$ row, mixdom-wfst.tex line 228–237 verifies the row sum (over destinations plus termination) equals $\notext ^{(d)}_f (1{-}\kappa _d) = \pi (g{=}1, e{=}1 \mid d, f)$.

Lemma 1 (Conditional independence of $(g, e)$). Under the labelled MixDom singlet construction, $(g_n, e_n) \perp \{Z_k : k < n\} \mid (d_n, f_n)$ for every $n \geq 1$, with \begin {align} \label {eq:ge-prior} \pi (g{=}0 \mid d, f) &= 1 - \notext ^{(d)}_f, \\ \pi (g{=}1, e{=}0 \mid d, f) &= \notext ^{(d)}_f \,\kappa _d, \\ \pi (g{=}1, e{=}1 \mid d, f) &= \notext ^{(d)}_f\,(1 - \kappa _d). \end {align}

Proof. By the singlet-table convention above, the labelled-chain joint factorises position-by-position as \[ P(Z_n \mid Z_{n-1}) = P(d_n, f_n \mid Z_{n-1}) \cdot \pi (g_n, e_n \mid d_n, f_n), \] because the table entry at position $n$ has the destination prior $\pi (g_n, e_n \mid d_n, f_n)$ as an explicit multiplicative factor (line 187–189 of mixdom-wfst.tex), independent of $Z_{n-1}$ beyond what flows through $(d_n, f_n)$. Marginalising the chain joint over $Z_{1:n-1}$ at fixed $(d_n, f_n)$ leaves the $\pi (g_n, e_n \mid d_n, f_n)$ factor untouched, establishing $P(g_n, e_n \mid d_n, f_n, Z_{1:n-1}) = \pi (g_n, e_n \mid d_n, f_n)$. Substituting the explicit Bernoulli forms ($g_n \sim \text {Bern}(\notext ^{(d_n)}_{f_n})$ and, given $g_n = 1$, $e_n \sim \text {Bern}(1 - \kappa _{d_n})$, with $e_n$ at $g_n = 0$ folded into the joint as the residual $\notext ^{(d)}_f$ row) gives (??).

Lemma 2 (Singlet route-decomposition). The marginal singlet emission $\omega ^{(d, f, d', f')} := \sum _{g, e, g', e'} \widetilde \sigma ((d, f, g, e) \to (d', f', g', e'))$ decomposes as $\omega = \omega ^{(R1)} + \omega ^{(R2)} + \omega ^{(R3)}$, with the per-route weights of eq. (??).

Proof. The singlet table mixdom-wfst.tex line 165–170 partitions the source $(g, e)$ values into three rows:

Row $g{=}0$ (combining all source $e$ values, since the $g{=}0$ row weight does not depend on source $e$): table entry $\widetilde \sigma _{R1} = \ext ^{(d)}_{f, f'}$ for destination $(d, f', g', e')$ summed over destination indicators.
Row $g{=}1, e{=}0$: $\widetilde \sigma _{R2} = \notext ^{(d)}_f \kappa _d \fragdist _{d, f'}$.
Row $g{=}1, e{=}1$: $\widetilde \sigma _{R3} = \notext ^{(d)}_f (1{-}\kappa _d) \kappa _\main \domdist _{d'} \kappa _{d'} \fragdist _{d', f'} / (1{-}\zeta )$.

By the singlet-table convention, each row is the joint contribution including the source-$(g, e)$ prior; in particular $\widetilde \sigma _{R1} = \pi (g{=}0 \mid d, f) \cdot \sigma _{\text {cond}}((d, f', g'{=}\cdot , e'{=}\cdot ) \mid (d, f, g{=}0, e))$ summed over destination $(g', e')$. Thus $\omega ^{(r)} = \widetilde \sigma _r$ directly: the row weight already equals the route contribution to the marginal singlet emission, so no separate “prior cancellation” step is needed — the source prior was baked in from the start. Summing the three rows gives $\omega ^{(d, f, d', f')}$ in (C.35). The fact that each row has support only on destinations consistent with its route is by inspection of the singlet table.

Proposition (Reduced kernel). The reduced per-character marginal Pair HMM kernel, $\hat T_{ss'}((d, f), (d', f')) := \sum _{g, e, g', e'} \pi (g, e \mid d, f) \cdot P((s', d', f', g', e') \mid (s, d, f, g, e))$, satisfies the route-sum (C.37).

Proof. Substitute the Pair HMM factorisation (C.36), and replace $\pi (g, e \mid d, f) \cdot \sigma _{\text {cond}}(\cdots ) = \widetilde \sigma (\cdots )$ via the singlet-table convention: \begin {align*} \hat T_{ss'}((d, f), (d', f')) &= \sum _{g, e, g', e'} \pi (g, e \mid d, f)\, T^{\text {lab}}_{ss'}\!\big ((d, f, g, e), (d', f', g', e')\big )\, \sigma _{\text {cond}}\!\big ((d', f', g', e') \mid (d, f, g, e)\big ) \\ &= \sum _{g, e, g', e'} T^{\text {lab}}_{ss'}\!\big ((d, f, g, e), (d', f', g', e')\big )\, \widetilde \sigma ((d, f, g, e) \to (d', f', g', e')). \end {align*}

By Lemma 2, $\widetilde \sigma $ has support partitioned into the three routes $R1, R2, R3$, each with a unique source $(g_r, e_r)$. The labelled WFST entry for fixed source $(g_r, e_r)$ and structural transition $(s, s')$ depends on the destination $(g', e')$ in only two ways: (i) the label-preservation constraint on $\mat $ (which forces $g' = g_{r,\text {dest}}$ and $e' = e_{r,\text {dest}}$ where the destination-side $(g_r, e_r)$ are determined by the route and the destination structural label, trivialising the destination sum); and (ii) for $\ins $, the destination prior on $(g', e')$ at the new $(d', f')$ is folded into the WFST as a $\pi (g', e' \mid d', f')$ factor (mixdom-wfst.tex eq. (C.74) gives the WFST = Pair HMM/singlet division explicitly). In both cases the $(g', e')$ summation cleanly factors out: \[ \sum _{g', e'} T^{\text {lab}}_{ss'}((d, f, g_r, e_r), (d', f', g', e'))\,\pi (g', e' \mid d', f')^{-1}\,\pi (g', e' \mid d', f') = \tilde T^{\text {lab}, (r)}_{ss'}((d, f), (d', f')) \] (the inverse-prior in the WFST cancels the prior in $\widetilde \sigma $’s destination factor; for $\mat $ the inverse prior is trivially 1 because of label preservation). Combining with the source weights $\widetilde \sigma _r$ summed over destinations giving $\omega ^{(r)}$ from Lemma 2, we obtain eq. (C.37).

Corollary (Markov property of reduced chain). The marginal chain in $(d, f)$ obtained by summing $(g, e)$ out of the labelled singlet chain is exactly Markov, with per-step kernel $\hat T_{\sta s}$ at the reduced state space.

Proof. We show $P(\tau _n \mid \tau _{1:n-1}) = P(\tau _n \mid \tau _{n-1})$ for all $n \geq 2$, which is the defining property of a Markov chain. By the singlet’s chain factorisation, $P(\tau _n \mid Z_{1:n-1}) = \sigma _{\text {cond, marg}}(\tau _n \mid Z_{n-1})$, which depends on $Z_{n-1} = (\tau _{n-1}, g_{n-1}, e_{n-1})$. Marginalising $(g_{n-1}, e_{n-1})$ against its conditional distribution given $\tau _{1:n-1}$: \[ P(\tau _n \mid \tau _{1:n-1}) = \sum _{g_{n-1}, e_{n-1}} P(g_{n-1}, e_{n-1} \mid \tau _{1:n-1})\, \sigma _{\text {cond, marg}}(\tau _n \mid \tau _{n-1}, g_{n-1}, e_{n-1}). \] By Lemma 1 (Markov property of $(g, e)$), $P(g_{n-1}, e_{n-1} \mid \tau _{1:n-1}) = \pi (g_{n-1}, e_{n-1} \mid \tau _{n-1})$, depending only on $\tau _{n-1}$. Substituting: \[ P(\tau _n \mid \tau _{1:n-1}) = \sum _{g_{n-1}, e_{n-1}} \pi (g_{n-1}, e_{n-1} \mid \tau _{n-1})\, \sigma _{\text {cond, marg}}(\tau _n \mid \tau _{n-1}, g_{n-1}, e_{n-1}) \] which is $\hat T_{\sta s}$ (eq. (C.37) for the $s = s' = $ trivial-transition case, i.e. depending only on $(\tau _{n-1}, \tau _n)$). This depends only on $\tau _{n-1}$, establishing the Markov property. The product factorisation $P(\tau _1, \ldots , \tau _L) = \prod _n \hat T(\tau _{n-1}, \tau _n)$ follows by chain rule.

Caveats. Two technical conditions buried in the proof deserve explicit flagging:

1.: Independence of WFST weights from destination $(g', e')$: the labelled WFST entries $T^{\text {lab}}_{ss'}((d, f, g, e), (d', f', g', e'))$ depend on the structural transition type ($\mat $/$\ins $/$\del $ at both ends) and on the source $(g, e)$ via the case split into mid-fragment / fragment-boundary / domain-boundary, but they do not depend on the destination $(g', e')$ except through the trivial label-preservation ($\mat $ requires matching labels) constraint already implicit in $T^{\text {lab}}$. This is by inspection of the labelled WFST tables of mixdom-wfst.tex Section C.10.3.
2.: Boundary entries: the proof above covers non-boundary positions $1 \leq n \leq L$ with $(s, s') \in \{\mat , \ins , \del \}$. The $\sta $ and $\fin $ row/column entries of $\hat T$ require separate derivation because the $\sta $ source has no preceding $(g, e)$ to enumerate routes over, and the $\fin $ destination requires the singlet’s termination weight rather than a next-character emission. These are flagged as Section C.8.11, item I.2.

Markov property of the reduced chain (consequence). The latent $(g, e)$ at position $n$ depends only on $(d_n, f_n)$ and the model parameters (Lemma 1). Consequently, summing out $(g, e)$ at each position yields a marginal chain in $(d, f)$ that remains exactly Markov, with per-step transition kernel $\hat T$ defined below. This justifies the path-LL formulation in Section C.8.5 as exact (not a mean-field approximation) under the reduced state space.

Reduced kernel as a route-sum. Combining the route enumeration with the labelled WFST entries gives the reduced per-character kernel as a sum of route contributions: \begin {equation} \label {eq:T-hat} \boxed {\; \hat {T}_{ss'}\!\big ((d, f), (d', f');\, \branchlen , \theta \big ) \;=\; \sum _{r \in \mathcal {R}((d, f), (d', f'))} \omega ^{(r)}_{(d, f) \to (d', f')}\, \tilde {T}^{\text {lab}, (r)}_{ss'}\!\big ((d, f), (d', f');\, \branchlen , \theta \big ), \;} \end {equation} with route set $\mathcal {R}((d, f), (d', f')) = \{R3\}$ for $d' \neq d$ and $\{R1, R2, R3\}$ for $d' = d$, and per-route labelled WFST entry \begin {equation} \label {eq:tilde-T-lab-r} \tilde {T}^{\text {lab}, (r)}_{ss'}\!\big ((d, f), (d', f');\, \branchlen , \theta \big ) \;=\; \sum _{g', e'} T^{\text {lab}}_{ss'}\!\big ((d, f, g_r, e_r), (d', f', g', e');\, \branchlen , \theta \big ), \end {equation} where $T^{\text {lab}}$ is the conditional labelled WFST of Section C.10.3, defined as the labelled Pair HMM joint divided by the labelled singlet emission, and $(g_r, e_r)$ is the source $(g, e)$ for route $r$: $(g_{R1}, e_{R1}) = (0, e)$ for any $e$ (since within-fragment transitions do not constrain $e$; the labelled-singlet table at mixdom-wfst.tex line 165 treats $e$ as a free variable in the $g{=}0$ row, so we adopt $(0, 0)$ as the canonical representative); $(g_{R2}, e_{R2}) = (1, 0)$; and $(g_{R3}, e_{R3}) = (1, 1)$. The product $\omega ^{(r)} \cdot \tilde {T}^{\text {lab}, (r)}_{ss'}$ then equals the per-route contribution to the labelled Pair HMM joint $(d, f) \to (d', f')$ summed over destination indicators $(g', e')$, so summing over routes recovers the marginal Pair HMM joint at the reduced state space; division of $T^{\text {lab}}$ by the singlet was applied per-route inside $\omega ^{(r)}$ to avoid double-counting.

When the route-sum collapses. The previously-claimed factorisation $\hat T = \omega \cdot \tilde T^{\text {lab}}$ holds only if, for the given $(s, s')$, the labelled WFST entry $\tilde T^{\text {lab}, (r)}_{ss'}$ is independent of $r$ across the enabled routes. This essentially never happens for the indel block. For $\mat \to \mat $, the per-route conditional WFST weights (Pair HMM joint divided by the route’s singlet emission) are \[ W^{(R1)}_{\mat \mat } = 1, \qquad W^{(R2)}_{\mat \mat } = (1{-}\beta _d)\alpha _d, \qquad W^{(R3)}_{\mat \mat } \;\propto \; \tau ^{(d)}_{\mat ,\fin }\, \nonemptytrans ^{[d \to d']}_{\mat ,\mat }\,\tau ^{(d')}_{\sta ,\mat }, \] where $W^{(R3)}_{\mat \mat }$ involves the destination-domain-specific top-level effective transition (in the implementation, T_exit_k; structurally, $\nonemptytrans ^{[d\to d']}_{\mat ,\mat }$ is the destination-typed version of the empty-domain-summed top-level $\mat \to \mat $ kernel, with the $\domdist _{d'}$ and other normalisation factors absorbed — see Section C.10.6). The three weights are distinct in general: R1 carries no separate BDI factor because in-fragment extension is generated by the singlet alone (the descendant character matches trivially via fragment continuation); R2 carries the standard new-fragment match factor $(1{-}\beta _d)\alpha _d$ (the singlet’s $\kappa _d$ cancels into $\tau ^{(d)}_{\mat ,\mat } = (1{-}\beta _d)\alpha _d\kappa _d$); R3 carries the cross-domain entry weight, involving both the source domain’s exit $\tau ^{(d)}_{\mat ,\fin } = (1{-}\beta _d)(1{-}\kappa _d)$ and the destination domain’s TKF92 entry $\tau ^{(d')}_{\sta ,\mat } = (1{-}\beta _{d'})\alpha _{d'}\kappa _{d'}$. Hence the route-sum genuinely cannot be collapsed to a single labelled-WFST entry for any $(d, f) \to (d', f')$ transition that admits more than one route, which includes every same-domain transition $(d, f) \to (d, f')$.

The reduced kernel is no more row-stochastic than the labelled one was — both are conditional transducers — but the input-conditional normalisation of the transducer is preserved by the route-sum marginalisation.

Numerical verification. The route-sum (C.37) has been verified at $t = 0.1$ on a $(N_\text {dom}, N_\text {fr}) = (2, 2)$ instance (random parameters, fixed seed) by reconstructing every $\mat \!\to \!\mat $ entry of the Pair HMM as $\sum _r \omega ^{(r)}\,W^{(r)}_{\mat \mat }$ and comparing against the corresponding entry of build_nested_trans; the maximum absolute discrepancy across the 16-entry $\mat \!\to \!\mat $ block is $1.1 \times 10^{-16}$, i.e. floating-point noise. The script is at python/verify_reduced_wfst_routes.py. The previous $t = 0$ check did not test the inter-route discrepancy in $W^{(r)}_{ss'}$ because at $t = 0$ the WFST collapses to (near-)identity and all routes coincide trivially.

C.8.5 Per-branch path log-likelihood

Read parent and child labelled states column-by-column on branch $v \to w$ to obtain the joint sequence $(Z^v_n, Z^w_n)_{n=1}^L$. By the same-tuple invariant of $q_n$, only five joint configurations carry positive variational mass, mapping to WFST states as in equation (B.61) (with $\tau $ now in the reduced $\mathcal {T}$): \begin {equation} \label {eq:state-map-mixdom-reduced} S_n = \begin {cases} \mat & (Z^v_n, Z^w_n) = (\tau , \tau ), \\ \del & (Z^v_n, Z^w_n) = (\tau , \state {D}), \\ \ins & (Z^v_n, Z^w_n) = (\state {N}, \tau ), \\ \mathsf {Ig} & (Z^v_n, Z^w_n) \in \{(\state {N},\state {N}), (\state {D}, \state {D})\}. \end {cases} \end {equation} With sentinels $S_0 = \sta $, $S_{L+1} = \fin $, the strip-Ignore reduction of equation (B.62) carries over: \begin {equation} \label {eq:branch-LL-mixdom-reduced} \log P\!\big (X^w \mid X^v, \branchlen , \theta \big ) \;=\; \sum _{N=1}^{L+1} \delta (S_N \neq \mathsf {Ig}) \sum _{M=0}^{N-1} \delta (S_M \neq \mathsf {Ig}) \log \hat {T}_{S_M S_N}\!\big (\tau _{n(M)},\, \tau _{n(N)};\, \branchlen , \theta \big ) \prod _{K=M+1}^{N-1} \delta (S_K = \mathsf {Ig}), \end {equation} with the reduced WFST $\hat {T}$ from equation (C.37) replacing the labelled $T^{\text {lab}}$. Tuples are now in $\mathcal {T} = [N_{\text {dom}}] \times [N_{\text {fr}}]$.

C.8.6 Per-column expected indel log-likelihood under $q$

The variational column-state probabilities $P^{v\to w}_q(S_n = s)$ depend only on the presence/absence pattern through the inner $q_n^{(\pi |\tau )}$ (which is $\tau $-independent in our factorisation), so the same belief-propagation machinery from the simple case (equation (??)) computes them.

The per-branch expected indel log-likelihood factorises as \begin {equation} \label {eq:E-branch-LL-mixdom-reduced} \expect _q\!\big [\log P(X^w \mid X^v, \branchlen , \theta )\big ] = \sum _{s, s'} \sum _{\tau , \tau ' \in \mathcal {T}} W^{(v\to w)}_{ss', \tau \tau '}\, \log \hat {T}_{ss'}\!\big (\tau , \tau ';\, \branchlen , \theta \big ), \end {equation} with the reduced expected transition counts \begin {equation} \label {eq:W-pair-mixdom-reduced} W^{(v\to w)}_{ss', \tau \tau '} \;=\; \sum _{N=1}^{L+1}\!\sum _{M=0}^{N-1} q^{(\tau )}_{n(M)}(\tau )\, q^{(\tau )}_{n(N)}(\tau ')\, P^{v\to w}_{q,M}(s)\, P^{v\to w}_{q,N}(s') \prod _{K=M+1}^{N-1} P^{v\to w}_{q,K}(\mathsf {Ig}), \end {equation} where the per-column factorisation of $q^{(\tau )}$ has reduced the pairwise tuple weight to a product of per-column marginals (no chain forward-backward needed). The cumulant trick of Section B.6.8 lifts unchanged to the inner $K$-product, with stable computation in $O(L \cdot |s|^2 \cdot |\mathcal {T}|^2)$ per branch via the running-logaddexp prefix on $\log P^{v\to w}_q(\mathsf {Ig})$. For $(N_{\text {dom}}, N_{\text {fr}}) = (3, 2)$ this gives $|\mathcal {T}|^2 = 36$ versus $|\mathcal {T}^{\text {lab}}|^2 = 57600$ in the labelled variant ($\sim 1600\times $ cheaper per branch).

C.8.7 Per-column expected substitution log-likelihood

For each column $n$, define the Fitch subtree $\mathcal {F}_n \subseteq \parsetree $ as the smallest connected subtree containing all leaves $v$ with $X^v_n = 1$. By stationarity and reversibility of the substitution model, only nodes inside $\mathcal {F}_n$ contribute to the substitution likelihood; nodes outside are treated as missing data and absorbed into normalisation. The Fitch subtree is determined by the leaf data alone, independently of $q$ and $\tau $.

Because $c$ is a per-column model latent (one class per column, governing all branches at that column), the column-substitution likelihood under fragchar $f$ marginalises $c$ at the model level: \begin {equation} \label {eq:L-sub-total} L^{(\text {sub}), \text {tot}}_n(f; \mathcal {F}_n) \;=\; \sum _{c=1}^{N_{\text {cl}}} \classdist _{f, c}\, L^{(\text {sub})}_n(c; \mathcal {F}_n), \end {equation} with \begin {equation} \label {eq:L-sub-mixdom} L^{(\text {sub})}_n(c; \mathcal {F}_n) = \sum _{a \in \alphabet } \eqm ^{(c)}_a\,\beta _n^{r_n}(a; c) \end {equation} the standard Felsenstein up-pass likelihood at column $n$ on $\mathcal {F}_n$ under class-$c$’s rate matrix $\revsub ^{(c)}$ and stationary $\eqm ^{(c)}$ ($r_n$ is the Fitch-determined root of $\mathcal {F}_n$, $\beta _n^v(a; c)$ the standard Felsenstein up-message). The expected substitution log-likelihood under $q$ involves only the fragchar marginal $q_n^{(f)}(f) = \sum _d q_n^{(\tau )}(d, f)$: \begin {equation} \label {eq:E-sub-mixdom-reduced} \boxed {\; \expect _{q_n^{(\tau )}}\!\big [\log L^{(\text {sub}), \text {tot}}_n(f_n; \mathcal {F}_n)\big ] \;=\; \sum _{f=1}^{N_{\text {fr}}} q_n^{(f)}(f)\, \log \!\Big [\sum _c \classdist _{f, c}\, L^{(\text {sub})}_n(c; \mathcal {F}_n)\Big ]. \;} \end {equation} The presence factor $q_n^{(\pi |\tau )}$ does not enter the substitution likelihood is fully determined by the (data-determined) Fitch subtree and the (variational-determined) fragchar marginal.

Why $\log \sum _c \classdist L$, not $\sum _c \classdist \log L$. This is the proper integration over the per-column class prior. A labelled-variant counterpart that carried $c$ as a variational latent would give $\sum _c q^{(c)}_n(c) \log L^{(\text {sub})}_n(c)$; by Jensen’s inequality $\log \sum _c p_c L_c \geq \sum _c p_c \log L_c$ with $p_c = \classdist _{f, c}$, equality holding only when $q^{(c)}_n$ is at the Bayes-optimal class posterior $q^*(c) \propto \classdist _{f, c} L^{(\text {sub})}_n(c)$. The reduction analytically marginalises $c$ at the place where the labelled formulation left an inequality, so the reduced ELBO is at least as tight as the labelled ELBO and strictly tighter for any $q^{(c)}$ off the Bayes-optimal class posterior.

Numerical implementation. $\log L^{(\text {sub}), \text {tot}}_n(f)$ is computed in log-space: \begin {equation} \log L^{(\text {sub}), \text {tot}}_n(f) \;=\; \mathrm {logsumexp}_c\!\big (\log \classdist _{f, c} + \log L^{(\text {sub})}_n(c; \mathcal {F}_n)\big ), \end {equation} with the expected log under $q_n^{(f)}$ a simple weighted sum.

C.8.8 ELBO

Combining the four contributions: \begin {equation} \label {eq:elbo-mixdom-reduced} \boxed {\;\; \log p(\text {MSA} \mid \parsetree , \theta ) \;\geq \; \log \tilde {p}(\text {MSA} \mid \parsetree , \theta ) \;\geq \; \mathcal {L}[q], \;\;} \end {equation} with \begin {equation} \label {eq:elbo-mixdom-reduced-detail} \mathcal {L}[q] = \expect _q\!\big [\log p_{\text {singlet}}^{\text {red}}(Z^{\text {root}})\big ] + \sum _{(v\to w) \in \parsetree } \expect _q\!\big [\log \hat {P}^{\text {WFST}}\!(Z^w \mid Z^v, \branchlen _{vw}, \theta )\big ] + \sum _{n=1}^L \sum _f q_n^{(f)}(f)\,\log L^{(\text {sub}), \text {tot}}_n(f; \mathcal {F}_n) + H[q]. \end {equation} The first two terms use the reduced singlet HMM (defined on $\mathcal {T} = [N_{\text {dom}}] \times [N_{\text {fr}}]$ with the singlet kernel $\omega $ from equation (C.35)) and the reduced WFST $\hat {T}$ from equation (C.37). The third is the marginalised substitution term (C.45).

Entropy decomposition. \begin {equation} \label {eq:entropy-mixdom-reduced} H[q] = \sum _{n=1}^L H[q_n^{(\tau )}] \;+\; \sum _{n=1}^L H[q_n^{(\pi |\tau )} \mid \text {MSA}], \end {equation} with $H[q_n^{(\tau )}]$ the simple categorical entropy (closed-form, $|\mathcal {T}| - 1$ free parameters per column) and $H[q_n^{(\pi |\tau )} \mid \text {MSA}]$ the leaf-conditioned entropy of the simple-case 3-state graphical model (equations (B.71) and (B.72) of Section B.6.7, with the same $\log Z_q$ correction).

Bound interpretation. Identical to equation (B.73): the bound is on $\log p$ with gap = $q$-independent restriction-gap (ghost-column histories) + variational KL. The reduction $(c, g, e)$-collapse introduces no additional restriction gap; the analytic marginalisations are exact at the model level, replacing two labelled-variant Jensen inequalities with equalities and yielding a strictly tighter bound at the same variational optimum.

C.8.9 Cross-column constraint vanishes

The labelled-variant cross-column constraint (Section M.8 of the labelled draft) forced $q^{(\tau )}$ to be a column-Markov chain to avoid hard-zero violations of structural rules ($d_{n+1} = d_n$ when $e_n = 0$, etc.). Under the $(g, e)$-marginalisation those rules disappear: every entry of $\omega ^{(d, f, d', f')}$ is positive (assuming irreducibility of $\ext ^{(d)}$ and positivity of $\domdist , \fragdist , \notext ^{(d)}, \kappa _{\main }, \kappa _{\dom }$), so no $(\tau _n, \tau _{n+1})$ pair is hard-zero in $\hat {T}$ either. Per-column factorised $q^{(\tau )}(\tau _1, \ldots , \tau _L) = \prod _n q^{(\tau )}_n(\tau _n)$ puts positive mass everywhere on the model’s support and $\mathbb {E}_q[\log p]$ is finite for any positive $q$.

A column-Markov $q^{(\tau )}$ remains an optional refinement the true posterior on column-tuples is correlated across columns even after $(g, e)$-marginalisation (the singlet kernel is genuinely Markov on $(d, f)$ via $\omega $) but the per-column factorised $q^{(\tau )}$ is sufficient for $\mathcal {L}[q]$ to be a finite proper bound, eliminating the $|\mathcal {T}|^2$ tuple-Markov parameters per column-pair that the labelled variant required.

C.8.10 Special cases and recovery

Recovery of simple TreeVarAnc. Setting $|\mathcal {T}| = 1$ (single domain, single fragchar) collapses $\hat {T}$ to the GGI-approximation WFST $\tkfwfst '$ of Section A.3 and renders the substitution term constant, recovering Section B.6 verbatim.

Single-class limit. Setting $\classdist _{f, c} = \delta _{c, c_0}$ for a fixed class $c_0$ gives $L^{(\text {sub}), \text {tot}}_n(f) = L^{(\text {sub})}_n(c_0; \mathcal {F}_n)$, and the substitution term reduces to the standard Felsenstein log-likelihood under class $c_0$.

Single-fragchar-per-domain limit. $N_{\text {fr}} = 1$ collapses $\ext ^{(d)}$ to a scalar self-loop, the within-domain fragchar Markov reduces to a geometric extension, and the reduced model recovers a simpler scalar-extension form.

Fragment-boundary inference caveat. Were the labelled $g$ indicator a deterministic function of consecutive fragchar transitions ($g_n = \delta (f_n \neq f_{n+1})$ within the same domain — the special case where intra-fragment fragchar transitions are forbidden), the variational marginal would identify fragment boundaries directly. Under the reduced formulation the variational marginal $q_n^{(\tau )}(d, f)$ at each column gives the ancestral $(d, f)$ directly, but does not uniquely identify “fragment boundaries” in that restrictive sense: a fragchar transition $(d, f) \to (d, f')$ with $f' \neq f$ in the inferred ancestral trajectory may be either a within-fragment Markov move or a new-fragment start, depending on the latent route. For ancestral $(d, f)$ inference at observed columns this distinction is irrelevant. For downstream tasks that need fragment boundaries explicitly, one option is to augment $q^{(\tau )}$ post-hoc with a fragment-boundary indicator inferred from the route posterior over each $(d, f) \to (d, f')$ transition — but this is a derived quantity rather than an independent variational latent.

C.8.11 Open issues

Two points warrant explicit verification before relying on the construction quantitatively (a third was resolved by numerical verification at $t=0$, see Section C.8.4).

(I.1) $\hat {T}$ row-sum identity. The reduced WFST $\hat {T}$ inherits the labelled WFST’s input-conditional normalisation by construction (Section C.8.4), but we have not explicitly verified the row-sum identity. A small numerical check is recommended: for $(N_{\text {dom}}, N_{\text {fr}}) \in \{(2, 2), (3, 2)\}$, build $\hat {T}$ from the labelled $T^{\text {lab}}$ and verify that composition with the reduced singlet HMM on $(d, f)$ yields the same column-marginal Pair HMM probabilities as the labelled construction.

(I.2) Start/end-row boundary effects. Boundary entries $\hat {T}_{\sta s'}(\sta , \tau ')$ and $\hat {T}_{s\fin }(\tau , \fin )$ involve the labelled WFST’s start and end rows, which carry domain-level boundary indicators ($\nonemptytrans _{\sta \cdot }$, $\nonemptytrans _{\cdot \fin }$) that are not column-internal. These rows may need explicit derivation rather than mechanical application of equation (C.37); the construction in Section C.8.4 is correct for interior $(s, s') \in \{\mat , \ins , \del \}^2$ but should be verified at the boundaries.

These open issues are flagged here for transparency. Their resolution is a precondition for production use of the reduced ELBO in parameter learning; for ancestral-tuple inference at fixed $\theta $ the resulting bias (if any) is $q$-independent and so cancels from the variational optimum.

C.9 Generalized Phylo-HMM for MixDom

This appendix presents a polynomial-time algorithm for marginal ancestral reconstruction in a restricted MixDom model. The restriction is: the top-level TKF91 process has vanishing indel rates $\insrate _\main , \delrate _\main \to 0$ at fixed ratio $\kappa _\main = \insrate _\main / \delrate _\main $. As input we assume an MSA with a tree and a per-node gap/residue annotation (in practice obtained by Fitch parsimony on gaps).

C.9.1 The Vanishing-Top-Level-Indel Limit

Setting $\insrate _\main = \delrate _\main = 0$ literally produces $0/0$ indeterminate forms in the TKF91 transition probabilities (eq. A.4). We instead take the limit $\insrate _\main , \delrate _\main \to 0^+$ with $\kappa _\main $ fixed. In this limit the TKF91 $\alpha , \beta , \gamma $ coefficients evaluated at a branch length $\evoltime > 0$ satisfy $\alpha = e^{-\delrate _\main \evoltime } \to 1$, $\beta = \insrate _\main (1-e^{-(\delrate _\main - \insrate _\main )\evoltime })/ (\delrate _\main - \insrate _\main e^{-(\delrate _\main -\insrate _\main )\evoltime }) \to 0$, and $\gamma \to 0$. Inspecting the entries of the 5$\times $5 top-level matrix $\tkftrans _\main $: \begin {align*} \tkftrans _\main (\sta ,\mat ) &= (1-\beta )\kappa _\main \alpha \to \kappa _\main , \\ \tkftrans _\main (\sta ,\ins ) &= \beta \to 0, \\ \tkftrans _\main (\sta ,\del ) &= (1-\beta )\kappa _\main (1-\alpha ) \to 0, \\ \tkftrans _\main (\sta ,\fin ) &= (1-\beta )(1-\kappa _\main ) \to 1-\kappa _\main , \\ \tkftrans _\main (\mat ,\mat ) &\to \kappa _\main , \quad \tkftrans _\main (\mat ,\ins ), \tkftrans _\main (\mat ,\del ) \to 0, \\ \tkftrans _\main (\mat ,\fin ) &\to 1-\kappa _\main . \end {align*}

The $\ins $ and $\del $ columns become structurally unreachable on every branch, while the $\sta \to \mat \to \dots \to \mat \to \fin $ chain survives with probability $\kappa _\main ^{\ndom }(1-\kappa _\main )$ for a chain of $\ndom $ top-level match states. In other words, the number of top-level domains $\ndom $ is distributed $\geomdist (\kappa _\main )$ at the root, and every descendant preserves the same ordered list of domains (no top-level births or deaths). The ratio $\kappa _\main $ thus becomes the single top-level parameter that survives the limit; all branch-length dependence at the top level vanishes.

C.9.2 Partition Decomposition

Let the MSA have $L$ columns. Conditional on the number of top-level domains $\ndom _*$ and their classes $d_1, \dots , d_{\ndom _*}$, the full model decomposes into $\ndom _*$ independent nested processes, each responsible for some contiguous subset of columns. Since there are no top-level births or deaths on any branch, each domain is either entirely absent or entirely present on each branch, and the correspondence between columns and domains is a single partition shared across the whole tree. Let \[ P = (b_1, b_2, \dots , b_B), \qquad b_i = [l_{i-1}+1,\, l_i], \] with $0 = l_0 < l_1 < \dots < l_B = L$, be a partition of $\{1,\dots ,L\}$ into $B$ contiguous blocks, and let $d_{b_i} \in \{1,\dots ,\ndom \}$ be the class label of block $i$, where $\ndom $ is the number of top-level domain classes in the model. Then \begin {equation} P(\text {MSA} \mid \text {tree}, \text {model}) = \sum _{P}\; P(P \mid \text {model}) \prod _{i=1}^{B} G(l_{i-1}+1, l_i, d_{b_i}), \label {eq:partition-sum} \end {equation} where the block likelihood $G(k, l, \dom )$ is the standalone phylogenetic likelihood of the sub-MSA on columns $k..l$ under the within-domain model for domain class $\dom $, now including the Markovian fragment process with transition matrix $\ext ^{(\dom )}$ and per-fragment site class distributions $\classdist _{\dom \frag \class }$. The partition prior is \begin {equation} P(P \mid \text {model}) = (1-\kappa _\main )\, \kappa _\main ^{B}\, \prod _{i=1}^{B} \domdist _{d_{b_i}}, \label {eq:partition-prior} \end {equation} where $\domdist $ is the top-level class (stationary) distribution. When marginalising over $P$ the number of summands is exponential in $L$, so direct evaluation of (C.50) is intractable. It is, however, a generalised hidden Markov model in which each “emission” is an arbitrarily long contiguous block of columns.

C.9.3 Why the State Space Cannot Be Collapsed

A standard reduction of a generalised HMM to an ordinary HMM would introduce a hidden state at every column encoding “which within-domain Pair-HMM state each phylogenetic lineage is in”. Under a single TKF92, each lineage has three relevant states ($\mat , \ins , \del $), but the per-column state vector across $T$ tree nodes has $3^T$ values. Furthermore, once a new top-level domain begins the within-domain Pair HMMs on every lineage all reset to $\sta $, so the memory of the within-domain state on any previously gapped lineage is lost at every block boundary: a newly present lineage can only be reasoned about as starting from $\sta $ at the first column of the new block. Conditioning on the partition $P$, on the other hand, makes each block a self-contained problem on a sub-MSA, and eliminates the combinatorial explosion at the price of a quadratic sum over start columns.

C.9.4 Setup and Definitions

We pretend there is an “infinitely long” branch above the root node, so every fragment on the root row is modeled as an insertion. We partition the alignment into blocks of domain type $\dom $.

Column presence profile. Let $A(j)$ denote the column presence/absence profile for column $j$—a binary vector with one entry per tree node, where $1$ indicates the node has a residue at column $j$ and $0$ indicates a gap (as determined by Fitch parsimony).

Fragment continuity. Define \[ k_{\min }(i,j) = \min \{ k : i \leq k \leq j,\ A(k') = A(j)\ \text {for all}\ k \leq k' \leq j \} \] This is the first column in $i..j$ that can be part of the same fragment as column $j$ (i.e., the earliest column from which an unbroken run of identical presence profiles extends to $j$).

Per-branch TKF state. Let $S_{\mathrm {tkf}}(r, i, j)$ be the TKF Pair HMM emit state for row (branch) $r$ at column $j$, given block start $i$. This is a deterministic function of the presence/absence pattern: if all entries are zero for columns $i..j$ on branch $r$, the state is $\sta $; otherwise, the state depends on $(\text {ancestor\_present}_r(j),\; \text {descendant\_present}_r(j))$, mapping to $\mat $, $\ins $, or $\del $ as appropriate.

Per-branch TKF transitions. Let $B_{\mathrm {tkf}}(\dom , r, s, s')$ be the TKF91 branch transition matrix entry for domain $\dom $, branch $r$, from state $s$ to state $s'$. Define \[ T_{\mathrm {tkf}}(\dom , i, j) = \prod _r B_{\mathrm {tkf}}(\dom , r, S_{\mathrm {tkf}}(r, i, j{-}1),\; S_{\mathrm {tkf}}(r, i, j)) \] as the product of TKF91 transitions across all rows from column $j{-}1$ to column $j$ within a block starting at column $i$. Similarly define \[ T_{\mathrm {tkf,start}}(\dom , j) = \prod _r B_{\mathrm {tkf}}(\dom , r, \sta ,\; S_{\mathrm {tkf}}(r, j, j)) \] for block-start transitions and \[ T_{\mathrm {tkf,end}}(\dom , i, j) = \prod _r B_{\mathrm {tkf}}(\dom , r, S_{\mathrm {tkf}}(r, i, j),\; \fin ) \] for block-closing transitions.

Felsenstein emission likelihood. Let $U(j, \class )$ be the Felsenstein pruning likelihood for column $j$ under site class $\class $, computed over the present subtree at column $j$ with the substitution model $(\exch ^{(\class )}, \eqm ^{(\class )})$.

C.9.5 Intra-Block Forward Recurrence

Within a block assigned to domain $\dom $ spanning columns $i..j$, the Markovian fragment process induces a forward recurrence over fragment states.

Transition probabilities. Fragment-to-fragment transition (same fragment continues): \[ T_{\mathrm {ext}}(\dom , j, \srcfrag , \destfrag ) = \ext ^{(\dom )}_{\srcfrag \destfrag } \] This transition is only available when $A(j{-}1) = A(j)$ (the presence profile has not changed, so the same fragment can continue).

Fragment termination followed by new fragment start (the presence profile changes, or the Markov chain starts a new fragment): \[ T_{\mathrm {notext}}(\dom , i, j, \srcfrag , \destfrag ) = \notext ^{(\dom )}_\srcfrag \cdot T_{\mathrm {tkf}}(\dom , i, j) \cdot \fragdist _{\dom \destfrag } \]

Emission weight. Each column $j$ in fragment state $\destfrag $ contributes the class-averaged Felsenstein likelihood: \[ E(\dom , j, \destfrag ) = \sum _{\class =1}^{\nclasses } \classdist _{\dom \destfrag \class }\, U(j, \class ) \]

Forward recurrence. Define $F_{i,j,\dom ,\destfrag }$ as the probability of columns $i..j$ within a block starting at $i$ in domain $\dom $, with column $j$ in fragment state $\destfrag $.

Base case: \begin {equation} F_{i,i,\dom ,\destfrag } = T_{\mathrm {tkf,start}}(\dom , i) \cdot \fragdist _{\dom \destfrag } \cdot E(\dom , i, \destfrag ) \label {eq:intra-forward-base} \end {equation}

Recursion for $j > i$: \begin {equation} F_{i,j,\dom ,\destfrag } = \sum _{\srcfrag =1}^{\nfrag } F_{i,j{-}1,\dom ,\srcfrag } \cdot \left ( \delta (A(j{-}1){=}A(j))\, T_{\mathrm {ext}}(\dom , j, \srcfrag , \destfrag ) + T_{\mathrm {notext}}(\dom , i, j, \srcfrag , \destfrag ) \right ) \cdot E(\dom , j, \destfrag ) \label {eq:intra-forward-recurse} \end {equation} where $\delta (A(j{-}1){=}A(j))$ restricts the fragment extension term to columns with identical presence profiles.

Block likelihood. \begin {equation} G(i, j, \dom ) = \sum _{\srcfrag =1}^{\nfrag } F_{i,j,\dom ,\srcfrag } \cdot \notext ^{(\dom )}_\srcfrag \cdot T_{\mathrm {tkf,end}}(\dom , i, j) \label {eq:block-likelihood} \end {equation}

C.9.6 The Forward Recursion

Define the Forward quantity \begin {equation} F(l, \dom ) = P(\text {columns } 1..l \text { of MSA, with last block ending at } l \text { in domain class } \dom \mid \text {tree}, \text {model}). \end {equation} Encoding the partition prior (C.51) into the recursion, we initialise with a virtual start marker $F(0, \emptyset ) = 1$ and advance by \begin {equation} F(l, \dom ) = \kappa _\main \, \domdist _{\dom }\, \sum _{k=0}^{l-1} \bar F(k)\, G(k+1, l, \dom ), \label {eq:forward-compact} \end {equation} where $\bar F(0) := 1$ and $\bar F(k) := \sum _m F(k,m)$ for $k > 0$. The total data log-likelihood is obtained by closing off the final block with the top-level termination factor: \begin {equation} P(\text {MSA}\mid \text {tree},\text {model}) = (1-\kappa _\main ) \sum _{\dom } F(L, \dom ). \label {eq:total-ll} \end {equation}

Full MSA probability. Given a specific partition into blocks $(i_1,j_1,\dom _1), \ldots , (i_B,j_B,\dom _B)$, the probability of the partitioned MSA given the tree is \[ P(\text {partitioned MSA} \mid \text {tree}) = (1-\kappa _\main ) \prod _{b=1}^{B} \kappa _\main \, \domdist _{\dom _b}\, G(i_b, j_b, \dom _b) \]

C.9.7 The Backward Recursion

Since the domain class of the next block does not depend on the class of the current block (the partition prior factorises over blocks), the backward variable can be collapsed to a scalar. Define \begin {equation} \bar \beta (l) = P(\text {columns } l{+}1..L \text { of MSA} \mid \text {last block ended at column } l,\, \text {tree}, \text {model}). \end {equation} The boundary condition and recursion are \begin {align} \bar \beta (L) &= 1-\kappa _\main , \label {eq:backward-base}\\ \bar \beta (k) &= \sum _{l=k+1}^{L} \Bigl [\kappa _\main \sum _{\dom } \domdist _\dom \, G(k{+}1, l, \dom )\Bigr ]\, \bar \beta (l), \qquad k < L. \label {eq:backward} \end {align}

The total likelihood is recoverable as $\bar \beta (0)$—equivalent to (C.57) (since the $k{=}0$ case expands to $(1{-}\kappa _\main )\sum _\dom \kappa _\main \domdist _\dom \sum _{l=1}^{L} G(1,l,\dom )\,[\cdots ]$, matching the forward expression).

C.9.8 Intra-Block Backward Recurrence

To compute posterior fragment state and site class probabilities, we need an intra-block backward recurrence. Define $B_{i,k,j,\dom ,\srcfrag }$ as the probability of columns $k..j$ given a block $i..j$ in domain $\dom $, with column $k{-}1$ in fragment state $\srcfrag $.

Boundary condition: \begin {equation} B_{i,j{+}1,j,\dom ,\srcfrag } = \notext ^{(\dom )}_\srcfrag \cdot T_{\mathrm {tkf,end}}(\dom , i, j) \label {eq:intra-backward-boundary} \end {equation} (Note: this depends on $\srcfrag $ through $\notext ^{(\dom )}_\srcfrag $, the probability that fragment $\srcfrag $ terminates.)

Recursion for $k \leq j$: \begin {equation} B_{i,k,j,\dom ,\srcfrag } = \sum _{\destfrag =1}^{\nfrag } \left ( \delta (A(k{-}1){=}A(k))\, T_{\mathrm {ext}}(\dom , k, \srcfrag , \destfrag ) + T_{\mathrm {notext}}(\dom , i, k, \srcfrag , \destfrag ) \right ) \cdot E(\dom , k, \destfrag ) \cdot B_{i,k{+}1,j,\dom ,\destfrag } \label {eq:intra-backward-recurse} \end {equation}

C.9.9 Posterior Domain and Fragment State Assignment

From the inter-block forward $F$ and backward $\beta $, and the intra-block forward $F_{i,k,\dom ,\srcfrag }$ and backward $B_{i,k,j,\dom ,\srcfrag }$, we recover per-column posteriors.

Posterior domain assignment. Writing $Z = P(\text {MSA}\mid \text {tree},\text {model})$, \begin {equation} P(c \text { in block of class } \dom \mid \text {MSA}) \;=\; \frac {1}{Z}\,\kappa _\main \, \domdist _\dom \sum _{k=0}^{c-1} \sum _{l=c}^{L} \bar F(k)\, G(k{+}1, l, \dom )\, \bar \beta (l). \label {eq:post-col} \end {equation}

Posterior fragment state. The posterior probability that column $k$ is in fragment state $\srcfrag $, given a block $i..j$ of domain $\dom $, is: \begin {equation} P(\text {col } k \text { is frag } \srcfrag \mid \text {block } i..j, \dom ) = \frac {F_{i,k,\dom ,\srcfrag } \cdot B_{i,k{+}1,j,\dom ,\srcfrag }}{G(i, j, \dom )} \label {eq:post-frag} \end {equation} where $B_{i,k{+}1,j,\dom ,\srcfrag }$ uses $k{+}1$ because $B$ gives the probability of the remaining columns $k{+}1..j$ given fragment state $\srcfrag $ at column $k$.

Full unconditional fragment state posterior. The full posterior for fragment state at column $c$, marginalised over all block placements, is \begin {equation} P(\text {col } c \text { is fragtype } \srcfrag \mid \text {MSA}) = \sum _{\dom }\sum _{i \leq c}\sum _{j \geq c} P(\text {block } i..j,\, \dom \mid \text {MSA})\; \frac {F_{i,c,\dom ,\srcfrag }\, B_{i,c{+}1,j,\dom ,\srcfrag }}{G(i,j,\dom )}, \label {eq:post-frag-full} \end {equation} where the block–domain posterior is \begin {equation} P(\text {block } i..j,\, \dom \mid \text {MSA}) \;\propto \; \bar F(i{-}1)\;\kappa _\main \;\domdist _\dom \; G(i,j,\dom )\;\bar \beta (j). \label {eq:post-block} \end {equation}

Posterior site class. The posterior probability of site class $\class $ at column $k$ is obtained by mixing over the fragment state posterior: \begin {equation} P(\text {class } \class \text { at col } k \mid \text {block } i..j, \dom ) = \sum _{\srcfrag } P(\text {col } k \text { is frag } \srcfrag \mid \text {block } i..j, \dom ) \cdot \frac {\classdist _{\dom \srcfrag \class }\, U(k, \class )} {\sum _{\class '} \classdist _{\dom \srcfrag \class '}\, U(k, \class ')} \label {eq:post-class} \end {equation} The full (unconditional) posterior over site class at column $k$ is obtained by further marginalizing over blocks and domain types using the inter-block posterior (C.61).

C.9.10 Root Residue Reconstruction

Given the posterior in (C.61) and (C.65), we obtain a posterior over root residues at column $c$ by mixing over the class assignment and re-using the Felsenstein pruning posterior under each class: \begin {equation} P(\text {root}_c = a \mid \text {MSA}) = \sum _{\dom } P(c \in \dom \mid \text {MSA}) \cdot \sum _{\class } P(\text {class } \class \text { at } c \mid \dom ) \cdot P(\text {root}_c = a \mid \text {MSA}, c \in \class ), \label {eq:root-post} \end {equation} where the per-class conditional on the right uses Felsenstein pruning at column $c$ with the substitution model $(\exch ^{(\class )}, \eqm ^{(\class )})$. The MAP root sequence is obtained by taking $\mbox {argmax}_a$ of this posterior column-by-column; gaps at the root are determined by the presence annotation obtained from Fitch parsimony.

C.9.11 Why the Trick Fails with Top-Level Indels

If we retained nonzero top-level insertion and deletion rates we would also need to decide on which branch each top-level domain was born or died. Given only the presence pattern at each column, a block whose residues appear on only a single clade is consistent with both (i) an ancestrally present domain that was deleted on every other branch and (ii) a recent insertion. Conditioning on the partition no longer factorises the likelihood across blocks because the per-branch top-level state sequence is correlated across blocks. The algorithm above relies crucially on the absence of top-level births and deaths so that every block is a fully independent problem with a fresh $\sta $ at every branch.

Remark C.15 (Relaxing the top-level constraint). The zero top-level indel-rate limit ($n_{\mathrm {top,indel}} = 0$) can be partially relaxed by conditioning on domain presence/absence profiles. At $n_{\mathrm {top,indel}} = 1$, we allow a single top-level domain insertion or deletion event on the tree, yielding $O(R)$ possible domain subtrees (where $R$ is the number of tree rows/edges). At $n_{\mathrm {top,indel}} = 2$, there are $O(R^2)$ configurations, and so on. The probability of each truncation level is roughly exponential in $n_{\mathrm {top,indel}}$ (controlled by $\insrate _\main + \delrate _\main $), so this provides a systematic expansion controlling the state space explosion. The leading term ($n_{\mathrm {top,indel}} = 0$) is the algorithm presented here.

C.9.12 Complexity

Let $\ndom $ be the number of top-level domain classes, $\nfrag $ the number of fragment types per domain, $\nclasses $ the number of site classes, $T$ the number of tree nodes, and $L$ the number of MSA columns.

Computing all $G(k+1,l,\dom )$ by incremental forward updates requires, for each starting column $k+1$, a left-to-right pass over $l = k+1, \dots , L$. Each column update involves $O(\nfrag ^2)$ work for the fragment Markov chain transitions and $O(\nfrag \cdot \nclasses )$ for the emission weights, plus $O(T)$ for the TKF transition products. Total: $O(L^2\,(\nfrag ^2 + \nfrag \cdot \nclasses + T)\,\ndom )$.
Computing $F$ and $\beta $ given $G$ takes $O(L^2\,\ndom )$.
Computing the per-column posteriors including fragment state requires the intra-block backward, making the total $O(L^3\,\nfrag ^2\,\ndom )$ in the worst case (since $B_{i,k,j}$ depends on both $i$ and $j$ for each $k$).
Each column likelihood $U(j, \class )$ is computed once and cached as an $O(L \cdot \nclasses \cdot T \cdot |\alphabet |)$ precomputation.

Total for the $G$ pass and inter-block DP: $O(L^2\,(\nfrag ^2 + T)\,\ndom )$. The intra-block backward for posteriors adds $O(L^3\,\nfrag ^2\,\ndom )$ due to the dependence of $B_{i,k,j,\dom ,\srcfrag }$ on both $i$ (block start) and $j$ (block end) for each interior column $k$.

The algorithm is naturally vectorisable: all $G$ updates for a fixed starting column can be computed in parallel across $\ndom $ domain classes and across tree branches.

C.9.13 Simulation from MixDom

To generate data from the MixDom model on a phylogenetic tree:

1.: Sample domain sequence. Draw the number of top-level domains $\ndom _* \sim \geomdist (\kappa _\main )$. For each domain, draw its type $\dom _i \sim \catdist (\domdist _1, \ldots , \domdist _\ndom )$.
2.: Sample fragment trajectory. For each domain of type $\dom $, sample a fragment trajectory through the $\nfrag $-state Markov chain. The initial fragment state is $\frag _1 \sim \catdist (\fragdist _{\dom 1}, \ldots , \fragdist _{\dom \nfrag })$. At each step, transition to fragment state $\frag _{k+1}$ with probability $\ext ^{(\dom )}_{\frag _k, \frag _{k+1}}$, or terminate the domain with probability $\notext ^{(\dom )}_{\frag _k} = 1 - \sum _{\destfrag } \ext ^{(\dom )}_{\frag _k, \destfrag }$.
3.: Sample site class and root residue. For each emitted site from fragment state $\frag $, draw site class $\class \sim \catdist (\classdist _{\dom \frag 1}, \ldots , \classdist _{\dom \frag \nclasses })$. Draw the root residue $a \sim \eqm ^{(\class )}$.
4.: Evolve down the tree. For each site, evolve the root residue down the phylogenetic tree using the substitution model $(\exch ^{(\class )}, \eqm ^{(\class )})$ at evolutionary time $\evoltime _r$ on each branch $r$: $b \sim P^{(\class )}(\evoltime _r) \cdot e_a$, where $P^{(\class )}(\evoltime _r) = \exp (\revsub ^{(\class )} \evoltime _r)$. Apply the per-branch TKF91 indel process (with domain-specific parameters $\insrate _\dom , \delrate _\dom $) to create insertions and deletions.

C.10 Labeled-MixDom Singlet HMM and WFST

In this section we define state machines for the MixDom model that foreground all latent state in the input/output alphabet, so they can be used directly in beam-search MSA, progressive reconstruction, and the variational ancestral-state framework of Section C.8 without algebraic distillation. We call these the Labeled-MixDom Singlet HMM and the Labeled-MixDom WFST, to distinguish them from the Maraschino-distilled order-1 HMMs and WFSTs.

Remark C.16 (Within-fragment fragchar dynamics). Each fragment in MixDom carries an intra-fragment Markov chain on fragchars: from a character with fragchar $\srcfrag $ in a fragment of domain $\dom $, the next character (still in the same fragment) has fragchar $\destfrag $ with probability $\ext ^{(\dom )}_{\srcfrag \destfrag }$, and the fragment terminates at the current character with probability $\notext ^{(\dom )}_\srcfrag = 1 - \sum _\destfrag \ext ^{(\dom )}_{\srcfrag \destfrag }$. This is consistent with the exploded-MixDom spec (§C.3.2.0 of exploded-mixdom.tex), where $\ext ^{(k)}_{fg}$ is the intra-fragment fragchar transition kernel. The TKF92 fragment concept is preserved ($g=1$ marks the last character of a fragment).

Within a fragment ($g=0$). The next character is in the same fragment with fragchar $\destfrag $ chosen via $\ext ^{(\dom )}_{\srcfrag \destfrag }$; the destination $\destfrag $ may or may not equal the source $\srcfrag $ (intra-fragment Markov, not self-loop only).
Fragment boundary ($g=1$). The current character is the last in its fragment; this happens with the fragment-termination weight $\notext ^{(\dom )}_\srcfrag $. After fragment termination, the TKF92 outer dynamics decide what happens next: with probability $\kappa _\dom $ another fragment in the same domain begins (its first character drawn from $\fragdist _{\dom , \cdot }$); with probability $1-\kappa _\dom $ the domain ends ($e=1$).
Domain termination ($e=1$). The TKF91 outer dynamics on domains (with parameter $\kappa _{\main }$) decide whether another domain begins or the sequence ends.
$\nfrag = 1$ special case. Setting $N_\nfrag = 1$ (single fragchar) collapses $\ext ^{(\dom )}$ to a scalar self-loop; the resulting expressions reduce to a TKF92-extension scalar form that admits a more compact algebra (used in the simpler closed forms elsewhere in this paper).

C.10.1 Labeled Alphabet

Each alphabet symbol $\anctok \in \alphabet $ is decorated with a label tuple $(\class , \frag , g, \dom , e)$ consisting of the following latent variables:

$\class \in \nclass = \{1,\ldots ,|\nclass |\}$: substitution site class.
$\frag \in \nfrag = \{1,\ldots ,|\nfrag |\}$: fragchar of the current character (the per-character class label whose Markov chain within a fragment is $\ext ^{(\dom )}$).
$g \in \{0,1\}$: fragment-end indicator ($g=1$ if this is the last character of the current fragment; $g=0$ otherwise).
$\dom \in \ndom = \{1,\ldots ,|\ndom |\}$: domain type.
$e \in \{0,1\}$: domain-end indicator ($e=1$ if this is the last fragment of the current domain).

We write $\anctok _{\class \frag g\dom e}$ for a labeled symbol and define $\mathcal {L} = \nclass \times \nfrag \times \{0,1\} \times \ndom \times \{0,1\}$, so $|\mathcal {L}| = 4 |\nclass | |\nfrag | |\ndom |$. The labeled alphabet has $|\alphabet | \cdot |\mathcal {L}|$ symbols.

For context tracking in order-1 machines we need only the structural label $\ell = (\frag , g, \dom , e)$, since the site class $\class $ does not affect transition structure. The number of distinct structural labels is $L = 4 |\nfrag | |\ndom |$.

C.10.2 Labeled-MixDom Singlet HMM

The Labeled-MixDom Singlet HMM generates sequences from the MixDom stationary distribution with all latent variables made explicit. It is an order-1 HMM whose state records the structural label $\ell = (\frag , g, \dom , e)$ of the most recently emitted character.

States. The state space is $\{\sta \} \cup \{(\frag , g, \dom , e) : \frag \in \nfrag ,\, g \in \{0,1\},\, \dom \in \ndom ,\, e \in \{0,1\}\} \cup \{\fin \}$, giving $L + 2$ states.

Emissions. In state $(\frag , g, \dom , e)$, the HMM emits symbol $\anctok _{\class \frag g\dom e}$ (with the $\frag , g, \dom , e$ components matching the state) with probability \begin {equation} \label {eq:singlet-emit} \emprob (\anctok _{\class \frag g\dom e} \mid \frag , g, \dom , e) = \classdist _{\frag \class }\, \eqm _{\class \anctok } \end {equation} summing to 1 over $(\class , \anctok )$.

Transitions. From the start state $\sta $: \begin {equation} \label {eq:singlet-start} P(\sta \to (\frag , g, \dom , e)) = \domdist _\dom \, \kappa _\dom \, \fragdist _{\dom \frag }\, \begin {cases} 1 - \notext ^{(\dom )}_\frag & \text {if } g = 0 \\ \notext ^{(\dom )}_\frag \, \kappa _\dom \, \fragdist _{\dom \frag '} & \text {if } g = 1,\, e = 0 \\ & \quad \text {(to state $(\frag ', g', \dom , e')$; see below)} \\ \notext ^{(\dom )}_\frag \,(1-\kappa _\dom )\, \kappa _\main \, \domdist _{\dom '}\, \kappa _{\dom '}\, \fragdist _{\dom '\frag '} & \text {if } g = 1,\, e = 1 \\ & \quad \text {(to state $(\frag ', g', \dom ', e')$; see below)} \end {cases} \end {equation} However, since the singlet HMM is order-1 and we only track the destination state, it is cleaner to express the transitions directly.

The stationary distribution for the MixDom singlet process factors as follows. At the domain level, a TKF91 process with parameters $(\insrate _\main , \delrate _\main )$ generates $N_{\rm dom} \sim \mathrm {Geom}(\kappa _\main )$ domains, each of type $\dom \sim \catdist (\domdist _1,\ldots ,\domdist _\ndom )$. Within domain $\dom $, an irreducible Markov chain over fragchars with transition matrix $\ext ^{(\dom )} \in [0, 1]^{\nfrag \times \nfrag }$ and termination weights $\notext ^{(\dom )}_\srcfrag = 1 - \sum _\destfrag \ext ^{(\dom )}_{\srcfrag \destfrag }$ generates the sequence of per-character fragchars; the first character in the domain is drawn from the chain’s initial distribution $\fragdist _{\dom \srcfrag }$. Within each character’s fragchar $\srcfrag $, the site class is $\class \sim \classdist _{\srcfrag \class }$ and the residue is $\anctok \sim \eqm _\class $.


Source $\ell $	Dest $\ell '$	Transition weight

$\sta $	$(\frag ', 0, \dom ', e')$	$\kappa _\main \, \domdist _{\dom '}\, \fragdist _{\dom '\frag '}$
$\sta $	$(\frag ', 1, \dom ', e')$	$\kappa _\main \, \domdist _{\dom '}\, \fragdist _{\dom '\frag '}$
$\sta $	$\fin $	$1 - \kappa _\main $

Mid-domain continuations:
$(\frag , 0, \dom , e)$	$(\frag ', g', \dom , e)$	$\ext ^{(\dom )}_{\frag \frag '}$
$(\frag , 1, \dom , 0)$	$(\frag ', g', \dom , e')$	$\notext ^{(\dom )}_\frag \cdot \kappa _\dom \cdot \fragdist _{\dom \frag '}$
$(\frag , 1, \dom , 1)$	$(\frag ', g', \dom ', e')$	$\notext ^{(\dom )}_\frag \cdot (1-\kappa _\dom ) \cdot \kappa _\main \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}\,/\,(1-\zeta )$

Termination:
$(\frag , 0, \dom , e)$	$\fin $	$0$ (cannot end mid-fragment)
$(\frag , 1, \dom , 0)$	$\fin $	$\notext ^{(\dom )}_\frag \cdot (1-\kappa _\dom ) \cdot (1-\kappa _\main )\,/\,(1-\zeta )$
$(\frag , 1, \dom , 1)$	$\fin $	$\notext ^{(\dom )}_\frag \cdot (1-\kappa _\dom ) \cdot (1-\kappa _\main )\,/\,(1-\zeta )$

The $g'$ and $e'$ in the destination labels range over all values ($g' \in \{0,1\}$, $e' \in \{0,1\}$); the destination’s $g'$ is set by the prior $P(g' = 1 | \frag ', \dom ') = \notext ^{(\dom ')}_{\frag '}$ (per-character fragment-termination probability under the intra-fragment Markov), and likewise $e'$ by the per-fragment domain-termination probability $1-\kappa _{\dom '}$. The weights above are the joint contributions; conditional probabilities of next-state indicators are recovered by multiplying by the appropriate prior factor for $(g', e')$.

The $\zeta $ correction $\zeta = \kappa _\main \cdot \emptyseg _0^{(\rm sing)}$ accounts for skipping over geometrically-many empty domains before the next emitting character or sequence termination, where $\emptyseg _0^{(\rm sing)} = \sum _\dom \domdist _\dom (1-\kappa _\dom )$ is the singlet-process probability of an empty domain (under per-domain TKF92 with parameter $\kappa _\dom $). Verification of the marginalised $(\dom , \frag )$-only kernel (Section C.8, equation (C.35)) against the existing MixDom Pair HMM implementation build_nested_trans at $t=0$ confirms these weights to 5+ decimal places on a (n_dom=2, n_fr=2) test instance.

More precisely, the probability of ending the sequence from a domain boundary state involves summing over all possible runs of empty domains before termination. Let $\zeta = \kappa _\main \cdot \emptyseg _0^{(\rm sing)}$ be the probability of generating an empty domain and continuing. Then from state $(\frag , 1, \dom , 1)$ (domain boundary): \begin {equation} \label {eq:singlet-end-domain} P(\text {to } \fin ) = \frac {1 - \kappa _\main }{1 - \zeta } \end {equation} and from state $(\frag , 1, \dom , 0)$ (fragment boundary, mid-domain): \begin {equation} \label {eq:singlet-end-frag} P(\text {to } \fin ) = (1 - \kappa _\dom ) \cdot \frac {1 - \kappa _\main }{1 - \zeta } \end {equation}

Similarly, the transition from a fragment boundary to a new character must account for possibly skipping empty domains. From $(\frag , 1, \dom , 1)$: \begin {equation} \label {eq:singlet-newdom} P((\frag , 1, \dom , 1) \to (\frag ', g', \dom ', e')) = \frac {\kappa _\main \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}}{1 - \zeta } \end {equation} accounting for the possibility that some intermediate domains were empty (the nonempty domain $\dom '$ is reached after a geometric number of empty-domain trials).

Normalization check. From state $(\frag , 1, \dom , 1)$, summing over all destinations including $\fin $: \[ \sum _{\frag ',g',\dom ',e'} \frac {\kappa _\main \domdist _{\dom '} \kappa _{\dom '} \fragdist _{\dom '\frag '}}{1-\zeta } + \frac {1-\kappa _\main }{1-\zeta } = \frac {\kappa _\main (1 - \emptyseg _0^{(\rm sing)}) + 1 - \kappa _\main }{1 - \zeta } = \frac {1 - \zeta }{1 - \zeta } = 1. \] where we used $\sum _{\dom '} \domdist _{\dom '} \kappa _{\dom '} = 1 - \emptyseg _0^{(\rm sing)}$ and $\zeta = \kappa _\main \emptyseg _0^{(\rm sing)}$. The sum over $g'$ and $e'$ is implicit in the fragment/domain generation that follows.

C.10.3 Labeled-MixDom WFST

Remark C.17 (WFST tables use the matrix-kernel notation). The transition tables below use the intra-fragment Markov kernel $\ext ^{(\dom )}_{\srcfrag \destfrag }$ (matrix indexed by source and destination fragchar) and its per-row termination $\notext ^{(\dom )}_\frag = 1 - \sum _{\frag '} \ext ^{(\dom )}_{\frag \frag '}$. Where a within-fragment self-loop appears, the relevant entry is the diagonal $\ext ^{(\dom )}_{\frag \frag }$; where a new-fragchar-same-domain move appears, it is an off-diagonal $\ext ^{(\dom )}_{\frag \frag '}$ for $\frag ' \neq \frag $; the fragment-end weight is $\notext ^{(\dom )}_\frag $.

The Labeled-MixDom WFST represents the conditional distribution of a descendant labeled sequence given an ancestral labeled sequence, separated by evolutionary time $\evoltime $. When composed with the Labeled-MixDom Singlet HMM (Section C.10.2), it must reproduce the MixDom Pair HMM (Section C.1.1).

Remark C.18 (Reduced kernel for variational inference). The full $(c, f, g, d, e)$ alphabet is what makes this WFST self-contained (no algebraic distillation, exact for beam search and progressive reconstruction). For the variational ancestral-state framework (Appendix C.8) the labelled WFST is marginalised analytically over $(c, c', g, e, g', e')$ to a reduced per-character kernel $\hat {T}_{ss'}((d,f),(d',f'); \branchlen , \theta )$ over just $(d, f)$, with $3 N_\text {dom} N_\text {fr} + 2$ states. The marginalisation does not collapse to a single labelled-WFST entry: the per-character labelled transition $(d, f) \to (d', f')$ admits up to three latent routes (intra-fragment fragchar transition; new fragment, same domain; new domain that may equal $d$), and the source’s $(g, e)$ has a non-trivial posterior over routes whenever $d' = d$. The reduced kernel is therefore a route-sum $\hat {T}_{ss'} = \sum _r \omega ^{(r)} \tilde {T}^{\text {lab},(r)}_{ss'}$ (eq. (C.37)), which collapses to the cleaner form $\omega \cdot \tilde {T}^{\text {lab}}$ only when the per-route labelled WFST entries coincide — a degenerate special case of the trivial $\nfrag = 1, \ndom = 1$ instance. Class marginalisation is trivial at the indel level since the WFST indel block is class-independent. So the labelled WFST defined here and the reduced kernel used in the variational appendix are the same object viewed through different state-space lenses, with the variational appendix providing the explicit route-decomposition.

Design principles. The WFST has two kinds of states:

Emitting states (“unready”): $\mat $, $\mathrm {I_F}$, $\mathrm {I_D}$, $\mathrm {D_F}$, $\mathrm {D_D}$. These consume an input character, produce an output character, or both, then make a mandatory null transition to a Wait state.
Wait states (“ready”): $\mathrm {W_M}$, $\mathrm {W_{D_F}}$, $\mathrm {W_{D_D}}$. These inspect the boundary indicators $(g, e)$ of the current context and choose the next emitting state based on the hierarchical boundary structure. Wait states also handle end-of-sequence.

In addition there are the non-emitting $\sta $ and $\fin $ states.

States. Each emitting or wait state carries a structural label context $\ell = (\frag , g, \dom , e)$. The full state space is: \[ \{\sta \} \;\cup \; \{(X, \frag , g, \dom , e) : X \in \{\mat , \mathrm {I_F}, \mathrm {I_D}, \mathrm {D_F}, \mathrm {D_D}, \mathrm {W_M}, \mathrm {W_{D_F}}, \mathrm {W_{D_D}}\}\} \;\cup \; \{\fin \} \] Not all combinations occur (see constraints below), but the upper bound is $8L + 2$ states.

Context semantics. The context $\ell = (\frag , g, \dom , e)$ on each state records:

In $\mat $ and $\mathrm {W_M}$: the label of the most recent matched character (which is the same for both input and output, since match preserves labels).
In $\mathrm {I_F}$, $\mathrm {I_D}$: the label of the most recent output character.
In $\mathrm {D_F}$, $\mathrm {D_D}$, $\mathrm {W_{D_F}}$, $\mathrm {W_{D_D}}$: the label of the most recent input character.

Thus only one context tuple is needed per state.

Emitting-State Transitions (Unready $\to $ Wait) Every emitting state makes a mandatory null transition (no input, no output) to its corresponding wait state. These transitions carry the boundary-survival weights from the nested TKF92/TKF91 structure.


Source	Dest	Weight	Input	Output

$(\mat , \frag , g, \dom , e)$	$(\mathrm {W_M}, \frag , g, \dom , e)$	$w_{\mat }(g, e)$	$\varepsilon $	$\varepsilon $
$(\mathrm {D_F}, \frag , g, \dom , e)$	$(\mathrm {W_{D_F}}, \frag , g, \dom , e)$	$w_{\mathrm {D_F}}(g, e)$	$\varepsilon $	$\varepsilon $
$(\mathrm {D_D}, \frag , g, \dom , e)$	$(\mathrm {W_{D_D}}, \frag , g, \dom , e)$	$w_{\mathrm {D_D}}(g, e)$	$\varepsilon $	$\varepsilon $

$(\mathrm {I_F}, \frag , g, \dom , e)$	$(\mathrm {W_M}, \frag , g, \dom , e)$	$w_{\mathrm {I_F}}(g, e)$	$\varepsilon $	$\varepsilon $
$(\mathrm {I_D}, \frag , g, \dom , e)$	$(\mathrm {W_M}, \frag , g, \dom , e)$	$w_{\mathrm {I_D}}(g, e)$	$\varepsilon $	$\varepsilon $

Wait: the insert states need more careful treatment. An inserted fragment or domain is a complete sub-sequence generated by the descendant. The insert states $\mathrm {I_F}$ and $\mathrm {I_D}$ handle character-level emissions within an inserted fragment, and upon fragment/domain completion, control returns to the wait state that initiated the insertion. We therefore need to track whether we are inserting at the fragment level or domain level.

Let us reconsider the state structure more carefully.

Revised State Structure In the MixDom Pair HMM, the five state types $\mat \mat $, $\mat \ins $, $\mat \del $, $\ins \ins $, $\del \del $ at each $(\dom , \frag )$ position represent:

$\mat \mat _{\dom \frag }$: ancestral and descendant both have a character (match/substitution)
$\mat \ins _{\dom \frag }$: descendant insertion within domain $\dom $, fragment $\frag $
$\mat \del _{\dom \frag }$: ancestral deletion within domain $\dom $, fragment $\frag $
$\ins \ins _{\dom \frag }$: insertion of an entire domain (both ancestor and descendant insert)
$\del \del _{\dom \frag }$: deletion of an entire domain (both ancestor and descendant delete)

The top-level states $\mat , \ins , \del $ refer to the domain-level TKF91 process, while the nested states $\mat , \ins , \del $ refer to the fragment-level TKF92 process within a domain.

For the WFST, we separate the domain-level and fragment-level indel processes:


WFST State Type	Input	Output

$\mat $ (Match)	$\anctok _{\class \frag g\dom e}$	$\destok _{\class \frag g\dom e}$
$\mathrm {I_F}$ (Insert Fragment char)	$\varepsilon $	$\destok _{\class \frag g\dom e}$
$\mathrm {I_D}$ (Insert Domain char)	$\varepsilon $	$\destok _{\class \frag g\dom e}$
$\mathrm {D_F}$ (Delete Fragment char)	$\anctok _{\class \frag g\dom e}$	$\varepsilon $
$\mathrm {D_D}$ (Delete Domain char)	$\anctok _{\class \frag g\dom e}$	$\varepsilon $
$\mathrm {W_M}$ (Wait after Match)	$\varepsilon $	$\varepsilon $
$\mathrm {W_{D_F}}$ (Wait after Delete-Fragment)	$\varepsilon $	$\varepsilon $
$\mathrm {W_{D_D}}$ (Wait after Delete-Domain)	$\varepsilon $	$\varepsilon $

Key constraint: label preservation. In a Match state, the WFST does not change the structural label: the input and output labels $(\frag , g, \dom , e)$ must be identical (though $\class $ is also preserved and the character $\anctok \to \destok $ may change via substitution). Insertions create new characters with new labels; deletions consume characters without producing output.

Emitting to Wait Transitions After emitting (or consuming), each emitting state transitions to its wait state. These null transitions carry weights that account for the fragment-extension and domain-continuation structure.

In the MixDom model, within a domain of type $\dom $, each fragment evolves under the intra-fragment Markov kernel $\ext ^{(\dom )}_{\srcfrag \destfrag }$. Within the fragment ($g=0$), the character simply continues. At a fragment boundary ($g=1$), the TKF92 process within the domain decides whether to start a new fragment or end the domain. At a domain boundary ($g=1, e=1$), the TKF91 domain-level process decides whether to start a new domain or end the sequence.

The weights on emitting $\to $ wait transitions are:


Transition	Condition	Weight

$\mat \to \mathrm {W_M}$	$g=0$	$1$
$\mat \to \mathrm {W_M}$	$g=1, e=0$	$(1-\beta _\dom )$
$\mat \to \mathrm {W_M}$	$g=1, e=1$	$(1-\beta _\dom )(1-\beta _\main )$

$\mathrm {D_F} \to \mathrm {W_{D_F}}$	$g=0$	$1$
$\mathrm {D_F} \to \mathrm {W_{D_F}}$	$g=1, e=0$	$(1-\gamma _\dom )$
$\mathrm {D_F} \to \mathrm {W_{D_F}}$	$g=1, e=1$	$(1-\gamma _\dom )(1-\beta _\main )$

$\mathrm {D_D} \to \mathrm {W_{D_D}}$	$g=0$	$1$
$\mathrm {D_D} \to \mathrm {W_{D_D}}$	$g=1, e=0$	$1$
$\mathrm {D_D} \to \mathrm {W_{D_D}}$	$g=1, e=1$	$(1-\gamma _\main )$

where $\beta _\dom = \beta (\insrate _\dom , \delrate _\dom , \evoltime )$, $\gamma _\dom = \gamma (\insrate _\dom , \delrate _\dom , \evoltime )$, $\beta _\main = \beta (\insrate _\main , \delrate _\main , \evoltime )$, $\gamma _\main = \gamma (\insrate _\main , \delrate _\main , \evoltime )$.

The rationale: at $g=0$ we are mid-fragment, so no boundary weight is needed. At $g=1$ (fragment boundary), the TKF92 boundary weight $(1-\beta )$ or $(1-\gamma )$ applies. At $g=1, e=1$ (domain boundary), both the fragment boundary and the domain boundary weights apply. For $\mathrm {D_D}$, the domain is being deleted as a unit; within the deleted domain, fragment structure is irrelevant (all fragments are consumed), so the fragment-level weights are unity and only the domain-level weight $(1-\gamma _\main )$ applies at domain end.

Insert states. Insert states ($\mathrm {I_F}$, $\mathrm {I_D}$) represent characters being inserted in the descendant. An inserted fragment is a self-contained TKF92 fragment; an inserted domain is a self-contained TKF91 domain.

After emitting an inserted character, the insert state loops back to itself (fragment extension) or transitions to a wait state (fragment/domain termination). The fragment extension self-loop:


Transition	Weight	Input	Output

$(\mathrm {I_F}, \srcfrag , g, \dom , e) \to (\mathrm {I_F}, \destfrag , g', \dom , e)$	$\ext ^{(\dom )}_{\srcfrag \destfrag }$	$\varepsilon $	$\destok _{\class \destfrag g'\dom e}$
$(\mathrm {I_D}, \srcfrag , g, \dom , e) \to (\mathrm {I_D}, \destfrag , g', \dom , e)$	$\ext ^{(\dom )}_{\srcfrag \destfrag }$	$\varepsilon $	$\destok _{\class \destfrag g'\dom e}$

On fragment termination within an inserted domain ($\mathrm {I_D}$), a new fragment may begin (with TKF92 parameters for the inserted domain):


Transition	Weight	Condition	I/O

$(\mathrm {I_D}, \frag , 1, \dom , 0) \to (\mathrm {I_D}, \frag ', g', \dom , e')$	$\notext ^{(\dom )}_\frag \cdot \kappa _\dom \cdot \fragdist _{\dom \frag '}$	new frag in inserted domain	$\varepsilon / \destok $
$(\mathrm {I_D}, \frag , 1, \dom , e) \to (\mathrm {W_M}, \ldots )$	$\notext ^{(\dom )}_\frag (1-\kappa _\dom )$	$e$ set appropriately	$\varepsilon / \varepsilon $

For $\mathrm {I_F}$ (inserted fragment within an existing domain), fragment termination returns control to $\mathrm {W_M}$:


Transition	Weight	Condition

$(\mathrm {I_F}, \frag , 1, \dom , e) \to (\mathrm {W_M}, \frag , 1, \dom , e)$	$\notext ^{(\dom )}_\frag $	fragment ends

Wait-State Transitions (Ready $\to $ Emitting) Wait states inspect the boundary indicators and decide the next action. The transitions depend on whether we are mid-fragment ($g=0$), at a fragment boundary ($g=1, e=0$), or at a domain boundary ($g=1, e=1$).

Case 1: Mid-fragment ($g = 0$). Within a fragment, only continuation of the current fragment is possible. No new fragments or domains can start.


Source	Dest	Weight	Input	Output

$(\mathrm {W_M}, \frag , 0, \dom , e)$	$(\mat , \frag , g', \dom , e)$	$\alpha _\dom $	$\anctok _{\class \frag g'\dom e}$	$\destok _{\class \frag g'\dom e}$
$(\mathrm {W_M}, \frag , 0, \dom , e)$	$(\mathrm {D_F}, \frag , g', \dom , e)$	$(1-\alpha _\dom )$	$\anctok _{\class \frag g'\dom e}$	$\varepsilon $

$(\mathrm {W_{D_F}}, \frag , 0, \dom , e)$	$(\mathrm {D_F}, \frag , g', \dom , e)$	$1$	$\anctok _{\class \frag g'\dom e}$	$\varepsilon $

$(\mathrm {W_{D_D}}, \frag , 0, \dom , e)$	$(\mathrm {D_D}, \frag , g', \dom , e)$	$1$	$\anctok _{\class \frag g'\dom e}$	$\varepsilon $

Here $g'$ can be 0 or 1 (determined by the input character’s label), $\alpha _\dom = \alpha (\insrate _\dom , \delrate _\dom , \evoltime )$, and the fragment type $\frag $ and domain indicators $(\dom , e)$ are unchanged. The emission weight for $\mat $ is $\exp (\revsub _\class \evoltime )_{\anctok \destok }$; the emission weight for $\mathrm {D_F}$ is 1 (input consumed, no output).

Case 2: Fragment boundary, mid-domain ($g = 1, e = 0$). At a fragment boundary within a domain, the TKF92 process within the domain decides: start a new fragment (match or delete), or insert a new fragment.


Source	Dest	Weight	Input	Output

$(\mathrm {W_M}, \frag , 1, \dom , 0)$	$(\mat , \frag ', g', \dom , e')$	$\alpha _\dom \cdot \fragdist _{\dom \frag '}$	$\anctok _{\class '\frag ' g'\dom e'}$	$\destok _{\class '\frag ' g'\dom e'}$
$(\mathrm {W_M}, \frag , 1, \dom , 0)$	$(\mathrm {D_F}, \frag ', g', \dom , e')$	$(1-\alpha _\dom ) \cdot \fragdist _{\dom \frag '}$	$\anctok _{\class '\frag ' g'\dom e'}$	$\varepsilon $
$(\mathrm {W_M}, \frag , 1, \dom , 0)$	$(\mathrm {I_F}, \frag ', g', \dom , 0)$	$\beta _\dom \cdot \fragdist _{\dom \frag '}$	$\varepsilon $	$\destok _{\class '\frag ' g'\dom 0}$
$(\mathrm {W_M}, \frag , 1, \dom , 0)$	$\fin $	$(1-\kappa _\dom ) \cdot (1-\kappa _\main )$	$\varepsilon $	$\varepsilon $

$(\mathrm {W_{D_F}}, \frag , 1, \dom , 0)$	$(\mat , \frag ', g', \dom , e')$	$\alpha _\dom \cdot \fragdist _{\dom \frag '}$	$\anctok _{\class '\frag ' g'\dom e'}$	$\destok _{\class '\frag ' g'\dom e'}$
$(\mathrm {W_{D_F}}, \frag , 1, \dom , 0)$	$(\mathrm {D_F}, \frag ', g', \dom , e')$	$(1-\alpha _\dom ) \cdot \fragdist _{\dom \frag '}$	$\anctok _{\class '\frag ' g'\dom e'}$	$\varepsilon $

$(\mathrm {W_{D_D}}, \frag , 1, \dom , 0)$	$(\mathrm {D_D}, \frag ', g', \dom , e')$	$\fragdist _{\dom \frag '}$	$\anctok _{\class '\frag ' g'\dom e'}$	$\varepsilon $

Note: in this case, a new fragment type $\frag '$ is drawn from $\fragdist _{\dom \frag '}$, and $e'$ is determined by the input character’s label. The domain type $\dom $ is unchanged.

Normalization of $\mathrm {W_M}$ at fragment boundary ($g=1, e=0$). The outgoing weights from $(\mathrm {W_M}, \frag , 1, \dom , 0)$ must sum to 1 when we include all possible input/output characters:

With an input character present (ancestral fragment continues): $\alpha _\dom + (1 - \alpha _\dom ) = 1$, weighted by $\fragdist _{\dom \frag '}$. But the input character determines $\frag '$, so the $\fragdist $ weight is a prior for the singlet composition, not a transition weight in the WFST.
With no input character: insertion weight $\beta _\dom $ or end weight.

Actually, for a WFST, the normalization is more subtle: the transducer weights need not sum to 1 at every state, because the transducer represents a conditional distribution. However, when composed with the singlet HMM, the resulting Pair HMM transitions must be properly normalized.

Global vs. local normalization. The Labeled-MixDom WFST is constructed by dividing the row-stochastic Pair HMM $\transnest $ (Section C.1.1) by the row-stochastic Singlet HMM (Section C.10.2). By construction, $\text {Singlet} \circ \text {WFST} = \text {Pair HMM}$, so the WFST is conditionally normalized in the global sense: $\sum _{\text {output sequences } y} P(y \mid x, \theta , \evoltime ) = 1$ for any ancestor sequence $x$. However, the per-state outgoing weights from $\mat ,\mathrm {I_F},\mathrm {I_D}, \mathrm {D_F},\mathrm {D_D}$ do not sum to 1 over a fixed input symbol, even when the Singlet emission factor for the destination label is included. This is the same state-folding artifact discussed for the TKF92 WFST in Appendix A.3.7: the Bernoulli-$\ext $ extension-vs-exit decision (here, the $\ext ^{(\dom )}_{\srcfrag \destfrag }$ vs. $\notext ^{(\dom )}_\srcfrag $ event), and the $\kappa _\dom $ vs. $1{-}\kappa _\dom $ domain-continuation event, have been compiled into the same $\mat \to \cdot $ and $\mathrm {I_\cdot }\to \cdot $ edges that already carry the destination-singlet factors. Splitting each emitting state into a “just-arrived” and a “decision” state would restore local stochasticity at the cost of doubling (or tripling) the state graph. The compact form here trades local stochasticity for a smaller machine, exactly as in TKF92.

Let us instead directly specify the transition weights that, when composed with the Singlet HMM, reproduce the Pair HMM transition matrix $\transnest $ from Section C.1.1.

Complete WFST Transition Table We now give the complete transition table. For readability, we abbreviate the state labels and split by source wait-state type and boundary case.

Let the following shorthand apply throughout: \begin {align*} \alpha _\dom &= \alpha (\insrate _\dom , \delrate _\dom , \evoltime ), & \beta _\dom &= \beta (\insrate _\dom , \delrate _\dom , \evoltime ), & \gamma _\dom &= \gamma (\insrate _\dom , \delrate _\dom , \evoltime ) \\ \alpha _\main &= \alpha (\insrate _\main , \delrate _\main , \evoltime ), & \beta _\main &= \beta (\insrate _\main , \delrate _\main , \evoltime ), & \gamma _\main &= \gamma (\insrate _\main , \delrate _\main , \evoltime ) \\ \kappa _\dom &= \insrate _\dom / \delrate _\dom , & \kappa _\main &= \insrate _\main / \delrate _\main \end {align*}

and recall $\nonemptytrans $ is the effective $5 \times 5$ domain-level transition matrix with null states eliminated (Section C.1.1).

Start transitions. From $\sta $, the WFST enters its first emitting state. The weights mirror the first row of $\transnest $:


Dest	Weight	Input	Output

$(\mat , \frag ', g', \dom ', e')$	$\nonemptytrans _{\sta \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \ystate } \cdot \fragdist _{\dom '\frag '}$	$\anctok _{\class '\frag ' g'\dom ' e'}$	$\destok _{\class '\frag ' g'\dom ' e'}$
$(\mathrm {I_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\sta \ins } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\varepsilon $	$\destok _{\class '\frag ' g'\dom ' e'}$
$(\mathrm {D_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\sta \del } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\anctok _{\class '\frag ' g'\dom ' e'}$	$\varepsilon $
$\fin $	$\nonemptytrans _{\sta \fin }$	$\varepsilon $	$\varepsilon $

where $\ystate $ is determined by the nested state type of the destination ($\mat $ for Match, $\ins $ for Insert-Fragment, $\del $ for Delete-Fragment within the domain). For Match destinations that enter at the first character of a domain, $\tkftrans ^{(\dom ')}_{\sta \ystate }$ is the TKF92 Pair HMM transition from $\sta $ into the appropriate nested state.

Emitting to Wait transitions. After each emitting state, a null transition to the corresponding wait state occurs. The weight depends on the boundary indicators:


Transition	Condition on $(g, e)$	Weight

$(\mat , \ell ) \to (\mathrm {W_M}, \ell )$	$g=0$	$1$
	$g=1,\, e=0$	$\notext ^{(\dom )}_\frag \cdot 1$
	$g=1,\, e=1$	$\notext ^{(\dom )}_\frag \cdot 1$

$(\mathrm {D_F}, \ell ) \to (\mathrm {W_{D_F}}, \ell )$	$g=0$	$1$
	$g=1,\, e=0$	$\notext ^{(\dom )}_\frag $
	$g=1,\, e=1$	$\notext ^{(\dom )}_\frag $

$(\mathrm {D_D}, \ell ) \to (\mathrm {W_{D_D}}, \ell )$	$g=0$	$1$
	$g=1,\, e=0$	$\notext ^{(\dom )}_\frag $
	$g=1,\, e=1$	$\notext ^{(\dom )}_\frag $

Wait—the fragment extension must also be handled. At $g=0$, the character is mid-fragment; the next character in the same fragment follows with certainty (the matrix entry $\ext ^{(\dom )}_{\srcfrag \destfrag }$ is already accounted for in the singlet HMM emission of the next labelled character). At $g=1$, the fragment has ended, so the per-source termination weight $\notext ^{(\dom )}_\frag $ has already been “spent” by the fact that $g=1$ was observed. Since the boundary indicators are part of the label on the character, which is determined by the input (for ancestral) or output (for descendant), the fragment extension probability is absorbed into the singlet HMM, not the WFST.

Therefore all emitting-to-wait transitions have unit weight: \begin {equation} \label {eq:emit-to-wait} w(X \to W_X) = 1 \quad \text {for all } X \in \{\mat , \mathrm {D_F}, \mathrm {D_D}\} \end {equation} The fragment and domain boundary structure is encoded in the wait-state outgoing transitions, which condition on $(g, e)$.

Wait-State Outgoing Transitions The wait states make all structural decisions. We organize by source state type and boundary case.

$\mathrm {W_M}$ (Wait after Match). Case $g=0$ (mid-fragment): Continue the current fragment. The next input character must have the same $(\frag , \dom , e)$.


Dest	Weight	Input	Output	Notes

$(\mat , \frag , g', \dom , e)$	$\alpha _\dom $	$\anctok _{\class '\frag g'\dom e}$	$\destok _{\class '\frag g'\dom e}$	match continues
$(\mathrm {D_F}, \frag , g', \dom , e)$	$(1-\alpha _\dom )$	$\anctok _{\class '\frag g'\dom e}$	$\varepsilon $	fragment-level deletion

Case $g=1, e=0$ (fragment boundary, mid-domain): Fragment ended, new fragment within same domain, or insert/delete fragment, or end domain and transition at domain level.


Dest	Weight	I	O	Notes

$(\mat , \frag ', g', \dom , e')$	$\tkftrans ^{(\dom )}_{\mat \mat } \cdot \fragdist _{\dom \frag '}$	$\anctok _{\ldots }$	$\destok _{\ldots }$	new frag, match
$(\mathrm {D_F}, \frag ', g', \dom , e')$	$\tkftrans ^{(\dom )}_{\mat \del } \cdot \fragdist _{\dom \frag '}$	$\anctok _{\ldots }$	$\varepsilon $	new frag, delete
$(\mathrm {I_F}, \frag ', g', \dom , 0)$	$\tkftrans ^{(\dom )}_{\mat \ins } \cdot \fragdist _{\dom \frag '}$	$\varepsilon $	$\destok _{\ldots }$	insert frag
$\fin _{\rm domain}$	$\tkftrans ^{(\dom )}_{\mat \fin }$			domain ends (see below)

When the domain ends, $\tkftrans ^{(\dom )}_{\mat \fin } = (1-\beta _\dom )(1-\kappa _\dom )$, the domain-level process takes over. This is where the $\nonemptytrans $ matrix from the null-eliminated domain-level Pair HMM applies.

Rather than introducing an intermediate “domain end” state, we can fold the domain-level transition into the fragment-boundary transitions. From $(\mathrm {W_M}, \frag , 1, \dom , 0)$, with the domain ending: the TKF92 weight is $\tkftrans ^{(\dom )}_{\mat \fin }$, and then the domain-level $\nonemptytrans _{\mat \cdot }$ applies to reach the next domain.

However, this conflates two levels of the hierarchy. To keep the WFST clean and composable, we should handle this through the domain-end indicator $e$: when $e=0$, we are mid-domain and only fragment-level transitions apply. The domain-end case $e=1$ is reached when the last fragment of a domain has ended.

Actually, let us reconsider. The labels $(g, e)$ on the input/output characters tell us the hierarchical position. The singlet HMM generates these labels according to the MixDom stationary process. The WFST must respect them: it cannot change $(g, e)$ on matched characters.

So the WFST sees:

Characters labeled $g=0$: mid-fragment
Characters labeled $g=1, e=0$: end of fragment, not end of domain
Characters labeled $g=1, e=1$: end of fragment and end of domain

At a fragment boundary in the ancestor ($g=1$), the WFST knows the ancestral fragment has ended. The next ancestral character (if any) will start a new fragment or domain. Between the end of one fragment and the start of the next, the WFST may insert new fragments (for $\mathrm {I_F}$) or entire domains (for $\mathrm {I_D}$).

This gives us the complete transition logic from each wait state.

Revised: $\mathrm {W_M}$ at $g=1, e=0$ (fragment boundary, mid-domain):


Dest	Weight	I/O	Notes

$(\mat , \frag ', g', \dom , e')$	$\alpha _\dom \cdot \fragdist _{\dom \frag '}$	$\anctok / \destok $	new fragment, match
$(\mathrm {D_F}, \frag ', g', \dom , e')$	$(1-\alpha _\dom ) \cdot \fragdist _{\dom \frag '}$	$\anctok / \varepsilon $	new fragment, delete
$(\mathrm {I_F}, \frag ', g', \dom , 0)$	$\beta _\dom \cdot \fragdist _{\dom \frag '}$	$\varepsilon / \destok $	insert fragment

Normalization: with an input character, the weight is $\alpha _\dom + (1-\alpha _\dom ) = 1$ (times $\fragdist $). Without an input character (insertion), $\beta _\dom $ (times $\fragdist $). These don’t need to sum to 1 together because input-present and input-absent are exclusive events in the WFST.

$\mathrm {W_M}$ at $g=1, e=1$ (domain boundary):

Here both the fragment and domain have ended. The domain-level TKF91 process decides what happens next.


Dest	Weight	I/O	Notes

$(\mat , \frag ', g', \dom ', e')$	$\nonemptytrans _{\mat \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \mat } \cdot \fragdist _{\dom '\frag '}$	$\anctok / \destok $	new domain, match
$(\mathrm {D_F}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\mat \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \del } \cdot \fragdist _{\dom '\frag '}$	$\anctok / \varepsilon $	new domain, del-frag
$(\mathrm {D_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\mat \del } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\anctok / \varepsilon $	delete domain
$(\mathrm {I_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\mat \ins } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\varepsilon / \destok $	insert domain
$\fin $	$\nonemptytrans _{\mat \fin }$	$\varepsilon / \varepsilon $	end

Here $\nonemptytrans _{\mat \cdot }$ are the null-eliminated domain-level transitions. The $\tkftrans ^{(\dom ')}_{\sta \mat }$ and $\tkftrans ^{(\dom ')}_{\sta \del }$ factors are the TKF92 entry transitions into the new domain $\dom '$.

$\mathrm {W_{D_F}}$ (Wait after Delete-Fragment).

$\mathrm {W_{D_F}}$ tracks the deletion of individual fragments within a domain (the domain itself is matched/surviving; only some fragments are deleted).

Case $g=0$ (mid-fragment):


Dest	Weight	I/O	Notes

$(\mathrm {D_F}, \frag , g', \dom , e)$	$1$	$\anctok / \varepsilon $	continue deleting fragment

Case $g=1, e=0$ (fragment boundary, mid-domain):


Dest	Weight	I/O	Notes

$(\mat , \frag ', g', \dom , e')$	$\alpha _\dom \cdot \fragdist _{\dom \frag '}$	$\anctok / \destok $	new frag, match
$(\mathrm {D_F}, \frag ', g', \dom , e')$	$(1-\alpha _\dom ) \cdot \fragdist _{\dom \frag '}$	$\anctok / \varepsilon $	new frag, delete

Case $g=1, e=1$ (domain boundary):


Dest	Weight	I/O	Notes

$(\mat , \frag ', g', \dom ', e')$	$\nonemptytrans _{\mat \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \mat } \cdot \fragdist _{\dom '\frag '}$	$\anctok / \destok $	new domain, match
$(\mathrm {D_F}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\mat \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \del } \cdot \fragdist _{\dom '\frag '}$	$\anctok / \varepsilon $	new domain, del-frag
$(\mathrm {D_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\mat \del } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\anctok / \varepsilon $	delete domain
$(\mathrm {I_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\mat \ins } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\varepsilon / \destok $	insert domain
$\fin $	$\nonemptytrans _{\mat \fin }$	$\varepsilon / \varepsilon $	end

Note: $\mathrm {W_{D_F}}$ at domain boundaries uses the same $\nonemptytrans _{\mat \cdot }$ row as $\mathrm {W_M}$, because fragment-level deletion within a domain does not affect the domain-level state (the domain was matched, i.e. the top-level state was $\mat $).

$\mathrm {W_{D_D}}$ (Wait after Delete-Domain).

$\mathrm {W_{D_D}}$ handles the deletion of entire domains. Within a deleted domain, all fragments are consumed without output.

Case $g=0$ (mid-fragment):


Dest	Weight	I/O	Notes

$(\mathrm {D_D}, \frag , g', \dom , e)$	$1$	$\anctok / \varepsilon $	continue consuming domain

Case $g=1, e=0$ (fragment boundary, mid-domain):


Dest	Weight	I/O	Notes

$(\mathrm {D_D}, \frag ', g', \dom , e')$	$\fragdist _{\dom \frag '}$	$\anctok / \varepsilon $	next fragment in deleted domain

Note: within a deleted domain, the entire domain is being consumed, so the fragment distribution weight $\fragdist _{\dom \frag '}$ is needed to match the singlet prior.

Case $g=1, e=1$ (domain boundary):


Dest	Weight	I/O	Notes

$(\mat , \frag ', g', \dom ', e')$	$\nonemptytrans _{\del \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \mat } \cdot \fragdist _{\dom '\frag '}$	$\anctok / \destok $	new domain, match
$(\mathrm {D_F}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\del \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \del } \cdot \fragdist _{\dom '\frag '}$	$\anctok / \varepsilon $	new domain, del-frag
$(\mathrm {D_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\del \del } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\anctok / \varepsilon $	delete another domain
$(\mathrm {I_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\del \ins } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\varepsilon / \destok $	insert domain
$\fin $	$\nonemptytrans _{\del \fin }$	$\varepsilon / \varepsilon $	end

Here $\nonemptytrans _{\del \cdot }$ is used because the domain-level state was $\del $.

Insert states ($\mathrm {I_F}$, $\mathrm {I_D}$).

Insert states handle character-level emissions for inserted fragments/domains. They self-loop for fragment extension and terminate back to wait states.

$\mathrm {I_F}$ (inserted fragment within an existing domain):


Source	Dest	Weight	I/O

$(\mathrm {I_F}, \srcfrag , 0, \dom , e)$	$(\mathrm {I_F}, \destfrag , g', \dom , e)$	$\ext ^{(\dom )}_{\srcfrag \destfrag }$ (emit next char)	$\varepsilon / \destok $
$(\mathrm {I_F}, \frag , 1, \dom , e)$	$(\mathrm {W_M}, \frag , 1, \dom , e)$	$1$ (fragment ended)	$\varepsilon / \varepsilon $

Since the label $g$ on the output character indicates whether the fragment continues ($g=0$) or ends ($g=1$), the fragment-extension matrix $\ext ^{(\dom )}$ is again handled by the singlet HMM generating the output labels. In the WFST, $\mathrm {I_F}$ at $g=0$ continues emitting; at $g=1$ the fragment ends and control returns to $\mathrm {W_M}$ to decide on the next fragment-level action.

Actually, for insert states, the output character labels are generated by the WFST itself (there is no ancestral character to copy from). The WFST must therefore assign the correct probabilities to the output labels. The emission weight at $(\mathrm {I_F}, \frag , g, \dom , e)$ for output character $\destok _{\class '\frag ' g'\dom ' e'}$ is: \begin {equation} \label {eq:if-emit} \delta _{\dom \dom '} \delta _{ee'} \cdot \classdist _{\frag '\class '} \eqm _{\class '\destok } \cdot \begin {cases} \ext ^{(\dom )}_{\frag \frag '} & \text {if } g' = 0 \\ \notext ^{(\dom )}_\frag \,\delta _{\frag \frag '} & \text {if } g' = 1 \end {cases} \end {equation} followed by a transition: if $g'=0$, loop to $(\mathrm {I_F}, \frag , 0, \dom , e)$; if $g'=1$, go to the appropriate wait state.

$\mathrm {I_D}$ (inserted domain): similar to $\mathrm {I_F}$, but the entire domain is new. Within the inserted domain, the fragment-level TKF92 structure applies:


Source condition	Dest	Notes

$(\mathrm {I_D}, \frag , 0, \dom , e)$	$(\mathrm {I_D}, \frag , g', \dom , e)$, emit $\destok $	mid-fragment, continue
$(\mathrm {I_D}, \frag , 1, \dom , 0)$	$(\mathrm {I_D}, \frag ', g', \dom , e')$, emit $\destok $	new fragment in inserted domain
	weight: $\kappa _\dom \cdot \fragdist _{\dom \frag '}$
$(\mathrm {I_D}, \frag , 1, \dom , 0)$	domain ends within inserted domain	weight: $(1-\kappa _\dom )$
	$\to $ next domain-level action
$(\mathrm {I_D}, \frag , 1, \dom , 1)$	inserted domain complete	returns to domain-level wait

When $\mathrm {I_D}$ completes (the inserted domain ends), control goes back to a domain-level wait state. Since the domain was inserted (top-level state $\ins $), the next domain-level transition uses $\nonemptytrans _{\ins \cdot }$:


Dest	Weight	I/O	Notes

$(\mat , \frag ', g', \dom ', e')$	$\nonemptytrans _{\ins \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \mat } \cdot \fragdist _{\dom '\frag '}$	$\anctok / \destok $	match next domain
$(\mathrm {D_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\ins \del } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\anctok / \varepsilon $	delete next domain
$(\mathrm {I_D}, \frag ', g', \dom ', e')$	$\nonemptytrans _{\ins \ins } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}$	$\varepsilon / \destok $	insert another domain
$\fin $	$\nonemptytrans _{\ins \fin }$	$\varepsilon / \varepsilon $	end

C.10.4 Emission Weights

The emission weights for each emitting state type are:


State	Emission	Weight

$\mat $	input $\anctok _{\class \frag g\dom e}$, output $\destok _{\class \frag g\dom e}$	$\exp (\revsub _\class \evoltime )_{\anctok \destok }$
$\mathrm {I_F}$	output $\destok _{\class \frag g\dom e}$	$\classdist _{\frag \class }\, \eqm _{\class \destok } \cdot p_g(\ext ^{(\dom )}_{\frag \cdot })$
$\mathrm {I_D}$	output $\destok _{\class \frag g\dom e}$	$\classdist _{\frag \class }\, \eqm _{\class \destok } \cdot p_g(\ext ^{(\dom )}_{\frag \cdot })$
$\mathrm {D_F}$	input $\anctok _{\class \frag g\dom e}$	$1$
$\mathrm {D_D}$	input $\anctok _{\class \frag g\dom e}$	$1$

where $p_g(\ext ^{(\dom )}_{\frag \cdot })$ stands for $\ext ^{(\dom )}_{\frag \frag '}$ if $g=0$ (fragment continues with destination fragchar $\frag '$) and $\notext ^{(\dom )}_\frag $ if $g=1$ (fragment ends), and the substitution matrix $\revsub _\class $ is the rate matrix for site class $\class $.

For the WFST (conditional on ancestor), the delete emission weight is 1 because the ancestral emission probability is divided out (it was generated by the singlet HMM). The match emission divides out the ancestral $\eqm _{\class \anctok }$ from the pair emission $\eqm _{\class \anctok } \exp (\revsub _\class \evoltime )_{\anctok \destok }$, yielding just the substitution matrix entry.

For insert states, the emission includes the full descendant character probability because there is no ancestral character to condition on.

C.10.5 Verification: Composition Reproduces the Pair HMM

Claim. The Labeled-MixDom Singlet HMM composed with the Labeled-MixDom WFST is equivalent to the MixDom Pair HMM defined in Section C.1.1.

Sketch of proof. The composition proceeds as follows:

1.: The Singlet HMM generates the ancestral labeled sequence, with transitions governed by the MixDom stationary distribution.
2.: The WFST reads the ancestral labeled sequence as input and produces the descendant labeled sequence as output.
3.: The composed machine has states that are pairs (singlet state, WFST state). Since both are order-1 machines tracking structural labels, the composed state is a pair of structural labels.

We verify that the composed transition weights match $\transnest $ for each case:

Match-to-Match within a fragment ($g=0$). The singlet emits character $\anctok _{\class \frag 0\dom e}$ with weight $\classdist _{\frag \class } \eqm _{\class \anctok }$, and transitions to state $(\frag ', g', \dom , e)$ with weight $\ext ^{(\dom )}_{\frag \frag '}$ (if $g'=0$) or $\notext ^{(\dom )}_\frag \cdot \ldots $ (if $g'=1$). The WFST in state $(\mathrm {W_M}, \srcfrag , 0, \dom , e)$ reads this input and transitions to $(\mat , \destfrag , g', \dom , e)$ with weight $\alpha _\dom $, emitting $\destok $ with substitution weight $\exp (\revsub _\class \evoltime )_{\anctok \destok }$. The composed emission weight is $\classdist _{\destfrag \class } \eqm _{\class \anctok } \exp (\revsub _\class \evoltime )_{\anctok \destok }$, summing over $\class $ gives $\sum _\class \classdist _{\destfrag \class } \eqm _{\class \anctok } \exp (\revsub _\class \evoltime )_{\anctok \destok }$, which matches the Pair HMM emission for $\mat \mat _{\dom \destfrag }$. The transition weight within the fragment is $\ext ^{(\dom )}_{\srcfrag \destfrag } \cdot \alpha _\dom $, which corresponds to the $\samedom \ext ^{(\dom )}_{\srcfrag \destfrag }$ term in $\transnest $ (for the intra-fragment Markov contribution to $\mat \mat _{\srcdom \srcfrag } \to \mat \mat _{\srcdom \destfrag }$).

Match-to-Match across fragment boundary ($g=1, e=0$). Singlet: fragment ends, transitions to new fragment $\frag '$ within same domain with weight $\kappa _\dom \cdot \fragdist _{\dom \frag '}$. WFST: $(\mathrm {W_M}, \frag , 1, \dom , 0)$ transitions to $(\mat , \frag ', g', \dom , e')$ with weight $\alpha _\dom \cdot \fragdist _{\dom \frag '}$. Combined: $\kappa _\dom \cdot \fragdist _{\dom \frag '} \cdot \alpha _\dom \cdot \fragdist _{\dom \frag '}$.

Wait—the $\fragdist $ appears twice, which is wrong. This reveals that the WFST transition weight should not include $\fragdist _{\dom \frag '}$ when the input character determines $\frag '$. The fragment type of the next input character is determined by the singlet HMM, and the WFST simply reads whatever fragment type appears.

Let us correct: at fragment boundaries, when the next action involves reading an input character, the WFST does not weight by $\fragdist _{\dom \frag '}$. The $\fragdist $ is part of the singlet distribution, not the conditional (WFST) distribution.

Corrected Wait-State Transitions. The WFST represents the conditional distribution $P(\text {descendant} \mid \text {ancestor})$. Therefore:

Transitions that consume an input character should not include the prior probability of that input character’s labels (such as $\fragdist $, $\domdist $).
Transitions that produce an output character (insertions) do include the full probability of the output labels (since the WFST generates them).
Transitions involving the $\nonemptytrans $ matrix at domain boundaries must be adjusted: $\nonemptytrans $ was derived for the Pair HMM (joint distribution), so the WFST version divides out the ancestral prior factors.

Concretely, the $\transnest $ entry for $\mat \xstate _{\srcdom \srcfrag } \to \mat \ystate _{\destdom \destfrag }$ with $\srcdom = \destdom $ (same domain, new fragment) is: \[ \notext ^{(\dom )}_\srcfrag \cdot \tkftrans ^{(\dom )}_{\xstate \ystate } \cdot \fragdist _{\dom \destfrag } \] This is the joint weight. In the composed (singlet $\circ $ WFST) machine:

Singlet provides: $\notext ^{(\dom )}_\srcfrag \cdot \kappa _\dom \cdot \fragdist _{\dom \destfrag }$ (end fragment, continue domain, choose new fragment type)
WFST provides: $\frac {\tkftrans ^{(\dom )}_{\xstate \ystate }}{\kappa _\dom }$ (the conditional transition, dividing out the $\kappa _\dom $ from the joint)

Product: $\notext ^{(\dom )}_\srcfrag \cdot \kappa _\dom \cdot \fragdist _{\dom \destfrag } \cdot \tkftrans ^{(\dom )}_{\xstate \ystate } / \kappa _\dom = \notext ^{(\dom )}_\srcfrag \cdot \tkftrans ^{(\dom )}_{\xstate \ystate } \cdot \fragdist _{\dom \destfrag }$. This matches the Pair HMM. $\m@th \mathchar "458$

Similarly, for the domain-boundary case with $\srcdom \neq \destdom $:

Singlet provides: $\notext ^{(\srcdom )}_\srcfrag (1-\kappa _{\srcdom }) \cdot \frac {\kappa _\main \domdist _{\destdom } \kappa _{\destdom } \fragdist _{\destdom \destfrag }}{1-\zeta }$
WFST provides: the conditional weight that, when multiplied by the singlet weight, gives $\transnest _{\mat \xstate _{\srcdom \srcfrag } \to \mat \ystate _{\destdom \destfrag }}$

The required WFST transition weights therefore depend on the singlet transition structure. Rather than writing out all the corrected weights with singlet factors divided out, we express the key principle:

WFST weight principle: For any transition that consumes an input character with label $\ell '$, the WFST weight is the Pair HMM transition weight divided by the singlet transition weight for generating a character with label $\ell '$ from the current singlet state. For transitions that produce an output character (insertions), the WFST carries the full conditional weight. For null transitions (emitting-to-wait), the weight is 1.

Specifically, let $P_{\rm pair}(i \to j)$ be the Pair HMM transition from state $i$ to $j$, and let $P_{\rm sing}(\ell \to \ell ')$ be the singlet transition. Then: \begin {equation} \label {eq:wfst-weight-principle} w_{\rm WFST}(s, \ell _{\rm in}, \ell _{\rm out} \to s', \ell ') = \frac {P_{\rm pair}(i(s,\ell ) \to j(s',\ell '))}{P_{\rm sing}(\ell \to \ell ')} \end {equation} where $i(s,\ell )$ and $j(s',\ell ')$ are the corresponding Pair HMM states, and $\ell _{\rm in}$ is present when $s'$ consumes input.

Explicit corrected weights. We now tabulate the corrected WFST transition weights organized by source wait state, boundary case, and whether input is consumed.

In all cases below, the WFST context $\ell = (\frag , g, \dom , e)$ is the structural label of the last processed character. The singlet HMM is in the corresponding state. The factor $P_{\rm sing}(\ell \to \ell ')$ is the singlet transition weight from the current label to the next label.

$\mathrm {W_M}$, $g=0$ (mid-fragment, input consumed):


Dest	WFST weight	Notes

$\mat $	$\alpha _\dom $	match
$\mathrm {D_F}$	$(1-\alpha _\dom )$	frag-level delete

No singlet factor to divide out: at $g=0$ the singlet continues the fragment with weight $\ext ^{(\dom )}_{\srcfrag \destfrag }$ (producing the next character with destination fragchar $\destfrag $ in the same domain), and the WFST weight is purely the TKF92 match/delete split, independent of $\fragdist $.

$\mathrm {W_M}$, $g=1, e=0$ (fragment boundary, mid-domain):

With input (start new ancestral fragment):


Dest	WFST weight	Notes

$\mat $	$\alpha _\dom $	match new fragment
$\mathrm {D_F}$	$(1-\alpha _\dom )$	delete new fragment

The $\fragdist _{\dom \frag '}$ is supplied by the singlet; the WFST contributes only the match/delete split.

Without input (descendant insertion):


Dest	WFST weight	Notes

$(\mathrm {I_F}, \frag ', g', \dom , 0)$	$\beta _\dom \cdot \fragdist _{\dom \frag '} \cdot \classdist _{\frag '\class '} \eqm _{\class '\destok } \cdot p_{g'}(\ext _{\frag '})$	insert fragment

Here the WFST generates the full output character probability since there is no input to condition on.

$\mathrm {W_M}$, $g=1, e=1$ (domain boundary):

With input (start new ancestral domain):


Dest	WFST weight	Notes

$\mat $	$\dfrac {\nonemptytrans _{\mat \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \ystate } \cdot \fragdist _{\dom '\frag '}} {P_{\rm sing}(\ell \to \ell ')}$	new domain, match

$\mathrm {D_F}$	$\dfrac {\nonemptytrans _{\mat \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \del } \cdot \fragdist _{\dom '\frag '}} {P_{\rm sing}(\ell \to \ell ')}$	new domain, del-frag

$\mathrm {D_D}$	$\dfrac {\nonemptytrans _{\mat \del } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}} {P_{\rm sing}(\ell \to \ell ')}$	delete domain

Without input:


Dest	WFST weight	Notes

$\mathrm {I_D}$	$\nonemptytrans _{\mat \ins } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '} \cdot \classdist _{\frag '\class '} \eqm _{\class '\destok } \cdot p_{g'}(\ext _{\frag '})$	insert domain
$\fin $	$\nonemptytrans _{\mat \fin } / P_{\rm sing}(\ell \to \fin )$	end

where $P_{\rm sing}(\ell \to \ell ') = \frac {\kappa _\main \domdist _{\dom '} \kappa _{\dom '} \fragdist _{\dom '\frag '}}{1-\zeta }$ is the singlet transition from a domain boundary to the next character label $\ell '$ (Equation C.71).

The $\mathrm {W_{D_F}}$ and $\mathrm {W_{D_D}}$ tables follow the same pattern, using $\nonemptytrans _{\mat \cdot }$ for $\mathrm {W_{D_F}}$ (since fragment deletion is within a matched domain) and $\nonemptytrans _{\del \cdot }$ for $\mathrm {W_{D_D}}$ (since the domain is being deleted).

$\mathrm {W_{D_F}}$ at $g=0$:


Dest	WFST weight

$\mathrm {D_F}$	$1$

$\mathrm {W_{D_F}}$ at $g=1, e=0$:


Dest	WFST weight

$\mat $	$\alpha _\dom $
$\mathrm {D_F}$	$(1-\alpha _\dom )$

$\mathrm {W_{D_F}}$ at $g=1, e=1$: Same structure as $\mathrm {W_M}$ at $g=1, e=1$, using $\nonemptytrans _{\mat \cdot }$.

$\mathrm {W_{D_D}}$ at $g=0$:


Dest	WFST weight

$\mathrm {D_D}$	$1$

$\mathrm {W_{D_D}}$ at $g=1, e=0$:


Dest	WFST weight

$\mathrm {D_D}$	$1$

(Within a deleted domain, all fragments are consumed; fragment type is irrelevant.)

$\mathrm {W_{D_D}}$ at $g=1, e=1$:


Dest	WFST weight	Notes

$\mat $	$\nonemptytrans _{\del \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \ystate } \cdot \fragdist _{\dom '\frag '} / P_{\rm sing}(\ell \to \ell ')$	new domain, match
$\mathrm {D_F}$	$\nonemptytrans _{\del \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \del } \cdot \fragdist _{\dom '\frag '} / P_{\rm sing}(\ell \to \ell ')$	new domain, del-frag
$\mathrm {D_D}$	$\nonemptytrans _{\del \del } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '} / P_{\rm sing}(\ell \to \ell ')$	delete domain
$\mathrm {I_D}$	$\nonemptytrans _{\del \ins } \cdot (\text {full output prob})$	insert domain
$\fin $	$\nonemptytrans _{\del \fin } / P_{\rm sing}(\ell \to \fin )$	end

$\mathrm {I_D}$ completion at $g=1, e=1$:


Dest	WFST weight	Notes

$\mat $	$\nonemptytrans _{\ins \mat } \cdot \domdist _{\dom '} \cdot \tkftrans ^{(\dom ')}_{\sta \ystate } \cdot \fragdist _{\dom '\frag '} / P_{\rm sing}(\ell \to \ell ')$	match next domain
$\mathrm {D_D}$	$\nonemptytrans _{\ins \del } \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '} / P_{\rm sing}(\ell \to \ell ')$	delete next domain
$\mathrm {I_D}$	$\nonemptytrans _{\ins \ins } \cdot (\text {full output prob})$	insert another domain
$\fin $	$\nonemptytrans _{\ins \fin } / P_{\rm sing}(\ell \to \fin )$	end

C.10.6 Simplification of Domain-Boundary WFST Weights

The domain-boundary WFST weights (Section C.10.3.0) involve a ratio of the Pair HMM transition weight to the singlet transition weight. This simplifies considerably.

For a transition that consumes input with label $\ell ' = (\frag ', g', \dom ', e')$ from a domain boundary ($g=1, e=1$), the singlet weight is (from Equation C.71): \[ P_{\rm sing}(\ell \to \ell ') = \frac {\kappa _\main \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}}{1 - \zeta } \]

The Pair HMM weight for $\mat \mat _{\srcdom \srcfrag } \to \mat \mat _{\destdom \destfrag }$ with $\srcdom \neq \destdom $ (different domain, going through domain boundary) is: \[ \notext ^{(\srcdom )}_\srcfrag \cdot \tkftrans ^{(\srcdom )}_{\mat \fin } \cdot \nonemptytrans _{\mat \mat } \cdot \domdist _{\destdom } \cdot \tkftrans ^{(\destdom )}_{\sta \mat } \cdot \fragdist _{\destdom \destfrag } \]

The WFST weight is therefore: \[ \frac {\notext ^{(\srcdom )}_\srcfrag \cdot \tkftrans ^{(\srcdom )}_{\mat \fin } \cdot \nonemptytrans _{\mat \mat } \cdot \domdist _{\destdom } \cdot \tkftrans ^{(\destdom )}_{\sta \mat } \cdot \fragdist _{\destdom \destfrag }} {\frac {\kappa _\main \cdot \domdist _{\destdom } \cdot \kappa _{\destdom } \cdot \fragdist _{\destdom \destfrag }}{1 - \zeta }} = \frac {(1-\zeta )\notext ^{(\srcdom )}_\srcfrag \cdot \tkftrans ^{(\srcdom )}_{\mat \fin } \cdot \nonemptytrans _{\mat \mat } \cdot \tkftrans ^{(\destdom )}_{\sta \mat }} {\kappa _\main \cdot \kappa _{\destdom }} \]

The $\domdist _{\destdom }$ and $\fragdist _{\destdom \destfrag }$ cancel, leaving a weight that depends on the source domain parameters and the destination domain’s TKF parameters, but not on the specific fragment or domain type of the destination. This is a significant simplification: the WFST transition weight at domain boundaries is the same for all destination labels $\ell '$, given the source label $\ell $.

Since $\tkftrans ^{(\srcdom )}_{\mat \fin } = (1-\beta _{\srcdom })(1-\kappa _{\srcdom })$ and $\tkftrans ^{(\destdom )}_{\sta \mat } = (1-\beta _{\destdom })\kappa _{\destdom }\alpha _{\destdom }$, the weight becomes: \[ \frac {(1-\zeta )\notext ^{(\srcdom )}_\srcfrag (1-\beta _{\srcdom })(1-\kappa _{\srcdom }) \cdot \nonemptytrans _{\mat \mat } \cdot (1-\beta _{\destdom })\alpha _{\destdom }} {\kappa _\main } \]

Note that this weight still depends on $\destdom $ through $(1-\beta _{\destdom })\alpha _{\destdom }$, so the cancellation is partial: $\domdist $ and $\fragdist $ cancel but the TKF92 entry parameters do not.

C.10.7 State Count and Sparsity

The Labeled-MixDom WFST has at most $8L + 2$ states where $L = |\nfrag | \cdot |\ndom | \cdot 4$. In practice, many combinations are constrained away:

$\mathrm {I_F}$ states only occur with $e \neq 1$ (inserted fragments cannot be the last in a domain since they are inserted within a domain).
Mid-fragment ($g=0$) wait states have at most 2 outgoing transitions each (continue in same fragment, match or delete).
Fragment-boundary ($g=1, e=0$) wait states have at most 3 outgoing transition types.
Domain-boundary ($g=1, e=1$) wait states have at most 5 outgoing transition types (one per $\nonemptytrans $ column).

The key advantage over distillation is exactness: the Labeled-MixDom WFST preserves all correlations of the MixDom model without approximation, at the cost of a larger effective alphabet.

C.11 Formal Grammar Elaboration Rules

The TKF family of evolutionary models—TKF91, TKF92, MixDom, the TKF Structure Tree, and the TKF Genome—describes the joint evolution of biological sequences subject to insertions, deletions, and substitutions. Despite their apparent diversity, all these models share a common constructive pattern: they begin with a simple weighted context-free grammar (WCFG) for a geometrically distributed number of links, and then systematically elaborate that grammar through a series of formal transformations.

In the TKF91 model (50), each link carries a single character evolving by a continuous-time Markov chain (CTMC). TKF92 (51) extends this by replacing each character with a geometrically distributed fragment of characters. The MixDom model nests a TKF92 process inside a TKF91 process, decorating links with mixtures of domain types, fragment types, and substitution classes. The TKF Structure Tree (22) uses stochastic context-free grammar (SCFG) recursion to model RNA secondary structure with stems (emitting paired characters left and right) and loops. The TKF Genome extends these ideas to entire genomes with coding sequences, introns, RNA structures, and conserved elements.

Making the elaboration steps explicit and composable has several benefits:

(i): Correctness: each transformation can be verified independently, rather than checking a large monolithic grammar.
(ii): Modularity: new models (e.g., RNA models with basepair stacking, codon models with reading-frame-aware indels) can be constructed by composing well-understood building blocks.
(iii): Automation: the transformations are sufficiently formal to be implemented as software operations on grammar objects, enabling automatic derivation of dynamic programming algorithms from high-level model specifications.

This appendix defines seven elaboration rules and the associated null-state management procedures, then shows how each known TKF-family model arises as a specific sequence of elaborations.

C.11.1 Base Grammar

Weighted Context-Free Grammars

Definition C.1 (Weighted Context-Free Grammar). A weighted context-free grammar (WCFG) is a tuple $\WCFG = (\NT , \TM , \PR , S, \WF )$ where:

$\NT $ is a finite set of nonterminal symbols.
$\TM $ is a finite set of terminal symbols, disjoint from $\NT $.
$S \in \NT $ is the start symbol.
$\PR $ is a finite set of production rules of the form $X \to \alpha $ where $X \in \NT $ and $\alpha \in (\NT \cup \TM )^*$.
$\WF : \PR \to \mathbb {R}_{\geq 0}$ assigns a nonneg weight to each production.

The grammar is proper if for every nonterminal $X$, the weights of all productions with left-hand side $X$ sum to 1: $\sum _{(X \to \alpha ) \in \PR } \WF (X \to \alpha ) = 1$. In a proper WCFG, weights are probabilities and the grammar defines a stochastic context-free grammar (SCFG).

Definition C.2 (Elaboration Rule). An elaboration rule (or grammar transformation) is a map $\mathcal {E} : \WCFG \to \WCFG '$ that takes a WCFG and a set of elaboration parameters, and produces a new WCFG. An elaboration is validity-preserving if it maps proper grammars to proper grammars.

We distinguish between the single-sequence grammar (describing the stationary distribution over sequences) and the pair grammar (describing the joint distribution over ancestor–descendant sequence pairs). Most elaborations operate on the single-sequence grammar; the Evolution elaboration (Section C.11.2.0) converts a single-sequence grammar into a pair grammar.

The Link Grammar The fundamental building block of all TKF-family models is a grammar generating a geometrically distributed number of “links.” This grammar arises from the stationary distribution of the linear birth-death-immigration (BDI) process with per-capita birth rate $\insrate $, per-capita death rate $\delrate > \insrate $, and immigration rate $\immrate = \insrate $.

Definition C.3 (Link Grammar). The link grammar $\WCFG _{\mathrm {link}}(\kappa )$ with parameter $\kappa = \insrate /\delrate \in [0,1)$ is the WCFG with nonterminals $\{\texttt {IMM}, \texttt {MOR}\}$, start symbol $\texttt {IMM}$, and productions: \begin {align} \texttt {IMM} &\to \texttt {MOR}\;\texttt {IMM} && \text {weight } \kappa \label {eq:imm-extend} \\ \texttt {IMM} &\to \epsilon && \text {weight } 1-\kappa \label {eq:imm-end} \\ \texttt {MOR} &\to \texttt {MOR}\;\texttt {MOR} && \text {weight } \kappa \label {eq:mor-extend} \\ \texttt {MOR} &\to \epsilon && \text {weight } 1-\kappa \label {eq:mor-end} \end {align}

Remark C.19. In this grammar, $\texttt {IMM}$ (the “immortal link”) generates a sequence of $n \sim \geomdist (\kappa )$ mortal links. Each $\texttt {MOR}$ can recursively generate further mortal links; the recursive self-loop in (??) reflects the offspring-generating property of the BDI process. The distinction between $\texttt {IMM}$ and $\texttt {MOR}$ captures the different roles: the immortal link corresponds to the BDI regime $\immrate = \insrate , X(0) = 0$ (immigration from nothing), while mortal links correspond to $\immrate = 0, X(0) = 1$ (a single founder that can die).

Remark C.20. The grammar in Definition C.3 generates only the empty string $\epsilon $, since $\texttt {MOR}$ has no terminal-producing rules. The elaboration rules below will add terminal emissions (characters, character pairs, etc.) to the mortal links.

Proposition C.3. The link grammar $\WCFG _{\mathrm {link}}(\kappa )$ is proper for any $\kappa \in [0,1)$. Under the start symbol $\texttt {IMM}$, the number of $\texttt {MOR}$ expansions before reaching $\epsilon $ is distributed as $\geomdist (\kappa )$ (with support $\{0,1,2,\ldots \}$ and mean $\kappa /(1-\kappa )$).

Proof. For $\texttt {IMM}$: the productions (??) and (??) have weights $\kappa $ and $1-\kappa $, summing to 1. Similarly for $\texttt {MOR}$. The number of $\texttt {MOR}$ nonterminals generated by $\texttt {IMM}$ before choosing $\epsilon $ is geometric with parameter $\kappa $ by the standard geometric series argument. □

C.11.2 Elaboration Rules

We now define each elaboration rule as a formal grammar transformation. For each rule, we specify:

The input grammar fragment (which productions are targeted).
The output grammar fragment (the replacement productions).
The parameters introduced by the elaboration.
The validity conditions under which the transformation preserves properness.

CTMC Expansion The most basic elaboration decorates each link with a character (or tuple of characters) that evolves according to a finite-state CTMC. This is the step that takes the bare link grammar to a model with observable sequences.

Definition C.4 (Emission Type). An emission type specifies where terminal symbols appear relative to the recursive expansion of a nonterminal:

Left-emission: terminal appears to the left of the recursive part. Production form: $X \to c\;\alpha $ where $c \in \TM $.
Right-emission: terminal appears to the right. Production form: $X \to \alpha \;c$.
LR-emission: terminals appear on both sides. Production form: $X \to c_L\;\alpha \;c_R$ where $c_L, c_R \in \TM $. This is the form needed for RNA basepair models (stems).

Definition C.5 (CTMC Expansion). Let $\WCFG $ be a WCFG containing a nonterminal $X$ with an $\epsilon $-generating production used for link termination. Let $\alphabet $ be a finite alphabet, $\eqm $ a probability distribution over $\alphabet $, and $\exch $ a rate matrix with stationary distribution $\eqm $. The left CTMC expansion of $X$ with parameters $(\alphabet , \eqm , \exch )$ replaces every production $X \to \alpha $ (where $\alpha \neq \epsilon $) with the family of productions: \begin {equation} X \to c\;\alpha \qquad \text {weight } \WF (X \to \alpha ) \cdot \eqm _c \quad \text {for each } c \in \alphabet \end {equation} The $\epsilon $-production $X \to \epsilon $ is left unchanged.

The right CTMC expansion replaces $X \to \alpha $ ($\alpha \neq \epsilon $) with: \begin {equation} X \to \alpha \;c \qquad \text {weight } \WF (X \to \alpha ) \cdot \eqm _c \quad \text {for each } c \in \alphabet \end {equation}

The LR CTMC expansion with alphabet $\alphabet \times \alphabet $ and equilibrium distribution $\eqm (c_L, c_R)$ replaces $X \to \alpha $ ($\alpha \neq \epsilon $) with: \begin {equation} X \to c_L\;\alpha \;c_R \qquad \text {weight } \WF (X \to \alpha ) \cdot \eqm (c_L, c_R) \quad \text {for each } (c_L, c_R) \in \alphabet \times \alphabet \end {equation}

Proposition C.4. Left, right, and LR CTMC expansion are validity-preserving: if $\WCFG $ is proper, then so is the elaborated grammar.

Proof. For left CTMC expansion: the total weight of productions with LHS $X$ becomes $\sum _{c \in \alphabet } \sum _{\alpha \neq \epsilon } \WF (X \to \alpha ) \cdot \eqm _c + \WF (X \to \epsilon ) = \sum _{\alpha \neq \epsilon } \WF (X \to \alpha ) \cdot 1 + \WF (X \to \epsilon ) = \sum _\alpha \WF (X \to \alpha ) = 1$. The right and LR cases are analogous. □

Example C.1 (TKF91). Applying left CTMC expansion to $\texttt {MOR}$ in the link grammar $\WCFG _{\mathrm {link}}(\kappa )$ with alphabet $\alphabet $ and equilibrium distribution $\eqm $ yields: \begin {align*} \texttt {IMM} &\to \texttt {MOR}\;\texttt {IMM} && \kappa \\ \texttt {IMM} &\to \epsilon && 1-\kappa \\ \texttt {MOR} &\to c\;\texttt {MOR}\;\texttt {MOR} && \kappa \cdot \eqm _c \quad (c \in \alphabet )\\ \texttt {MOR} &\to c && (1-\kappa ) \cdot \eqm _c \quad (c \in \alphabet ) \end {align*}

This is the stationary (single-sequence) grammar for TKF91: it generates a geometric number of links, each carrying a character drawn i.i.d. from $\eqm $.

Example C.2 (TKF Structure Tree stems). In the TKF Structure Tree, stems use LR CTMC expansion. The nonterminal $S$ (“stem”) generates basepairs: \begin {align*} S &\to c_L\;S\;c_R && (1-\kappa _S) \cdot \eqm _S(c_L, c_R) \\ S &\to L && \kappa _S \end {align*}

where $L$ is the loop nonterminal (which uses left CTMC expansion), and $\eqm _S(c_L, c_R)$ is the joint equilibrium distribution over basepairs. The LR emission enables the grammar to generate palindromic structures characteristic of RNA secondary structure.

Fragment Expansion Fragment expansion takes TKF91 to TKF92 by replacing each single-character link with a geometrically distributed sequence of characters.

Definition C.6 (Fragment Expansion). Let $\WCFG $ be a WCFG with a nonterminal $\texttt {MOR}$ representing a mortal link. The fragment expansion with parameter $\ext \in [0,1)$ (the extension probability) replaces $\texttt {MOR}$ with three nonterminals $\texttt {MOR\_S}$ (fragment start), $\texttt {MOR\_X}$ (fragment extend), and $\texttt {MOR\_E}$ (fragment end), defined as follows.

Every production in $\WCFG $ that references $\texttt {MOR}$ on its right-hand side is updated to reference $\texttt {MOR\_S}$ instead. Then:

Before (link with single terminal slot): \begin {align*} \texttt {MOR} &\to [\text {terminal}]\;\alpha && \WF _{\text {orig}} \end {align*}

After (link with geometric fragment): \begin {align*} \texttt {MOR\_S} &\to \texttt {MOR\_X} && 1 \\ \texttt {MOR\_X} &\to [\text {terminal}]\;\texttt {MOR\_X} && \ext \\ \texttt {MOR\_X} &\to [\text {terminal}]\;\texttt {MOR\_E} && 1 - \ext \\ \texttt {MOR\_E} &\to \alpha && \WF _{\text {orig}} \end {align*}

Here $[\text {terminal}]$ denotes whatever terminal emission was associated with $\texttt {MOR}$ (a single character from a CTMC expansion, or a character pair, etc.), and $\alpha $ denotes the rest of the original production’s right-hand side (the recursive continuation).

More precisely, if $\texttt {MOR}$ had productions $\texttt {MOR} \to c\;\texttt {MOR}\;\texttt {MOR}$ (weight $\kappa \cdot \eqm _c$) and $\texttt {MOR} \to c$ (weight $(1-\kappa ) \cdot \eqm _c$) after CTMC expansion, the fragment expansion produces: \begin {align*} \texttt {MOR\_S} &\to \texttt {MOR\_X} && 1 \\ \texttt {MOR\_X} &\to c\;\texttt {MOR\_X} && \ext \cdot \eqm _c \quad (c \in \alphabet ) \\ \texttt {MOR\_X} &\to c\;\texttt {MOR\_E} && (1-\ext ) \cdot \eqm _c \quad (c \in \alphabet ) \\ \texttt {MOR\_E} &\to \texttt {MOR\_S}\;\texttt {MOR\_S} && \kappa \\ \texttt {MOR\_E} &\to \epsilon && 1-\kappa \end {align*}

where $\texttt {MOR\_E}$ takes over the inter-link continuation logic from the original $\texttt {MOR}$.

Proposition C.5. Fragment expansion is validity-preserving. The expected number of terminals per fragment is $1/(1-\ext )$.

Remark C.21. Fragment expansion must be applied after CTMC expansion (or simultaneously), because it assumes the link already has terminal emissions. If applied to a bare link grammar (no terminals), the result would have fragments of $\epsilon $’s, which is degenerate.

Example C.3 (TKF92). TKF92 = link grammar $\to $ (fragment expansion with parameter $\ext $) $\to $ (CTMC expansion with $\alphabet , \eqm , \exch $). Equivalently, CTMC expansion first, then fragment expansion. Both orderings yield the same grammar, because fragment expansion simply wraps the terminal emission in a geometric self-loop. The resulting grammar generates sequences where each link produces a fragment of $K \sim \geomdist (\ext )$ characters from $\eqm $.

Mixture Expansion Mixture expansion decorates a link with a latent categorical variable whose value determines subsequent model parameters.

Definition C.7 (Mixture Expansion). Let $\WCFG $ be a WCFG containing a nonterminal $X$. Let $\{1, \ldots , K\}$ be a finite set of mixture components with weights $p_1, \ldots , p_K > 0$ satisfying $\sum _k p_k = 1$. The mixture expansion of $X$ with components $K$ and weights $(p_k)$ replaces $X$ with $K$ new nonterminals $X_1, \ldots , X_K$ and modifies all productions referencing $X$ as follows.

Every production $Y \to \alpha \;X\;\beta $ in $\WCFG $ (where $Y \neq X$ and $\alpha , \beta \in (\NT \cup \TM )^*$) is replaced by $K$ productions: \begin {equation} Y \to \alpha \;X_k\;\beta \qquad \text {weight } \WF (Y \to \alpha X \beta ) \cdot p_k \quad \text {for } k = 1, \ldots , K \end {equation}

The productions of $X$ itself are copied to each $X_k$: for each production $X \to \gamma $, create \begin {equation} X_k \to \gamma _k \qquad \text {weight } \WF (X \to \gamma ) \end {equation} where $\gamma _k$ is $\gamma $ with any self-references to $X$ replaced by $X_k$ (maintaining the component assignment within a link).

Each $X_k$ may then undergo different subsequent elaborations (e.g., different CTMC parameters, different fragment extension rates).

Proposition C.6. Mixture expansion is validity-preserving.

Proof. For $Y$: the total weight of productions referencing $X_k$ is $\sum _k \WF (\cdot ) \cdot p_k = \WF (\cdot ) \cdot 1$. For each $X_k$: the production weights are copied from $X$, hence sum to 1. □

Remark C.22. In principle, the mixture component selector could itself evolve via a CTMC across evolutionary time, provided that nested expansions are resampled from equilibrium upon a change of component. In this appendix, we restrict attention to the simpler case where the component assignment is fixed throughout the lifetime of a link (i.e., it is sampled once at birth and does not change).

Example C.4 (MixDom: three levels of mixture). In the MixDom model, mixture expansion is applied at three levels:

1.: Domain mixture: each top-level link is assigned a domain type $\dom \sim \catdist (\domdist _1,\ldots ,\domdist _\ndom )$, determining $(\insrate _\dom , \delrate _\dom )$ for the nested TKF92 process.
2.: Fragment process: within domain $\dom $, the initial fragment type is $\frag \sim \catdist (\fragdist _{\dom 1},\ldots ,\fragdist _{\dom \nfrag })$. Subsequent fragments are drawn from the $\nfrag \times \nfrag $ transition matrix $\ext ^{(\dom )}_{\srcfrag \destfrag }$ (the $\nfrag = 1$ case reduces to IID geometric extension).
3.: Site class mixture: within fragment state $\frag $ of domain $\dom $, each character position is assigned a class $\class \sim \catdist (\classdist _{\dom \frag 1},\ldots ,\classdist _{\dom \frag \nclasses })$, determining the substitution parameters $(\exch ^{(\class )}, \eqm ^{(\class )})$.

Link Sequence Concatenation Concatenation decorates a single link with two or more consecutive sub-links.

Definition C.8 (Link Sequence Concatenation). Let $\WCFG $ be a WCFG with nonterminal $X$ representing a link. The binary concatenation of $X$ replaces it with two new nonterminals $X_A, X_B$ and the production: \begin {equation} X \to X_A\;X_B \qquad \text {weight } 1 \end {equation} $X_A$ and $X_B$ may then be independently elaborated. For $n$-ary concatenation, $X$ is replaced by $X \to X_1\;X_2\;\cdots \;X_n$.

Remark C.23 (Concatenation combined with mixture and fragments). A powerful pattern combines mixture, fragment, and concatenation to decorate a link with a variable-length sequence of categorically typed sub-links:

1.: Apply fragment expansion to the link, creating a geometric number of sub-link slots.
2.: Apply mixture expansion to each sub-link slot, assigning it a categorical type.
3.: If correlations between adjacent types are desired, replace the i.i.d. mixture with an HMM output distribution: the type sequence is generated by a hidden Markov model whose transition matrix captures adjacency preferences.

This yields a variable number of concatenated sub-links of varying types with Markovian correlations. In the TKF Genome, this pattern is used for genomic regions (coding, noncoding, structural) within a top-level link sequence.

Example C.5 (TKF Genome: region concatenation). The TKF Genome’s top-level grammar has: \begin {align*} \texttt {GENOME} &\to \texttt {REGION}\;\texttt {GENOME} && 1-\kappa _R \\ \texttt {GENOME} &\to \epsilon && \kappa _R \\ \texttt {REGION} &\to \texttt {INTER} && p_N \\ &\;\;|\;\; \texttt {FWDCDS} && p_G/2 \\ &\;\;|\;\; \texttt {REVCDS} && p_G/2 \\ &\;\;|\;\; \texttt {STRUCT} && p_S \\ &\;\;|\;\; \texttt {CONS} && p_C \end {align*}

Here $\texttt {REGION}$ is both a mixture expansion (over region types) and a concatenation point (each region type expands into its own sub-grammar).

Non-Recursive Nesting Non-recursive nesting splices a complete sub-grammar into the transitions of a mortal link, without introducing bifurcation or self-reference. This is how nesting works in MixDom.

Definition C.9 (Non-Recursive Nesting). Let $\WCFG _{\mathrm {outer}}$ be a link grammar with nonterminal $\texttt {MOR}$ (mortal link), and let $\WCFG _{\mathrm {inner}}$ be an independent link grammar with its own start symbol $\texttt {IMM}_{\mathrm {inner}}$ and parameters. The non-recursive nesting of $\WCFG _{\mathrm {inner}}$ into $\texttt {MOR}$ of $\WCFG _{\mathrm {outer}}$ replaces each terminal-emitting production of $\texttt {MOR}$ with: \begin {equation} \texttt {MOR} \to \texttt {IMM}_{\mathrm {inner}}\;\alpha \qquad \text {weight } \WF _{\mathrm {orig}} \end {equation} where $\alpha $ is the original continuation. The inner grammar generates a complete sequence of inner links, each with its own parameters, for every outer mortal link.

More precisely, wherever $\texttt {MOR}$ previously emitted a terminal symbol $c$, it now expands into the entire inner grammar $\WCFG _{\mathrm {inner}}$, which itself may generate zero or more characters. The inner grammar’s $\epsilon $-productions (zero-length inner sequences) give rise to null states in the combined grammar.

Remark C.24. The key difference from recursive nesting (Section C.11.2.0) is that $\WCFG _{\mathrm {inner}}$ does not reference any nonterminals of $\WCFG _{\mathrm {outer}}$. There is no possibility of re-entering the outer grammar from within the inner grammar. This ensures that the combined grammar generates strings from a regular language (at each level), rather than a context-free language.

Example C.6 (MixDom as non-recursive nesting). The MixDom model nests a Markovian fragment process (inner grammar) into each mortal link of a TKF91 process (outer grammar). The outer link grammar has parameters $(\insrate _0, \delrate _0)$ and the inner grammar, for domain type $\dom $, has parameters $(\insrate _\dom , \delrate _\dom , \ext ^{(\dom )}_{\srcfrag \destfrag }, \classdist _{\dom \frag \class }, \exch ^{(\class )}, \eqm ^{(\class )})$. Each outer mortal link, instead of emitting a single character, expands into a domain sequence governed by the Markovian fragment process.

Since the inner grammar can generate the empty string (the inner link sequence may have length zero), this nesting creates null states in the combined Pair HMM. These must be eliminated by the procedures of Section C.11.3.

Recursive Nesting Recursive nesting, used in the TKF Structure Tree, allows transitions into a mortal link to spawn a bifurcation: a new nonterminal whose sub-grammar may reference the original grammar’s nonterminals.

Definition C.10 (Recursive Nesting). Let $\WCFG $ be a link grammar with nonterminals including $\texttt {MOR}$ (mortal link). Let $\texttt {BIF}$ be a new nonterminal with its own sub-grammar $\WCFG _{\mathrm {bif}}$ that may reference nonterminals of $\WCFG $ (including $\texttt {IMM}$ and $\texttt {MOR}$).

The right recursive nesting of $\texttt {BIF}$ at $\texttt {MOR}$ replaces each terminal-emitting production of $\texttt {MOR}$ with a mixture of terminal emission and bifurcation:

Before: \begin {align*} \texttt {MOR} &\to c\;\alpha && \WF _{\mathrm {orig}} \cdot \eqm _c \end {align*}

After: \begin {align*} \texttt {MOR} &\to c\;\alpha && \WF _{\mathrm {orig}} \cdot s \cdot \eqm _c && \text {(terminal link, probability } s\text {)} \\ \texttt {MOR} &\to \texttt {BIF}\;\alpha && \WF _{\mathrm {orig}} \cdot (1-s) \cdot \domdist _{\mathrm {bif}} && \text {(bifurcation, probability } 1-s\text {)} \end {align*}

where $s \in (0,1]$ is the probability that a link is a terminal (character-emitting) link rather than a nesting point, and $\domdist _{\mathrm {bif}}$ is a distribution over bifurcation types if there are multiple $\texttt {BIF}$ variants.

For left recursive nesting: \begin {align*} \texttt {MOR} &\to \alpha \;\texttt {BIF} && \WF _{\mathrm {orig}} \cdot (1-s) \cdot \domdist _{\mathrm {bif}} \end {align*}

The sub-grammar $\WCFG _{\mathrm {bif}}$ for $\texttt {BIF}$ defines how the bifurcation expands. Since it may reference the start symbol of $\WCFG $ (e.g., $\texttt {IMM}$), the combined grammar is genuinely recursive: link sequences can contain nested link sequences of arbitrary depth.

Remark C.25. The recursive nesting creates null cycles whenever the nested sub-grammar can generate the empty string. These must be handled by the null-state management procedures (Section C.11.3). In the TKF Structure Tree, the nullability fixed-point iteration solves for the probability that each nonterminal generates $\epsilon $.

Example C.7 (TKF Structure Tree: stems and loops). The TKF Structure Tree has two types of link sequences:

Loop sequences ($L$): left-emitting links generating single nucleotides. Nonterminal rule: $L \to c_L\;L$ with CTMC expansion (left emission).
Stem sequences ($S$): LR-emitting links generating basepairs. Nonterminal rule: $S \to c_L\;S\;c_R$ with CTMC expansion (LR emission).

The recursive nesting works as follows. Within a loop sequence, a mortal link may either emit a character (probability $s_L$) or spawn a stem (probability $1-s_L$): \[ L \to c\;L \quad |\quad S\;L \] At the base of a stem (when the self-loop terminates), the grammar transitions to a loop: \[ S \to c_L\;S\;c_R \quad |\quad L \] This creates the alternating stem–loop structure characteristic of RNA secondary structure. The recursion arises because a loop can spawn a stem, which eventually returns to a loop, which can spawn another stem, ad infinitum.

Example C.8 (TKF92 with Recursive Domains). The recursive domain model has nonterminals $\texttt {L}_\dom $ (link sequence for domain $\dom $), $\texttt {A}_\dom $ (aligned component), $\texttt {I}_\dom $ (inserted component), $\texttt {D}_\dom $ (deleted component), etc. The aligned component rule is: \begin {align*} \texttt {A}_\dom &\to c && s_\dom \cdot \eqm _{\dom ,c} \quad \text {(terminal: character)} \\ \texttt {A}_\dom &\to \texttt {L}_{\dom '} && (1-s_\dom ) \cdot \domdist _{\dom \dom '} \quad \text {(bifurcation: nested link sequence)} \end {align*}

Since $\texttt {L}_{\dom '}$ can reference $\texttt {L}_\dom $ (if $\dom ' = \dom $ or through a chain of domain transitions), this creates genuine recursion: link sequences contain domains that contain further link sequences.

Evolution The Evolution elaboration converts a single-sequence grammar (describing the stationary distribution) into a pair grammar (describing the joint ancestor–descendant distribution at evolutionary time $\evoltime $).

Definition C.11 (Evolution Elaboration). Let $\WCFG _1$ be a single-sequence grammar with link nonterminal $X$ generating terminals from alphabet $\alphabet $ with equilibrium distribution $\eqm $. Let $\exch $ be the CTMC rate matrix and $\evoltime > 0$ the evolutionary time.

Define the TKF parameters: \begin {align*} \alpha &= e^{-\delrate \evoltime }, \quad \beta = \frac {\insrate (e^{-\insrate \evoltime } - e^{-\delrate \evoltime })} {\delrate e^{-\insrate \evoltime } - \insrate e^{-\delrate \evoltime }}, \quad \gamma = 1 - \frac {\delrate \beta }{\insrate (1-\alpha )}, \quad \kappa = \frac {\insrate }{\delrate } \end {align*}

The evolution elaboration replaces each link nonterminal $X$ in $\WCFG _1$ with three nonterminals $X_\mat $, $X_\ins $, $X_\del $ in the pair grammar $\WCFG _2$:

Before (single-sequence grammar, left-emitting): \begin {align*} X &\to c\;X\;X && \kappa \cdot \eqm _c && \text {(link with offspring)} \\ X &\to c && (1-\kappa ) \cdot \eqm _c && \text {(terminal link)} \end {align*}

After (pair grammar):

For the immortal link continuation, each nonterminal $Y$ that previously generated the link sequence $Y \to X\;Y\;|\;\epsilon $ transforms as follows: \begin {align*} Y_\mat &\to X_\mat \;Y_\mat && (1-\beta _Y)\kappa \alpha \\ Y_\mat &\to X_\ins \;Y_\mat && \beta _Y \\ Y_\mat &\to X_\del \;Y_\del && (1-\beta _Y)\kappa (1-\alpha ) \\ Y_\mat &\to \epsilon && (1-\beta _Y)(1-\kappa ) \\[6pt] Y_\del &\to X_\mat \;Y_\mat && (1-\gamma _Y)\kappa \alpha \\ Y_\del &\to X_\ins \;Y_\mat && \gamma _Y \\ Y_\del &\to X_\del \;Y_\del && (1-\gamma _Y)\kappa (1-\alpha ) \\ Y_\del &\to \epsilon && (1-\gamma _Y)(1-\kappa ) \\ \end {align*}

where $\beta _Y$ and $\gamma _Y$ use the appropriate $(\insrate , \delrate )$ for the link sequence that $Y$ belongs to.

The subscript indicates the alignment type:

$X_\mat $ (match): ancestral link survived; emits aligned pair $(c_a, c_d)$ with probability $\eqm _{c_a} \cdot \exp (\revsub \evoltime )_{c_a c_d}$.
$X_\ins $ (insert): new link born in descendant; emits descendant-only character $c_d$ with probability $\eqm _{c_d}$.
$X_\del $ (delete): ancestral link died; emits ancestor-only character $c_a$ with probability $\eqm _{c_a}$.

The transition weights are precisely the entries of the TKF91 Pair HMM transition matrix $\tkftrans (\insrate , \delrate , \evoltime )$.

Proposition C.7. The evolution elaboration roughly triples the number of nonterminals (each single-sequence nonterminal becomes three pair nonterminals). For LR-emitting nonterminals (as in stem sequences), match states emit paired tuples $(c_L^a, c_R^a, c_L^d, c_R^d)$, insert states emit $(c_L^d, c_R^d)$, and delete states emit $(c_L^a, c_R^a)$.

Remark C.26. When fragment expansion has been applied before evolution, the fragment self-loop interleaves with the alignment states. In TKF92, the Pair HMM transition matrix becomes $\tkftrans '$ where the $\mat $, $\ins $, and $\del $ self-loops gain a fragment-extension component: \[ \tkftrans '_{aa} = \ext + (1-\ext )\tkftrans _{aa} \qquad \text {for } a \in \{\mat , \ins , \del \} \] and off-diagonal transitions are scaled by $(1-\ext )$: \[ \tkftrans '_{ab} = (1-\ext )\tkftrans _{ab} \qquad \text {for } a \neq b \]

Remark C.27. For LR-emitting grammars (such as stem sequences in the TKF Structure Tree), the evolution elaboration creates nonterminals $X_M, X_I, X_D$ whose left and right emissions are correlated:

$X_M \to c_L^x\;c_L^y\;X_M\;c_R^y\;c_R^x$: ancestral basepair $(c_L^x, c_R^x)$ evolved to descendant basepair $(c_L^y, c_R^y)$ (match).
$X_I$: emits only descendant basepair $c_L^y\;\cdots \;c_R^y$ (insertion).
$X_D$: emits only ancestral basepair $c_L^x\;\cdots \;c_R^x$ (deletion).

The ancestor terminals go on the outside, the descendant terminals on the inside (or vice versa), preserving the palindromic nesting.

Example C.9 (TKF91 Pair HMM). Applying evolution to the TKF91 single-sequence grammar (Example C.1) yields the standard 5-state Pair HMM $(\sta , \mat , \ins , \del , \fin )$ with transition matrix $\tkftrans (\insrate , \delrate , \evoltime )$.

C.11.3 Null State Management

Null State Identification

Definition C.12 (Null state). A nonterminal $X$ in a WCFG is nullable if there exists a derivation $X \Rightarrow ^* \epsilon $. The nullability $\nullability (X) = P(X \Rightarrow ^* \epsilon )$ is the probability that a parse tree rooted at $X$ yields the empty string.

Elaborations that create null states include:

1.: Non-recursive nesting: the inner grammar may generate the empty string (e.g., a domain in MixDom may contain zero fragments).
2.: Recursive nesting: nested link sequences can be empty.
3.: Mixture expansion combined with nesting: a mixture component that expands into a nullable sub-grammar.

In the MixDom Pair HMM, null states arise because each domain’s inner TKF92 process can generate an empty sequence. The probability of an empty domain at evolutionary time $\evoltime $ is $\emptyseg _\evoltime = \sum _\dom \domdist _\dom (1-\kappa _\dom )(1-\beta _\dom )$.

Null State Removal: The $(I - T_{NN})^{-1}$ Closure

Definition C.13 (Null Closure). Let $\WCFG $ be a WCFG (or equivalently an HMM/transducer) with states partitioned into emitting states $\Omega $ and non-emitting (null) states $\mathcal {Z}$. Let $T_{\mathcal {Z}\mathcal {Z}}$ be the submatrix of transition weights among null states. The null closure is: \begin {equation} \nullcl = (I - T_{\mathcal {Z}\mathcal {Z}})^{-1} = \sum _{k=0}^{\infty } T_{\mathcal {Z}\mathcal {Z}}^k \end {equation} This converges provided the spectral radius $\rho (T_{\mathcal {Z}\mathcal {Z}}) < 1$.

Proposition C.8 (Effective Transition Matrix). The effective transition matrix between emitting (and start/end) states, with all null-state paths summed out, is: \begin {equation} \effT = T_{\Omega \Omega } + T_{\Omega \mathcal {Z}} \cdot (I - T_{\mathcal {Z}\mathcal {Z}})^{-1} \cdot T_{\mathcal {Z}\Omega } \end {equation} where subscripts denote submatrices restricted to the indicated state sets.

Null Cycle Detection and Removal

Definition C.14 (Null Cycle). A null cycle in a WCFG is a chain of unit productions (productions whose right-hand side is a single nonterminal) that returns to the starting nonterminal: $X \to Y_1 \to Y_2 \to \cdots \to X$. In the context of HMMs/transducers, this corresponds to a cycle among non-emitting states.

Null cycles arise in two main situations:

1.: Non-recursive nesting with empty inner grammars: when the inner grammar can produce $\epsilon $, a path $\sta \to \texttt {MOR}_{\mathrm {outer}} \to \texttt {IMM}_{\mathrm {inner}} \to \epsilon \to \texttt {MOR}_{\mathrm {outer}}$ creates a cycle through null states.
2.: Recursive nesting: the chain $\texttt {L}'_\dom \to \texttt {M}'_\dom \to \texttt {S}'_\dom \to \texttt {A}'_\dom \to \texttt {L}'_{\dom '}$ creates a null cycle when $\dom ' = \dom $ or when the chain of domain transitions eventually returns to $\dom $.

Definition C.15 (Null Cycle Removal). To remove null cycles from a WCFG:

1.: Compute nullabilities: for each nonterminal $X$, compute $\nullability (X) = P(X \Rightarrow ^* \epsilon )$. In the non-recursive case, this can be done in closed form. In the recursive case, iterate the fixed-point equations: \begin {equation} \nullability (X) = \sum _{(X \to \alpha ) \in \PR } \WF (X \to \alpha ) \cdot \prod _{Y \in \alpha } \nullability (Y) \end {equation} where the product is over nonterminals in $\alpha $, with $\nullability (\text {terminal}) = 0$ and $\nullability (\epsilon ) = 1$.
Initialize $\nullability ^{(0)}(X) = 0$ for all $X$ and iterate until convergence.
2.: Create non-nullable copies: for each nullable nonterminal $X$, create $X'$ whose productions never generate $\epsilon $. For a bifurcation rule $X \to Y\;Z$, the non-nullable version adds: \begin {align*} X' &\to Y'\;Z' && \WF (X \to YZ) \\ X' &\to Y' && \WF (X \to YZ) \cdot \nullability (Z) \\ X' &\to Z' && \WF (X \to YZ) \cdot \nullability (Y) \end {align*}
This accounts for the two ways one child can be null.
3.: Remove unit-production cycles: identify cycles $X' \to Y'_1 \to \cdots \to X'$ among the non-nullable nonterminals. For each such cycle, compute the transition matrix $\mathcal {A}$ among the cycle’s nonterminals and replace the cycle with its closure $(I - \mathcal {A})^{-1}$, distributing the accumulated weight to the non-cyclic continuations.

Example C.10 (Recursive domains: nullability fixed point). In the recursive domain model, $\nullability (\texttt {C}_\dom )$ (nullability of the child link sequence nonterminal) satisfies: \[ \nullability (\texttt {C}_\dom ) = \frac {1 - \kappa _\dom } {1 - \kappa _\dom (1-s_\dom ) \sum _{\dom '} \domdist _{\dom \dom '} \nullability (\texttt {C}_{\dom '})} \] This is solved by initializing $x_\dom ^{(0)} = 0$ and iterating: \[ x_\dom ^{(k+1)} = \frac {1 - \kappa _\dom } {1 - \kappa _\dom (1-s_\dom ) \sum _{\dom '} \domdist _{\dom \dom '} x_{\dom '}^{(k)}} \] After convergence, the null cycles $\texttt {L}'_\dom \to \texttt {L}'_{\dom '}$ and $\texttt {C}'_\dom \to \texttt {C}'_{\dom '}$ are removed using the $(I - \mathcal {A})^{-1}$ and $(I - \mathcal {B})^{-1}$ closures respectively, where $\mathcal {A}$ and $\mathcal {B}$ are the $\ndom \times \ndom $ transition matrices among domain types.

C.11.4 Composition Properties

Commutativity and Order The elaboration rules do not, in general, commute. The following table summarizes the ordering constraints.

Constraint	Reason
CTMC expansion commutes with mixture expansion	Mixture selects which CTMC parameters to use; the order of these two operations does not affect the final grammar. Both orderings produce the same set of productions.

Fragment expansion must follow (or be simultaneous with) CTMC expansion	Fragment expansion wraps terminal emissions in a geometric self-loop. Without terminals, fragment expansion produces fragments of $\epsilon $’s.

Non-recursive nesting must follow both CTMC and fragment expansion of the inner grammar	The inner grammar must be fully specified before it can be spliced into the outer grammar.

Evolution must be applied last (after all structural elaborations)	Evolution triples the nonterminals and introduces alignment-dependent transition weights ($\alpha , \beta , \gamma $). Applying structural elaborations after evolution would require modifying all three copies independently.

Mixture expansion commutes with concatenation	Both are structural operations on different aspects of a link.

Null state removal must follow all nullable elaborations but precede distillation	All null states must be identified before they can be summed out. The distilled order-1 machines assume null-free grammars.

Validity Conditions

Definition C.16 (Well-Formed Elaborated Grammar). An elaborated grammar $\WCFG '$ is well-formed if:

1.: Properness: for every nonterminal $X$, the production weights sum to 1.
2.: No unresolved null cycles: after null state removal, no cycles among non-emitting states remain. Equivalently, $\rho (T_{\mathcal {Z}\mathcal {Z}}) < 1$ for the null-state transition matrix.
3.: Convergent nullability: for recursive grammars, the fixed-point iteration for nullabilities converges. A sufficient condition is that $\kappa _\dom < 1$ and $s_\dom > 0$ for all domain types (every link has a positive probability of being terminal rather than a nesting point).
4.: Finite expected derivation length: the expected total number of terminals generated is finite. For the link grammar, this requires $\kappa < 1$. For recursive nesting, additional conditions on the nesting probabilities are needed.

Proposition C.9. Each elaboration rule defined above is validity-preserving under its stated conditions. Composition of validity-preserving elaborations is validity-preserving, provided the ordering constraints above are respected.

Derivation of Existing Models We now show explicitly how each known TKF-family model arises as a sequence of elaborations applied to the base link grammar.

TKF91. \begin {equation*} \boxed {\text {TKF91}} = \WCFG _{\mathrm {link}}(\kappa ) \xrightarrow {\text {CTMC}(\alphabet , \eqm , \exch )} \WCFG _{\mathrm {TKF91}} \end {equation*} Steps:

1.: Start with the link grammar $\WCFG _{\mathrm {link}}(\kappa )$ (Definition C.3).
2.: Apply left CTMC expansion (Definition C.5) to $\texttt {MOR}$ with alphabet $\alphabet $, equilibrium $\eqm $, rate matrix $\exch $.

Result: each mortal link emits a single character from $\eqm $. The pair grammar is obtained by applying evolution (Definition C.11), yielding the standard 5-state Pair HMM with transition matrix $\tkftrans (\insrate , \delrate , \evoltime )$.

TKF92. \begin {equation*} \boxed {\text {TKF92}} = \WCFG _{\mathrm {link}}(\kappa ) \xrightarrow {\text {Frag}(\ext )} \xrightarrow {\text {CTMC}(\alphabet , \eqm , \exch )} \WCFG _{\mathrm {TKF92}} \end {equation*} Steps:

1.: Start with $\WCFG _{\mathrm {link}}(\kappa )$.
2.: Apply fragment expansion (Definition C.6) with extension probability $\ext $.
3.: Apply left CTMC expansion to each fragment position.

Result: each mortal link emits a fragment of $K \sim \geomdist (\ext )$ characters. The pair grammar has self-looping match/insert/delete states with fragment-extension probability $\ext $.

MixDom: Markovian fragments. \begin {equation*} \boxed {\text {MixDom}} = \WCFG _{\mathrm {link}}(\kappa _0) \xrightarrow {\text {Mix}(\domdist _\dom )} \xrightarrow [\text {for each } \dom ]{\text {NonRecNest}\bigl ( \WCFG _{\mathrm {link}}(\kappa _\dom ) \xrightarrow {\text {HMM}(\ext ^{(\dom )})} \xrightarrow {\text {Mix}(\fragdist _{\dom \frag })} \xrightarrow {\text {Mix}(\classdist _{\dom \frag \class })} \xrightarrow {\text {CTMC}(\alphabet , \eqm ^{(\class )}, \exch ^{(\class )})} \bigr )} \WCFG _{\mathrm {MixDom}} \end {equation*} Steps:

1.

Start with $\WCFG _{\mathrm {link}}(\kappa _0)$ (outer/top-level link grammar).

2.

Apply mixture expansion (Definition C.7) to $\texttt {MOR}$ with domain types $\dom \sim \catdist (\domdist _1,\ldots ,\domdist _\ndom )$.

3.

For each domain type $\dom $, construct an inner grammar:

(a): Start with $\WCFG _{\mathrm {link}}(\kappa _\dom )$ (inner link grammar).
(b): Replace geometric fragment extension with the Markovian fragment HMM governed by the $\nfrag \times \nfrag $ transition matrix $\ext ^{(\dom )}_{\srcfrag \destfrag }$.
(c): Apply mixture expansion for initial fragment types $\frag \sim \catdist (\fragdist _{\dom 1},\ldots ,\fragdist _{\dom \nfrag })$.
(d): Apply mixture expansion for site classes $\class \sim \catdist (\classdist _{\dom \frag 1},\ldots ,\classdist _{\dom \frag \nclasses })$.
(e): Apply CTMC expansion with $(\alphabet , \eqm ^{(\class )}, \exch ^{(\class )})$.

4.

Apply non-recursive nesting (Definition C.9): splice each domain’s inner grammar into the corresponding outer mortal link.

5.

Apply null state removal (Section C.11.3.0): the inner grammar can generate empty sequences (null domains), creating the null states $\mnull , \inull , \dnull $ in the null-separated Pair HMM. The $(I - T_{\mathcal {Z}\mathcal {Z}})^{-1}$ closure reduces the 8-state null-separated Pair HMM to the effective 5-state matrix $\nonemptytrans $.

TKF Structure Tree. \begin {equation*} \boxed {\text {TKF Structure Tree}} = \WCFG _{\mathrm {link}}(\kappa _L) \xrightarrow {\text {CTMC}_L(\alphabet , \eqm _L, \exch _L)} \xrightarrow {\text {RecNest}\bigl ( \WCFG _{\mathrm {link}}(\kappa _S) \xrightarrow {\text {CTMC}_{LR}(\alphabet ^2, \eqm _S, \exch _S)} \bigr )} \WCFG _{\mathrm {ST}} \end {equation*} Steps:

1.

Start with $\WCFG _{\mathrm {link}}(\kappa _L)$ (loop link grammar).

2.

Apply left CTMC expansion for loops with $(\alphabet , \eqm _L, \exch _L)$.

3.

Apply recursive nesting (Definition C.10): within the loop, a mortal link may spawn a stem. The stem sub-grammar is:

(a): Start with $\WCFG _{\mathrm {link}}(\kappa _S)$ (stem link grammar).
(b): Apply LR CTMC expansion with basepair alphabet $(\alphabet \times \alphabet , \eqm _S, \exch _S)$.
(c): At stem termination, return to the loop grammar (creating the recursion $S \to L$).

4.

The loop grammar now has two types of mortal links:

Character-emitting links (probability $s_L$): emit a single nucleotide $c$ with $\eqm _L(c)$.
Stem-spawning links (probability $1 - s_L$): expand into a stem $S$ nonterminal.

The LR emission in the stem grammar is critical: the rule $S \to c_L\;S\;c_R$ emits characters on both sides of the recursive expansion, generating the palindromic base-pairing structure of RNA stems.

TKF Genome. \begin {equation*} \boxed {\text {TKF Genome}} = \WCFG _{\mathrm {link}}(\kappa _R) \xrightarrow {\text {Concat+Mix(region types)}} \xrightarrow {\text {various nested elaborations per region type}} \WCFG _{\mathrm {Genome}} \end {equation*} Steps:

1.

Start with $\WCFG _{\mathrm {link}}(\kappa _R)$ (top-level genomic region grammar).

2.

Apply concatenation + mixture: each link is a “region” selected from types $\{\texttt {INTER}, \texttt {FWDCDS}, \texttt {REVCDS}, \texttt {STRUCT}, \texttt {CONS}\}$ with probabilities $(p_N, p_G/2, p_G/2, p_S, p_C)$.

3.

Each region type undergoes its own elaboration chain:

INTER: link grammar + left CTMC expansion (single nucleotides, neutral evolution).
FWDCDS/REVCDS: link grammar + concatenation into codons ($c_1 c_2 c_3$) + CTMC expansion with codon substitution model + recursive nesting for introns (introns contain a nested $\texttt {GENOME}$ nonterminal, flanked by splice donor/acceptor sites).
STRUCT: link grammar with LR CTMC expansion (stems) + recursive nesting into loops (left CTMC expansion).
CONS: link grammar + left CTMC expansion (conserved elements).

4.

The intron nesting creates recursion: $\texttt {GENOME} \to \texttt {CDS} \to \texttt {CODON} \to \texttt {INTRON} \to \texttt {GENOME}$.

C.11.5 Toward Implementation

Grammar Objects with Transformation Methods The elaboration rules defined above can be implemented as methods on a grammar object:

class WCFG:
    nonterminals: Set[str]
    terminals: Set[str]
    productions: List[Production]
    start: str

    def ctmc_expand(self, nonterminal, alphabet,
                    equilibrium, emission_type=’left’):
        """CTMC Expansion"""

    def fragment_expand(self, nonterminal, extension_prob):
        """Fragment Expansion"""

    def mixture_expand(self, nonterminal, components, weights):
        """Mixture Expansion"""

    def concatenate(self, nonterminal, parts):
        """Link Sequence Concatenation"""

    def nest_nonrecursive(self, nonterminal, inner_grammar):
        """Non-Recursive Nesting"""

    def nest_recursive(self, nonterminal, bifurcation_grammar,
                       terminal_prob, side=’right’):
        """Recursive Nesting"""

    def evolve(self, time, rates):
        """Evolution"""

    def remove_null_states(self):
        """Null State Removal"""

Each method validates the preconditions (e.g., that fragment expansion targets a nonterminal with terminal emissions), performs the transformation, and returns the modified grammar.

Automatic Derivation of DP Algorithms Given an elaborated grammar, the dynamic programming algorithm (Forward, Backward, Inside, Viterbi) can be derived automatically by:

1.: State space identification: each nonterminal in the elaborated grammar corresponds to a state in the DP. Emitting states correspond to observable positions (sequence characters); non-emitting states are handled by null closure.
2.: Transition structure: the productions define the recurrence relations. Linear (non-branching) productions yield HMM-style recurrences; branching (bifurcation) productions yield CYK-style recurrences.
3.: Emission probabilities: determined by the CTMC parameters and the emission type (left, right, LR, match/insert/delete).
4.: Fill order: determined by the topological sort of nonterminals (after null cycle removal). For recursive grammars, the fill order follows the CYK pattern (by span length for context-free rules).

Connection to Existing Frameworks The elaboration rules defined here are compatible with existing grammar and transducer frameworks:

Transducer composition: the Evolution elaboration produces a transducer (Pair HMM / WFST). These can be composed on phylogenetic trees using the standard composition and intersection operations for Mealy machines in waiting-machine normal form.
SCFG parsers: the elaborated grammars (especially those with recursive nesting and LR emission) are SCFGs amenable to Inside/Outside parsing algorithms.
Distillation: the elaborated pair grammars can be distilled to order-1 machines (HMMs and WFSTs) by computing adjacency frequencies and normalizing, as described in Section C.4.5. This step loses the hierarchical structure but produces compact machines suitable for phylogenetic composition.

C.12 Recursive TKF Models

Another way to nest TKF models is to borrow from grammar theory and allow recursion. This was previously used to develop RNA evolutionary models (22), and is developed here in a general way for proteins and genomic DNA.

We illustrate how the TKF-mixed domain model can be interpreted as a stochastic grammar, developing four examples—recursive protein domains (i.e. arbitrary nesting of motifs), a basic model of RNA foldback structure, a second more sophisticated model of RNA structure, and a basic model of a genome—that highlight the structure of such models as a series of stepwise grammar elaborations that constitute tree-adjoining moves on the space of grammars.

C.12.1 Example One: Left-Recursive TKF (L-TKF)

We can imagine a links model where, in a domain of type $\srcdom $, each mortal link is either (with probability $\tok _\srcdom $) associated with a character, or (with probability $(1 - \tok _\srcdom ) \domdist _{\srcdom \destdom }$ associated with its own independently-evolving links model of domain type $\destdom $. Links models can thus be nested ad infinitum

\begin {eqnarray*} \model _\srcdom & = & \tkflinks (\mixture _{\sumidx \sim \tok }(L^{(\sumidx )}_\srcdom );\insrate _\srcdom ,\delrate _\srcdom ) \\ L^{(1)}_\srcdom & = & \hmmproc (\{\exch ^{(\class )},\eqm ^{(\class )}\}_\class ;\ext ^{(\srcdom )},\classdist _\srcdom ) \\ L^{(0)}_\srcdom & = & \mixture _{\destdom \sim \domdist _\destdom }(\model _\destdom ) \end {eqnarray*}

where $\hmmproc $ denotes the Markovian fragment process (Section C.1.1), and $\mixture _{\sumidx \sim \tok }$ is defined for the Bernoulli index variable $\sumidx $ and probability $\tok $ as it was for categorical index variables \[ \state \sim \mixture _{\sumidx \sim p}(\model (\theta _\sumidx );p)\ \Leftrightarrow \ \sumidx \sim \berndist (p),\ \state \sim \model (\theta _\sumidx ) \]

We postpone the TKF92-like augmentations of $L^{(1)}_\srcdom $ (fragment types $\frag $ and site classes $\class $) for now, and start with a simplified TKF91-like version that allows full recursively-nested domains but allows only single-character fragments with one site class per domain

\begin {eqnarray*} \model '_\srcdom & = & \tkflinks (\mixture _{\sumidx \sim \tok }(L'^{(\sumidx )}_\srcdom );\insrate _\srcdom ,\delrate _\srcdom ) \\ L'^{(1)}_\srcdom & = & \subproc (\exch _\srcdom ,\eqm _\srcdom ) \\ L'^{(0)}_\srcdom & = & \mixture _{\destdom \sim \domdist _\srcdom }(\model '_\destdom ) \end {eqnarray*}

As with the nested TKF HMM, we have to account for the probability of zero-length components, leading to null cycles during likelihood and inference computations.

The joint distribution over ancestor-descendant alignments under the recursive TKF model is described by the following stochastic context-free grammar (SCFG) \[ \begin {array}{lrcccl} \mbox {Symbol interpretation} & \mbox {LHS} & & \multicolumn {2}{c}{\mbox {RHS}} & \mbox {Probability} \\ \hline \mbox {\underline {\bf L}ink sequence, domain $\srcdom $} & \ntlinks _\srcdom & \to & \ntlinks _\srcdom & \ntmor _\srcdom & \kappa _\srcdom \\ & & | & \ntimm _\srcdom & & 1 - \kappa _\srcdom \\ \mbox {Immortal lin\underline {\bf k}} & \ntimm _\srcdom & \to & \ntnew _\srcdom & & \beta _\srcdom \\ & & | & \epsilon & & 1 - \beta _\srcdom \\ \mbox {\underline {\bf M}ortal link} & \ntmor _\srcdom & \to & \ntsur _\srcdom & & \alpha _\srcdom \\ & & | & \ntexp _\srcdom & & 1 - \alpha _\srcdom \\ \mbox {\underline {\bf S}urviving mortal link} & \ntsur _\srcdom & \to & \ntaln _\srcdom & \ntnew _\srcdom & \beta _\srcdom \\ & & | & \ntaln _\srcdom & & 1 - \beta _\srcdom \\ \mbox {\underline {\bf E}xpired mortal link} & \ntexp _\srcdom & \to & \ntdel _\srcdom & \ntnew _\srcdom & \gamma _\srcdom \\ & & | & \ntdel _\srcdom & & 1 - \gamma _\srcdom \\ \mbox {\underline {\bf N}ewborn mortal link(s)} & \ntnew _\srcdom & \to & \ntins _\srcdom & \ntnew _\srcdom & \beta _\srcdom \\ & & | & \ntins _\srcdom & & 1 - \beta _\srcdom \\ \mbox {\underline {\bf A}ligned component} & \ntaln _\srcdom & \to & \term _{\anctok \destok } & & \tok _\srcdom \eqm _{\srcdom \anctok } \exp (\exch _\srcdom \evoltime )_{\anctok \destok }\\ & & | & \ntlinks _\destdom & & (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \mbox {\underline {\bf I}nserted component} & \ntins _\srcdom & \to & \term _{\gap \destok } & & \tok _\srcdom \eqm _{\srcdom \destok } \\ & & | & \ntchild _\destdom & & (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \mbox {\underline {\bf D}eleted component} & \ntdel _\srcdom & \to & \term _{\anctok \gap } & & \tok _\srcdom \eqm _{\srcdom \anctok } \\ & & | & \ntparent _\destdom & & (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \mbox {\underline {\bf C}hild link sequence (inserted)} & \ntchild _\srcdom & \to & \ntchild _\srcdom & \ntins _\srcdom & \kappa _\srcdom \\ & & | & \epsilon & & 1 - \kappa _\srcdom \\ \mbox {\underline {\bf P}arent link sequence (deleted)} & \ntparent _\srcdom & \to & \ntparent _\srcdom & \ntdel _\srcdom & \kappa _\srcdom \\ & & | & \epsilon & & 1 - \kappa _\srcdom \\ \end {array} \]

The standard path from here is to transform the grammar to Chomsky Normal Form (45). We don’t need to go all the way down that path; we only need to remove $\epsilon $-productions and null cycles.

In order to train by EM, every time we remove $\epsilon $-productions and null cycles, we need to be able to convert a rule count in the $\epsilon $-eliminated grammar to a set of rule counts in the original grammar. For each non-nullable nonterminal $\xstate '$, when a rule $\xstate ' \to \ystate '$ appears (arising from an original bifurcation $\xstate \to \ystate \;\zstate $ where $\zstate $ was nullable), the expected count $c(\xstate ' \to \ystate ')$ should contribute $c(\xstate ' \to \ystate ') \cdot \nullability (\zstate )$ to the count of the original rule $\xstate \to \ystate \;\zstate $, and symmetrically when the first child was nullable.

More precisely, for the nullability fixed-point iteration, this means tracking an additional set of expected counts alongside the nullabilities. For each nonterminal $\xstate $ with nullability $\nullability (\xstate )$, define $\bar {c}(\xstate \to \alpha )$ as the expected number of times rule $\xstate \to \alpha $ would have been used in the original grammar, conditioned on $\xstate $ generating the empty string. These satisfy analogous fixed-point equations: \[ \bar {c}(\xstate \to \ystate _1 \cdots \ystate _k) = \WF (\xstate \to \ystate _1 \cdots \ystate _k) \prod _{i=1}^k \nullability (\ystate _i) + \sum _{i=1}^k \frac {\WF (\xstate \to \ystate _1 \cdots \ystate _k) \prod _{j=1}^k \nullability (\ystate _j)} {\nullability (\xstate )} \sum _{\alpha '} \bar {c}(\ystate _i \to \alpha ') \] and can be iterated to convergence alongside the nullabilities. Given posterior counts from the Inside-Outside algorithm on the $\epsilon $-eliminated grammar, the original-grammar rule counts are recovered by:

1.: For each non-null production $\xstate ' \to \alpha '$ in the $\epsilon $-eliminated grammar, its posterior count $c'(\xstate ' \to \alpha ')$ contributes directly to the corresponding original rule.
2.: For each “nullability shortcut” production $\xstate ' \to \ystate '$ (arising from $\xstate \to \ystate \;\zstate $ with $\zstate $ nullable), the count $c'(\xstate ' \to \ystate ')$ contributes $c'(\xstate ' \to \ystate ')$ to the original rule $\xstate \to \ystate \;\zstate $, plus $c'(\xstate ' \to \ystate ') \cdot \bar {c}(\zstate \to \alpha )$ to each rule within $\zstate $’s null derivation subtree.

The general theory of $\epsilon $-elimination with EM count recovery, including the null closure $(I - T_{\mathcal {Z}\mathcal {Z}})^{-1}$ and null cycle removal, is developed in Appendix C.11 (Section C.11.3)

First we find the nullability $\nullability (\xstate )$ of each nonterminal $\xstate $. The nullability is the probability that a parse tree rooted in that nonterminal yields the empty string. We can’t solve for these in closed form (at least not in the general recursive model, where a link sequence can contain another link sequence of the same type). Instead we can solve approximately by iterating towards a fixed point.

We first observe that the nullabilities collectively satisfy the following

\begin {eqnarray*} \nullability (\ntlinks _\srcdom ) & = & \nullability (\ntimm _\srcdom ) \frac {1 - \kappa _\srcdom }{1 - \kappa _\srcdom \nullability (\ntmor _\srcdom )} \\ \nullability (\ntimm _\srcdom ) & = & 1 - \beta _\srcdom + \beta _\srcdom \nullability (\ntnew _\srcdom ) \\ \nullability (\ntmor _\srcdom ) & = & \alpha _\srcdom \nullability (\ntsur _\srcdom ) + (1 - \alpha _\srcdom ) \nullability (\ntexp _\srcdom ) \\ \nullability (\ntsur _\srcdom ) & = & \nullability (\ntaln _\srcdom ) \left ( \beta _\srcdom \nullability (\ntnew _\srcdom ) + (1 - \beta _\srcdom ) \right ) \\ \nullability (\ntexp _\srcdom ) & = & \nullability (\ntdel _\srcdom ) \left ( \gamma _\srcdom \nullability (\ntnew _\srcdom ) + (1 - \gamma _\srcdom ) \right ) \\ \nullability (\ntnew _\srcdom ) & = & \nullability (\ntins _\srcdom ) \frac {1 - \beta _\srcdom }{1 - \beta _\srcdom \nullability (\ntins _\srcdom )} \\ \nullability (\ntaln _\srcdom ) & = & (1 - \tok _\srcdom ) \sum _\destdom \domdist _{\srcdom \destdom } \nullability (\ntlinks _\destdom ) \\ \nullability (\ntins _\srcdom ) & = & (1 - \tok _\srcdom )\sum _\destdom \domdist _{\srcdom \destdom } \nullability (\ntchild _\destdom ) \\ \nullability (\ntdel _\srcdom ) & = & \nullability (\ntins _\srcdom ) \\ \nullability (\ntchild _\srcdom ) & = & \frac {1 - \kappa _\srcdom }{1 - \kappa _\srcdom \nullability (\ntins _\srcdom )} \\ \nullability (\ntparent _\srcdom ) & = & \nullability (\ntchild _\srcdom ) \\ \end {eqnarray*}

A general procedure is (i) iterate to solve for the $\nullability (\ntchild _\srcdom )$ (substituting in the downstream definition of $\nullability (\ntins _\srcdom )$ so the formula becomes self-referential); (ii) this directly yields $\nullability (\ntparent _\srcdom ), \nullability (\ntins _\srcdom ), \nullability (\ntdel _\srcdom ), \nullability (\ntnew _\srcdom ), \nullability (\ntexp _\srcdom ), \nullability (\ntimm _\srcdom )$; (iii) iterate to solve for $\nullability (\ntlinks _\srcdom )$ (again, first substituting to make it self-referential); (iv) this directly yields the remaining $\nullability (\ntmor _\srcdom ), \nullability (\ntsur _\srcdom ), \nullability (\ntaln _\srcdom )$.

In detail: initialize $x_\srcdom ^{(0)} \leftarrow 0$ for $\srcdom \in \ndom $. Iterate to convergence \[ x_\srcdom ^{(\sumidx +1)} \leftarrow \frac {1 - \kappa _\srcdom }{1 - \kappa _\srcdom (1 - \tok _\srcdom )\sum _\destdom \domdist _{\srcdom \destdom } x_\destdom ^{(\sumidx )}} \] We then set $\nullability (\ntchild _\srcdom ) \leftarrow \lim _{\sumidx \to \infty } x_\srcdom ^{(\sumidx )}$ and set $\nullability (\ntparent _\srcdom ), \nullability (\ntins _\srcdom ), \nullability (\ntdel _\srcdom ), \nullability (\ntnew _\srcdom ), \nullability (\ntexp _\srcdom ), \nullability (\ntimm _\srcdom )$ using the above equations. Now set $y_\srcdom ^{(0)} \leftarrow 0$ for $\srcdom \in \ndom $ and again iterate \[ y_\srcdom ^{(\sumidx +1)} \leftarrow \frac {(1 - \kappa _\srcdom ) \nullability (\ntimm _\srcdom )}{1 - \kappa _\srcdom (1 - \alpha _\srcdom ) \nullability (\ntexp _\srcdom ) - \kappa _\srcdom \alpha _\srcdom \left ( 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \right ) (1 - \tok _\srcdom ) \sum _\destdom \domdist _{\srcdom \destdom } y_\destdom ^{(\sumidx )} } \] Then set $\nullability (\ntlinks _\srcdom ) \leftarrow \lim _{\sumidx \to \infty } y_\srcdom ^{(\sumidx )}$ and set the remaining $\nullability (\ntmor _\srcdom ), \nullability (\ntsur _\srcdom ), \nullability (\ntaln _\srcdom )$ using the above equations.

We next develop a “non-nullable” version of the grammar that yields the same Inside probabilities, but explicitly separates out $\epsilon $-generations. For every nonterminal $\xstate _\srcdom $ we create a new nonterminal $\xstate '_\srcdom $ with rules that (by construction) never generate empty parse trees, but are otherwise identical to those transforming $\xstate _\srcdom $. In cases where the original grammars has bifurcation rules $\xstate _i \to \xstate _j \xstate _k$, we need to introduce transitions $\xstate '_i \to \xstate '_j$ and $\xstate '_i \to \xstate '_k$, to account for the missing nullability of $\xstate _j$ and $\xstate _k$. (We can also eliminate $\ntimm _\srcdom $, which only has one outgoing rule when its $\epsilon $-production is removed, and fold $\ntlinks _\srcdom \to \ntimm _\srcdom \to \ntmor _\srcdom $ into $\ntlinks _\srcdom \to \ntmor _\srcdom $.)

\[ \begin {array}{lrcccl} \mbox {Symbol interpretation} & \mbox {LHS} & & \multicolumn {2}{c}{\mbox {RHS}} & \mbox {Probability} \\ \hline \mbox {\underline {\bf L}ink sequence, domain $\srcdom $} & \ntlinks '_\srcdom & \to & \ntlinks '_\srcdom & \ntmor '_\srcdom & \kappa _\srcdom \\ \mbox {(nonempty)} & & | & \ntlinks '_\srcdom & & \kappa _\srcdom \nullability (\ntmor _\srcdom ) \\ & & | & \ntmor '_\srcdom & & \kappa _\srcdom \nullability (\ntlinks _\srcdom ) \\ & & | & \ntnew '_\srcdom & & (1 - \kappa _\srcdom ) \beta _\srcdom \\ \mbox {\underline {\bf M}ortal link (nonempty)} & \ntmor '_\srcdom & \to & \ntsur '_\srcdom & & \alpha _\srcdom \\ & & | & \ntexp '_\srcdom & & 1 - \alpha _\srcdom \\ \mbox {\underline {\bf S}urviving mortal link} & \ntsur '_\srcdom & \to & \ntaln '_\srcdom & \ntnew '_\srcdom & \beta _\srcdom \\ \mbox {(etc.; all $\xstate '_\srcdom $ are nonempty)} & & | & \ntnew '_\srcdom & & \beta _\srcdom \nullability (\ntaln _\srcdom ) \\ & & | & \ntaln '_\srcdom & & 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \\ \mbox {\underline {\bf E}xpired mortal link} & \ntexp '_\srcdom & \to & \ntdel '_\srcdom & \ntnew '_\srcdom & \gamma _\srcdom \\ & & | & \ntnew '_\srcdom & & \gamma _\srcdom \nullability (\ntdel _\srcdom ) \\ & & | & \ntdel '_\srcdom & & 1 - \gamma _\srcdom (1 - \nullability (\ntnew _\srcdom )) \\ \mbox {\underline {\bf N}ewborn mortal link(s)} & \ntnew '_\srcdom & \to & \ntins '_\srcdom & \ntnew '_\srcdom & \beta _\srcdom \\ & & | & \ntnew '_\srcdom & & \beta _\srcdom \nullability (\ntins _\srcdom ) \\ & & | & \ntins '_\srcdom & & 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \\ \mbox {\underline {\bf A}ligned component} & \ntaln '_\srcdom & \to & \term _{\anctok \destok } & & \tok _\srcdom \eqm _{\srcdom \anctok } \exp (\exch _\srcdom \evoltime )_{\anctok \destok }\\ & & | & \ntlinks '_\destdom & & (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \mbox {\underline {\bf I}nserted component} & \ntins '_\srcdom & \to & \term _{\gap \destok } & & \tok _\srcdom \eqm _{\srcdom \destok } \\ & & | & \ntchild '_\destdom & & (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \mbox {\underline {\bf D}eleted component} & \ntdel '_\srcdom & \to & \term _{\anctok \gap } & & \tok _\srcdom \eqm _{\srcdom \anctok } \\ & & | & \ntparent '_\destdom & & (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \mbox {\underline {\bf C}hild link sequence (inserted)} & \ntchild '_\srcdom & \to & \ntchild '_\srcdom & \ntins '_\srcdom & \kappa _\srcdom \\ & & | & \ntchild '_\srcdom & & \kappa _\srcdom \nullability (\ntins _\srcdom ) \\ & & | & \ntins '_\srcdom & & \kappa _\srcdom \nullability (\ntchild _\srcdom ) \\ \mbox {\underline {\bf P}arent link sequence (deleted)} & \ntparent '_\srcdom & \to & \ntparent '_\srcdom & \ntdel '_\srcdom & \kappa _\srcdom \\ & & | & \ntparent '_\srcdom & & \kappa _\srcdom \nullability (\ntdel _\srcdom ) \\ & & | & \ntdel '_\srcdom & & \kappa _\srcdom \nullability (\ntparent _\srcdom ) \\ \end {array} \]

The final step is to remove null cycles: chains of unit productions resulting in the same nonterminal. Specifically we need to remove $\ntlinks '_\srcdom \to \ntmor '_\srcdom \to \ntsur '_\srcdom \to \ntaln '_\srcdom \to \ntlinks '_\destdom $, $\ntchild '_\srcdom \to \ntins '_\srcdom \to \ntchild '_\destdom $ and $\ntparent '_\srcdom \to \ntdel '_\srcdom \to \ntparent '_\destdom $. We do this by deleting the $\ntaln '_\srcdom \to \ntlinks '_\destdom $, $\ntins '_\srcdom \to \ntchild '_\destdom $ and $\ntdel '_\srcdom \to \ntparent '_\destdom $ transitions, adding compensatory self-loops and self-bifurcations to $\ntlinks '_\srcdom $, $\ntparent '_\srcdom $, and $\ntchild '_\srcdom $ to account for the now-broken paths through $\ntaln '_\srcdom $, $\ntins '_\srcdom $ and $\ntdel '_\srcdom $. The modified grammar is

\[ \begin {array}{lrcccl} \mbox {Symbol} & \mbox {LHS} & & \multicolumn {2}{c}{\mbox {RHS}} & \mbox {Probability} \\ \hline \mbox {\underline {\bf L}ink} & \ntlinks ''_\srcdom & \to & \ntlinks ''_\srcdom & \ntmor ''_\srcdom & \kappa _\srcdom \\ & & | & \ntlinks ''_\srcdom & \ntlinks ''_\destdom & \kappa _\srcdom \alpha _\srcdom \left ( 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \right ) (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ & & | & \ntlinks ''_\destdom & & \ltrans _{\srcdom \destdom } \\ & & | & \ntmor ''_\srcdom & & \kappa _\srcdom \nullability (\ntlinks _\srcdom ) \\ & & | & \ntnew ''_\srcdom & & (1 - \kappa _\srcdom ) \beta _\srcdom \\ \mbox {\underline {\bf M}ortal} & \ntmor ''_\srcdom & \to & \ntsur ''_\srcdom & & \alpha _\srcdom \\ & & | & \ntexp ''_\srcdom & & 1 - \alpha _\srcdom \\ \mbox {\underline {\bf S}urviving} & \ntsur ''_\srcdom & \to & \ntaln ''_\srcdom & \ntnew ''_\srcdom & \beta _\srcdom \\ & & | & \ntnew ''_\srcdom & & \beta _\srcdom \nullability (\ntaln _\srcdom ) \\ & & | & \ntaln ''_\srcdom & & 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \\ \mbox {\underline {\bf E}xpired} & \ntexp ''_\srcdom & \to & \ntdel ''_\srcdom & \ntnew ''_\srcdom & \gamma _\srcdom \\ & & | & \ntparent ''_\destdom & \ntnew ''_\srcdom & \gamma _\srcdom (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ & & | & \ntnew ''_\srcdom & & \gamma _\srcdom \nullability (\ntdel _\srcdom ) \\ & & | & \ntdel ''_\srcdom & & 1 - \gamma _\srcdom (1 - \nullability (\ntnew _\srcdom )) \\ & & | & \ntparent ''_\destdom & & \left ( 1 - \gamma _\srcdom (1 - \nullability (\ntnew _\srcdom )) \right ) (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \mbox {\underline {\bf N}ewborns} & \ntnew ''_\srcdom & \to & \ntins ''_\srcdom & \ntnew ''_\srcdom & \beta _\srcdom \\ & & | & \ntchild ''_\destdom & \ntnew ''_\srcdom & \beta _\srcdom (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ & & | & \ntnew ''_\srcdom & & \beta _\srcdom \nullability (\ntins _\srcdom ) \\ & & | & \ntins ''_\srcdom & & 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \\ & & | & \ntchild ''_\destdom & & \left ( 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \right ) (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \mbox {\underline {\bf A}ligned} & \ntaln ''_\srcdom & \to & \term _{\anctok \destok } & & \tok _\srcdom \eqm _{\srcdom \anctok } \exp (\exch _\srcdom \evoltime )_{\anctok \destok }\\ \mbox {\underline {\bf I}nserted} & \ntins ''_\srcdom & \to & \term _{\gap \destok } & & \tok _\srcdom \eqm _{\srcdom \destok } \\ \mbox {\underline {\bf D}eleted} & \ntdel ''_\srcdom & \to & \term _{\anctok \gap } & & \tok _\srcdom \eqm _{\srcdom \anctok } \\ \mbox {\underline {\bf C}hild} & \ntchild ''_\srcdom & \to & \ntchild ''_\srcdom & \ntins ''_\srcdom & \kappa _\srcdom \\ & & | & \ntchild ''_\srcdom & \ntchild ''_\destdom & \kappa _\srcdom (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ & & | & \ntchild ''_\destdom & & \ctrans _{\srcdom \destdom } \\ & & | & \ntins ''_\srcdom & & \kappa _\srcdom \nullability (\ntchild _\srcdom ) \\ \mbox {\underline {\bf P}arent} & \ntparent ''_\srcdom & \to & \ntparent ''_\srcdom & \ntdel ''_\srcdom & \kappa _\srcdom \\ & & | & \ntparent ''_\srcdom & \ntparent ''_\destdom & \kappa _\srcdom (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ & & | & \ntparent ''_\destdom & & \ctrans _{\srcdom \destdom } \\ & & | & \ntdel ''_\srcdom & & \kappa _\srcdom \nullability (\ntchild _\srcdom ) \\ \end {array} \] where $\ltrans ,\ctrans $ are the $\ndom \times \ndom $ transition matrices

\begin {eqnarray*} \ltrans _{\srcdom \destdom } & = & \kappa _\srcdom \left ( \nullability (\ntmor _\srcdom ) \delta _{\srcdom \destdom } + \nullability (\ntlinks _\srcdom ) \alpha _\srcdom \left (1 - \beta _\srcdom \left (1 - \nullability (\ntnew _\srcdom )\right )\right ) \left (1 - \tok _\srcdom \right ) \domdist _{\srcdom \destdom } \right ) \\ \ctrans _{\srcdom \destdom } & = & \kappa _\srcdom \left ( \nullability (\ntins _\srcdom ) \delta _{\srcdom \destdom } + \nullability (\ntchild _\srcdom ) (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \right ) \end {eqnarray*}

Letting $\ltransinv =(I-\ltrans )^{-1}$ and $\ctransinv =(I-\ctrans )^{-1}$ be the geometric series sums of these matrices, update the rules to sum over all $\ntlinks ''_\srcdom \to \ntlinks ''_\destdom $, $\ntchild ''_\srcdom \to \ntchild ''_\destdom $, and $\ntparent ''_\srcdom \to \ntparent ''_\destdom $ transition chains explicitly \[ \begin {array}{lrcccl} \mbox {Symbol} & \mbox {LHS} & & \multicolumn {2}{c}{\mbox {RHS}} & \mbox {Probability} \\ \hline \mbox {\underline {\bf L}ink} & \ntlinks '''_\srcdom & \to & \ntlinks '''_\destdom & \ntmor '''_\destdom & \ltransinv _{\srcdom \destdom } \kappa _\destdom \\ & & | & \ntlinks '''_\destdom & \ntlinks '''_\bifdom & \ltransinv _{\srcdom \destdom } \kappa _\destdom \alpha _\destdom \left ( 1 - \beta _\destdom (1 - \nullability (\ntnew _\destdom )) \right ) (1 - \tok _\destdom ) \domdist _{\destdom \bifdom } \\ & & | & \ntmor '''_\destdom & & \ltransinv _{\srcdom \destdom } \kappa _\destdom \nullability (\ntlinks _\destdom ) \\ & & | & \ntnew '''_\destdom & & \ltransinv _{\srcdom \destdom } (1 - \kappa _\destdom ) \beta _\destdom \\ \mbox {\underline {\bf C}hild} & \ntchild '''_\srcdom & \to & \ntchild '''_\destdom & \ntins '''_\destdom & \ctransinv _{\srcdom \destdom } \kappa _\destdom \\ & & | & \ntchild '''_\destdom & \ntchild '''_\bifdom & \ctransinv _{\srcdom \destdom } (1 - \tok _\destdom ) \domdist _{\destdom \bifdom } \\ & & | & \ntins '''_\destdom & & \ctransinv _{\srcdom \destdom } \kappa _\destdom \nullability (\ntchild _\destdom ) \\ \mbox {\underline {\bf P}arent} & \ntparent '''_\srcdom & \to & \ntparent '''_\destdom & \ntdel '''_\srcdom & \ctransinv _{\srcdom \destdom } \kappa _\destdom \\ & & | & \ntparent '''_\destdom & \ntparent '''_\bifdom & \ctransinv _{\srcdom \destdom } (1 - \tok _\destdom ) \domdist _{\destdom \bifdom } \\ & & | & \ntdel '''_\destdom & & \ctransinv _{\srcdom \destdom } \kappa _\destdom \nullability (\ntchild _\destdom ) \\ \end {array} \]

Rules for $\xstate '''_\srcdom \ldots $ where $\xstate \in \{ \ntmor , \ntsur , \ntexp , \ntnew \}$ are just copied over from the corresponding $\xstate ''_\srcdom \to \ldots $ rules, changing $\xstate ''$ to $\xstate '''$ on the right-hand side as well.

\[ \begin {array}{lrcccl} \mbox {Symbol} & \mbox {LHS} & & \multicolumn {2}{c}{\mbox {RHS}} & \mbox {Probability} \\ \hline \mbox {\underline {\bf M}ortal} & \ntmor '''_\srcdom & \to & \ntsur '''_\srcdom & & \alpha _\srcdom \\ & & | & \ntexp '''_\srcdom & & 1 - \alpha _\srcdom \\ \mbox {\underline {\bf S}urviving} & \ntsur '''_\srcdom & \to & \ntaln '''_\srcdom & \ntnew '''_\srcdom & \beta _\srcdom \\ & & | & \ntnew '''_\srcdom & & \beta _\srcdom \nullability (\ntaln _\srcdom ) \\ & & | & \ntaln '''_\srcdom & & 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \\ \mbox {\underline {\bf E}xpired} & \ntexp '''_\srcdom & \to & \ntdel '''_\srcdom & \ntnew '''_\srcdom & \gamma _\srcdom \\ & & | & \ntparent '''_\destdom & \ntnew '''_\srcdom & \gamma _\srcdom (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ & & | & \ntnew '''_\srcdom & & \gamma _\srcdom \nullability (\ntdel _\srcdom ) \\ & & | & \ntdel '''_\srcdom & & 1 - \gamma _\srcdom (1 - \nullability (\ntnew _\srcdom )) \\ & & | & \ntparent '''_\destdom & & \left ( 1 - \gamma _\srcdom (1 - \nullability (\ntnew _\srcdom )) \right ) (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \mbox {\underline {\bf N}ewborns} & \ntnew '''_\srcdom & \to & \ntins '''_\srcdom & \ntnew '''_\srcdom & \beta _\srcdom \\ & & | & \ntchild '''_\destdom & \ntnew '''_\srcdom & \beta _\srcdom (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ & & | & \ntnew '''_\srcdom & & \beta _\srcdom \nullability (\ntins _\srcdom ) \\ & & | & \ntins '''_\srcdom & & 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \\ & & | & \ntchild '''_\destdom & & \left ( 1 - \beta _\srcdom (1 - \nullability (\ntnew _\srcdom )) \right ) (1 - \tok _\srcdom ) \domdist _{\srcdom \destdom } \\ \end {array} \] We use this last opportunity to reintroduce mixtures of site and fragment classes

\[ \begin {array}{lrcccl} \mbox {Symbol} & \mbox {LHS} & & \multicolumn {2}{c}{\mbox {RHS}} & \mbox {Probability} \\ \hline \mbox {\underline {\bf A}ligned} & \ntaln '''_\srcdom & \to & \ntfrag _\frag & & \tok _\srcdom \fragdist _{\srcdom \frag } \\ \mbox {\underline {\bf I}nserted} & \ntins '''_\srcdom & \to & \ntgen _\frag & & \tok _\srcdom \fragdist _{\srcdom \frag } \\ \mbox {\underline {\bf D}eleted} & \ntdel '''_\srcdom & \to & \ntrem _\frag & & \tok _\srcdom \fragdist _{\srcdom \frag } \\ \mbox {\underline {\bf F}ragment} & \ntfrag _\frag & \to & \term _{\anctok \destok } & \ntfrag _\destfrag & \ext ^{(\srcdom )}_{\frag \destfrag } \sum _\class \classdist _{\srcdom \frag \class } \eqm ^{(\class )}_\anctok \exp (\revsub ^{(\class )} \evoltime )_{\anctok \destok }\\ & & | & \term _{\anctok \destok } & & \notext ^{(\srcdom )}_\frag \sum _\class \classdist _{\srcdom \frag \class } \eqm ^{(\class )}_\anctok \exp (\revsub ^{(\class )} \evoltime )_{\anctok \destok } \\ \mbox {\underline {\bf G}enerated} & \ntgen _\frag & \to & \term _{\gap \destok } & \ntgen _\destfrag & \ext ^{(\srcdom )}_{\frag \destfrag } \sum _\class \classdist _{\srcdom \frag \class } \eqm ^{(\class )}_\destok \\ & & | & \term _{\gap \destok } & & \notext ^{(\srcdom )}_\frag \sum _\class \classdist _{\srcdom \frag \class } \eqm ^{(\class )}_\destok \\ \mbox {\underline {\bf R}emoved} & \ntrem _\frag & \to & \term _{\anctok \gap } & \ntrem _\destfrag & \ext ^{(\srcdom )}_{\frag \destfrag } \sum _\class \classdist _{\srcdom \frag \class } \eqm ^{(\class )}_\anctok \\ & & | & \term _{\anctok \gap } & & \notext ^{(\srcdom )}_\frag \sum _\class \classdist _{\srcdom \frag \class } \eqm ^{(\class )}_\anctok \\ \end {array} \] and add the top-level rule: \[ \begin {array}{lrcccl} \mbox {Symbol} & \mbox {LHS} & & \multicolumn {2}{c}{\mbox {RHS}} & \mbox {Probability} \\ \hline \mbox {\underline {\bf B}egin} & \ntbegin & \to & \ntlinks '''_1 & & 1 \\ & & | & \epsilon & & \nullability (\ntlinks _1) \\ \end {array} \]

With no $\epsilon $-productions and a strictly acyclic topological transition sort order on nonterminals of $\ntbegin \to \ntlinks \to \ntmor \to ( \ntsur \to (\ntaln ,\ \ntnew \to \ntins \to \ntchild ),\ \ntexp \to \ntdel \to \ntparent )$, this grammar is ready for an Inside parser.

At this point our model is the desired

C.12.2 Example Two: The TKF Structure Tree (TKFST)

Consider now the RNA evolutionary model derived in (22).

That model has stem ($S$) and loop ($L$) sequences, each of which is a TKF91 sequence with its own rates. Stems are sequences of base pairs, loops of individual bases.

The model was developed as a proof of concept, but suffers similar deficiencies to TKF91 concerning the absence of an affine gap penalty, as well as a lack of basepair stacking effects or other empirically observed features of biological RNA structures.

TKF Structure Tree singlet rules ($a$)

Sequence $a$ terminals: $\{ u, v \}$.



lhs	$\to $	rhs	$P(a)$


$L_{a}$	$\to $	${u}\ L_{a}$	$\kappa _l \pi _l(u)$
	$\|$	$S_{a}\ L_{a}$	$\kappa _l \pi _l(S)$
	$\|$	$\epsilon $	$1-\kappa _l$

$S_{a}$	$\to $	${u}\ S_{a}\ {v}$	$\kappa _s \pi _s(uv)$
	$\|$	$L_{a}$	$1-\kappa _s$

Table C.2: TKF Structure Tree. Singlet rule-set for $a$.

TKF Structure Tree singlet rules ($b$)

Sequence $b$ terminals: $\{ w, x \}$.



lhs	$\to $	rhs	$P(b)$


$L_{b}$	$\to $	${w}\ L_{b}$	$\kappa _l \pi _l(w)$
	$\|$	$S_{b}\ L_{b}$	$\kappa _l \pi _l(S)$
	$\|$	$\epsilon $	$1-\kappa _l$

$S_{b}$	$\to $	${w}\ S_{b}\ {x}$	$\kappa _s \pi _s(wx)$
	$\|$	$L_{b}$	$1-\kappa _s$

Table C.3: TKF Structure Tree. Singlet rule-set for $b$.

TKF Structure Tree pair rules ($a \stackrel {t}{\to } b$)

Sequence $a$ terminals: $\{ u, v \}$. Sequence $b$ terminals: $\{ w, x \}$.



lhs	$\to $	rhs	$P(a)$	$P(b\|a)$


$L_{ab}$	$\to $	${u}\ {w}\ L_{ab}$	$\kappa _l \pi _l(u)$	$(1-\beta _l) \alpha _l M_l(u,w)$
	$\|$	${w}\ L_{ab}$	$1$	$\beta _l \pi _l(w)$
	$\|$	${u}\ L_{a\gap b}$	$\kappa _l \pi _l(u)$	$(1-\beta _l) (1-\alpha _l)$

	$\|$	$S_{ab}\ L_{ab}$	$\kappa _l \pi _l(S)$	$(1-\beta _l) \alpha _l$
	$\|$	$S_{b}\ L_{ab}$	$1$	$\beta _l \pi _l(S)$
	$\|$	$S_{a}\ L_{a\gap b}$	$\kappa _l \pi _l(S)$	$(1-\beta _l) (1-\alpha _l)$
	$\|$	$\epsilon $	$1-\kappa _l$	$1-\beta _l$

$S_{ab}$	$\to $	${u}\ {w}\ S_{ab}\ {x}\ {v}$	$\kappa _s \pi _s(uv)$	$(1-\beta _s) \alpha _s M_s(uv,wx)$
	$\|$	${w}\ S_{ab}\ {x}$	$1$	$\beta _s \pi _s(wx)$
	$\|$	${u}\ S_{a\gap b}\ {v}$	$\kappa _s \pi _s(uv)$	$(1-\beta _s) (1-\alpha _s)$
	$\|$	$L_{ab}$	$1-\kappa _s$	$1-\beta _s$

$L_{a\gap b}$	$\to $	${u}\ {w}\ L_{ab}$	$\kappa _l \pi _l(u)$	$(1-\gamma _l) \alpha _l M_l(u,w)$
	$\|$	${w}\ L_{ab}$	$1$	$\gamma _l \pi _l(w)$
	$\|$	${u}\ L_{a\gap b}$	$\kappa _l \pi _l(u)$	$(1-\gamma _l) (1-\alpha _l)$

	$\|$	$S_{ab}\ L_{ab}$	$\kappa _l \pi _l(S)$	$(1-\gamma _l) \alpha _l$
	$\|$	$S_{b}\ L_{ab}$	$1$	$\gamma _l \pi _l(S)$
	$\|$	$S_{a}\ L_{a\gap b}$	$\kappa _l \pi _l(S)$	$(1-\gamma _l) (1-\alpha _l)$
	$\|$	$\epsilon $	$1-\kappa _l$	$1-\gamma _l$

$S_{a\gap b}$	$\to $	${u}\ {w}\ S_{ab}\ {x}\ {v}$	$\kappa _s \pi _s(uv)$	$(1-\gamma _s) \alpha _s M_s(uv,wx)$
	$\|$	${w}\ S_{ab}\ {x}$	$1$	$\gamma _s \pi _s(wx)$
	$\|$	${u}\ S_{a\gap b}\ {v}$	$\kappa _s \pi _s(uv)$	$(1-\gamma _s) (1-\alpha _s)$
	$\|$	$L_{ab}$	$1-\kappa _s$	$1-\gamma _s$

Table C.4: TKF Structure Tree. Pair rule-set for $a \stackrel {t}{\to } b$ branch. Requires singlet rule-sets for $a$ and $b$.

Parameters Let $\overline {xyz}$ denote reverse complement e.g. $\overline {AAG}=CTT$.

Parameters: insertion and deletion rates $\lambda _F < \mu _F$, fragment extension probability $r_F$, substitution rate matrix $Q_F$, equilibrium probability vector $q_F$ (so $q_x Q_F = 0$) for $F \in \{ R, N, G, S, L, C \}$. Splice donor/acceptor site distribution $q_{D1}$, $q_{D2}$, $q_{A1}$, $q_{A2}$. Region-type probabilities $p_G + p_N + p_S + p_C = 1$. Intron probability $p_I$.

The $Q_N$, $Q_S$, $Q_L$ and $Q_C$ models should be strand-invariant, so e.g. $Q_N(x_1,x_2) = Q_N(\overline {x_1},\overline {x_2})$.

Functions For $F \in \{ R, N, G, S, L, C \}$:

\begin {eqnarray*} \kappa _F & = & \left ( 1 - \frac {\lambda _F}{\mu _F} \right ) \left ( 1 - r_F \right ) \end {eqnarray*}

The Stationary Grammar \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \nt {GENOME} & \to & \nt {REGION} \nt {GENOME} & 1 - \kappa _R \\ & | & \epsilon & \kappa _R \\ \nt {REGION} & \to & \nt {INTER} & p_N \\ & | & \nt {FWDCDS} & p_G / 2 \\ & | & \nt {REVCDS} & p_G / 2 \\ & | & \nt {STRUCT} & p_S \\ & | & \nt {CONS} & p_C \\ \nt {INTER} & \to & x_1\ \nt {INTER} & 1 - \kappa _N & q_N(x_1) \\ & | & \epsilon & \kappa _N \\ \nt {FWDCDS} & \to & \nt {FWDCOD} \nt {FWDCDS} & 1 - \kappa _G \\ & | & \epsilon & \kappa _G \\ \nt {FWDCOD} & \to & x_1\ x_2\ x_3 & 1 - p_I & q_G(xyz) \\ & | & x_1\ x_2\ x_3\ \nt {FWDINT} & p_I / 3 & q_G(xyz) \\ & | & x_1\ x_2\ \nt {FWDINT} x_3 & p_I / 3 & q_G(xyz) \\ & | & x_1\ \nt {FWDINT} x_2\ x_3 & p_I / 3 & q_G(xyz) \\ \nt {FWDINT} & \to & x_1\ x_2\ \nt {GENOME} x_3\ x_4 & 1 & q_{D1}(x_1) q_{D2}(x_2) q_{A1}(x_3) q_{A2}(x_4) \\ \nt {REVCDS} & \to & \nt {REVCDS} \nt {REVCOD} & 1 - \kappa _G \\ & | & \epsilon & \kappa _G \\ \nt {REVCOD} & \to & x_1\ x_2\ x_3 & 1 - p_I & q_G(\overline {xyz}) \\ & | & x_1\ x_2\ x_3\ \nt {REVINT} & p_I / 3 & q_G(\overline {xyz}) \\ & | & x_1\ x_2\ \nt {REVINT} x_3 & p_I / 3 & q_G(\overline {xyz}) \\ & | & x_1\ \nt {REVINT} x_2\ x_3 & p_I / 3 & q_G(\overline {xyz}) \\ \nt {REVINT} & \to & x_1\ x_2\ \nt {GENOME} x_3\ x_4 & 1 & q_{D1}(\overline {x_4}) q_{D2}(\overline {x_3}) q_{A1}(\overline {x_2}) q_{A2}(\overline {x_1}) \\ \nt {STRUCT} & \to & x_1\ \nt {STRUCT} x_2 & 1 - \kappa _S & q_S(xy) \\ & | & \nt {LOOP} & \kappa _S \\ \nt {LOOP} & \to & x_1\ \nt {LOOP} & 1 - \kappa _L & q_L(x_1) \\ & | & \epsilon & \kappa _L \\ \nt {CONS} & \to & x_1\ \nt {CONS} & 1 - \kappa _C & q_C(x_1) \\ & | & \epsilon & \kappa _C \\ \end {array} \]

The Joint Finite-Time Grammar The general rules for forming the joint grammar from the conditional grammar are as follows. For every nonterminal of the following form (here $\ntf {L}$, $\ntf {R}$, and/or $\ntf {E}$ are allowed to be $\epsilon $) \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \nt {F} & \to & \nt {L} \nt {F} \nt {R} & 1 - \kappa _F & q_F(LR) \\ & | & \nt {E} & \kappa _F \\ \end {array} \] ...that is, for $\ntf {GENOME}$, $\ntf {INTER}$, $\ntf {FWDCDS}$, $\ntf {REVCDS}$, $\ntf {STRUCT}$, $\ntf {LOOP}$, and $\ntf {CONS}$ (with $F \in \{ R, N, G, S, L, C \}$), replace these rules with \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \ntm {F} & \to & \ntm {L} \ntm {F} \ntm {R} & (1 - \beta _{M,F}) (1 - \kappa _F) \alpha _F & q_F(LR_x) \exp (Q_F t)(LR_x,LR_y) \\ & | & \nty {L} \ntm {F} \nty {R} & \beta _{M,F} & q_F(LR_y) \\ & | & \ntx {L} \ntd {F} \ntx {R} & (1 - \beta _{M,F}) (1 - \kappa _F) (1 - \alpha _F) & q_F(LR_x) \\ & | & \ntm {E} & (1 - \beta _{M,F}) \kappa _F \\ \ntd {F} & \to & \ntm {L} \ntm {F} \ntm {R} & (1 - \beta _{D,F}) (1 - \kappa _F) \alpha _F & q_F(LR_x) \exp (Q_F t)(LR_x,LR_y) \\ & | & \nty {L} \ntm {F} \nty {R} & \beta _{D,F} & q_F(LR_y) \\ & | & \ntx {L} \ntd {F} \ntx {R} & (1 - \beta _{D,F}) (1 - \kappa _F) (1 - \alpha _F) & q_F(LR_x) \\ & | & \ntm {E} & (1 - \beta _{D,F}) \kappa _F \\ \end {array} \] ...that is, two versions $\ntm {F}$ and $\ntd {F}$, with different $\beta $’s for each type.

For every other nonterminal $\ntf {N}$, there need to be three versions $\ntm {N}$, $\ntx {N}$ and $\nty {N}$, with outgoing rules for each type going to other nonterminals of the same type.

Top level. \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \ntk {GENOME} & \to & \ntm {REGION} \ntm {GENOME} & (1 - \beta _{k,R}) (1 - \kappa _R) \alpha _R \\ & | & \nty {REGION} \ntm {GENOME} & \beta _{k,R} \\ & | & \ntx {REGION} \ntd {GENOME} & (1 - \beta _{k,R}) (1 - \kappa _R) (1 - \alpha _R) \\ & | & \epsilon & (1 - \beta _{k,R}) \kappa _R \\ \ntj {REGION} & \to & \ntj {INTER} & p_N \\ & | & \ntj {FWDCDS} & p_G / 2 \\ & | & \ntj {REVCDS} & p_G / 2 \\ & | & \ntj {STRUCT} & p_S \\ & | & \ntj {CONS} & p_C \\ \nt {INTER} & \to & x_1\ \nt {INTER} & 1 - \kappa _N & q_N(x_1) \\ & | & \epsilon & \kappa _N \\ \end {array} \]

Coding sequences. \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \nt {FWDCDS} & \to & \nt {FWDCOD} \nt {FWDCDS} & 1 - \kappa _G \\ & | & \epsilon & \kappa _G \\ \nt {FWDCOD} & \to & x_1\ x_2\ x_3 & 1 - p_I & q_G(xyz) \\ & | & x_1\ x_2\ x_3\ \nt {FWDINT} & p_I / 3 & q_G(xyz) \\ & | & x_1\ x_2\ \nt {FWDINT} x_3 & p_I / 3 & q_G(xyz) \\ & | & x_1\ \nt {FWDINT} x_2\ x_3 & p_I / 3 & q_G(xyz) \\ \nt {FWDINT} & \to & x_1\ x_2\ \nt {GENOME} x_3\ x_4 & 1 & q_{D1}(x_1) q_{D2}(x_2) q_{A1}(x_3) q_{A2}(x_4) \\ \end {array} \]

\[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \nt {REVCDS} & \to & \nt {REVCDS} \nt {REVCOD} & 1 - \kappa _G \\ & | & \epsilon & \kappa _G \\ \nt {REVCOD} & \to & x_1\ x_2\ x_3 & 1 - p_I & q_G(\overline {xyz}) \\ & | & x_1\ x_2\ x_3\ \nt {REVINT} & p_I / 3 & q_G(\overline {xyz}) \\ & | & x_1\ x_2\ \nt {REVINT} x_3 & p_I / 3 & q_G(\overline {xyz}) \\ & | & x_1\ \nt {REVINT} x_2\ x_3 & p_I / 3 & q_G(\overline {xyz}) \\ \nt {REVINT} & \to & x_1\ x_2\ \nt {GENOME} x_3\ x_4 & 1 & q_{D1}(\overline {x_4}) q_{D2}(\overline {x_3}) q_{A1}(\overline {x_2}) q_{A2}(\overline {x_1}) \\ \end {array} \]

RNA structures. \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \nt {STRUCT} & \to & x_1\ \nt {STRUCT} x_2 & 1 - \kappa _S & q_S(xy) \\ & | & \nt {LOOP} & \kappa _S \\ \nt {LOOP} & \to & x_1\ \nt {LOOP} & 1 - \kappa _L & q_L(x_1) \\ & | & \epsilon & \kappa _L \\ \nt {CONS} & \to & x_1\ \nt {CONS} & 1 - \kappa _C & q_C(x_1) \\ & | & \epsilon & \kappa _C \\ \end {array} \]

C.12.3 Example Three: The TKF Basepair Stack (TKFStack)

The simple TKF Structure Tree models RNA secondary structure with alternating stems (basepair sequences) and loops (single-nucleotide sequences). While a useful proof of concept, it lacks basepair stacking, multiloop junctions, bulges, and internal loops—features critical for realistic RNA structure modeling.

We now define an enhanced stem-loop grammar that incorporates these features while remaining within the TKF evolutionary framework. We then show how evolution elaboration, profile SCFG construction, and the triplet model for progressive reconstruction all apply to this grammar.

Parameters The grammar has the following parameters:

Stem link TKF91 rates $\insrate _S < \delrate _S$, giving $\kappa _S = \insrate _S / \delrate _S$.
Loop link TKF91 rates $\insrate _L < \delrate _L$, giving $\kappa _L = \insrate _L / \delrate _L$.
Bulge extension probability $\ext _B$.
Stacked-pair fragment extension probability $\ext _K$.
Stem link type probabilities $p_{\mathrm {bp}} + p_{\mathrm {st}} + p_{\mathrm {bu}} = 1$.
Loop link type probabilities $p_{\mathrm {lf}} + p_{\mathrm {rf}} + p_{\mathrm {sl}} = 1$.
Nucleotide equilibrium $\eqm (c)$ over $\alphabet = \{A,C,G,U\}$, with rate matrix $\exch $.
Single basepair equilibrium $\eqm _{\mathrm {bp}}(c_L, c_R)$ over $|\alphabet |^2 = 16$ states, rate matrix $\exch _{\mathrm {bp}}$.
Closing basepair equilibrium $\eqm _{\mathrm {cl}}(c_L, c_R)$ over $16$ states, rate matrix $\exch _{\mathrm {cl}}$.
Stacked-pair equilibrium $\eqm _K(c_L^1, c_L^2, c_R^2, c_R^1)$ over the $6^2 = 36$ canonical stacked pairs, rate matrix $\exch _K$. A stacked pair consists of two consecutive canonical basepairs $(c_L^1, c_R^1)$ and $(c_L^2, c_R^2)$; the state space is restricted to the six Watson-Crick and wobble pairs (AU, CG, GC, UA, GU, UG) for each position, giving 36 states rather than the full $16^2 = 256$.

Nonterminals and Productions A stem-loop consists of a stem (nested basepairs with decorations), a closing basepair, and a loop (with possible multiloop branches). The start symbol is $\mathsf {SL}$ (stem-loop).

Stem-loop: \begin {align} \mathsf {SL} &\to \mathsf {STEM} && \text {weight } 1 \label {eq:sl} \end {align}

Stem (TKF91 sequence of stem-links, from outer to inner): \begin {align} \mathsf {STEM} &\to c_L\;\mathsf {STEM}\;c_R && \kappa _S\, p_{\mathrm {bp}}\, \eqm _{\mathrm {bp}}(c_L, c_R) && \text {[single basepair, LR]} \label {eq:stem-bp} \\ &\to c_L^1\, c_L^2\;\mathsf {STEM}\;c_R^2\, c_R^1 && \kappa _S\, p_{\mathrm {st}}\, (1-\ext _K)\, \eqm _K(\cdot ) && \text {[terminal stacked pair, LLRR]} \label {eq:stem-stack-term} \\ &\to c_L^1\, c_L^2\;\mathsf {STACK}\;c_R^2\, c_R^1 && \kappa _S\, p_{\mathrm {st}}\, \ext _K\, \eqm _K(\cdot ) && \text {[extended stacked pair, LLRR]} \label {eq:stem-stack-ext} \\ &\to \mathsf {LDECO}\;\mathsf {STEM}\;\mathsf {RDECO} && \kappa _S\, p_{\mathrm {bu}} && \text {[bulge]} \label {eq:stem-bulge} \\ &\to \mathsf {CLOSE}\;\mathsf {LOOP} && 1 - \kappa _S && \text {[end stem]} \label {eq:stem-end} \end {align}

A bulge link (??) represents an internal loop, a single-sided bulge, or a multihelix junction branch between basepairs. All non-basepair content between consecutive basepairs is consolidated into a single TKF link type whose internal structure is governed by $\mathsf {LDECO}$ and $\mathsf {RDECO}$.

Bulge decorations (L-side and R-side content): \begin {align} \mathsf {LDECO} &\to \mathsf {LFRAG}\;\mathsf {LDECO} && \text {[L-fragment, then more L-content]} \label {eq:ldeco-lfrag} \\ &\to \mathsf {SL}\;\mathsf {LDECO} && \text {[left branch: nested stem-loop]} \label {eq:ldeco-branch} \\ &\to \epsilon && \text {[end L-decorations]} \label {eq:ldeco-end} \\[6pt] \mathsf {RDECO} &\to \mathsf {RDECO}\;\mathsf {RFRAG} && \text {[R-fragment, more R-content]} \label {eq:rdeco-rfrag} \\ &\to \mathsf {RDECO}\;\mathsf {SL} && \text {[right branch: nested stem-loop]} \label {eq:rdeco-branch} \\ &\to \epsilon && \text {[end R-decorations]} \label {eq:rdeco-end} \end {align}

$\mathsf {LDECO}$ generates all content on the $5'$ side (between the outer basepair’s L-half and the continuation), while $\mathsf {RDECO}$ generates all content on the $3'$ side (between the continuation and the outer basepair’s R-half). An internal loop has both $\mathsf {LDECO} \neq \epsilon $ and $\mathsf {RDECO} \neq \epsilon $; a single-sided bulge has content on only one side; a multihelix junction branch has $\mathsf {SL}$ nested within $\mathsf {LDECO}$ or $\mathsf {RDECO}$.

Stacked-pair fragment continuation (LLRR emission): \begin {align} \mathsf {STACK} &\to c_L^1\, c_L^2\;\mathsf {STEM}\;c_R^2\, c_R^1 && (1-\ext _K)\, \eqm _K(\cdot ) && \text {[terminal]} \label {eq:stack-term} \\ &\to c_L^1\, c_L^2\;\mathsf {STACK}\;c_R^2\, c_R^1 && \ext _K\, \eqm _K(\cdot ) && \text {[extend]} \label {eq:stack-ext} \end {align}

Closing basepair (LR emission): \begin {align} \mathsf {CLOSE} &\to c_L\;\mathsf {CLOSE}'\;c_R && \eqm _{\mathrm {cl}}(c_L, c_R) \label {eq:close} \end {align}

where $\mathsf {CLOSE}'$ is a unit nonterminal ($\mathsf {CLOSE}' \to \epsilon $, weight $1$) that serves as a placeholder for the inside of the closing basepair.

Loop (TKF91 sequence of loop-links): \begin {align} \mathsf {LOOP} &\to \mathsf {LOOPLINK}\;\mathsf {LOOP} && \kappa _L && \text {[add loop link]} \label {eq:loop-link} \\ &\to \epsilon && 1 - \kappa _L && \text {[end loop]} \label {eq:loop-end} \end {align}

Loop link types: \begin {align} \mathsf {LOOPLINK} &\to \mathsf {LFRAG} && p_{\mathrm {lf}} && \text {[L-fragment]} \label {eq:loop-lfrag} \\ &\to \mathsf {RFRAG} && p_{\mathrm {rf}} && \text {[R-fragment]} \label {eq:loop-rfrag} \\ &\to \mathsf {SL} && p_{\mathrm {sl}} && \text {[nested stem-loop (multiloop)]} \label {eq:loop-sl} \end {align}

Unpaired nucleotide fragments in loops (geometric length $\geq 1$): \begin {align} \mathsf {LFRAG} &\to c\;\mathsf {LFRAG} && \ext _L\, \eqm (c) && \text {[extend, L-emission]} \label {eq:lfrag-ext} \\ &\to c && (1-\ext _L)\, \eqm (c) && \text {[terminal, L-emission]} \label {eq:lfrag-term} \\[6pt] \mathsf {RFRAG} &\to \mathsf {RFRAG}\;c && \ext _R\, \eqm (c) && \text {[extend, R-emission]} \label {eq:rfrag-ext} \\ &\to c && (1-\ext _R)\, \eqm (c) && \text {[terminal, R-emission]} \label {eq:rfrag-term} \end {align}

Emission Types and Span Tracking The grammar has four emission patterns, each determining how terminals consume positions from the span $[i,j]$ of the input sequence:

1.: L-emission ($c$): consumes from the left end, advancing $i \to i+1$. Used by $\mathsf {LFRAG}$, $\mathsf {LDECO}$ content.
2.: R-emission ($c$): consumes from the right end, retreating $j \to j-1$. Used by $\mathsf {RFRAG}$, $\mathsf {RDECO}$ content.
3.: LR-emission ($c_L, c_R$): consumes from both ends simultaneously, $i \to i+1$, $j \to j-1$. Used by $\mathsf {STEM}$ (single basepair), $\mathsf {CLOSE}$.
4.: LLRR-emission ($c_L^1, c_L^2, c_R^2, c_R^1$): consumes two from each end, $i \to i+2$, $j \to j-2$. Used by $\mathsf {STEM}$ (stacked pair), $\mathsf {STACK}$. Each LLRR unit represents a stacked dinucleotide pair: two consecutive canonical basepairs $(c_L^1, c_R^1)$ and $(c_L^2, c_R^2)$ treated as a single evolutionary unit with nearest-neighbor stacking energy. The state space is restricted to the $6 \times 6 = 36$ combinations of canonical (Watson-Crick and wobble) pairs.

Remark C.28 (L/R distinction in terminals vs. productions). The terminal alphabet is $\alphabet = \{A,C,G,U\}$ without L/R copies: both L-emission and R-emission produce characters from the same alphabet, differing only in which end of the span $[i,j]$ is consumed. At leaf nodes, the observed sequence carries no L/R annotation.

However, the L/R distinction is essential at the production level and therefore in the profile SCFG. Each nonterminal instance in a profile has a fixed emission direction (L, R, LR, or LLRR), determined by the parse tree from which it was extracted—analogous to the MATL, MATR, and MATP node types in Infernal covariance models (38). The branch PTT must preserve emission-direction compatibility when mapping parent productions to child productions. Consequently, a profile position that is LR-emitting (basepaired) at one node of the phylogeny may correspond to separate L-emitting and R-emitting positions (unpaired) at another node, reflecting structural change along the evolutionary lineage.

Remark C.29 (Structural interpretation). Consider a stem with links (from outer to inner): basepair $(c_L^1, c_R^1)$, bulge (left-side $b_1 b_2$, right-side branch $\mathsf {SL}'$), basepair $(c_L^2, c_R^2)$. The yield is: \[ c_L^1\; b_1\; b_2\; c_L^2\; [\text {loop}]\; c_R^2\; [\text {yield}(\mathsf {SL}')]\; c_R^1 \] The bulge nucleotides $b_1, b_2$ appear on the $5'$ side (via $\mathsf {LDECO}$), while the branch structure appears on the $3'$ side (via $\mathsf {RDECO}$). This represents an internal loop with unpaired nucleotides on the $5'$ strand and a branching sub-structure on the $3'$ strand—a configuration that the consolidated bulge link type (??) captures as a single TKF event.

The Pair Grammar via Evolution Elaboration Applying the evolution elaboration rules (Appendix C.11) to the singlet grammar produces the pair grammar for an ancestor–descendant pair separated by evolutionary time $\evoltime $. Each nonterminal participating in a TKF91 link sequence ($\mathsf {STEM}$, $\mathsf {LOOP}$) gains $\mat /\ins /\del $ versions with the standard TKF91 transition weights $(\alpha , \beta , \gamma )$ derived from $(\insrate , \delrate , \evoltime )$.

Stem Link Elaboration. The stem is a TKF91 sequence of links with rates $(\insrate _S, \delrate _S)$. Each $\mathsf {STEM}$ nonterminal becomes $\mathsf {STEM}_\mat $ (post-match/insert) and $\mathsf {STEM}_\del $ (post-delete):

\begin {align} \mathsf {STEM}_\mat &\to c_L^x\, c_L^y\;\mathsf {STEM}_\mat \;c_R^y\, c_R^x && (1-\beta _S) \kappa _S\, p_{\mathrm {bp}}\, \alpha _S\, \eqm _{\mathrm {bp}}(c_L^x, c_R^x)\, P_{\mathrm {bp}}(c_L^y, c_R^y | c_L^x, c_R^x) \label {eq:pair-bp-match} \\ &\to c_L^y\;\mathsf {STEM}_\mat \;c_R^y && \beta _S\, \kappa _S\, p_{\mathrm {bp}}\, \eqm _{\mathrm {bp}}(c_L^y, c_R^y) && \text {[BP insert]} \label {eq:pair-bp-ins} \\ &\to c_L^x\;\mathsf {STEM}_\del \;c_R^x && (1-\beta _S) \kappa _S\, p_{\mathrm {bp}}\, (1-\alpha _S)\, \eqm _{\mathrm {bp}}(c_L^x, c_R^x) && \text {[BP delete]} \label {eq:pair-bp-del} \end {align}

with analogous rules for $\mathsf {STEM}_\del $ using $\gamma _S$ in place of $\beta _S$.

For matched single basepairs (??), the nesting is $c_L^x\, c_L^y\;\cdots \;c_R^y\, c_R^x$: ancestor terminals on the outside, descendant terminals on the inside, preserving palindromic structure. For inserted basepairs (??), only descendant terminals appear (LR emission). For deleted basepairs (??), only ancestor terminals appear (LR emission).

Stacked-Pair Elaboration. Matched stacked pairs have LLRR emission on both ancestor and descendant, yielding an $L^4R^4$ nesting pattern in the pair grammar: \begin {align} \mathsf {STEM}_\mat &\to c_{L1}^x\, c_{L2}^x\, c_{L1}^y\, c_{L2}^y\; \mathsf {STEM}_\mat \; c_{R2}^y\, c_{R1}^y\, c_{R2}^x\, c_{R1}^x \notag \\ &\qquad \qquad (1-\beta _S) \kappa _S\, p_{\mathrm {st}}\, (1-\ext _K)\, \alpha _S\, \eqm _K(\cdot )\, P_K(\cdot |\cdot ) \label {eq:pair-stack-match} \end {align}

with inserted stacked pairs emitting only descendant LLRR, and deleted stacked pairs emitting only ancestor LLRR.

Bulge Elaboration. A bulge link (??) gains $\mat /\ins /\del $ versions like any other stem link. A matched bulge has the form $\mathsf {LDECO}_\mat \;\mathsf {STEM}_\mat \;\mathsf {RDECO}_\mat $, with each side elaborated independently.

Within $\mathsf {LDECO}$, each sub-element ($\mathsf {LFRAG}$, nested $\mathsf {SL}$) gains $\mat /\ins /\del $ versions with L-emission direction preserved. For example, a matched left-branch within a bulge produces: \begin {align} \mathsf {LDECO}_\mat &\to \mathsf {SL}_\mat \;\mathsf {LDECO}_\mat && \alpha _{\mathrm {deco}}\, P(\text {branch}) && \text {[matched left branch]} \label {eq:pair-ldeco-match} \\ &\to \mathsf {SL}_\ins \;\mathsf {LDECO}_\mat && \beta _{\mathrm {deco}} && \text {[inserted left branch]} \label {eq:pair-ldeco-ins} \\ &\to \mathsf {SL}_\del \;\mathsf {LDECO}_\del && (1-\alpha _{\mathrm {deco}})\, P(\text {branch}) && \text {[deleted left branch]} \label {eq:pair-ldeco-del} \end {align}

Here $\mathsf {SL}_\mat $ generates aligned ancestor–descendant sub-structures, $\mathsf {SL}_\ins $ generates descendant-only sub-structures, and $\mathsf {SL}_\del $ generates ancestor-only sub-structures.

$\mathsf {RDECO}$ is elaborated symmetrically, with R-emission direction preserved for $\mathsf {RFRAG}$ elements and right-side branches.

Loop Elaboration. The loop is a TKF91 sequence of loop-links with rates $(\insrate _L, \delrate _L)$. Elaboration proceeds identically to the stem: each $\mathsf {LOOP}$ becomes $\mathsf {LOOP}_\mat / \mathsf {LOOP}_\del $, and each loop-link type ($\mathsf {LFRAG}$, $\mathsf {RFRAG}$, nested $\mathsf {SL}$) gains $\mat /\ins /\del $ versions.

Nested stem-loops within the loop (multiloop junctions) are handled recursively: $\mathsf {SL}_\mat $ aligns both ancestor and descendant sub-structures, allowing multiloop branches to be independently inserted, deleted, or matched.

C.12.4 Example Four: The TKF Genome

Parameters Let $\overline {xyz}$ denote reverse complement e.g. $\overline {AAG}=CTT$.

The $Q_N$, $Q_S$, $Q_L$ and $Q_C$ models should be strand-invariant, so e.g. $Q_N(x_1,x_2) = Q_N(\overline {x_1},\overline {x_2})$.

Functions For $F \in \{ R, N, G, S, L, C \}$:

\begin {eqnarray*} \kappa _F & = & \left ( 1 - \frac {\lambda _F}{\mu _F} \right ) \left ( 1 - r_F \right ) \end {eqnarray*}

The Joint Finite-Time Grammar

How to form the joint grammar The general rules for forming the joint grammar from the conditional grammar are as follows. For every nonterminal of the following form (here $\ntf {L}$, $\ntf {R}$, and/or $\ntf {E}$ are allowed to be $\epsilon $) \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \nt {F} & \to & \nt {L} \nt {F} \nt {R} & 1 - \kappa _F & q_F(LR) \\ & | & \nt {E} & \kappa _F \\ \end {array} \] ...that is, for $\ntf {GENOME}$, $\ntf {INTER}$, $\ntf {FWDCDS}$, $\ntf {REVCDS}$, $\ntf {STRUCT}$, $\ntf {LOOP}$, and $\ntf {CONS}$ (with $F \in \{ R, N, G, S, L, C \}$), replace these rules with \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \ntm {F} & \to & \ntm {L} \ntm {F} \ntm {R} & (1 - \beta _{M,F}) (1 - \kappa _F) \alpha _F & q_F(LR_x) \exp (Q_F t)(LR_x,LR_y) \\ & | & \nty {L} \ntm {F} \nty {R} & \beta _{M,F} & q_F(LR_y) \\ & | & \ntx {L} \ntd {F} \ntx {R} & (1 - \beta _{M,F}) (1 - \kappa _F) (1 - \alpha _F) & q_F(LR_x) \\ & | & \ntm {E} & (1 - \beta _{M,F}) \kappa _F \\ \ntd {F} & \to & \ntm {L} \ntm {F} \ntm {R} & (1 - \beta _{D,F}) (1 - \kappa _F) \alpha _F & q_F(LR_x) \exp (Q_F t)(LR_x,LR_y) \\ & | & \nty {L} \ntm {F} \nty {R} & \beta _{D,F} & q_F(LR_y) \\ & | & \ntx {L} \ntd {F} \ntx {R} & (1 - \beta _{D,F}) (1 - \kappa _F) (1 - \alpha _F) & q_F(LR_x) \\ & | & \ntm {E} & (1 - \beta _{D,F}) \kappa _F \\ \end {array} \] ...that is, two versions $\ntm {F}$ and $\ntd {F}$, with different $\beta $’s for each type.

For every other nonterminal $\ntf {N}$, there need to be three versions $\ntm {N}$, $\ntx {N}$ and $\nty {N}$, with outgoing rules for each type going to other nonterminals of the same type.

Top level \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \ntk {GENOME} & \to & \ntm {REGION} \ntm {GENOME} & (1 - \beta _{k,R}) (1 - \kappa _R) \alpha _R \\ & | & \nty {REGION} \ntm {GENOME} & \beta _{k,R} \\ & | & \ntx {REGION} \ntd {GENOME} & (1 - \beta _{k,R}) (1 - \kappa _R) (1 - \alpha _R) \\ & | & \epsilon & (1 - \beta _{k,R}) \kappa _R \\ \ntj {REGION} & \to & \ntj {INTER} & p_N \\ & | & \ntj {FWDCDS} & p_G / 2 \\ & | & \ntj {REVCDS} & p_G / 2 \\ & | & \ntj {STRUCT} & p_S \\ & | & \ntj {CONS} & p_C \\ \nt {INTER} & \to & x_1\ \nt {INTER} & 1 - \kappa _N & q_N(x_1) \\ & | & \epsilon & \kappa _N \\ \end {array} \]

Coding sequences \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \nt {FWDCDS} & \to & \nt {FWDCOD} \nt {FWDCDS} & 1 - \kappa _G \\ & | & \epsilon & \kappa _G \\ \nt {FWDCOD} & \to & x_1\ x_2\ x_3 & 1 - p_I & q_G(xyz) \\ & | & x_1\ x_2\ x_3\ \nt {FWDINT} & p_I / 3 & q_G(xyz) \\ & | & x_1\ x_2\ \nt {FWDINT} x_3 & p_I / 3 & q_G(xyz) \\ & | & x_1\ \nt {FWDINT} x_2\ x_3 & p_I / 3 & q_G(xyz) \\ \nt {FWDINT} & \to & x_1\ x_2\ \nt {GENOME} x_3\ x_4 & 1 & q_{D1}(x_1) q_{D2}(x_2) q_{A1}(x_3) q_{A2}(x_4) \\ \end {array} \]

RNA structures \[ \begin {array}{rcl|ll} \mbox {LHS} & \to & \mbox {RHS} & \mbox {Transition} & \mbox {Emission} \\ \hline \nt {STRUCT} & \to & x_1\ \nt {STRUCT} x_2 & 1 - \kappa _S & q_S(xy) \\ & | & \nt {LOOP} & \kappa _S \\ \nt {LOOP} & \to & x_1\ \nt {LOOP} & 1 - \kappa _L & q_L(x_1) \\ & | & \epsilon & \kappa _L \\ \nt {CONS} & \to & x_1\ \nt {CONS} & 1 - \kappa _C & q_C(x_1) \\ & | & \epsilon & \kappa _C \\ \end {array} \]

[next] [prev] [prev-tail] [front] [up]


Category	States	Count

Start/End	\(\sta \), \(\fin \)	\(2\)
Domain-level (top-level TKF91 states)	\(\matdom \), \(\insdom \), \(\deldom \), \(\matdomend \), \(\insdomend \), \(\deldomend \)	\(6\)
Domain type selection (one per domain type \(k\))	\(\matdomtype {k}\), \(\insdomtype {k}\), \(\deldomtype {k}\)	\(3\ndom \)
Fragment-level (inner TKF states within \(\matdomtype {k}\))	\(\mkfrag {k}\), \(\mkifrag {k}\), \(\mkdfrag {k}\)	\(3\ndom \)
Fragment-level (single looping state within \(\insdomtype {k}\), \(\deldomtype {k}\))	\(\ikfrag {k}\), \(\dkfrag {k}\)	\(2\ndom \)
Fragment type selection (one per fragment type \(f\))	\(\mkfragtype {k}{f}\), \(\mkifragtype {k}{f}\), \(\mkdfragtype {k}{f}\), \(\ikfragtype {k}{f}\), \(\dkfragtype {k}{f}\)	\(5\ndom \nfrag \)
Emit states (the only emitting states)	\(\mkfragemit {k}{f}\), \(\mkifragemit {k}{f}\), \(\mkdfragemit {k}{f}\), \(\ikfragemit {k}{f}\), \(\dkfragemit {k}{f}\)	\(5\ndom \nfrag \)
Fragment end (fragment termination)	\(\mkfragend {k}{f}\), \(\mkifragend {k}{f}\), \(\mkdfragend {k}{f}\), \(\ikfragend {k}{f}\), \(\dkfragend {k}{f}\)	\(5\ndom \nfrag \)

Total	\(8 + 8\ndom + 15\ndom \nfrag \)


Parameter	Factor	Where it appears

\(\alpha _0\)	\((1-\beta _0)\kappa _0\alpha _0\)	Top-level \(\to \matdom \)
\(1-\alpha _0\)	\((1-\beta _0)\kappa _0(1-\alpha _0)\)	Top-level \(\to \deldom \)
\(\beta _0\)	\(\beta _0\)	Top-level \(\to \insdom \)
\(1-\beta _0\)	\((1-\beta _0)\)	Top-level \(\to \matdom , \deldom , \fin \)
\(\gamma _0\)	\(\gamma _0\)	\(\deldomend \to \insdom \)
\(1-\gamma _0\)	\((1-\gamma _0)\)	\(\deldomend \to \matdom , \deldom , \fin \)
\(\kappa _0\)	\(\kappa _0\)	Top-level \(\to \matdom , \deldom \)
\(1-\kappa _0\)	\((1-\kappa _0)\)	Top-level \(\to \fin \)
\(\alpha _k\)	\((1-\beta _k)\kappa _k\alpha _k\)	Domain-\(k\) \(\to \) MatFrag
\(\beta _k\)	\(\beta _k\)	Domain-\(k\) \(\to \) InsFrag
\(\gamma _k\)	\(\gamma _k\)	Domain-\(k\) DelFragEnd \(\to \) InsFrag
\(\kappa _k\)	\(\kappa _k\)	Domain-\(k\) \(\to \) MatFrag/DelFrag, I/D-type continuation
\(1-\kappa _k\)	\((1-\kappa _k)\)	Domain-\(k\) \(\to \) DomEnd
\(\ext ^{(k)}_{fg}\)	\(\ext ^{(k)}_{fg}\)	Intra-fragment fragment-type transition \(f \to g\)
\(\notext ^{(k)}_f\)	\(1 - \sum _g \ext ^{(k)}_{fg}\)	Fragment termination
\(v_k\)	\(v_k\)	Domain type selection
\(w_{kf}\)	\(w_{kf}\)	Fragment type selection


Tensor	Shape	Meaning

\(B\)	\(n_\tau \times (\|\alphabet \|+2)^2\)	Singlet bigrams (incl. \(\sta \)/\(\fin \))
\(C^{\mat \mat }\)	\(n_\tau \times \|\alphabet \|^4\)	Match\(\to \)Match: \((\anctok ,\destok ,\anctok ',\destok ')\)
\(C^{\mat \ins }\)	\(n_\tau \times \|\alphabet \|^3\)	Match\(\to \)Insert: \((\anctok ,\destok ,\destok ')\)
\(C^{\mat \del }\)	\(n_\tau \times \|\alphabet \|^3\)	Match\(\to \)Delete: \((\anctok ,\destok ,\anctok ')\)
\(C^{\ins \mat }\)	\(n_\tau \times \|\alphabet \|^3\)	Insert\(\to \)Match: \((\destok ,\anctok ',\destok ')\)
\(C^{\ins \ins }\)	\(n_\tau \times \|\alphabet \|^2\)	Insert\(\to \)Insert: \((\destok ,\destok ')\)
\(C^{\ins \del }\)	\(n_\tau \times \|\alphabet \|^2\)	Insert\(\to \)Delete: \((\destok ,\anctok ')\)
\(C^{\del \mat }\)	\(n_\tau \times \|\alphabet \|^3\)	Delete\(\to \)Match: \((\anctok ,\anctok ',\destok ')\)
\(C^{\del \del }\)	\(n_\tau \times \|\alphabet \|^2\)	Delete\(\to \)Delete: \((\anctok ,\anctok ')\)
\(C^{\del \ins }\)	\(n_\tau \times \|\alphabet \|^2\)	Delete\(\to \)Insert: \((\anctok ,\destok ')\)
\(C^{\sta \mat }\)	\(n_\tau \times \|\alphabet \|^2\)	Start\(\to \)Match: \((\anctok ',\destok ')\)
\(C^{\sta \ins }\)	\(n_\tau \times \|\alphabet \|\)	Start\(\to \)Insert: \((\destok ')\)
\(C^{\sta \del }\)	\(n_\tau \times \|\alphabet \|\)	Start\(\to \)Delete: \((\anctok ')\)
\(C^{\mat \fin }\)	\(n_\tau \times \|\alphabet \|^2\)	Match\(\to \)End: \((\anctok ,\destok )\)
\(C^{\ins \fin }\)	\(n_\tau \times \|\alphabet \|\)	Insert\(\to \)End: \((\destok )\)
\(C^{\del \fin }\)	\(n_\tau \times \|\alphabet \|\)	Delete\(\to \)End: \((\anctok )\)
\(C^{\sta \fin }\)	\(n_\tau \)	Start\(\to \)End (empty alignment)

Source	Dest	Input	Output	Weight

\(\sta \)	\(\waitm \)	\(\varepsilon \)	\(\varepsilon \)	\(p_{\sta \waitm }\)
\(\sta \)	\(\ins \)	\(\varepsilon \)	\(\destok \)	\(p_{\sta \ins }(Y,\destok )\)
\(\sta \)	\(\fin \)	\(\varepsilon \)	\(\varepsilon \)	\(p_{\sta \fin }\)

\(\waitm \)	\(\mat \)	\(\anctok \)	\(\destok \)	\(p_{\waitm \mat }(X,Y,\anctok ,\destok )\)
\(\waitm \)	\(\del \)	\(\anctok \)	\(\varepsilon \)	\(p_{\waitm \del }(X,Y,\anctok )\)

\(\waitd \)	\(\mat \)	\(\anctok \)	\(\destok \)	\(p_{\waitd \mat }(X,Y,\anctok ,\destok )\)
\(\waitd \)	\(\del \)	\(\anctok \)	\(\varepsilon \)	\(p_{\waitd \del }(X,Y,\anctok )\)

\(\mat \)	\(\waitm \)	\(\varepsilon \)	\(\varepsilon \)	\(p_{\mat \waitm }\)
\(\mat \)	\(\ins \)	\(\varepsilon \)	\(\destok \)	\(p_{\mat \ins }(X,Y,\destok )\)
\(\mat \)	\(\fin \)	\(\varepsilon \)	\(\varepsilon \)	\(p_{\mat \fin }\)

\(\ins \)	\(\waitm \)	\(\varepsilon \)	\(\varepsilon \)	\(p_{\ins \waitm }\)
\(\ins \)	\(\ins \)	\(\varepsilon \)	\(\destok \)	\(p_{\ins \ins }(X,Y,\destok )\)
\(\ins \)	\(\fin \)	\(\varepsilon \)	\(\varepsilon \)	\(p_{\ins \fin }\)

\(\del \)	\(\waitd \)	\(\varepsilon \)	\(\varepsilon \)	\(p_{\del \waitd }\)
\(\del \)	\(\ins \)	\(\varepsilon \)	\(\destok \)	\(p_{\del \ins }(X,Y,\destok )\)
\(\del \)	\(\fin \)	\(\varepsilon \)	\(\varepsilon \)	\(p_{\del \fin }\)

Context	Adjacency	MixDom path	Frequency


	\(\sta \to \mat [X',Y']\)	\(\sta \xrightarrow {\silent ^\ast } \mat [X',Y']\)	\(f^{\sta \mat }(X',Y')\)

	\(\sta \to \ins [Y']\)	\(\sta \xrightarrow {\silent ^\ast } \ins [Y']\)	\(f^{\sta \ins }(Y')\)

	\(\sta \to \fin \)	\(\sta \xrightarrow {\silent ^\ast } \fin \)	\(f^{\sta \fin }\)


	\(\mat [X,Y] \to \mat [X',Y']\)	\(\mat [X,Y] \xrightarrow {\silent ^\ast } \mat [X',Y']\)	\(f^{\mat \mat }(X,Y,X',Y')\)

	\(\mat [X,Y] \to \ins [Y']\)	\(\mat [X,Y] \xrightarrow {\silent ^\ast } \ins [Y']\)	\(f^{\mat \ins }(X,Y,Y')\)

	\(\mat [X,Y] \to \del [X']\)	\(\mat [X,Y] \xrightarrow {\silent ^\ast } \del [X']\)	\(f^{\mat \del }(X,Y,X')\)

	\(\mat [X,Y] \to \fin \)	\(\mat [X,Y] \xrightarrow {\silent ^\ast } \fin \)	\(f^{\mat \fin }(X,Y)\)


	\(\ins [Y] \to \ins [Y']\)	\(\ins [Y] \xrightarrow {\silent ^\ast } \ins [Y']\)	\(f^{\ins \ins }(X,Y,Y')\)

	\(\ins [Y] \to \mat [X',Y']\)	\(\ins [Y] \xrightarrow {\silent ^\ast } \mat [X',Y']\)	\(f^{\ins \mat }(X,Y,X',Y')\)

	\(\ins [Y] \to \del [X']\)	\(\ins [Y] \xrightarrow {\silent ^\ast } \del [X']\)	\(f^{\ins \del }(X,Y,X')\)

	\(\ins [Y] \to \fin \)	\(\ins [Y] \xrightarrow {\silent ^\ast } \fin \)	\(f^{\ins \fin }(X,Y)\)


	\(\del [X] \to \mat [X',Y']\)	\(\del [X] \xrightarrow {\silent ^\ast } \mat [X',Y']\)	\(f^{\del \mat }(X,Y,X',Y')\)

	\(\del [X] \to \del [X']\)	\(\del [X] \xrightarrow {\silent ^\ast } \del [X']\)	\(f^{\del \del }(X,Y,X')\)

	\(\del [X] \to \ins [Y']\)	\(\del [X] \xrightarrow {\silent ^\ast } \ins [Y']\)	\(f^{\del \ins }(X,Y,Y')\)

	\(\del [X] \to \fin \)	\(\del [X] \xrightarrow {\silent ^\ast } \fin \)	\(f^{\del \fin }(X,Y)\)

Parameter	\(v_\theta \)	\(N\) for \(\varepsilon {=}10\%\)	\(N\) for \(\varepsilon {=}5\%\)	\(N\) for \(\varepsilon {=}1\%\)
\(\insrate _0\) (top-level ins)	18	1 800	7 200	180 000
\(\delrate _0\) (top-level del)	18	1 800	7 200	180 000
\(\insrate _1\) (dom 1 ins)	24	2 400	9 600	240 000
\(\insrate _2\) (dom 2 ins)	255	25 500	102 000	2 550 000
\(\insrate _3\) (dom 3 ins)	2.4	240	960	24 000
\(w_d\) (domain weights)	\(\sim 1/L_0 \approx 0.06\)	6	25	630
\(\ext ^{(d)}_{fg}\) (fragment trans.)	\(\sim 1/\bar {C}_d\)	depends on domain
Substitution (\(Q\))	\(\sim 1/\bar {L}_{\text {seq}}\)	\(\ll 100\)


Source \(\ell \)	Dest \(\ell '\)	Transition weight

\(\sta \)	\((\frag ', 0, \dom ', e')\)	\(\kappa _\main \, \domdist _{\dom '}\, \fragdist _{\dom '\frag '}\)
\(\sta \)	\((\frag ', 1, \dom ', e')\)	\(\kappa _\main \, \domdist _{\dom '}\, \fragdist _{\dom '\frag '}\)
\(\sta \)	\(\fin \)	\(1 - \kappa _\main \)

Mid-domain continuations:
\((\frag , 0, \dom , e)\)	\((\frag ', g', \dom , e)\)	\(\ext ^{(\dom )}_{\frag \frag '}\)
\((\frag , 1, \dom , 0)\)	\((\frag ', g', \dom , e')\)	\(\notext ^{(\dom )}_\frag \cdot \kappa _\dom \cdot \fragdist _{\dom \frag '}\)
\((\frag , 1, \dom , 1)\)	\((\frag ', g', \dom ', e')\)	\(\notext ^{(\dom )}_\frag \cdot (1-\kappa _\dom ) \cdot \kappa _\main \cdot \domdist _{\dom '} \cdot \kappa _{\dom '} \cdot \fragdist _{\dom '\frag '}\,/\,(1-\zeta )\)

Termination:
\((\frag , 0, \dom , e)\)	\(\fin \)	\(0\) (cannot end mid-fragment)
\((\frag , 1, \dom , 0)\)	\(\fin \)	\(\notext ^{(\dom )}_\frag \cdot (1-\kappa _\dom ) \cdot (1-\kappa _\main )\,/\,(1-\zeta )\)
\((\frag , 1, \dom , 1)\)	\(\fin \)	\(\notext ^{(\dom )}_\frag \cdot (1-\kappa _\dom ) \cdot (1-\kappa _\main )\,/\,(1-\zeta )\)


Transition	Condition on \((g, e)\)	Weight

\((\mat , \ell ) \to (\mathrm {W_M}, \ell )\)	\(g=0\)	\(1\)
	\(g=1,\, e=0\)	\(\notext ^{(\dom )}_\frag \cdot 1\)
	\(g=1,\, e=1\)	\(\notext ^{(\dom )}_\frag \cdot 1\)

\((\mathrm {D_F}, \ell ) \to (\mathrm {W_{D_F}}, \ell )\)	\(g=0\)	\(1\)
	\(g=1,\, e=0\)	\(\notext ^{(\dom )}_\frag \)
	\(g=1,\, e=1\)	\(\notext ^{(\dom )}_\frag \)

\((\mathrm {D_D}, \ell ) \to (\mathrm {W_{D_D}}, \ell )\)	\(g=0\)	\(1\)
	\(g=1,\, e=0\)	\(\notext ^{(\dom )}_\frag \)
	\(g=1,\, e=1\)	\(\notext ^{(\dom )}_\frag \)



lhs	\(\to \)	rhs	\(P(a)\)


\(L_{a}\)	\(\to \)	\({u}\ L_{a}\)	\(\kappa _l \pi _l(u)\)
	\(\|\)	\(S_{a}\ L_{a}\)	\(\kappa _l \pi _l(S)\)
	\(\|\)	\(\epsilon \)	\(1-\kappa _l\)

\(S_{a}\)	\(\to \)	\({u}\ S_{a}\ {v}\)	\(\kappa _s \pi _s(uv)\)
	\(\|\)	\(L_{a}\)	\(1-\kappa _s\)



lhs	\(\to \)	rhs	\(P(b)\)


\(L_{b}\)	\(\to \)	\({w}\ L_{b}\)	\(\kappa _l \pi _l(w)\)
	\(\|\)	\(S_{b}\ L_{b}\)	\(\kappa _l \pi _l(S)\)
	\(\|\)	\(\epsilon \)	\(1-\kappa _l\)

\(S_{b}\)	\(\to \)	\({w}\ S_{b}\ {x}\)	\(\kappa _s \pi _s(wx)\)
	\(\|\)	\(L_{b}\)	\(1-\kappa _s\)



lhs	\(\to \)	rhs	\(P(a)\)	\(P(b\|a)\)


\(L_{ab}\)	\(\to \)	\({u}\ {w}\ L_{ab}\)	\(\kappa _l \pi _l(u)\)	\((1-\beta _l) \alpha _l M_l(u,w)\)
	\(\|\)	\({w}\ L_{ab}\)	\(1\)	\(\beta _l \pi _l(w)\)
	\(\|\)	\({u}\ L_{a\gap b}\)	\(\kappa _l \pi _l(u)\)	\((1-\beta _l) (1-\alpha _l)\)

	\(\|\)	\(S_{ab}\ L_{ab}\)	\(\kappa _l \pi _l(S)\)	\((1-\beta _l) \alpha _l\)
	\(\|\)	\(S_{b}\ L_{ab}\)	\(1\)	\(\beta _l \pi _l(S)\)
	\(\|\)	\(S_{a}\ L_{a\gap b}\)	\(\kappa _l \pi _l(S)\)	\((1-\beta _l) (1-\alpha _l)\)
	\(\|\)	\(\epsilon \)	\(1-\kappa _l\)	\(1-\beta _l\)

\(S_{ab}\)	\(\to \)	\({u}\ {w}\ S_{ab}\ {x}\ {v}\)	\(\kappa _s \pi _s(uv)\)	\((1-\beta _s) \alpha _s M_s(uv,wx)\)
	\(\|\)	\({w}\ S_{ab}\ {x}\)	\(1\)	\(\beta _s \pi _s(wx)\)
	\(\|\)	\({u}\ S_{a\gap b}\ {v}\)	\(\kappa _s \pi _s(uv)\)	\((1-\beta _s) (1-\alpha _s)\)
	\(\|\)	\(L_{ab}\)	\(1-\kappa _s\)	\(1-\beta _s\)

\(L_{a\gap b}\)	\(\to \)	\({u}\ {w}\ L_{ab}\)	\(\kappa _l \pi _l(u)\)	\((1-\gamma _l) \alpha _l M_l(u,w)\)
	\(\|\)	\({w}\ L_{ab}\)	\(1\)	\(\gamma _l \pi _l(w)\)
	\(\|\)	\({u}\ L_{a\gap b}\)	\(\kappa _l \pi _l(u)\)	\((1-\gamma _l) (1-\alpha _l)\)

	\(\|\)	\(S_{ab}\ L_{ab}\)	\(\kappa _l \pi _l(S)\)	\((1-\gamma _l) \alpha _l\)
	\(\|\)	\(S_{b}\ L_{ab}\)	\(1\)	\(\gamma _l \pi _l(S)\)
	\(\|\)	\(S_{a}\ L_{a\gap b}\)	\(\kappa _l \pi _l(S)\)	\((1-\gamma _l) (1-\alpha _l)\)
	\(\|\)	\(\epsilon \)	\(1-\kappa _l\)	\(1-\gamma _l\)

\(S_{a\gap b}\)	\(\to \)	\({u}\ {w}\ S_{ab}\ {x}\ {v}\)	\(\kappa _s \pi _s(uv)\)	\((1-\gamma _s) \alpha _s M_s(uv,wx)\)
	\(\|\)	\({w}\ S_{ab}\ {x}\)	\(1\)	\(\gamma _s \pi _s(wx)\)
	\(\|\)	\({u}\ S_{a\gap b}\ {v}\)	\(\kappa _s \pi _s(uv)\)	\((1-\gamma _s) (1-\alpha _s)\)
	\(\|\)	\(L_{ab}\)	\(1-\kappa _s\)	\(1-\gamma _s\)

C Recursive TKF

C.1 The TKF-Mixed Domain Model (MixDom)

C.1.1 The MixDom Model

C.1.2 Singlet HMM for MixDom

C.1.3 Pair HMM for MixDom

C.1.4 Baum-Welch Algorithm for MixDom Pair HMM

C.1.5 WFSTs for MixDom

C.2 Selected Inference Algorithms for MixDom

C.2.1 Fast Statistical Alignment (FSA)

C.2.2 Beam Search Ancestral Sequence Reconstruction (BeamASR)

C.2.3 Phylogenetic Hidden Markov Model (PhyloHMM)

C.2.4 Phylogenetic composition

C.2.5 Beam Backward algorithm (BeamMSA)

C.2.6 Progressive alignment via profile construction (ProgRec)

C.3 Exploded MixDom Pair HMM

C.3.1 State Space

C.3.2 Transition Weights

C.3.3 Null State Classification

C.3.4 Null Elimination

C.3.5 Exact Count Restoration

C.3.6 Parameter Group Decomposition

C.4 Order-1 Maraschino: Distilled Adjacency Frequencies

C.4.1 Cherry-count summary statistics

C.4.2 Cherry-count likelihood for the MixDom Pair HMM

C.4.3 Distillation From MixDom To Order-1 Machines

C.4.4 Notation for path marginalizations

C.4.5 Distillation to Order-1 HMM

C.4.6 Distillation to Order-1 WFST

C.5 Algebraic Distillation of MixDom

C.5.1 Setup

C.5.2 Class-mixture emissions

C.5.3 Single HMM Distillation

C.5.4 Pair HMM Distillation

C.5.5 Block Structure and Matrix Inversions

C.5.6 Within-Domain Inversion: closed form

C.5.7 Within-Domain Inversion: \(\nfrag = 1\) closed form

C.5.8 Bilinear Factored Form of Adjacency Frequencies

C.5.9 Full-Context Distillation: Passthrough Context for Insert and Delete

C.5.10 Domains versus Fragments versus Classes for Adjacency Capture

C.5.11 Identifiability

C.5.12 Scaling to \(\ndom , \nfrag , \nclasses \)

C.5.13 Summary

C.6 MixDom-Specific SVI-BW Convergence Considerations

C.6.1 Parameter groups and Fisher information

C.6.2 Substitution vs. indel information

C.6.3 MixDom expected statistics

C.6.4 Convergence rate estimates

C.6.5 Discussion: why top-level indel rates are hardest

C.7 Variational EM training of MixDom from tree-structured data

C.7.1 Outer EM loop

C.7.2 Per-family E-step

C.7.3 M-step from aggregated sufficient statistics

C.7.4 Stochastic VBEM (SVI-VBEM)

C.7.5 Convergence and ELBO monitoring

C.7.6 Initialisation and warm-start

C.7.7 Computational scaling and minibatching

C.7.8 Comparison to SVI-BW

C.8 Mixture-of-trees variational MixDom ancestral inference

C.8.1 Setting and reduced state space

C.8.2 Restricted generative model

C.8.3 Variational family

C.8.4 Reduced WFST: marginalising \((g, e)\) and the class \(c\)

C.8.5 Per-branch path log-likelihood

C.8.6 Per-column expected indel log-likelihood under \(q\)

C.8.7 Per-column expected substitution log-likelihood

C.8.8 ELBO

C.8.9 Cross-column constraint vanishes

C.8.10 Special cases and recovery

C.8.11 Open issues

C.9 Generalized Phylo-HMM for MixDom

C.9.1 The Vanishing-Top-Level-Indel Limit

C.9.2 Partition Decomposition

C.9.3 Why the State Space Cannot Be Collapsed

C.9.4 Setup and Definitions

C.9.5 Intra-Block Forward Recurrence

C.9.6 The Forward Recursion

C.9.7 The Backward Recursion

C.9.8 Intra-Block Backward Recurrence

C.9.9 Posterior Domain and Fragment State Assignment

C.9.10 Root Residue Reconstruction