Asymptotic Density of Zimin Words

Word $W$ is an instance of word $V$ provided there is a homomorphism $\phi$ mapping letters to nonempty words so that $\phi(V) = W$. For example, taking $\phi$ such that $\phi(c)=fr$, $\phi(o)=e$ and $\phi(l)=zer$, we see that"freezer"is an instance of"cool". Let $\mathbb{I}_n(V,[q])$ be the probability that a random length $n$ word on the alphabet $[q] = \{1,2,\cdots q\}$ is an instance of $V$. Having previously shown that $\lim_{n \rightarrow \infty} \mathbb{I}_n(V,[q])$ exists, we now calculate this limit for two Zimin words, $Z_2 = aba$ and $Z_3 = abacaba$.


Introduction
Our present interest is in words-not the linguistic units with lexical value, but rather strings of symbols or letters. We are interested in words as abstract discrete structures. In particular, we are investigating elements of a free monoid. A monoid is an algebraic structure consisting of a set, an associative binary operation on the set, and an identity element. A free monoid is defined over some generating set of elements, which we view as an alphabet of letters. Its binary operation is simply concatenation, its elements-called free words-are all finite strings of letters, and its identity element is the empty word (generally denoted with ε or λ). Often, the operation of a monoid is called multiplication, so it is fitting that a "subword" of a free word is called a "factor." For example, in the free monoid over alphabet {a, b, c, d, r}, the word cadabra is a factor of abracadabra because abracadabra is the product of abra and cadabra.

Combinatorial Limit Theory
In an era of massive technological and computational advances, we have large systems for transportation, communication, education, and commerce (to name a few examples). We also possess massive quantities of information in every part of life. Therefore, in many applications of discrete mathematics, the useful theory is that which is relevant to arbitrarily large discrete structures. For example, graphs can be used to model a computer network, with each vertex representing a device and each edge a data connection between devices. The most well-known computer network, the Internet, consists of billions of devices with constantly changing connections; one cannot simply create a database of all billion-vertex graphs and their properties.
x b n s u φ(x) vr n m oo Definition 1.9. U is an instance of V , or a V -instance, provided U = φ(V ) for some nonerasing homomorphism φ; equivalently, • V = x 0 x 1 · · · x m−1 where each x i is a letter; If W fails to encounter V , we say W avoids V .
To help distinguish the encountered word and the encountering word, "pattern" is elsewhere used to refer to V in the encounter relation V W . Also, an instance of a word is sometimes called a "substitution instance" and "witness" is sometimes used in place of encounter. Definition 1.10. A word V is unavoidable provided, for any finite alphabet, there are only finitely many words that avoid V .
The first classification of unavoidable words was by Bean, Ehrenfeucht, and McNulty (1979). Three years later, Zimin published a fundamentally different classification of unavoidable words (Zimin 1982in Russian, Zimin 1984 in English). Definition 1.11. Define the n-th Zimin word recursively by Z 0 := ε and, for n ∈ N, Z n+1 = Z n x n Z n . Using the English alphabet rather than indexed letters: Equivalently, Z n can be defined over the natural numbers as the word of length 2 n − 1 such that the i-th letter, 1 ≤ i < 2 n , is the 2-adic order of i. Theorem 1.12 (Zimin 1984). A word V with n distinct letters is unavoidable if and only if Z n encounters V .
With Zimin's concise characterization of unavoidable words, a natural combinatorial question follows: How long must a q-ary word be to guarantee that it encounters a given unavoidable word? Define f(n, q) to be the smallest integer M such that every q-ary word of length M encounters Z n .

Asymptotic Probability of Being Zimin
Definition 2.1. Let I n (V, q) be the probability that a uniformly randomly selected length-n q-ary word is an instance of V . That is, Denote I(V, q) = lim n→∞ I n (V, q).
Cooper and Rorabaugh (2016+) prove that I(V, q) exists for any word V . Moreover, they establish the following dichotomy for q ≥ 2: I(V, q) = 0 if and only if V is doubled (that is, every letter in V occurs at least twice). Trivially, if V is composed of k distinct, nonrecurring letters, then I n (V, [q]) = 1 for n ≥ k, so I(V, q) = 1. But if V contains at least one recurring letter, it becomes a nontrivial task to compute I(V, q). We have from previous work the following bounds for the instance probability of Zimin words.
Proof: For the lower bound, note that ||Z n || = |Z n | − | L(Z n )| = (2 n − 1) − (n). Theorem 3.3 from Cooper and Rorabaugh (2016+) tells us that for all q ∈ Z + and nondoubled V , For the upper bound, observe that the n letters occurring in Z n have the following multiplicities: r j = 2 j : 0 ≤ j < n . Since there is exactly one nonrecurring letter in Z n , r 0 = 2 0 = 1, Theorem 4.14 from Rorabaugh (2015) provides an upper bound of A nice property of these bounds is that they are asymptotically equivalent as q → ∞. For some specific V , we can do better. Presently, we provide infinite series for computing the asymptotic instance probability I(V, q) for two Zimin words, V = Z 2 = aba (Section 3) and V = Z 3 = abacaba (Section 4). Table 2 below gives numerical approximations for 2 ≤ q ≤ 6. Our method also provides upper bounds on I(Z n , q) for general n (Section 5).

Tab. 2:
Approximate values of I(Z2, q) and I(Z3, q) for 2 ≤ q ≤ 6. q 2 3 4 5 6 · · · I(Z 2 , q) 0.7322132 0.4430202 0.3122520 0.2399355 0.1944229 · · · I(Z 3 , q) 0.1194437 0.0183514 0.0051925 0.0019974 0.0009253 · · · 3 Calculating I(Z 2 , q) Definition 3.1. Nonempty word V is a bifix of word W provided W = V A = BV for some nonempty words A and B; that is, V is both a proper prefix and suffix of W . Moreover, if bifix V is an instance of word Z, then V is a Z-bifix of W . If word W has no bifixes, W is bifix-free. If W has no Z-bifix, W is Z-bifix-free.
Lemma 3.2. If word W has a bifix, then it has a bifix of length at most ⌊|W |/2⌋. Proof: Let W be a word with minimal-length bifix of length k, ⌊|W |/2⌋ < k < |W |. Then we can write But then W has bifix W 2 with |W 2 | < k, which contradicts our selection of the shortest bifix of W .
Although some words are neither Z 2 -instances nor bifix-free, the proportion of such words is asymptotically 0. Hence, 1 − I(Z 2 , q) was previously computed by Nielsen (1973) as the asymptotic probability that a word is bifix-free. Equivalently, in a paper of Guibas and Odlyzko (1981) on the period, or overlap, of words, 1 − I(Z 2 , q) was computed as the proportion of strings with no period. Rather than restate these results, we reformulate them presently for completeness and as a warm-up for calculating I(Z 3 , q).
Suppose U aV has a bifix for some letter a. Then by the lemma, U aV has a bifix of length at most |U aV |/2. But W is bifix free, so the only possibility is U = aV .
Therefore, for every bifix-free word of length 2k there are q bifix-free words of length 2k + 1. For every bifix-free word of length 2k − 1, there are q bifix-free words of length 2k, with exception of the the length-2k words that are the square of a bifix-free word of length k.
Theorem 3.4. For q ≥ 2, . Proof: Since a ℓ = a (q) ℓ counts bifix-free words, the number of q-ary words of length M that are Z 2instances is (without double-count) so the proportion of q-ary words of length M that are Z 2 -instances is From the recursive definition of a ℓ , we obtain the functional equation .

Proof:
The lower bound follows from the fact that a word of length M > 2 is a Z 2 -instance when the first and last character are the same. This occurrence has probability 1/q. Note that f (q) (q −2 ) is an alternating series. Moreover, the terms in absolute value are monotonically approaching 0; the routine proof of monotonicity can be found in the appendices (Lemma A.1). Hence, the partial sums provide successively better upper and lower bounds: 4 Calculating I(Z 3 , q) Will use similar methods to compute I(Z 3 , q). To avoid unnecessary subscripts and superscripts, assume throughout this section that we are using a fixed alphabet with q > 1 letters, unless explicitly stated otherwise. Since Z 2 has more interesting structure than Z 1 , there are more cases to consider in developing the necessary recursion.
Then LAL can be written in exactly one of the following ways: Proof: With some thought, the reader should recognize that the five listed cases are in fact mutually exclusive. The proof that these are the only possibilities follows.
Given that W has a Z 2 -bifix and L is bifix-free, it follows that W has a Z 2 -bifix LBL for some nonempty B. Let LBL be chosen of minimal length. We break this proof into nine cases depending on the lengths of L and LBL (Figure 1). Set m = |W |, ℓ = |L|, and k = |LBL|.
In LAL, the first and last occurrences of LBL overlap by a length strictly between 0 and ℓ. This is impossible, since L is bifix-free.
Case (4): 2k = m + ℓ. This is iii Case (5): m + ℓ < 2k < m + 2ℓ. The first and last occurrences of LBL overlap by a length strictly between ℓ and 2ℓ. This is impossible, since L is bifix-free.
But this contradicts the minimality of LBL, since LLLLLL has Z 2 -bifix LLL, which is shorter than LBL = LLLL.
Case (8) is a bifix of LAL, contradicting the minimality of LBL.
Case (9): m − ℓ < k < m. The first and last occurrences of LBL overlap by a length strictly between k − ℓ and k. This is impossible, since L is bifix-free.
For fixed bifix-free word L of length ℓ, define b ℓ m to count the number of Z 2 words with bifix L that are Z 2 -bifix-free q-ary words of length m. Then In order to form a recursive definition of b n as we did for a n , we now describe two new terms. Let AB be a word of length W with |A| = ⌈W/2⌉ and |B| = ⌊W/2⌋. Then AB has q length-(n + 1) children of the form AxB, each having AB as its parent. In this way every nonempty word has exactly q children and exactly 1 parent, which establishes the 1:q ratio of words of length n to words of length n + 1. The set of a word's children together with successive generations of progeny we refer to as that word's descendants.
Theorem 4.2. b ℓ n = c ℓ n + d ℓ n where c n = c ℓ n and d n = d ℓ n are defined recursively as follows: For even ℓ : For odd ℓ > 1 : For ℓ = 1 : Proof: Fix a bifix-free word L of length ℓ. The full recursion is too messy to prove all at once, so we build up to it in stages. Within each stage, ≈ indicates an incomplete definition. Example word trees with small q and short L are found in Appendix B.
Stage I Since L is bifix free, any Z 2 -instance with L as a bifix has to be of greater length than 2ℓ. Thus we have b 1 = · · · = b 2ℓ = 0. The only such words of length 2ℓ + 1 are of the form LxL for some letter x, therefore, b 2ℓ+1 = q.
Every word of length n > 2ℓ + 1 has L as a bifix if and only if its parent has L as a bifix. This is why, for k > ℓ, the definition of b 2k includes the term qb 2k−1 , and the definition of b 2k+1 includes the term qb 2k . If b n were counting Z 2 -instances with bifix L, we would be done. However, we do not want b n to count words that have a Z 2 -bifix. Thus, we must deal with each of the 5 cases listed in Lemma 4.1.
First, let us deal with case ii : LAL = LBLLBL with LBL the shortest Z 2 -bifix of LAL. The number of these of length 2k, with k > ℓ, is b k . Therefore, in the definition of b 2k , we subtract b k . Conveniently, the descendants of case-ii words are precisely words of case i . Therefore, we have accounted for two cases at once.
Next, let us look at case iii : LAL = LBLBL with LBL the shortest Z 2 -bifix of LAL. For the moment, assume |L| = ℓ is even. Then |LBLBL| is even. The number of such words of length 2k, with k > ℓ, is b k+ℓ/2 . We want to exclude words of this form, but we do not necessarily want to exclude their children. Therefore, in the definition of b 2k we subtract b k+ℓ/2 , but then we add qb k+ℓ/2 in the definition of b 2k+1 . Now we look at when |L| is odd, so |LBLBL| is odd. The number of such words of length 2k + 1, with k > ℓ, is b k+⌈ℓ/2⌉ . Therefore, in the definition of b 2k+1 we subtract b k+⌈ℓ/2⌉ , but then we add qb (k−1)+⌈ℓ/2⌉ = qb k+⌊ℓ/2⌋ in the definition of b (2(k−1)+1)+1 = b 2k .
Our work so far renders the following tentative definition of b n .

For even
We continue with case iv : LAL = LLF LLF LL with LLF LL the shortest Z 2 -bifix of LAL. Note that |LLF LLF LL| is even. It would apear that the number of such words of length 2k would be b k−ℓ (counting words of the form LF L), which we could deal with in the same fashion as we did for case iii . However, when counting words of the form LF L, we do not want words of the form LLGLL, because LLF LLF LL = LLLGLLLLGLLL is already accounted for in case i .
Stage II To address this issue, we will define two different recursions. Let d n count the Z 2 -instances of the form LLALL that are Z 2 -bifix free. Let c n count all other Z 2 -instances of the form LAL that are Z 2 -bifix free. Therefore, b n = c n + d n by definition.
As with b n , we quickly see that c n = 0 for n ≤ 2ℓ and c 2ℓ+1 = q. Now the shortest words counted by d n are of the form LLxLL for some letter x, so d n = 0 for n ≤ 4ℓ and d 4ℓ+1 = q.
To deal with cases i and ii , we can do the same things as before, but recognizing that LL is a bifix of LBLLBL if and only if LL is a bifix of LBL. Therefore, subtract c k in the definition of c 2k and subtract d k in the definition of d 2k (both for k > ℓ).
We also deal with case iii as before, recognizing that LL is a bifix of LBLBL if and only if LL is a bifix of LBL. For even ℓ: subtract c k+ℓ/2 in the definition of c 2k and add qc k+ℓ/2 in the definition of c 2k+1 ; subtract d k+ℓ/2 in the definition of d 2k and add qd k+ℓ/2 in the definition of d 2k+1 . For odd ℓ: subtract c k+⌈ℓ/2⌉ in the definition of c 2k+1 and add qc k+⌊ℓ/2⌋ in the definition of c 2k ; subtract d k+⌈ℓ/2⌉ in the definition of d 2k+1 and add qd k+⌊ℓ/2⌋ in the definition of d 2k .
Having split b n into c n and d n , we can address case iv : LAL = LLF LLF LL with LLF LL the shortest Z 2 -bifix of LAL. These words are counted by d n , not by c n , and there are d k+ℓ such words of length 2k. Therefore, we subtract d k+ℓ in the definition of d 2k and add qd k+ℓ in the definition of d 2k+1 .
This brings us to the following tentative definitions of c n and d n .
For even ℓ : For odd ℓ :

Stage III
Next, let us deal with case v : LLLL. We merely need to subtract 1 in the definition of c 4ℓ . Since all of the words counted by d n are descendants of LLLL, this is what prevents overlap of the words counted by c n and d n .
There was a small omission in the previous stage. When dealing with cases i and ii , we pointed out that LL is a bifix of LBLLBL if and only if LL is a bifix of LBL, this was a true and important observation. The one problem is that LLL has LL as a bifix but is not of the form LLALL. Therefore, LLLLLL was "removed" in the definition of c 6ℓ when it should have been "removed" from d 6ℓ . We must account for this by adding 1 in the definition of c 6ℓ and subtracting 1 in the definition of d 6ℓ .
Similarly, in dealing with case iii , we "removed" LLLLL in the definition of c 5ℓ and "replaced" its children in the definition of c 5ℓ+1 . These should have happened to d n . Therefore, we add 1 and subtract q in the definitions of c 5ℓ and c 5ℓ+1 , respectively, then subtract 1 and add q in the definitions of d 5ℓ and d 5ℓ+1 , respectively.
Since LLL does not cause any trouble with case iv , we are done building the recursive definition for even ℓ as found in the theorem statement.
Stage IV The recursion for odd ℓ has the additional caveat that ℓ = 1. When ℓ = 1, there exist conflicts in the recursive definitions: 4ℓ + 1 = 5ℓ and 5ℓ + 1 = 6ℓ. After consolidating the"adjustments" for these cases, we get the definition for ℓ = 1 as appears in the theorem statement.
With our recursively defined sequences a n and b n , the latter in terms of c n and d n , we are now able to formulate Theorem 3.4 for Z 3 . Theorem 4.3. For integers q ≥ 2, where Proof: Recalling Equation (2), Similar to our proof for I(Z 2 , q), let us define generating functions for the sequences c n = c ℓ n and d n = d ℓ n : Despite having to write the recursive relations three different ways, depending on ℓ, the underlying recursion is fundamentally the same and results in the following functional equations: Solving (3) for g(x), we get with r(x) and s(x) as defined in the theorem statement. Expanding (5) gives Likewise, solving (4) for h(x), we get with u(x) and v(x) as defined in the theorem statement. (G(i) + H(i)) ≤ I(Z 3 , q); with G(i) = G

Now for any integer
Moreover, since the a ℓ are nonnegative, the lower bound for the theorem is evident. For a bifix-free word L of length ℓ, ∞ m=0 b ℓ m q −2m is the limit, as M → ∞, of the probability that a word of length M is a Z 3 -instance of the form LALBLAL. A necessary condition for such a word is that it starts and ends with L, which (for M ≥ 2ℓ) has probability q −2ℓ . Also a ℓ counts the number of bifix-free words of length ℓ, so a ℓ ≤ q ℓ . Hence for any integer N ≥ 0: The values in Table 4 were generated by the Sage code found in Appendix A.2, which was derived directly from Corollary 4.4 and can be used to compute I(Z 3 , q) to arbitrary precision for any q ≥ 2.
5 Bounding I(Z n , q) for Arbitrary n This programme is not practical for n in general. The number of cases for a generalization of Lemma 3.1 is likely to grow with n. Even if that stabilizes somehow, the expression for calculating I(Z n , q) requires n nested infinite series. Nevertheless, ignoring some of the more subtle details, we proceed with this method to obtain computable upper bounds for I(Z n , q).
Fix a Z n−2 -instance L of length ℓ ≥ 1, letb ℓ m be the number of words of length m of the form LAL for A = ε but not of the form LBLBL, LBLLBL, or LBLCLBL. This corresponds to Stage I from the proof of Theorem 4.2. As we do not account for the structure of L,b is an overcount for the number of Z n−1 -instances of the form LAL that do not have a Z n−1 -bifix of the form LAL. Thenb m =b ℓ m is recursively defined as follows: The associated generating functionf ℓ (x) :=f Nowf ℓ (q −2 ) gives an upper bound for the limit (as word-length approaches infinity) of the probability that a word is a Z n -instance of the form LALBLAL with |L| = ℓ. Taking this one step further, for some Z i -instance K of length ℓ i , the asymptotic probability that a word is a Z n -instance constructed with 2 n−i+1 copies of K is at most Consequently, We need to get control of the tails to turn this into a computable sum. A trivial upper bound for the asymptotic probability that a word is a Z n -instance constructed with 2 n−i copies of K, and thus starts and ends with K, is q −2ℓi . Since there are at most q ℓi Z i -instances of length ℓ i , the asymptotic probability that a word is a Z n -instance with a Z i -component of length ℓ i is at most q −ℓi . Therefore, the asymptotic probability that a word is a Z n -instance with a Z i -component of length greater than N i is at most Now in the upper bound of I(Z n , q), we can replace the partial tail Therefore, I(Z n , q) ≤

A Proofs and Computations for Sections 3 and 4
A.1 Proofs of Monotonicity .