The Variance of the Profile in Digital Search Trees

What today we call digital search tree (DST) is Coffman and Eve’s sequence tree introduced in 1970. A digital search tree is a binary tree whose ordering of nodes is based on the values of bits in the binary representation of a node’s key. In fact, a digital search tree is a digital tree in which strings (keys, words) are stored directly in internal nodes. The proﬁle of a digital search tree is a parameter that counts the number of nodes at the same distance from the root. In this paper we concentrate on external proﬁle, i.e., the number of external nodes at level k when n strings are sorted. By assuming that the n input strings are independent and follow a (binary) memoryless source the asymptotic behaviour of the average proﬁle was determined by Drmota and Szpankowski (2011). The purpose of the present paper is to extend their analysis and to provide a precise analysis of variance of the proﬁle. The main (technical) difference is that we have to deal with an inhomogeneous part in a proper functional-differential equations satisﬁed by the second moment and Poisson variance. However, we show that the variance is asymptotically of the same order as the expected value which implies concentration. These results are derived by methods of analytic combinatorics such as generating functions, Mellin transform, Poissonization, the saddle point method and singularity analysis.


Introduction
In 1970, Coffman and Eve published a paper that introduced several modification to hashing schemes using trees [1].The paper included three type of trees and some analysis of their basic properties.These trees are known today by different names.Coffman and Eve's sequence trees are what today we call digital search trees.The method is well suited for storing data when the bit composition (or letter composition) of the data is available [16].In fact, digital search trees and tries are two classes of so-called log n-trees and their construction is based on digital keys and not on the order structure of the keys as in the case of binary search trees [2].They are important in many computer science applications like data compression, pattern matching and hashing.For example, the popular Lempel-Ziv compression scheme is strongly related to digital search trees.
parsing scheme and digital search trees by assuming a random model based on independent memoryless input strings.They also analyzed digital tries with Markovian dependency [6].There is a more general random model of dynamical sources but in this case the technicalities are usually more involved [13,23].Furthermore, Louchard and Szpankowski [15] studied the average profile and limiting distribution for the phrase size in the Lempel-Ziv parsing algorithm, Jacquet and Régnier [9] discussed the limiting distribution in a trie partitioning process, and Knessl and Szpankowski [10] analyzed the asymptotic behavior of the height in a digital search tree and the longest phrase of the Lempel-Ziv scheme.They also considered the average profile of symmetric digital search trees [11] (which means that all letters in the memoryless source appear with the same frequency).Louchard [14] provided an exact and asymptotic distribution in digital search trees.
Recently Park et al. [17,18] studied the external and internal profiles of tries (with memoryless input strings) and provided a very detailed picture on the asymptotic behaviour of the provile (expected value, variance, limiting distribution).For digital search trees (DST) Drmota and Szpankowski [3] proved asymptotic results on the average profile which are similar to those for tries.The purpose of this paper is to extend their analysis to asymptotics for the variance of the profile.
Before we state our results we recall the construction of a digital search tree (DST) which stores (binary) input strings in its internal nodes.The root contains the first string, and the next string occupies the right or the left child of the root depending on whether its first symbol of the next string is 0 or 1.At each level of the tree a different bit of the key is checked; if the bit is 0, the search continues down the left subtree, if it is 1, the search continues down the right subtree.The remaining strings are stored in available nodes which are directly attached to nodes already existing in the tree, as shown in Figure 1 [12,16,21].
Let B n,k be the number of external nodes at level k when n strings are sorted.We study the external profile built over n binary strings generated by a memoryless source, that is, we assume each string is a binary i.i.d.sequence with p being the probability of a "1" (0 < p < 1); we also use q := 1 − p and assume that p < q.We also mention that symmetric DST's, i.e. p = q = 1 2 , are not covered by our analysis because for p < q, g(s, w) is an analytic function for |w| < 1/T (s) where T (s) = p −s + q −s and the saddle point analysis fails for p = q [3].In this case we expect a completely different behaviour and also the use of different methods, see also the discussion of the symmetric case in [3].
In order to state our main result we need the following notations.For a real number α with (log 1 p ) −1 < α < (log 1 q ) −1 , let ρ = ρ(α) be defined by the equation Explicitly, ρ(α) is given by Furthermore set Theorem 1 Let Var(B n,k ) denote the variance of the profile in unbalanced digital search trees with underlying probabilities 0 < p < q = 1 − p.Let k and n be positive integers such that k/ log n satisfies (log where ρ n,k = ρ(k/ log n) and L(ρ, x) is a non-zero periodic function with period 1 in x.
Remark 1 The function L(ρ, x) can be represented as L(ρ, x) , where f (s) has the form f (s) = g(s − 1, 1/T (s))D(s, 1/T (s)) and the functions g(s, w) and D(s, w) are described in (17) and in Lemma 3, respectively.However, it is not really explicit.
This theorem shows that the variance Var(B n,k ) is of the same order of magnitude as the expected value E(B n,k ).In particular it follows that B n,k is concentrated around its expected value.The function L has small amplitude and their oscillations are consequences of an infinite number of saddle-points appearing in the integrand of the associated Mellin transform.The only difference between two asymptotic formulas is in function L for two different function f .The same phenomenon has been observed for tries [17,18] where it was also shown that B n,k satisfies a central limit theorem.It would be natural to expect a corresponding behaviour for DST's, however, the methods of [17,18] are not applicable in the present situation.
In order to analyze the variance we use the so-called Poisson variance V k (x) as a (hopefully) good approximation of the variance of the profile Var(B n,k ) [18] if we set x = n.It is defined by where and ∆ k (x, 1) and ∆ k (x, 1) denote the first and the second derivative of ∆ k (x, u) with respect to u at u = 1, respectively.It turs out that the Poisson variance satisfies a recurrence (10) that is quite similar to that of the "Poisson expectation" ∆ (1) n! , see (5), in particular the inhomogeneous part in ( 10) is relatively small.This suggests suggests that the asymptotics of the variance should be of the same order of magnitude as for the expected value.In order to make this heuristics rigorous we have to overcome several technical difficulties.First we have to obtain an explicit solution for F (3) k (s) of ( 11) which is the most difficult part (Section 2).Then we have to find proper asymptotics for F (3) k (s) and to invert then Mellin transform of V k (x).This leads us to an infinite number of saddle points (cf.also [3] and [18]).The final step will be to show that the Poisson variance of V k (n) is asymptotically equal to Var(B n,k ).The reader is referred to [3] and [18] for a detailed discussion of the above mentioned tools that belong to analytic combinatorics.

Generating Functions
In this section we recall (and extend) the combinatorial analysis of DST's with the help of generating functions.

Basic relations
As already introduced, B n,k denotes the (random) number of external nodes at level k in a digital search tree built over n strings generated by a memoryless source with parameter q > p = 1 − q.We recall the initial conditions The probability generating function of the external profile, P n,k (u) = E u B n,k , satisfies the following recurrence relation (cf.[7]) with initial conditions P 0,k (u) = 1 for k ≥ 1, P 0,0 (u) = u, P n,0 (u) = 1 for n ≥ 1.The corresponding exponential generating function . By taking derivatives with respect to u and setting u = 1 we obtain for the exponential generating function x n n! the following functional recurrence with initial conditions E (1) 0 (x) = 1 and E (1) with initial conditions ∆ (1) 0 (x) = e −x and ∆ (1) . Similarly, by taking second derivatives with respect to u and setting u = 1 we obtain for the exponential generating function x n n! the following functional recurrence with initial conditions E (2) where w k (x) = 2∆ (1) 0 (x) = e −x and ∆ (2) . By induction it is easy to prove that ∆ (2) k (x) can be represented as finite linear combinations of functions of the form e −p 1 q 2 x with 1 , 2 ≥ 0, and of products of two of these functions.Hence, the Mellin transform of ∆ (2) k (x) (see [5]) exists for all s with (s) > 0. Since B n,k = 0 for k > n it follows that E (2) (s) actually exists for s with (s) > −k.
Let us now express ∆ * (2) k (s) as where Γ(s) is the Euler gamma function.By definition we know that F k (s) is the finite linear combinations of functions a −s (with certain values of a).Thus, F k (s) is an entire function.Furthermore (7) translates into F (2) where

and F
(2) 0 (s) = 1.Note that (8) does not only hold for (s) > −k where the Mellin transform exists.Since F (2) k (s) continues analytically to an entire function, ( 8) holds for all s, too.The inhomogeneous part in ( 8) is very large compared to the order of magnitude of the homogeneous equation for the first moment (compare with [3]).Since F k (s) behaves geometrically as T (s) k it seems that the term F (1) k+1 (s − 1) is negligible compared to the other two terms in (9) [3].This phenomenon will also occur for F k (s) (Section 2.3).We introduce the so-called Poisson variance which should be a good approximation for the variance of the profile ( [8]).By ( 5) and ( 7), V k (x) satisfies with initial condition V 0 (x) = e −x (1 − e −x ) and V k (0) = 0 (k ≥ 1).The Mellin transform of V k (x) is then given as and again, we can use a factorization of the form where V * k (s) and F (3) k (s), and F k (s) respectively.In particular, (10) translates into where H (3) k (s) is the Mellin transform of V k (x) that exists for (s) > −k.We will use this property several times.

Analysis of F
(1) k (s) Before we will find a solution of ( 8) or (11) we recall (and extend) some facts of F (1) k (s) (see [3]).We define the power series Furthermore we introduce the function operator A as follows We also set By the way it is easy to compute R k (s) for a few small values of k.For example, .
In general we have a representation of the form where the coefficients c k,i are uniformly bounded by for some constant C just depending on p and q.
It is also easy to verify that R k (s) satisfies the recurrence and satisfies R k (−∞) = 0 (k ≥ 1).From ( 14) and ( 16) we obtain g(s, w) = g(s − 1, w)/(1 − wT (s)) and consequently This shows that w = 1/T (s) is the dominanting polar singularity of the mapping w → g(s, w) (if s is sufficiently close to the real axis) and it also follows that as k → ∞.Actually this asymptotics is uniform for real s ≥ 0. By (15) it also follows that Next we turn to F k (s) which satisfies the same recurrence ( 16) as R k (s).Hence we expect a similar representation for the generating function f (1) (s, w).However, we have different initial conditions, namely F (1) k (0) = 0 (k ≥ 1) or f (1) (0, w) = 1.Anyway, it follows (as above) that Note that the mapping w → 1/g(0, w) is an entire function.Hence, w = 1/T (s) is again the dominating polar singularity (if s is sufficiently close to the real axis).Consequently This is also one of the main observations of [3].It follows from (18) that F k (s) is a linear combination of R (s), 0 ≤ ≤ k, which also shows that F (1) k (−∞) exists because R k (0) = 0 for all k (see the proof of Theorem 3 in [3]).Furthermore, since the ratio f (1) (s, w)/g(s, w) = 1/g(0, w) is independent of s it follows that f (1) (0, w) g(0, w) = f (1) (−∞, w) g(−∞, w) .

Remark 2 The analysis of F
(1) k (s) that was given in [3] is not as direct as the approach given here.In [3] it was observed that the recurrence (9) translates to the relation which can be translated to the power series representation (18).Both approaches use the fact that the limit

Analysis of
The main difference between the recurrence (9) for F (1) k (s) and the recurrence (11) for The main problem with this inhomogeneous part is -as we will show in a moment -that the limit H k (−∞) does not exist.First we derive a proper representation for H k (s) .Lemma 1 The function H where (with c = (s)/2).Furthermore we have uniformly for and H (3) Proof: The representations ( 22) and ( 23) are immediate from (21) and from (19).We also can use the property that (e −ax e −bx ) = −(a + b)e −(a+b)x and that its Mellin transform is given by Γ(s)(a + b) −s+1 .
Next, we use the estimate |F (1) Finally by using (20) and the explicit representation (22) we also obtain (24). 2 The next lemma is crucial for the solution of our problem.
Proof: Let L k (s) be defined by By definition it is clear that L k (s) is a linear combination of terms of the form (p k1 q k2 ) −s , where all coefficients have the same sign.Furthermore, since 1/g(s, w) is an entire function, we have L k (s) = O(η k ) for every (fixed) η > 0.
We will show that for every k ≥ 1 there exists a function D k (s) with and with an upper bound of the form By Lemma 1 and by the definition of ) is a linear combination of functions of the form a −s (for certain positive numbers a), where all coefficients have the same sign.Furthermore, there is only a bounded number of real numbers a appearing there with a ≥ 1. Recall that a is of the form p k1 q k2 p i q 1−i + p j q 2−j .
This bound is also independent of k, m, 1 , 2 .Next observe that for a = 1.Hence, each term of the form a −s can be written as a difference of the form d(s) − d(s − 1).
Of course, this implies the existence of a function D k (s) satisfying (27).
Let us consider one of the terms L k−m (s)G 1, 2 (s).Recall that we already know that , where η > 0 is arbitrary.We split it up into a sum where A(s) contains all terms of the form a −s with a < 1 and B(s) the remaining ones.Since all coefficients have the same sign, A(s) and B(s) can be bounded from the above as A(s) + B(s).By the above observation we can represent A(s) as Note also that the constants that are implied by the O-notation are uniform, they only depend on p and q.Putting all these estimates together we immediately deduce (28).
Finally, we set which is analytic for |w| < 1/T ( (s)/2 − 1) 2 . 2 We are now ready to state and prove a proper representation for f (3) (s, w).
We recall that the dominating singularity of g(s, w) is w = 1/T (s).Since we have the relation the function D(s, w) is analytic in a region containing this singularity.Hence g(s, w) dominates the behaviour of f (3) (s, w) as was the case with f (1) (s, w).Therefore we can expect that F k (s) behaves similarly as F (1) k (s) which turns out to be true, see Section 3. Proof: From (11) it follows that f (3) (s, w) satisfies Hence, D(s, w) and D 0 (s, w) differ by a function C(s, w) that is periodic in s (with period 1).However, since ∆ k (x) and ∆ k (x) are linear combinations of functions of the form e −(p i q j )x it follows that the Mellin transform of 2 has no periodic parts.Consequently, we have for some function K(w).
Next we observe that for every non-negative integer m the function f (3) (−m, w) is a polynomial in w.This follows from the fact that F We now choose −m ≤ (s).Then this representation implies that D(s, w) is analytic for |w| < 1/T ( (s)/2 − 1) 2 . 2

Asymptotic Analysis
We now prove Theorem 1 by establishing the asymptotic behavior of the variance of the profiles.The plan of the proof is as follows.First we concentrate on the singularity analysis of f (3) (s, w) and obtain an asymptotic expansion for F k (s).Second we invert the Mellin transform V * k (s) by a proper saddle point method of the dominant part T (s) k x −s = exp k log T (s) − s log x of the integrand in the inverse Mellin integral.Finally we show Var(B n,k ) ∼ V k (n) by standard depoissonization methods [18].

Singularity Analysis of f (3) (s, w)
In order to obtain asymptotic information for F k (s) we will analyze the generating function f (3) (s, w) that by Lemma 3 is given by f (3) (s, w) = D(s, w)g(s, w).
Since we will apply the inverse Mellin transform to V * k (s) = Γ(s)F uniformly for all s with (s) ∈ [a, b], | (s) − 2 π log(q/p)| ≤ ε for some integer and k ≥ k 0 , where is an analytic function that satisfies f (−r) = 0 for r = 1, 2, . . .and is bounded in this region.Furthermore, if | (s) − 2 π log(q/p)| > ε for for all integers then we have for some η > 0 that can be chosen to be uniform for (s) ∈ [a, b].Furthermore, by the representation (17) it is clear that w = 1/T (s) is the dominant (and polar) singularity of g(s, w) if s is sufficiently close to the real axis, that is, | (s)| ≤ ε for some ε > 0 that can be be chosen to be uniform for (s) ∈ [a, b].Since T (s + 2 πi log(q/p)) = e −2πi log(p)/ log(q/p) T (s).
Next we apply Cauchy's formula for a contour of integration on the circle |w| = e γ /T (s) (for some sufficiently small γ > 0) and the residue theorem it follows that These estimates are uniform for s contained in a compact interval [a, b].However, note that f (−r) = 0 for non-negative integer r, since F k (−r) = 0. Hence, if the interval [a, b] contains a non-positive integer then we multiply by Γ(s) and do completely the same calulations which give (after all) the expansion (29).

Proof of Theorem 1
In this section, we show that the Poisson variance is asymptotically equal to the variance of the profile.i.e., V k (n) ∼ Var(B n,k ).For this goal, we use the same technique of [17,18].First we prove the following Lemma that is necessary for proving the our main result.

( 3 )Lemma 4
k (s) we have to know the behaviour in a strip (s) ∈ [a, b] for given real numbers a, b.For every real interval [a, b] there exist k 0 , γ > 0 and ε > 0 such that