Digital search trees with m trees : Level polynomials and insertion costs

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Digital search trees with m trees: Level polynomials and insertion costs Helmut Prodinger

The following sentences from our own Prodinger (1995), which never appeared in a proper journal, can be reproduced here almost verbatim: A DST is constructed like a binary search tree, but the decision to go down to the left or right is done according to the representation of the key as a binary string of bits.If the first bit is 0, the item goes to the left, otherwise to the right.Then the second bit is responsible for left or right, etc., until there is an empty node where the item can be stored.Decisions are made independently.
The classic book Knuth (1973) contains a more elaborate description.Now, in order to study the average search costs in a DST built from n random data (i.e., in every decision, 0 and 1 is equally likely), the polynomial H n (u), which has as the coefficient of u k the expected number of nodes on this level, is studied.By convention, the root is at level 0.
DSTs and Approximate Counting are very similar when it comes to analysis, see for instance Prodinger (1992).In 2011, Cichon (Cichoń and Macyna (2011)) had a novel idea about Approximate Counting, by introducing an additional parameter m, see also Prodinger (2011).
Translating this idea to the world of DSTs goes as follows: Instead of keeping one DST, we keep m DSTs, and an incoming data is attached to one of them.For algorithmic purposes (insertion and searching), this must be deterministic, but for the analysis we assume that a random DST is chosen with probability 1 m .The meaning of the level polynomial is then obvious: the coefficient of u k is the expected number of data on level k in all m DSTs combined.
Considering m-DSTs is equivalent (by adding an extra root) to the investigation of a tree structure with a prescribed root degree.This is not uncommon in Combinatorics and Computer Science; we just give one citation as an illustration: Kemp (1980).

An explicit formula for the level polynomials
We start from the classic formula (derived e. g., in Prodinger (1995)) for the level polynomials for classical DSTs h n (u) and note that h n (1) = n.Then the level polynomials H N (u) for the m-version satisfy which is easy to see since a total of N data splits into n 1 , . . ., n m data each, building the m DSTs, and the individual level polynomials have to be added.The final explicit formula for the coefficients of H N (u) appears at the end of the following computations in Theorem 2.1.We can simplify: So the difference is just this extra factor m 1−k .Note again that for our DST application we have q = 1 2 , although the computations hold for general q.
Now we form a bivariate generating function: .
We need the classic transformation of Heine Andrews (1976): which is obtained by setting a = q, b = y, c = 0, and t = z and noticing that .
We can read off coefficients: This is the explicit formula that we wanted to derive.In it, we used an expansion due to Euler: Theorem 2.1 The expected number of data on level l in an m-DST built from N random data is explicitly given by The instance m = 1 has been known before, see Louchard (1987); Prodinger (1995); Mahmoud (1992).

Insertion cost
The quantity H N (u) N is the probability generating function of a random variable called insertion cost.While it is well studied in the classical instance, we will provide here average and variance for general (fixed) m, both, explicitly and asymptotically.(The impatient reader can already jump forward to Theorem 3.1 for the results.)We need (1 − 2 −k ), common in Computer Science, as well as (q) n .Now we can engage into the asymptotics, using Rice's method.This method has been described in Flajolet and Sedgewick (1995): An alternating sum can be written as a contour integral: Here, the positively oriented curve C enclosed the poles 2, 3, . . ., N , and no others.This formula follows from simple residue calculations.Note also that Extending the curve of integration, we encounter extra residues; in order to keep the formula correct, these residues must be subtracted.They give us the terms of the asymptotic expansion of interest.There is in all our examples a pole at z = 1, and it will give us the dominant contribution.
For convenience, we use w = z − 1, so that the expansions are around w = 0.
Just recently in our paper Prodinger (2011) we used (with Q = 1 q = 2 and L = log Q = log 2) Here, So we must consider (q) w−1 m −w Γ(N + 1)Γ(−w − 1) Γ(N − w) ; the residue can be computed by a computer: Now, we have traditionally For the second factorial moment we use around w = 0 and compute the residue, which we don't display in full: When we compute the variance via there are many cancellations.Altogether we summarize our results.Theorem 3.1 Expectation and variance of the parameter insertion cost in m-DSTs admit the asymptotic expansions , where δ 1 (x) and δ 2 (x) are tiny fluctuating functions that we did not compute explicitly here.
The computation of the fluctuating functions is not difficult, and very similar computations have appeared in the relevant literature many times; here is an incomplete list of references: Kirschenhofer and Prodinger (1988); Kirschenhofer and Prodinger (1991); Flajolet and Sedgewick (1995).Note that the variance is (asymptotically) a constant (plus a tiny fluctuation) which does not depend on m; δ 1 (x) also does not depend on m.For m = 1, this result appeared already in Kirschenhofer and Prodinger (1988); Szpankowski (1991).
So, as far as the main terms in the asymptotics are concerned, the dependency on m of average and variance is very minor, and m-DSTs don't show any improved behaviour.But such a statement can only be made after some thorough analysis, which is, why we provided it here.