Discrete Mathematics & Theoretical Computer Science |
This article deals with Pólya generalized urn models with constant balance in any dimension. It is based on the algebraic approach of Pouyanne (2005) and classifies urns having "large'' eigenvalues in five classes, depending on their almost sure asymptotics. These classes are described in terms of the spectrum of the urn's replacement matrix and examples of each case are treated. We study the cases of so-called cyclic urns in any dimension and $m$-ary search trees for $m \geq 27$.
For a given matrix of size $n \times m$ over a finite alphabet $\mathcal{A}$, a bicluster is a submatrix composed of selected columns and rows satisfying a certain property. In microarrays analysis one searches for largest biclusters in which selected rows constitute the same string (pattern); in another formulation of the problem one tries to find a maximally dense submatrix. In a conceptually similar problem, namely the bipartite clique problem on graphs, one looks for the largest binary submatrix with all '1'. In this paper, we assume that the original matrix is generated by a memoryless source over a finite alphabet $\mathcal{A}$. We first consider the case where the selected biclusters are square submatrices and prove that with high probability (whp) the largest (square) bicluster having the same row-pattern is of size $\log_Q^2 n m$ where $Q^{-1}$ is the (largest) probability of a symbol. We observe, however, that when we consider $\textit{any}$ submatrices (not just $\textit{square}$ submatrices), then the largest area of a bicluster jumps to $A_n$ (whp) where $A$ is an explicitly computable constant. These findings complete some recent results concerning maximal biclusters and maximum balanced bicliques for random bipartite graphs.
This paper presents the first distributional analysis of a linear probing hashing scheme with buckets of size $b$. The exact distribution of the cost of successful searches for a $b \alpha$ -full table is obtained, and moments and asymptotic results are derived. With the use of the Poisson transform distributional results are also obtained for tables of size $m$ and $n$ elements. A key element in the analysis is the use of a new family of numbers that satisfies a recurrence resembling that of the Bernoulli numbers. These numbers may prove helpful in studying recurrences involving truncated generating functions, as well as in other problems related with buckets.
In a suffix tree, the multiplicity matching parameter (MMP) $M_n$ is the number of leaves in the subtree rooted at the branching point of the $(n+1)$st insertion. Equivalently, the MMP is the number of pointers into the database in the Lempel-Ziv '77 data compression algorithm. We prove that the MMP asymptotically follows the logarithmic series distribution plus some fluctuations. In the proof we compare the distribution of the MMP in suffix trees to its distribution in tries built over independent strings. Our results are derived by both probabilistic and analytic techniques of the analysis of algorithms. In particular, we utilize combinatorics on words, bivariate generating functions, pattern matching, recurrence relations, analytical poissonization and depoissonization, the Mellin transform, and complex analysis.
An $\textit{anticoloring}$ of a graph is a coloring of some of the vertices, such that no two adjacent vertices are colored in distinct colors. We deal with the anticoloring problem with two colors for planar graphs, and, using Lipton and Tarjan's separation algorithm, provide an algorithm with some bound on the error. In the particular cases of graphs which are strong products of two paths or two cycles, we provide an explicit optimal solution.
The machinery of Riordan arrays has been used recently by several authors. We show how meromorphic singularity analysis can be used to provide uniform bivariate asymptotic expansions, in the central regime, for a generalization of these arrays. We show how to do this systematically, for various descriptions of the array. Several examples from recent literature are given.
We build upon previous work of Fayolle (2004) and Park and Szpankowski (2005) to study asymptotically the average internal profile of tries and of suffix-trees. The binary keys and the strings are built from a Bernoulli source $(p,q)$. We consider the average number $p_{k,\mathcal{P}}(\nu)$ of internal nodes at depth $k$ of a trie whose number of input keys follows a Poisson law of parameter $\nu$. The Mellin transform of the corresponding bivariate generating function has a major singularity at the origin, which implies a phase reversal for the saturation rate $p_{k,\mathcal{P}}(\nu)/2^k$ as $k$ reaches the value $2\log(\nu)/(\log(1/p)+\log(1/q))$. We prove that the asymptotic average profiles of random tries and suffix-trees are mostly similar, up to second order terms, a fact that has been experimentally observed in Nicodème (2003); the proof follows from comparisons to the profile of tries in the Poisson model.
We consider the number of nodes in the levels of unlabeled rooted random trees and show that the joint distribution of several level sizes (where the level number is scaled by $\sqrt{n}$) weakly converges to the distribution of the local time of a Brownian excursion evaluated at the times corresponding to the level numbers. This extends existing results for simply generated trees and forests to the case of unlabeled rooted trees.
We introduce a new class of algorithms to estimate the cardinality of very large multisets using constant memory and doing only one pass on the data. It is based on order statistics rather that on bit patterns in binary representations of numbers. We analyse three families of estimators. They attain a standard error of $\frac{1}{\sqrt{M}}$ using $M$ units of storage, which places them in the same class as the best known algorithms so far. They have a very simple internal loop, which gives them an advantage in term of processing speed. The algorithms are validated on internet traffic traces.
We show an asymptotic estimate for the number of labelled planar graphs on $n$ vertices. We also find limit laws for the number of edges, the number of connected components, and other parameters in random planar graphs.
We consider two probability distributions on Boolean functions defined in terms of their representations by $\texttt{and/or}$ trees (or formulas). The relationships between them, and connections with the complexity of the function, are studied. New and improved bounds on these probabilities are given for a wide class of functions, with special attention being paid to the constant function $\textit{True}$ and read-once functions in a fixed number of variables.
We consider simply generated trees, where the nodes are equipped with weakly monotone labellings with elements of $\{1, 2, \ldots, r\}$, for $r$ fixed. These tree families were introduced in Prodinger and Urbanek (1983) and studied further in Kirschenhofer (1984), Blieberger (1987), and Morris and Prodinger (2005). Here we give distributional results for several tree statistics (the depth of a random node, the ancestor-tree size and the Steiner-distance of $p$ randomly chosen nodes, the height of the $j$-st leaf, and the number of nodes with label $l$), which extend the existing results and also contain the corresponding results for unlabelled simply generated trees as the special case $r=1$.
We give several examples for Poisson approximation of quantities of interest in the analysis of algorithms: the distribution of node depth in a binary search tree, the distribution of the number of losers in an election algorithm and the discounted profile of a binary search tree. A simple and well-known upper bound for the total variation distance between the distribution of a sum of independent Bernoulli variables and the Poisson distribution with the same mean turns out to be very useful in all three cases.
We consider a sequence of $n$ geometric random variables and interpret the outcome as an urn model. For a given parameter $m$, we treat several parameters like what is the largest urn containing at least (or exactly) $m$ balls, or how many urns contain at least $m$ balls, etc. Many of these questions have their origin in some computer science problems. Identifying the underlying distributions as (variations of) the extreme value distribution, we are able to derive asymptotic equivalents for all (centered or uncentered) moments in a fairly automatic way.
Let $\mathcal{T}_n$ denote the set of unrooted unlabeled trees of size $n$ and let $\mathcal{M}$ be a particular (finite) tree. Assuming that every tree of $\mathcal{T}_n$ is equally likely, it is shown that the number of occurrences $X_n$ of $\mathcal{M}$ as an induced sub-tree satisfies $\mathbf{E} X_n \sim \mu n$ and $\mathbf{V}ar X_n \sim \sigma^2 n$ for some (computable) constants $\mu > 0$ and $\sigma \geq 0$. Furthermore, if $\sigma > 0$ then $(X_n - \mathbf{E} X_n) / \sqrt{\mathbf{V}ar X_n}$ converges to a limiting distribution with density $(A+Bt^2)e^{-Ct^2}$ for some constants $A,B,C$. However, in all cases in which we were able to calculate these constants, we obtained $B=0$ and thus a normal distribution. Further, if we consider planted or rooted trees instead of $T_n$ then the limiting distribution is always normal. Similar results can be proved for planar, labeled and simply generated trees.
Renewed interest in caching techniques stems from their application to improving the performance of the World Wide Web, where storing popular documents in proxy caches closer to end-users can significantly reduce the document download latency and overall network congestion. Rules used to update the collection of frequently accessed documents inside a cache are referred to as cache replacement algorithms. Due to many different factors that influence the Web performance, the most desirable attributes of a cache replacement scheme are low complexity and high adaptability to variability in Web access patterns. These properties are primarily the reason why most of the practical Web caching algorithms are based on the easily implemented Least-Recently-Used (LRU) cache replacement heuristic. In our recent paperJelenković and Radovanović (2004c), we introduce a new algorithm, termed Persistent Access Caching (PAC), that, in addition to desirable low complexity and adaptability, somewhat surprisingly achieves nearly optimal performance for the independent reference model and generalized Zipf's law request probabilities. Two drawbacks of the PAC algorithm are its dependence on the request arrival times and variable storage requirements. In this paper, we resolve these problems by introducing a discrete version of the PAC policy (DPAC) that, after a cache miss, places the requested document in the cache only if it is requested at least $k$ times among the last $m$, $m \geq k$, […]
We summarize several limit results for the profile of random plane-oriented recursive trees. These include the limit distribution of the normalized profile, asymptotic bimodality of the variance, asymptotic approximations of the expected width and the correlation coefficients of two level sizes. We also unveil an unexpected connection between the profile of plane-oriented recursive trees (with logarithmic height) and that of random binary trees (with height proportional to the square root of tree size).
We study a gcd algorithm directed by Least Significant Bits, the so―called LSB algorithm, and provide a precise average―case analysis of its main parameters [number of iterations, number of shifts, etc...]. This analysis is based on a precise study of the dynamical systems which provide a continuous extension of the algorithm, and, here, it is proved convenient to use both a 2―adic extension and a real one. This leads to the framework of products of random matrices, and our results thus involve a constant $γ$ which is the Lyapunov exponent of the set of matrices relative to the algorithm. The algorithm can be viewed as a race between a dyadic hare with a speed of 2 bits by step and a "real'' tortoise with a speed equal to $γ /\textit{log} \ 2 \sim 0.05$ bits by step. Even if the tortoise starts before the hare, the hare easily catches up with the tortoise [unlike in Aesop's fable [Ae]\ldots], and the algorithm terminates.
We investigate distances between pairs of nodes in digital trees (digital search trees (DST), and tries). By analytic techniques, such as the Mellin Transform and poissonization, we describe a program to determine the moments of these distances. The program is illustrated on the mean and variance. One encounters delayed Mellin transform equations, which we solve by inspection. Interestingly, the unbiased case gives a bounded variance, whereas the biased case gives a variance growing with the number of keys. It is therefore possible in the biased case to show that an appropriately normalized version of the distance converges to a limit. The complexity of moment calculation increases substantially with each higher moment; A shortcut to the limit is needed via a method that avoids the computation of all moments. Toward this end, we utilize the contraction method to show that in biased digital search trees the distribution of a suitably normalized version of the distances approaches a limit that is the fixed-point solution (in the Wasserstein space) of a distributional equation. An explicit solution to the fixed-point equation is readily demonstrated to be Gaussian.
In this paper, we are concerned with random sampling of an n dimensional integral point on an $(n-1)$ dimensional simplex according to a multivariate discrete distribution. We employ sampling via Markov chain and propose two "hit-and-run'' chains, one is for approximate sampling and the other is for perfect sampling. We introduce an idea of <i>alternating inequalities </i> and show that a <i>logarithmic separable concave</i> function satisfies the alternating inequalities. If a probability function satisfies alternating inequalities, then our chain for approximate sampling mixes in $\textit{O}(n^2 \textit{ln}(Kɛ^{-1}))$, namely $(1/2)n(n-1) \textit{ln}(K ɛ^{-1})$, where $K$ is the side length of the simplex and $ɛ (0<ɛ<1)$ is an error rate. On the same condition, we design another chain and a perfect sampler based on monotone CFTP (Coupling from the Past). We discuss a condition that the expected number of total transitions of the chain in the perfect sampler is bounded by $\textit{O}(n^3 \textit{ln}(Kn))$.
In this paper we show that the CSMA IEEE 802.11 protocol (Wifi) provides packet access delays asymptotics in power law. This very feature allows us to specify optimal routing via polynomial algorithm while the general case is NP-hard.
For the tree algorithm introduced by [Cap79] and [TsMi78] let $L_N$ denote the expected collision resolution time given the collision multiplicity $N$. If $L(z)$ stands for the Poisson transform of $L_N$, then we show that $L_N - L(N) ≃ 1.29·10^-4 \cos (2 π \log _2 N + 0.698)$.
It has become customary to prove binomial identities by means of the method for automated proofs as developed by Petkovšek, Wilf and Zeilberger. In this paper, we wish to emphasize the role of "human'' and constructive proofs in contrast with the somewhat lazy attitude of relaying on "automated'' proofs. As a meaningful example, we consider the four formulas by Romik, related to Motzkin and central trinomial numbers. We show that a proof of these identities can be obtained by using the method of coefficients, a human method only requiring hand computations.
A tight upper bound of the size of the antidictionary of a binary string is presented. And it is shown that the size of the antidictionary of a binary sting is always smaller than or equal to that of its dictionary. Moreover, an algorithm to reconstruct its dictionary from its antidictionary is given.
The aim of this paper is threefold: firstly, to explain a certain segment of ordinals in terms which are familiar to the analytic combinatorics community, secondly to state a great many of associated problems on resulting count functions and thirdly, to provide some weak asymptotic for the resulting count functions. We employ for simplicity Tauberian methods. The analytic combinatorics community is encouraged to provide (maybe in joint work) sharper results in future investigations.
We show that data compression methods (or universal codes) can be applied for hypotheses testing in a framework of classical mathematical statistics. Namely, we describe tests, which are based on data compression methods, for the three following problems: i) identity testing, ii) testing for independence and iii) testing of serial independence for time series. Applying our method of identity testing to pseudorandom number generators, we obtained experimental results which show that the suggested tests are quite efficient.
Given a set $\mathcal{S}$ with real-valued members, associated with each member one of two possible types; a multi-partitioning of $\mathcal{S}$ is a sequence of the members of $\mathcal{S}$ such that if $x,y \in \mathcal{S}$ have different types and $x < y$, $x$ precedes $y$ in the multi-partitioning of $\mathcal{S}$. We give two distribution-sensitive algorithms for the set multi-partitioning problem and a matching lower bound in the algebraic decision-tree model. One of the two algorithms can be made stable and can be implemented in place. We also give an output-sensitive algorithm for the problem.
We consider words or strings of characters $a_1a_2a_3 \ldots a_n$ of length $n$, where the letters $a_i \in \mathbb{Z}$ are independently generated with a geometric probability $\mathbb{P} \{ X=k \} = pq^{k-1}$ where $p+q=1$. Let $d$ be a fixed nonnegative integer. We say that we have an ascent of size $d$ or more if $a_{i+1} \geq a_i+d$. We determine the mean, variance and limiting distribution of the number of ascents of size $d$ or more in a random geometrically distributed word.
We consider the $\textit{master ring problem (MRP)}$ which often arises in optical network design. Given a network which consists of a collection of interconnected rings $R_1, \ldots, R_K$, with $n_1, \ldots, n_K$ distinct nodes, respectively, we need to find an ordering of the nodes in the network that respects the ordering of every individual ring, if one exists. Our main result is an exact algorithm for MRP whose running time approaches $Q \cdot \prod_{k=1}^K (n_k/ \sqrt{2})$ for some polynomial $Q$, as the $n_k$ values become large. For the $\textit{ring clearance problem}$, a special case of practical interest, our algorithm achieves this running time for rings of $\textit{any}$ size $n_k \geq 2$. This yields the first nontrivial improvement, by factor of $(2 \sqrt{2})^K \approx (2.82)^K$, over the running time of the naive algorithm, which exhaustively enumerates all $\prod_{k=1}^K (2n_k)$ possible solutions.
This extended abstract introduces a new algorithm for the random generation of labelled planar graphs. Its principles rely on Boltzmann samplers as recently developed by Duchon, Flajolet, Louchard, and Schaeffer. It combines the Boltzmann framework, a judicious use of rejection, a new combinatorial bijection found by Fusy, Poulalhon and Schaeffer, as well as a precise analytic description of the generating functions counting planar graphs, which was recently obtained by Giménez and Noy. This gives rise to an extremely efficient algorithm for the random generation of planar graphs. There is a preprocessing step of some fixed small cost. Then, for each generation, the time complexity is quadratic for exact-size uniform sampling and linear for approximate-size sampling. This greatly improves on the best previously known time complexity for exact-size uniform sampling of planar graphs with $n$ vertices, which was a little over $\mathcal{O}(n^7)$.
On modern computers memory access patterns and cache utilization are as important, if not more important, than operation count in obtaining high-performance implementations of algorithms. In this work, the memory behavior of a large family of algorithms for computing the Walsh-Hadamard transform, an important signal processing transform related to the fast Fourier transform, is investigated. Empirical evidence shows that the family of algorithms exhibit a wide range of performance, despite the fact that all algorithms perform the same number of arithmetic operations. Different algorithms, while having the same number of memory operations, access memory in different patterns and consequently have different numbers of cache misses. A recurrence relation is derived for the number of cache misses and is used to determine the distribution of cache misses over the space of WHT algorithms.
The problem of finding the convex hull of the intersection points of random lines was studied in Devroye and Toussaint, 1993 and Langerman, Golin and Steiger, 2002, and algorithms with expected linear time were found. We improve the previous results of the model in Devroye and Toussaint, 1993 by giving a universal algorithm for a wider range of distributions.
The Additive-Increase-Multiplicative Decrease (AIMD) algorithm is an effective technique for controlling competitive access to a shared resource. Let $N$ be the number of users and let $x_i(t)$ be the amount of the resource in possession of the $i$-th user. The allocations $x_i(t)$ increase linearly until the aggregate demand $\sum_i x_i(t)$ exceeds a given nominal capacity, at which point a user is selected at a random time and its allocation reduced from $x_i(t)$ to $x_i(t)/ \gamma$ , for some given parameter $\gamma >1$. In our new, generalized version of AIMD, the choice of users to have their allocations cut is determined by a selection rule whereby the probabilities of selection are proportional to $x_i^{\alpha} (t)/ \sum_j x_j^{\alpha}$, with $\alpha$ a parameter of the policy. Variations of parameters allows one to adjust fairness under AIMD (as measured for example by the variance of $x_i(t)$) as well as to provide for differentiated service. The primary contribution here is an asymptotic, large-$N$ analysis of the above nonlinear AIMD algorithm within a baseline mathematical model that leads to explicit formulas for the density function governing the allocations $x_i(t)$ in statistical equilibrium. The analysis yields explicit formulas for measures of fairness and several techniques for supplying differentiated service via AIMD.
Consider a set $S$ of points in the plane in convex position, where each point has an integer label from $\{0,1,\ldots,n-1\}$. This naturally induces a labeling of the edges: each edge $(i,j)$ is assigned label $i+j$, modulo $n$. We propose the algorithms for finding large non―crossing $\textit{harmonic}$ matchings or paths, i. e. the matchings or paths in which no two edges have the same label. When the point labels are chosen uniformly at random, and independently of each other, our matching algorithm with high probability (w.h.p.) delivers a nearly―perfect matching, a matching of size $n/2 - O(n^{1/3}\ln n)$.
As a sequel to [arch04], the position of the maximum in a geometrically distributed sample is investigated. Samples of length n are considered, where the maximum is required to be in the first d positions. The probability that the maximum occurs in the first $d$ positions is sought for $d$ dependent on n (as opposed to d fixed in [arch04]). Two scenarios are discussed. The first is when $d=αn$ for $0 < α ≤ 1$, where Mellin transforms are used to obtain the asymptotic results. The second is when $1 ≤ d = o(n)$.
New cache-oblivious and cache-aware algorithms for simple dynamic programming based on Valiant's context-free language recognition algorithm are designed, implemented, analyzed, and empirically evaluated with timing studies and cache simulations. The studies show that for large inputs the cache-oblivious and cache-aware dynamic programming algorithms are significantly faster than the standard dynamic programming algorithm.
Ordinary generating series of multiple harmonic sums admit a full singular expansion in the basis of functions $\{(1-z)^α \log^β (1-z)\}_{α ∈ℤ, β ∈ℕ}$, near the singularity $z=1$. A constructive proof of this result is given, and, by combinatoric aspects, an explicit evaluation of Taylor coefficients of functions in some polylogarithmic algebra is obtained. In particular, the asymptotic expansion of multiple harmonic sums is easily deduced.
Using recent results on singularity analysis for Hadamard products of generating functions, we obtain the limiting distributions for additive functionals on $m$-ary search trees on $n$ keys with toll sequence $(i) n^α$ with $α ≥ 0 (α =0$ and $α =1$ correspond roughly to the space requirement and total path length, respectively); $(ii) ln \binom{n} {m-1}$, which corresponds to the so-called shape functional; and $(iii) $$1$$_{n=m-1}$, which corresponds to the number of leaves.
In this report, we prove that under a Markovian model of order one, the average depth of suffix trees of index n is asymptotically similar to the average depth of tries (a.k.a. digital trees) built on n independent strings. This leads to an asymptotic behavior of $(\log{n})/h + C$ for the average of the depth of the suffix tree, where $h$ is the entropy of the Markov model and $C$ is constant. Our proof compares the generating functions for the average depth in tries and in suffix trees; the difference between these generating functions is shown to be asymptotically small. We conclude by using the asymptotic behavior of the average depth in a trie under the Markov model found by Jacquet and Szpankowski ([JaSz91]).