DMTCS Proceedings vol. AH, 2007 Conference on Analysis of Algorithms (AofA 07)

The 2007 Conference on Analysis of Algorithms (AofA'07), has been held in Juan des Pins near the cities of Antibes and Nice, France, on June 17-22, 2007.

The Analysis of Algorithms is a scientific basis for computation, providing a link between abstract algorithms and the performance characteristics of their implementations in the real world. The general effort to precisely predict the performance of algorithms has come to involve research in analytic combinatorics, the analysis of random discrete structures, asymptotic analysis, exact and limiting distributions, and other fields of inquiry in computer science, probability theory, and enumerative combinatorics.

1. Expected values of statistics on permutation tableaux

Corteel, Sylvie ; Hitczenko, Pawel.
Permutation tableaux are new objects that were introduced by Postnikov in the context of enumeration of the totally positive Grassmannian cells. They are known to be in bijection with permutations and recently, they have been connected to PASEP model used in statistical physics. Properties of permutation tableaux became a focus of a considerable research activity. In this paper we study properties of basic statistics defined on permutation tableaux. We present a simple and unified approach based on probabilistic techniques and use it to compute the expected values of basic statistics defined on permutation tableaux. We also provide a non―bijective and very simple proof that there are n! permutation tableaux of length n.
Section: Proceedings

2. Limit laws for a class of diminishing urn models.

Kuba, Markus ; Panholzer, Alois.
In this work we analyze a class of diminishing 2×2 Pólya-Eggenberger urn models with ball replacement matrix M given by $M= \binom{ -a \,0}{c -d}, a,d∈\mathbb{N}$ and $c∈\mathbb{N} _0$. We obtain limit laws for this class of 2×2 urns by giving estimates for the moments of the considered random variables. As a special instance we obtain limit laws for the pills problem, proposed by Knuth and McCarthy, which corresponds to the special case $a=c=d=1$. Furthermore, we also obtain limit laws for the well known sampling without replacement urn, $a=d=1$ and $c=0$, and corresponding generalizations, $a,d∈\mathbb{N}$ and $c=0$.
Section: Proceedings

3. Minimal and maximal plateau lengths in Motzkin paths

Prodinger, Helmut ; Wagner, Stephan.
The minimal length of a plateau (a sequence of horizontal steps, preceded by an up- and followed by a down-step) in a Motzkin path is known to be of interest in the study of secondary structures which in turn appear in mathematical biology. We will treat this and the related parameters <i> maximal plateau length, horizontal segment </i>and <i>maximal horizontal segment </i>as well as some similar parameters in unary-binary trees by a pure generating functions approach―-Motzkin paths are derived from Dyck paths by a substitution process. Furthermore, we provide a pretty general analytic method to obtain means and limiting distributions for these parameters. It turns out that the maximal plateau and the maximal horizontal segment follow a Gumbel distribution.
Section: Proceedings

4. Degree distribution of random Apollonian network structures and Boltzmann sampling

Darrasse, Alexis ; Soria, Michèle.
Random Apollonian networks have been recently introduced for representing real graphs. In this paper we study a modified version: random Apollonian network structures (RANS), which preserve the interesting properties of real graphs and can be handled with powerful tools of random generation. We exhibit a bijection between RANS and ternary trees, that transforms the degree of nodes in a RANS into the size of particular subtrees. The distribution of degrees in RANS can thus be analysed within a bivariate Boltzmann model for the generation of random trees, and we show that it has a Catalan form which reduces to a power law with an exponential cutoff: $α ^k k^{-3/2}$, with $α = 8/9$. We also show analogous distributions for the degree in RANS of higher dimension, related to trees of higher arity.
Section: Proceedings

5. One-sided Variations on Tries: Path Imbalance, Climbing, and Key Sampling

Christophi, Costas A. ; Mahmoud, Hosam M..
One-sided variations on path length in a trie (a sort of digital trees) are investigated: They include imbalance factors, climbing under different strategies, and key sampling. For the imbalance factor accurate asymptotics for the mean are derived for a randomly chosen key in the trie via poissonization and the Mellin transform, and the inverse of the two operations. It is also shown from an analysis of the moving poles of the Mellin transform of the poissonized moment generating function that the imbalance factor (under appropriate centering and scaling) follows a Gaussian limit law. The method extends to several variations of sampling keys from a trie and we sketch results of climbing under different strategies. The exact probability distribution is computed in one case, to demonstrate that such calculations can be done, at least in principle.
Section: Proceedings

6. The average position of the first maximum in a sample of geometric random variables

Archibald, Margaret ; Knopfmacher, Arnold.
We consider samples of n geometric random variables $(Γ _1, Γ _2, \dots Γ _n)$ where $\mathbb{P}\{Γ _j=i\}=pq^{i-1}$, for $1≤j ≤n$, with $p+q=1$. The parameter we study is the position of the first occurrence of the maximum value in a such a sample. We derive a probability generating function for this position with which we compute the first two (factorial) moments. The asymptotic technique known as Rice's method then yields the main terms as well as the Fourier expansions of the fluctuating functions arising in the expected value and the variance.
Section: Proceedings

7. Tail Bounds for the Wiener Index of Random Trees

Khan, Tämur Ali ; Neininger, Ralph.
Upper and lower bounds for the tail probabilities of the Wiener index of random binary search trees are given. For upper bounds the moment generating function of the vector of Wiener index and internal path length is estimated. For the lower bounds a tree class with sufficiently large probability and atypically large Wiener index is constructed. The methods are also applicable to related random search trees.
Section: Proceedings

8. On the Exit Time of a Random Walk with Positive Drift

Drmota, Michael ; Szpankowski, Wojciech.
We study a random walk with positive drift in the first quadrant of the plane. For a given connected region $\mathcal{C}$ of the first quadrant, we analyze the number of paths contained in $\mathcal{C}$ and the first exit time from $\mathcal{C}$. In our case, region $\mathcal{C}$ is bounded by two crossing lines. It is noted that such a walk is equivalent to a path in a tree from the root to a leaf not exceeding a given height. If this tree is the parsing tree of the Tunstall or Khodak variable-to-fixed code, then the exit time of the underlying random walk corresponds to the phrase length not exceeding a given length. We derive precise asymptotics of the number of paths and the asymptotic distribution of the exit time. Even for such a simple walk, the analysis turns out to be quite sophisticated and it involves Mellin transforms, Tauberian theorems, and infinite number of saddle points.
Section: Proceedings

9. The Size of the rth Smallest Component in Decomposable Structures with a Restricted Pattern

Dong, Li ; Gao, Zhicheng ; Panario, Daniel.
In our previous work [paper1], we derived an asymptotic expression for the probability that a random decomposable combinatorial structure of size n in the \exp -\log class has a given restricted pattern. In this paper, under similar conditions, we provide the probability that a random decomposable combinatorial structure has a given restricted pattern and the size of its rth smallest component is bigger than k, for r,k given integers. Our studies apply to labeled and unlabeled structures. We also give several concrete examples.
Section: Proceedings

10. Asynchronous Cellular Automata and Brownian Motion

Chassaing, Philippe ; Gerin, Lucas.
This paper deals with some very simple interacting particle systems, \emphelementary cellular automata, in the fully asynchronous dynamics: at each time step, a cell is randomly picked, and updated. When the initial configuration is simple, we describe the asymptotic behavior of the random walks performed by the borders of the black/white regions. Following a classification introduced by Fatès \emphet al., we show that four kinds of asymptotic behavior arise, two of them being related to Brownian motion.
Section: Proceedings

11. Distributional asymptotics in the analysis of algorithms: Periodicities and discretization

Grübel, Rudolf.
It is well known that many distributions that arise in the analysis of algorithms have an asymptotically fluctuating behaviour in the sense that we do not have 'full' convergence, but only convergence along suitable subsequences as the size of the input to the algorithm tends to infinity. We are interested in constructions that display such behaviour via an ordinarily convergent background process in the sense that the periodicities arise from this process by deterministic transformations, typically involving discretization as a decisive step. This leads to structural representations of the resulting family of limit distributions along subsequences, which in turn may give access to their properties, such as the tail behaviour (unsuccessful search in digital search trees) or the dependence on parameters of the algorithm (success probability in a selection algorithm).
Section: Proceedings

12. Uniqueness of polynomial canonical representations

Lladser, Manuel.
Let $P(z)$ and $Q(y)$ be polynomials of the same degree $k \geq 1$ in the complex variables $z$ and $y$, respectively. In this extended abstract we study the non-linear functional equation $P(z)=Q(y(z))$, where $y(z)$ is restricted to be analytic in a neighborhood of $z=0$. We provide sufficient conditions to ensure that all the roots of $Q(y)$ are contained within the range of $y(z)$ as well as to have $y(z)=z$ as the unique analytic solution of the non-linear equation. Our results are motivated from uniqueness considerations of polynomial canonical representations of the phase or amplitude terms of oscillatory integrals encountered in the asymptotic analysis of the coefficients of mixed powers and multivariable generating functions via saddle-point methods. Uniqueness shall prove important for developing algorithms to determine the Taylor coefficients of the terms appearing in these representations. The uniqueness of Levinson's polynomial canonical representations of analytic functions in several variables follows as a corollary of our one-complex variables results.
Section: Proceedings

13. Online Bandwidth packing with symmetric distribution

Lelarge, Marc.
We consider the following stochastic bin packing process: the items arrive continuously over time to a server and are packed into bins of unit size according to an online algorithm. The unpacked items form a queue. The items have random sizes with symmetric distribution. Our first contribution identifies some monotonicity properties of the queueing system that allow to derive bounds on the queue size for First Fit and Best Fit algorithms. As a direct application, we show how to compute the stability region under very general conditions on the input process. Our second contribution is a study of the queueing system under heavy load. We show how the monotonicity properties allow one to derive bounds for the speed at which the stationary queue length tends to infinity when the load approaches one. In the case of Best Fit, these bounds are tight. Our analysis shows connections between our dynamic model, average-case results on the classical bin packing problem and planar matching problems.
Section: Proceedings

14. A new method for computing asymptotics of diagonal coefficients of multivariate generating functions

Raichev, Alexander ; Wilson, Mark C..
Let $\sum_{\mathbf{n} \in \mathbb{N}^d} F_{\mathbf{n}} \mathbf{x}^{\mathbf{n}}$ be a multivariate generating function that converges in a neighborhood of the origin of $\mathbb{C}^d$. We present a new, multivariate method for computing the asymptotics of the diagonal coefficients $F_{a_1n,\ldots,a_dn}$ and show its superiority over the standard, univariate diagonal method. Several examples are given in detail.
Section: Proceedings

15. The height of watermelons with wall - Extended Abstract

Feierl, Thomas.
We derive asymptotics for the moments of the height distribution of watermelons with $p$ branches with wall. This generalises a famous result by de Bruijn, Knuth and Rice on the average height of planted plane trees, and a result by Fulmek on the average height of watermelons with two branches.
Section: Proceedings

16. Quantum random walks in one dimension via generating functions

Bressler, Andrew ; Pemantle, Robin.
We analyze nearest neighbor one-dimensional quantum random walks with arbitrary unitary coin-flip matrices. Using a multivariate generating function analysis we give a simplified proof of a known phenomenon, namely that the walk has linear speed rather than the diffusive behavior observed in classical random walks. We also obtain exact formulae for the leading asymptotic term of the wave function and the location probabilities.
Section: Proceedings

17. Random permutations and their discrepancy process

Chapuy, Guillaume.
Let $\sigma$ be a random permutation chosen uniformly over the symmetric group $\mathfrak{S}_n$. We study a new "process-valued" statistic of $\sigma$, which appears in the domain of computational biology to construct tests of similarity between ordered lists of genes. More precisely, we consider the following "partial sums": $Y^{(n)}_{p,q} = \mathrm{card} \{1 \leq i \leq p : \sigma_i \leq q \}$ for $0 \leq p,q \leq n$. We show that a suitable normalization of $Y^{(n)}$ converges weakly to a bivariate tied down brownian bridge on $[0,1]^2$, i.e. a continuous centered gaussian process $X^{\infty}_{s,t}$ of covariance: $\mathbb{E}[X^{\infty}_{s,t}X^{\infty}_{s',t'}] = (min(s,s')-ss')(min(t,t')-tt')$.
Section: Proceedings

18. Analysis of the total costs for variants of the Union-Find algorithm

Kuba, Markus ; Panholzer, Alois.
We study the average behavior of variants of the UNION-FIND algorithm to maintain partitions of a finite set under the random spanning tree model. By applying the method of moments we can characterize the limiting distribution of the total costs of the algorithms "Quick Find Weighted'' and "Quick Find Biased'' extending the analysis of Knuth and Schönhage, Yao, and Chassaing and Marchand.
Section: Proceedings

19. The Height of List-tries and TST

Broutin, N. ; Devroye, L..
We characterize the asymptotics of heights of the trees of de la Briandais and the ternary search trees (TST) of Bentley and Sedgewick. Our proof is based on a new analysis of the structure of tries that distinguishes the bulk of the tree, called the $\textit{core}$, and the long trees hanging down the core, called the $\textit{spaghettis}$.
Section: Proceedings

20. Combinatorial Dominance Guarantees for Heuristic Algorithms

Berend, Daniel ; Skiena, Steven S. ; Twitto, Yochai.
An $f(n)$ $\textit{dominance bound}$ on a heuristic for some problem is a guarantee that the heuristic always returns a solution not worse than at least $f(n)$ solutions. In this paper, we analyze several heuristics for $\textit{Vertex Cover}$, $\textit{Set Cover}$, and $\textit{Knapsack}$ for dominance bounds. In particular, we show that the well-known $\textit{maximal matching}$ heuristic of $\textit{Vertex Cover}$ provides an excellent dominance bound. We introduce new general analysis techniques which apply to a wide range of problems and heuristics for this measure. Certain general results relating approximation ratio and combinatorial dominance guarantees for optimization problems over subsets are established. We prove certain limitations on the combinatorial dominance guarantees of polynomial-time approximation schemes (PTAS), and give inapproximability results for the problems above.
Section: Proceedings

21. Why almost all satisfiable $k$-CNF formulas are easy

Coja-Oghlan, Amin ; Krivelevich, Michael ; Vilenchik, Dan.
Finding a satisfying assignment for a $k$-CNF formula $(k \geq 3)$, assuming such exists, is a notoriously hard problem. In this work we consider the uniform distribution over satisfiable $k$-CNF formulas with a linear number of clauses (clause-variable ratio greater than some constant). We rigorously analyze the structure of the space of satisfying assignments of a random formula in that distribution, showing that basically all satisfying assignments are clustered in one cluster, and agree on all but a small, though linear, number of variables. This observation enables us to describe a polynomial time algorithm that finds $\textit{whp}$ a satisfying assignment for such formulas, thus asserting that most satisfiable $k$-CNF formulas are easy (whenever the clause-variable ratio is greater than some constant). This should be contrasted with the setting of very sparse $k$-CNF formulas (which are satisfiable $\textit{whp}$), where experimental results show some regime of clause density to be difficult for many SAT heuristics. One explanation for this phenomena, backed up by partially non-rigorous analytical tools from statistical physics, is the complicated clustering of the solution space at that regime, unlike the more "regular" structure that denser formulas possess. Thus in some sense, our result rigorously supports this explanation.
Section: Proceedings

22. Expected number of locally maximal solutions for random Boolean CSPs

Creignou, Nadia ; Daudé, Hervé ; Dubois, Olivier.
For a large number of random Boolean constraint satisfaction problems, such as random $k$-SAT, we study how the number of locally maximal solutions evolves when constraints are added. We give the exponential order of the expected number of these distinguished solutions and prove it depends on the sensitivity of the allowed constraint functions only. As a by-product we provide a general tool for computing an upper bound of the satisfiability threshold for any problem of a large class of random Boolean CSPs.
Section: Proceedings

23. Randomized Optimization: a Probabilistic Analysis

Cardinal, Jean ; Langerman, Stefan ; Louchard, Guy.
In 1999, Chan proposed an algorithm to solve a given optimization problem: express the solution as the minimum of the solutions of several subproblems and apply the classical randomized algorithm for finding the minimum of $r$ numbers. If the decision versions of the subproblems are easier to solve than the subproblems themselves, then a faster algorithm for the optimization problem may be obtained with randomization. In this paper we present a precise probabilistic analysis of Chan's technique.
Section: Proceedings

24. Optimal Prefix and Suffix Queries on Texts

Crochemore, Maxime ; Iliopoulos, Costas S. ; Rahman, M. Sohel.
In this paper, we study a restricted version of the position restricted pattern matching problem introduced and studied by Mäkinen and Navarro [Position-Restricted Substring Searching, LATIN 2006]. In the problem handled in this paper, we are interested in those occurrences of the pattern that lies in a suffix or in a prefix of the given text. We achieve optimal query time for our problem against a data structure which is an extension of the classic suffix tree data structure. The time and space complexity of the data structure is dominated by that of the suffix tree. Notably, the (best) algorithm by Mäkinen and Navarro, if applied to our problem, gives sub-optimal query time and the corresponding data structure also requires more time and space.
Section: Proceedings

25. On the Ehrenfeucht-Mycielski Balance Conjecture

Kieffer, John C. ; Szpankowski, W..
In 1992, A. Ehrenfeucht and J. Mycielski defined a seemingly pseudorandom binary sequence which has since been termed the EM-sequence. The balance conjecture for the EM-sequence, still open, is the conjecture that the sequence of EM-sequence initial segment averages converges to $1/2$. In this paper, we do not prove the balance conjecture but we do make some progress concerning it, namely, we prove that every limit point of the aforementioned sequence of averages lies in the interval $[1/4,3/4]$, improving the best previous result that every such limit point belongs to the interval $[0.11,0.89]$. Our approach is novel and exploits an analysis of the growth behavior as $n \to \infty$ of the rooted tree formed by the binary strings appearing at least twice as substrings of the length $n$ initial segment of the EM-sequence.
Section: Proceedings

26. Counting occurrences for a finite set of words: an inclusion-exclusion approach

Bassino, Frédérique ; Clément, Julien ; Fayolle, J. ; Nicodème, P..
In this paper, we give the multivariate generating function counting texts according to their length and to the number of occurrences of words from a finite set. The application of the inclusion-exclusion principle to word counting due to Goulden and Jackson (1979, 1983) is used to derive the result. Unlike some other techniques which suppose that the set of words is reduced (<i>i..e.</i>, where no two words are factor of one another), the finite set can be chosen arbitrarily. Noonan and Zeilberger (1999) already provided a MAPLE package treating the non-reduced case, without giving an expression of the generating function or a detailed proof. We give a complete proof validating the use of the inclusion-exclusion principle and compare the complexity of the method proposed here with the one using automata for solving the problem.
Section: Proceedings

27. A Note on the Approximation of Perpetuities

Knape, Margarete ; Neininger, Ralph.
We propose and analyze an algorithm to approximate distribution functions and densities of perpetuities. Our algorithm refines an earlier approach based on iterating discretized versions of the fixed point equation that defines the perpetuity. We significantly reduce the complexity of the earlier algorithm. Also one particular perpetuity arising in the analysis of the selection algorithm Quickselect is studied in more detail. Our approach works well for distribution functions. For densities we have weaker error bounds although computer experiments indicate that densities can also be well approximated.
Section: Proceedings

28. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

Flajolet, Philippe ; Fusy, Éric ; Gandouet, Olivier ; Meunier, Frédéric.
This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of \emphdistinct elements (the cardinality) of very large data ensembles. Using an auxiliary memory of m units (typically, "short bytes''), HYPERLOGLOG performs a single pass over the data and produces an estimate of the cardinality such that the relative accuracy (the standard error) is typically about $1.04/\sqrt{m}$. This improves on the best previously known cardinality estimator, LOGLOG, whose accuracy can be matched by consuming only 64% of the original memory. For instance, the new algorithm makes it possible to estimate cardinalities well beyond $10^9$ with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. The algorithm parallelizes optimally and adapts to the sliding window model.
Section: Proceedings

29. Message passing for the coloring problem: Gallager meets Alon and Kahale

Ben-Shimon, Sonny ; Vilenchik, Dan.
Message passing algorithms are popular in many combinatorial optimization problems. For example, experimental results show that \emphsurvey propagation (a certain message passing algorithm) is effective in finding proper k-colorings of random graphs in the near-threshold regime. In 1962 Gallager introduced the concept of Low Density Parity Check (LDPC) codes, and suggested a simple decoding algorithm based on message passing. In 1994 Alon and Kahale exhibited a coloring algorithm and proved its usefulness for finding a k-coloring of graphs drawn from a certain planted-solution distribution over k-colorable graphs. In this work we show an interpretation of Alon and Kahale's coloring algorithm in light of Gallager's decoding algorithm, thus showing a connection between the two problems - coloring and decoding. This also provides a rigorous evidence for the usefulness of the message passing paradigm for the graph coloring problem.
Section: Proceedings

30. On expected number of maximal points in polytopes

Baryshnikov, Yu..
We answer an old question: what are possible growth rates of the expected number of vector-maximal points in a uniform sample from a polytope.
Section: Proceedings

31. Sorting using complete subintervals and the maximum number of runs in a randomly evolving sequence: Extended abstract.

Janson, Svante.
We study the space requirements of a sorting algorithm where only items that at the end will be adjacent are kept together. This is equivalent to the following combinatorial problem: Consider a string of fixed length n that starts as a string of 0's, and then evolves by changing each 0 to 1, with the n changes done in random order. What is the maximal number of runs of 1's? We give asymptotic results for the distribution and mean. It turns out that, as in many problems involving a maximum, the maximum is asymptotically normal, with fluctuations of order $n^{1/2}$, and to the first order well approximated by the number of runs at the instance when the expectation is maximized, in this case when half the elements have changed to 1; there is also a second order term of order $n^{1/3}$. We also treat some variations, including priority queues and sock-sorting.
Section: Proceedings

32. Lattice reduction in two dimensions: analyses under realistic probabilistic models

Vallée, Brigitte ; Vera, Antonio.
The Gaussian algorithm for lattice reduction in dimension 2 is precisely analysed under a class of realistic probabilistic models, which are of interest when applying the Gauss algorithm "inside'' the LLL algorithm. The proofs deal with the underlying dynamical systems and transfer operators. All the main parameters are studied: execution parameters which describe the behaviour of the algorithm itself as well as output parameters, which describe the geometry of reduced bases.
Section: Proceedings

33. Hamming distance from irreducible polynomials over $\mathbb {F}_2$

Lee, Gilbert ; Ruskey, Frank ; Williams, Aaron.
We study the Hamming distance from polynomials to classes of polynomials that share certain properties of irreducible polynomials. The results give insight into whether or not irreducible polynomials can be effectively modeled by these more general classes of polynomials. For example, we prove that the number of degree $n$ polynomials of Hamming distance one from a randomly chosen set of $\lfloor 2^n/n \rfloor$ odd density polynomials, each of degree $n$ and each with non-zero constant term, is asymptotically $(1-e^{-4}) 2^{n-2}$, and this appears to be inconsistent with the numbers for irreducible polynomials. We also conjecture that there is a constant $c$ such that every polynomial has Hamming distance at most $c$ from an irreducible polynomial. Using exhaustive lists of irreducible polynomials over $\mathbb{F}_2$ for degrees $1 ≤ n ≤ 32$, we count the number of polynomials with a given Hamming distance to some irreducible polynomial of the same degree. Our work is based on this "empirical" study.
Section: Proceedings

34. Coherent random permutations with record statistics

Gnedin, Alexander.
A two-parameter family of random permutations of $[n]$ is introduced, with distribution conditionally uniform given the counts of upper and lower records. The family interpolates between two versions of Ewens' distribution. A distinguished role of the family is determined by the fact that every sequence of coherent permutations $(π _n,n=1,2,\ldots)$ with the indicated kind of sufficiency is obtainable by randomisation of the parameters. Generating algorithms and asymptotic properties of the permutations follow from the representation via initial ranks.
Section: Proceedings

35. Properties of Random Graphs via Boltzmann Samplers

Panagiotou, Konstantinos ; Weißl, Andreas.
This work is devoted to the understanding of properties of random graphs from graph classes with structural constraints. We propose a method that is based on the analysis of the behaviour of Boltzmann sampler algorithms, and may be used to obtain precise estimates for the maximum degree and maximum size of a biconnected block of a "typical'' member of the class in question. We illustrate how our method works on several graph classes, namely dissections and triangulations of convex polygons, embedded trees, and block and cactus graphs.
Section: Proceedings

36. On Correlation Polynomials and Subword Complexity

Gheorghiciuc, Irina ; Ward, Mark Daniel.
We consider words with letters from a $q-ary$ alphabet $\mathcal{A}$. The kth subword complexity of a word $w ∈\mathcal{A}^*$ is the number of distinct subwords of length $k$ that appear as contiguous subwords of $w$. We analyze subword complexity from both combinatorial and probabilistic viewpoints. Our first main result is a precise analysis of the expected $kth$ subword complexity of a randomly-chosen word $w ∈\mathcal{A}^n$. Our other main result describes, for $w ∈\mathcal{A}^*$, the degree to which one understands the set of all subwords of $w$, provided that one knows only the set of all subwords of some particular length $k$. Our methods rely upon a precise characterization of overlaps between words of length $k$. We use three kinds of correlation polynomials of words of length $k$: unweighted correlation polynomials; correlation polynomials associated to a Bernoulli source; and generalized multivariate correlation polynomials. We survey previously-known results about such polynomials, and we also present some new results concerning correlation polynomials.
Section: Proceedings