Counting connected graphs with large excess

We enumerate the connected graphs that contain a linear number of edges with respect to the number of vertices. So far, only the first term of the asymptotics was known. Using analytic combinatorics, i.e. generating function manipulations, we derive the complete asymptotic expansion.


Introduction
We investigate the number CSG n,k of connected graphs with n vertices and n + k edges. The quantity k, defined as the difference between the numbers of edges and vertices, is the excess of the graph.
Related works Trees are the simplest connected graphs, and reach the minimal excess −1. They were enumerated in 1860 by Borchardt, and his result, known as Cayley's Formula, is CSG n,−1 = n n−2 . Rényi (1959) then derived the formula for CSG n,0 , which corresponds to connected graphs that contain exactly one cycle, and are called unicycles. Wright (1980), using generating function techniques, obtained the asymptotics of connected graphs for k = o(n 1/3 ). This result was improved by Flajolet et al. (2004), who derived a complete asymptotic expansion for fixed excess. Luczak (1990) obtained the asymptotics of CSG n,k when k goes to infinity while k = o(n). Bender et al. (1990) derived the asymptotics for a larger range, requiring only that 2k/n − log(n) is bounded. This covers the interesting case where k is proportional to n. Their proof was based on differential equations obtained by Wright, involving the generating functions of connected graphs indexed by their excesses. Since then, two simpler proofs were proposed. The proof of Pittel and Wormald (2005) relied on the enumeration of graphs with minimum degree at least 2. The second proof, derived by van der Hofstad and Spencer (2006), used probabilistic methods, analyzing a breadth-first search on a random graph. Erdős and Rényi (1960) proved that almost all graphs are connected when (2k/n − log(n)) tends to infinity. As a corollary, the asymptotics of connected graphs with those parameters is equivalent to the total number of graphs.
Contributions In this article, we derive an exact expression for the generating function of connected graphs (Theorem 3), tractable for asymptotics analysis. Our main result is the following theorem.
Theorem 1. When k/n has a positive limit and d is fixed, then the following asymptotics holds CSG n,k = D n,k 1 + c 1 n −1 + · · · + c d−1 n −(d−1) + O(n −d ) , where the dominant term D n,k is derived in Lemma 6, and the (c ) are computable constants.

Notations and models
We introduce the notations adopted in this article, the standard graph model, a multigraph model better suited for generating function manipulations, and the concept of patchwork, used to translate to graphs the results derived on multigraphs.
Notations A multiset is an unordered collection of objects, where repetitions are allowed. Sets, or families, are then multisets without repetitions. A sequence, or tuple, is an ordered multiset. We use the parenthesis notation (u 1 , . . . , u n ) for sequences, and the brace notation {u 1 , . . . , u n } for sets and multisets. The cardinality of a set or multiset S is denoted by |S|. The double factorial notation for odd numbers stands for and [z n ]F (z) denotes the nth coefficient of the series expansion of F (z) at z = 0.
Graphs We consider in this article the classic model of graphs, a.k.a. simple graphs, with labelled vertices and unlabelled unoriented edges. All edges are distinct and no edge links a vertex to itself. We naturally adopt for graphs generating functions exponential with respect to the number of vertices, and ordinary with respect to the number of edges (see Sedgewick (2009), or Bergeron et al. (1997)).
is the labelled set of vertices, and E(G) is the set of edges. Each edge is a set of two vertices from V (G). The number of vertices (resp. of edges) The generating function of a family F of graphs is and F k (z) denotes the generating function of multigraphs from F with excess k, F k (z) = [y k ]F (z/y, y).
As always in analytic combinatorics and species theory, the labels are distinct elements that belong to a totally ordered set. When counting labelled objects (here, graphs), we always assume that the labels are consecutive integers starting at 1. Another formulation is that we consider two objects as equivalent if there exists an increasing relabelling sending one to the other.
With those conventions, the generating function of all graphs is (1 + w) ( n 2 ) z n n! , because a graph with n vertices has n 2 possible edges. Since a graph is a set of connected graphs, the generating function of connected graphs CSG(z, w) satisfies the relation SG(z, w) = e CSG(z,w) .
We obtain the classic closed form for the generating function of connected graphs This expression was the starting point of the analysis of Flajolet et al. (2004), who worked on graphs with fixed excess. However, as already observed by those authors, it is complex to analyze, because of "magical " cancellations in the coefficients. The reason of those cancellations is the presence of trees, which are the only connected components with negative excess. In this paper, we follow a different approach, closer to the one of Pittel and Wormald (2005): we consider cores, i.e. graphs with minimum degree at least 2, and add rooted trees to their vertices. This setting produces all graphs without trees.
Multigraphs As already observed by Flajolet et al. (1989); Janson et al. (1993), multigraphs are better suited for generating function manipulations than graphs. Exact and asymptotic results on connected multigraphs are available in de Panafieu (2014). We propose a new definition for those objects, distinct but related with the one used by Flajolet et al. (1989); Janson et al. (1993), and link the generating functions of graphs and multigraphs in Lemma 1. We define a multigraph as a graph with labelled vertices, and labelled oriented edges, where loops and multiple edges are allowed. Since vertices and edges are labelled, we choose exponential generating functions with respect to both quantities. Furthermore, a weight 1/2 is assigned to each edge, for a reason that will become clear in Lemma 1.
is the set of labelled vertices, and E(G) is the set of labelled edges (the edge labels are independent from the vertex labels). Each edge is a triplet (v, w, e), where v, w are vertices, and e is the label of the edge. The number of vertices (resp. number of edges, excess) is n(G) = |V (G)| (resp. m(G) = |E(G)|, k(G) = m(G) − n(G)). The generating function of the family F of multigraphs is and F k (z) denotes the generating function [y k ]F (z/y, y). Figure 2 presents an example of multigraph. A major difference between graphs and multigraphs is the possibility of loops and multiple edges.
Definition 3. A loop (resp. double edge) of a multigraph G is a subgraph (V, E) (i.e. V ⊂ V (G) and E ⊂ E(G)) isomorphic to the following left multigraph (resp. to one of the following right multigraphs). The set of loops and double edges of a multigraph G is denoted by LD(G), and its cardinality by ld(G).
In particular, a multigraph that has no double edge contains no multiple edge. Multigraphs are better suited for generating function manipulations than graphs. However, we aim at deriving results on the graph model, since it has been adopted both by the graph theory and the combinatorics communities. The following lemma, illustrated in Figure 1, links the generating functions of both models.
Lemma 1. Let MG \ LD denote the family of multigraphs that contain neither loops nor double edges, and p the projection from MG \ LD to the set SG of graphs, that erases the edge labels and orientations, as illustrated in Figure 1. Let F denote a subfamily of MG \ LD , stable by edge relabelling and change of orientations. Then there exists a family H of graphs such that p −1 (H) = F. Furthermore, the generating functions of F and H, with the respective conventions of multigraphs and graphs, are equal Patchworks To apply the previous lemma, we need to remove the loops and multiple edges from multigraph families. Our tool is the inclusion-exclusion technique, in conjunction with the notion of patchwork.
3! ). As stated by Lemma 1, those generating functions are equal. In particular, all pairs (V i , E i ) are distinct, MG(P ) has minimum degree at least 2, and two edges in E i , E j having the same label must link the same vertices. We use for patchwork generating functions the same conventions as for multigraphs introducing an additional variable u to mark the number of parts Lemma 2. The generating function of patchworks is equal to For each k, there is a polynomial P k (z, u) such that P k (z, u) = P 0 (z, u)P k (z, u).
Proof. A patchwork of excess 0 is a set of isolated loops and double edges (i.e. sharing no vertex with another loop or double edge), which explains the expression of P 0 (z, u). Let P k denote the family of patchworks of excess k that contain no isolated loop or double edge. Each vertex of degree 2 then belongs to exactly one double edge and no loop. The number of such double edges is at most k, because each increases the excess by 1. If we remove them, the corresponding multigraph has minimum degree at least 3 and excess at most k.
There is a finite number of such multigraphs (see e.g. Wright (1980), and we give the proof in Appendix 5.1 for completeness), so the family P k is finite, and P k (z, u) is a polynomial. Since any patchwork of excess k is a set of isolated loops and double edges and a patchwork from P k , we have

Exact enumeration
In this section, we derive an exact expression for CSG k (z), suitable for asymptotics analysis. The proofs rely on tools developed by de Panafieu and Ramos (2016); Collet et al. (2016).
Theorem 2. The generating function of cores, i.e. graphs with minimum degree at least 2, is Proof. Let MCore denote the set of multicores, i.e. multigraphs with minimum degree at least 2, and set where ld(G) denotes the number of loops and double edges in G. According to Lemma 1, we have Core(z, w) = MCore(z, w, 0). To express the generating function of multicores, the inclusion-exclusion method (see (Flajolet and Sedgewick, 2009, Section III.7.4)) advises us to consider MCore(z, w, u + 1) instead. This is the generating function of the set MCore of multicores where each loop and double edge is either marked by u or left unmarked. The set of marked loops and double edges form, by definition, a patchwork. One can cut each unmarked edge into two labelled half-edges. Observe that the degree constraint implies that each vertex outside the patchwork contains at least two half-edges. Reversely, as illustrated in Figure 3, any multicore from MCore can be uniquely build following the steps: 1. start with a patchwork P , which will be the final set of marked loops and double edges, 2. add a set of isolated vertices, 3. add to each vertex a set of labelled half-edges, such that each isolated vertex receives at least two of them. The total number of half-edges must be even, and is denoted by 2m, 4. add to the patchwork the m edges obtained by linking the half-edges with consecutive labels (1 with 2, 3 with 4 and so on). Observe that a relabelling of the vertices (resp. the edges) occurs at step 2 (resp. 4). This construction implies, by application of the species theory (Bergeron et al. (1997)) or the symbolic method (Flajolet and Sedgewick (2009)), the generating function relation For u = −1, we obtain the expression of Core(z, w) = MCore(z, w, 0). Any graph where no component is a tree can be built starting with a core, and replacing each vertex with a rooted tree. The components of smallest excess, zero, are then the unicycles. The difference with the multi-unicycles -connected multigraphs of excess 0 -is that the cycle can then be a loop or a double edge. We recall the classic expressions of their generating functions (see Flajolet and Sedgewick (2009)).
Lemma 3. The generating functions of rooted trees, multi-unicycles, and unicycles are characterized by We apply the previous results to investigate graphs where all components have positive excess, i.e. that contain neither trees nor unicycles. This is the key new ingredient in our proof of Theorem 1.
Lemma 4. The generating function of graphs with excess k where each component has positive excess is Proof. In the expression of the generating function of cores, after developing the exponential as a sum over n and applying the change of variable m ← k + n, we obtain The sum over n is replaced by its closed form Lemma 2 is applied to expand P (ze x , w, −1). The generating function of cores of excess k is then If we do not remove the loops and double edges, we obtain the generating function MCore k (z) of multicores of excess k. In the generating function, this means replacing P (ze x , w, −1) with the constant 1, so P vanishes except for = 0, and A core of excess k where the vertices are replaced by rooted trees can be uniquely decomposed as a set of unicycles, and a graph of excess k where each component has a positive excess, so This leads to the results stated in the lemma, after division by e V (z) (resp. e MV (z) ). According to Lemma 1, the generating function MG >0 (z, w) of multigraphs where all components have positive excess dominates Either by calculus -as a corollary of the previous lemma -or by a combinatorial argument, we obtain the following result, first proven by Wright (see also (Janson et al., 1993, Lemma 1 p.33)), and that was a key ingredient of the proofs of Bender et al. (1990); Flajolet et al. (2004).
Lemma 5. For each k > 0, there exists a computable polynomial Q k such that Observe that this result is only useful for fixed k. We finally prove an exact expression for the number of connected graphs, which asymptotics is derived in Section 4.
Theorem 3. For k > 0, the number of connected graphs with n vertices and excess k is Proof. Each graph in SG >0 is a set of connected graphs with positive excess, so ≥0 SG >0 (z)y = e k>0 CSG k (z)y k .
Observe that SG >0 0 (z) = 1. Indeed, the only graph of excess 0 where all components have positive excess is the empty graph (this can also be deduced by calculus from Lemma 4). Taking the logarithm of the previous expression and extracting the coefficient [y k ], we obtain which leads to the result by expansion of the logarithm and extraction of the coefficient [z n ]. Observe that q ≤ k because each k j is at least 1, and k j ≤ k − q + 1 for the same reason.

Asymptotics of connected graphs
In this section, we prove Theorem 1, deriving CSG n,k up to a multiplicative factor (1 + O(n −d )), where d is an arbitrary fixed integer. Our strategy is to express CSG n,k as a sum of finitely many non-negligible terms, which asymptotic expansions are extracted using a saddle-point method. We will see that in the expression of CSG n,k from Theorem 3, the dominant contribution comes from q = 1, i.e., applying Lemma 4, In this expression, the dominant contribution will come from = 0. This means that a graph with n vertices, excess k, and without tree or unicycle components, is connected with high probability -a fact already proven by Erdős and Rényi (1960) and used by Pittel and Wormald (2005). Furthermore, its loops and double edges are typically disjoint, hence forming a patchwork of excess 0. We now derive the asymptotics D n,k of this dominant term, and will use it as a reference, to which the other terms will be compared.
Lemma 6. When k/n tends toward a positive constant, we have the following asymptotics , where the right-hand side is denoted by D n,k , and λ is the unique positive solution of λ 2 e λ +1 e λ −1 = k n + 1. In particular, introducing the value ζ characterized by T (ζ) = λ e λ −1 , we have Proof. Injecting the formulas for P 0 (z, u) and V (z) derived in Lemmas 2, 3, the expression becomes (1 − T (z))B(z, x).
We recognize the classic large powers setting, and a bivariate saddle-point method (see e.g. Bender and Richmond (1999)) is applied to extract the asymptotics, which implies the second result of the lemma: where ζ, λ and the 2 × 2 matrix (H i,j (z, x)) 1≤i,j≤2 are characterized by the equations The first result follows by application of the Stirling formula and expansion of the expression. The system of equation characterizing ζ and λ is equivalent with the super-exponential term in the asymptotics D n,k is n n+k . The exponential term is 2k n k e −n−k B(ζ, λ) k ζ n λ 2k = e λ/2 − e −λ/2 λ 1+k/n n . (1) The coefficients of the symmetric matrix H = H(ζ, λ) are The constant and polynomial terms of the asymptotics D n,k are .
(2) D n,k is then the product of n n+k with the right-hand sides of Equations (1) and (2).
In the expression of CSG n,k from Theorem 3, the product over j has the following simple bound.
Lemma 7. When k/n tends to a positive constant, for any integer composition k 1 + · · · + k q = k, we have where the big O is independent of q.
Proof. According to Lemma 4, we have Applying a classic bound (see e.g. (Flajolet and Sedgewick, 2009, Section VIII.2)), we obtain for all j Taking the product over j and using the facts k 1 + · · · + k q = k and e −MV (ζ) < 1 leads to The result follows, as a consequence of the bound derived in Lemma 6.
We now identify, in the expression of CSG n,k from Theorem 3, some negligible terms.
Lemma 8. For any fixed d (resp. fixed d and q), the following two terms are Proof. According to Lemma 7, it is sufficient to prove that the sequence satisfies, for any fixed d (resp. when d and q are fixed), The proof is available in Appendix 5.2 The two main ingredients are that the argument of the sum defining S q,d,k is maximal when one of the k j is large (then the others remain small), and that S q,0,k ≤ 3q for all q ≤ k (proof by recurrence).
Using the previous lemma, we remove the negligible terms from CSG n,k and simplify its expression.
Lemma 9. There exist computable polynomials R q,r such that, when k/n has a positive limit, Proof. The previous lemma proves that in the expression of CSG n,k from Theorem 3, we need only consider the terms corresponding to q ≤ d + 4, and k − (d + 4) ≤ max j (k j ) ≤ k. Since k 1 + · · · + k q = k, when k is large enough and d is fixed, there is at most one k j between k − d and k. Up to a symmetry of order q, we can thus assume k q = max j (k j ), and introduce r = k − k q According to Lemma 5, there exist computable polynomials (Q k ) k≥1 such that (1 − T (z)) 3r , and the numerator is the polynomial R q,r evaluated at T (z).
The next lemma proves that the terms corresponding to patchworks with a large excess are negligible. The difficulty here is that we can only manipulate the generating functions of patchworks of finite excess.
Lemma 10. When k/n has a positive limit and q, r are fixed, then Proof. We only present the proof of the equality This corresponds to the case q = 1 and r = 0 of the lemma, the general proof being identical. Given a finite family F of multigraphs, let IE <d (F) denote the bounded inclusion-exclusion operator Let MG >0 n,k denote the set of multigraphs with n vertices, excess k, without tree or unicycle component. Its subset MG >0 n,k,<d (resp. MG >0 n,k,≥d ) corresponds to multigraphs G with maximal patchwork LD(G) of excess less than d (resp. at least d). Given the decomposition MG >0 n,k = MG >0 n,k,<d MG >0 n,k,≥d , we have IE <d (MG >0 n,k ) = IE <d (MG >0 n,k,<d ) + IE <d (MG >0 n,k,≥d ).
Working as in the proof of Lemma 4, we obtain , applying the same saddle-point method as in Lemma 6, the th term of the sum is a Θ(k − D n,k ). By inclusion-exclusion IE <d (MG >0 n,k,<d ) = SG >0 n,k so, injecting those results in Equation (4), We now bound | IE <d (MG >0 n,k,≥d )|. Any multigraph from MG >0 n,k,≥d contains, as a subgraph, a patchwork of excess d. Thus, | MG >0 n,k,≥d | is bounded by the number of multigraphs from MG >0 n,k where a patchwork of excess d is distinguished. If, in any such multigraph, we mark another patchwork of excess less than dwhich might well intersect the patchwork previously distinguished -we obtain the bound where the second argument of P d is a 2, because each loop and double edge of the distinguished patchwork can be either marked or left unmarked. By the same saddle-point argument, this is a O(k −d D n,k ).
Combining Lemmas 9 and 10, CSG n,k is expressed as a sum of finitely many terms (since d is fixed) where , applying the same saddle-point method as in Lemma 6, we obtain that the summand corresponding to q, r, is a Θ(k −r− D n,k ). Hence, D n,k is the dominant term in the asymptotics of CSG n,k . We can be more precise in our estimation of each summand. Its coefficient extraction is expressed as a Cauchy integral on a torus of radii (ζ, λ) (from Lemma 6), [z n x 2k ]A q,r, (z, x)B(z, x) k = 1 (2π) 2 π θ=−π π ϕ=−π A q,r, (ζe iθ , λe iϕ ) B(ζe iθ , λe iϕ ) k ζ n e niθ λe 2kiϕ dθdϕ, and its asymptotic expansion follows, by application of (Pemantle and Wilson, 2013, Theorem 5.1.2) n!(2(k − r − ) − 1)!![z n x 2k ]A q,r, (z, x)B(z, x) k = k −r− D n,k b 0 + · · · + b d−1 n −d−1 + O(n −d ) , where the (b ) are computable constants, and the factorials have been replaced by their asymptotic expansions. Injecting those expansions in Equation (5) concludes the proof of Theorem 1.
We apply Lemma 12 to bound the sum S q,0,k ≤ q(2 + O(k −2 )3k + o(1)), which is not greater than 3q for k large enough.
Lemma 14. For any fixed d, k large enough and q ≤ k, we have S q,k−d,k ≤ 2 −k .
Applying Stirling's formula, we bound the double factorials C 1 (2k) k e −k ≤ (2k − 1)!! ≤ C 2 (2k) k e −k for some constant positive values C 1 , C 2 . This implies, when k 1 + · · · + k q = k, The cardinality of the set {(k 1 , . . . , k q ) ∈ [0, d] q | k 1 + · · · + k q = k} is at most d q , which is not greater than d k , so The right hand-side is smaller than 2 −k when d is fixed and k is large enough.