Rare Events and Conditional Events on Random Strings

Some strings -the texts-are assumed to be randomly generated, according to a probability model that is either a Bernoulli model or a Markov model. A rare event is the over or under-representation of a word or a set of words. The aim of this paper is twofold. First, a single word is given. We study the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, we assume that a given word is overrepresented. The conditional distribution of a second word is studied; formulae for the expectation and the variance are derived. In both cases, the formulae are precise and can be computed efﬁciently. These results have applications in computational biology, where a genome is viewed as a text.


Introduction
In this paper, we study the distribution of the number of occurrences of a word or a set of words in random texts.So far, the first moments, e.g. the mean and the variance, have been extensively studied by various authors under different probability models and different counting schemes [Wat95,Rég00,Szp01].Moreover, it is well known that the random variable that counts the number of occurrences converges, in law, to the normal law [BK93, PRdT95, RS97a, NSF99, FGSV01] when the size n of the text grows to infinity.Nevertheless, very few results are known out of the convergence domain, also called the central domain.This paper aims at filling this gap, as rare events occur out of the convergence domain.
First, we study the tail distribution.We consider a single given word, H 1 .In [RS97a,NSF99], a large deviation principle is established; in [RS97a] the rate function is implicitely defined, but left unsolved.In [RS98], the authors approximate the exact distribution by the so-called compound Poisson distribution, and compute the tail distribution of this approximate distribution.We provide a precise expansion of the exact probability out of the convergence domain.More precisely, we derive a computable formula for the rate function, and two more terms in the asymptotic expansion.This accuracy is made possible by the combinatorial structure of the problem.Second, we rely on these results to address the conditional counting problem.The overrepresentation (or the under-representation) of a word H 1 modifies the distribution of the number of occurrences of the other words.In this paper, we study the expectation and the variance of the number of occurrences of a word H 2 , when an other word, H 1 , is exceptional, that is either overrepresented or under-represented.Our results on the tail distribution of H 1 allow to show that the conditional expectation and variance of H 2 are linear functions of the size n of the text.We derive explicit formulae for the linearity constants.
The complexity to compute the tail distribution or the conditional counting moments is low.As a matter of fact, it turns out that the problem reduces to the solution of a polynomial equation the degree of which is equal to the length of the overrepresented word.The approach is valid for various counting models.
These results have applications in computational biology, where a genome is viewed as a text.Available data on the genome(s) are increasing continuously.To extract relevant information from this huge amount of data, it is necessary to provide efficient tools for "in silico" prediction of potentially interesting regions.The statistical methods, now widely used [BJVU98, GKM00, BLS00, LBL01, EP02, MMML02] rely on a simple basic assumption: an exceptional word, i.e. a word which occurs significantly more (or less) in real sequences than in random ones, may denote a biological functionality.The conditional counting problem is adressed when one wants to detect a weak biological signal, the word H 2 , hidden by a stronger signal, the word H 1 [BFW + 00, DRV01].
Section 2 is devoted to the introduction of some preliminary notions and results.The tail distribution of a single word is studied in section 3. Conditional events are addressed in Section 4.

Probability model
Our assumption is that the languages are generated on some alphabet S of size V by an ergodic and stationary source.The models we handle are either the Bernoulli model or the Markov model.
In the Markov model, a text T is a realization of a stationary Markov process of order K where the probability of the next symbol occurrence depends on the K previous symbols.Given two K-uples the probability that a β-ocurrence ends at position l, when an α-occurrence ends at position l − 1, does not depend on the position l in the text.E.g., we denote ), the transition matrix P is sparse when K > 1. Vector π = (π 1 , . . ., π V K ) denotes the stationary distribution satisfying πP = π, and Π is the stationary matrix that consists of V K identical rows equal to π.Finally, Z is the fundamental matrix Z = (I − (P − Π)) −1 where I is the identity matrix.Definition 2.1 Given a word z of length |z| greater than or equal to K, we denote P(w|z) the conditional probability that a w occurrence starts at a given position l in the text, knowing that a z occurrence starts at position l − |z| + 1.
Given a word w of size |w|, |w| ≥ K, we denote f (w) and l(w) the w-prefix and the w-suffix of length K.For i in {1, • • • , |w| − K + 1}, we denote w[i] the i-th factor of length K.That is We denote P(w) the stationary probability that the word w occurs in a random text.That is It will appear that all counting results depend on the Markovian process through submatrices of the matrix F(z) defined below.Definition 2.2 Given a Markovian model of order K, let F(z) be the V K ×V K matrix (1) It is worth noticing that F(z) can be reexpressed as a power series in Z.
In the Bernoulli model, one assumes that the text is randomly generated by a memoryless source.Each letter s of the alphabet has a given probability p s to be generated at any step.Generally, the p s are not equal and the model is said to be biased.When all p s are equal, the model is said to be uniform.The Bernoulli model can be viewed as a Markovian model of order K = 0.

The correlation polynomials and matrices
Finding a word in a random text is, in a certain sense, correlated to the previous occurrences of the same word or other words.For example, the probability to find H 1 = ATT, knowing that one has just found H 2 = TAT, is -intuitively -rather good since a T just after H 2 is enough to give H 1 .The correlation polynomials and the correlation matrices give a way to formalize this intuitive observation.At first, let us define the overlapping set and the correlation set [GO81] of two words.

Definition 2.3
The overlapping set of two words H i and H j is the set of suffixes of H i which are prefixes of H j .The correlation set is the set of H i -suffixes in the associated H j -factorizations.It is denoted by A i, j .
When H i = H j , the correlation set is called the autocorrelation set of H i .
For example, the overlapping set of H 1 = ATT and H 2 = TAT is {T}.The associated factorization of H 2 is T • AT .The correlation set is A 1,2 = {AT}.The overlapping set of H 2 with itself is {TAT, T }.The associated factorizations are TAT • ε and T • AT , where ε is the empty string.The autocorrelation set of H 2 is {ε, AT}.As any string belongs to its overlapping set, the empty string belongs to any autocorrelation set.

Definition 2.4
In the Markov model, the correlation polynomial of two words H i and H j is defined as follows: In the Bernoulli model, the correlation polynomial is When H i = H j , this polynomial is called the autocorrelation polynomial of H i .
Given two words H 1 and H 2 , the matrix is called the correlation matrix.
Definition 2.5 Given two ordered sets .

Word counting
There are several ways to count word occurrences, that depend on the possible application.Let H 1 and H 2 be two words on the same alphabet.In the overlapping counting model [Wat95], any occurrence of each word is taken into account.Assume, for example, that H 1 = ATT, H 2 = TAT and that the text is TTATTATATATT.This text contains 2 occurrences of H 1 and 4 overlapping occurrences of H 2 at positions 2, 5, 7 and 9.In other models, such as the renewal model [TA97], some overlapping occurrences are not counted.Although our approach is valid for different counting models, we restrict here to the most commonly used, e.g. the overlapping model [Wat95].
When several words are searched simultaneously, we need some additional conditions on this set of words, H .It is generally assumed that the set H is reduced.

Definition 2.6 [BK93]
A set of words is reduced if no word in this set is a proper factor of another word.
The two words H 1 and H 2 do not play the same role in the conditional counting problem.We can partially relax the reduction condition.

Definition 2.7 A couple of words
Remark that, in the case where the set of words is given by a regular expression, this regular expression must be unambiguous.A discussion on ambiguity in counting problems and algorithmic issues can be found in [KM97].

Multivariate Probability Generating Functions
Our basic tools are the multivariate probability generating functions.Let L be some language that is randomly generated according to one of the models described above.For any integer n, let L n be the set of words of size n that belong to L. Given two words H 1 and H 2 , we denote X i,n with i ∈ {1, 2}, the random variable which counts the occurrences of H i in a text from this set L n ; we denote P(X i,n = k) the probability that H i occurs k times.The probability generating function of the random variable X i,n is denoted P i,n .We have Definition 2.8 Given a language L, the multivariate generating function that counts H 1 and H 2 occurrences in the texts that belong to this language L is The multivariate generating function that counts H 1 -occurrences (only) is Remark: These multivariate generating functions satisfy the equation Moreover, L 1 (z, 1) = L 1 (z, 1, 1) is the ordinary generating function of the language L.
One important language is the set of all possible words on the alphabet S, denoted below by T .Language T is also named the language of texts.A general expression for its multivariate generating function [Rég00].For a single word H 1 of sixe m 1 , it depends on H 1 through the entire series of the variable z defined as follows: In the Bernoulli model, this series D 1 (z) is a polynomial.
Proposition 2.1 [RS97a] The multivariate generating function that counts the occurrences of a single word H 1 of sixe m 1 , in a Bernoulli or a Markov model, satisfies the equation where As a consequence, our counting results only depend on this series D 1 (z).Similarly, for two words counts, all the results depend on H 1 and H 2 through the matrix D(z) defined below.Definition 2.9 Given a reduced couple of words H 1 and H 2 of size m 1 and m 2 , let D(z) be the 2 × 2 matrix We denote, for i, j in {1, 2}, 3 Significance of an Exceptional Word In this section, we study the tail distribution of the number of occurrences of a single word H 1 in a random text T .In [RS97a], a large deviation principle is established by the Gartner-Ellis Theorem.We derive below an explicit formula for the rate function and a precise expansion of the probabilities in the large deviation domain.These results should be compared to [Hwa98] although the validity domains in [Hwa98] are closer to the central region.

Sharp large deviations estimates
Definition 3.1 The fundamental equation is the equation where a is a real positive number satisfying 0 ≤ a ≤ 1.
Lemma 3.1 Assume that a > P(H 1 ).When H 1 is selfoverlapping or when 1 m 1 > a, there exists a largest real positive solution of the fundamental equation that satisfies 0 < z a < 1.It is called the fundamental root of (E a ) and denoted z a .

Proof: Define the function of the real variable
We are now ready to state the main result of this section.Theorem 3.1 Let H 1 be a given word and a be some real number such that a = P(H 1 ).In a Bernoulli and a Markov model, the random variable X 1,n satisfies where and z a is the fundamental root of (E a ).
Remark: D 1 (z) 1−z is the generating function of a language [RS97a].It satisfies D 1 (0) = 1.Hence, it has positive coefficients and cannot be 0 at a real value.It follows that D 1 (z a ) = 0 and that D 1 (z a )+z a −1 = 0.
Remark: It follows from (8) that − ln P(X 1,n ≥na) n has a finite limit, I(a), when n tends to ∞.This limit is the rate function of the large deviation theory [DZ92].Equation (8) provides two additional terms in the asymptotic expansion and a correction to the result claimed in [RS97a].
Remark: When a = P(H 1 ), Equation ( 8) still provides the probability in the central domain.As a matter of fact, the fundamental root z a is equal to 1.The rate function is I(a) = 0, as expected in the central domain, and δ a = 0.One can check that .
This is the variance previously computed in the Bernoulli case by various authors [Wat95] and in the Markov case in [RS97a].
The next proposition provides a local expansion of the rate function.
Proposition 3.1 The rate function I satisfies, for any ã in a neighbourhood of a, where

Technical results
Our proof of Theorem 3.1 is purely analytic.It follows from the definition of T 1 (z, u) in (2) that Using the expression (4) this is When na is an integer, this function is an analytic function.Let us show that the radius of convergence is strictly greater than 1.The generating function M 1 (z) is the probability generating function of a language; hence, all its coefficients are positive and the radius of convergence is at least that M 1 (1) = 1: hence, the radius of convergence of M 1 is strictly greater than 1.Now, this equation implies that the singularities of M 1 are the singularities of D 1 (z) and the roots of D 1 (z) = 0. Hence, these singularities and these roots are necessarily greater than 1.Finally, all singularities of f a (z) are greater than 1.
Let us observe that there exists a direct proof in the Bernoulli model.The series has only positive coefficients; hence, the root with smallest modulus is real positive.As A 1 (z) and P(H)z m 1 have positive coefficients, a real positive root of D 1 (z) is greater than 1.
Cauchy formula for univariate series can be written as where the integration is done along any contour around 0 included in the convergence circle.We define the function h a (z) of the complex variable z by the equation The integral above can be expressed in the form J g (a) = 1 2iπ H e nh a (z) g(z)dz where g(z) is an analytic function.Here, g(z) is set to be . We need to establish an asymptotic expansion of this integral.Theorem 3.2 Given an analytic function g, let J g (a) be the integral If g is such that g(0) = 0, then the integral J g (a) satisfies where a and z a is the fundamental root of (7).If there exists an integer l such that G(z Before dealing with the proof of Theorem 3.2, we observe that h a (z a ) is the function I(a) defined in (9) and that the dominating term is • az a σ a = e δa σ a .This is Equation (8).The following terms in the expansion will be necessary to deal with conditional events in Section 4 Proof of Theorem 3.2: We prove (16) by the saddle point method [Hen77].We need to establish a technical lemma.Lemma 3.2 Let a be a real number.The function h a (z) = a ln M 1 (z) − ln z and all its derivatives are rational functions of D 1 and its derivatives.They satisfy the following equalities: Moreover, there exists a neighbourhood of z a , included in the convergence domain, and a positive integer Proof: A differentiation of Equation ( 5) shows that the derivatives of h a (z) are rational functions of D 1 and its derivatives.The values at point z a follow from the Fundamental Equation (E a ).As h ′′ (z a ) > 0, the second derivative h ′′ is strictly positive in some neighbourhood of z a ; this establishes the lower bound on the real part.✷ Let us chose a suitable contour of integration for (15).A Taylor expansion of h a (z) and g(z) around z = z a yields: With the change of variable y = x τ a √ n , the integrand rewrites, when ny 3 is small, x 2 2τ 2 a n We choose as a first part of the contour a vertical segment In order to keep ny 3 small when ny 2 tends to ∞, we choose 1 3 < α < 1 2 .In that case, each term x k provides a contribution and F 2p+1 = 0. Hence, the odd terms do not contribute to the integral.This yields an asymptotic expansion of P(X 1,n = na) in 1 n p+1/2 .We now close our contour in order to get an exponentially negligible contribution.The bound (18) implies that the contributions of the segments [0, z 1 ] and [0, z 2 ] are exponentially smaller than e −nI(a) .
We need now establish (17).In order to use (16), we rewrite where ã is defined by the equation na = (n − l) ã .
).We substitute ( ã, n − l) to (a, n) in Equation ( 16) and compute a Taylor expansion of all parameters : the fundamental root z ã, the rate function I( ã), the variance term τ a and the constant term g(z a ).From ∂φ ∂a (a, z a ) =

Rate function
We need here a local expansion of the rate function I(a) around the point a that is interesting in its own.
We have the following equalities: The coefficient of ( ã − a) reduces to ∂ψ ∂a = ln M 1 (z).The coefficient of ( ã − a) 2 rewrites , and we get the rate function As I(a) + a ln M 1 (z a ) = ln z a and G(z a )z l a = g(z a ), this term provides the correcting term Variance We now compute the contribution of τ 2 ã(n − l).We have: The equality ) and the contribution is

Constant term
We now compute the contribution of G(z ã).We have

This is Equation (17). ✷ 4 Conditional Events
We consider here the conditional counting problem.The conditional expectation and variance can be expressed as functions of the coefficients of the multivariate generating function of the language of texts T .More precisely, it follows from the equation , that Definition (2) implies that Similarly, we can prove Given two words, the software RegExpCount allows to compute and derive T (z, u 1 , u 2 ).The shift-ofthe-mean method allows to compute the linearity constant for the mean and the expectation in [Nic00].This step is costly; notably, it must be repeated when n varies.
Our closed formulae provide an efficient alternative.The general expression for T (z, u 1 , u 2 ) given in [Rég00] is a matricial expression that is not suitable for the computation of the partial derivatives that occur in ( 19) and (20).In 4.1 below, we provide a new expression that is suitable for a partial derivative.
At point u 2 = 1, the partial derivatives rewrite as ) k where ψ is analytic in z in a larger domain than 1 1−u 1 M 1 (z) .Hence, the methodolny of Section 3 applies.

Multivariate Generating Functions for Word Counting
Our enumeration method follows the scheme developed in [Rég00].More details on this formalism can be found in [Rég00,Szp01].In this paper, a set of basic languages, the initial, minimal and tail languages, is defined and any counting problem is rewritten as a problem of text decomposition over these basic languages.This is in the same vein as the general decomposition of combinatorial stuctures over basic data structures presented in [FS96].Such basic languages satisfy equations that depend on the counting model.These equations translate into equations for corresponding generating functions, and multivariate generating functions for the counting problem are rewritten over this set of basic generating functions.
We briefly present this formalism when two words H 1 and H 2 are counted.The initial languages Ri (for i = 1 or 2) are defined as the languages of words ending with H i and containing no other occurrence of H 1 or H 2 .The minimal language M i, j (for i ∈ {1, 2} and j ∈ {1, 2}) contains the words w which end with H j and such that H i w contains exactly two occurrences of {H 1 , H 2 }: the one at the beginning and the one at the end.The tail language Ũi is the language of words w such that H i w contains exactly one occurrence of H i and no other {H 1 , H 2 }-occurrence.For example, let us assume that H 1 = ATT and H 2 = TAT.The text TTATTATATATT can be decomposed as follows: Among the many decompositions of T according to these languages, the following new one is of particular interest for conditional counting.
Theorem 4.1 Let T + ⊂ T be the set of words on the alphabet S which contain at least one occurrence of H 1 or at least one occurrence of H 2 .It satisfies the language equation that translates into the functional equation on the generating functions Proof: The first term of the right member is the set of words of T + which do not contain any occurrence of H 1 ; such a text can be decomposed according to H 2 occurrences, using basic languages R2 , M 2,2 , Ũ2 .
The second term is the set of words of T + that contain at least one occurrence of H 1 ; such a text can be decomposed according to H 1 occurrences, using basic languages R 1 , M 1 , U 1 .
✷ The proposition below establishes a decomposition of the basic languages for a single pattern onto the basic languages for several words.The bivariate generating functions that count H 2 -occurrences in these basic languages follow.Proposition 4.1 Given a reduced couple of words (H 1 , H 2 ), the basic languages satisfy the following equations: The multivariate generating functions that count H 2 -occurrences in these languages are: Proof: The proof of the first equation relies on a very simple observation: a word w in R 1 is not in R1 iff it contains k occurrences of H 2 before H 1 , with k ≥ 1.Hence, such a word rewrites in a unique manner: A similar reasoning leads to the second and third equations.✷

Partial derivatives
The proof of our main theorems, Theorem 4.2 and Theorem 4.3, relies on a suitable computation of the partial derivatives of the bivariate generating function.Notably, ∂T ∂u 2 (z, u 1 , 1) yields the generating function of conditional expectations.Proposition 4.2 Let (H 1 , H 2 ) be a couple of words.The bivariate generating function of the H 1 -conditional expectation of H 2 -occurrences is, in Bernoulli and Markov models: where Proof: Deriving with respect to u 2 yields: Equations ( 23)-( 25) allow for an easy derivation of (30).The partial derivatives of probability generating functions of languages R 1 , U 1 and M 1 satisfy the following equations: To complete the proof, we rely on the results proved in [RS97b,Rég00], where the monovariate generating functions of the basic languages are expressed in terms of the coefficients of D(z).More precisely: Proposition 4.3 The matrix D(z) is regular when |z| < 1.The generating functions of the basic languages are defined by the following equations: The classical inversion formulae in dimension 2 lead to the equation Setting u 2 = 1 in (30) and substituting the expressions given in (31-33) yield (26). ✷

Conditional expectation
Our main theorem is Theorem 4.2 below.We introduce a few notations.
Notation: Let us denote Let us denote l and l the orders at z = 0 of g and ḡ, respectively, and let Theorem 4.2 Let T be the language of all possible words on an alphabet S. Assume that T is randomly generated by a Bernoulli or a Markov process.Given a reduced couple of words (H 1 , H 2 ), we denote X 1,n and X 2,n the two random variables that count the number of occurrences of H 1 and H 2 , respectively.The conditional expectation of X 2,n , knowing that where µ and λ are functions of the autocorrelation polynomials at the point z a that is the solution of Equation ( 7).With the notations of ( 8) and ( 34)-( 37), these functions are Remark: In the central region, the substitutions z a = 1 and a = P(H 1 ) in (39) steadily give that µ(a) = P(H 2 ).
Proof: We are ready to compute (19).To get the linear term, we observe that We observe that M 1 (z, 1) is equal to M 1 (z) in the previous section and that D 1,1 (z) = D 1 (z).With k 1 = na, the ratio (19) to be computed becomes .
In this ratio, the computation of the numerator contribution is similar to the computation of (8).The integrand rewrites D 1 (z) 2 (D 1 (z)+z−1) 2 and the ratio (19) becomes (na − 1) J g (a) .Using (17), this is We use (16) to compute J G (a) z l−l a .The exponential terms simplify and the γ-terms cancel.The ratio (19) becomes With the notations above, we have Ḡ(z) G(z) z l−l = ḡ(z) g(z) = θ(z).It follows that the linear term is a θ(z a ) = µ(a) .
Let us compute now the constant term.First, − ḡ(z a ) g(z a ) yields a contribution − θ(z a ).Second, we observe .
).The last term in the product contributes Furthermore, we have Finally, the last contribution to the constant term comes from

Conditional variance
We prove here that the variance is a linear function of n, except for a few degenerate cases.We provide the linearity constant.
Theorem 4.3 With the same conditions as Theorem 4.2, the conditional variance of X 2,n , when X 1,n is known and equal to na, is a linear function of n.More precisely, where Remark: It follows from (41) that the expectation nµ(a) is a tight approximation of the variance, when M 2,2 (z) is small.This is the usual case, but the contribution of 2 M 2,2 (z a ) 1−M 2,2 (z a ) may be significant, for instance when H 2 = x * , where x is some character of the alphabet.
The linearity constant may also be 0 in some degenerate cases.For example, with an alphabet of size 2, the choices of the two words AB and BA leads to M 2,2 (z) = 0 and θ(z) = 1.The variance is 0. As a matter of fact, the difference between the number of occurrences of AB and the number of occurrences of BA is at most 1.
We now use the formula (20).We only need to compute the second partial derivative of T .We proceed with a second differentiation of (30), using again the partial derivatives and, finally, we get Proposition 4.4 below.
Proposition 4.4 With the same hypotheses as Proposition 4.2, we have where Proof of Theorem 4.3: As a consequence of (42), we get Let us denote: Hence, To achieve the derivation, we need to establish relationships between g(z), g(z) and ḡ(z).We check that g(z) J g (a) + na f (z a ) g(z a ) and −2nµ(a)λ(a).The first term is a .
We now observe that other contributions to ∂ 2 T ∂u 2 2 and E(X 2,n ) 2 have a common multiplicative factor:

Conclusion
Our formulae apply for both Bernoulli and Markov models for random texts generation and provide sharp large deviation estimates.This approach needs much less computations than exact methods, in the domain where such methods are computable.Experimental evidence is presented in [DRV01], where our results are compared to others ([BFW + 00] and RSA-tools).Other applications, and a comparison with other methods [RS98, Nue01, RS01], are presented in [Rég03] and will be extended in a forthcoming paper.Maple procedures that implement a part of our results are available on request.An extension to underrepresented words is possible, and related results are presented in [VM03].
A slight modification allows for the extension of these formulae to other counting models, such as the renewal model [Wat95,TA97].A natural -and useful-generalisation of this work would be to give similar formulae for sets of motifs.In particular, computing expectation and variance conditioned by several overrepresented motifs would be useful to detect new significant information in biolnical sequences.
The function ψ(a, z) = h a (z) satisfies the functional equation ∂ψ ∂z (a, z a ) = φ(a, z a ) = h ′ a (z a ) = 0 where φ(a, z) = ∂ψ ∂z (a, z).It implicitely defines z a as a function z(a) of the variable a. Two differentaiations with respect to a yield the derivatives of z(a) at point a.More precisely, ∂φ ∂a + ∂φ ∂z z ′ (a) = 0

)
It follows that the quadratic terms 1 2 g(z a ) g(z a ) and ḡ(z a ) g(z a ) 2 cancel.Consequently, the variance is a linear function of n .In a few degenerate cases, it is a constant function.Let us compute now the linearity coefficient.First of all, the sum − 3 2 a • g(z a ) g(z a ) + µ(a) contributes by µ(a)(1 − 3 θ(z a )).The term − θ(z a ) in λ(a) yields the contribution 2µ(a) θ(z a ).Then, we consider in turn the terms in (16) and (17) that contribute to n 2 a 2 2 J g(a)