Derivatives of Approximate Regular Expressions

Our aim is to construct a finite automaton recognizing the set of words that are at a bounded distance from some word of a given regular language. We define new regular operators, the similarity operators, based on a generalization of the notion of distance and we introduce the family of regular expressions extended to similarity operators, that we call AREs (Approximate Regular Expressions). We set formulae to compute the Brzozowski derivatives and the Antimirov derivatives of an ARE, which allows us to give a solution to the ARE membership problem and to provide the construction of two recognizers for the language denoted by an ARE. As far as we know, the family of approximative regular expressions is introduced for the first time in this paper. Classical approximate regular expression matching algorithms are approximate matching algorithms on regular expressions. Our approach is rather to process an exact matching on approximate regular expressions.


Introduction
This paper addresses the problem of constructing a finite automaton that recognizes the language of all the words that are at a distance less than or equal to a given positive integer k from some word of a given regular language. Our approach is based on the extension of regular expressions to approximate regular expressions (AREs) that handle distance operators. More precisely, we first define a new family of operators: given an integer k, the F k operator is such that, for any regular language L, the language F k (L) is the set of all the words that are at a distance less than or equal to k from some word of L. We then consider the family of approximate regular expressions obtained from the family of regular expressions by adding the family of F k operators to the set of regular operators. We provide a formula that, given a regular language L, computes the quotient of the language F k (L) with respect to a symbol. We finally extend the computation of Brzozowski derivatives [3] (resp. of Antimirov derivatives [1]) to the family of approximate regular expressions. The first benefit of the derivation of an ARE is that it yields an elegant solution for the approximate membership problem. Moreover, the set of Brzozowski derivatives (resp. of Antimirov derivatives) of an ARE is shown to be finite. As a consequence, the derivation of an ARE enables the computation of a finite automaton that recognizes the language of this ARE.
The similarity between two words is generally measured by a distance and two basic types of distance called Hamming distance and Levenshtein distance (or edit distance) are generally considered. In our constructions the similarity between two words is handled by a word comparison function, that is more general than a distance (for instance, a comparison function is not necessarily symmetrical). It is the reason why we will speak of similarity operators rather than of distance operators.
The aim of this paper is to investigate the properties of the AREs family, in particular to define formulae for computing the set of (Brzozowski or Antimirov) derivatives of an ARE and to check the properties of this set. This theoretical study leads to a solution for the approximate membership problem as well as to a solution for the approximate regular expression matching problem (based on the automaton associated with the set of derivatives of an ARE). However, this paper is not an algorithmic contribution to the approximate regular expression matching problem: it investigates new automaton-theoretic constructions that hopefully make a sound foundation for the design of new approximate matching algorithms, but it does not present new efficient algorithms.
Let us recall that approximate matching consists in locating the segments of the text that approximately correspond to the pattern to be matched, i.e. segments that do not present too many errors with respect to the pattern. This research topic has numerous applications, in biology or in linguistics for example, and many algorithms have been designed in this framework for more than thirty years especially concerning approximate string matching (see [6,14] for a survey of such algorithms). Two contexts can be distinguished: in the off-line case, that is when a pre-computing of the text is performed, the basic tool is the construction of indexes [9]; otherwise, the basic technique is dynamic programming [12]. In both cases, automata constructions have been used, either to represent an index [18,2] or to simulate dynamic programming [8].
Several studies address the problem of constructing a finite automaton that recognizes the language of all the words that are at a distance less than or equal to a given positive integer k from a given word. For instance this problem is considered in [7] where Hamming distance is used and in [17] where Levenshtein distance is used. A challenging problem is to tackle the more general case where the pattern is no longer a word but a regular expression [15,19]. The solution described in [11] first computes k + 1 clones of some non-deterministic automaton recognizing the language of the regular expression and then interconnects these clones by a set of transitions that depends on the type of distance.
As far as we know, the family of approximate regular expressions is introduced for the first time in this paper. Approximate regular expression matching algorithms described in the papers above-cited are approximate matching algorithms on regular expressions. Our approach is rather to process an exact matching on approximate regular expressions. This paper is an extended version of [5]. Classical notions of language theory, such as derivative computation, are recalled in Section 2. Section 3 gives a formalization of the notion of word comparison function and provides a definition of the family of approximate regular expressions. The two next sections investigate derivation-based constructions of an automaton from an approximate regular expression. For seek of clarity, the standard case of Hamming and Levenshtein distances is first described and illustrated in Section 4 (without any proof), while the general case is addressed in Section 5; finally the link between the proofs of the standard case and of the general case is shown in Subsection 5.4.

Preliminaries
Given a set X, we denote by Card(X) the number of elements in X.
A finite automaton A is a 5-tuple (Σ, Q, I, F, δ) with: • Σ the alphabet (a finite set of symbols), • Q a finite set of states, • I ⊂ Q the set of initial states, • F ⊂ Q the set of final states, The set δ is equivalent to the function from Q × Σ to 2 Q defined by: q ′ ∈ δ(q, a) if and only if (q, a, q ′ ) ∈ δ. The domain of the function δ is extended to 2 Q × Σ * as follows: ∀P ⊂ Q, ∀a ∈ Σ, ∀w ∈ Σ * , δ(P, ε) = P , δ(P, a) = p∈P δ(p, a) and δ(P, a · w) = δ(δ(P, a), w). The automaton A recognizes the language where a is any symbol in Σ and F and G are any two regular expressions. The language L(E) denoted by E is inductively defined by: where a is any symbol in Σ, F and G are any two regular expressions, and for any L 1 , L 2 ⊂ Σ * , . . , k}, w j ∈ L 1 } ∪ {ε}. A language L is regular if there exists a regular expression E such that L(E) = L. It has been proved by Kleene [10] that a language is regular if and only if it is recognized by a finite automaton. Given a language L over an alphabet Σ and a word w in Σ * , the membership problem is to determine whether w belongs to L. It can be solved by the computation of the boolean r(w, L) defined by: r(w, L) = 1 if w ∈ L, 0 otherwise. The quotient of L w.r.t. a symbol a is the language a −1 (L) = {w ∈ Σ * | aw ∈ L}. It can be recursively computed as follows: It can be recursively computed as follows: ε −1 (L) = L, (aw ′ ) −1 (L) = w ′−1 (a −1 (L)) with a ∈ Σ and w ′ ∈ Σ + . The Myhill-Nerode Theorem [13,16] states that a language L is regular if and only if the set of quotients Since r(w, L) = r(ε, w −1 (L)), the membership problem can be solved using the quotient formulae and the following straightforward computation of r(ε, L): r(ε, L * 1 ) = 1. The notion of derivative of an expression has been introduced by Brzozowski [3]. The derivative of an expression E w.r.t. a word w is an expression denoting the quotient of L(E) w.r.t. w. Let E be a regular expression over an alphabet Σ and let a and b be two distinct symbols of Σ. The derivative of E w.r.t. a is the expression d da (E) inductively computed as follows: ). For convenience, we set r(w, E) = r(w, L(E)). Notice that the boolean r(ε, E) can be inductively computed as follows: r(ε, E * 1 ) = 1. As a consequence, derivation provides a syntactical solution for the membership problem. Notice that the set D E of derivatives of an expression E is not necessarily finite. It has been proved by Brzozowski [3] that it is sufficient to use the ACI equivalence (that is based on the associativity, the commutativity and the idempotence of the sum of expressions) to obtain a finite set of derivatives: the set D ′ E of dissimilar derivatives. Given a class of ACI-equivalent expressions, a unique representative can be obtained after deleting parenthesis (associativity), ordering terms of each sum (commutativity) and deleting redundant subexpressions (idempotence). Let E ∼s be the unique representative of the class of the expression E. The set of dissimilar derivatives can be computed as follows: (F ) · G) ∼s otherwise. The dissimilar derivative finite automaton B ′ (E) = (Σ, Q, {q 0 }, F, δ) of a regular expression E over an alphabet Σ is defined by: The automaton B ′ (E) is deterministic and it recognizes the language L(E). Its size can be exponentially larger than the number of symbols of E. Antimirov's algorithm [1] constructs a finite automaton from a regular expression E. It is based on the partial derivative computation. The partial derivative of a regular expression E w.r.t. a symbol a is the set ∂ ∂a (E) of expressions defined as follows: The partial derivative of E is extended to words of Σ * as follows: . Every element of the partial derivative of E w.r.t. a word w in Σ * is called a derivated term of E w.r.t. w. The set of the derivated terms of E is the union of the sets of the derivated terms of E w.r.t. w, for all w in Σ * . Antimirov [1] has shown that the size Card(DT E ) of the set DT E of the derivated terms of E is at most n+1, where n is the number of symbols of E.
Furthermore, for any word w in Σ * , E ′ ∈ ∂ ∂w (E) L(E ′ ) = w −1 (L(E)). Consequently, the partial derivation provides another syntactical solution for the membership problem as well as a finite automaton computation. Indeed, it can be shown that r(w, E) = E ′ ∈ ∂ ∂w (E) r(ε, E ′ ). The derivated term finite automaton A(E) = (Σ, Q, {q 0 }, F, δ) of a regular expression E is defined as follows: The automaton A(E) recognizes the language L(E).
In this paper, we consider the approximate membership problem that is defined as follows: Given a regular expression E over an alphabet Σ, a word w in Σ * , a function F from Σ * × Σ * to N and an integer k, is there a word w ′ in L(E) satisfying F(w, w ′ ) ≤ k ?
In the following, we provide a syntactical solution for the approximate membership problem in the case where the function F satisfies specific properties.

Comparison Functions: Symbols, Sequences and Words
Let Σ be an alphabet, S = Σ ∪ {ε} and X be a subset of S × S. A cost function C over X is a function from X to N satisfying Condition 1: for all α in S, C(α, α) = 0. For any pair (α, β) in S × S such that C(α, β) is not defined, let us set C(α, β) = ⊥. Consequently, a cost function can be viewed as a function from S × S to N ∪ {⊥} satisfying Condition 1. Since we use ⊥ to deal with undefined computation, we set for all x in N ∪ {⊥}, ⊥ + x = x + ⊥ = x − ⊥ = ⊥ − x = ⊥ and for all integers x, y in N, x − y = ⊥ when y > x. A cost function can be represented by a directed and labelled graph Transitions labelled by ⊥ can be omitted in the graphical representation, as well as the implicit transitions (α, 0, α) (See Example 1).
Example 1 Let Σ = {a, b, c}. Let C be the cost function defined as follows: The cost function C can be represented by the graph in Figure 1.
Given a positive integer k we now consider the set S k of all the sequences s = (s 1 , . . . , s k ) of size k made of elements of S. A sequence comparison function is a function F from k>0 S k × S k to N ∪ {⊥}. Given a pair (s, s ′ ) of sequences with the same size, F(s, s ′ ) either is an integer or is undefined. In the following we will consider sequence comparison functions F satisfying Condition 2: for any couple (α, β) in S ×S, F(α, β) = C(α, β), where C is some cost function C over S ×S, and Condition 3: F is a symbol-wise comparison function, that is, for any two sequences s = (s 1 , . . . , s n ) and ). We consider that those functions satisfy Condition 1 , i.e. for all α in S, F((α), (α)) = 0. Consequently, for any pair of sequences s = (s 1 , . . . , s k ) and s ′ = (s ′ 1 , . . . , s ′ k ) such that k > 1, Condition 4 is satisfied: if there exists an integer k ′ in {1, . . . , k} such that s k ′ = s ′ k ′ , then: . . , s ′ k )) otherwise. As a consequence of Condition 3, a symbol-wise sequence comparison function is defined by the images of the pairs of sequences of size 1. Notice that a sequence comparison function is not necessarily symbolwise, e.g. for a given cost function F, F((s 1 , . . . , s n ), (s ′ 1 , . . . , s ′ n )) = k∈{1,...,n} F(s k , s ′ k ) k . Two of the most well-known symbol-wise sequence comparison functions are the Hamming one (H) and the Levenshtein one (L) respectively defined for any integer n > 0 and for any pair of sequences s = (s 1 , . . . , s n ) and , with H and L the two cost functions respectively defined for all a, b in Σ ∪ {ε} by: Let us now explain how a word comparison function can be deduced from a sequence comparison function. Let w be a word in Σ * and |w| be its length. The sequence s = (s 1 , . . . , s n ) in S n is said to be a split-up of w if s 1 · · · s n = w. The integer n is the size of s. The set of all the split-ups of size k of a word w is denoted by Split k (w) and the set of all the split-ups of w is denoted by Split(w).
Let F be a sequence comparison function, (u, v) be a pair of words of Σ * , and k be a positive integer. We consider the following sets: Notice that a word comparison function is not necessarily symmetrical. Indeed, some problems can be modelized with a non-symmetrical function. For instance, given two words w and w ′ , can w be obtained from w ′ by deleting some letters, i.e. is w a subword of w ′ ? Such a problem can be modelized by the word comparison function D associated to the symbol-wise comparison function D defined for any pair of sequences of length 1 by: It can be shown that for any two words w and w ′ in Σ * : In the case of a sequence comparison function based on a cost function, the whole set N needs not to be considered. Indeed, according to Condition 4, if u = ε or v = ε, then Y (u, v) = Y |u|+|v| (u, v) and we can write: otherwise. The Hamming distance H and the Levenshtein distance L are the word comparison functions respectively associated to the sequence comparison functions H and L. Both of them satisfy the properties of word distances (i) . Notice that in the following we will handle word comparison functions that are not necessarily distances (see Example 1 for the definition of a nonsymmetrical cost function).
Example 2 Let C be the cost function defined in Example 1. Let s = (s 1 ) and s ′ = (s ′ 1 ) be two sequences of size 1. We define four symbol-wise sequence comparison functions by setting the images of the pairs of sequences of size 1 from the cost function C.
Let us consider the two split-ups t = (a, c, a) and t ′ = (c, a, c). According to Figure 2, it holds: (i) A word distance D is a word comparison function satisfying the three following properties for all x, y, z ∈ Σ * : Any word comparison function can be used as a language operator in order to compute the set of words that are at a bounded distance from some word of a given language.
Definition 2 Let L be a language over an alphabet Σ, F a word comparison function and k an element in N ∪ {⊥}. Then: The operator F k is called a similarity operator. Let us notice that F k (F k ′ (L)) is not necessarily equal to F k+k ′ (L). Indeed, let us consider the three languages Definition 3 An approximate regular expression (ii) (ARE) E over an alphabet Σ is inductively defined by: where a is any symbol in Σ, F and G are any two AREs, F is any symbol-wise word comparison function and k is any element in N ∪ {⊥}.

Definition 4
The language denoted by an ARE E is the language L(E) inductively defined by: , where a is any symbol in Σ, F and G are any two AREs, F is any symbol-wise word comparison function and k is any element in N ∪ {⊥}.
In order to prove that the language denoted by an ARE E is regular, we will show how to compute a finite automaton recognizing L(E).
(ii) The fact that any ARE denotes a regular language is proved in Corollary 1.

Hamming and Levenshtein Derivation Formulae
In this section, we extend the derivation formulae to the family of approximate regular expressions where the word comparison functions are the usual Hamming and Levenshtein distances. Notice that the proofs are not given in this section, but will be stated in Section 5.4, deduced from the proof of the general case provided in Section 5.
Let a be a symbol in an alphabet Σ and L be a regular language over Σ. Let k be an integer and L ′ = L k (L). The quotient of L ′ w.r.t. a is by definition the set of words w such that there exists a word w ′ in L satisfying L(aw, w ′ ) ≤ k. Consequently, we distinguish the four following cases, according to the way w ′ can be split: Notice that for the Hamming distance, only the two first cases need to be considered since H(α, β) = ⊥ whenever α = ε and β = ε or α = ε and β = ε.
As a consequence, the following lemma can be stated.
Lemma 1 Let L be a regular language over an alphabet Σ, a be a symbol in Σ and k be an element in N ∪ {⊥}. Then: In the remaining of this section, we consider restricted AREs that only use Hamming and Levenshtein distances.
Definition 5 Let Σ be an alphabet. A Hamming-Levenshtein Approximate Regular Expression (HLARE) over Σ is an ARE E over Σ satisfying the following condition: For any subexpression G of E such that G = F k (H), either F = H or F = L.

Brzozowski Derivatives for an HLARE
In this subsection, we extend the Brzozowski derivation to the HLAREs. From an HLARE E and a word w, Brzozowski derivation allows us to syntactically compute an HLARE D ′ w (E), called the dissimilar derivative of E w.r.t. w, denoting the language w −1 (L(E)).

Definition 6
Let E be an HLARE over an alphabet Σ. Let a and b be two distinct symbols in Σ and w be a word in Σ * . The dissimilar derivative of E w.r.t. the symbol a (resp. the word w) is the HLARE D ′ a (E) (resp. D ′ w (E)) defined as follows: where E 1 and E 2 are any two HLAREs and k is any element in N ∪ {⊥}.
Lemma 2 Let E be an HLARE over an alphabet Σ. Let w be a word in Σ * . Then: ). The next lemma shows that the boolean r(ε, E) is syntactically computable for any HLARE E using dissimilar derivatives.
be two HLAREs over an alphabet Σ. Then the two following propositions are satisfied: . Given an HLARE E, we denote by D HL (E) the set {D ′ w (E) | w ∈ Σ * } of the dissimilar derivatives of E.

Lemma 4
The set D HL (E) of dissimilar derivatives of an HLARE E is finite.
From this finite set, one can compute a deterministic finite automaton that recognizes L(E).

Definition 7
Let E be an HLARE over an alphabet Σ. The tuple B ′ (E) = (Σ, Q, I, F, δ) is defined by: Proposition 1 Let E be an HLARE over an alphabet Σ. Then: B ′ (E) is a deterministic finite automaton that recognizes L(E).
For any HLARE E, the automaton B ′ (E) is called the dissimilar derivative finite automaton of E. Example 3 presents the computation of the dissimilar derivative automaton of an HLARE. Example 4 illustrates the computation of the boolean r(w, E) for an HLARE E. Notice that in both examples, the following reductions are used: The dissimilar derivatives of E are the following expressions: The dissimilar derivative automaton of E is given Figure 3.
Notice that in this case: 1. The word w is split up into s w = (a, b, a, ε); 2. The word w ′ = abaa in L((aba + abb)a(a) * ) can be split up into s w ′ = (a, b, a, a); 3. It holds L(s w , s w ′ ) = 1.
Another split-up is presented in Example 6.

Antimirov Partial Derivatives of an HLARE
In this subsection, we extend the Antimirov derivation to the HLAREs. From an HLARE E and a word w, Antimirov derivation allows us to compute a set ∆ w (E) of HLAREs, called the partial derivative of E w.r.t. w. Any HLARE in ∆ w (E) is called a derivated term of E w.r.t. w. Finally, we state that the union of the languages denoted by the derivated terms in ∆ w (E) is equal to w −1 (L(E)).

Definition 8
Let E be an HLARE over an alphabet Σ. Let a and b be two distinct symbols in Σ and w be a word in Σ * . The partial derivative of E w.r.t. the symbol a (resp. to the word w) is the set ∆ a (E) (resp. ∆ w (E)) of HLAREs defined as follows: where E 1 and E 2 are any two HLAREs and k an element in N ∪ {⊥} and where for any set E of HLAREs, for any HLARE F , for any symbol a in Σ, Lemma 5 Let E be an HLARE over an alphabet Σ. Let w be a word in Σ * . Then: G∈∆w(E) L(G) = w −1 (L(E)). Next lemma shows that the boolean r(ε, E) is syntactically computable for any HLARE E using partial derivation.
Lemma 6 Let E = H k (E ′ ) and F = L k (F ′ ) be two HLAREs over an alphabet Σ. Then the two following conditions are satisfied: Given an HLARE E, we denote by DT HL (E) the set w∈Σ * ∆ w (E) of the derivated terms of E.

Lemma 7 The set DT HL (E) of the derivated terms of an HLARE E is finite.
From this finite set, one can compute a finite automaton that recognizes L(E).

Proposition 2 Let E be an HLARE over an alphabet Σ. Then:
A(E) is a finite automaton that recognizes L(E).
For any HLARE E, the automaton A(E) is the derivated term finite automaton of E. Example 5 presents the computation of the derivated term automaton of an HLARE. Example 6 illustrates the computation of the boolean r(w, E) for an HLARE E. Notice that in both of these examples, the five following reductions are used: Example 5 Let E be the HLARE defined in Example 3. The partial derivatives of E are the following sets of expressions: The derivated term automaton of E is given in Figure 4. Notice that in this case: 1. The word w is split up into s w = (a, b, ε, a); 2. The word w ′ = abaa in L((aba + abb)a(a) * ) can be split up into s w ′ = (a, b, a, a); 3. It holds L(s w , s w ′ ) = 1.
Another split-up is presented in Example 4.

Word Comparison Functions, Quotients and Derivatives
In this section, we address the general case. We present two constructions of an automaton from an ARE using Brzozowski's derivatives and Antimirov's ones, respectively leading to a deterministic automaton and a non-deterministic one. We first show how to compute the quotient of a given language F k (L) w.r.t. a symbol a, where F is a given word comparison function, k is an integer and L is a regular language.

Quotient of a Language
Let F be a word comparison function associated with a symbol-wise sequence comparison function F defined over an alphabet Σ. Let k be an integer, a be a symbol in Σ, u = aw be a word of Σ + , and L ′ be a regular language over Σ. According to Definition 2, the word u is in L = F k (L ′ ) if and only if there exists a word v ∈ L ′ such that F(u, v) ≤ k. According to Definition 1, this is equivalent to the existence of a positive integer n and of an alignment (iv) (u ′ , v ′ ) ∈ Split n (u) × Split n (v) between u and v, the cost F(u ′ , v ′ ) of which is not greater than k.
Moreover, let us set tu ′ 2 · · · u ′ n ; let us similarly set Since F is a symbol-wise word comparison function, there exists an alignment (u ′ , v ′ ) between u and v satisfying F(u ′ , v ′ ) ≤ k if and only if there exists an alignment (u ′′ , v ′′ ) between t and z satisfying ). According to Definition 1, this is equivalent to the existence of a word ). According to Definition 2, it is equivalent to say that the word t is in ). Depending on the value of (u ′ 1 , v ′ 1 ) we can distinguish the following cases: (iv) An alignment between two words u and v is a pair (s, s ′ ) of sequences of same size such that s ∈ Split(u) and s ′ ∈ Split(v).
, these three cases provide a recursive expression of the quotient of the language F k (L ′ ) w.r.t. a symbol a ∈ Σ. Unfortunately, its computation may imply a recursive loop, due to Case 3, when F((ε), (b)) = 0. It is possible to get rid of this loop by precomputing the set of all the quotients of L ′ w.r.t. words w such that F(ε, w) = 0. In this purpose, let us set Notice that if L ′ is a regular language, the set of its residuals is finite; as a consequence, so is X(L ′ ).
Lemma 8 Let L = F k (L ′ ) be a language over an alphabet Σ where L ′ is a regular language, F is a symbol-wise word comparison function associated with a sequence comparison function F and a be a symbol in Σ. The quotient of L w.r.t. a is the language a −1 (L) computed as follows:

Brzozowski Derivatives for an ARE
An extension of Brzozowski derivatives can be directly deduced from the computation of the quotient presented in Lemma 8.
Definition 10 Let E = F k (E ′ ) be an ARE over an alphabet Σ where F is associated with F and a be a symbol in Σ. The dissimilar derivative of E w.r.t. a is the Let us show that the set of dissimilar derivatives of any HLARE E is finite (Lemma 9), that the dissimilar derivative of E w.r.t. a word w denotes the quotient of L(E) w.r.t. w (Lemma 10) and how to determine whether the empty word belongs to the language denoted by E (Lemma 11).
Lemma 9 Let E = F k (E ′ ) be an ARE over an alphabet Σ and D E be the set of dissimilar derivatives of E. Then D E is a finite set of AREs. Moreover, its computation halts.
Proof: Consider F associated with F. Let us show by induction over the structure of E ′ and by recurrence over k that D E is a finite set of AREs.
By induction, the set D E ′ is a finite set of AREs. Consequently, since In order to show that D E is a finite set, let us show that any derivative G of E satisfies the property P(E ′ , k): G is a finite sum of expressions of type F k ′ (G ′ ) with k ′ ≤ k and G ′ a derivative of E ′ .
According to Fact 1, any subexpression (F ))) also satisfies P(E ′ , k). Finally, by recurrence hypothesis, for k ′ < k, any derivative of an expression F k ′ (G ′ ) satisfies P(G ′ , k ′ ). Consequently, any derivative of (F )) satisfies P(E ′ , k). As a consequence, (Fact 2) any derivative of E w.r.t. a symbol a satisfies P(E ′ , k).
Let us show now that if an expression H satisfies P(E ′ , k), then any symbol derivative of H also satisfies P(E ′ , k). Since H is a sum of expressions of type F k ′ (G ′ ) where k ′ ≤ k and G ′ is a derivative of E ′ , any symbol derivative H ′ of H is the sum of the derivatives of the expression H. According to Fact 2, any symbol derivative of an expression F k ′ (G ′ ) satisfies P(G ′ , k ′ ). Since G ′ is a derivative of E ′ and k ′ ≤ k, any expression satisfying P(G ′ , k ′ ) also satisfies P(E ′ , k). As a consequence, any derivative of E w.r.t. a word w in Σ * satisfies P(E ′ , k).
As a conclusion, since any derivative of E is a sum of expressions all belonging to the finite set {F k ′ (G) | k ′ ≤ k ∧ G ∈ D E ′ }, using the ACI-equivalence, D E is a finite set of AREs. Moreover, by induction over E ′ and by recurrence over k, since any derivative of an expression F in X(E ′ ) belongs to the finite set of derivatives of E ′ the computation of which halts, and since F((ε), (b)) = 0 implies that k − F((ε), (b)) < k, the computation of D E halts. ✷ Lemma 10 Let E = F k (E ′ ) be an ARE over an alphabet Σ and a be a symbol in Σ. Then L( d ′ d ′ a (E)) = a −1 (L(E)).
Proof: By induction over the structure of E. According to Lemma 8: . As a consequence, there exists a surjection f from X(E ′ ) to X(L(E ′ )) such that for any expression G in X(E ′ ), f(G) = L(G) belongs to X(L(E ′ )). As a consequence: (E ′′ )))) Furthermore, by recurrence over k, for any F((ε), (b)) > 0, it holds: be an ARE over an alphabet Σ and a be a symbol in Σ. Let W F and X(E ′ ) be the sets defined by: . Let us consider the language L ′ defined by: (F )))). Then the two following propositions are equivalent: Furthermore, this equivalence defines a membership test that halts.

Proof:
Let and for any symbol α, β in Σ, let us set k α,β = k − F((α), (β)). Obviously, Furthermore, (a) by induction over E ′ , the membership test defined by ε ∈ F ∈X(E ′ ) L(F ) halts; (b) by recurrence over k since k ε,b < k when F((ε), (b)) = 0, the membership test defined by: (F )))) halts. ✷ Lemma 9 ensures that the derivative automaton B ′ (E) of an ARE E, computed from the set D E of dissimilar derivatives of E following the classical way, is a finite recognizer. Lemma 11 ensures that the set of final states can be computed, since the number of derivatives is finite. Finally, Lemma 10 ensures that the DFA D recognizes L(E).

Definition 11
Let E be an ARE over an alphabet Σ. The tuple B ′ (E) = (Σ, Q, I, F, δ) is defined by: proposition is satisfied by definition of δ. If w = w ′ a with w ′ ∈ Σ * and a ∈ Σ, by recurrence hypothesis it holds δ(E, As a first consequence, since Card(I) = 1, since δ is a function from Q × Σ * to 2 Q , and since for any pair (q, a) in Q × Σ, Card(δ(q, a)) = 1, then the tuple B ′ (E) is a deterministic automaton. Moreover, For any ARE E, the automaton B ′ (E) is the dissimilar derivative finite automaton of E. Consequently, according to Kleene theorem, we have the following corollary.

Corollary 1
The language denoted by any ARE is regular.

Antimirov Derivatives for an ARE
Partial derivatives are defined by means of sets of expressions instead of expressions and thus lead to the construction of a non-deterministic recognizer. We now extend partial derivatives to the family of AREs. For convenience, let us set for E a set of expressions F k (E) = E∈E F k (E) and L(E) = E∈E L(E).
Definition 12 Let E = F k (E ′ ) be an ARE over an alphabet Σ where F is associated with F and a be a symbol in Σ. The partial derivative of E w.r.t. a is the set ∂ ∂a (E) computed as follows: be an ARE over an alphabet Σ and a be a symbol in Σ. Then L( ∂ ∂a (E)) = a −1 (L(E)).

Proof:
By induction over the structure of E.
According to Lemma 8: . As a consequence: and: ))) Furthermore, by recurrence over k, for any F((ε), (b)) > 0, it holds: a −1 (L(E)) = L( ∂ ∂a (E)). ✷ Let DT E be the set of derivated terms of an ARE E, that is the set of the elements of all the partial derivatives of E.
Lemma 13 Let E = F k (E ′ ) be an ARE over an alphabet Σ. Then: . Moreover, the computation of DT E halts.
Proof: Consider that F is associated with F. Let us define the set S(E ′ , k) = k ′ ∈{0,...,k} F k ′ (DT E ′ ). Let us show by induction over the structure of E ′ and by recurrence over k that DT E ⊂ S(E ′ , k). Since X(E ′ ) is a finite set of derivated terms of E ′ , any subexpression of type F k−F ((a),(ε)) (F ) with F ∈ ) is a subset of S(E ′ , k). Finally, by recurrence hypothesis, for k ′ < k, any partial derivative of an expression F k ′ (H) is a subset of S(H, k ′ ). Consequently, any partial derivative of is a subset of S(E ′ , k). As a consequence, (Fact A) any derivated term of E w.r.t. a symbol a belongs to S(E ′ , k).
Furthermore, let us show that if G = F k ′ (H) is an expression that belongs to S(E ′ , k), then any partial derivative of G is a subset of S(E ′ , k). According to Fact A, any partial derivative of an expression F k ′ (H) is a subset of S(H, k ′ ). When H is a derivated term of E ′ and k ′ ≤ k, any expression in S(H, k ′ ) belongs to S(E ′ , k). As a consequence, any derivated term of E belongs to S(E ′ , k).
As a conclusion, DT E ⊂ S(E ′ , k) = k ′ ∈{0,...,k} F k ′ (DT E ′ ). Moreover, by induction over E ′ and by recurrence over k, since any derivated term of an expression F in X(E ′ ) belongs to the finite set of derivated terms of E ′ the computation of which halts, and since k−F((ε), (b)) < k when F((ε), (b)) = 0, the computation of DT E halts. ✷ be an ARE over an alphabet Σ. Then DT E is a finite set of AREs. Furthermore, Card(DT E ) ≤ Card(DT E ′ ) × (k + 1).
Proposition 4 Let E be an approximate regular expression. Then: A(E) is a finite automaton that recognizes L(E).

Consequently:
Finally, by induction hypothesis and by recurrence over k, ✷ As a corollary of Proposition 5, the proofs of the lemmas and propositions of Section 4 can be deduced from the corresponding ones of Section 5.

Conclusion
The similarity operators that equip the family of approximate regular expressions make AREs to be a nice tool to deal with approximate regular expression matching. The extension of dissimilar derivatives and partial derivatives to the family of AREs allows us to provide a syntactical solution to the approximate membership problem; moreover in each case the set of derivatives is finite and thus this extension also yields the construction of a recognizer. An additional advantage of similarity operators is that they can be combined with other regular operators, such as intersection and complementation operators [4], in order to produce even smaller expressions.