Regular expressions, automata and monoids

Published

September 3, 2013

In formal language theory the task is to specify, over some given alphabet, a set of valid strings. This is useful in searching for structures textual data through files (e.g. via grep), for specifying the syntactic structure of programming languages (e.g. in Bison or pandoc), and for generating output of a specified form (e.g. automatic computer science and mathematics paper generators).

An automoton is (roughly) a set of symbols, and a set of states, along with transitions for each state that take a symbol and return another state. They can be used to model (and verify) simple processes.

Automata can be brought into correspondence with formal languages in a very natural way; given an initial state s, and a sequence of symbols (a1, a2, …, an) the automata has a naturally assigned state (… ((s a1) a2) … an) (where “(state symbol)” represents the state obtained from the transition on symbol using state). Then if we nominate an initial state, and a set of “accepting” valid states, we say a string is in the language of the automata if and only if when applied to the initial state it ends in a final state.

This gives a very useful pairing in computer science; formal languages are useful tools, and automata (often) give an efficient way to implement them on a computer.

To get a little more mathematical a semigroup is a a closed associative binary operation; if we add a two sided identity it is called a monoid, and if we additionally add inverses it becomes a group. For instance under addition the set of positive numbers (1, 2, …) is a semigroup, the set of non-negative numbers is a monoid with identity 0, and the set of integers is a group with -a being the inverse of a. Clearly every group is a monoid (forgetting about the inverses) and every monoid is a semigroup (forgetting about the identity).

In the same way a group often arises as a set of invertible transformations (isomorphisms), a monoid often arises as a set of transformations (morphisms). Another useful example of a monoid is sets under union, with 0 corresponding to the empty set.

The free monoid generated by a set S, denoted S*, is the set of all (finite or infinite) sequences of elements of S with multiplication defined as concatenation.

For example {x}* has set \(\{\epsilon, x, xx, xxx, xxxx, \ldots\} = \{x^0, x^1, x^2, x^3, x^4, \ldots \}\) (where \(\epsilon\) represents the sequence with no elements), and multiplication is given by \(x^n x^m = x^{n+m}\) . As a further example some elements of {x, y}* are \(\epsilon\) , x, y, xx, xy, yx, yy, xxx, xxy, xyx, yxx, xyy, yxy, yyx, yyy, …

In computer science terms the free monoid generated by the set S is precisely the set of all strings (or words) in the alphabet S, with the monoid product corresponding to string concatenation.

Using this notation a language over a finite set S is a subset of S*; that is an element of the power set \(2^{S*}\) . More generally we can define a language over a monoid, M, as an element of the power set of M.

There is a natural product on the power set of a monoid; \(AB \equiv \{ab | a \in A, b \in B\}\) , and so it too is a monoid with identity \(\{\epsilon\}\) . There is another natural monoidal structure on any collection of subsets; union with the additive identity of the empty set. Notice that \(A (B \cup C) = (AB) \cup (AC)\) and similarly \((A \cup B) C = (AC) \cup (BC)\) , and \(\emptyset A = \emptyset = A \emptyset\) . Consequently the power set of a monoid naturally has the structure of a semiring.

Given a subset S of a monoid M, denote S* (the Kleene star) to be its monoidal closure; the smallest submonoid of M containing S. The regular expressions over a set (alphabet) Σ is defined to be the set generated by the elements \(\{\epsilon\}, \{a\}, a \in \Sigma\) using the Kleene algebra formed by the semiring \(2^M\) with the Kleene star.

On the other hand a deterministic finite automoton (DFA) over a set (alphabet) Σ is a finite set of states S, an initial state s in S, a subset F of accepting states of S, and a transition map \(t : S \times \Sigma \to S\) . The language of a DFA is the set of all strings (a b … x) of symbols in Σ such that \(t( \cdots t( t(s, a), b) \cdots , x) \in F\) . Often a DFA is represented diagrammatically using circles to represent states, and labelled arrows to represent the transitions between states [this looks rather like a category theory diagram]. The initial state is denoted by a horizontal arrow pointing to it, and the final states are represented by a double circle.

An example DFA Diagram

In the example above the alphabet is {0, 1} the states are {S1, S2}, the initial state is S1, and the final states are {S1}, the transitions are t(S1, 0) = S2, t(S1, 1) = S1, t(S2, 1) = S2, t(S2, 0) = S1.

Often the transitions are represented as a table with states listed vertically and transitions listed horizontally e.g.

0 1
S1 S2 S1
S2 S1 S2

More algebraically we can consider the transition to be the monoidal action; since the elements of Σ generate Σ* freely, the transition extends uniquely to a function \(T : S \times \Sigma* \to S\) such that T(t, xy) = T(T(t, x), y) for any state t and elements of Σ* x and y.

Rephrasing and generalising slightly, a DFA over a monoid M is a set of states S, a (contravariant) monoid homomorphism \(\rho : M \to \hbox{Map}(S)\) (where Map(S) represents all functions from S to S; i.e. \(\rho(xy) = \rho(y) \rho(x)\) ), an initial state s from S, and a subset F of accepting states in S. Then the language of a DFA is precisely \(L = \{ x \in M | \rho(x) \circ s \in F\}\) .

Theorem: The regular languages are equivalent to the languages representable by a DFA.

This theorem can be proved as follows: a DFA is inductively transformed into a regular expression by transforming the DFA that can only pass through an increasingly large subset of states. A regular expression is transformed into a nondeterministic finite automoton, which is in turn transformed into a DFA.

A nondeterministic finite automaton (NFA) over a set Σ is a set S, an initial state s, a set of accepting states F and a transition map t \(S \times \Sigma \to 2^S\) . Then a string (a b … c) is in the language of the NFA if and only if there is some \(s_1 \in t(s, a), s_2 \in t(s_1, b), ... s_n \in t(s_{n-1}, c)\) such that \(s_n \in F\) .

Sometimes epsilon transitions are allowed; that is transitions that take no input so t \(S \times \Sigma \cup \{\epsilon\} \to 2^S\) , and we allow arbitrary epsilon insertions in the string. As with DFAs, NFAs can be represented diagrammatically and implemented efficiently as a table (though in this case we need to trace every possible path of a transition).

These can be extended more algebraically as follows: a NFA over a monoid M is a set S, a set of initial states I, a set of accepting states F, and a monoid/semigroup homomorphism \(\rho : M \to 2^S\) . The language of an NFA is \(\{x \in M | \rho(x) \circ I \cap F \neq \emptyset \}\) . (The previous definition is simply M = Σ*, I = {s} and \(\rho(x) \circ A = \bigcup_{a \in A} t(x, a)\) ).

The monoidal case corresponds to no epsilon transitions, and the semigroup case allows monoidal transitions (for then \(\rho(\epsilon)\) need not be the identity). To promote a semigroup homomorphism ρ to a monoidal homomorphism η we simply define \(\eta(x) = \rho(x) \rho(\epsilon)^*\) (considering \(2^S\) as a Kleene algebra in the obvious way).

It is almost trivial to represent a given NFA as a DFA; we take the set of the DFA to be \(2^S\) , the initial state to be S, the final states to be any state intersecting F, and the transition function to be the same. This is the so called power-set construction.

So DFA=NFA=Regular Languages.