One Minute to Learn Programming: Finite Automata

Today, we're gonna talk about an incredibly simple programming system that has a ton of real-world applications: Finite State Machines (FSMs). You're gonna see them a lot in both theory and systems courses, so this is a very important lecture.

(By the way - for those of you who've learned about FSMs before; everything in this lecture refers to Deterministic Finite Automata (DFAs). We use the terms FSM and DFA interchangeably.)

FSMs take a string as input and output either YES or NO. By a string, we mean a finite sequence of characters, such as abaca or 10011101. The below diagram represents a typical FSM. The circles represent states, the double-circles represent accepting states, and the arrows represent transitions; the initial or starting state is indicated by the external arrow:

Let's say we fed the string 0111 to the above automata. What would happen?

  1. We start at the leftmost circle, or state (this is indicated by the external arrow).
  2. The first character is a 0. The transition arrow for that state says to stay at the same state.
  3. The second character is a 1. We move to the top state in accordance with the transition.
  4. The third character is a 1. We move to the right state (actually, we'd have done that even if it was a 0).
  5. The fourth character is a 1. We stay at the same state (right).
  6. Since we ended up at an accepting state (states with two circles are accepting states), we respond YES.

On the other hand, 001011110100 is not accepted. Why?

FSMs have two major components: states and transitions. A state stores information about the past. In other words, it reflects the input changes from the system start to the present moment. A transition indicates a state change and is described by a condition that would need to be fulfilled to enable the transition. A defining characteristic of an FSM is that the next state of the machine is completely determined by the current state and the current input.

Another way to think about finite automata is in terms of programs without any variables. All such a program can do is look at each incoming character to determine what line to go to, and eventually return true or false (depending on whether it thinks the string matches or doesn't). Nonetheless, such simple machines can display complex behavior.

Formal Definition

We define a finite state machine in terms of a finite set of states, a specific start state, a set of accepting states (a subset of state sets), a finite alphabet, and state transition instructions:

  • Finite set of states .
  • A finite alphabet . In the example DFA above, the alphabet is .
  • A start state .
  • A set of accepting states .
  • State transition function
    where means that when the machine is in state , it will transition to given input character . That is, we have a transition which looks like this: . Since is a function, transitions must be defined for every state and alphabet symbol combination. (This is not to say that DFAs cannot "crash," but any DFA which crashes can be made into one which doesn't crash by adding a single "garbage" state.)

Formally, let be a FSM. starts in state . In operation, machine reads one letter at a time from the input string (going from left to right). If is in state and reads some letter , moves to state (if is undefined, then the machine stops/crashes).

  • accepts the string if, after finishes reading , it ends in an accepting state.
  • rejects the string if, after finishes reading , it ends in a non-accepting state.

The set of inputs accepted by is: is called the language of the machine M. ( is defined as the set of all strings over the alphabet .)

Examples

What are the languages accepted by these machine?

In this case, the initial state (arrow symbol) is also the accepting state (double circle). The arrow represents the transition a, b. Any finite string of a's and b's ends up in the accepting state. So, the language (all finite strings of a's and b's.)

The string aa doesn't end up in an accepting state. Neither does abba. However, the strings a and aba do. In general, only strings that have an odd number of characters end up in the accepting state. Strings of even length are rejected. Hence, the language accepted by this machine .

In this case, the strings a, aaba, and bbba don't work, whereas ab, aabb, and bbbab do. .

.

Designing FSMs

(Most of the below content is from Sipser's Introduction to the Theory of Computation.)

We can also do the reverse: try to find a machine that accepts a given language. For example, constructing a machine for the language L = {all strings in {a,b}* with at least one a}.

In general, constructing a FSM is more a creative process than simple formula. However, the following approach might be somewhat helpful: put yourself in the place of the machine you are trying to design and then see how you would go about performing the machine's task.

Suppose that you are given some language and want to design a FSM that recognizes it. Pretending to be the automaton, you receive an input string and must determine whether it is a member of the language the automaton is supposed to recognize. You get to see the symbols in the string one by one. After each symbol, you must decide whether the string seen so far is in the language. The reason is that you, like the machine, don't know when the end of the string will come, so you must always be ready with the answer.

Also, you need to decide what you need to remember about the string as you are reading it. Remember, you're designing a finite state machine, and thus you cannot remember all that you have seen to date (the input may be thousands and thousands of characters, after all.) Fortunately, you only need to remember the crucial information about the input to date - what that information is varies from language to language.

Here are a couple of languages. What might we want to keep track of?

  • L = {all strings in {0,1}* with an odd number of 1's}. We just need to keep track of whether the number of 1s we've read to date is odd or even.
  • L = {all strings in {a,b}* that contain at least one a }. We just need a flag that becomes true if we've seen an a.

Can you design FSMs for the above? Here are a few more exercises:

  • L = {all strings in {a,b}* with an even number of ab pairs}.
  • L = {all strings in {a,b}* containing ababb as a consecutive substring}.

Here's a harder one, but a good test:

  • L = {all strings in {a,b}* that end in abb }.

In general, finite automata act as detectors, each one picking up a specified pattern or sequence. These are pattern-seeking machines.

A good strategy when designing FSMs is to come up with "names" for your states. For example, suppose that you wanted to build a machine which accepts a string from {0,1}* if and only if it is of length 1 modulo 3. You could do this in three states where your states could be named "string so far is of length 0 mod 3," "string so far is of length 1 mod 3," and "string so far is of length 2 mod 3." Whenever you are told to construct a machine for a given language, think about all the intermediate steps necessary to verify that a string will be accepted.

Languages

Let's take a quick diversion from FSMs and talk about the concept of a language. Any is defined to be a language. Basically, is just a set of strings (it is called a language for historical reasons.) Some examples of languages:

  • The set of all strings ending in a b where .
  • The set of all strings of the form : some number of a's followed by an equal number of b's where .
  • The set of all palindromes where .
  • The set of all graphs (encoded in binary) with a Hamiltonian tour .

is called a regular language if there is some finite automaton that accepts . In this chapter, we have already seen many regular languages, including , strings of even length, strings containing ababb, and others. In the list above, only the top example is regular. Let's try to get a grip on which languages are regular.

Theorem. Any finite language is regular.

Proof. Make a machine with a "path" for each string in the language. Basically, we can pick a tree to generate all strings in our finite `set (with an accepting state at the end of each string path.)

Here's an example with (Note: I'm being slightly lazy in the transitions - see exercise below)

Note that if we have an infinite list of words, the technique will not work.

Exercise
The above FSM is actually not complete - we don't show every possible transition. Certain classes you take will allow this: if a transition is undefined, the FSM will crash (obviously the input will not be accepted if that happens.) It doesn't actually matter if you allow this: as an exercise, explain how you can transform an FSM like the above into one without undefined transitions.

Theorem. The union of two regular languages is regular (we assume that they have the same alphabet)

(The union of two languages is the set of all strings present in either language.) How can we prove this? Here's what we'll do - since and are regular, there exist FSMs that accept them. Let's construct a new FSM based on that accepts the union of the two languages. This new FSM will just simulate and simultaneously, seeing whether either accept at the end.

Proof. Let , be FSMs that accept respectively. We define a new FSM to accept .

  • The states of are just the pairs of states from .
  • The alphabet is identical.
  • The transition function : on input state and character , output .
  • The initial state is just the state corresponding to .
  • The set of accept states just corresponds to .

We claim that this machine accepts . Consider a member . The construction is symmetric, so we assume that . Imagine running and simultaneously on . We have constructed the DFA such that at any point during the process, is in some state iff is in some tuple state of the form (i.e. the left part of the tuple will match the state is in. Also, since , must end up in an accepting state. However, our construction guarantees that if is an accepting state for , all states of the form are accepting states for . Therefore accepts this string.

We also need to show that if accepts a string, then that string must be in or . Let be the accepting state ends up in after reading . Our construction guarantees that either was an accepting state for or was an accepting state for . WLOG, assume the former. However, we know that would only be in some state after reading some word if were in state after reading , so will accept .

In general, regular languages are closed under the following operations.

  • Union: .
  • Intersection: .
  • Reverse: .
  • Negation: .
  • Concatenation: .
  • Star: .

Nonregular languages

How powerful are finite automata? In other words, are all languages regular?

Consider the language (remember that is just the symbol which denotes the empty string). In other words, we have a bunch of a's followed by an equal number of b's. No finite automaton accepts this language. Can we prove this?

Well, we could try saying that no machine has enough states to keep track of the number of a's it might encounter. But this alone is a fairly weak argument.

To see why this is, consider the following example: L = {strings where the number of occurrences of the pattern ab is equal to the number of occurrences of the pattern ba }. By the above argument this can't be regular: no machine has enough states to keep track of the number of occurrences of ab. But, as the following diagram shows, this language is regular!

The ABA automaton shown above accepts only the strings with an equal number of ab occurrences and ba occurrences. For example, it accepts aba, aabba, and , but rejects baaa. So, we can compare two quantities without counting.

We need a professional-strength proof that is not regular. Here are two very handy principles to keep in mind when developing a proof.

  • Pigeonhole principle: Given boxes and objects, at least one box must contain more than one object.

  • Letterbox principle: If the average number of letters per box is , then some box will have at least letters. (Similarly, some box has at most .)

Theorem. The language is not regular.

Proof. Assume that the language is regular. Then there is some machine with states that accepts it. For each , let be the state of after reading . By the pigeonhole principle, there exists some such that , but . Therefore, the machine will erroneously accept , which is a contradiction. Thus, there is no such machine and the language is not regular.

Again, make sure you understand this argument - its slightly tricky! If we assume that this language is regular, then there is some automata with states that accepts it. Let's try feeding to this machine, keeping track of the states that it visits. There are only states, so while its still just reading a's, the machine will have to be in the same state twice: let's say that its in state after reading a's and also in state after reading a's. This means, though, that if the machine accepts and (which it's supposed to), then it will also accept and - this is a contradiction! Therefore, the language is not regular.

Intuitively, we can say that finite automata can't count!

There's tons more to learn about finite automata. Much of that material is covered in advanced courses, such as 15-354: Computational Discrete Math or 15-453: Formal Languages, Automata, and Computation (FLAC).

Extra Remark. (You are not responsible for knowing this.) Since is a countably infinite set, the total number of languages is uncountable. However, there are only countably many FSMs and therefore, the number of regular languages is countable. We may therefore conclude that there are uncountably many languages which are not regular.

Application: String Matching

The Unix operating system has a command called grep. The grep command allows you to search for a specified pattern in one or more files, printing any lines which contain the specified set of characters. grep uses regular expressions to specify the search pattern and a finite automaton to implement the search itself.

Given input text of length and a string of length : does appear inside the text ?

Naively, we can take string and compare it to the text , looking for a match as we step through the text reading it character by character. In computational terms, we would need to perform on the order of comparisons: Cost: comparisons. This is fairly wasteful - we aren't taking advantage of the information we're learning while comparing (consider trying to find aaaab in aaaaaaaaab.)

The automata solution is to build a special-purpose machine to find a given pattern that accepts any string with as a consecutive substring. We then feed the text to M. In this case, the cost is comparisons plus the time to build M. It solves the problem reading the characters only once. As it happens, we can use the so-called Knuth-Morris-Pratt algorithm to build M quickly. Moreover, it turns out that looking for a given string can actually be done, even in the worst case, with fewer than t comparisons. We don't have to look at every symbol. We can save time by reading only a portion of the text.

15-211 covers the implementation of this algorithm, so we'll pass on doing it in this class.

The grep command or utility in Unix is just one of many real-world applications of finite state machines. Such mechanisms underlie or govern vending machines, fridge thermostats, elevators, train track switches, and lexical analyzers for parsers. In several of these instances, more complex approaches have replaced finite state machines as the underlying controllers. For example, a computer chip and a little Java program instead of a finite state machine now typically handles change in a vending machine. Similarly, control logic governs thermostats.

One place where finite state machines are still used is in train track switches, mainly because they are simple and can be analyzed mathematically. We can prove theorems about them. So the design is verifiable to ensure that the devices operate safely.

References

Introduction to the Theory of Computation, first edition. Chapter 1.1

Editor Notes

Diagrams were created using Microsoft Powerpoint 2007.

Topic revision: r6 - 2008-08-26 - 00:39:55 - AnupamGupta
 
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback