Converting a set of Strings to a Finite State Automata

This is a follow up to my post about converting a language model to a finite state transducer. Someone asked about the other necessary scripts for the task I was working on, so I am adding a short post describing converting a text file of sentences to a finite state automata (FSA).

My assumption is that each line in the text file contains a sequence of words. Each line is equally valid and each one represents a possible path. This means that the final acceptor will accept any of the strings in the file. Consider the example file below.

the dog barked
the cat meowed

Each word becomes an arc between two states. As in the previous post, we add start and end symbols. The default is <s> and </s>, but they can be changed using the command line options. The final FSA would be

0 1 <s>
1 2 the
2 3 dog
3 4 barked
4 5 </s>
0 6 <s>
6 7 the
7 8 cat
8 9 meowed
9 10 </s>

The final two numbers represent valid final states. Note that the resulting FSA could be simplified, but it is not necessary to handle in the script; the OpenFST command fstminimize can take care of that, if necessary.

Detailed information about the parameters and usage can be found by running the script with the option -h.

This entry was posted in Code and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s