## Converting a Language Model to a Finite State Transducer

I like the OpenFST toolkit. By converting different types of models into finite state machines (FSM), many algorithms within the toolkit can easily be applied. For instance, I was recently presented with the problem of segmenting a sequence of characters into a sequence of words. This problem requires the original sequence of characters and some method of determining the optimal segmentation.

In my case, I have a language model (LM) that represents the likelihood of any sequence of words. Of all the possible ways to segment a sequence of characters, we choose the segmentation that produces a sequence of words with the greatest likelihood. While writing code to handle this would be relatively straightforward, if we can convert our data into an FSM representation, the OpenFST toolkit can easily handle this problem for us. An added benefit is that it will likely be more efficient than any code that can be quickly written. Just compose the two models and then find the shortest path—a simple sequence of commands issued through the command line.

` fstcompose words.fsa lm.fst | fstshortestpath > words.fst `

The only slightly non-trivial part is the conversion of the language model to a finite state transducer (FST). I have provided a python script for converting an ARPA-format trigram language model to an FST, but I will also briefly discuss the details.

We will consider a simple ARPA-format language model. I assume the reader is already familiar with n-grams and language models.
``` \data\ ngram 1=x ngram 2=y ngram 3=z```

``` ```

```\1-grams: p(<s>) <s> b(<s>) p(</s>) </s> b(</s>) ```
The <s> and </s> are the default sentence start and end symbols respectively. Each $p(.)$ is a probability represented in log base 10. Each $b(.)$ is a backoff weight, also represented in log base 10.

Our approach to creating an FST will assume that each state is a sequence of words that have been seen. A link between any two states is the cost of seeing a particular word given the current state. Before even considering the LM, we instantiate the the FST with four special states.

1. An initial start state. This state will only link to the sentence start symbol, by default <s>. This forces the FST to only accept sequences that start with this symbol.
2. The sentence start symbol.
3. A null history state. This state represents a position where the start symbol has been seen, but we have kept no other history.
4. The sentence end symbol. Also, the only valid final state.

In the python script I create a state for every history sequence that appears in the LM. The integer values for a possible set of states are shown below. Note that the FST maintains no labels or other information about specific states.
``` <s> 1 NULL 2 cat 3 the 4 <s>_the 5 </s> 6 ```
Each entry in the LM actually represents a transition between two states. The first entry in the example LM is a special case, it represents the transition between the initial start state and the sentence start symbol. For our FST, using the integer values for the states given above, this produces the following line.
``` 0 1 <s> <s> p(<s>) ```
For the remainder of the unigrams, each entry produces a line representing a transition from the null history state to the state representing the word in the unigram. For instance, the line in our FST file representing the unigram probability of “cat”.
``` 2 3 cat cat p(cat) ```
This handles the unigram probabilities; we will come back to the backoff weights. The bigram entries are handled in a very similar manner. The bigram for the sequence “<s> the” is represented as
``` 1 5 the the p(the|<s>) ```

Notice that we are not transitioning to the state representing just the word “the”. The state represents the sequence “<s> the”. The extension to higher order n-grams is straightforward. Now it is time to come back to the backoff weights we have ignored thus far. The backoff weights are necessary when a required n-gram does not exist in the LM. It can be thought of as a penalty for erasing the history.

In each case the backoff weight is the penalty to transition to a state with less history. Going back through the original LM, we can start adding transitions for the backoff weights. Looking at the backoff weight for the unigram “the”, we need to transition to a state with less history. This is where our null history state is used. To the FST file we add
``` 4 2 <eps> <eps> b(the) ```
The symbol is the symbol used to represent an epsilon transition. In essence, it means that transitioning between these two states requires no input and produces no output. Adding the backoff transitions for higher order n-grams is similar.

Now we come to the single tricky case (to be honest, I didn’t even see it until I wrote this post). Consider the trigram “the cat barked”; it does not exist in our LM. In terms of the FST, we would be in a state representing the history “the cat”. By paying the backoff penalty associated with “the cat”, we can transition to the state representing the history “cat”. However, we will find that the bigram “cat barked” does not exist either. Missing this case would not crash the entire process—we can always pay a backoff penalty again to transition to the null history state—it just makes the final result slightly incorrect, paying two backoff penalties when we should only pay one.

The correct way to handle this would be to transition directly to the null history state if the future bigram does not exist. The difficulty is that I do not know how this can be incorporated into the FST representing the LM. How can the FST capture the logic of taking a transition if and only if a future state it will lead to does not exist? At the moment, I do not have an exact solution to this problem.

As far as I can tell, there are only two reasonable options. The first is to just ignore the problem. This leads to the issue of paying the backoff penalty twice as previously mentioned. The second option is to always include a transition to the null history state. This will only cause an issue when the unigram probability of a word is greater than its bigram probability. The chance of that happening is dependent on the dataset.

Instead of making the choice for you, I left it as an option in the script (-backoff [once | twice]). The script should work for any unigram, bigram, or trigram language models. Just use the -h option to get a detailed description of the options and use.

If you encounter any issues or discover a better solution to the backoff problem, please let me know. The script ‘convert_lm_to_fst.py’ can be found here.

This entry was posted in Code, Research and tagged , , . Bookmark the permalink.

### 5 Responses to Converting a Language Model to a Finite State Transducer

1. alex zatv says:

Btw, there is the tool by Google to convert LMs to FSA. It is here: http://www.opengrm.org/

• Thanks for pointing this out. I wasn’t aware it existed.

2. guoguo says:

For the P(barked | the cat) example, I guess srilm’s ngram tool will also calculate the backoff penalty twice if none of “the cat barked” and “cat barked” exists?

• If you are referring to ngram-count, then it is not a problem. That tool simply counts the ngram occurrences in the text and builds a language model. The issue of double counting only appears during decoding or rescoring (basically anytime you use the language model). This is not an issue for the decoders used in HTK or Kaldi, just my simple implementation here.