The standard procedure for building a language model is to collect a set of sentences and count the ngrams—the sequences of words of length 1 to . Once the ngrams are collected, the creation of a language model is straightforward with standard tools (like the SRILM toolkit).
If instead of a set of sentences, you have one or more lattices, the process is more complicated. This situation can happen in ASR when doing unsupervised training. Each untranscribed sentence is decoded as a lattice, and a language model can be built from the lattices. In some languages, the word segmentation is not obvious. A lattice of all possible segmentations could be used to build a language model. This is the use case I had in mind.
Take the sequence ‘ABC’, it could be segmented in four possible ways.
A B C
The entire sequence represents a probability mass of one. Each possible segmentation has a probability of 0.25—assuming that all segmentations are equally likely. In order to count the ngrams, one solution is to count the ngrams in each segmentation and multiply it by the probability of the segmentation. This is reasonable for short sequences, but quickly becomes intractable as the number of possible segmentations increases. If the segmentations are unconstrained, then the number of possible segmentations is exponential in the length of the sequence.
Instead, the segmentations can be represented in a type of lattice. For this discussion, a finite state transducer (FST) representation is used. The FST also has the added benefit of easily generating all possible segmentations; simply compose a dictionary of each valid word with the original sequence.
Returning to the previous example, the set of four segmentations can be represented in a simple FST format as:
0 1 A
1 2 B
2 3 C
0 2 AB
1 3 BC
0 3 ABC
Each number represents a state, and each line represents a transition between two states. The final line defines state 3 as the final accepting state.
Now the problem has been shifted to counting the fractional ngrams in this FST representation. Initially, when the segmentations were all expressly written, counting the ngrams was easy. The issue was that writing each possible segmentation is impractical for long sequences. The FST provides a compact representation of all segmentations, but no obvious method for counting the ngrams.
The ngrams can be computed in two stages. In the first stage, the probability of passing through each particular state is computed. The second stage is to traverse the the FST and count the ngrams.
Start from the final state. If there is more than one final state, add an additional state and add transitions from each final state to the new state. To determine the probability of passing through any particular state, examine the parent states. Divide the probability for each parent by the number of children. The summation of this value for each parent becomes the occupancy probability for a state. This can be computed recursively until reaching the start state—whose probability is obviously 1. In pseudocode, the algorithm is:
Let fst be the lattice or fst. occupancy_probability(fst, final_state) def occupancy_probability(fst, state): score = 0 for s in fst[state].parents: score += occupancy_probability(fst, s) / len(fst[s].children) if len(fst[state].parents) == 0: score = 1.0 state.score = score return score
Once the occupancy probability has been set for each state, counting the ngrams is simple. Traverse the FST and count all possible ngrams.
I have a simple python script to count the ngrams in a lattice represented in the OpenFST format. Run the script with the ‘-h’ option to see the available options. The script should work for any order of ngram and will produce a list of fractional counts for each ngram.