## Cross-Word Sub-Word Units for Low Resource Keyword Spotting

I presented this paper at the 2014 Workshop on Spoken Language Technology for Under-Resourced Languages (SLTU) [1]. The work was done with Lori Lamel and Jean-Luc Gauvain. It was supported by the Quaero project the IARPA Babel project.

Keyword spotting is the task of detecting specific words or sequences of words in audio. There seems to be a resurgence in the popularity of the task thanks to the IARPA Babel project. The Babel project is focused on a particular type of keyword spotting where searching for words must happen after the audio has been processed. The standard approach is to generate lattices and then search the lattice once the keywords arrive.

Given this setup, it is important to generate lattices such that you can accurately detect as many keywords as possible. If the keywords are not in-vocabulary, then this becomes impossible—at least to find the keywords by searching for exact matches. One solution is to use a more sophisticated approach to searching for the keywords—either use similar words as proxies or use some type of phonetic confusion model. In our work, we instead attempt to minimize the number of out-of-vocabulary words by generating lattices using sub words.

Subword units for keyword spotting are used by many systems in the literature. However, the subword units are almost always single characters or phones. We use longer subword units in this work. As implied by the title, we also explore subword units that cross word boundaries. This work was a preliminary exploration using the Kaldi speech recognition toolkit. We have since improved the results with better features and a more intelligent search strategy.

Our results—and more recent work—show that no particular type of subword unit appears to consistently work better than the other units. However, using multiple subword unit types provide complimentary information and combine quite well. Most of the subword units are character n-grams. For a given n (3,5, or 7 in this work), we find all possible sequences of characters of length n in the training script. Based on this set of subwords, we build a unigram langauge model. The training data is then segmented based on the language model. This is equivalent to selecting the segmentation that minimizes the number of segments in the data. Once the data has been initially segmented, a new trigram language model is built. Using the larger language model, the training data is segmented again. This process is repeated until convergence.

For each subword unit type—a total of 7 types are used in this work—a language model and pronunciation dictionary are produced. The data is decoded separately using each subword unit type using the Kaldi speech recognition toolkit. Keyword spotting is performed separately on each system and the results are combined.

Combining multiple systems significantly improves the performance of both in-vocabulary and out-of-vocabulary keywords. The downside to combining multiple systems is the increased computational cost. Our future work will focus on obtaining the performance of combining multiple systems , but requiring only a single decoding of the data.

[1] W. Hartmann, L. Lamel, and J.-L. Gauvain, “Cross-Word Sub-Word Units for Low Resource Keyword Spotting,” Proceedings of SLTU, pp. 112-117, 2013. (Preprint, postprint not yet available)

Posted in Paper, Research | | 1 Comment

## Google’s 1st European Doctoral Workshop on Speech Technology

Google recently hosted a two day workshop (April 29-30, 2014) at their London office focusing on both speech recognition and speech synthesis. They invited a total of 68 students and post-docs from across Europe. The attendees had a wide range of experience—I met someone just two months into their Phd and others who have been working in the field for years with dozens of publications. No proceedings were associated with the workshop, but each of the attendees were asked to present a poster of their work. Most seemed to present a poster from a previous conference (I used the poster from my recent ASRU paper).

Overall, I had a great time at the workshop, and I think most attendees would say the same thing. If Google demonstrated one thing, it is that they take very good care of their guests and employees. We were well fed and stayed in a nice hotel in the center of London. We spent three days in London, discussing state-of-the-art research and meeting new people, without spending a cent of our own money.

As the workshop began, I could sense the uncertainty from the other attendees. None of us really knew what the goal of the workshop would be. Some assumed it would be two days of hearing how great Google is and we would be handed job applications as we walked out the door. In truth, it was nothing like that.

The majority of the workshop consisted of Google researchers giving talks, not necessarily describing how great the Google product family is, but their personal research. Google is known by their products. Nearly every attendee uses Gmail, Google search, and Scholar. Half of us have smartphones running Android. I think they really wanted to make the point that, yes, Google is a company that makes products, but the research scientists are doing the cutting edge research that makes those products possible.

Obviously the Google employees take great pride in the work they do. Unlike in academia, they have the benefit of seeing their ideas work, not just on toy datasets, but in real world technologies used by millions of people. Often the papers they publish are about techniques that have a real world impact. This also explains why they sometimes have difficulty using research done in academia. If I have a technique that works well on a small dataset, there is little evidence that it would scale to a live system trained on thousands of hours of speech. Since academic labs do not have the resources of Google, it is difficult to prove the efficacy of a technique to the level desired by a large company like Google, Microsoft, or Apple.

While all of the research talks were interesting, it is the anecdotes and trivia that really stick in my mind. Below are some of the more interesting details I remember.

Their typical ASR system is a hybrid DNN-HMM system. Input are 26 frames of mel-scale filter bank features. Typically systems from other labs use a window with an equal amount of before and after context. Since Google focuses on real-time performance, they only allow five frames of future context. This behemoth of a system uses 8 hidden layers of 2560 units each, giving a total of 85 million parameters. Approximately 2,000 hours of speech are used to train the system.

There is a large focus on optimization. They use lots of tricks to reduce the total amount of computation. Surprisingly, they even have an offline system in Android that uses about 3 million parameters. Even with that huge reduction in parameters, they only lose about 30% in relative accuracy.

General Search
The web consists of approximately 60 trillion unique addresses. Google indexes around 10 billion of those per day and maintains an index that takes 100 petabytes of disk space. Everyday Google handles 3 billion searches. One amazing fact is that around 15% of those searches have never been seen before.

While I have no idea what the implementation details are, Google maintains a knowledge graph of over 500 million entries and 18 billion facts. This represents discrete knowledge about the world, such as celebrities, sports teams, and countries. This helps for disambiguation and better understanding of search queries.

Obviously their research is heavily tied to their current end goal, improved voice search. Within that space, there is apparently a lot of freedom. General expectations are that your project should bear fruit within a year. You can play with toy datasets, but you quickly need to move to real-world experiments. They tend to use their Icelandic system as an initial baseline for testing new ideas.

Rarely do they work towards a specific product deadline. Instead their goals are more performance-based. Publication is encouraged, but not required. In general, any paper published needs to have the ideas patented first. Most other companies have a similar policy. In the case of Google, they claim the patents are purely for defensive purposes. Since I never hear of Google suing smaller companies for patent infringement, I assume this is true.

## Recent Study Shows Two Random Statistics are Strongly Correlated

I am sure most people have seen the new site, Spurious Correlation . The site has a database of random statistics and continually compares the data to generate correlations between two sets of statistics. While there are relationships found by the site that are likely legitimately linked, most just seem ridiculous.

A random correlation I noticed was the “Number of People who Starve to Death in the US” and the “Per Capita Consumption of Margarine.” A correlation coefficient of 0.97 is nothing to sneeze at.

I can just picture a butter commercial encouraging people to dump margarine and save a life. More than anything, the site serves to show that with enough comparisons, you can find a statistically significant result somewhere.

This website could be a gold mine for the disingenuous reporter. The Daily Mail could run a daily article grabbing a result from the site, and probably have enough material for years. Just take two random things that are strongly correlated and speculate wildly about the reasons. This differs little from how many news stories—and even some scientific studies—are written now.

## Skeptics Guide to the Universe: Now only Two Years Behind

I have written about the Skeptics Guide to the universe podcast twice before. The first gave my impression after listening to the first 50 episodes. My review was not overly positive and I ended by stating I likely would not continue listening. Soon after that I reviewed episode 51 that contained a rather unique interview with a physics crank.

Since then I have listened to approximately 300 additional episodes, leaving me only two years behind real time. That may sound like a lot of episodes, but I do spend about two hours a day commuting.

I felt it was necessary to revise my review somewhat. Those 300 episodes cover about six years of real time. In that time, the quality has steadily improved, along with the chemistry between the hosts. Each episode typically begins with a discussion of current events—obviously with a focus on science. When I began listening, I recognized very few of the stories. Now that we are quickly reaching the present, I am beginning to recognize more of the stories. For instance, they just recently spent half an hour detailing all of the flaws and sloppy writing of the movie Prometheus, a critique I heartily agreed with.

As I said in my first post about the podcast, it really excels when discussing biology and medicine. These are two topics that I know very little about, so it is always enjoyable to learn more about them.

I now look more critically at my own beliefs. Where did they originate from and what evidence do I have?

They often discuss books of a scientific nature, either with the author or among themselves. The discussions have encouraged me to read many interesting books. A few examples are:

• A Universe from Nothing by Lawrence Krauss: A concise history of the theories of the universe. Presents a hypothesis on how the universe could have formed from “nothing”. The concept of nothing is in itself difficult and is discussed in some detail.
• Microcosm by Carl Zimmer: A introduction to evolution and life as told through the lens of E. coli. This bacteria is the most studied organism in history. By reviewing the experiments, Zimmer is able to describe some of the strongest evidence for evolution.
• 59 Seconds by Richard Wiseman: Like any popular self-help book, except actually based on scientific evidence. Interesting to learn that many of the common beliefs stated in the self-help industry are either based on zero evidence or completely contradicted by it. While Wiseman tends to overstate the case for things based on a single study, it is a good alternative to the garbage in the self-help industry.

The interviews are still good, though their importance has decreased as the show has improved. I now feel I can learn a great deal from a single episode even if there is no interview with an expert.

The podcast has also introduced me to two informative blogs. The first is Neurologica, the blog of Steven Novella. He generally posts something 4-5 days a week about science and skepticism. The second is Science-Based Medicine, where Steven Novella is also a contributor. Many of the posts give an insider’s look at the science behind medical treatments and procedures. They also spend much of their time combating the misinformation produced by alternative medicine.

Now that I am over 350 episodes into the podcast, I think I will continue until I reach the present day. My review of the first 50 episodes does not reflect the present quality of the show. It is both entertaining and informative, and something that makes my daily commute bearable.

Posted in Science Envy | | 2 Comments

## A Script for Combining the Outputs of Keyword Spotting Systems

The Keyword Spotting (KWS) task seems to be quite popular lately. For some work I am doing, I have had the need to be able to quickly merge the outputs of multiple systems (and sometimes the same system). I have written a simple python script to take care of this.

KWS is the task of detecting specific words in audio. The typical output of a KWS system is an XML file (at least in the case of NIST-style evaluations). The basic format is

<kwslist system_id=“ID1" language=“Unknown" kwlist_filename=“hitlist.xml”>
<detected_kwlist kwid=“KW001" search_time="1" oov_count="0">
<kw tbeg="195.23" dur="0.27" file=“file1 score="0.482474" channel="1" decision="NO" />
<kw tbeg="314.55" dur="0.83" file=“file2" score="0.470213" channel="1" decision="NO" />
</detected_kwlist>
</kwslit>

My script takes a set of these XML files and combines them in one of several ways; run with the -h option to get the full help. I refer to each of these files as hitlists and each of the individual entries as hits.

If we can assume that the XML files contain no overlapping entries, then we can use the fast (-m ‘fast’) option. When using this option, the merging is very fast because it does not have to search over the previous entries when adding a new one. Often I perform KWS by using a separate process for each audio file. I then merge them using this script.

In the case of overlapping hits, a decision must be made on how to combine the probabilities. The script gives five options: min, max, mean, gmean, and rank. Min, max, and mean are obvious. Gmean is similar, but uses the geometric mean instead of arithmetic mean.

The final option is rank. Rank simply considers the XML files in order. If there are overlapping entries, it takes the probability of the entry seen first.

There is one other option I wanted to mention, ‘–unscored’. When this option is given, any XML files that do not have an overlapping entry for a detection seen in another file are treated as having a probability of 0.

To be honest, I only use ‘–merge fast’ when merging files from the same system, and ‘—merge mean’ when merging files from separate systems. I have also never had a use for the ‘—unscored’ option. In all of my experiments, ‘–merge mean’ gives the best results out of all the merging options. However, others may find a use for the other options.

Once again, the script is called merge_hitlists.py.

## Counting Fractional Ngrams in a Lattice or FST

The standard procedure for building a language model is to collect a set of sentences and count the ngrams—the sequences of words of length 1 to $n$. Once the ngrams are collected, the creation of a language model is straightforward with standard tools (like the SRILM toolkit).

If instead of a set of sentences, you have one or more lattices, the process is more complicated. This situation can happen in ASR when doing unsupervised training. Each untranscribed sentence is decoded as a lattice, and a language model can be built from the lattices. In some languages, the word segmentation is not obvious. A lattice of all possible segmentations could be used to build a language model. This is the use case I had in mind.

Take the sequence ‘ABC’, it could be segmented in four possible ways.

A B C
AB C
A BC
ABC

The entire sequence represents a probability mass of one. Each possible segmentation has a probability of 0.25—assuming that all segmentations are equally likely. In order to count the ngrams, one solution is to count the ngrams in each segmentation and multiply it by the probability of the segmentation. This is reasonable for short sequences, but quickly becomes intractable as the number of possible segmentations increases. If the segmentations are unconstrained, then the number of possible segmentations is exponential in the length of the sequence.

Instead, the segmentations can be represented in a type of lattice. For this discussion, a finite state transducer (FST) representation is used. The FST also has the added benefit of easily generating all possible segmentations; simply compose a dictionary of each valid word with the original sequence.

Returning to the previous example, the set of four segmentations can be represented in a simple FST format as:

0 1 A
1 2 B
2 3 C
0 2 AB
1 3 BC
0 3 ABC
3

Each number represents a state, and each line represents a transition between two states. The final line defines state 3 as the final accepting state.

Now the problem has been shifted to counting the fractional ngrams in this FST representation. Initially, when the segmentations were all expressly written, counting the ngrams was easy. The issue was that writing each possible segmentation is impractical for long sequences. The FST provides a compact representation of all segmentations, but no obvious method for counting the ngrams.

The ngrams can be computed in two stages. In the first stage, the probability of passing through each particular state is computed. The second stage is to traverse the the FST and count the ngrams.

Start from the final state. If there is more than one final state, add an additional state and add transitions from each final state to the new state. To determine the probability of passing through any particular state, examine the parent states. Divide the probability for each parent by the number of children. The summation of this value for each parent becomes the occupancy probability for a state. This can be computed recursively until reaching the start state—whose probability is obviously 1. In pseudocode, the algorithm is:

Let fst be the lattice or fst.
occupancy_probability(fst, final_state)
def occupancy_probability(fst, state):
score = 0
for s in fst[state].parents:
score += occupancy_probability(fst, s) / len(fst[s].children)
if len(fst[state].parents) == 0:
score = 1.0
state.score = score
return score

Once the occupancy probability has been set for each state, counting the ngrams is simple. Traverse the FST and count all possible ngrams.

I have a simple python script to count the ngrams in a lattice represented in the OpenFST format. Run the script with the ‘-h’ option to see the available options. The script should work for any order of ngram and will produce a list of fractional counts for each ngram.

## Install Software without Root Access in a Roaming Home Directory

A common setup in the research labs I have seen is to give a roaming home directory to every user. Each user may have access to hundreds of individual machines, but their home directory follows them everywhere. In many ways this is convenient, but there is one aspect that always frustrates me. Installing the software that I need for each individual machine.

I just recently finished writing a bit of code in python for a research idea that I had. It worked great on my machine with a small dataset. Due to memory constraints, I needed to run the code on a larger server with the data I really wanted to use. The first error was that the numpy and scipy packages were not installed. Since I did not have root access, I attempted to install the packages to my home directory. Next error was that they required special mathematical libraries. When I tried to install these libraries, I realized the system did not have a FORTRAN compiler. At this point, I gave up out of frustration and realized I needed a new approach.

What follows is my current solution to this problem. I am not claiming it is the best approach, but it works for me. These instructions correspond specifically to an installation on a 64bit machine running OpenSUSE 12.1, but should generalize to other machines. I am also assuming a bash environment.

The first step is to setup a directory to work as our local root folder within the home directory. I chose ~/local_root. The goal is to have this setup work for any environment, so we will create a directory in local_root for each machine/environment we come across. Create a directory for the first environment, ~/local_root/env1 (or some better descriptive name). To save on typing, we can set up an environment variable.

export LOCAL_ROOT=~/local_root/env1

We also need to create two new directories within the local_root directory.

mkdir \$LOCAL_ROOT/local
mkdir \$LOCAL_ROOT/src

The first thing we will do is setup our .bashrc file. Some of the paths we are adding may not exist yet, but they will once we install gcc.

if [[ `hostname` == "machine_env1" ]]; then
export LOCAL_ROOT=~/local_root/env1
export LD_LIBRARY_PATH=\$LOCAL_ROOT/local/lib64:\$LD_LIBRARY_PATH
export CPLUS_INCLUDE_PATH=\$LOCAL_ROOT/local/include:\$CPLUS_INCLUDE_PATH
export PATH=\$LOCAL_ROOT/local/bin:\$PATH
fi

The if statement makes sure that the code only runs on a specific machine. This could also be modified for a specific set of machines. If there are multiple machines, a separate set of commands can exist for each machine.

Now that the .bashrc file is properly setup, we can begin installing software specific to the environment, without requiring root access. For this post, I will focus only on installing gcc—once we have gcc, we should be able to compile and install almost anything else we require.

You might be thinking, “Why should I go through the trouble of installing gcc, the one program nearly all systems already have installed?” If the machines you work with already have an up-to-date version of gcc (including gfortran), you can probably stop reading now. However, if you work at a place that has not updated their version of gcc since the previous millennium, then it will probably be worth the trouble.

Before installing gcc, we need to install any dependencies. My system needed MPFR and MPC. You may also need to download and install GMP. We will install each one individually using nearly the exact same setup.

Download and unpack MPFR into the \$LOCAL_ROOT/src directory (I used version 3.1.2). Run the following commands to build and install MPFR.

./configure —prefix=\$LOCAL_ROOT/local
make
make check
make install

Download and unpack MPC into the \$LOCAL_ROOT/src directory (I used version 1.02). Run the following commands to build and install MPC.

./configure --prefix=\$LOCAL_ROOT/local --with-mpfr-lib=\$LOCAL_ROOT/local/lib64 --with-mpfr-include=\$LOCAL_ROOT/local/include
make
make check
make install

The dependencies for gcc should now be installed. Get the most recent stable version of gcc ( I used version 4.8.0 ). As with the dependencies, unpack gcc in the \$LOCAL_ROOT/src directory and run the following commands.

./configure --prefix=\$LOCAL_ROOT/local --with-mpfr-lib=\$LOCAL_ROOT/local/lib64 --with-mpfr-include=\$LOCAL_ROOT/local/include --with-mpc-lib=\$LOCAL_ROOT/local/lib64 --with-gmp-lib=\$LOCAL_ROOT/local/lib64 --disable-multilib
make
make check
make install

Now we should have a working installation of gcc for a specific environment. The steps can be repeated for every environment that is needed. While compiling and installing other software may still require significant effort, you are no longer chained to the preexisting dependencies on any particular system.

If anyone else has a better solution for this problem, I would love to hear it.