In this assignment, you will learn how to build a speech recognizer using the CMU Sphinx system. There are three major components that go into a typical speech recognizer:
Your task for this assignment is to build an acoustic model, dictionary, and language model, and use them to recognize a list of sentences, which will be provided to you in the form of audio files and corresponding text. You will calculate and report the error rate achieved by your system. It is common practice to refine your models on a held-out development set of data, then use a disjoint and unseen test set for final verification. We will follow this practice by testing your systems on a separate test set which is similar but not identical to the development seet given to you.
You will need to have Perl installed on your computer. For Windows, we recommend using ActivePerl, or installing Cygwin. On Linux and Mac OS X, you should have a working C compiler and development libraries.
To build an acoustic model, you will mostly follow the instructions in the Sphinx tutorial. However, we have modified the training set slightly in order to make it more suitable for this task. First, download the assignment data package, and extract it to a suitable directory on your computer.
For Windows, the executable programs needed are included in the assignment data package. As mentioned above, you will need to have Perl installed. You can either use ActivePerl from the Windows command line (so-called "DOS Box"), or you can use the Perl that is included in Cygwin. You should verify that you have a working version of Perl before proceeding by opening a command-line window (either cmd.exe or Cygwin), changing into the directory created by extracting the assignment data, and running:
You should see output like this:
MODULE: 00 verify training files O.S. is case insensitive ("A" == "a"). Phones will be treated as case insensitive. Phase 1: DICT - Checking to see if the dict and filler dict agrees with the phonelist file. Found 1219 words using 40 phones Phase 2: DICT - Checking to make sure there are not duplicate entries in the dictionary Phase 3: CTL - Check general format; utterance length (must be positive); files exist Phase 4: CTL - Checking number of lines in the transcript should match lines in control file Phase 5: CTL - Determine amount of training data, see if n_tied_states seems reasonable. Total Hours Training: 1.52594188034191 This is a small amount of data, no comment at this time Phase 6: TRANSCRIPT - Checking that all the words in the transcript are in the dictionary Words in dictionary: 1216 Words in filler dictionary: 3 Phase 7: TRANSCRIPT - Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
On Linux or Mac OS X, you will also need to compile the Sphinx trainer and recognizer and install these programs in the assignment data directory. Dowload the provided snapshot of SphinxTrain (this corresponds to the current code as of September 8th, as there is currently no released version). Unpack it using tar and compile it:
tar zxf SphinxTrain-20080912.tar.gz cd SphinxTrain ./configure make
Next, you will need to compile the recognizer. We have also provided a snapshot of the stable branch of PocketSphinx (this corresponds to the upcoming 0.5.1 release). Unpack this using tar and use the build_all_static.sh script provided:
tar zxf pocketsphinx-0.5-20080912.tar.gz cd pocketsphinx-0.5 ./build_all_static.sh
Finally, change into the assignment directory, and use the setup scripts in SphinxTrain and PocketSphinx to copy the programs to the correct location.
cd hw1_data ../SphinxTrain/scripts_pl/setup_SphinxTrain.pl -task rm ../pocketsphinx-0.5/pocketsphinx/scripts/setup_sphinx.pl -task rm
Ensure that training is correctly configured by running:
You should see output like this:
MODULE: 00 verify training files O.S. is case sensitive ("A" != "a"). Phones will be treated as case sensitive. Phase 1: DICT - Checking to see if the dict and filler dict agrees with the phonelist file. Found 1219 words using 40 phones Phase 2: DICT - Checking to make sure there are not duplicate entries in the dictionary Phase 3: CTL - Check general format; utterance length (must be positive); files exist Phase 4: CTL - Checking number of lines in the transcript should match lines in control file Phase 5: CTL - Determine amount of training data, see if n_tied_states seems reasonable. Total Hours Training: 1.52594188034191 This is a small amount of data, no comment at this time Phase 6: TRANSCRIPT - Checking that all the words in the transcript are in the dictionary Words in dictionary: 1216 Words in filler dictionary: 3 Phase 7: TRANSCRIPT - Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
The training process is organized in a list of successive stages, each of which builds on the results of the previous one. The first step (01.vector_quantize) constructs a generic model of the acoustic feature space without distinguishing phones. In the second step (20.ci_hmm), individual phones are trained without reference to their context. The third step (30.cd_hmm_untied) trains separate models for every triphone (meaning a phone with a specific left and right context). In the fourth step (40.buildtrees), these triphones are clustered using decision trees, and in the next step (45.prunetree), the trees are pruned to produce a set of basic acoustic units known as senones or tied states. Finally, senone models are trained in the last (50.cd_hmm_tied) step.
If everything is set up correctly, you won't have to run each step manually. You can just use the RunAll.pl script to run the entire training process:
However, if one of the steps fails, it is useful to know which one failed so that you can look in the right place to diagnose the problem, and so that you can restart it after repairing it. For each stage, there is a corresponding directory in scripts_pl and logdir. The directory in scripts_pl contains a script beginning with "slave" which is (strangely enough) the "master" script for that particular stage of training. Running this script will run the entire stage. The directory in logdir contains log files for the subprocesses. If the training process fails, look in the most recent logfile to see what the error might be. There is also an HTML logfile called rm.html in the the assignment directory.
On a relatively modern PC (4 years old), training should take a few hours. The most time-consuming part is the 40.buildtrees stage.
If training successfully completes, you will see directories for each stage in logdir, and there will be a directory in model_parameters called rm.cd_semi_1000, which will contain (at least) the files mdef, means, variances, mixture_weights, and transition_matrices.
There are two options available for language modeling:
Regardles of whether you elect to use an N-Gram model or a finite state grammar as your language model, you may find it helpful to create a formal definition of the grammar for this task. Informally, you will need to be able to recognize directions for a hypothetical speech-driven pizza ordering system. These will consist of requests for a specific size of pizza with one or more toppings. Each request optionally starts with one of the following phrases:
Followed by the size of pizza requested, either "small", "medium", "large", or "extra large", the word "pizza", and an optional list of toppings. The available toppings are:
extra cheese, mushrooms, onions, black olives, green olives, pineapple, green peppers, broccoli, tomatoes, spinach, anchovies, sausage, pepperoni, ham, bacon
Any of the various toppings should also be recognized in isolation, or in a list spoken separately from the main pizza ordering request. You can use the transcription from the development data as an example, but keep in mind that you will be tested on a separate data set which will contain a different set of sentences, so your model must be sufficiently general.
To build an N-Gram language model for the task, you will need a sufficiently large sample of representative text. In real-world systems this is usually collected from digital text resources such as newswires and web sites, or it is created by transcribing speech data. For this assignment, you will be creating it from scratch. Roughly, you need to create a text file which "covers" the description above. One possible way to do this is to write a program which randomly generates a large number of sentences according to that description.
This text file should be a plain text file consisting of one sentence per line. Punctuation is not needed. You should enter it into the upload field at http://www.speech.cs.cmu.edu/tools/lmtool.html and click the "Compile Knowledge Base" button. The knowledge base tool is limited to 5000 lines of input. It will print some status messages and then redirect you to a page where you can download the results. The only files you need to download are the dictionary and the language model. They will have numeric filenames. After downloading them, we recommend that you rename them to pizza.lm and pizza.dic. Now you can proceed to test the system.
Using a finite state grammar allows you to formally specify the language accepted by the speech recognizer. Internally, Sphinx uses a format to describe these grammars which is not easy for humans to write or read. However, we have provided a tool called sphinx_jsgf2fsg which allows you to write a grammar in the JSpeech Grammar Format. This is a grammar format originally developed by Sun for the Java Speech API, which is relatively easy to write. A short example JSGF grammar for part of this task would look like this:
#JSGF V1.0; grammar pizza; public <startPizza> = i want to order a <size> pizza with <topping>; <size> = small | medium | large; <topping> = pepperoni | mushrooms | anchovies;
Some things to note about the JSGF format:
After you have created the JSGF grammar file, you need to convert it to a Sphinx FSG file using sphinx_jsgf2fsg. We will assume you called your grammar file pizza.gram here. From the command-line, run:
bin\sphinx_jsgf2fsg < pizza.gram > pizza.fsg
Now you will need to create a pronunciation dictionary. The Perl script fsg2wlist.pl in the scripts_pl directory will extract the words from the resulting FSG file:
perl scripts_pl\fsg2wlist.pl pizza.fsg > pizza.words
You can now upload the word list pizza.words to http://www.speech.cs.cmu.edu/tools/lmtool.html and click the "Compile Knowledge Base" button. The only file you will need from the results page is the dictionary file. Download it and rename it to pizza.dic.
Finally, you can run the speech recognizer in "batch mode" to test the models that you have built. This requires you to have, in addition to the acoustic model, language model, and dictionary, the following files:
In general, the configuration file needs, at the very least, to contain the following arguments:
However, in this case, there are several additional arguments that are necessary:
Therefore, if you are using an N-Gram language model called pizza.lm and the corresponding dictionary is called pizza.dic, then your configuration file would contain:
-hmm model_parameters/rm.cd_semi_1000 -lm pizza.lm -dict pizza.dic -ctl etc/pizza_devel.fileids -adcin yes -adchdr 44 -cepext .wav -cepdir wav -hyp pizza_devel.hyp -outlatdir lat
If, instead, you created a finite-state grammar and called it pizza.fsg, then your configuration file would contain:
-hmm model_parameters/rm.cd_semi_1000 -fsg pizza.fsg -dict pizza.dic -ctl etc/pizza_devel.fileids -adcin yes -adchdr 44 -cepext .wav -cepdir wav -hyp pizza_devel.hyp -outlatdir lat
Remember that all filenames in the configuration file are interpreted relative to the current directory, so for the configuration files above to work, the language model or FSG and dictionary should be in the top level assignment directory, and you should run the recognizer from this directory. You should pass the configuration file as the only argument to the recognizer program, like this:
If recognition was successful, you should see a lot of output on the screen, ending with a few lines like this (the numbers will be different on your computer):
INFO: batch.c(386): TOTAL 409.99 seconds speech, 10.60 seconds CPU, 10.67 seconds wall INFO: batch.c(388): AVERAGE 0.03 xRT (CPU), 0.03 xRT (elapsed)
The recognition results can now be found in the file pizza_devel.hyp. You will now need to compute the word error rate, which is the standard method for evaluating speech recognition systems. The script word_align.pl in the scripts_pl\decode directory compares the reference transcription to the recognition results and reports the error rate for each sentence, followed by the overall error rate. You can run it like this:
perl scripts_pl\decode\word_align.pl etc\pizza_devel.transcription pizza_devel.hyp
The final three lines of its output will report the number of errors and the error rate.
This assignment is due Wednesday, October 5th, before the beginning of class. Your final submission for this assigment should consist of a brief writeup and a .zip or .tar.gz archive containing:
As with the reading assignments, send your writeup and data to email@example.com. The archive should not be larger than about 6 megabytes or so. If you have trouble sending it, place it in an AFS directory and e-mail the path, or if you are unable to do this, contact the professor to make alternative arrangements.
In your writeup, you should answer the following questions:
While this assignment is not particularly difficult, the acoustic model training may be quite time-consuming, so we recommend that you start it as soon as possible.