15-492: ASR Assignment

In this assignment, you will learn how to build a speech recognizer using the CMU Sphinx system. There are three major components that go into a typical speech recognizer:

  1. The Acoustic model defines the acoustic units of recognition and the statistical models used to identify them. It is constructed by a training process which takes segmented audio data and the corresponding transcriptions as its input.
  2. The Dictionary (or pronunciation model) defines the mapping of words to acoustic units. It is constructed by hand, or by using trained letter to sound rules, or frequently some combination of the two.
  3. The Language model (or grammar) defines the set of possible sentences that can be recognized, as well as their relative prior probabilities. In the Sphinx system, it can be a Markov model defined over words, commonly known as an N-Gram model, or a finite-state automaton which recognizes a regular grammar. N-Gram models are trained from corpora of text, while finite-state grammar descriptions can be written by hand in the JSGF format.

Task Description

Your task for this assignment is to build an acoustic model, dictionary, and language model, and use them to recognize a list of sentences, which will be provided to you in the form of audio files and corresponding text. You will calculate and report the error rate achieved by your system. It is common practice to refine your models on a held-out development set of data, then use a disjoint and unseen test set for final verification. We will follow this practice by testing your systems on a separate test set which is similar but not identical to the development seet given to you.

Required tools

You will need to have Perl installed on your computer. For Windows, we recommend using ActivePerl, or installing Cygwin. On Linux and Mac OS X, you should have a working C compiler and development libraries.

Instructions: Acoustic Model

To build an acoustic model, you will mostly follow the instructions in the Sphinx tutorial. However, we have modified the training set slightly in order to make it more suitable for this task. First, download the assignment data package, and extract it to a suitable directory on your computer.

Installation on Windows

For Windows, the executable programs needed are included in the assignment data package. As mentioned above, you will need to have Perl installed. You can either use ActivePerl from the Windows command line (so-called "DOS Box"), or you can use the Perl that is included in Cygwin. You should verify that you have a working version of Perl before proceeding by opening a command-line window (either cmd.exe or Cygwin), changing into the directory created by extracting the assignment data, and running:

      perl scripts_pl\00.verify\verify_all.pl 
    

You should see output like this:

MODULE: 00 verify training files
O.S. is case insensitive ("A" == "a").
Phones will be treated as case insensitive.
    Phase 1: DICT - Checking to see if the dict and filler dict agrees with the phonelist file.
        Found 1219 words using 40 phones
    Phase 2: DICT - Checking to make sure there are not duplicate entries in the dictionary
    Phase 3: CTL - Check general format; utterance length (must be positive); files exist
    Phase 4: CTL - Checking number of lines in the transcript should match lines in control file
    Phase 5: CTL - Determine amount of training data, see if n_tied_states seems reasonable.
        Total Hours Training: 1.52594188034191
        This is a small amount of data, no comment at this time
    Phase 6: TRANSCRIPT - Checking that all the words in the transcript are in the dictionary
        Words in dictionary: 1216
        Words in filler dictionary: 3
    Phase 7: TRANSCRIPT - Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
    

Installation on Unix

On Linux or Mac OS X, you will also need to compile the Sphinx trainer and recognizer and install these programs in the assignment data directory. Dowload the provided snapshot of SphinxTrain (this corresponds to the current code as of September 8th, as there is currently no released version). Unpack it using tar and compile it:

      tar zxf SphinxTrain-20080912.tar.gz
      cd SphinxTrain
      ./configure
      make
    

Next, you will need to compile the recognizer. We have also provided a snapshot of the stable branch of PocketSphinx (this corresponds to the upcoming 0.5.1 release). Unpack this using tar and use the build_all_static.sh script provided:

      tar zxf pocketsphinx-0.5-20080912.tar.gz
      cd pocketsphinx-0.5
      ./build_all_static.sh
    

Finally, change into the assignment directory, and use the setup scripts in SphinxTrain and PocketSphinx to copy the programs to the correct location.

      cd hw1_data
      ../SphinxTrain/scripts_pl/setup_SphinxTrain.pl -task rm
      ../pocketsphinx-0.5/pocketsphinx/scripts/setup_sphinx.pl -task rm
    

Ensure that training is correctly configured by running:

      perl scripts_pl/00.verify/verify_all.pl 
    

You should see output like this:

MODULE: 00 verify training files
O.S. is case sensitive ("A" != "a").
Phones will be treated as case sensitive.
    Phase 1: DICT - Checking to see if the dict and filler dict agrees with the phonelist file.
        Found 1219 words using 40 phones
    Phase 2: DICT - Checking to make sure there are not duplicate entries in the dictionary
    Phase 3: CTL - Check general format; utterance length (must be positive); files exist
    Phase 4: CTL - Checking number of lines in the transcript should match lines in control file
    Phase 5: CTL - Determine amount of training data, see if n_tied_states seems reasonable.
        Total Hours Training: 1.52594188034191
        This is a small amount of data, no comment at this time
    Phase 6: TRANSCRIPT - Checking that all the words in the transcript are in the dictionary
        Words in dictionary: 1216
        Words in filler dictionary: 3
    Phase 7: TRANSCRIPT - Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
    

Training

The training process is organized in a list of successive stages, each of which builds on the results of the previous one. The first step (01.vector_quantize) constructs a generic model of the acoustic feature space without distinguishing phones. In the second step (20.ci_hmm), individual phones are trained without reference to their context. The third step (30.cd_hmm_untied) trains separate models for every triphone (meaning a phone with a specific left and right context). In the fourth step (40.buildtrees), these triphones are clustered using decision trees, and in the next step (45.prunetree), the trees are pruned to produce a set of basic acoustic units known as senones or tied states. Finally, senone models are trained in the last (50.cd_hmm_tied) step.

If everything is set up correctly, you won't have to run each step manually. You can just use the RunAll.pl script to run the entire training process:

      perl scripts_pl/RunAll.pl
    

However, if one of the steps fails, it is useful to know which one failed so that you can look in the right place to diagnose the problem, and so that you can restart it after repairing it. For each stage, there is a corresponding directory in scripts_pl and logdir. The directory in scripts_pl contains a script beginning with "slave" which is (strangely enough) the "master" script for that particular stage of training. Running this script will run the entire stage. The directory in logdir contains log files for the subprocesses. If the training process fails, look in the most recent logfile to see what the error might be. There is also an HTML logfile called rm.html in the the assignment directory.

On a relatively modern PC (4 years old), training should take a few hours. The most time-consuming part is the 40.buildtrees stage.

If training successfully completes, you will see directories for each stage in logdir, and there will be a directory in model_parameters called rm.cd_semi_1000, which will contain (at least) the files mdef, means, variances, mixture_weights, and transition_matrices.

Instructions: Language Model

There are two options available for language modeling:

  1. Finite-state Grammar: You must write a JSGF Grammar for the task, as described below, and process it into a Sphinx FSG file using the sphinx_jsgf2fsg tool (which will be provided for you).
  2. N-Gram Model: You must create a corpus of representative text for the task, and use the Sphinx knowledge base tool to create an N-Gram language model.

Regardles of whether you elect to use an N-Gram model or a finite state grammar as your language model, you may find it helpful to create a formal definition of the grammar for this task. Informally, you will need to be able to recognize directions for a hypothetical speech-driven pizza ordering system. These will consist of requests for a specific size of pizza with one or more toppings. Each request optionally starts with one of the following phrases:

Followed by the size of pizza requested, either "small", "medium", "large", or "extra large", the word "pizza", and an optional list of toppings. The available toppings are:

extra cheese, mushrooms, onions, black olives, green olives, pineapple, green peppers, broccoli, tomatoes, spinach, anchovies, sausage, pepperoni, ham, bacon

Any of the various toppings should also be recognized in isolation, or in a list spoken separately from the main pizza ordering request. You can use the transcription from the development data as an example, but keep in mind that you will be tested on a separate data set which will contain a different set of sentences, so your model must be sufficiently general.

N-Gram Language Models

To build an N-Gram language model for the task, you will need a sufficiently large sample of representative text. In real-world systems this is usually collected from digital text resources such as newswires and web sites, or it is created by transcribing speech data. For this assignment, you will be creating it from scratch. Roughly, you need to create a text file which "covers" the description above. One possible way to do this is to write a program which randomly generates a large number of sentences according to that description.

This text file should be a plain text file consisting of one sentence per line. Punctuation is not needed. You should enter it into the upload field at http://www.speech.cs.cmu.edu/tools/lmtool.html and click the "Compile Knowledge Base" button. The knowledge base tool is limited to 5000 lines of input. It will print some status messages and then redirect you to a page where you can download the results. The only files you need to download are the dictionary and the language model. They will have numeric filenames. After downloading them, we recommend that you rename them to pizza.lm and pizza.dic. Now you can proceed to test the system.

Finite State Grammars

Using a finite state grammar allows you to formally specify the language accepted by the speech recognizer. Internally, Sphinx uses a format to describe these grammars which is not easy for humans to write or read. However, we have provided a tool called sphinx_jsgf2fsg which allows you to write a grammar in the JSpeech Grammar Format. This is a grammar format originally developed by Sun for the Java Speech API, which is relatively easy to write. A short example JSGF grammar for part of this task would look like this:

#JSGF V1.0;

grammar pizza;

public <startPizza> = i want to order a <size> pizza with <topping>;

<size> = small | medium | large;

<topping> = pepperoni | mushrooms | anchovies;
    

Some things to note about the JSGF format:

After you have created the JSGF grammar file, you need to convert it to a Sphinx FSG file using sphinx_jsgf2fsg. We will assume you called your grammar file pizza.gram here. From the command-line, run:

      bin\sphinx_jsgf2fsg < pizza.gram > pizza.fsg
    

Now you will need to create a pronunciation dictionary. The Perl script fsg2wlist.pl in the scripts_pl directory will extract the words from the resulting FSG file:

      perl scripts_pl\fsg2wlist.pl pizza.fsg > pizza.words
    

You can now upload the word list pizza.words to http://www.speech.cs.cmu.edu/tools/lmtool.html and click the "Compile Knowledge Base" button. The only file you will need from the results page is the dictionary file. Download it and rename it to pizza.dic.

Testing

Finally, you can run the speech recognizer in "batch mode" to test the models that you have built. This requires you to have, in addition to the acoustic model, language model, and dictionary, the following files:

In general, the configuration file needs, at the very least, to contain the following arguments:

However, in this case, there are several additional arguments that are necessary:

Therefore, if you are using an N-Gram language model called pizza.lm and the corresponding dictionary is called pizza.dic, then your configuration file would contain:

-hmm model_parameters/rm.cd_semi_1000
-lm pizza.lm
-dict pizza.dic
-ctl etc/pizza_devel.fileids
-adcin yes
-adchdr 44
-cepext .wav
-cepdir wav
-hyp pizza_devel.hyp
-outlatdir lat
    

If, instead, you created a finite-state grammar and called it pizza.fsg, then your configuration file would contain:

-hmm model_parameters/rm.cd_semi_1000
-fsg pizza.fsg
-dict pizza.dic
-ctl etc/pizza_devel.fileids
-adcin yes
-adchdr 44
-cepext .wav
-cepdir wav
-hyp pizza_devel.hyp
-outlatdir lat
    

Remember that all filenames in the configuration file are interpreted relative to the current directory, so for the configuration files above to work, the language model or FSG and dictionary should be in the top level assignment directory, and you should run the recognizer from this directory. You should pass the configuration file as the only argument to the recognizer program, like this:

    bin\pocketsphinx_batch.exe pizza.cfg
    

If recognition was successful, you should see a lot of output on the screen, ending with a few lines like this (the numbers will be different on your computer):

INFO: batch.c(386): TOTAL 409.99 seconds speech, 10.60 seconds CPU, 10.67 seconds wall
INFO: batch.c(388): AVERAGE 0.03 xRT (CPU), 0.03 xRT (elapsed)
    

The recognition results can now be found in the file pizza_devel.hyp. You will now need to compute the word error rate, which is the standard method for evaluating speech recognition systems. The script word_align.pl in the scripts_pl\decode directory compares the reference transcription to the recognition results and reports the error rate for each sentence, followed by the overall error rate. You can run it like this:

      perl scripts_pl\decode\word_align.pl etc\pizza_devel.transcription pizza_devel.hyp
    

The final three lines of its output will report the number of errors and the error rate.

Due date and submissions

This assignment is due Wednesday, October 5th, before the beginning of class. Your final submission for this assigment should consist of a brief writeup and a .zip or .tar.gz archive containing:

As with the reading assignments, send your writeup and data to nbach@cs.cmu.edu. The archive should not be larger than about 6 megabytes or so. If you have trouble sending it, place it in an AFS directory and e-mail the path, or if you are unable to do this, contact the professor to make alternative arrangements.

In your writeup, you should answer the following questions:

  1. What was the initial error rate that you achieved?
  2. Look over the output of word_align.pl. Pick a few errors and listen to the corresponding audio files. Can you hear anything that might have caused the error?
  3. How would you reduce the error rate? (extra credit if you successfully implement a strategy for this)

While this assignment is not particularly difficult, the acoustic model training may be quite time-consuming, so we recommend that you start it as soon as possible.