The CMU Sphinx Group Open Source Speech Recognition Engines

Speech at CMU   |   Sphinx at SourceForge

Introduction

General Documentation

CMUSphinx Components

Common library

Decoders

Acoustic Model Training

Language Model Training

Utilities


Latest News

PocketSphinx: 0.5 release
2008-07-08 16:02
Read More »

cmudict.0.7a release
2008-02-19 18:22
Read More »

New IRC channel and documentation wiki
2007-12-20 16:01
Read More »

Site news archive »


External Links

Notice: if you have comments about the links below, please contact the authors directly.

CMU Sphinx documentation Wiki

Migrating to PocketSphinx 0.5

The upcoming 0.5 release of PocketSphinx will finally and irreversibly break source compatibility with Sphinx2. Therefore, code which relied on the API sort of being the same between them will need to be updated to use the new PocketSphinx API. There are some good reasons to do this besides sheer novelty:

  1. It is much more likely to remain stable both in terms of source and binary compatibility due to the use of abstract types.
  2. It is fully re-entrant, so there is no problem having multiple decoders in the same process.
  3. The new language model API (in SphinxBase) supports linear interpolation of multiple models at run-time.

  4. It has enabled a drastic reduction in code footprint and a modest but significant reduction in memory consumption.

Reference documentation for the new API is available at http://www.speech.cs.cmu.edu/sphinx/doc/doxygen/pocketsphinx/

Basic Usage (hello world)

Basic usage of the new API is fairly similar to the previous one. There are a few obvious differences though:

  1. Command-line parsing is done externally (in <cmd_ln.h>)

  2. Everything takes a ps_decoder_t * as the first argument.

To illustrate the new API, we will step through a simple "hello world" example. This example is somewhat specific to Unix in the locations of files and the compilation process. We will create a C source file called hello_ps.c. To compile it (on Unix), use this command:

gcc -o hello_ps hello_ps.c \
    -DMODELDIR=\"`pkg-config --variable=modeldir pocketsphinx`\"
    `pkg-config --cflags --libs pocketsphinx sphinxbase`

Initialization

The first thing we need to do is to create a configuration object, which for historical reasons is called cmd_ln_t. Along with the general boilerplate for our C program, we will do it like this:

   1 #include <pocketsphinx.h>
   2 
   3 int
   4 main(int argc, char *argv[])
   5 {
   6         ps_decoder_t *ps;
   7         cmd_ln_t *config;
   8 
   9         config = cmd_ln_init(NULL, ps_args(), TRUE,
  10                              "-hmm", MODELDIR "/hmm/wsj1",
  11                              "-lm", MODELDIR "/lm/turtle/turtle.lm.DMP",
  12                              "-dict", MODELDIR "/lm/turtle/turtle.dic",
  13                              NULL);
  14         if (config == NULL)
  15                 return 1;
  16 
  17         return 0;
  18 }

The cmd_ln_init() function takes a variable number of null-terminated string arguments, followed by NULL. The first argument is any previous cmd_ln_t * which is to be updated. The second argument is an array of argument definitions - the standard set can be obtained by calling ps_args()}}.  The third argument is a flag telling the argument parser to be "strict" - if this is {{{TRUE, then duplicate arguments or unknown arguments will cause parsing to fail.

The MODELDIR macro is defined on the GCC command-line by using pkg-config to obtain the modeldir variable from PocketSphinx configuration. On Windows, you can simply add a preprocessor definition to the code, such as this:

   1 #define MODELDIR "c:/sphinx/model"

(replace this with wherever your models are installed). Now, to initialize the decoder, use ps_init:

   1         ps = ps_init(config);
   2         if (ps == NULL)
   3                 return 1;

Decoding a file stream

Because live audio input is somewhat platform-specific, we will confine ourselves to decoding audio files. The "turtle" language model recognizes a very simple "robot control" language, which recognizes phrases such as "go forward ten meters". In fact, there is an audio file helpfully included in the PocketSphinx source code which contains this very sentence. You can find it in test/data/goforward.raw. Copy it to the current directory. If you want to create your own version of it, it needs to be a single-channel (monaural), little-endian, unheadered 16-bit signed PCM audio file sampled at 16000 Hz.

To do this, we will first open the file:

   1         FILE *fh;
   2 
   3         fh = fopen("goforward.raw", "rb");
   4         if (fh == NULL) {
   5                 perror("Failed to open goforward.raw");
   6                 return 1;
   7         }

And then decode it, using ps_decode_raw():

   1         rv = ps_decode_raw(ps, fh, "goforward", -1);
   2         if (rv < 0)
   3                 return 1;

Now, to get the hypothesis, we can use ps_get_hyp():

   1         char const *hyp, *uttid;
   2         int rv;
   3         int32 score;
   4 
   5         hyp = ps_get_hyp(ps, &score, &uttid);
   6         if (hyp == NULL)
   7                 return 1;
   8         printf("Recognized: %s\n", hyp);

Decoding audio data from memory

Now, we will decode the same file again, but using the API for decoding audio data from blocks of memory. In this case, we need to first start the utterance using ps_start_utt():

   1         fseek(fh, 0, SEEK_SET);
   2         rv = ps_start_utt(ps, "goforward");
   3         if (rv < 0)
   4                 return 1;

We will then read 512 samples at a time from the file, and feed them to the decoder using ps_process_raw():

   1         int16 buf[512];
   2         while (!feof(fh)) {
   3             size_t nsamp;
   4             nsamp = fread(buf, 2, 512, fh);
   5             rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
   6         }

Then we will need to mark the end of the utterance using ps_end_utt():

   1         rv = ps_end_utt(ps);
   2         if (rv < 0)
   3                 return 1;

Retrieving the hypothesis string works in exactly the same way:

   1         hyp = ps_get_hyp(ps, &score, &uttid);
   2         if (hyp == NULL)
   3                 return 1;
   4         printf("Recognized: %s\n", hyp);

Cleaning up

To clean up, simply call ps_free() on the object that was returned by ps_init(). You should not do anything to free the configuration object.

Code listing

   1 #include <pocketsphinx.h>
   2 
   3 int
   4 main(int argc, char *argv[])
   5 {
   6         ps_decoder_t *ps;
   7         cmd_ln_t *config;
   8         FILE *fh;
   9         char const *hyp, *uttid;
  10         int16 buf[512];
  11         int rv;
  12         int32 score;
  13 
  14         config = cmd_ln_init(NULL, ps_args(), TRUE,
  15                              "-hmm", MODELDIR "/hmm/wsj1",
  16                              "-lm", MODELDIR "/lm/turtle/turtle.lm.DMP",
  17                              "-dict", MODELDIR "/lm/turtle/turtle.dic",
  18                              NULL);
  19         if (config == NULL)
  20                 return 1;
  21         ps = ps_init(config);
  22         if (ps == NULL)
  23                 return 1;
  24 
  25         fh = fopen("goforward.raw", "rb");
  26         if (fh == NULL) {
  27                 perror("Failed to open goforward.raw");
  28                 return 1;
  29         }
  30 
  31         rv = ps_decode_raw(ps, fh, "goforward", -1);
  32         if (rv < 0)
  33                 return 1;
  34         hyp = ps_get_hyp(ps, &score, &uttid);
  35         if (hyp == NULL)
  36                 return 1;
  37         printf("Recognized: %s\n", hyp);
  38 
  39         fseek(fh, 0, SEEK_SET);
  40         rv = ps_start_utt(ps, "goforward");
  41         if (rv < 0)
  42                 return 1;
  43         while (!feof(fh)) {
  44             size_t nsamp;
  45             nsamp = fread(buf, 2, 512, fh);
  46             rv = ps_process_raw(ps, buf, nsamp, FALSE, FALSE);
  47         }
  48         rv = ps_end_utt(ps);
  49         if (rv < 0)
  50                 return 1;
  51         hyp = ps_get_hyp(ps, &score, &uttid);
  52         if (hyp == NULL)
  53                 return 1;
  54         printf("Recognized: %s\n", hyp);
  55 
  56         fclose(fh);
  57         ps_free(ps);
  58         return 0;
  59 }

Advanced Usage

For more complicated uses of the old API, there are some significant differences:

  1. There are no longer separate functions for getting partial and full hypotheses.
  2. Word segmentations are accessed via iterators rather than being returned as arrays or lists.
  3. Language model switching is done externally (in <ngram_model.h>

The first of these is straightforward. Before, you had to use uttproc_partial_result() to get partial results (i.e. before uttproc_end_utt() was called), and uttproc_result() for full results. Now, ps_get_hyp() works for both.

For word segmentations, the API provides an iterator object which is used to, well, iterate over the sequence of words. This iterator object is an abstract type, with some accessors provided to obtain timepoints, scores, and (most interestingly) posterior probabilities for each word.

Finally, language model switching is quite different. The decoder is always associated with a language model set object (yes, even if there is only one language model). Switching language models is accomplished by:

  1. Getting a handle to the language model set object: ps_get_lmset()

  2. Selecting the new language model: ngram_model_set_select()

  3. Telling the decoder the language model set has been updated: ps_update_lmset()

PocketSphinxMigration (last edited 2008-09-22 15:31:41 by DavidHugginsDaines)

SourceForge.net Logo This page is maintained by David Huggins-Daines ()
CMUSphinx is a project within the Sphinx Group at Carnegie Mellon