Speech at CMU | Language Modelling Tool
Docs | Models | SphinxTrain | download | Sphinx

CMU Sphinx FAQ

  1. What is CMU Sphinx?

    CMU Sphinx is a large vocabulary, speaker independent speech recognition codebase and suite of tools. The code is available for download and use. The main page is at http://www.speech.cs.cmu.edu/sphinx/, with CVS hosted on SourceForge.

  2. What is the difference between sphinx2 and sphinx3?

    sphinx2 is the real-time engine, and sphinx3 is slower but potentially more accurate. Sphinx2 is "semicontinuous" (uses tied mixtures), and Sphinx3 can use fully continuous observation densities (untied, so that each state has its own distribution statistics). Sphinx3 is not as fully developed for portability, but the codebase is cleaner than that of Sphinx2.

    Use sphinx2 for speed, and sphinx3 if you can afford to take longer than real-time to decode. Both are available under CVS.

  3. What is a phoneset?

    The phoneset is the list of 'phones', or speech sounds, that the engine can recognize. When you build acoustic models and pronunciations for words, they can be made to use any set of units, but they must be the same units. The acoustic models will search for the speech sounds (phones), and the word pronunciations are also given in terms of the phones in the phone set.

    The default phoneset for American English that comes with Sphinx2 contains the following phones: AA AE AH AO AW AX AXR AY B CH D DH DX EH ER EY F G HH IH IX IY JH K L M N NG OW OY P R S SH T TH UH UW V W Y Z ZH .

    There is also the silence phone, SIL, and a number of "noise" phones: +BREATH+ +COUGH+ +LAUGH+ +SMACK+ +UH+ +UHUM+ +UM+.

    Note: By default, Sphinx2 uses a slightly different phoneset for American English than is in the CMU Dictionary. Internally, Sphinx2 does not use lexical stress; you need to run the CMU Dictionary through the stress2sphinx utility included in the Sphinx2 release to convert it to the default phone set.

  4. What is a dictionary (.dic) file?

    The decoder needs to know the pronunciations of words, and the dictionary file (often with a .dic extension) is a list of words with a sequence of phones. If a word can be pronounced more than one way, it needs to be given a different name but affixing a number after it in parentheses:

      ELEVEN                         AX L EH V AX N
      ELEVEN(2)                      IY L EH V AX N
      EXIT                           EH G Z AX T
      EXIT(2)                        EH K S AX T
      EXPLORE                        IX K S P L AO R
      FIFTEEN                        F IH F T IY N
    Here, there are two pronunciations for the word 'EXIT', as there are for 'ELEVEN'. Note that the first pronunciation does not have a number after it. Also, make sure there is an unnumbered pronunciation if there are numbered ones.

  5. Do I need a Noise Word Dictionary? What is it?

    A noise word dictionary (noisedict) is a set of words made up of noise phones (like +UM+), and by convention the words have two plusses where the phones have one. Here is an example:

    ++BREATH++   +BREATH+
    ++COUGH++    +COUGH+
    ++LAUGH++    +LAUGH+
    ++SMACK++    +SMACK+
    ++UH++       +UH+
    ++UHUM++     +UHUM+
    ++UM++       +UM+ 
    The decoder can insert noise words and silences where it thinks they are, without explicit specification in the Language Model.

    A noise word dictionary is not required, but when used, it can significantly improve recognition accuracy by handling nonspeech sections of the input.

  6. What is an Acoustic Model?

    Acoustic models characterize how sound changes over time. Each phoneme or speech sound is modeled by a sequence of states and signal observation probabilites -- distributions of sounds that you might hear (observe) in that state.

    Sphinx2 is implemented using a 5-state phonetic model; each phone model has exactly five states. At run-time, frames of the input audio are compared to the distributions in the states to see which ones the sound could have come from -- which might be likely producers of the observed (heard) audio.

    Acoustic models that are matched to the conditions they will be used in perform best. That is to say, English acoustic models work best for English, and telephone models work best on the telephone. With SphinxTrain, you can train acoustic models for any language, task, or channel condition.

    Context-independent phones (CI-phones) are modeled using data from many different context, and triphones are phones that take into account left and right context in the modeling. Sphinx2 is uses tied states to reduce the total number of states in the system. The default English acoustic models contain 6,000 states, or senones in historic CMU nomenclature, that are shared among all the 5-state phone models. These states share their parameter weights with other states also. If space is at a premium, you can train smaller models by allowing fewer states -- at some performance loss.

  7. What is a Language Model (LM) file?

    An LM file (often with a .lm extension) is a Language model. The language model describes the likelihood, probability, or penalty taken when a sequence or collection of words is seen. Sphinx2 uses N-gram models, and usually N is 3, so they are trigram models, and these are sequences of three words. All the sequences of three words, two words, and one word are combined together using back-off weights in order to assign probabilities to sequences of words.

    Many of the advances in accuracy in speech recognition have come from language modeling. Having a language model that is tuned to a particular application, especially when it is a small language, leads to much better results than when the language model is mismatched to the one given. You can see this if you run the "turtle" demo, which is made from sentences like "rotate right forty five degrees" and "go forward ten meters," and then start reading Alice in Wonderland. The system will do the best it can to fit Alice into the the toy vocabulary and language model.

    There are two recommended ways to build a language model at the moment: using the web-based Language Modelling Tool (LMtool), where you simply upload a set of sentences and it creates everything you need, or by the more elaborate, but more powerful CMU/Cambridge Statistical Language Modeling Toolkit .

    There are also some example language models, including a class language model.

  8. My audio hardware only does 48KHz audio. How can I run Sphinx2?

    Change the code that sends the audio to be processed to use every third sample, and pass the buffer as usual. You can effectively subsample the 48k to 16k by taking every third sample.

  9. How do I get more information when I run sphinx2?

    Use the -verbosity with a number (1-9) to increase the amount of output verbage.

  10. Can I run this as a server?

    There is demo code for a sphinx2-server that runs the recognizer on a host machine, and opens a tcp socket for a single listener to connect. The example code returns the top N decodes.

    Start sphinx2-server on the host machine. The easiest way to get sane defaults for this is to copy the sphinx2-simple script and change the program from sphinx2-continuous to sphinx2-server and then execute the new script.

    Connect to the specified port (default 7027) with your client; you can use telnet for this:

       telnet localhost 7027
       sphinx2-client localhost 7027
    The server will ask you to hit a carriage return (<CR>) to start up, once the client has connected. You can do so. The server says READY... and waits for audio input.

    Say "Go forward ten meters." That should get you a result on the server side, and a bunch of output to the client. The output is the N-best list -- the top N hypotheses of what was said. The last line is END_UTT on a line by itself.

    The server now blocks until it gets an ACK from the client. It will start listening for one more utterance as soon as it receives the ACK. sphinx2-client automatically sends it, but if you are using telnet or writing your own client, you will need to send it yourself.