CMU Sphinx is a large vocabulary, speaker independent speech recognition codebase and suite of tools. The code is available for download and use. The main page is at http://www.speech.cs.cmu.edu/sphinx/, with CVS hosted on SourceForge.
sphinx2 is the real-time engine, and sphinx3 is slower but potentially more accurate. Sphinx2 is "semicontinuous" (uses tied mixtures), and Sphinx3 can use fully continuous observation densities (untied, so that each state has its own distribution statistics). Sphinx3 is not as fully developed for portability, but the codebase is cleaner than that of Sphinx2.
Use sphinx2 for speed, and sphinx3 if you can afford to take longer than real-time to decode. Both are available under CVS.
The phoneset is the list of 'phones', or speech sounds, that the engine can recognize. When you build acoustic models and pronunciations for words, they can be made to use any set of units, but they must be the same units. The acoustic models will search for the speech sounds (phones), and the word pronunciations are also given in terms of the phones in the phone set.
The default phoneset for American English that comes with Sphinx2 contains
the following phones:
AA AE AH AO AW AX AXR AY B CH D DH DX
EH ER EY F G HH IH IX IY JH K L M N NG
OW OY P R S SH T TH UH UW V W Y Z ZH
There is also the silence phone,
SIL, and a number of
+BREATH+ +COUGH+ +LAUGH+ +SMACK+ +UH+ +UHUM+ +UM+.
By default, Sphinx2 uses a slightly different phoneset for American English
than is in the
Internally, Sphinx2 does not use
lexical stress; you need to run the CMU Dictionary through the
stress2sphinx utility included in the Sphinx2 release
to convert it to the default phone set.
The decoder needs to know the pronunciations of words, and the
dictionary file (often with a
.dic extension) is
a list of words with a sequence of phones. If a word can
be pronounced more than one way, it needs to be given a different
name but affixing a number after it in parentheses:
ELEVEN AX L EH V AX N ELEVEN(2) IY L EH V AX N EXIT EH G Z AX T EXIT(2) EH K S AX T EXPLORE IX K S P L AO R FIFTEEN F IH F T IY NHere, there are two pronunciations for the word 'EXIT', as there are for 'ELEVEN'. Note that the first pronunciation does not have a number after it. Also, make sure there is an unnumbered pronunciation if there are numbered ones.
A noise word dictionary (
noisedict) is a set of words
made up of noise phones (like
+UM+), and by convention
the words have two plusses where the phones have one. Here is an
++BREATH++ +BREATH+ ++COUGH++ +COUGH+ ++LAUGH++ +LAUGH+ ++SMACK++ +SMACK+ ++UH++ +UH+ ++UHUM++ +UHUM+ ++UM++ +UM+The decoder can insert noise words and silences where it thinks they are, without explicit specification in the Language Model.
A noise word dictionary is not required, but when used, it can significantly improve recognition accuracy by handling nonspeech sections of the input.
Acoustic models characterize how sound changes over time. Each phoneme or speech sound is modeled by a sequence of states and signal observation probabilites -- distributions of sounds that you might hear (observe) in that state.
Sphinx2 is implemented using a 5-state phonetic model; each phone model has exactly five states. At run-time, frames of the input audio are compared to the distributions in the states to see which ones the sound could have come from -- which might be likely producers of the observed (heard) audio.
Acoustic models that are matched to the conditions they will be used in perform best. That is to say, English acoustic models work best for English, and telephone models work best on the telephone. With SphinxTrain, you can train acoustic models for any language, task, or channel condition.
Context-independent phones (CI-phones) are modeled using data from many different context, and triphones are phones that take into account left and right context in the modeling. Sphinx2 is uses tied states to reduce the total number of states in the system. The default English acoustic models contain 6,000 states, or senones in historic CMU nomenclature, that are shared among all the 5-state phone models. These states share their parameter weights with other states also. If space is at a premium, you can train smaller models by allowing fewer states -- at some performance loss.
An LM file (often with a
.lm extension) is a Language
model. The language model describes the likelihood, probability, or
penalty taken when a sequence or collection of words is seen. Sphinx2 uses
N-gram models, and usually N is 3, so they are trigram models,
and these are sequences of three words. All the sequences of three words,
two words, and one word are combined together using back-off weights
in order to assign probabilities to sequences of words.
Many of the advances in accuracy in speech recognition have come from language modeling. Having a language model that is tuned to a particular application, especially when it is a small language, leads to much better results than when the language model is mismatched to the one given. You can see this if you run the "turtle" demo, which is made from sentences like "rotate right forty five degrees" and "go forward ten meters," and then start reading Alice in Wonderland. The system will do the best it can to fit Alice into the the toy vocabulary and language model.
There are two recommended ways to build a language model at the moment: using the web-based Language Modelling Tool (LMtool), where you simply upload a set of sentences and it creates everything you need, or by the more elaborate, but more powerful CMU/Cambridge Statistical Language Modeling Toolkit .
There are also some example language models, including a class language model.
Change the code that sends the audio to be processed to use every third sample, and pass the buffer as usual. You can effectively subsample the 48k to 16k by taking every third sample.
-verbosity with a number (1-9) to increase the
amount of output verbage.
There is demo code for a
sphinx2-server that runs the recognizer
on a host machine, and opens a tcp socket for a single listener
to connect. The example code returns the top N decodes.
sphinx2-server on the host machine. The easiest
way to get sane defaults for this is to copy the
script and change the program from
sphinx2-server and then execute the new script.
Connect to the specified port (default 7027) with your client; you
telnet for this:
telnet localhost 7027or
sphinx2-client localhost 7027The server will ask you to hit a carriage return (<CR>) to start up, once the client has connected. You can do so. The server says
READY...and waits for audio input.
Say "Go forward ten meters." That should get you a result on the
server side, and a bunch of output to the client. The output is the
N-best list -- the top N hypotheses of what was said. The
last line is
END_UTT on a line by itself.
The server now blocks until it gets an
ACK from the
client. It will start listening for one more utterance as soon
as it receives the
automatically sends it, but if you are using
or writing your own client, you will need to send it yourself.