Speech at CMU | Language Modelling Tool
Docs | Models | SphinxTrain | download | FAQ

CMU Sphinx

CMU Sphinx is a set of speech recognition development libraries and tools that can be linked in to speech-enable applications. The libraries and sample code can be used for both research and commercial purposes; for instance, Sphinx2 can be used as a telephone-based recognizer, which can be used in a dialog system. Sphinx3 is a slower, more accurate decoder.

Try a System

If you'd like to have a chance to try out  an application that uses CMU Sphinx, try the Communicator, an experimental system that helps you plan air travel. You can reach it at the toll-free number 1-877-CMU-PLAN (1-877-268-7526) or at +1 412 268 1084.

The system will provide real flight information. More complete information about the Communicator can be found here. The system may be sensitive to loud background noises, especially over cell phones.

Note that your call will be recorded for development purposes and may be shared with other researchers. We don't have a policy set up yet for placing such recordings into a publicly availably database, and so there is no guarantee that this data will become publicly available -- though we're motivated to set that up in the future.

The Sphinx Group

The Sphinx Group at Carnegie Mellon University is committed to releasing the long-time, DARPA-funded Sphinx projects widely, in order to stimulate the creation of speech-using tools and applications, and to advance the state of the art both directly in speech recognition, as well as in related areas including dialog systems and speech synthesis.

The licensing terms for the Sphinx engines and tools are derived from BSD, and based, in particular, upon the license for the Apache web server. There is no restriction against commercial use or redistribution. (License terms for CMU Sphinx)

Automatic Speech Recognition

The task of an Automatic Speech Recognition (ASR) engine is to take audio input and turn it into a written representation. This is sometimes called "decoding," because you can think of the audio as an input channel and the written form, whether it is a sequence of words or speech sounds, as the output channel, with the decoder working between them. Once the the speech has been converted into some text representation, it can be used to send events to an application.

The ability to speak, and the ability to listen, co-evolved in humans; the ear and the mouth are tuned to each other. A talker uses speech sounds in order to be understood, and a listener draws on expectations about speech when they listen.

Linguistic theory breaks speech down into a set of levels at which language can be described: phonetics, phonology, morphonology, syntax, semantics, pragmatics. It the first four of these that drive most speech recognition systems, though the division between phonetics and phonology is often ignored, and the speech sounds are referred to as "phones."

Speech recognition systems draw on these sources of knowledge, and find the best combined result that finds the best sequence of sounds and words that harmonize the phonetic models, the word models from a pronunciation dictionary), and a "language model" or grammar that predicts what word relationships are likely (or possible at all).

Most modern ASR systems use statistical methods to train models for each of these levels, with linguistically informed choices. Hidden Markov Models (HMMs), decision trees, and other statistical and machine learning techniques are used throughout the system, and in speech science as a whole.

Many systems do not explicitly use semantic information in the recognizer, but leave that to a parser. Syntax may also be weakly modeled, with a word sequence model (N-gram model) in the engine that does not capture long distance relationships, or effects such as subject-verb agreement directly.

A recognizer can often send the N-best decodes -- as an ordered list of the best output text strings -- to the parser, and it can then find which ones are syntactically or semantically consistent.

ASR is one component technology needed to build a speech-interactive application. It may be used under the control of a VoiceXML interpreter, or into a more general speech-interactive system. After the text comes in, it may be passed to a parser, that would turn the input text strings into a representation of the semantics of the message. A dialog manager could then take that meaning input and perform some action in response, sending its output intentions to a language generation component, which would in turn pass this on to a speech synthesizer.

From Starting Out to Standing Up!

Sphinx is a large, complex codebase that was originally created using "middle-out" development for research purposes. While it has grown into the basis of many commercial systems, and the free release has led to a lot of advances, it still can be intimidating to take the first steps.

The example code given is a good place to start, but it will take the experience of building a real task using the toolkit, perhaps linking the libraries in to your own applications and going beyond the simple example apps (such as sphinx2-continuous) and doing new and interesting things with speech technology.

Sphinx2

Sphinx2 is a fast, large vocabulary speaker independent recognition system for continuous speech developed at Carnegie Mellon University. It was released under a BSD-style License, which makes it free for both commercial and non-commercial use. The license satisfies the Open Source definition.

Download

You can download sphinx2 from SourceForge. The CVS tree and the releases are available there.

Note: The tarball releases lag behind the CVS tree. Developers, please check the latest version out of CVS. The following commands will check it out. Just press the Enter key when the login asks for your password.

cvs -d:pserver:anonymous@cvs.cmusphinx.sourceforge.net:/cvsroot/cmusphinx login
cvs -z3 -d:pserver:anonymous@cvs.cmusphinx.sourceforge.net:/cvsroot/cmusphinx co sphinx2
Updates from within the module's directory do not need the -d parameter.

Bug Tracking and Discussion Groups

There are fora for bug tracking and discussions on the SourceForge site, also. Please go there for help, questions, to report bugs, and to see the latest work. The work is currently pre-version 1.0, so there is a lot yet to be done.

Building Sphinx2

under construction. This section will contain build instructions for Linux, Windows, Solaris, and Alpha platforms, and more as available.

Using Sphinx2

Running the examples

under construction. Here will be descriptions of the example code.

Build a Language Model

You can build your own language models for use with Sphinx using the Language Modelling Tool web page. Just make a simple text file containing sentences and words that you would like to have recognized, and upload the file with your browser. You will see the progress as it builds your files, and then it will present you with the final products.

Note: The LM Tool still builds for the OLD phoneset right now, with the deletable stops and the TS phone. This will be updated shortly. If you want to use it with the current Sphinx2 CVS version, just edit the resulting .dic file to remove the D from PD TD KD BD DD GD (to make them P T K B D G), and split the phone TS into T and S.

You may also want to use the CMU-Cambridge Statistical Language Modeling Toolkit to build models, and get word pronunciations from the CMU Dictionary.

Note: By default, Sphinx2 uses a slightly different phone set for American English than is in the CMU Dictionary. Internally, Sphinx2 does not use lexical stress; you need to run the CMU Dictionary through the stress2sphinx utility included in the Sphinx2 release to convert it to the default phone set.

Acoustic Models

There is a set of models for American English that comes with Sphinx2, and there are also some other example acoustic models available.

You can use SphinxTrain to build acoustic models for any language and any channel conditions.

How Sphinx works

needs to be fleshed out

Sphinx uses Hidden Markov Models to find the best path through the combined constraints of the acoustic, lexical, and language model, given the input audio.

Internally, Sphinx2 uses a fixed topology of 5-state phone models with skips. Historically, these subphonetic states have been called senones at CMU.

[ picture needed ]

Each of these 5-state models makes up a phone model. At run-time, the system compares a frame of audio input to the distributions in each senone, and finds the best path using viterbi decoding. Since the realization of a phone is influenced by its neighbors (often called coarticulation), one phone to the left and one to the right are used to describe this context. A triphone is a phone with the left and right context named. It is not three phones -- it is one context-dependent phone.

A pronunciation dictionary is used to find the relevant sequences of phones, and the language model is used to find the probabilities of sequences of words.

History

Lots of prior history needs to be filled in here

The Sphinx Group has been supported for many years by funding from the Defense Advanced Research Projects Agency, and the recognition engines to be released are those that the group used for the various DARPA projects and their respective evaluations.

In early 2000, the Sphinx Group released Sphinx2, a real-time, large vocabulary, speaker independent speech recognition system as free software under the Apache-style license. Sphinx2 is the engine used in the Sphinx Group's dialog systems that require real-time speech interaction, such as the implementation of the DARPA Communicator project, a many-turn dialog for travel planning. The pre-made acoustic models include American English and French in full bandwidth, and telephone-bandwidth Communicator models; Sphinx2 is a decent candidate for handheld, portable, and embedded devices, and telephone and desktop systems that require short response times.

Resources

Language Modeling

Acoustic Modeling

SphinxTrain is a suite of training scripts and tools for training acoustic models for CMU Sphinx. It was first publically released on June 7th, 2001. With this contribution, people should be able to build models for any language and condition for which there is enough acoustic data.
Kevin Lenzo