The system will provide real flight information. More complete information about the Communicator can be found here. The system may be sensitive to loud background noises, especially over cell phones.
Note that your call will be recorded for development purposes and may be shared with other researchers. We don't have a policy set up yet for placing such recordings into a publicly availably database, and so there is no guarantee that this data will become publicly available -- though we're motivated to set that up in the future.
The licensing terms for the Sphinx engines and tools are derived from BSD, and based, in particular, upon the license for the Apache web server. There is no restriction against commercial use or redistribution. (License terms for CMU Sphinx)
The ability to speak, and the ability to listen, co-evolved in humans; the ear and the mouth are tuned to each other. A talker uses speech sounds in order to be understood, and a listener draws on expectations about speech when they listen.
Linguistic theory breaks speech down into a set of levels at which language can be described: phonetics, phonology, morphonology, syntax, semantics, pragmatics. It the first four of these that drive most speech recognition systems, though the division between phonetics and phonology is often ignored, and the speech sounds are referred to as "phones."
Speech recognition systems draw on these sources of knowledge, and find the best combined result that finds the best sequence of sounds and words that harmonize the phonetic models, the word models from a pronunciation dictionary), and a "language model" or grammar that predicts what word relationships are likely (or possible at all).
Most modern ASR systems use statistical methods to train models for each of these levels, with linguistically informed choices. Hidden Markov Models (HMMs), decision trees, and other statistical and machine learning techniques are used throughout the system, and in speech science as a whole.
Many systems do not explicitly use semantic information in the recognizer, but leave that to a parser. Syntax may also be weakly modeled, with a word sequence model (N-gram model) in the engine that does not capture long distance relationships, or effects such as subject-verb agreement directly.
A recognizer can often send the N-best decodes -- as an ordered list of the best output text strings -- to the parser, and it can then find which ones are syntactically or semantically consistent.
ASR is one component technology needed to build a speech-interactive application. It may be used under the control of a VoiceXML interpreter, or into a more general speech-interactive system. After the text comes in, it may be passed to a parser, that would turn the input text strings into a representation of the semantics of the message. A dialog manager could then take that meaning input and perform some action in response, sending its output intentions to a language generation component, which would in turn pass this on to a speech synthesizer.
The example code given is a good place to start, but it will take the experience of building a real task using the toolkit, perhaps linking the libraries in to your own applications and going beyond the simple example apps (such as sphinx2-continuous) and doing new and interesting things with speech technology.
Note: The tarball releases lag behind the CVS tree. Developers, please check the latest version out of CVS. The following commands will check it out. Just press the Enter key when the login asks for your password.
cvs -d:pserver:anonymous@cvs.cmusphinx.sourceforge.net:/cvsroot/cmusphinx login cvs -z3 -d:pserver:anonymous@cvs.cmusphinx.sourceforge.net:/cvsroot/cmusphinx co sphinx2Updates from within the module's directory do not need the -d parameter.
Note: The LM Tool still builds for the OLD phoneset right now, with the deletable stops and the TS phone. This will be updated shortly. If you want to use it with the current Sphinx2 CVS version, just edit the resulting .dic file to remove the D from PD TD KD BD DD GD (to make them P T K B D G), and split the phone TS into T and S.
You may also want to use the CMU-Cambridge Statistical Language Modeling Toolkit to build models, and get word pronunciations from the CMU Dictionary.
Note:
By default, Sphinx2 uses a slightly different
phone set for American English
than is in the CMU Dictionary. Internally, Sphinx2 does not use
lexical stress; you need to run the CMU Dictionary through the
stress2sphinx utility included in the Sphinx2 release
to convert it to the default phone set.
You can use SphinxTrain to build acoustic models for any language and any channel conditions.
Sphinx uses Hidden Markov Models to find the best path through the combined constraints of the acoustic, lexical, and language model, given the input audio.
Internally, Sphinx2 uses a fixed topology of 5-state phone models with skips. Historically, these subphonetic states have been called senones at CMU.
[ picture needed ]
Each of these 5-state models makes up a phone model. At run-time, the system compares a frame of audio input to the distributions in each senone, and finds the best path using viterbi decoding. Since the realization of a phone is influenced by its neighbors (often called coarticulation), one phone to the left and one to the right are used to describe this context. A triphone is a phone with the left and right context named. It is not three phones -- it is one context-dependent phone.
A pronunciation dictionary is used to find the relevant sequences of phones, and the language model is used to find the probabilities of sequences of words.
The Sphinx Group has been supported for many years by funding from the Defense Advanced Research Projects Agency, and the recognition engines to be released are those that the group used for the various DARPA projects and their respective evaluations.
In early 2000, the Sphinx Group released Sphinx2, a real-time, large vocabulary, speaker independent speech recognition system as free software under the Apache-style license. Sphinx2 is the engine used in the Sphinx Group's dialog systems that require real-time speech interaction, such as the implementation of the DARPA Communicator project, a many-turn dialog for travel planning. The pre-made acoustic models include American English and French in full bandwidth, and telephone-bandwidth Communicator models; Sphinx2 is a decent candidate for handheld, portable, and embedded devices, and telephone and desktop systems that require short response times.
You can build your own language models for use with Sphinx using the Language Modelling Tool web page. Just make a simple text file containing sentences and words that you would like to have recognized, and upload the file with your browser. You will see the progress as it builds your files, and then it will present you with the final products.
The CMU/Cambridge Statistical Language Modeling Toolkit is a powerful suite of tools for building language models from text corpora.