Note: This file must be read by anyone who intends to use the models in this package for recognition. The specifications provided here must be exactly matched in the user's setup to prevent recognition failures. -------------------------------------------------------------------- The models have been trained using 140 hours of 1996 and 1997 hub4 training data. The phoneset for which models have been provided is that of the dictionary cmudict_0.6d available through the CMU web pages. The dictionary has been used without stress markers, resulting in 40 phones, including the silence phone, SIL. Adding stress markers degrades performance by about 5% relative. The models are 3-state within-word and cross-word triphone HMMs with no skips permitted between states. There are two sets of models this package, comprised of 4000 and 6000 senones respectively. They are placed in directories named 4000senones and 6000senones respectively. For each set of models, there are two subsets of models, one for the feature set specified in the SPHINX as "1s_c_d_dd" and the other for the set specified as "s3_1x39". Both subsets of models can be used with s3.0 decoder. Only the "1s_c_d_dd" subset can be used with the s3.3 decoder. The models are in sub-directories named after the corresponding feature set. The models have been trained with Mel-frequency cepstra (MFC) vectors derived from the hub4 data. Each vector is composed of 13 cepstral coefficients, 13 delta cepstra and 13 double delta cepstra. Each vector is thus 39-dimensional. The correct SPHINX name for the vectors used is "1s_c_d_dd", and this must be specified to the decoder(s) for correct usage of the acoustic models provided. Other specifications for the feature set are as follows: preemphasis factor : 0.970 sampling rate : 16000.000 Hz frame rate : 100.000 frames/sec Hamming window length : 0.0256 sec size of FFT : 512 samples number of Mel filters : 40 lower edge of filter bank : 133.33334 Hz upper edge of filter bank : 6855.49756 Hz number of MFCC coefficients/frame : 13 dither : added One set of quantized models have been provided with each set of full models. The number of codewords used in the sub-vector quantization is for 4000senone models --------------------- 1024 for 1 gau/state models, 2048 for 2,4 gau/state models and 4096 for 8 gau/state models. for 6000senone models --------------------- 1024 for 1 gau/state models, 2048 for 2 gau/state models and 4096 for 4,8 gau/state models. The quantized models are for use with s3.2/3.3, which also requires the corresponding un-quantized models during runtime (ie, both must be provided). The quantized models are labeled as .quant and are placed in the same sub-directory as the corresponding full model. The un-quantized models in all sub-directories are for use with the s3 continuous decoder. Language model: The language model provided is a simple trigram model, which has been built for tasks similar to broadcast news. The text used to build this model was taken from a variety of permitted sources, including broadcast news. The vocabulary covers 64000 words, and is listed in the file called language_model.vocabulary. The file language_model.arpaformat.gz can be used with the s2 decoder, while the file language_model.arpaformat.DMP.Z must be used with the s3 decoders. Note that the system will only recognize words which are within the vocabulary. For a description of the arpa format, see http://www.speech.cs.cmu.edu/sphinxman/fr7.html