INDEX (this document is under construction...)

ACOUSTIC MODELS FOR 16KHZ SAMPLED NORMAL BANDWIDTH SPEECH
bnmodels.tar.gz
These models are for the sphinx2 semicontinuous decoder. You can use CMUdict with these models, just make sure that the phones it uses are the .chmm labelled phone model files in the model directory. The dictionary is very generic. To make it suitable for any specialized task, you will have to remove words and pronunciations which are not likely to occur within that task. The smaller the dictionary, the faster and better the recognition, if it still covers all the words likely to be encountered in the task.
With these models and dictionary, you can use the bn.bigram.arpa.gz ARPA-format bigram language model. This has 57138 unigrams and about 10 million bigrams. It can be turned into a unigram lm by deleting the bigram entries, keeping the \end\ marker, the \2-grams marker, one bigram, and setting the 2-gram count to 1 in the begining of the lm. This was initially a trigram lm, but was too large to put up on the web from this site. 14 million trigrams were deleted to give this bigram LM. You can get better recognition with trigram lms, but if you are only begining to set up your system, work with this bigram lm, or better still, a unigram lm. Lms can be easily switched later by altering a single flag entry in the decoder arguments.

Training data

Source: Hub4-1998 data provided by LDC.
Amount of data actually used: 36.67 hours

Feature set used

Mel frequency cepstra computed using the front-end provided with the opensource. The following specs were used to compute the cepstra:

• premphasis factor : 0.970
• sampling rate : 16000.000 Hz
• frame rate : 100.000 frames/sec
• Hamming window length : 0.0256 sec
• size of FFT : 512 samples
• number of Mel filters : 40
• lower edge of filter bank : 133.33334 Hz
• upper edge of filter bank : 6855.49756 Hz
• number of MFCC coefficients/frame : 13

Model architecture

• HMM type: semi-continuous, with four 256-component mixture-weights per state (four codebooks with 256 codewords each; each codeword represents a Gaussian density function)
• Total number of phones: 51
+BREATH+, +COUGH+, +LAUGH+, +SMACK+, +UH+, +UHUM+, +UM+, SIL, AA, AE, AH, AO, AW, AX, AXR, AY, B, CH, D, DH, DX, EH, ER, EY, F, G, HH, IH, IX, IY, JH, K, M, N, NG, OW, OY, P, R, S, SH, T, TH, UH, UW, V, W, Y, Z, ZH
• Number of filler phones: 8
+BREATH+, +COUGH+, +LAUGH+, +SMACK+, +UH+, +UHUM+, +UM+, SIL
• Total number of triphonetic tied states: 6000
• Total number of triphones: 125665
The complete list of triphones and the manner in which states were tied can be seen in this model definition file
• Model topology used: 5-state Bakis topology HMM with non-emitting last state
Number of states per model followed by the transition matrix template:
6
1.0	1.0	1.0	0.0	0.0	0.0
0.0	1.0	1.0	1.0	0.0	0.0
0.0	0.0	1.0	1.0	1.0	0.0
0.0	0.0	0.0	1.0	1.0	1.0
0.0	0.0	0.0	0.0	1.0	1.0
The last state has no outgoing arcs unless embedded in a sentence hmm structure


Recognition feature set

c/1..L-1/,d/1..L-1/,c/0/d/0/dd/0/,dd/1..L-1/

CMU internal

Here are the locations of some training files:

• Transcript file (force-aligned): /net/bert/usr7/archive/alf32/usr6/hub4opensource/1998.transcripts.faligned
• Control file (for force-aligned transcripts): /net/bert/usr7/archive/alf32/usr6/hub4opensource/1998.ctl.faligned
• Training dictionary: /net/bert/usr7/archive/alf32/usr6/hub4opensource/train.dict
• Training filler dictionary: /net/bert/usr7/archive/alf32/usr6/hub4opensource/train.filler.dict
• Training phonelist: /net/bert/usr7/archive/alf32/usr6/hub4opensource/train.phonelist
• Linguistic questions used for decision trees: /net/bert/usr7/archive/alf32/usr6/hub4opensource/linguistic_questions
• Decision trees (unpruned): /net/bert/usr7/archive/alf32/usr6/hub4opensource/trees/newfe_hub97.unpruned
• Decision trees (pruned): /net/bert/usr7/archive/alf32/usr6/hub4opensource/trees/newfe_hub97.6000
• sphinx3 model parameters: /net/bert/usr7/archive/alf32/usr6/hub4opensource/model_parameters/newfe_hub97.cd_semi_6000/means, variances, mixture_weights, transition_matrices
• sphinx2 models: /alf32/usr6/hub4opensource/opensrc_hub4/sphinx_2_format

Model format

The models provided are in the SPHINX-II format. These were trained using the SPHINX-III trainer and converted into the SPHINX-II format using a format conversion toolkit which will be provided soon..

Performance for benchmarking

Test set yet to be decided

ACOUSTIC MODELS FOR 16KHZ SAMPLED NORMAL BANDWIDTH SPEECH
Things to check for if the models do not work for you

CMU internal: restructuring the opensource models to include newer triphones

The model set provided for 16KHz normal bandwidth includes models for 51 context-independent phones and 125665 triphones. This is the set of all triphones that could possibly be generated from the dictionary provided for recognition alongwith the models. However, no dictionary can claim to have all possible words in the language listed in it. New words will always be encountered and the pronuniciations of these words may include triphones which were never seen in the dictionary provided. New triphones may also be generated when you compound the words present in the recognition dictionary and treat each compounded word as a regular word. For example, the word

THAT  DH AE T


in the dictionary may give rise to many word begining triphones with DH as the central phone, and many word ending triphones with the central phone T, and the word-internal triphone AE(DH,T). The word

DOES  D AX Z

similarly results in word beginning triphones for D, word ending triphones for Z and the word-internal triphone AX(D,Z). When you compound these two words to get the compounded word

THAT_DOES  DH AE T D AX Z

it includes, amongst other triphones, four word-internal triphones (rather than two). While most of the triphones that can be generated from this word might already have been seen in the recognition dictionary, the new word-internal triphones D(T,AX) or T(AE,D) may not have been seen. As a consequence there may be no models for these two triphones in the given set of models. That is likely because the sequence of phones AE T D or T D AX is extremely rare within any word in the English language.

Thus, if you are compounding words or introducing new ones with rather rare phone sequences present in the pronunciations, you must generate or construct models for them before recognition. This is where "tied states" or senones come in handy. Senones can be viewed as independent model components, or model building blocks. A model for a triphone consists of a sequence of senones. Which sequence is right for a given triphone is decided on the basis of what are called "pruned" decision trees. Each leaf of a pruned decision tree represents a bunch of contexts (which are again phones) and is labeled by a number, called the "tied state id" or the "senone id". Each leaf is a tied state or a senone. There is one pruned decision tree for each state of each PHONE (not triphone). Since the phone alongwith its contexts forms the triphone for which you want to compose a model, you only have to look for the leaf which includes that context and select the leaf-id (or senone-id) to represent the corresponding state of the triphone's HMM. The process is a little more involved than this description, but at a very crude level this is the idea.

This "senone selection" is done by the executable "tiestate" provided with the SPHINX package, once you specify to it the triphone that you need models for, and the pruned trees to select the senones from. The usage of this executable is explained here.

Here is what you do when you think that you have new triphones (the list of triphones already provided with the models is given in the section describing the model architecture), and want to update your models:

1. First, check to see if the new triphones are indeed absent from the recognition dictionary provided. This is very simple to do. Just look if the new phone sequences are present in the dictionary or not. If not, then
2. Take the recognition dictionary which has been provided with the models and include the new words and pronunciations in it. This is your new recognition dictionary.
3. Create a model definition file which now lists all possible triphones present in this dictionary. Use the script 01.listalltriphones.csh to generate this file. Click here to read more about a tied state model definition file. This will create a file called alltriphones.mdef
4. Use the script 02.tiestate.csh to create models for the triphones listed in alltriphones.mdef using the pruned decision trees corresponding to the current model set. This will create a file called newtriphones.6000.mdef
5. Edit the script 03.cvt.csh to convert the Sphinx-III format models to SPHINX-II format using the file newtriphones.6000.mdef. The right paths are currently entered in the script, but if you move the setup elsewhere, you will have to edit the script to give introduce the correct pathnames for the SPHINX-III model parameters. These paths are listed under "CMU internal" in the description provided above for these models.
6. cvt.csh will create a new directory called "newmodels_sphinx2_format, which you can use for recognition, alongwith the new dictionary.
The scripts are all in the directory /net/bert/usr7/archive/alf32/usr6/hub4opensource/restructure_s2models/

General instructions for adapting existing models to your task domain using within-domain data are here. Prior to using your within-domain data for adaptation, you must force-align this data using the original unadapted models. The force-aligned transcripts must then be used for adaptation. Here are the locations of some specific files needed for adaptation:

2. the location of the baum_welch executable and the required flags are described in /net/bert/usr7/archive/alf32/usr6/hub4opensource/adapt_s2models/bin/baum_welch.readme
3. the location of the norm executable and the required flags are described in /net/bert/usr7/archive/alf32/usr6/hub4opensource/adapt_s2models/bin/norm.readme
4. the location of the mixw_interp executable and the required flags are described in /net/bert/usr7/archive/alf32/usr6/hub4opensource/adapt_s2models/bin/mixw_interp.readme
5. After the adaptated models are written, they must be converted to SPHINX-II format. The script 03.cvt.csh in /net/bert/usr7/archive/alf32/usr6/hub4opensource/restructure_s2models/ may be used for this. The model_definition file used for adaptation must also be used during conversion.
ACOUSTIC MODELS FOR 8KHZ SAMPLED TELEPHONE BANDWIDTH SPEECH
Training data

Training data

Source: Communicator data collected at CMU
Amount of data actually used:

Feature set used

Mel frequency cepstra computed using the front-end provided with the opensource. The following specs were used to compute the cepstra:

• premphasis factor : 0.970
• sampling rate : 8000.000 Hz
• frame rate : 100.000 frames/sec
• Hamming window length : 0.0256 sec
• size of FFT : 256 samples
• number of Mel filters : 31
• lower edge of filter bank : 200.0 Hz
• upper edge of filter bank : 3500.0 Hz
• number of MFCC coefficients/frame : 13

DECIDING WHICH DECODER TO USE WITH YOUR MODELS AND SETTING THE DECODER PARAMETERS

If you are about to train acoustic models to go with a decoder that you have already decided you will use, or if you are about to use existing acoustic models and want to choose the most compatible decoder, you have to know a little about the strengths and limitations of each decoder. The SPHINX comes with three decoders:

• Sphinx 2.0 : This can only decode with 5 state/hmm, 4 feature stream semi-continuous models in sphinx 2.0 format. Within CMU, there are two versions of this decoder - an old version which, in the live mode, computes cepstra using and old cepstra-computation code. The cepstra-computation code is obsolete, but has to be used if you are using this old decoder. The newer version uses the new fron-end code provided with the SPHINX opensource version and currently used for all tasks within CMU. The difference betweent he old and new front-end computation codes is that the old version used a log-linear function to warp the frequencies in the process of cepstrum computation. The new one uses a proper mel-function. The Sphinx 2.0 decoder decodes at realtime speeds.

• Sphinx 3.0: This can decode with semi-continuous and continuous models using all supported feature vector configurations and HMM topologies. The decodes are slow and depending on your models and data, can range from 10-60 times real time.

• Sphinx 3.2: This can decode only 3 state and 5 state continuous models. This uses two sets of models for each decode, a set of continuous models in the conventional Sphinx 3.0 format and another set of models which are quantized versions of the first set. This is also called the "Sphinx3 fastdecoder". Depending on your models, data, and size of LM, this runs between 2-8 times real time.
Flag settings The flag settings vary depending on which decoder you are using, and what settings were involved in the computation of training features for your models.

Here are complete listings of flags that are accepted by these three decoders. The lines in green are the flags that a user would be typically expected to specify, depending on the type of data encountered during recognition and the specifications that come with the acoustic models being used. Standard values for these flags are indicated, and you can optimize around these values. The lines in red, however, must only be used if you are familiar with what is going on in the decoder at an algorithmic level. They are mostly active for research and debugging. In a standard task, don't mention these flags and don't worry about them.

Flag settings for the Sphinx 2.0 decoder (I still have to color the lines... this is not complete)

Acoustic models, lms and dictionaries for the diplomat english/croatian translation system

LM texts

The decode dictionaries are also within the model directories. They are called DECODE.DICT and DECODE.NOISEDICT. Both must be used during recognition. Remember that ONLY the words that are in the dictionary AND the LM are recognizable. If you have words in your vocabulary that you want to recognize, and don't have examples of their usage in the LM text, then include them simply as unigrams in the LM. The LM vocabulary must not exceed 64,000 words. The dictionary is flexible. You can shorten it or add new words to it. Do not change the phoneset, though.

The acoustic models are meant to recognize 16khz sampled speech. Make sure that your signals being recorded are not clipped and do not have a "tabletop" appearance. If they do, you have a gain control problem.

No agc has been used during training. Hence no agc must be used for decoding. The agc flag must be set to false.

Here are some tentative flag settings for the decoder to be used with these models. First try only these (leave out ALL other flags; don't mention them). If these settings slow down the decode (and the decodes look ok), try reducing -topn (do not go below 2. This flag affects recognition hugely. 1 can be very fast, very bad. 2 can be a little slower, reasonable. 4 gives you good decodes). You can of course optimize around these settings if you have the time:

 -live TRUE                      -topsenfrm 4
-topsenthresh -50000            -nmprior TRUE
-fwdflat FALSE                  -bestpath TRUE
-top 4                          -fillpen 1e-10
-nwpen 0.01                     -silpen 0.005
-inspen 0.65                    -langwt 7.5
-ugwt 0.5                       -beam 2e-6
-npbeam 2e-6                    -nwbeam 5e-4
-lpbeam 2e-5                    -lponlybeam 5e-4
-rescorelw 9.5                  -hmmdir  tongues_english
-hmmdirlist tongues_english     -cbdir   tongues_english
-lmfn      english\English.arpabo
-kbdumpdir tongues_english      -dictfn  tongues_english\DECODE.DICT
-phnfn     tongues_english\phone  -ndictfn   tongues_english\DECODE.NOISEDICT
-mapfn      tongues_english\map  -sendumpfn  tongues_english\sendump
-normmean TRUE                   -8bsen TRUE
-matchfn c:\anything.match
-logfn c:\anything.log (omit this flag if you don't want to write a log)


The specifications for the croatian models are the same as those for the English models. There is no decode noisedict for these models. The same settings (barring the actual modeldirectory names and output filenames) should work for croatian. There is ONE difference: the -ndictfn flag must be omitted for the croatian models. Add the missing words to the croatian dictionary before using it to decode. a list of missing words is given. An ALL.tar.gz tar file for these lms and dicts is available, to preserve the ISO characters.