INDEX (this document is under construction...)

  1. Acoustic models for 16khz sampled normal bandwidth speech
  2. Acoustic models for 8khz sampled telephone bandwidth speech
  3. Deciding which decoder to use with your models and setting the decoder parameters

These models are for the sphinx2 semicontinuous decoder. You can use CMUdict with these models, just make sure that the phones it uses are the .chmm labelled phone model files in the model directory. The dictionary is very generic. To make it suitable for any specialized task, you will have to remove words and pronunciations which are not likely to occur within that task. The smaller the dictionary, the faster and better the recognition, if it still covers all the words likely to be encountered in the task.
With these models and dictionary, you can use the ARPA-format bigram language model. This has 57138 unigrams and about 10 million bigrams. It can be turned into a unigram lm by deleting the bigram entries, keeping the \end\ marker, the \2-grams marker, one bigram, and setting the 2-gram count to 1 in the begining of the lm. This was initially a trigram lm, but was too large to put up on the web from this site. 14 million trigrams were deleted to give this bigram LM. You can get better recognition with trigram lms, but if you are only begining to set up your system, work with this bigram lm, or better still, a unigram lm. Lms can be easily switched later by altering a single flag entry in the decoder arguments.

Training data

Source: Hub4-1998 data provided by LDC.
Amount of data actually used: 36.67 hours

Feature set used

Mel frequency cepstra computed using the front-end provided with the opensource. The following specs were used to compute the cepstra:

Model architecture

Recognition feature set


Click here for an explanation of this feature

CMU internal

Here are the locations of some training files:

Model format

The models provided are in the SPHINX-II format. These were trained using the SPHINX-III trainer and converted into the SPHINX-II format using a format conversion toolkit which will be provided soon..

Performance for benchmarking

Test set yet to be decided

Things to check for if the models do not work for you

CMU internal: restructuring the opensource models to include newer triphones

The model set provided for 16KHz normal bandwidth includes models for 51 context-independent phones and 125665 triphones. This is the set of all triphones that could possibly be generated from the dictionary provided for recognition alongwith the models. However, no dictionary can claim to have all possible words in the language listed in it. New words will always be encountered and the pronuniciations of these words may include triphones which were never seen in the dictionary provided. New triphones may also be generated when you compound the words present in the recognition dictionary and treat each compounded word as a regular word. For example, the word


in the dictionary may give rise to many word begining triphones with DH as the central phone, and many word ending triphones with the central phone T, and the word-internal triphone AE(DH,T). The word

similarly results in word beginning triphones for D, word ending triphones for Z and the word-internal triphone AX(D,Z). When you compound these two words to get the compounded word

it includes, amongst other triphones, four word-internal triphones (rather than two). While most of the triphones that can be generated from this word might already have been seen in the recognition dictionary, the new word-internal triphones D(T,AX) or T(AE,D) may not have been seen. As a consequence there may be no models for these two triphones in the given set of models. That is likely because the sequence of phones AE T D or T D AX is extremely rare within any word in the English language.

Thus, if you are compounding words or introducing new ones with rather rare phone sequences present in the pronunciations, you must generate or construct models for them before recognition. This is where "tied states" or senones come in handy. Senones can be viewed as independent model components, or model building blocks. A model for a triphone consists of a sequence of senones. Which sequence is right for a given triphone is decided on the basis of what are called "pruned" decision trees. Each leaf of a pruned decision tree represents a bunch of contexts (which are again phones) and is labeled by a number, called the "tied state id" or the "senone id". Each leaf is a tied state or a senone. There is one pruned decision tree for each state of each PHONE (not triphone). Since the phone alongwith its contexts forms the triphone for which you want to compose a model, you only have to look for the leaf which includes that context and select the leaf-id (or senone-id) to represent the corresponding state of the triphone's HMM. The process is a little more involved than this description, but at a very crude level this is the idea.

This "senone selection" is done by the executable "tiestate" provided with the SPHINX package, once you specify to it the triphone that you need models for, and the pruned trees to select the senones from. The usage of this executable is explained here.

Here is what you do when you think that you have new triphones (the list of triphones already provided with the models is given in the section describing the model architecture), and want to update your models:

  1. First, check to see if the new triphones are indeed absent from the recognition dictionary provided. This is very simple to do. Just look if the new phone sequences are present in the dictionary or not. If not, then
  2. Take the recognition dictionary which has been provided with the models and include the new words and pronunciations in it. This is your new recognition dictionary.
  3. Create a model definition file which now lists all possible triphones present in this dictionary. Use the script 01.listalltriphones.csh to generate this file. Click here to read more about a tied state model definition file. This will create a file called alltriphones.mdef
  4. Use the script 02.tiestate.csh to create models for the triphones listed in alltriphones.mdef using the pruned decision trees corresponding to the current model set. This will create a file called newtriphones.6000.mdef
  5. Edit the script 03.cvt.csh to convert the Sphinx-III format models to SPHINX-II format using the file newtriphones.6000.mdef. The right paths are currently entered in the script, but if you move the setup elsewhere, you will have to edit the script to give introduce the correct pathnames for the SPHINX-III model parameters. These paths are listed under "CMU internal" in the description provided above for these models.
  6. cvt.csh will create a new directory called "newmodels_sphinx2_format, which you can use for recognition, alongwith the new dictionary.
The scripts are all in the directory /net/bert/usr7/archive/alf32/usr6/hub4opensource/restructure_s2models/

CMU internal: adapting the opensource models to your task domain

General instructions for adapting existing models to your task domain using within-domain data are here. Prior to using your within-domain data for adaptation, you must force-align this data using the original unadapted models. The force-aligned transcripts must then be used for adaptation. Here are the locations of some specific files needed for adaptation:

  1. the following script can be used for force-aligning your adaptation data: /net/bert/usr7/archive/alf32/usr6/hub4opensource/adapt_s2models/bin/falign.csh
  2. the location of the baum_welch executable and the required flags are described in /net/bert/usr7/archive/alf32/usr6/hub4opensource/adapt_s2models/bin/baum_welch.readme
  3. the location of the norm executable and the required flags are described in /net/bert/usr7/archive/alf32/usr6/hub4opensource/adapt_s2models/bin/norm.readme
  4. the location of the mixw_interp executable and the required flags are described in /net/bert/usr7/archive/alf32/usr6/hub4opensource/adapt_s2models/bin/mixw_interp.readme
  5. After the adaptated models are written, they must be converted to SPHINX-II format. The script 03.cvt.csh in /net/bert/usr7/archive/alf32/usr6/hub4opensource/restructure_s2models/ may be used for this. The model_definition file used for adaptation must also be used during conversion.
    Training data

    Training data

    Source: Communicator data collected at CMU
    Amount of data actually used:

    Feature set used

    Mel frequency cepstra computed using the front-end provided with the opensource. The following specs were used to compute the cepstra:


    If you are about to train acoustic models to go with a decoder that you have already decided you will use, or if you are about to use existing acoustic models and want to choose the most compatible decoder, you have to know a little about the strengths and limitations of each decoder. The SPHINX comes with three decoders:

    Flag settings The flag settings vary depending on which decoder you are using, and what settings were involved in the computation of training features for your models.

    Here are complete listings of flags that are accepted by these three decoders. The lines in green are the flags that a user would be typically expected to specify, depending on the type of data encountered during recognition and the specifications that come with the acoustic models being used. Standard values for these flags are indicated, and you can optimize around these values. The lines in red, however, must only be used if you are familiar with what is going on in the decoder at an algorithmic level. They are mostly active for research and debugging. In a standard task, don't mention these flags and don't worry about them.

    Flag settings for the Sphinx 2.0 decoder (I still have to color the lines... this is not complete)

    -forceForce STRING Force
    -argfileArgFile STRING Cmd line argument file
    -allphoneAllPhoneMode BOOL All Phone Mode
    -forceRecForceRec BOOL ForceRec
    -agcbetaAgcBeta BOOL Use beta based AGC
    -agcmaxAgcMax BOOL Use max based AGC Find the maximum c0 value in the current utterance and subtract it from the c0 of all frames, thereby forcing the max c0 to zero always.
    Problem: Normalization is based on the maximum value. The maximum value is a *single* point. Any statistic that is based on a single point is not robust. The max value may be an outlier, for instance. Also, we are always anchoring the max value to 0. Visualize two gaussians: one narrow and one wide. If we align them to that, the value at which the distributions have some constant value other than the value at the mean (this is the typical scenario - the max value obtained from a narrow distribution will be closer to the mean than one obtained from a broad distribution), the two means will not align. As a result any distribution computed from the union of the two distributions will be smeared with respect to both distributions. If we had simply aligned the means of the two distributions instead, this wouldnt happen. When we perform CMN we set the means of all utterances to 0, thereby aligning the means of all utterances.
    -agcemaxAgcEMax BOOL Use another max based AGC Estimate max c0 as the average of the max c0 of the past (upto) 10 utterances and subtract it from the c0 of all frames. Needed when doing live decodes since we dont have the entire utterance to find the max c0 from.
    Problem: Same as for agcmax.
    -agcnoiseAgcNoise BOOL Use Noise based AGC
    -agcthreshAgcThreshold FLOAT Threshold for Noise based AGC
    -normmeanNormalizeMean BOOL Normalize the feature means to 0.0
    -nmpriorNormalizeMeanPrior BOOL Normalize feature means with prior mean
    -compressCompressBackground BOOL Compress excess background frames Find a 100 point histogram of c0. Find the position where this peaks. Find the bin position some N bins away from the peak towards the min-energy bin (I found the number "5" for N in the code, but that seems wrong as this would be too close to the peak). This bin position is taken as a threshold. All frames with c0 below this threshold are simply deleted from the utterance thereby shrinking the utterance length.
    Problem: This is based on a heuristically computed threshold. The heuristic will not always work. First, this is assuming that the shape of the historgram of the test data will be similar to the shape of the histogram of the data that the heurisitic (the number N that is the shift from the peak of the histogram) was developed on. This may not be the case. The test data may have a broader distribution. In this case we would end up deleting speech. On the other hand, if the test data has lots of silence, we will find a peak in the histogram at the typical silence c[0]. If we shift to the left of this peak to find a threshold, the peak c[0] will lie above this threshold, thereby ensuring that most of the silence frames remain anyway, making the compress operation pointless. A better thing wouldve been to find a bimodal distribution, fold over the extremum point across the first peak, an use that for a threshold (like we did for SPINE). The way it is currently implemented may help at times, but it probably hurts more frequently than it helps.
    -compresspriorCompressPrior BOOL Compress excess background frames based on prior utt For live decodes the histogram is found from *previous* utterances and the threshold based on this histogram of previous utterances. Delete all cepstra with c0 below this threshold.
    Problem: Same as for compress
    -dcep80msweightDcep80msWeight DOUBLE Weight for dcep80ms
    -liveLiveData BOOL Get input from A/D hardware
    -blockingadA/D blocks on read BOOL A/D blocks on read
    -ctlfnCtlFileName STRING Control file name
    -ctloffsetCtlLineOffset INT Number of Lines to skip in ctl file
    -ctlcountCtlCount INT Number of lines to process in ctl file
    -ctlincrCtlLineIncr INT Do every nth line in the ctl file
    -compallsenComputeAllSenones BOOL Compute all senone scores every frame
    -topsenfrmTopSenonesFrames INT #frames top senones for predicting phones
    -topsenthreshTopSenonesThresh INT Top senones threshold for predicting phones
    -wsj1Sentwsj1Sent BOOL Sent_Dir using wsj1 format
    -reportpronReportAltPron BOOL Report actual pronunciation in match file
    -matchfnMatchFileName STRING Recognition output file name
    -matchsegfnMatchSegFileName STRING Recognition output with segmentation
    -phoneconfPhoneConfidence INT Phone confidence
    -pscr2latPhoneLat BOOL Phone lattice based on best senone scores
    -logfnLogFileName STRING Recognition ouput file name
    -correctfnCorrectFileName STRING Reference ouput file name
    -uttUtterance STRING Utterance name
    -datadirDataDirectory STRING Data directory
    -cepdirDataDirectory STRING Data directory
    -vqdirDataDirectory STRING Data directory
    -segdirSegDataDirectory STRING Data directory
    -sentdirSentDir STRING Sentence directory
    -sentextSentExt STRING Sentence File Extension
    -lmnamedirLMNamesDir STRING Directory for LM-name file for each utt
    -lmnameextLMNamesExt STRING Filename extension for LM-name files
    -startworddirStartWordDir STRING Startword directory
    -startwordextStartWordExt STRING StartWord File Extension
    -nbestdirNbestDir STRING N-best Hypotheses Directory
    -nbestNbestCount INT No. N-best Hypotheses
    -nbestextNbestExt STRING N-best Hypothesis File Extension
    -cepextCepExt STRING Cepstrum File Extension
    -cextCCodeExt STRING CCode File Extension
    -dextDCodeExt STRING DCode File Extension
    -pextPCodeExt STRING PCode File Extension
    -xextXCodeExt STRING XCode File Extension (4 codebook only)
    -beamBeamWidth FLOAT Beam Width
    -nwbeamNewWordBeamWidth FLOAT New Word Beam Width
    -fwdflatbeamFwdFlatBeamWidth FLOAT FwdFlat Beam Width
    -fwdflatnwbeamFwdFlatNewWordBeamWidth FLOAT FwdFlat New Word Beam Width
    -lponlybwLastPhoneAloneBeamWidth FLOAT Beam Width for Last Phones Only
    -lponlybeamLastPhoneAloneBeamWidth FLOAT Beam Width for Last Phones Only
    -npbeamNewPhoneBeamWidth FLOAT New Phone Beam Width
    -lpbeamLastPhoneBeamWidth FLOAT Last Phone Beam Width
    -phnpenPhoneInsertionPenalty FLOAT Penalty for each phone used
    -inspenInsertionPenalty FLOAT Penalty for word transitions
    -nwpenNewWordPenalty FLOAT Penalty for new word transitions
    -silpenSilenceWordPenalty FLOAT Penalty for silence word transitions
    -fillpenFillerWordPenalty FLOAT Penalty for filler word transitions
    -langwtLanguageWeight FLOAT Weighting on Language Probabilities
    -rescorelwRescoreLanguageWeight FLOAT LM prob weight for rescoring pass
    -fwdflatlwFwdFlatLanguageWeight FLOAT FwdFlat Weighting on Language Probabilities
    -fwdtreeFwdTree BOOL Fwd tree search (1st pass)
    -fwdflatFwdFlat BOOL Flat fwd search over fwdtree lattice
    -forwardonlyForwardOnly BOOL Run only the forward pass
    -bestpathBestpath BOOL Shortest path search over lattice
    -fwd3gTrigramInFwdPass BOOL Use trigram (if available) in forward pass
    -cbdirCodeBookDirectory STRING Code book directory
    -ccbfnCCodeBookFileName STRING CCode Book File Name
    -dcbfnDCodeBookFileName STRING DCode Book File Name
    -pcbfnPCodeBookFileName STRING PCode Book File Name
    -xcbfnXCodeBookFileName STRING XCode Book File Name
    -use20msdpUse20msDiffPow BOOL Use 20 ms diff power instead of c0
    -cepfloorCepFloor FLOAT Floor of Cepstrum Variance
    -dcepfloorDCepFloor FLOAT Floor of Delta Cepstrum Variance
    -xcepfloorXCepFloor FLOAT Floor of XCepstrum Variance
    -topTopNCodeWords INT Number of code words to use
    -skipaltSkipAltFrames INT Skip alternate frames in exiting phones
    -matchscoreWriteScoreInMatchFile BOOL write score in the match file
    -latsizeLatticeSizes INT BP and FP Tables Sizes
    -lmcachelinesLMCacheNumLines INT No. lines in LM cache
    -ilmugwtILMUGCacheWeight INT Weight(%) for ILM UG cache prob
    -ilmbgwtILMBGCacheWeight INT Weight(%) for ILM BG cache prob
    -dumplatdirDumpLattice STRING Dump Lattice
    -sampSamplingRate INT Sampling rate
    -adcinUseADCInput BOOL Use raw ADC input
    -adcextADCFileExt STRING ADC file extension
    -adcendianADCByteOrder INT ADC file byte order (0=BIG/1=LITTLE)
    -adchdrADCHdrSize INT ADC file header size
    -rawlogdirRawLogDir STRING Log directory for raw output files)
    -mfclogdirMFCLogDir STRING Log directory for MFC output files)
    -tactlfnTimeAlignCtlFile STRING Time align control file
    -tawordTimeAlignWord BOOL Time Align Phone
    -taphoneTimeAlignPhone BOOL Time Align Phone
    -tastateTimeAlignState BOOL Time Align State
    -segextSegFileExt STRING Seg file extension
    -scoreextScoreFileExt STRING Seg file extension
    -osentfnOutSentFile STRING output sentence file name
    -backtracePrintBackTrace BOOL Print Back Trace
    -cdcnCDCNinitFile STRING CDCN Initialization File

    Acoustic models, lms and dictionaries for the diplomat english/croatian translation system


    LM texts

    The decode dictionaries are also within the model directories. They are called DECODE.DICT and DECODE.NOISEDICT. Both must be used during recognition. Remember that ONLY the words that are in the dictionary AND the LM are recognizable. If you have words in your vocabulary that you want to recognize, and don't have examples of their usage in the LM text, then include them simply as unigrams in the LM. The LM vocabulary must not exceed 64,000 words. The dictionary is flexible. You can shorten it or add new words to it. Do not change the phoneset, though.

    The acoustic models are meant to recognize 16khz sampled speech. Make sure that your signals being recorded are not clipped and do not have a "tabletop" appearance. If they do, you have a gain control problem.

    No agc has been used during training. Hence no agc must be used for decoding. The agc flag must be set to false.

    Here are some tentative flag settings for the decoder to be used with these models. First try only these (leave out ALL other flags; don't mention them). If these settings slow down the decode (and the decodes look ok), try reducing -topn (do not go below 2. This flag affects recognition hugely. 1 can be very fast, very bad. 2 can be a little slower, reasonable. 4 gives you good decodes). You can of course optimize around these settings if you have the time:

     -live TRUE                      -topsenfrm 4  
     -topsenthresh -50000            -nmprior TRUE
     -fwdflat FALSE                  -bestpath TRUE  
     -top 4                          -fillpen 1e-10
     -nwpen 0.01                     -silpen 0.005  
     -inspen 0.65                    -langwt 7.5  
     -ugwt 0.5                       -beam 2e-6  
     -npbeam 2e-6                    -nwbeam 5e-4  
     -lpbeam 2e-5                    -lponlybeam 5e-4
     -rescorelw 9.5                  -hmmdir  tongues_english  
     -hmmdirlist tongues_english     -cbdir   tongues_english  
     -lmfn      english\English.arpabo
     -kbdumpdir tongues_english      -dictfn  tongues_english\DECODE.DICT
     -phnfn     tongues_english\phone  -ndictfn   tongues_english\DECODE.NOISEDICT
     -mapfn      tongues_english\map  -sendumpfn  tongues_english\sendump
     -normmean TRUE                   -8bsen TRUE  
     -matchfn c:\anything.match  
     -logfn c:\anything.log (omit this flag if you don't want to write a log)

    The specifications for the croatian models are the same as those for the English models. There is no decode noisedict for these models. The same settings (barring the actual modeldirectory names and output filenames) should work for croatian. There is ONE difference: the -ndictfn flag must be omitted for the croatian models. Add the missing words to the croatian dictionary before using it to decode. a list of missing words is given. An ALL.tar.gz tar file for these lms and dicts is available, to preserve the ISO characters.