Training data
Source: Hub4-1998 data provided by LDC.
Amount of data actually used: 36.67 hours
Feature set used
Mel frequency cepstra computed using the front-end provided with the opensource. The following specs were used to compute the cepstra:
Model architecture
Number of states per model followed by the transition matrix template: 6 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0 The last state has no outgoing arcs unless embedded in a sentence hmm structure
c/1..L-1/,d/1..L-1/,c/0/d/0/dd/0/,dd/1..L-1/
Click here for an explanation of this feature
CMU internal
Here are the locations of some training files:
Model format
The models provided are in the SPHINX-II format. These were trained using the SPHINX-III trainer and converted into the SPHINX-II format using a format conversion toolkit which will be provided soon..
Performance for benchmarking
Test set yet to be decided
CMU internal: restructuring the opensource models to include newer triphones
The model set provided for 16KHz normal bandwidth includes models for 51 context-independent phones and 125665 triphones. This is the set of all triphones that could possibly be generated from the dictionary provided for recognition alongwith the models. However, no dictionary can claim to have all possible words in the language listed in it. New words will always be encountered and the pronuniciations of these words may include triphones which were never seen in the dictionary provided. New triphones may also be generated when you compound the words present in the recognition dictionary and treat each compounded word as a regular word. For example, the word
THAT DH AE Tin the dictionary may give rise to many word begining triphones with DH as the central phone, and many word ending triphones with the central phone T, and the word-internal triphone AE(DH,T). The word
DOES D AX Zsimilarly results in word beginning triphones for D, word ending triphones for Z and the word-internal triphone AX(D,Z). When you compound these two words to get the compounded word
THAT_DOES DH AE T D AX Zit includes, amongst other triphones, four word-internal triphones (rather than two). While most of the triphones that can be generated from this word might already have been seen in the recognition dictionary, the new word-internal triphones D(T,AX) or T(AE,D) may not have been seen. As a consequence there may be no models for these two triphones in the given set of models. That is likely because the sequence of phones AE T D or T D AX is extremely rare within any word in the English language.
Thus, if you are compounding words or introducing new ones with rather rare phone sequences present in the pronunciations, you must generate or construct models for them before recognition. This is where "tied states" or senones come in handy. Senones can be viewed as independent model components, or model building blocks. A model for a triphone consists of a sequence of senones. Which sequence is right for a given triphone is decided on the basis of what are called "pruned" decision trees. Each leaf of a pruned decision tree represents a bunch of contexts (which are again phones) and is labeled by a number, called the "tied state id" or the "senone id". Each leaf is a tied state or a senone. There is one pruned decision tree for each state of each PHONE (not triphone). Since the phone alongwith its contexts forms the triphone for which you want to compose a model, you only have to look for the leaf which includes that context and select the leaf-id (or senone-id) to represent the corresponding state of the triphone's HMM. The process is a little more involved than this description, but at a very crude level this is the idea.
This "senone selection" is done by the executable "tiestate" provided with the SPHINX package, once you specify to it the triphone that you need models for, and the pruned trees to select the senones from. The usage of this executable is explained here.
Here is what you do when you think that you have new triphones (the list of triphones already provided with the models is given in the section describing the model architecture), and want to update your models:
CMU internal: adapting the opensource models to your task domain
General instructions for adapting existing models to your task domain using within-domain data are here. Prior to using your within-domain data for adaptation, you must force-align this data using the original unadapted models. The force-aligned transcripts must then be used for adaptation. Here are the locations of some specific files needed for adaptation:
Training data
Source: Communicator data collected at CMU
Amount of data actually used:
Feature set used
Mel frequency cepstra computed using the front-end provided with the opensource. The following specs were used to compute the cepstra:
If you are about to train acoustic models to go with a decoder that you have already decided you will use, or if you are about to use existing acoustic models and want to choose the most compatible decoder, you have to know a little about the strengths and limitations of each decoder. The SPHINX comes with three decoders:
Here are complete listings of flags that are accepted by these three decoders. The lines in green are the flags that a user would be typically expected to specify, depending on the type of data encountered during recognition and the specifications that come with the acoustic models being used. Standard values for these flags are indicated, and you can optimize around these values. The lines in red, however, must only be used if you are familiar with what is going on in the decoder at an algorithmic level. They are mostly active for research and debugging. In a standard task, don't mention these flags and don't worry about them.
Flag settings for the Sphinx 2.0 decoder (I still have to color the lines... this is not complete)
| FLAG | XXXX | TYPE OF SETTING | DESCRIPTION | DEFAULT VALUE | TYPICAL SETTING | EXPLANATION |
| -force | Force | STRING | Force | |||
| -argfile | ArgFile | STRING | Cmd line argument file | |||
| -allphone | AllPhoneMode | BOOL | All Phone Mode | |||
| -forceRec | ForceRec | BOOL | ForceRec | |||
| -agcbeta | AgcBeta | BOOL | Use beta based AGC | |||
| -agcmax | AgcMax | BOOL | Use max based AGC |
Find the maximum c0 value in the current utterance and subtract it
from the c0 of all frames, thereby forcing the max c0 to zero always.
Problem: Normalization is based on the maximum value. The maximum value is a *single* point. Any statistic that is based on a single point is not robust. The max value may be an outlier, for instance. Also, we are always anchoring the max value to 0. Visualize two gaussians: one narrow and one wide. If we align them to that, the value at which the distributions have some constant value other than the value at the mean (this is the typical scenario - the max value obtained from a narrow distribution will be closer to the mean than one obtained from a broad distribution), the two means will not align. As a result any distribution computed from the union of the two distributions will be smeared with respect to both distributions. If we had simply aligned the means of the two distributions instead, this wouldnt happen. When we perform CMN we set the means of all utterances to 0, thereby aligning the means of all utterances. | ||
| -agcemax | AgcEMax | BOOL | Use another max based AGC |
Estimate max c0 as the average of the max c0 of the past (upto) 10
utterances and subtract it from the c0 of all frames. Needed when doing
live decodes since we dont have the entire utterance to find the max c0
from.
Problem: Same as for agcmax. | ||
| -agcnoise | AgcNoise | BOOL | Use Noise based AGC | |||
| -agcthresh | AgcThreshold | FLOAT | Threshold for Noise based AGC | |||
| -normmean | NormalizeMean | BOOL | Normalize the feature means to 0.0 | |||
| -nmprior | NormalizeMeanPrior | BOOL | Normalize feature means with prior mean | |||
| -compress | CompressBackground | BOOL | Compress excess background frames |
Find a 100 point histogram of c0. Find the position where this
peaks. Find the bin position some N bins away from the peak towards the
min-energy bin (I found the number "5" for N in the code, but that seems
wrong as this would be too close to the peak). This bin position is taken
as a threshold. All frames with c0 below this threshold are simply deleted
from the utterance thereby shrinking the utterance length.
Problem: This is based on a heuristically computed threshold. The heuristic will not always work. First, this is assuming that the shape of the historgram of the test data will be similar to the shape of the histogram of the data that the heurisitic (the number N that is the shift from the peak of the histogram) was developed on. This may not be the case. The test data may have a broader distribution. In this case we would end up deleting speech. On the other hand, if the test data has lots of silence, we will find a peak in the histogram at the typical silence c[0]. If we shift to the left of this peak to find a threshold, the peak c[0] will lie above this threshold, thereby ensuring that most of the silence frames remain anyway, making the compress operation pointless. A better thing wouldve been to find a bimodal distribution, fold over the extremum point across the first peak, an use that for a threshold (like we did for SPINE). The way it is currently implemented may help at times, but it probably hurts more frequently than it helps. | ||
| -compressprior | CompressPrior | BOOL | Compress excess background frames based on prior utt |
For live decodes the histogram is found from *previous*
utterances and the threshold based on this histogram of previous
utterances. Delete all cepstra with c0 below this threshold.
Problem: Same as for compress | ||
| -dcep80msweight | Dcep80msWeight | DOUBLE | Weight for dcep80ms | |||
| -live | LiveData | BOOL | Get input from A/D hardware | |||
| -blockingad | A/D blocks on read | BOOL | A/D blocks on read | |||
| -ctlfn | CtlFileName | STRING | Control file name | |||
| -ctloffset | CtlLineOffset | INT | Number of Lines to skip in ctl file | |||
| -ctlcount | CtlCount | INT | Number of lines to process in ctl file | |||
| -ctlincr | CtlLineIncr | INT | Do every nth line in the ctl file | |||
| -compallsen | ComputeAllSenones | BOOL | Compute all senone scores every frame | |||
| -topsenfrm | TopSenonesFrames | INT | #frames top senones for predicting phones | |||
| -topsenthresh | TopSenonesThresh | INT | Top senones threshold for predicting phones | |||
| -wsj1Sent | wsj1Sent | BOOL | Sent_Dir using wsj1 format | |||
| -reportpron | ReportAltPron | BOOL | Report actual pronunciation in match file | |||
| -matchfn | MatchFileName | STRING | Recognition output file name | |||
| -matchsegfn | MatchSegFileName | STRING | Recognition output with segmentation | |||
| -phoneconf | PhoneConfidence | INT | Phone confidence | |||
| -pscr2lat | PhoneLat | BOOL | Phone lattice based on best senone scores | |||
| -logfn | LogFileName | STRING | Recognition ouput file name | |||
| -correctfn | CorrectFileName | STRING | Reference ouput file name | |||
| -utt | Utterance | STRING | Utterance name | |||
| -datadir | DataDirectory | STRING | Data directory | |||
| -cepdir | DataDirectory | STRING | Data directory | |||
| -vqdir | DataDirectory | STRING | Data directory | |||
| -segdir | SegDataDirectory | STRING | Data directory | |||
| -sentdir | SentDir | STRING | Sentence directory | |||
| -sentext | SentExt | STRING | Sentence File Extension | |||
| -lmnamedir | LMNamesDir | STRING | Directory for LM-name file for each utt | |||
| -lmnameext | LMNamesExt | STRING | Filename extension for LM-name files | |||
| -startworddir | StartWordDir | STRING | Startword directory | |||
| -startwordext | StartWordExt | STRING | StartWord File Extension | |||
| -nbestdir | NbestDir | STRING | N-best Hypotheses Directory | |||
| -nbest | NbestCount | INT | No. N-best Hypotheses | |||
| -nbestext | NbestExt | STRING | N-best Hypothesis File Extension | |||
| -cepext | CepExt | STRING | Cepstrum File Extension | |||
| -cext | CCodeExt | STRING | CCode File Extension | |||
| -dext | DCodeExt | STRING | DCode File Extension | |||
| -pext | PCodeExt | STRING | PCode File Extension | |||
| -xext | XCodeExt | STRING | XCode File Extension (4 codebook only) | |||
| -beam | BeamWidth | FLOAT | Beam Width | |||
| -nwbeam | NewWordBeamWidth | FLOAT | New Word Beam Width | |||
| -fwdflatbeam | FwdFlatBeamWidth | FLOAT | FwdFlat Beam Width | |||
| -fwdflatnwbeam | FwdFlatNewWordBeamWidth | FLOAT | FwdFlat New Word Beam Width | |||
| -lponlybw | LastPhoneAloneBeamWidth | FLOAT | Beam Width for Last Phones Only | |||
| -lponlybeam | LastPhoneAloneBeamWidth | FLOAT | Beam Width for Last Phones Only | |||
| -npbeam | NewPhoneBeamWidth | FLOAT | New Phone Beam Width | |||
| -lpbeam | LastPhoneBeamWidth | FLOAT | Last Phone Beam Width | |||
| -phnpen | PhoneInsertionPenalty | FLOAT | Penalty for each phone used | |||
| -inspen | InsertionPenalty | FLOAT | Penalty for word transitions | |||
| -nwpen | NewWordPenalty | FLOAT | Penalty for new word transitions | |||
| -silpen | SilenceWordPenalty | FLOAT | Penalty for silence word transitions | |||
| -fillpen | FillerWordPenalty | FLOAT | Penalty for filler word transitions | |||
| -langwt | LanguageWeight | FLOAT | Weighting on Language Probabilities | |||
| -rescorelw | RescoreLanguageWeight | FLOAT | LM prob weight for rescoring pass | |||
| -fwdflatlw | FwdFlatLanguageWeight | FLOAT | FwdFlat Weighting on Language Probabilities | |||
| -fwdtree | FwdTree | BOOL | Fwd tree search (1st pass) | |||
| -fwdflat | FwdFlat | BOOL | Flat fwd search over fwdtree lattice | |||
| -forwardonly | ForwardOnly | BOOL | Run only the forward pass | |||
| -bestpath | Bestpath | BOOL | Shortest path search over lattice | |||
| -fwd3g | TrigramInFwdPass | BOOL | Use trigram (if available) in forward pass | |||
| -cbdir | CodeBookDirectory | STRING | Code book directory | |||
| -ccbfn | CCodeBookFileName | STRING | CCode Book File Name | |||
| -dcbfn | DCodeBookFileName | STRING | DCode Book File Name | |||
| -pcbfn | PCodeBookFileName | STRING | PCode Book File Name | |||
| -xcbfn | XCodeBookFileName | STRING | XCode Book File Name | |||
| -use20msdp | Use20msDiffPow | BOOL | Use 20 ms diff power instead of c0 | |||
| -cepfloor | CepFloor | FLOAT | Floor of Cepstrum Variance | |||
| -dcepfloor | DCepFloor | FLOAT | Floor of Delta Cepstrum Variance | |||
| -xcepfloor | XCepFloor | FLOAT | Floor of XCepstrum Variance | |||
| -top | TopNCodeWords | INT | Number of code words to use | |||
| -skipalt | SkipAltFrames | INT | Skip alternate frames in exiting phones | |||
| -matchscore | WriteScoreInMatchFile | BOOL | write score in the match file | |||
| -latsize | LatticeSizes | INT | BP and FP Tables Sizes | |||
| -lmcachelines | LMCacheNumLines | INT | No. lines in LM cache | |||
| -ilmugwt | ILMUGCacheWeight | INT | Weight(%) for ILM UG cache prob | |||
| -ilmbgwt | ILMBGCacheWeight | INT | Weight(%) for ILM BG cache prob | |||
| -dumplatdir | DumpLattice | STRING | Dump Lattice | |||
| -samp | SamplingRate | INT | Sampling rate | |||
| -adcin | UseADCInput | BOOL | Use raw ADC input | |||
| -adcext | ADCFileExt | STRING | ADC file extension | |||
| -adcendian | ADCByteOrder | INT | ADC file byte order (0=BIG/1=LITTLE) | |||
| -adchdr | ADCHdrSize | INT | ADC file header size | |||
| -rawlogdir | RawLogDir | STRING | Log directory for raw output files) | |||
| -mfclogdir | MFCLogDir | STRING | Log directory for MFC output files) | |||
| -tactlfn | TimeAlignCtlFile | STRING | Time align control file | |||
| -taword | TimeAlignWord | BOOL | Time Align Phone | |||
| -taphone | TimeAlignPhone | BOOL | Time Align Phone | |||
| -tastate | TimeAlignState | BOOL | Time Align State | |||
| -segext | SegFileExt | STRING | Seg file extension | |||
| -scoreext | ScoreFileExt | STRING | Seg file extension | |||
| -osentfn | OutSentFile | STRING | output sentence file name | |||
| -backtrace | PrintBackTrace | BOOL | Print Back Trace | |||
| -cdcn | CDCNinitFile | STRING | CDCN Initialization File |
LM texts
The decode dictionaries are also within the model directories. They are called DECODE.DICT and DECODE.NOISEDICT. Both must be used during recognition. Remember that ONLY the words that are in the dictionary AND the LM are recognizable. If you have words in your vocabulary that you want to recognize, and don't have examples of their usage in the LM text, then include them simply as unigrams in the LM. The LM vocabulary must not exceed 64,000 words. The dictionary is flexible. You can shorten it or add new words to it. Do not change the phoneset, though.
The acoustic models are meant to recognize 16khz sampled speech. Make sure that your signals being recorded are not clipped and do not have a "tabletop" appearance. If they do, you have a gain control problem.
No agc has been used during training. Hence no agc must be used for decoding. The agc flag must be set to false.
Here are some tentative flag settings for the decoder to be used with these models. First try only these (leave out ALL other flags; don't mention them). If these settings slow down the decode (and the decodes look ok), try reducing -topn (do not go below 2. This flag affects recognition hugely. 1 can be very fast, very bad. 2 can be a little slower, reasonable. 4 gives you good decodes). You can of course optimize around these settings if you have the time:
-live TRUE -topsenfrm 4 -topsenthresh -50000 -nmprior TRUE -fwdflat FALSE -bestpath TRUE -top 4 -fillpen 1e-10 -nwpen 0.01 -silpen 0.005 -inspen 0.65 -langwt 7.5 -ugwt 0.5 -beam 2e-6 -npbeam 2e-6 -nwbeam 5e-4 -lpbeam 2e-5 -lponlybeam 5e-4 -rescorelw 9.5 -hmmdir tongues_english -hmmdirlist tongues_english -cbdir tongues_english -lmfn english\English.arpabo -kbdumpdir tongues_english -dictfn tongues_english\DECODE.DICT -phnfn tongues_english\phone -ndictfn tongues_english\DECODE.NOISEDICT -mapfn tongues_english\map -sendumpfn tongues_english\sendump -normmean TRUE -8bsen TRUE -matchfn c:\anything.match -logfn c:\anything.log (omit this flag if you don't want to write a log)
The specifications for the croatian models are the same as those for the English models. There is no decode noisedict for these models. The same settings (barring the actual modeldirectory names and output filenames) should work for croatian. There is ONE difference: the -ndictfn flag must be omitted for the croatian models. Add the missing words to the croatian dictionary before using it to decode. a list of missing words is given. An ALL.tar.gz tar file for these lms and dicts is available, to preserve the ISO characters.